用Python写一个简单的爬虫BeautifulSoup：优秀的HTML/XML的解析器需求问题汇总

221 阅读 0 评论 146 点赞

我是靠谱客的博主能干小甜瓜，这篇文章主要介绍用Python写一个简单的爬虫BeautifulSoup：优秀的HTML/XML的解析器需求问题汇总，现在分享给大家，希望可以做个参考。

和朋友都灰常懒，不想上下滚动页面看价格，所以写了一个爬虫，用于存储商品价格。
环境：macOS、python3.5
IDE：pycharm
使用的库：BeautifulSoup、urllib

BeautifulSoup：优秀的HTML/XML的解析器

安装方法：pip install即可，语句：pip install beautifulsoup4
注意：python3一定要安装BeautifulSoup4（简称bs4）
1.bs4文档：
https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html
2.bs4 css选择器的使用方法
http://cuiqingcai.com/1319.html

需求

获得商品价格，对应获得商品名称，另外就是下载商品的图片

1.由上文，则需要在网页源代码中找到对应位置的信息。

用chrome的“检查”即可查看html源代码。
在源码中找到：代表商品价格的class = “Price”；
含有商品全名的class = product-image；上一级class = product-image-and-name-container。（由于直接定位product-image会将其他非目标商品的信息也搜集下来，所以引入上级class。）
用bs4的css选择器（select方法）定位到含有上述class的位置。

2.保存图片的方法

all_image = name[i].find_all('img')#找到图片的集合
for image in all_image:#print(all_image)
urllib.request.urlretrieve(image['src'],"/Users/lixuefei/Desktop/test/%s.jpg" % (x))

3.全部代码

"""
Created by LiXuefei in 2017.07.26
"""
import urllib.request
from bs4 import BeautifulSoup
import pandas as pd
# beautifulsoup方法
## load html file
def get_content(url):
html = urllib.request.urlopen(url)
content = html.read().decode("utf-8") # 转码 'ignore'
html.close()
# 一定要关闭网页
return content
def save_to_file(file_name, contents):
fh = open(file_name,'w')
fh.write(contents)
fh.close()
def get_txt(info):
soup = BeautifulSoup(info,"lxml")
# 设置解析器为“lxml”
#print(soup)
name = soup.select('.product-image-and-name-container > .product-image')#.product-name product-image
price = soup.select('.Price')
##download img
x = 0
for i in range(len(name)):
all_image = name[i].find_all('img')
urllib.request.urlretrieve(all_image[0]['src'], "/Users/lixuefei/Desktop/test/%s.jpg" % (x))
x += 1
##download name,price
for x in range(len(name)):
name[x] = str(name[x]).split('=')#用=将字符串分割
price[x] = str(price[x]).strip('<span class="Price">' + '</span>' + 'n')#去除css，只保留价格
for i in range(len(name)):
name[i] = str(name[i][4]).strip('onload' + '" ')#第四段为名称，选取名称，并去除多余的关键字
return name,price
url = "http://www.chemistwarehouse.com.au/Shop-Online/587/Swisse"
content = get_content(url)
content_name,content_price = get_txt(content)
df = pd.DataFrame()
df['name'] = content_name
df['price'] = content_price
df.to_csv('/Users/lixuefei/Desktop/test.csv')

问题汇总

1.对于另一中文购物网站，出现了由于混合中文而导致转码出现问题，用’ignore’参数可以避免报错，但是同时ignore参数会忽略商品的中文名称。
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb3 in position 137363: invalid start byte
2.程序运行结果，会在文件夹里下载我们需要的商品图片，在另一个csv文档里下载商品名称及价格。如果能把图片和名称价格整合到一起更好，用数据库？还没想好更好的表现形式。