Python豆瓣爬虫（1）urllib库

81 阅读 0 评论 54 点赞

我是靠谱客的博主自觉西装，这篇文章主要介绍Python豆瓣爬虫（1）urllib库，现在分享给大家，希望可以做个参考。

学完了Python基础的一些语法之后，也想通过以项目代练的方式对之前的知识进行熟练巩固。再者说，Python的库非常的强大，也只有在不断的实践过程中才能熟能生巧。
所以开始想试一下爬虫这一方面，爬虫就是通过模拟浏览器访问网页服务器的形式，将页面上所需的页面爬取下来为我们所用的技术，这样的数据来源无疑对于大数据分析来说是非常好的。
关于爬虫的发展和介绍这里不多做赘述（其实是我也没去了解过），直接就来讲讲要用到的一些库和工具：
urllib库：Python自带，用来获取网页回应访问的信息，即获取访问页面的html代码。
BeautifulSoup库：将html文档转换为一个树形结构，每个节点都为Python对象。
re库：正则表达式，通过re库可以匹配到我们所需要的内容。
sqlite3库：Python环境与数据库交互的方式。
xlwt库：与Excel表格交互实现存储。
除了这些Python库之外，后续的数据可视化，通过flask框架集成，应用Echarts、WordCloud等工具实现。
现在就先通过代码来看看urllib库的一些简单应用：

复制代码

# -*- coding = utf-8 -*=
# 项目名称：worm
# 文件名：test_urllib
# 开发时间：11:42
import urllib.request
#获取一个get请求
res = urllib.request.urlopen("http://www.baidu.com")
print(res)
print(res.read().decode('utf-8')) #对获取到的网页源码进行utf-8解码
print('-------------------')
#获取一个post请求 可以用httpbin.org网站进行测试
#解析包导入
import urllib.parse
#解析用户名密码
data = bytes(urllib.parse.urlencode({'hello':'world'}),encoding='utf-8')
res_post = urllib.request.urlopen('http://httpbin.org/post',data=data)
print(res_post.read().decode('utf-8'))
print('-------------------')
#超时处理
try:
res_chaoshi = urllib.request.urlopen('http://httpbin.org/get',timeout=0.01)
print(res_chaoshi.read().decode('utf-8'))
except urllib.error.URLError as e:
print('timeout')
res_chaoshi = urllib.request.urlopen('http://www.baidu.com')
print(res_chaoshi.status)#状态码 404找不到网页 418被发现是爬虫
print(res_chaoshi.getheaders())#获取发送给网页服务器的所有response header信息
print(res_chaoshi.getheader('Server'))#获取response header中单个信息
#尝试模仿浏览器访问豆瓣
url = 'http://httpbin.org/post'
headers = {
"Accept": "application/json",
"Accept-Encoding": "gzip, deflate",
"Accept-Language": "zh-TW,zh;q=0.9,en-US;q=0.8,en;q=0.7",
"Host": "httpbin.org",
"Referer": "http://httpbin.org/",
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.93 Safari/537.36",
"X-Amzn-Trace-Id": "Root=1-60936d4a-437ee7652118014714135fef"
}
data = bytes(urllib.parse.urlencode({'name':'szj'}),encoding='utf-8')
req = urllib.request.Request(url=url,data=data,headers=headers,method='POST')
res = urllib.request.urlopen(req)
print(res.read().decode('utf-8'))
url = 'https://www.douban.com'
headers = {
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.93 Safari/537.36'
}
#使用request.Request封装一个对象来传递信息
req = urllib.request.Request(url=url,headers=headers)
res = urllib.request.urlopen(req)
print(res.read().decode('utf-8'))

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
# -*- coding = utf-8 -*=
# 项目名称：worm
# 文件名：test_urllib
# 开发时间：11:42
import urllib.request
#获取一个get请求
res = urllib.request.urlopen("http://www.baidu.com")
print(res)
print(res.read().decode('utf-8')) #对获取到的网页源码进行utf-8解码
print('-------------------')
#获取一个post请求 可以用httpbin.org网站进行测试
#解析包导入
import urllib.parse
#解析用户名密码
data = bytes(urllib.parse.urlencode({'hello':'world'}),encoding='utf-8')
res_post = urllib.request.urlopen('http://httpbin.org/post',data=data)
print(res_post.read().decode('utf-8'))
print('-------------------')
#超时处理
try:
res_chaoshi = urllib.request.urlopen('http://httpbin.org/get',timeout=0.01)
print(res_chaoshi.read().decode('utf-8'))
except urllib.error.URLError as e:
print('timeout')
res_chaoshi = urllib.request.urlopen('http://www.baidu.com')
print(res_chaoshi.status)#状态码 404找不到网页 418被发现是爬虫
print(res_chaoshi.getheaders())#获取发送给网页服务器的所有response header信息
print(res_chaoshi.getheader('Server'))#获取response header中单个信息
#尝试模仿浏览器访问豆瓣
url = 'http://httpbin.org/post'
headers = {
"Accept": "application/json",
"Accept-Encoding": "gzip, deflate",
"Accept-Language": "zh-TW,zh;q=0.9,en-US;q=0.8,en;q=0.7",
"Host": "httpbin.org",
"Referer": "http://httpbin.org/",
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.93 Safari/537.36",
"X-Amzn-Trace-Id": "Root=1-60936d4a-437ee7652118014714135fef"
}
data = bytes(urllib.parse.urlencode({'name':'szj'}),encoding='utf-8')
req = urllib.request.Request(url=url,data=data,headers=headers,method='POST')
res = urllib.request.urlopen(req)
print(res.read().decode('utf-8'))
url = 'https://www.douban.com'
headers = {
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.93 Safari/537.36'
}
#使用request.Request封装一个对象来传递信息
req = urllib.request.Request(url=url,headers=headers)
res = urllib.request.urlopen(req)
print(res.read().decode('utf-8'))

具体的注意点都在注释里标明了，这里需要讲的是在整个过程中其实用到了非常多html语言的知识，后面会具体进行说明，有一个技巧是可以在Chrome浏览器中通过F12进入开发者模式，可以直接看到网页的html代码，也是非常方便的。
以及http://httpbin.org/post这个网站可以用来测试回包是否成功。