使用urllib爬取压缩过的网页

78 阅读 0 评论 52 点赞

我是靠谱客的博主称心小虾米，这篇文章主要介绍使用urllib爬取压缩过的网页，现在分享给大家，希望可以做个参考。

最近在使用urllib爬取网页的时候发现一个非常奇怪的问题，就是使用浏览器或者postman都可以正常访问的一个网页，但是使用urllib的话获取到的网页信息都是乱码，无论使用utf-8解码还是使用GBK解码都不行。

原始代码：

复制代码

1
2
3
4
5
6
7
8
9
10
cookies = http.cookiejar.LWPCookieJar()
handlers = [
urllib.request.HTTPHandler(),
urllib.request.HTTPSHandler(),
urllib.request.HTTPCookieProcessor(cookies)
]
opener = urllib.request.build_opener(*handlers)
opener.addheaders = [('User-agent', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.87 Safari/537.36')]
request = urllib.request.Request(url)
text = opener.open(request).read()

排除错误的过程

首先怀疑web page本身有问题，使用浏览器和postman，结果都能打开

其次怀疑代码问题，换成requests module，没有问题，可以正常获取

复制代码

1
response = requests.request('GET', url, headers=headers)

但问题是我这里整个爬虫的框架都是用的是urllib，而且对于大多数web（几乎所有了）都是可以的。为什么偏偏对某些不行呢？总不能为了这一个来修改整体的代码吧。

继续钻研:
发现postman显示，accept-encoding: gzip，猜想难道web发过来的时候是压缩过的数据。那么试一下解压缩呢，于是将上面的代码修改为

复制代码

1
2
3
4
5
6
7
8
9
10
11
12
cookies = http.cookiejar.LWPCookieJar()
handlers = [
urllib.request.HTTPHandler(),
urllib.request.HTTPSHandler(),
urllib.request.HTTPCookieProcessor(cookies)
]
opener = urllib.request.build_opener(*handlers)
opener.addheaders = [('User-agent', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.87 Safari/537.36'),
('Accept-encoding', 'gzip')]
request = urllib.request.Request(url)
text = opener.open(request).read()
html = zlib.decompress(text, 16+zlib.MAX_WBITS)