python爬虫类型_python-爬虫的分类urllib、requests

56 阅读 0 评论 37 点赞

我是靠谱客的博主积极小丸子，最近开发中收集的这篇文章主要介绍python爬虫类型_python-爬虫的分类urllib、requests，觉得挺不错的，现在分享给大家，希望可以做个参考。

概述

#!/usr/bin/env python#-*- coding: utf-8 -*-

__author__ = 'Fade Zhao'

importrequests'''1、requests请求'''url= 'http://127.0.0.1:8000/spider'response=requests.get(url)

response.encoding='utf-8'

print(response.text)#同时也支持多种http请求#response = requests.post(url)#response = requests.delete(url)#response = requests.options(url)#response = requests.head(url)

#get通过url提交数据

url = 'https://www.baidu.com/s'data={'wd':'麦克雷'}

response= requests.get(url,params=data)print(response.url)#https://www.baidu.com/s?wd=%E9%BA%A6%E5%85%8B%E9%9B%B7

'''2、响应与编码'''url= 'http://127.0.0.1:8000/spider'response=requests.get(url)print(response.content) #返回字节形式

print(response.text) #返回文本形式

print(response.encoding) #返回的是HTTP猜测的网页编码格式

#设置为utf-8 输出text后出现乱码可以使用↓↓↓的方法进行转码，但是这种方式太笨了，可以通过chardet来进行编码检测

response.encoding = 'utf-8'

#使用chardet.detect方法可直接返回content的编码格式

importchardet

response.encoding= chardet.detect(response.content).get('encoding')print(response.text)'''3、请求头headers处理'''

#requests与urllib添加请求头的方式大致一样，在get请求中添加headers参数就行

headers ={'User-Agent':'Mozilla/5.0 (Macintosh; Intel …) Gecko/20100101 Firefox/57.0'}#response = requests.get(url,headers=headers)

'''4、响应编码与响应头header处理'''url= 'http://127.0.0.1:8000/spider'response=requests.get(url)if response.status_code ==requests.codes.ok:

res_header= response.headers #响应头

print('响应头~',res_header)print(res_header.get('content-type'))else:#主动产生异常，如有4XX或5XXX时，会抛出异常

try:

response.raise_for_status()exceptException as e:print('响应码异常：',e)'''5、Cookie设置'''url= 'http://127.0.0.1:8000/spider'response= requests.get(url=url)

cookie=response.cookiesprint(cookie)#cookie 的存储类型是属于dict格式，可以通过get(key)的方式来获取value

print(cookie.get('name'))## 添加cookie

url = 'http://127.0.0.1:8000/spider'

#这里的cookies是一个字典格式的数据。#在Cookie Version 0中规定空格、方括号、圆括号、等于号、逗号、双引号、斜杠、问号、@，冒号，分号等特殊符号都不能作为Cookie的内容。

cookie = {'testCookies_1': 'Hello_Python3', 'testCookies_2': 'Hello_Requests'}

response= requests.get(url=url,cookies=cookie)print(response.cookies.get('name'))#>>> LetMe

#在用Post登陆网站的时候，如果提交的cookies没有之前访问网站浏览cookies，网站会拒绝这次的表单提交，所以requests提供了session用来携带cookies进行登录跳转

url = 'http://127.0.0.1:8000/Blog/login.html'session=requests.Session()#首先获取登陆页面，对于游客，服务器会分配一个cookies

response =session.get(url)

data={'username':'alex','password':'alex',

}#还是用这个带有cookies的session进行登录验证

response_login = session.post(url,data=data)#print(response.text)#在上边的登陆验证中，无论用户是以什么样的方式访问页面(登陆/浏览)，服务器都会分配一个cookies，如果提交表单的时候没有携带cookies进行登录，服务器会将你视为非法用户导致登陆失败。

'''6、重定向'''

#在登陆中，都会有跳转重定向之类的问题，requests做的很好，在get/post请求中默认的allow_redirects=True,当allow_redirects设置为True时是支持网页跳转的

url = 'http://www.github.com'response=requests.get(url)print(response.status_code) #>>>200#如果允许重定向，可以通过history来查询历史重定向记录

print(response.history) #[, ]

print(response.url) #打印结果为 https://github.com/#为什么会是https，是因为github会把所有http的请求重定向为https请求。

'''7、超时设置'''

#在get/post请求函数中通过timeout设置,单位为秒

response = requests.get(url,timeout=10)'''8、代理设置'''

#同上，也是在get/post请求函数中设置

url = 'http://127.0.0.1:8000/'proxy={'https':'60.255.186.169:8888','http':'61.135.217.7:80'}

response= requests.get(url,proxies=proxy)print(response.text)

通过requests爬取小说实例