Python爬虫基本使用

84 阅读 0 评论 56 点赞

我是靠谱客的博主精明冰淇淋，这篇文章主要介绍Python爬虫基本使用，现在分享给大家，希望可以做个参考。

复制代码

1、引入urllib库。

2、发起请求。

3、读取返回的内容。

4、编码设置。（b'为二进制编码，需要转化为utf-8）

5、打印出来。

复制代码

1
2
3
4
5
import urllib.request
response=urllib.request.urlopen("http://www.baidu.com")
html=response.read()
html=html.decode("utf-8")
print(html)

二、下载图片并保存到本地

复制代码

1
2
3
4
5
6
7
8
9
10
11
12
13
14
import urllib.request

#****this is the first way***
#response = urllib.request.urlopen("https://img6.bdstatic.com/img/image/smallpic/weiju112.jpg")

#****this is the second way***
req = urllib.request.Request("https://img6.bdstatic.com/img/image/smallpic/weiju112.jpg")
response=urllib.request.urlopen(req)

cat_img = response.read()

with open('aaaabbbbcccc.jpg','wb') as f:
    f.write(cat_img)

3、有道翻译

复制代码

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
import urllib.request
import urllib.parse
import json

content=input("Please input the content that you will translate:")

url='http://fanyi.youdao.com/translate?smartresult=dict&smartresult=rule&smartresult=ugc&sessionFrom=https://www.baidu.com/link'

data={}
data['action']='FY_BY_CLICKBUTTON'
data['doctype']='json'
data['i']=content
data['keyfrom']='fanyi.web'
data['type']='auto'
data['typoResult']='true'
data['ue']='UTF-8'
data['xmlVersion']='1.8'

data=urllib.parse.urlencode(data).encode("utf-8") 
response=urllib.request.urlopen(url,data)
html=response.read().decode('utf-8')

res=json.loads(html) #res is a direct
print("The result:%s" % (res['translateResult'][0][0]['tgt']))

4、有道翻译增加头部信息（1）(通过增加header信息参数，创建头部字典)。

复制代码

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
import urllib.request
import urllib.parse
import json

content=input("Please input the content that you will translate:")

url='http://fanyi.youdao.com/translate?smartresult=dict&smartresult=rule&smartresult=ugc&sessionFrom=https://www.baidu.com/link'

head={} # the info of req.header to imitate the Agent just like visiting the website by browser
head['User-Agent']="Mozilla/5.0 (Windows NT 6.3; WOW64; rv:51.0) Gecko/20100101 Firefox/51.0"

data={}
data['action']='FY_BY_CLICKBUTTON'
data['doctype']='json'
data['i']=content
data['keyfrom']='fanyi.web'
data['type']='auto'
data['typoResult']='true'
data['ue']='UTF-8'
data['xmlVersion']='1.8'

data=urllib.parse.urlencode(data).encode("utf-8") 

#response=urllib.request.urlopen(url,data)
req=urllib.request.Request(url,data,head)
response=urllib.request.urlopen(req)

html=response.read().decode('utf-8')

res=json.loads(html) #res is a direct
print("The result:%s" % (res['translateResult'][0][0]['tgt']))

5、有道翻译增加头部信息（2）（通过Request.add_header()）。

复制代码

import urllib.request
import urllib.parse
import json

content=input("Please input the content that you will translate:")

url='http://fanyi.youdao.com/translate?smartresult=dict&smartresult=rule&smartresult=ugc&sessionFrom=https://www.baidu.com/link'

'''
head={} # the info of req.header to imitate the Agent just like visiting the website by browser
head['User-Agent']="Mozilla/5.0 (Windows NT 6.3; WOW64; rv:51.0) Gecko/20100101 Firefox/51.0"
'''

data={}
data['action']='FY_BY_CLICKBUTTON'
data['doctype']='json'
data['i']=content
data['keyfrom']='fanyi.web'
data['type']='auto'
data['typoResult']='true'
data['ue']='UTF-8'
data['xmlVersion']='1.8'

data=urllib.parse.urlencode(data).encode("utf-8")

#response=urllib.request.urlopen(url,data)
req=urllib.request.Request(url,data)
req.add_header('User-Agent',"Mozilla/5.0 (Windows NT 6.3; WOW64; rv:51.0) Gecko/20100101 Firefox/51.0")
response=urllib.request.urlopen(req)

html=response.read().decode('utf-8')

res=json.loads(html) #res is a direct
print("The result:%s" % (res['translateResult'][0][0]['tgt']))

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
import urllib.request
import urllib.parse
import json

content=input("Please input the content that you will translate:")

url='http://fanyi.youdao.com/translate?smartresult=dict&smartresult=rule&smartresult=ugc&sessionFrom=https://www.baidu.com/link'

'''
head={} # the info of req.header to imitate the Agent just like visiting the website by browser
head['User-Agent']="Mozilla/5.0 (Windows NT 6.3; WOW64; rv:51.0) Gecko/20100101 Firefox/51.0"
'''

data={}
data['action']='FY_BY_CLICKBUTTON'
data['doctype']='json'
data['i']=content
data['keyfrom']='fanyi.web'
data['type']='auto'
data['typoResult']='true'
data['ue']='UTF-8'
data['xmlVersion']='1.8'

data=urllib.parse.urlencode(data).encode("utf-8") 

#response=urllib.request.urlopen(url,data)
req=urllib.request.Request(url,data)
req.add_header('User-Agent',"Mozilla/5.0 (Windows NT 6.3; WOW64; rv:51.0) Gecko/20100101 Firefox/51.0")
response=urllib.request.urlopen(req)

html=response.read().decode('utf-8')

res=json.loads(html) #res is a direct
print("The result:%s" % (res['translateResult'][0][0]['tgt']))

7、使用代理。

1、创建参数字典{‘type’:'proxy ip':'port'}

proxy_support=urllib.request.ProxyHandler({})

2、定制、创建opener。

opener=urllib.request.build_opener(proxy_support)
3、安装opener

urllibrequestinstall_opener(opener)

4、调用opener。

opener.open(url)

代码如下

复制代码

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
import urllib.request
import random
import time
while True:
    url='http://www.whatismyip.com.tw' #a website that can requery the ip of your device
    iplist=['171.39.32.171:9999','112.245.170.47:9999','111.76.129.119:808','27.206.143.225:9999','114.138.196.144:9999'] #it shuld include the ip:port

    #1、创建参数字典{‘type’:'proxy ip':'port'}
    proxy_support=urllib.request.ProxyHandler({'http':random.choice(iplist)})
    #proxy_support=urllib.request.ProxyHandler({'http':'123.163.219.132:81'})

    #2、 定制、创建opener。
    opener=urllib.request.build_opener(proxy_support)
    opener.addheaders=[('User-Agent','Mozilla/5.0 (Windows NT 6.3; WOW64; rv:51.0) Gecko/20100101 Firefox/51.0')]

    #3、安装opener
    urllib.request.install_opener(opener)

    res=urllib.request.urlopen(url)
    html=res.read().decode('utf-8')
    print(html)
    time.sleep(5)