Python爬虫学习 5 —— 使用BeautifulSoup提取信息

118 阅读 0 评论 78 点赞

我是靠谱客的博主虚拟滑板，这篇文章主要介绍Python爬虫学习 5 —— 使用BeautifulSoup提取信息，现在分享给大家，希望可以做个参考。

一、信息标记的三种形式

XML：eXtensible Markup Language
1、特点

类似HTML，使用标签
允许有注释

2、例子

复制代码

1
2
3
4
5
6
7
8
9
10
11
<person>
<firstName>Tian</firstName>
<lastName>Song</lastName>
<address>
<streetAdder>中关村</streetAdder>
<city>北京市</city>
<zipcode>100081</zipcode>
</address>
<prof>Computer System</prof><prof>Security</prof>
</person>

JSON：JavaScript Object Notation
1、特点

有类型的键值对："key": "value"
多值使用 "name": ["key1", "key2"]
键值对嵌套使用 {}

2、例子

复制代码

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
{
"firstName": "Tian"
"lastName" : "Song"
"address"
: {
"streetAdder": "中关村",
"city"
: "北京市",
"zipcode"
: "100081"
},
"prof"
:
["computer System", "Security"]
}

YAML：YAML Ain’t Markup Language
1、特点

无类型的键值对：name: value
多值使用缩进 tab
- 表达并列关系

2、例子

复制代码

1
2
3
4
5
6
7
8
9
10
11
12
13
14
firstName : Tian
lastName
: Song
address
:
streetAdder : 中关村
city
: 北京市
zipcode
: 100081
prof
-computer System
-Security

对比：

XML：最早的通用信息标记语言，可扩展性好，但繁琐
XML：信息有类型，适合程序处理
XML：信息无类型，文本信息比例最高，可读性好

二、信息提取

我们试试提取一个HTML文件乱码的所有链接：

思路：
1、搜索所有的<a>标签
2、解析<a>标签，提取href后的链接内容

实现：

复制代码

1
2
3
4
5
6
7
8
import requests
from bs4 import BeautifulSoup
url = "http://python123.io/ws/demo.html"
r = requests.get(url)
soup = BeautifulSoup(r.text, "html.parser")
for link in soup.find_all("a"):
print(link.get("href"))

提取结果：

复制代码

1
2
3
http://www.icourse163.org/course/BIT-268001
http://www.icourse163.org/course/BIT-1001870001

三、使用bs4库的HTML内容查找方法

find_all()：
语法：find_all(name, attrs, recursive, string, **kwargs)
返回值：返回一个列表类型，存储查询结果
参数：参考文档

参数名	说明
name	对标签名称的检索字符串
attrs	对标签属性的检索字符串，可标注属性检索
recursive	是否对子孙全部检索，默认值为True
string	<> … </>字符串区域的检索字符串
limit	设自治允许返回的列表长度

find_all的使用：

1、检索标签名：

复制代码

1
2
3
4
5
6
7
8
import requests
from bs4 import BeautifulSoup
url = "http://python123.io/ws/demo.html"
r = requests.get(url)
soup = BeautifulSoup(r.text, "html.parser")
print(soup.find_all("a"))
print(soup.find_all(["a", "b"]))

查询结果：

复制代码

1
2
3
[<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>, <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>]
[<b>The demo python introduces several python courses.</b>, <a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>, <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>]

2、设置检索所有子孙节点

复制代码

1
2
3
4
5
6
7
8
import requests
from bs4 import BeautifulSoup
url = "http://python123.io/ws/demo.html"
r = requests.get(url)
soup = BeautifulSoup(r.text, "html.parser")
for tag in soup.find_all(True):
print(tag.name)

检索结果：

复制代码

1
2
3
4
5
6
7
8
9
10
11
html
head
title
body
p
b
p
a
a
***Repl Closed***

3、使用正则表达式查询以b开头的标签

复制代码

1
2
3
4
5
6
7
8
import requests, re
from bs4 import BeautifulSoup
url = "http://python123.io/ws/demo.html"
r = requests.get(url)
soup = BeautifulSoup(r.text, "html.parser")
for tag in soup.find_all(re.compile("b")):
print(tag.name)

查询结果：

复制代码

1
2
3
body
b

4、根据CSS属性来检索：

复制代码

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
import requests, re
from bs4 import BeautifulSoup
url = "http://python123.io/ws/demo.html"
r = requests.get(url)
soup = BeautifulSoup(r.text, "html.parser")
# 查询class属性值为"py1"的标签
for tag in soup.find_all("a", class_="py1"):
print(tag)
# 查询class属性值为"py2"的标签
for tag in soup.find_all("a", attrs={"class": "py2"}):
print(tag)
# 查询含有 id 属性的所有标签
for tag in soup.find_all(id=True):
print(tag)
# 查询 id 属性值为 "link1" 的标签
for tag in soup.find_all(id="link1"):
print(tag)

输出：

复制代码

1
2
3
4
5
6
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>
<a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>
<a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>

5、设置recursive值为False

复制代码

1
2
3
4
5
6
7
import requests, re
from bs4 import BeautifulSoup
url = "http://python123.io/ws/demo.html"
r = requests.get(url)
soup = BeautifulSoup(r.text, "html.parser")
print(soup.find_all("a", recursive=False))

查询结果：说明：<a>标签不存在于soup的子标签中，但存在域子孙节点中

复制代码

1
2
[]

6、根据字符串内容来查找

复制代码

1
2
3
4
5
6
7
8
import requests, re
from bs4 import BeautifulSoup
url = "http://python123.io/ws/demo.html"
r = requests.get(url)
soup = BeautifulSoup(r.text, "html.parser")
print(soup.find_all(string="Basic Python"))
print(soup.find_all(string=re.compile("python")))

查询结果：

复制代码

1
2
3
['Basic Python']
['This is a python demo page', 'The demo python introduces several python courses.']

替代find_all：可以使用下面的方法替代find_all()函数

复制代码

1
2
3
<tag>.(...) = <tag>.find_all()
soup(...) = soup.find_all()

三、查找的扩展方法

方法	说明（参数同find_all()）
<>.find()	搜索且只返回第一个结果，字符串类型
<>.find_parents()	在先辈节点中搜索，返回列表类型
<>.find_parent()	在先辈节点中搜索，返回字符串类型
<>.find_next_siblings()	在后续平行节点中搜索，返回列表类型
<>.find_next_siblings()	在后续平行节点中搜索，返回字符串类型
<>.find_previous_siblings()	在前续平行节点中搜索，返回列表类型
<>.find_previous_siblings()	在前续平行节点中搜索，返回字符串类型