概述
昨天看了看Beautiful soup,看的我真的是一脸懵逼,lxml的全忘光了,两个光混淆。很难受
一、安装
安装Beautiful soup 和 lxml库
二、基本用法
# 数据源
html = '''
<html>
<head>
<title>The Dormouse`s story</title>
<p class="story">
Once upon a time there were three little sisters; and their names were
<a href="http://exmple.com/elsie" class="sister" id="link1">
<span>Elsie</span>
</a>
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a>
and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>
</p>
<p class="story">...</p>
'''
# 导入库
from bs4 import BeautifulSoup
# 导入数据源
soup = BeautifulSoup(html)
# 查找数据并打印
# 补充格式并打印,会自动补全标签,
print(soup.prettify())
# 打印title标签的内容
print(soup.title.string)
# 打印title标签的标签名称
print(soup.title.name)
<html>
<head>
<title>
The Dormouse`s story
</title>
</head> #这里添加了head的结束标签
<body>
<p class="story">
Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://exmple.com/elsie" id="link1">
<span>
Elsie
</span>
</a>
<a class="sister" href="http://example.com/lacie" id="link2">
Lacie
</a>
and
<a class="sister" href="http://example.com/tillie" id="link3">
Tillie
</a>
</p>
<p class="story">
...
</p>
</body> #这里添加了body的结束标签
</html> # 这里添加了html的结束标签
soup.prettify()补全了标签,格式化了html片段
soup.title.string和soup.title.name打印了文本和标签名字
三、节点选择器
# import lib
from bs4 import BeautifulSoup
# data source
这不重要 (滑稽)
# import data
soup = BeautifulSoup(html,'lxml')
# research and print
print(soup.)
1.选择元素(节点)
# choice element title
print('title:',soup.title)
title: <title>The Dormouse`s story</title>
打印的是整个节点信息
# the type of title
print('type of title:', type(soup.title))
type of title: <class 'bs4.element.Tag'>
节点是类型是tag类型
2.提取节点信息
获得节点名称
soup.title.name
获得节点内容
print(type(soup.title.string)) print(soup.title.text) print(type(soup.title.text)) print(soup.title.get_text()) print(type(soup.title.get_text())) print(soup.title.getText()) print(type(soup.title.getText())) The Dormouse`s story <class 'bs4.element.NavigableString'> The Dormouse`s story <class 'str'> The Dormouse`s story <class 'str'> The Dormouse`s story <class 'str'> 四种方法都可以获取到节点内容,但是只有string类型是BeautifulSoup自定义的类型 其他都是str类型
3.嵌套选择
print(soup.head.title)
print(type(soup.head.title))
<title>The Dormouse`s story</title>
<class 'bs4.element.Tag'>
往后面“.”就完了。
4.关联选择
子节点和子孙节点
我们之前使用 print(soup.title) 打印出了包括节点在内的所有内容
<p class="story">
Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://exmple.com/elsie" id="link1">
<span>Elsie</span>
</a>
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
</p>
使用soup.p打印出了包括节点的所有内容并且有格式
此结果类型为BeautifulSoup的tag类型
下面我们使用
print(soup.p.contents)
['n
Once upon a time there were three little sisters; and their names weren
', <a class="sister" href="http://exmple.com/elsie" id="link1">
<span>Elsie</span>
</a>, 'n', <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, 'n
andn
', <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>, 'n']
<class 'list'>
打印出来了什么?
节点内部所有的东西
既然是个list类型,我们看看list里所有类型
0
<class 'bs4.element.NavigableString'>
Once upon a time there were three little sisters; and their names were
1
<class 'bs4.element.Tag'>
<a class="sister" href="http://exmple.com/elsie" id="link1">
<span>Elsie</span>
</a>
2
<class 'bs4.element.NavigableString'>
3
<class 'bs4.element.Tag'>
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
4
<class 'bs4.element.NavigableString'>
and
5
<class 'bs4.element.Tag'>
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
6
<class 'bs4.element.NavigableString'>
主要有BeautifulSoup自定义的string类型和节点tag类型
然后我们看子节点和子孙节点
子节点:
print(soup.p.children)
print(type(soup.p.children))
for i,child in soup.p.children:
print(i," ",type(child)," ",child)
<list_iterator object at 0x0000000012F87A58>
<class 'list_iterator'> # 列表迭代对象
0
<class 'bs4.element.NavigableString'>
Once upon a time there were three little sisters; and their names were
1
<class 'bs4.element.Tag'>
<a class="sister" href="http://exmple.com/elsie" id="link1">
<span>Elsie</span>
</a>
2
<class 'bs4.element.NavigableString'>
3
<class 'bs4.element.Tag'>
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
4
<class 'bs4.element.NavigableString'>
and
5
<class 'bs4.element.Tag'>
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
6
<class 'bs4.element.NavigableString'>
和我们上面的是一样的结果。
子孙节点:
子孙节点使用的descendants
print(soup.p.descendants)
print(type(soup.p.descendants))
for i,d in enumerate(soup.p.descendants):
print(i,' ',type(d),' ', d)
<generator object Tag.descendants at 0x0000000012F81570>
<class 'generator'> # 生成器对象
0
<class 'bs4.element.NavigableString'>
Once upon a time there were three little sisters; and their names were
1
<class 'bs4.element.Tag'>
<a class="sister" href="http://exmple.com/elsie" id="link1">
<span>Elsie</span>
</a>
2
<class 'bs4.element.NavigableString'>
3
<class 'bs4.element.Tag'>
<span>Elsie</span>
4
<class 'bs4.element.NavigableString'>
Elsie
5
<class 'bs4.element.NavigableString'>
6
<class 'bs4.element.NavigableString'>
7
<class 'bs4.element.Tag'>
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
8
<class 'bs4.element.NavigableString'>
Lacie
9
<class 'bs4.element.NavigableString'>
and
10
<class 'bs4.element.Tag'>
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
11
<class 'bs4.element.NavigableString'>
Tillie
12
<class 'bs4.element.NavigableString'>
从结果我们可以看出,在使用子孙节点descendants的时候,就相当于,使用children方法,遇到了节点tag对象,则对子节点使用children方法,以此类推,直到所有子孙全部列出来
父节点和祖先节点
父节点: 使用parent print(type(soup.p.parent)) print(soup.p.parent)
<class 'bs4.element.Tag'>
<body><p class="story">
Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://exmple.com/elsie" id="link1">
<span>Elsie</span>
</a>
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
</p>
<p class="story">...</p>
</body>
p节点的父亲节点是body节点,返回的是body节点
祖先节点:
使用的是parents
print(type(soup.p.parents))
print(soup.p.parents)
for i, p in enumerate(soup.p.parents):
print(i, ' ', type(p), ' ', p)
<class 'generator'>
<generator object PageElement.parents at 0x0000000012F50570>
0
<class 'bs4.element.Tag'>
<body><p class="story">
Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://exmple.com/elsie" id="link1">
<span>Elsie</span>
</a>
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
</p>
<p class="story">...</p>
</body>
1
<class 'bs4.element.Tag'>
<html>
<head>
<title>The Dormouse`s story</title>
</head><body><p class="story">
Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://exmple.com/elsie" id="link1">
<span>Elsie</span>
</a>
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
</p>
<p class="story">...</p>
</body></html>
2
<class 'bs4.BeautifulSoup'>
<html>
<head>
<title>The Dormouse`s story</title>
</head><body><p class="story">
Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://exmple.com/elsie" id="link1">
<span>Elsie</span>
</a>
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
</p>
<p class="story">...</p>
</body></html>
返回一个生成器对象
一个一个节点往外走,直到走到最后一个总节点
第一个其实就是父节点
next(soup.p.parents)就是父节点
兄弟节点
兄弟节点(同级节点)使用的是sibling和siblings 这里的兄弟分这此节点的前兄弟和后兄弟,即 previous_sibling,和next_sibling 也可以找到前面的兄弟们和后面的兄弟们,即 previous_siblings,和next_siblings 以后面的兄弟们为例: print(soup.span.next_siblings) for i, s in enumerate(soup.a.next_siblings): print(i, " ", type(s), " ", s)
<generator object PageElement.next_siblings at 0x0000000012F90570>
0
<class 'bs4.element.NavigableString'>
1
<class 'bs4.element.Tag'>
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
2
<class 'bs4.element.NavigableString'>
and
3
<class 'bs4.element.Tag'>
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
4
<class 'bs4.element.NavigableString'>
这里我们看到无论是BeautifulSoup的自定义string类型还是自定义tag节点类型
只要是同级都是兄弟
获取属性值
当我们要获取节点属性的时候,需要这样做
获取节点所有的属性:soup.p.attrs
获取节点某个属性:soup.p.attrs[‘xxx’]
或soup.p[‘xxx’]print(soup.p.attrs)
print(type(soup.p.attrs))
print(soup.p.attrs[‘class’])
print(soup.p[‘class’])
{'class': ['story']}
<class 'dict'>
['story']
['story']
四、方法选择器
1.find_all()
- name
# 导包
from bs4 import BeautifulSoup
# data source
html = '''
<div class="palnel">
<div class="panel-heading">
<h4>Hello</h4>
</div>
<div class="panel-body">
<ul class="list" id="list-1">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>
<ul class="list list-small" id="list-2">
<li class="element">Foo</li>
<li class="element">Bar</li>
</ul>
</div>
</div>
'''
# import data
soup = BeautifulSoup(html,'lxml')
# research data
print(soup.find_all(name='ul'))
返回值:
列表内有两个ul节点
[<ul class="list" id="list-1">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>, <ul class="list list-small" id="list-2">
<li class="element">Foo</li>
<li class="element">Bar</li>
</ul>]
节点可以继续find_all()
for ul in soup.find_all(name='ul'):
for li in ul.find_all(name='li'):
print(li.string)
attrs
按照属性查找 attrs={'class'='list'} # 查找class属性包括list的节点值 print(soup.find_all(attrs={'class':'list'}))
返回值是个列表 , 记住是包括,不是等于
attrs={} 这里面是个字典 可以多个属性和属性值对应起来查找
[<ul class="list" id="list-1">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>, <ul class="list list-small" id="list-2">
<li class="element">Foo</li>
<li class="element">Bar</li>
</ul>]
text
匹配文本 # 查询text包括link的文本 print(soup.find_all(text=re.compile('Foo',re.I))) ['Foo', 'Foo', 'Foo'] 注意这样搜索出来的是文本列表,并不是节点列表 当然name,attrs,text三个可以结合起来用进行节点查找 print(soup.find_all( name='li', attrs={'class': 'element'}, text=re.compile('Foo'))) 查找节点名称为li,class属性为element,文本为Foo的节点
返回值为节点列表
[<li class="element">Foo</li>,
<li class="element">Foo</li>, <li class="element">Foo</li>]
2.find()
和之前的find_all()使用方法一样,
只不过带all的是返回的列表,不带all的返回的是列表中的第一个
3.find_parents()和find_parent()
返回祖先节点和父节点
4.find_next_siblings()和find_next_sibling()
返回后面所有的兄弟节点和返回后面第一个兄弟节点
5.find_previous_siblings()和find_previous_sibling()
返回前面所有的兄弟节点和返回前面第一个兄弟节点
6.find_previous_siblings()和find_previous_sibling()
返回前面所有的兄弟节点和返回前面第一个兄弟节点
7.find_all_next()和find_next()
返回之后所有符合条件的节点和返回之后符合条件的第一个节点
8.find_all_previous()和find_previous()
返回前面所有符合条件的节点和返回前面符合条件的第一个节点
注意:不管是find,find_all,find_previous_siblings等等
上述所有关于方法选择器的都可以使用name 、attrs 、text进行限定
方法名是为了选择查找的范围,属性是进行精确的限制,两者组合找到最后结果
五、CSS选择器
利用select选择
# 选择class属性值为panel下的class属性为panel-heading的节点
soup.select('.panel .panel-heading')
# 选择id为list-2的节点下的class属性为element的节点列表
soup.select('#list-2 .element')
# 选择所有lu节点
soup.select('lu')
1.嵌套选择
for ul in soup.select('ul'):
print(ul.select('li'))
[<li class="element">Foo</li>, <li class="element">Bar</li>,
<li class="element">Jay</li>]
[<li class="element">Foo</li>, <li class="element">Bar</li>]
2.属性获取
for ul in soup.select('ul'):
# 这下面等于是再用节点选择器中的方法
print(ul.attrs)
print(ul.attrs['id'])
print(ul['id'])
{'class': ['list'], 'id': 'list-1'}
list-1
list-1
{'class': ['list', 'list-small'], 'id': 'list-2'}
list-2
list-2
{'class': ['list', 'list-small'], 'id': 'list-2'}
list-2
list-2
3.获取文本
这下面等于是再用节点选择器中的方法
print(soup.p.text)
print(soup.p.string)
总结:
使用(html,’lxml’)
节点选择功能弱,但是快啊
使用find/find_all()筛选多个或者单个结果
若熟悉CSS 可以使用select
最后
以上就是丰富发带为你收集整理的Python学习笔记——pycharm 爬虫:Beautiful soup的全部内容,希望文章能够帮你解决Python学习笔记——pycharm 爬虫:Beautiful soup所遇到的程序开发问题。
如果觉得靠谱客网站的内容还不错,欢迎将靠谱客网站推荐给程序员好友。
发表评论 取消回复