【网页爬虫】BeautifulSoup4模块介绍1、BeautifulSoup4基础介绍2、BeautifulSoup4处理标签方法3、正则表达式4、其它

209 阅读 0 评论 138 点赞

我是靠谱客的博主开心小甜瓜，这篇文章主要介绍【网页爬虫】BeautifulSoup4模块介绍1、BeautifulSoup4基础介绍2、BeautifulSoup4处理标签方法3、正则表达式4、其它，现在分享给大家，希望可以做个参考。

1、BeautifulSoup4基础介绍
- - 使用pip安装BeautifulSoup4
- - 导入BeautifulSoup4模块
- - 创建BeautifulSoup.bs4对象
- - 查找bs4对象
2、BeautifulSoup4处理标签方法
- - 处理子标签与后代标签
- - 处理兄弟标签
- - 处理父标签
3、正则表达式
- - 正则表达式常用符号
- - 用正则表达式找图片
4、其它
- - 获取属性字典
- - Lambda表达式

1、BeautifulSoup4基础介绍

- 使用pip安装BeautifulSoup4

pip install BeautifulSoup4

- 导入BeautifulSoup4模块

import bs4

- 创建BeautifulSoup.bs4对象

# 引入urllib.request模块
import urllib.request
# html.read()为urllib.request.urlopen（）方法得到的字节对象，也可采用其他方法
html = urllib.request.urlopen("http://pythonscraping.com/pages/page1.html")
# 解析器采用python标准库："html.parser"，也可以采用其他库（需安装）
soup=bs4.BeautifulSoup(html.read(),"html.parser")

- 查找bs4对象

# 方法一：直接在bs4对象后跟对应的标签名,可以多级,结果相同
print(soup.h1)
print(soup.html.h1)
print(soup.html.body.h1)
# 方法二：使用find方法查找,返回类型为bs4.element.Tag
name=soup.find("span",{"class":"red"})
print(name)
print(type(name))
# 使用get_text()方法去掉标签
print(name.get_text())
# 方法三：使用findAll方法查找（返回列表<class 'bs4.element.ResultSet'>）
nameList = soup.findAll("span", {"class":"green"})
print(type(nameList))
for name in nameList:
# print(name)
print(type(name))
print(name.get_text())

2、BeautifulSoup4处理标签方法

- 处理子标签与后代标签

# 打印出table下的内容，仅孩子
for child in soup.find("table",{"id":"giftList"}).children:
print(child)
# 迭代打印出table下的内容，即后代
for child in soup.find("table",{"id":"giftList"}).descendants:
print(child)

- 处理兄弟标签

# 打印出tr往后的第一个兄弟标签（本身除外）
for child in soup.find("table",{"id":"giftList"}).tr.next_sibling:
print(child)
# 打印出tr往后的所有兄弟标签（本身除外）
for child in soup.find("table",{"id":"giftList"}).tr.next_siblings:
print(child)
# 打印出tr往前的第一个兄弟标签（本身除外）
for child in soup.find("table",{"id":"giftList"}).tr.previous_sibling:
print(child)
# 打印出tr往前的所有兄弟标签（本身除外）
for child in soup.find("table",{"id":"giftList"}).tr.previous_siblings:
print(child)

- 处理父标签

# .parent获取父标签，.previous_sibling获取前一个兄弟便签，如此可以获取表格同行的其它信息
print(soup.find("img",{"src":"../img/gifts/img1.jpg"}).parent.previous_sibling.get_text())

3、正则表达式

- 正则表达式常用符号

这里写图片描述

- 用正则表达式找图片

# 正则表达式可以直接与BeautifulSoup4结合使用，如下找出一定路径下的图片
images=soup.findAll("img",{"src":re.compile("../img/gifts/img.*.jpg")})
for image in images:
print(image)

4、其它

- 获取属性字典

images=soup.findAll("img",{"src":re.compile("../img/gifts/img.*.jpg")})
for image in images:
print(type(image))
# 获取属性字典，{属性：属性值}
print(type(image.attrs))
# 效果相同，获取src属性值
print(image["src"])
print(image.attrs["src"])

- Lambda表达式

# 返回有两个属性的标签
print(soup.findAll(lambda tag: len(tag.attrs) == 2))

最后

以上就是开心小甜瓜最近收集整理的关于【网页爬虫】BeautifulSoup4模块介绍1、BeautifulSoup4基础介绍2、BeautifulSoup4处理标签方法3、正则表达式4、其它的全部内容，更多相关【网页爬虫】BeautifulSoup4模块介绍1、BeautifulSoup4基础介绍2、BeautifulSoup4处理标签方法3、正则表达式4、其它内容请搜索靠谱客的其他文章。

本图文内容来源于网友提供，作为学习参考使用，或来自网络收集整理，版权属于原作者所有。

本文分类：python网页
浏览次数：209 次浏览
发布日期：2024-01-15 13:35:47
本文链接：https://www.kaopuke.com/article/k-p-k_13_u_23_ogf3_12__23__18_5.html

Python学习笔记——pycharm 爬虫：Beautiful soup

python3爬虫(八)--BeautifulSoup4的基本使用

python的BeautifulSoup用法

BeautifulSoup 常用方法

【网页爬虫】BeautifulSoup4模块介绍1、BeautifulSoup4基础介绍2、BeautifulSoup4处理标签方法3、正则表达式4、其它

转载--Beautifuisoup的使用转载自--http://mp.weixin.qq.com/s?src=11&timestamp=1520511185&ver=742&signature=KDzYoOg8Xd9Uukb6HtdfBUIjvAAgcC1gisI42xaXTVprREExwR2Ib2DN*nu3myHJE0LX2Y*L8rLegvTuhy*6WoCbLBobgpLg3wOL4INB6NRSP2wSPlxbSYjZynsnWgrg&new=1BeautifulSoup的使用

转载--Beautifuisoup的使用转载自--http://mp.weixin.qq.com/s?src=11×tamp=1520511185&ver=742&signature=KDzYoOg8Xd9Uukb6HtdfBUIjvAAgcC1gisI42xaXTVprREExwR2Ib2DNnu3myHJE0LX2YL8rLegvTuhy*6WoCbLBobgpLg3wOL4INB6NRSP2wSPlxbSYjZynsnWgrg&new=1BeautifulSoup的使用

【网页爬虫】BeautifulSoup4模块介绍1、BeautifulSoup4基础介绍2、BeautifulSoup4处理标签方法3、正则表达式4、其它

1、BeautifulSoup4基础介绍

- 使用pip安装BeautifulSoup4

- 导入BeautifulSoup4模块

- 创建BeautifulSoup.bs4对象

- 查找bs4对象

2、BeautifulSoup4处理标签方法

- 处理子标签与后代标签

- 处理兄弟标签

- 处理父标签

3、正则表达式

- 正则表达式常用符号

- 用正则表达式找图片

4、其它

- 获取属性字典

- Lambda表达式

最后

评论列表共有 0 条评论

发表评论取消回复

【网页爬虫】BeautifulSoup4模块介绍1、BeautifulSoup4基础介绍2、BeautifulSoup4处理标签方法3、正则表达式4、其它

1、BeautifulSoup4基础介绍

- 使用pip安装BeautifulSoup4

- 导入BeautifulSoup4模块

- 创建BeautifulSoup.bs4对象

- 查找bs4对象

2、BeautifulSoup4处理标签方法

- 处理子标签与后代标签

- 处理兄弟标签

- 处理父标签

3、正则表达式

- 正则表达式常用符号

- 用正则表达式找图片

4、其它

- 获取属性字典

- Lambda表达式

最后

相关文章

评论列表共有 0 条评论

发表评论 取消回复

发表评论取消回复