Python从入门到入土-网络爬虫(BeautifulSoup、lxml解析网页、requests获取网页）

95 阅读 0 评论 63 点赞

我是靠谱客的博主直率火车，最近开发中收集的这篇文章主要介绍Python从入门到入土-网络爬虫(BeautifulSoup、lxml解析网页、requests获取网页），觉得挺不错的，现在分享给大家，希望可以做个参考。

概述

CSDN话题挑战赛第2期
参赛话题：学习笔记

BeautifulSoup

获取所有p标签里的文本

# 获取所有p标签里的文本
# -*- coding: UTF-8 -*-
from bs4 import BeautifulSoup
# 在此实现代码
def fetch_p(html):
soup = BeautifulSoup(html, 'lxml')
p_list = soup.find_all("p")
results = [p.text for p in p_list]
return results
if __name__ == '__main__':
html = '''
<html>
<head>
<title>这是一个简单的测试页面</title>
</head>
<body>
<p class="item-0">body 元素的内容会显示在浏览器中。</p>
<p class="item-1">title 元素的内容会显示在浏览器的标题栏中。</p>
</body>
</html>
'''
p_text = fetch_p(html)
print(p_text)

BeautifulSoup 获取text

# BeautifulSoup 获取text
#
# 获取网页的text
# -*- coding: UTF-8 -*-
from bs4 import BeautifulSoup
# 在此实现代码
def fetch_text(html):
soup = BeautifulSoup(html, 'lxml')
result = soup.text
return result
if __name__ == '__main__':
html = '''
<html>
<head>
<title>这是一个简单的测试页面</title>
</head>
<body>
<p class="item-0">body 元素的内容会显示在浏览器中。</p>
<p class="item-1">title 元素的内容会显示在浏览器的标题栏中。</p>
</body>
</html>
'''
text = fetch_text(html)
print(text)

查找网页里所有图片地址

# 查找网页里所有图片地址
from bs4 import BeautifulSoup
# 在此实现代码
def fetch_imgs(html):
soup = BeautifulSoup(html, 'html.parser')
imgs = [tag['src'] for tag in soup.find_all('img')]
return imgs
def test():
imgs = fetch_imgs(
'<p><img src="http://example.com"/><img src="http://example.com"/></p>')
print(imgs)
if __name__ == '__main__':
test()

lxml解析网页

使用xpath获取所有段落的文本

# 使用xpath获取所有段落的文本
# -*- coding: UTF-8 -*-
from lxml import etree
# 在此实现代码
def fetch_text(html):
html = etree.HTML(html)
result = html.xpath("//p/text()")
return result
if __name__ == '__main__':
html = '''
<html>
<head>
<title>这是一个简单的测试页面</title>
</head>
<body>
<p class="item-0">body 元素的内容会显示在浏览器中。</p>
<p class="item-1">title 元素的内容会显示在浏览器的标题栏中。</p>
</body>
</html>
'''
imgs = fetch_text(html)
print(imgs)

使用xpath获取所有的文本

# 使用xpath获取所有的文本
# -*- coding: UTF-8 -*-
from lxml import etree
# 在此实现代码
def fetch_text(html):
html = etree.HTML(html)
result = html.xpath("//text()")
return result
if __name__ == '__main__':
html = '''
<html>
<head>
<title>这是一个简单的测试页面</title>
</head>
<body>
<p>body 元素的内容会显示在浏览器中。</p>
<p>title 元素的内容会显示在浏览器的标题栏中。</p>
</body>
</html>
'''
imgs = fetch_text(html)
print(imgs)

使用xpath获取 class 为 “item-1” 的段落文本

# 使用xpath获取 class 为 "item-1" 的段落文本
# -*- coding: UTF-8 -*-
from lxml import etree
# 在此实现代码
def fetch_text(html):
html = etree.HTML(html)
result = html.xpath("//p[@class='item-1']/text()")
return result
if __name__ == '__main__':
html = '''
<html>
<head>
<title>这是一个简单的测试页面</title>
</head>
<body>
<p class="item-0">body 元素的内容会显示在浏览器中。</p>
<p class="item-1">title 元素的内容会显示在浏览器的标题栏中。</p>
</body>
</html>
'''
imgs = fetch_text(html)
print(imgs)

requests 获取网页

获取url对应的网页HTML

# 获取url对应的网页HTML
# -*- coding: UTF-8 -*-
import requests
# 在此实现代码
def get_html(url):
response = requests.get(url=url)
result = response.text
return result
if __name__ == '__main__':
url = "http://www.baidu.com"
html = get_html(url)
print(html)

requests 获取网页 with headers

# 将url对应的网页下载到本地
# -*- coding: UTF-8 -*-
import requests
def get_html(url, headers=None):
response = requests.get(url=url)
return response.text
if __name__ == '__main__':
# 正确编写 headers
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.146 Safari/537.36"
}
url = "http://www.baidu.com"
html = get_html(url, headers)
print(html)

requests post 请求

# requests post 请求
# -*- coding: UTF-8 -*-
import requests
# 在此实现代码
def get_response(url, data, headers=None):
response = requests.post(url, data, headers)
result = response.text
return result
if __name__ == '__main__':
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.146 Safari/537.36"
}
data = {
"key1": "value1",
"key2": "value2"
}
url = "http://httpbin.org/post"
html = get_response(url, data, headers)
print(html)

本文内容到此结束了，
如有收获欢迎点赞????收藏????关注✔️，您的鼓励是我最大的动力。
如有错误❌疑问????欢迎各位指出。
主页：共饮一杯无的博客汇总????‍????

保持热爱，奔赴下一场山海。????????????

最后

以上就是直率火车为你收集整理的Python从入门到入土-网络爬虫(BeautifulSoup、lxml解析网页、requests获取网页）的全部内容，希望文章能够帮你解决Python从入门到入土-网络爬虫(BeautifulSoup、lxml解析网页、requests获取网页）所遇到的程序开发问题。

如果觉得靠谱客网站的内容还不错，欢迎将靠谱客网站推荐给程序员好友。

本图文内容来源于网友提供，作为学习参考使用，或来自网络收集整理，版权属于原作者所有。

本文分类：????Python
浏览次数：95 次浏览
发布日期：2024-01-15 10:46:12
本文链接：https://www.kaopuke.com/article/k-p-k_13_u_23_ogf3_13_z_18_4.html

Python从入门到入土-网络爬虫(BeautifulSoup、lxml解析网页、requests获取网页）

概述

BeautifulSoup

获取所有p标签里的文本

BeautifulSoup 获取text

查找网页里所有图片地址

lxml解析网页

使用xpath获取所有段落的文本

使用xpath获取所有的文本

使用xpath获取 class 为 “item-1” 的段落文本

requests 获取网页

获取url对应的网页HTML

requests 获取网页 with headers

requests post 请求

最后

评论列表共有 0 条评论

发表评论取消回复

Python从入门到入土-网络爬虫(BeautifulSoup、lxml解析网页、requests获取网页）

概述

BeautifulSoup

获取所有p标签里的文本

BeautifulSoup 获取text

查找网页里所有图片地址

lxml解析网页

使用xpath获取所有段落的文本

使用xpath获取所有的文本

使用xpath获取 class 为 “item-1” 的段落文本

requests 获取网页

获取url对应的网页HTML

requests 获取网页 with headers

requests post 请求

最后

相关文章

评论列表共有 0 条评论

发表评论 取消回复

发表评论取消回复