概述
如果我看到table标签,我通常会让熊猫来做,您可以过滤掉不需要或不需要的列。
html = """
Header | ||||
---|---|---|---|---|
1 | John | Jim | Russia | 2-1 |
"""
import pandas as pd
df = pd.read_html(html, skiprows=1)
results = df[0]
编辑:
但是,如果您更关心实际的类属性,我可以提供两种选择。
选项:1
仍然使用熊猫解析表,但在此之前,请使用BeautifulSoup通过.decompose()消除不需要的列/标签/类(无论您要调用什么):
import pandas as pd
import bs4
html = """
Header | ||||
---|---|---|---|---|
1 | John | Jim | Russia | 2-1 |
"""
soup = bs4.BeautifulSoup(html, 'html.parser')
keep_list = ["table-name", "table-score"]
for data in soup.find_all('td'):
class_attr = data['class'][0]
if class_attr in keep_list:
continue
else:
soup.select("td."+class_attr)[0].decompose()
df = pd.read_html(str(soup), skiprows=1)
results = df[0]
输出:
print (results)
0 1 2
0 John Jim 2-1
选项:2
与其他解决方案类似,只需找到特定的类属性即可。
import bs4
html = """
Header | ||||
---|---|---|---|---|
1 | John | Jim | Russia | 2-1 |
"""
soup = bs4.BeautifulSoup(html, 'html.parser')
keep_list = ["table-name", "table-score"]
alpha = soup.find_all('td', class_=lambda x: x in keep_list)
for data in alpha:
print (data.text)
# or if wanted in list
results = [ data.text for data in alpha ]
输出:
John
Jim
2-1
或者,列表可以分为3行:
soup = bs4.BeautifulSoup(html, 'html.parser')
keep_list = ["table-name", "table-score"]
results = [ data.text for data in soup.find_all('td', class_=lambda x: x in keep_list)]
输出:
print (results)
['John', 'Jim', '2-1']
最后
以上就是彩色微笑为你收集整理的python 遍历html,在Python中用漂亮的汤循环遍历html的全部内容,希望文章能够帮你解决python 遍历html,在Python中用漂亮的汤循环遍历html所遇到的程序开发问题。
如果觉得靠谱客网站的内容还不错,欢迎将靠谱客网站推荐给程序员好友。
发表评论 取消回复