概述
I have am looking to parse a HTML table with Python/BeautifulSoup...
This is my first attempt at coding anything in Python, so its probably not the most efficient.
I grabbed a function another post here (works great for the most part), but I am running into a couple of problems.
The code I am running is here:
def strip_tags(html, invalid_tags):
bs2 = BeautifulSoup(str(html))
for tag in bs2.findAll(True):
if tag.name in invalid_tags:
s = ""
for c in tag.contents:
if not isinstance(c, NavigableString):
c = strip_tags(unicode(c), invalid_tags)
s += unicode(c)
tag.replaceWith(s)
return bs2
invalid_tags = ['td','b']
for row in bs.findAll('tr'):
col = row.findAll('td')
for index,item in enumerate(col):
t = item.findAll('a')
for ta in t:
ta.replaceWithChildren()
col[index] == item
for item in col:
print(strip_tags(item.string,invalid_tags).string
The raw data table (HTML) looks like this:
11/10N ARMY-7.5NL 76-65 W W50.0%76.9%37.5%37.1%90.0%29.4%When I run the strip_tags function, It works for all the tags except for the second line... 'None' is returned as the output.
If anyone could provide any insight on why this is happening I would greatly appreciate it.
edit: wow thanks for everyone's quick responses. anyhow, here is what happens when I run the code:
11/10
None
-7.5
NL
76-65
W
W
None
50.0%
76.9%
37.5%
37.1%
90.0%
29.4%
The problem lies around the second line, where it returns 'None' instead of 'N ARMY'. So yes, ideally I would like just the text that is found within the tags.
解决方案
If I'm understanding the output you want correctly, you shouldn't need to do any manual removing of tags -- that's why we use BeautifulSoup! ;)
What you need to call is the get_text() method on the tag instances that find_all() returns.
Using your sample html:
11/10 | N ARMY | -7.5 | NL | 76-65 | W | W | 50.0% | 76.9% | 37.5% | 37.1% | 90.0% | 29.4% |
A simple iteration over the tds, and a call to get_text() and we're good to go!
from bs4 import BeautifulSoup
with open('test.html', 'rb') as html: #My local version of your html file
soup = BeautifulSoup(html.read())
for td in soup.find_all('td'):
print td.get_text()
This gives the output:
11/10
N ARMY
-7.5
NL
76-65
W
W
50.0%
76.9%
37.5%
37.1%
90.0%
29.4%
[Finished in 0.1s]
最后
以上就是大意耳机为你收集整理的beautifulsoup去除标签_BeautifulSoup标签去除的全部内容,希望文章能够帮你解决beautifulsoup去除标签_BeautifulSoup标签去除所遇到的程序开发问题。
如果觉得靠谱客网站的内容还不错,欢迎将靠谱客网站推荐给程序员好友。
发表评论 取消回复