html获取列表中的内容,使用python和BeautifulSoup从html中提取表内容

359 阅读 0 评论 237 点赞

我是靠谱客的博主爱笑橘子，这篇文章主要介绍html获取列表中的内容,使用python和BeautifulSoup从html中提取表内容，现在分享给大家，希望可以做个参考。

我想从html文档中提取某些信息.例如,它包含一个表(在其他表中包含其他内容),如下所示:

Advisory:	RHBA-2013:0947-1
Type:	Bug Fix Advisory
Severity:	N/A
Issued on:	2013-06-13
Last updated on:	2013-06-13
Affected Products:	Red Hat Enterprise Linux ELS (v. 4)

我想提取信息,如"发布日期:".看起来像BeautifulSoup4可以轻松地做到这一点,但不知何故,我无法做到这一点.我的代码到目前为止:

from bs4 import BeautifulSoup

soup=BeautifulSoup(unicodestring_containing_the_entire_htlm_doc)

table_tag=soup.table

if table_tag['class'] == ['details']:

print table_tag.tr.th.get_text() + " " + table_tag.tr.td.get_text()

a=table_tag.next_sibling

print unicode(a)

print table_tag.contents

这将获取第一个表行的内容,以及内容列表.但是下一个兄弟的事情是行不通的,我想我只是错了.当然我可以解析内容,但在我看来,美丽的汤旨在阻止我们这样做(如果我开始解析自己,我不妨解析整个文档......).如果有人能够告诉我如何实现这一点,我将感激不尽.如果有更好的方式然后BeautifulSoup,我会有兴趣听到它.