我是靠谱客的博主 追寻世界,最近开发中收集的这篇文章主要介绍python爬取论坛帖子_网络抓取每个论坛帖子(Python,Beautifulsoup),觉得挺不错的,现在分享给大家,希望可以做个参考。

概述

Hello once again fellow stack'ers. Short description.. I am web scraping some data from an automotive forum using Python and saving all data into CSV files. With some help from other stackoverflow members managed to get as far as mining through all pages for certain topic, gathering the dates, title and link for each post.

I also have a seperate script I am now sturggling with implementing (For every link found, python creates a new soup for it, scrapes through all the posts and then goes back to previous link).

Would really appreciate any other tips or advice on how to make this better as it's my first time working with python, I think it might be my nested loop logic that's messed up, but checking through multiple times seems right to me.

Heres the code snippet :

link += (div.get('href'))

savedData += "n" + title + ", " + link

tempSoup = make_soup('http://www.automotiveforums.com/vbulletin/' + link)

while tempNumber < 3:

for tempRow in tempSoup.find_all(id=re.compile("^td_post_")):

for tempNext in tempSoup.find_all(title=re.compile("^Next Page -")):

tempNextPage = ""

tempNextPage += (tempNext.get('href'))

post = ""

post += tempRow.get_text(strip=True)

postData += post + "n"

tempNumber += 1

tempNewUrl = "http://www.automotiveforums.com/vbulletin/" + tempNextPage

tempSoup = make_soup(tempNewUrl)

print(tempNewUrl)

tempNumber = 1

number += 1

print(number)

newUrl = "http://www.automotiveforums.com/vbulletin/" + nextPage

soup = make_soup(newUrl)

My main issue with it so far is that tempSoup = make_soup('http://www.automotiveforums.com/vbulletin/' + link)

Does not seem to create a new soup after it has done scraping all the posts for forum thread.

This is the output I'm getting :

http://www.automotiveforums.com/vbulletin/showthread.php?s=6a2caa2b46531be10e8b1c4acb848776&t=1139532&page=2

http://www.automotiveforums.com/vbulletin/showthread.php?s=6a2caa2b46531be10e8b1c4acb848776&t=1139532&page=3

1

So it does seem to find the correct links for new pages and scrape them, however for next itteration it prints the new dates AND the same exact pages. There's also a reaaly weird 10-12 seconds delays after the last link is printed and only then it hops down to print number 1 and then bash out all the new dates..

But after going for the next forum threads link, it scrapes same exact data every time.

Sorry if it looks really messy, it is sort of a side project, and my first attempt at doing some useful things, so I am very new at this, any advice or tips would be much appreciated. I'm not asking you to solve the code for me, even some pointers for my possibly wrong logic would be greatly appreciated!

Kind regards, and thanks for reading such annoyingly long post!

EDIT: I've cut out majority of the post / code snippet as I believe people were getting overwhelmed. Just left the essential bit I am trying to work with. Any help would be much appreciated!

解决方案

So after spending a little bit more time, I have managed to ALMOST crack it. It's now at the point where python finds every thread and it's link on the forum, then goes onto each link, reads all pages and continues on with next link.

This is the fixed code for it if anyone will make any use of it.

link += (div.get('href'))

savedData += "n" + title + ", " + link

soup3 = make_soup('http://www.automotiveforums.com/vbulletin/' + link)

while tempNumber < 4:

for postScrape in soup3.find_all(id=re.compile("^td_post_")):

post = ""

post += postScrape.get_text(strip=True)

postData += post + "n"

print(post)

for tempNext in soup3.find_all(title=re.compile("^Next Page -")):

tempNextPage = ""

tempNextPage += (tempNext.get('href'))

print(tempNextPage)

soup3 = ""

soup3 = make_soup('http://www.automotiveforums.com/vbulletin/' + tempNextPage)

tempNumber += 1

tempNumber = 1

number += 1

print(number)

newUrl = "http://www.automotiveforums.com/vbulletin/" + nextPage

soup = make_soup(newUrl)

All I've had to do was to seperate the 2 for loops that were nested within each other, into own loops. Still not a perfect solution, but hey, it ALMOST works.

The non working bit: First 2 threads of provided link have multiple pages of posts. The following 10+ more threads Do not. I cannot figure out a way to check the for tempNext in soup3.find_all(title=re.compile("^Next Page -")):

value outside of loop to see if it's empty or not. Because if it does not find a next page element / href, it just uses the last one. But if I reset the value after each run, it no longer mines each page =l A solution that just created another one problem :D.

最后

以上就是追寻世界为你收集整理的python爬取论坛帖子_网络抓取每个论坛帖子(Python,Beautifulsoup)的全部内容,希望文章能够帮你解决python爬取论坛帖子_网络抓取每个论坛帖子(Python,Beautifulsoup)所遇到的程序开发问题。

如果觉得靠谱客网站的内容还不错,欢迎将靠谱客网站推荐给程序员好友。

本图文内容来源于网友提供,作为学习参考使用,或来自网络收集整理,版权属于原作者所有。
点赞(58)

评论列表共有 0 条评论

立即
投稿
返回
顶部