python爬取论坛帖子_网络抓取每个论坛帖子（Python，Beautifulsoup）

87 阅读 0 评论 58 点赞

我是靠谱客的博主追寻世界，最近开发中收集的这篇文章主要介绍python爬取论坛帖子_网络抓取每个论坛帖子（Python，Beautifulsoup），觉得挺不错的，现在分享给大家，希望可以做个参考。

概述

Hello once again fellow stack'ers. Short description.. I am web scraping some data from an automotive forum using Python and saving all data into CSV files. With some help from other stackoverflow members managed to get as far as mining through all pages for certain topic, gathering the dates, title and link for each post.

I also have a seperate script I am now sturggling with implementing (For every link found, python creates a new soup for it, scrapes through all the posts and then goes back to previous link).

Would really appreciate any other tips or advice on how to make this better as it's my first time working with python, I think it might be my nested loop logic that's messed up, but checking through multiple times seems right to me.

Heres the code snippet :

link += (div.get('href'))

savedData += "n" + title + ", " + link

tempSoup = make_soup('http://www.automotiveforums.com/vbulletin/' + link)

while tempNumber < 3:

for tempRow in tempSoup.find_all(id=re.compile("^td_post_")):

for tempNext in tempSoup.find_all(title=re.compile("^Next Page -")):

tempNextPage = ""

tempNextPage += (tempNext.get('href'))

post = ""

post += tempRow.get_text(strip=True)

postData += post + "n"

tempNumber += 1

tempNewUrl = "http://www.automotiveforums.com/vbulletin/" + tempNextPage

tempSoup = make_soup(tempNewUrl)

print(tempNewUrl)

tempNumber = 1

number += 1

print(number)

newUrl = "http://www.automotiveforums.com/vbulletin/" + nextPage

soup = make_soup(newUrl)

My main issue with it so far is that tempSoup = make_soup('http://www.automotiveforums.com/vbulletin/' + link)

Does not seem to create a new soup after it has done scraping all the posts for forum thread.

This is the output I'm getting :

http://www.automotiveforums.com/vbulletin/showthread.php?s=6a2caa2b46531be10e8b1c4acb848776&t=1139532&page=2

http://www.automotiveforums.com/vbulletin/showthread.php?s=6a2caa2b46531be10e8b1c4acb848776&t=1139532&page=3

So it does seem to find the correct links for new pages and scrape them, however for next itteration it prints the new dates AND the same exact pages. There's also a reaaly weird 10-12 seconds delays after the last link is printed and only then it hops down to print number 1 and then bash out all the new dates..

But after going for the next forum threads link, it scrapes same exact data every time.

Sorry if it looks really messy, it is sort of a side project, and my first attempt at doing some useful things, so I am very new at this, any advice or tips would be much appreciated. I'm not asking you to solve the code for me, even some pointers for my possibly wrong logic would be greatly appreciated!

Kind regards, and thanks for reading such annoyingly long post!

EDIT: I've cut out majority of the post / code snippet as I believe people were getting overwhelmed. Just left the essential bit I am trying to work with. Any help would be much appreciated!

解决方案

So after spending a little bit more time, I have managed to ALMOST crack it. It's now at the point where python finds every thread and it's link on the forum, then goes onto each link, reads all pages and continues on with next link.

This is the fixed code for it if anyone will make any use of it.

link += (div.get('href'))

savedData += "n" + title + ", " + link

soup3 = make_soup('http://www.automotiveforums.com/vbulletin/' + link)

while tempNumber < 4:

for postScrape in soup3.find_all(id=re.compile("^td_post_")):

post = ""

post += postScrape.get_text(strip=True)

postData += post + "n"

print(post)

for tempNext in soup3.find_all(title=re.compile("^Next Page -")):

tempNextPage = ""

tempNextPage += (tempNext.get('href'))

print(tempNextPage)

soup3 = ""

soup3 = make_soup('http://www.automotiveforums.com/vbulletin/' + tempNextPage)

tempNumber += 1

tempNumber = 1

number += 1

print(number)

newUrl = "http://www.automotiveforums.com/vbulletin/" + nextPage

soup = make_soup(newUrl)

All I've had to do was to seperate the 2 for loops that were nested within each other, into own loops. Still not a perfect solution, but hey, it ALMOST works.

The non working bit: First 2 threads of provided link have multiple pages of posts. The following 10+ more threads Do not. I cannot figure out a way to check the for tempNext in soup3.find_all(title=re.compile("^Next Page -")):

value outside of loop to see if it's empty or not. Because if it does not find a next page element / href, it just uses the last one. But if I reset the value after each run, it no longer mines each page =l A solution that just created another one problem :D.