我是靠谱客的博主 贤惠便当,最近开发中收集的这篇文章主要介绍page source 保存html,如何让phantomJS webdriver等到加载特定的HTML元素然后返回page.source?...,觉得挺不错的,现在分享给大家,希望可以做个参考。

概述

我已经为Web爬网对象开发了以下代码。

它需要两个日期作为输入。然后创建这两个日期之间的日期列表,并将每个日期附加到包含位置的天气信息的网页URL。然后它将HTML数据表转换为Dataframe,之后将数据存储为存储中的csv文件(基本链接为:https://www.wunderground.com/history/daily/ir/mashhad/OIMM/date/2019-1-3,在本例中您可以看到日期为2019-1-3):

from datetime import timedelta, date

from bs4 import BeautifulSoup

from selenium import webdriver

import pandas as pd

from furl import furl

import os

import time

class WebCrawler():

def __init__(self, st_date, end_date):

if not os.path.exists('Data'):

os.makedirs('Data')

self.path = os.path.join(os.getcwd(), 'Data')

self.driver = webdriver.PhantomJS()

self.base_url = 'https://www.wunderground.com/history/daily/ir/mashhad/OIMM/date/'

self.st_date = st_date

self.end_date = end_date

def date_list(self):

# Create list of dates between two dates given as inputs.

dates = []

total_days = int((self.end_date - self.st_date).days + 1)

for i in range(total_days):

date = self.st_date + timedelta(days=i)

dates.append(date.strftime('%Y-%m-%d'))

return dates

def create_link(self, attachment):

# Attach dates to base link

f = furl(self.base_url)

f.path /= attachment

f.path.normalize()

return f.url

def open_link(self, link):

# Opens link and visits page and returns html source code of page

self.driver.get(link)

html = self.driver.page_source

return html

def table_to_df(self, html):

# Finds table of weather data and converts it into pandas dataframe and returns it

soup = BeautifulSoup(html, 'lxml')

table = soup.find("table",{"class":"tablesaw-sortable"})

dfs = pd.read_html(str(table))

df = dfs[0]

return df

def to_csv(self, name, df):

# Save the dataframe as csv file in the defined path

filename = name + '.csv'

df.to_csv(os.path.join(self.path,filename), index=False)

这是我想要使用WebCrawler对象的方式:

date1 = date(2018, 12, 29)

date2 = date(2019, 1, 1)

# Initialize WebCrawler object

crawler = WebCrawler(st_date=date1, end_date=date2)

dates = crawler.date_list()

for day in dates:

print('**************************')

print('PROCESSING : ', day)

link = crawler.create_link(day)

print('WAITING... ')

time.sleep(3)

print('VISIT WEBPAGE ... ')

html = crawler.open_link(link)

print('DATA RETRIEVED ... ')

df = crawler.table_to_df(html)

print(df.head(3))

crawler.to_csv(day, df)

print('DATA SAVED ...')

发生的问题是循环的第一次迭代运行完美,但第二次循环停止时出现错误,表示No tables where found(发生在table = soup.find("table",{"class":"tablesaw-sortable"})行),这是因为页面源是由WebCrawler.open_link在网页完全加载网页内容之前返回的,包括表(包含天气信息)。网站也有可能拒绝请求,因为它使服务器太忙。

无论如何我们可以构建一个循环,一直试图打开链接,直到它找到表,或者至少等到表加载然后返回表?

答案

您可以让selenium等待特定元素。在您的情况下,它将是类名为“tablesaw-sortable”的表。我强烈建议您使用CSS选择器来查找此元素,因为获取所有表元素的速度快且容易出错。

这是CSS选择器,为您预制table.tablesaw-sortable。将selenium设置为等待该元素加载。

另一答案

我使用@mildmelon建议的https://stackoverflow.com/a/26567563/4159473解决方案重写代码,每次向服务器发送请求和请求页面源时,我也使用了一些延迟:

from datetime import timedelta, date

from bs4 import BeautifulSoup

from selenium import webdriver

from selenium.webdriver.support.ui import WebDriverWait

from selenium.webdriver.support import expected_conditions as EC

from selenium.webdriver.common.by import By

from selenium.common.exceptions import TimeoutException

import pandas as pd

from furl import furl

import os

import time

class WebCrawler():

def __init__(self, st_date, end_date):

if not os.path.exists('Data'):

os.makedirs('Data')

self.path = os.path.join(os.getcwd(), 'Data')

self.driver = webdriver.PhantomJS()

self.delay_for_page = 7

self.base_url = 'https://www.wunderground.com/history/daily/ir/mashhad/OIMM/date/'

self.st_date = st_date

self.end_date = end_date

def date_list(self):

# Create list of dates between two dates given as inputs.

dates = []

total_days = int((self.end_date - self.st_date).days + 1)

for i in range(total_days):

date = self.st_date + timedelta(days=i)

dates.append(date.strftime('%Y-%m-%d'))

return dates

def create_link(self, attachment):

# Attach dates to base link

f = furl(self.base_url)

f.path /= attachment

f.path.normalize()

return f.url

def open_link(self, link):

# Opens link and visits page and returns html source code of page

self.driver.get(link)

myElem = WebDriverWait(self.driver, self.delay_for_page)

.until(EC.presence_of_element_located((By.CLASS_NAME, 'tablesaw-sortable')))

def table_to_df(self, html):

# Finds table of weather data and converts it into pandas dataframe and returns it

soup = BeautifulSoup(html, 'lxml')

table = soup.find("table",{"class":"tablesaw-sortable"})

dfs = pd.read_html(str(table))

df = dfs[0]

return df

def to_csv(self, name, df):

# Save the dataframe as csv file in the defined path

filename = name + '.csv'

df.to_csv(os.path.join(self.path,filename), index=False)

date1 = date(2019, 2, 1)

date2 = date(2019, 3, 5)

# Initialize WebCrawler object

crawler = WebCrawler(st_date=date1, end_date=date2)

dates = crawler.date_list()

for day in few_dates:

print('**************************')

print('DATE : ', day)

link = crawler.create_link(day)

print('WAITING ....')

print('')

time.sleep(12)

print('OPENING LINK ... ')

try:

crawler.open_link(link)

html = crawler.driver.page_source

print( "DATA IS FETCHED")

df = crawler.table_to_df(html)

print(df.head(3))

crawler.to_csv(day, df)

print('DATA SAVED ...')

except TimeoutException:

print( "NOT FETCHED ...!!!")

天气信息没有问题。我想每次请求之间的延迟会带来更好的性能。线myElem = WebDriverWait(self.driver, self.delay_for_page).until(EC.presence_of_element_located((By.CLASS_NAME, 'tablesaw-sortable')))也提高了速度。

最后

以上就是贤惠便当为你收集整理的page source 保存html,如何让phantomJS webdriver等到加载特定的HTML元素然后返回page.source?...的全部内容,希望文章能够帮你解决page source 保存html,如何让phantomJS webdriver等到加载特定的HTML元素然后返回page.source?...所遇到的程序开发问题。

如果觉得靠谱客网站的内容还不错,欢迎将靠谱客网站推荐给程序员好友。

本图文内容来源于网友提供,作为学习参考使用,或来自网络收集整理,版权属于原作者所有。
点赞(59)

评论列表共有 0 条评论

立即
投稿
返回
顶部