用scrapy进行网页抓取

106 阅读 0 评论 70 点赞

我是靠谱客的博主欢呼猎豹，最近开发中收集的这篇文章主要介绍用scrapy进行网页抓取，觉得挺不错的，现在分享给大家，希望可以做个参考。

概述

最近用scrapy来进行网页抓取,对于pythoner来说它用起来非常方便,详细文档在这里:http://doc.scrapy.org/en/0.14/index.html

要想利用scrapy来抓取网页信息,需要先新建一个工程,scrapy startproject myproject

工程建立好后,会有一个myproject/myproject的子目录,里面有item.py(由于你要抓取的东西的定义),pipeline.py(用于处理抓取后的数据,可以保存数据库,或是其他),然后是spiders文件夹,可以在里面编写爬虫的脚本.

这里以爬取某网站的书籍信息为例:

item.py如下:

[python]  
   view plain 
   copy 
    
   
 from scrapy.item import Item, Field  
   
 class BookItem(Item):  
     # define the fields for your item here like:  
     name = Field()  
     publisher = Field()  
     publish_date = Field()  
     price = Field()  

我们要抓取的东西都在上面定义好了,分别是名字,出版商,出版日期,价格,

下面就要写爬虫去网战抓取信息了,

spiders/book.py如下:

[python]  
   view plain 
   copy 
    
   
 from urlparse import urljoin  
 import simplejson  
   
 from scrapy.http import Request  
 from scrapy.contrib.spiders import CrawlSpider, Rule  
 from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor  
 from scrapy.selector import HtmlXPathSelector  
   
 from myproject.items import BookItem  
   
 class BookSpider(CrawlSpider):  
     name = 'bookspider'  
     allowed_domains = ['test.com']  
     start_urls = [  
         "http://test_url.com",   #这里写开始抓取的页面地址(这里网址是虚构的,实际使用时请替换)  
     ]  
     rules = (  
         #下面是符合规则的网址,但是不抓取内容,只是提取该页的链接(这里网址是虚构的,实际使用时请替换)  
         Rule(SgmlLinkExtractor(allow=(r'http://test_url/test?page_index=d+'))),  
         #下面是符合规则的网址,提取内容,(这里网址是虚构的,实际使用时请替换)  
         Rule(SgmlLinkExtractor(allow=(r'http://test_rul/test?product_id=d+')), callback="parse_item"),  
     )  
   
           
     def parse_item(self, response):  
         hxs = HtmlXPathSelector(response)  
         item = BookItem()  
         item['name'] = hxs.select('//div[@class="h1_title book_head"]/h1/text()').extract()[0]  
         item['author'] = hxs.select('//div[@class="book_detailed"]/p[1]/a/text()').extract()  
         publisher = hxs.select('//div[@class="book_detailed"]/p[2]/a/text()').extract()  
         item['publisher'] = publisher and publisher[0] or ''  
         publish_date = hxs.select('//div[@class="book_detailed"]/p[3]/text()').re(u"[u2e80-u9fffh]+uff1a([d-]+)")  
         item['publish_date'] = publish_date and publish_date[0] or ''  
         prices = hxs.select('//p[@class="price_m"]/text()').re("(d*.*d*)")  
         item['price'] = prices and prices[0] or ''  
         return item  

然后信息抓取后,需要保存,这时就需要写pipelines.py了(用于scapy是用的twisted,所以具体的数据库操作可以看twisted的资料,这里只是简单介绍如何保存到数据库中):

[python]  
   view plain 
   copy 
    
   
 from scrapy import log  
 #from scrapy.core.exceptions import DropItem  
 from twisted.enterprise import adbapi  
 from scrapy.http import Request  
 from scrapy.exceptions import DropItem  
 from scrapy.contrib.pipeline.images import ImagesPipeline  
 import time  
 import MySQLdb  
 import MySQLdb.cursors  
   
   
 class MySQLStorePipeline(object):  
   
     def __init__(self):  
         self.dbpool = adbapi.ConnectionPool('MySQLdb',  
                 db = 'test',  
                 user = 'user',  
                 passwd = '******',  
                 cursorclass = MySQLdb.cursors.DictCursor,  
                 charset = 'utf8',  
                 use_unicode = False  
         )  
   
     def process_item(self, item, spider):  
           
         query = self.dbpool.runInteraction(self._conditional_insert, item)  
           
         query.addErrback(self.handle_error)  
         return item  
     
     def _conditional_insert(self, tx, item):  
         if item.get('name'):  
             tx.execute(  
                 "insert into book (name, publisher, publish_date, price )   
                  values (%s, %s, %s, %s)",  
                 (item['name'],  item['publisher'], item['publish_date'],   
                 item['price'])  
             )  

完成之后在setting.py中添加该pipeline:

[python]  
   view plain 
   copy 
    
   
 ITEM_PIPELINES = ['myproject.pipelines.MySQLStorePipeline']  

?最后运行scrapy crawl bookspider就开始抓取了

本文地址http://www.chengxuyuans.com/Python/39302.html

最后

以上就是欢呼猎豹为你收集整理的用scrapy进行网页抓取的全部内容，希望文章能够帮你解决用scrapy进行网页抓取所遇到的程序开发问题。

如果觉得靠谱客网站的内容还不错，欢迎将靠谱客网站推荐给程序员好友。

本图文内容来源于网友提供，作为学习参考使用，或来自网络收集整理，版权属于原作者所有。

本文分类：Others
浏览次数：106 次浏览
发布日期：2023-12-28 18:35:34
本文链接：https://www.kaopuke.com/article/k-p-k_13_u_23_ocfy_13__23__6_3.html

用scrapy进行网页抓取

概述

最后

评论列表共有 0 条评论

发表评论取消回复

用scrapy进行网页抓取

概述

最后

相关文章

评论列表共有 0 条评论

发表评论 取消回复

发表评论取消回复