Python爬虫Scrapy框架CrawlSpider原理及使用案例

时间：2021-05-22

提问：如果想要通过爬虫程序去爬取”糗百“全站数据新闻数据的话，有几种实现方法？

方法一：基于Scrapy框架中的Spider的递归爬去进行实现的(Request模块回调)

方法二：基于CrawlSpider的自动爬去进行实现(更加简洁和高效)

一、简单介绍CrawlSpider

　　CrawlSpider其实是Spider的一个子类，除了继承到Spider的特性和功能外，还派生除了其自己独有的更加强大的特性和功能。其中最显著的功能就是”LinkExtractors链接提取器“。Spider是所有爬虫的基类，其设计原则只是为了爬取start_url列表中网页，而从爬取到的网页中提取出的url进行继续的爬取工作使用CrawlSpider更合适。

二、使用

　　1.创建scrapy工程(cmd切换到要创建项目的文件夹下执行)：scrapy startproject projectName （如：scrapy startproject crawlPro）

　　2.创建爬虫文件(cmd切换到创建的项目下执行)：scrapy genspider -t crawl spiderName /'] #定义链接提取器，且指定其提取规则 page_link = LinkExtractor(allow=r'/8hr/page/\d+/') rules = ( #定义规则解析器，且指定解析规则通过callback回调函数 Rule(page_link, callback='parse_item', follow=True), ) #自定义规则解析器的解析规则函数 def parse_item(self, response): div_list = response.xpath('//div[@id="content-left"]/div') for div in div_list: #定义item item = QiubaibycrawlItem() #根据xpath表达式提取糗百中段子的作者 item['author'] = div.xpath('./div/a[2]/h2/text()').extract_first().strip('\n') #根据xpath表达式提取糗百中段子的内容 item['content'] = div.xpath('.//div[@class="content"]/span/text()').extract_first().strip('\n') yield item #将item提交至管道

4.2items文件

# -*- coding: utf-8 -*-# Define here the models for your scraped items## See documentation in:# https://doc.scrapy.org/en/latest/topics/items.htmlimport scrapyclass QiubaibycrawlItem(scrapy.Item): # define the fields for your item here like: # name = scrapy.Field() author = scrapy.Field() #作者 content = scrapy.Field() #内容

4.3管道文件

# -*- coding: utf-8 -*-# Define your item pipelines here## Don't forget to add your pipeline to the ITEM_PIPELINES setting# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.htmlclass QiubaibycrawlPipeline(object): def __init__(self): self.fp = None def open_spider(self,spider): print('开始爬虫') self.fp = open('./data.txt','w') def process_item(self, item, spider): #将爬虫文件提交的item写入文件进行持久化存储 self.fp.write(item['author']+':'+item['content']+'\n') return item def close_spider(self,spider): print('结束爬虫') self.fp.close()

以上就是本文的全部内容，希望对大家的学习有所帮助，也希望大家多多支持。

Python爬虫Scrapy框架CrawlSpider原理及使用案例

相关文章

浅谈Scrapy网络爬虫框架的工作原理和数据采集

python爬虫scrapy基于CrawlSpider类的全站数据爬取示例解析

pycharm创建scrapy项目教程及遇到的坑解析

Python scrapy爬取小说代码案例详解

python中用Scrapy实现定时爬虫的实例讲解