Python-Scrapy的基本使用

总字符数: 21.88K

代码: 18.29K, 文本: 1.74K

预计阅读时间: 1.45 小时

scrapy的基本使用

创建一个工程 scrapy startproject filename
必须在spiders这个目录下创建一个爬虫文件
- cd proName
- scrapy genspider spiderName www.xxx.com
执行工程:scrapy crawl spiderName

settings.py

不遵从rebots协议
ROBOTSTXT_OBEY = False
进行UA伪装
USER_AGENT = 'UA'
进行日志等级设定
- LOG_LEVEL = 'ERROR'

scrapy解析

def parse(self, response):
    # 解析:作者的名称+段子的内容
    div_list = response.xpath('//div[@class="col1 old-style-col1"]/div')
    for div in div_list:
        # xpath返回的是列表, 但是列表元素一定是Selector类型的对象想要取得内容可以直接.extract()
        # .extract可以将Selector对象中的data参数存储的字符串提取出来
        # 只要能保证返回的列表里只有一个内容就可以使用.extract_first否则还是使用[下标]取出
        # .extract_first将列表中的第0个取出
        author = div.xpath('./div[1]/a[2]/h2/text()').extract_first()
        # 列表调用了.extract之后, 则表示将列表中每一个Selector对象中的data对应的字符串提取了出来
        content = div.xpath('./a[1]/div/span//text()').extract()
        content = ''.join(content)  # 将列表转换为字符串
        print(author, content)

scrapy持久化存储

基于终端指令:
- 要求:只可以将parse方法的返回值存储到本地的文本文件中
- scrapy crawl qiubai -o ./qiubai.csv
- 注意:持久化存储对应的文本文件类型只可以为:'json','jsonlines','jl','csv','xml','marshal','pickle'
- 好处:简洁高效便捷
- 缺点:局限性较强(数据值可以存储到指定后缀名文件中)

基于管道

编码流程:

数据解析

#qiubai.py 文件名

import scrapy from qiubaiPro.items import QiubaiproItem#导入item类
class QiubaiSpider(scrapy.Spider):
    name = 'qiubai'
    # allowed_domains = ['www.xxx.com']
    start_urls = ['https://www.qiushibaike.com/text/']

    def parse(self, response):
        # 解析:作者的名称+段子的内容
        div_list = response.xpath('//div[@class="col1 old-style-col1"]/div')
        for div in div_list:
            # xpath返回的是列表, 但是列表元素一定是Selector类型的对象想要取得内容可以直接.extract()
            # .extract可以将Selector对象中的data参数存储的字符串提取出来
            # 只要能保证返回的列表里只有一个内容就可以使用.extract_first否则还是使用[下标]取出
            # .extract_first将列表中的第0个取出
          # 匿名用户和正常用户的用户名不在同一个div中所以使用管道符分割,来写2个xpath表达式
        author = div.xpath('./div[1]/a[2]/h2/text()  ./div[1]/span[2]/h2/text()').extract_first()
            # 列表调用了.extract之后, 则表示将列表中每一个Selector对象中的data对应的字符串提取了出来
            content = div.xpath('./a[1]/div/span//text()').extract()
            content = ''.join(content)  # 将列表转换为字符串

在item类中定义相关的属性

#items.py 文件名

# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html

import scrapy


class QiubaiproItem(scrapy.Item):
    # define the fields for your item here like:
    author = scrapy.Field()#定义author属性
    content = scrapy.Field()#定义content属性

将解析的数据封装存储到item类型的对象中

#qiubai.py 文件名
item = QiubaiproItem()#实例化对象
item['author'] = author#将author存储到item对象中
item['content'] = content#将content存储到item对象中

将item类型的对象提交给管道进行持久化存储的操作
1
2
# 将item提交给了管道
yield item

在管道类的process_item中要将其接收到的item对象中存储的数据进行持久化存储操作

#pipelines.py 文件名

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html


# useful for handling different item types with a single interface
from itemadapter import ItemAdapter


class QiubaiproPipeline:
    # 专门用来处理item类型对象
    # 该方法可以接收爬虫文件提交过来的item对象
    # 该方法每接收到一个item就会被调用一次
    fp = None
    # 重写父类的一个方法:该方法只在开始爬虫的时候被调用一次

    def open_spider(self, spider):
        print('打开文件')
        self.fp = open('./qiubai.txt', 'w', encoding='utf-8')
        pass

    def process_item(self, item, spider):
        author = item['author']
        content = item['content']
        self.fp.write(author+':'+content+'\n')
        return item

    def close_spider(self, spider):
        print('关闭文件')
        self.fp.close()

在配置文件中开启管道好处:

#setting.py 文件名
#将以下内容取消注释,300代表优先级,数字越小优先级越高,一个键值对应一个管道类
ITEM_PIPELINES = {
    'qiubaiPro.pipelines.QiubaiproPipeline': 300,
}

优点:
- 通用性强
缺点:
- 操作繁琐

测试题

将爬取到的数据一份存储到本地一份存储到数据库,如何实现?

#settings.py 文件名

ITEM_PIPELINES = {
    'qiubaiPro.pipelines.QiubaiproPipeline': 300,
    'qiubaiPro.pipelines.mysqlPileLine': 301,
}

#pipelines.py 文件名

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html


# useful for handling different item types with a single interface
from itemadapter import ItemAdapter
import pymysql


class QiubaiproPipeline:
    # 专门用来处理item类型对象
    # 该方法可以接收爬虫文件提交过来的item对象
    # 该方法每接收到一个item就会被调用一次
    fp = None
    # 重写父类的一个方法:该方法只在开始爬虫的时候被调用一次

    def open_spider(self, spider):
        print('打开文件')
        self.fp = open('./qiubai.txt', 'w', encoding='utf-8')
        pass

    def process_item(self, item, spider):
        author = item['author']
        content = item['content']
        self.fp.write(author+':'+content+'\n')
        return item  # return item就会传递给下一个即将被执行的管道类

    def close_spider(self, spider):
        print('关闭文件')
        self.fp.close()
        # 管道文件中一个管道类对应将一组数据存储到一个平台或者载体中


        class mysqlPileLine:
            conn = None
            cursor = None

            def open_spider(self, spider):
                print('正在连接数据库')
                self.conn = pymysql.Connect(
                    host='127.0.0.1', port=3306, user='root', password='123456789', db='qiubai', charset="utf8")

                def process_item(self, item, spider):
                    print('正在写入数据')
                    self.cursor = self.conn.cursor()
                    try:
                        self.cursor.execute('insert into qiubai values("%s","%s")' %
                                            (item["author"], item["content"]))
                        self.conn.commit()
                        except Exception as e:
                            print(e)
                            self.conn.rollback
                            return item

                        def close_spider(self, spider):
                            print('正在关闭连接')
                            self.cursor.close()
                            self.conn.close()

爬虫文件提交的item类型的对象最终会提交给哪一个管道类?
- 爬虫文件提交的item只会给管道文件中第一个被执行的管道类接收
- 第一个管道类中的process_item中的return item表示将item传递给下一个即将被执行的管道类

基于Spider的全站数据爬取

什么是全站数据爬取

就是将网站中某板块下的全部页码对应的页面数据爬取下来

需求:爬取校花网中的照片的名称

实现方式:
- 自行手动进行请求发送(推荐)
- 将所有页面的url添加到start_urls列表中(不推荐,如果页码有上万个呢?)

#xiaohua.py

import scrapy


class XiaohuaSpider(scrapy.Spider):
    name = 'xiaohua'
    # allowed_domains = ['www.xxx.com']
    start_urls = ['http://www.521609.com/tuku/index.html']
    # 生成一个通用的url模板(不可变)
    url = 'http://www.521609.com/tuku/index_%d.html'
    page_num = 2

    def parse(self, response):
        li_list = response.xpath('//ul[@class="pbl "]/li')
        for li in li_list:
            img_name = li.xpath('./a/p/text()').extract_first()
            print(img_name)
        if self.page_num <= 51:
            print(self.page_num)
            new_url = format(self.url % self.page_num)
            # 手动请求发送:callbak回调函数是专门用于数据解析
            self.page_num += 1
            yield scrapy.Request(url=new_url, callback =self.parse)

请求传参

使用场景:如果爬取的解析数据不在同一张页面中.(深度爬取)

图片数据爬取之ImagesPipeline

基于scrapy爬取字符串类型的数据和爬取图片类型的数据区别?
- 字符串:只需要xpath进行解析且提交管道进行持久化存储
- 图片:xpath解析出图片的src属性值.单独对图片地址发起请求获取图片二进制类型的数据
ImagesPipeline:
- 只需要将img的src的属性值进行解析,将属性值封装到item并提交给管道,管道就会对图片的src进行请求发送获取图片的二进制类型的数据,且还会帮我们进行持久化存储
需求:爬取站长素材中的高清图片

使用流程:

数据解析(地址的地址)
将存储文件地址的item提交到指定的管道类
在管道文件中定制一个基于ImagesPipeLine的一个管道类

#pipelines.py

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html


# useful for handling different item types with a single interface
from itemadapter import ItemAdapter
from scrapy.pipelines.images import ImagesPipeline
import scrapy
# class ZhanzhangproPipeline:
#     def process_item(self, item, spider):
#         return item


class imgsPileLine(ImagesPipeline):
    # 就是可以根据图片地址进行图片数据的请求
    def get_media_requests(self, item, info):
        yield scrapy.Request(item['src'])
    # 指定图片存储的路径

    def file_path(self, request, response=None, info=None):
        imgName = request.url.split('/')[-1]
        return imgName

    def item_completed(self, results, item, info):
        return item  # 返回给下一个即将被执行的管道类

修改serrings.py配置文件

1
2
3

#serrings.py
#指定图片存储的目录
IMAGES_STORE = './imgs'

指定开启的管道类:定制的管道类

serrings.py

ITEM_PIPELINES = {
    'zhanzhangPro.pipelines.imgsPileLine': 300,
}

中间件

下载中间件

位置:引擎和下载器之间
作用:批量拦截到整个工程中所有的请求和响应
拦截请求
UA伪装
代理IP
拦截响应:
篡改响应数据,响应对象

设置UA伪装以及代理IP

#middlewares.py

# Define here the models for your spider middleware
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/spider-middleware.html

from scrapy import signals

# useful for handling different item types with a single interface
from itemadapter import is_item, ItemAdapter
import random


class QiubaiproDownloaderMiddleware:
    user_agent_list = [
        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 "
        "(KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1",
        "Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 "
        "(KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11",
        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 "
        "(KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6",
        "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 "
        "(KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6",
        "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 "
        "(KHTML, like Gecko) Chrome/19.77.34.5 Safari/537.1",
        "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.5 "
        "(KHTML, like Gecko) Chrome/19.0.1084.9 Safari/536.5",
        "Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 "
        "(KHTML, like Gecko) Chrome/19.0.1084.36 Safari/536.5",
        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 "
        "(KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
        "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 "
        "(KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
        "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_0) AppleWebKit/536.3 "
        "(KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
        "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 "
        "(KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 "
        "(KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
        "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 "
        "(KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 "
        "(KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
        "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 "
        "(KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
        "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 "
        "(KHTML, like Gecko) Chrome/19.0.1061.0 Safari/536.3",
        "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.24 "
        "(KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24",
        "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 "
        "(KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24"
    ]
    # 可被选用的代理IP
    PROXY_http = [
        '153.180.102.104:80',
        '195.208.131.189:56055',
    ]
    PROXY_https = [
        '120.83.49.90:9000',
        '95.189.112.214:35508',
    ]

    # 拦截请求
    def process_request(self, request, spider):
        # UA伪装
        request.headers['User-Agent'] = random.choice(self.user_agent_list)
        return None

    # 拦截所有的响应
    def process_response(self, request, response, spider):
        # Called with the response returned from the downloader.

        # Must either;
        # - return a Response object
        # - return a Request object
        # - or raise IgnoreRequest
        return response

    # 拦截发生异常的请求
    def process_exception(self, request, exception, spider):
        # 发生请求异常后设置代理
        # 对拦截到请求的url进行判断(协议头到底是http还是https)
        # request.url返回值:http://www.xxx.com
        if request.url.split(':')[0] == 'https':  # 请求的协议头
            ip = random.choice(self.PROXY_https)
            request.meta['proxy'] = 'https://'+ip
        else:
            ip = random.choice(self.PROXY_http)
            request.meta['proxy'] = 'http://' + ip
        return request  # 将修正之后的请求对象进行重新的请求发送

在settings.py中开启下载中间件

1
2
3

DOWNLOADER_MIDDLEWARES = {
   'qiubaiPro.middlewares.QiubaiproDownloaderMiddleware': 543,
}

需求:爬取网易新闻中的新闻数据(标题和内容)
- 通过网易新闻的首页解析出五大板块对应的详情页URL(没有动态加载)
- 每一个板块对应的新闻标题都是动态加载出来的(动态加载)
- 通过解析出每一条新闻详情页的url获取详情页的页面源码,解析出新闻内容

#settings.py
#基本操作
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36'
# Obey robots.txt rules
ROBOTSTXT_OBEY = False
LOG_LEVEL = 'ERROR'
#开启中间件拦截
DOWNLOADER_MIDDLEWARES = {
    'wangyiPro.middlewares.WangyiproDownloaderMiddleware': 543,
}
#开启管道
ITEM_PIPELINES = {
    'wangyiPro.pipelines.WangyiproPipeline': 300,
}

#wangyi.py

import scrapy
from selenium import webdriver
from wangyiPro.items import WangyiproItem


class WangyiSpider(scrapy.Spider):
    name = 'wangyi'
    # allowed_domains = ['https://news.163.com/']
    start_urls = ['https://news.163.com/']
    models_urls = []  # 存储五个板块对应详情页的url
    # 解析五大板块对应详情页的url

    # 实例化一个浏览器对象
    def __init__(self):
        # 后面是你的浏览器驱动位置,记得前面加r'','r'是防止字符转义的
        self.bro = webdriver.Chrome(
            executable_path=r'D:\Learning world\personal project\personal project\Python\爬虫\scrapy\wangyiPro\wangyiPro\chromedriver.exe')

    def parse(self, response):
        li_list = response.xpath('//div[@class="ns_area list"]/ul/li')
        alist = [3, 4, 6, 7, 8]
        for index in alist:
            model_url = li_list[index].xpath('./a/@href').extract_first()
            self.models_urls.append(model_url)

        # 依次对每一个板块对应的页面进行请求
        for url in self.models_urls:  # 对每一个板块的url进行请求发送
            yield scrapy.Request(url=url, callback=self.parse_model)

    # 每一个板块对应的新闻标题相关的内容都是动态加载
    def parse_model(self, response):  # 解析每一个板块页面中对应新闻的标题和新闻详情页的url
        div_list = response.xpath('//div[@class="ndi_main"]/div')
        for div in div_list:
            title = div.xpath('./div/div[1]/h3/a/text()').extract_first()
            new_detail_url = div.xpath(
                './div/div[1]/h3/a/@href').extract_first()
            item = WangyiproItem()
            item['title'] = title
            # 对新闻详情页的url发起请求
            yield scrapy.Request(url=new_detail_url, callback=self.parse_detail, meta={'item': item})

    def parse_detail(self, response):
        content = response.xpath(
            '//div[@class="post_body"]/p/text()').extract()
        content = ''.join(content)
        item = response.meta['item']
        item['content'] = content
        yield item

    def closed(self, spider):
        self.bro.quit()

#middlewares.py


# Define here the models for your spider middleware
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/spider-middleware.html

from scrapy import signals

# useful for handling different item types with a single interface
from itemadapter import is_item, ItemAdapter
from scrapy.http import HtmlResponse
from time import sleep


class WangyiproDownloaderMiddleware:
    def process_request(self, request, spider):
        # Called for each request that goes through the downloader
        # middleware.

        # Must either:
        # - return None: continue processing this request
        # - or return a Response object
        # - or return a Request object
        # - or raise IgnoreRequest: process_exception() methods of
        #   installed downloader middleware will be called
        return None

    # 通过该方法拦截五大板块对应的响应对象,进行篡改
    def process_response(self, request, response, spider):  # spider表示的是爬虫对象
        bro = spider.bro  # 获取了在爬虫类中定义的浏览器对象
        # 挑选出指定的响应对象进行篡改
        # 通过url指定request
        # 通过request指定response
        if request.url in spider.models_urls:
            bro.get(request.url)  # 五个板块对应的url进行请求
            sleep(2)
            page_text = bro.page_source  # 包含了动态加载的新闻数据
            # response  # 五大板块对应的响应对象
            # 针对定位到的response进行篡改
            # 实例化新的相应对象,符合需求:包含动态加载出的新闻数据,替代原来不满足需求的响应对象
            # 如何获取动态加载出的新闻数据呢?
            # 基于selenium便捷的获取动态加载数据
            new_response = HtmlResponse(
                url=request.url, body=page_text, encoding='utf-8', request=request)
            return new_response
        else:
            # response  # 其他请求对应的响应对象
            return response

    def process_exception(self, request, exception, spider):
        # Called when a download handler or a process_request()
        # (from other downloader middleware) raises an exception.

        # Must either:
        # - return None: continue processing this exception
        # - return a Response object: stops process_exception() chain
        # - return a Request object: stops process_exception() chain
        pass

#items.py

import scrapy


class WangyiproItem(scrapy.Item):
    # define the fields for your item here like:
    title = scrapy.Field()
    content = scrapy.Field()

#pipelines.py


# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html


# useful for handling different item types with a single interface
from itemadapter import ItemAdapter


class WangyiproPipeline:
    def process_item(self, item, spider):
        print(item)

CrawlSpider类

CrawlSpider类:Spider的一个子类

全站数据爬取的方式
- 基于Spider手动请求
- 基于CrawlSpider
CrawlSpider的使用:
- 创建一个工程
- cd xxx
- 创建爬虫文件(与之前不一样,基于CrawlSpider子类)
- scrapy genspider -t crawl name www.xxx.com
- LinkExtractor(链接提取器):
  - 作用:根据指定规则(allow="正则")进行指定链接的提取
- Rule(规则解析器)
  - 作用:将链接提取器提取到的链接进行指定规则(callback的解析操作)

分布式爬虫

分布式爬虫的概念

我们需要搭建一个分布式的集群,让其对一组资源进行分布联合爬取

作用

提升爬取数据的效率

如何实现分布式

安装scrapy-redis的组件
原生的scrapy是不可以实现分布式爬虫的,必须要让scrapy结合着scrapy-redis组件一起实现分布式爬虫
为什么原生的scrapy不可以实现分布式?
- 调度器不可以被分布式集群共享
- 管道不可以被分布式集群共享
scrapy-redis组件作用
- 可以给原生的scrapy框架提供可以被共享的管道和调度器

实现流程

创建一个工程
创建一个基于CrawlSpider的爬虫文件
修改当前的爬虫文件
导包
- from scrapy_redis.spiders import RedisCrawlSpider
将start_urls和allowed_domains进行注释
添加一个新属性:与之代替的是redis_key='sun'可以被共享的调度器队列的名称
编写数据解析相关的操作
将当前爬虫类的父类修改成RedisCrawlSpider

修改配置文件settings.py

指定使用可以被共享的管道

1
2
3

ITEM_PIPELINES = {
    'scrapy_redis.pipelines.RedisPipeline':400
}

指定调度器

#增加了一个去重容器类的配置,作用使用Redis的set集合来存储请求的指纹数据,从而实现请求去重的持久化
DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter"
#使用scrapy-redis组件自己的调度器
SCHEDULER="scrapy_redis.scheduler.Scheduler"
#配置调度器是否要持久化,也就是当爬虫结束了,要不要清空Redis中请求队列和去重指纹的set.True=保留,False=清空
SCHEDULER_PERSIST=TRUE

指定redis服务器

#setting.py

REDIS_HOST = 'redis服务器的ip地址'
REDIS_PORT = 6379

redis相关操作配置
- 配置`redis`的配置文件
- linux/mac:redis.conf
- windows:redis.windows.conf
- 打开配置文件修改
  - 将bind 127.0.0.1进行删除
  - 关闭保护模式:protected_mode yes改为no
- 结合着配置文件开启redis服务
  - redis-server配置文件
  - 启动客户端
  - redis-cli
- 执行工程
- scrapy runspider xxx.py(爬虫源文件名称)
- 向调度器的队列中放入一个起始的url
- 调度器的队列在redis的客户端中
  - lpush sun(爬虫文件中的redis_key) www.xxx.com(起始的url)

本文由 kill3r 原创,采用署名-非商业性使用-相同方式共享 4.0 国际许可协议，转载请注明出处。