关于scrapy中url去除原理一

一、关于url去重的作用

  • 1、减少重复的抓取同样的数据,提高爬虫的效益,同时数据库也不希望存储几份一模一样的数据
  • 2、在大型爬虫爬取数据中可以暂停,隔断时间又继续开启爬虫

二、关于url去重的认识

  • 1、当我们访问分页的时候每次会查找url列表,但是我们实际过程中是去重后的展现的是去重后的url地址
  • 2、关于爬虫代码

    class BlogSpider(scrapy.Spider):
        name = 'blog'
        allowed_domains = ['blog.jobbole.com', 'python.jobbole.com']
        start_urls = ['http://python.jobbole.com/all-posts/']
    
        def parse(self, response):
            print(response.url)
            # 获取到下一页的url地址
            next_page = response.xpath('//div[@class="navigation margin-20"]/a[last()]/@href').get()
            if next_page:
                yield Request(url=parse.urljoin(response.url, next_page), callback=self.parse)
    
  • 3、打印的url地址

    http://python.jobbole.com/all-posts/
    http://python.jobbole.com/all-posts/page/2/
    http://python.jobbole.com/all-posts/page/3/
    http://python.jobbole.com/all-posts/page/4/
    http://python.jobbole.com/all-posts/page/5/
    http://python.jobbole.com/all-posts/page/6/
    http://python.jobbole.com/all-posts/page/7/
    http://python.jobbole.com/all-posts/page/8/
    http://python.jobbole.com/all-posts/page/9/
    http://python.jobbole.com/all-posts/page/10/
    ...
    
  • 4、关于url去重源码

    # 在爬虫文件中引包(仅仅是点击跳转到源码)
    from scrapy.dupefilters import RFPDupeFilter
    # 关于去重的源码
    class BaseDupeFilter(object):
    
        @classmethod
        def from_settings(cls, settings):
            """
            获取配置文件(settings.py文件)
            """
            return cls()
    
        def request_seen(self, request):
            """
            主要函数方法
            """
            return False
    
        def open(self):  # can return deferred
            """
            爬虫打开的时候
            """
            pass
    
        def close(self, reason):  # can return a deferred
            """
            爬虫关闭的时候
            """
            pass
    
        def log(self, request, spider):  # log that a request has been filtered
            """
            操作日志方法
            """
            pass
    
    class RFPDupeFilter(BaseDupeFilter):
        pass
    

三、关于自定义url过滤器

  • 1、新建一个文件,定义一个类去继承BaseDupeFilter

    from scrapy.dupefilters import BaseDupeFilter
    
    class BlogDupeFilter(BaseDupeFilter):
        """
        自定义一个url去重的类
        """
    
        @classmethod
        def from_settings(cls, settings):
            return cls()
    
        def request_seen(self, request):
            print('当前的request', request)
            return False
    
        def open(self):  # can return deferred
            pass
    
        def close(self, reason):  # can return a deferred
            pass
    
        def log(self, request, spider):  # log that a request has been filtered
            pass
    
  • 2、在settings.py中设置使用自己定义的去重方法

    # 配置url去重的原理
    DUPEFILTER_CLASS = "utils.dupe_filter.BlogDupeFilter"
    
  • 3、基于上面的修改request_seen方法,让url存储到set集合中

    from scrapy.dupefilters import BaseDupeFilter
    
    class BlogDupeFilter(BaseDupeFilter):
        """
        自定义一个url去重的类
        """
        def __init__(self):
            self.vistited_urls = set()
    
        @classmethod
        def from_settings(cls, settings):
            return cls()
    
        def request_seen(self, request):
            print('当前的request', request)
            """
            简单粗暴的将当前的url添加到set集合上,
            当当前的url存在于set集合里面,那么就返回True
            如果不存在就添加到set集合
            """
            if request.url in self.vistited_urls:
                return True
            self.vistited_urls.add(request.url)
            return False
    
  • 4、我们看源码上是直接将一个request对象通过函数request_fingerprint处理后直接添加到set集合里面去,处理原因:

    • 下面两个url地址是在set集合中是不一样的
      https://www.baidu.com?name=hello&age=18
      https://www.baidu.com?age=18&name=hello
      
    • 2、自己写一个还不如直接用scrapy工具库中提供的request_fingerprint方法

      from scrapy.utils.request import request_fingerprint
      
    • 3、使用这个函数出来的有点类似MD5加密出来的,不管当前的Request多长出来的都是一样
    # 使用request_fingerprint加密后的request对象存储到set集合中
    from scrapy.dupefilters import BaseDupeFilter
    from scrapy.utils.request import request_fingerprint
    
    class BlogDupeFilter(BaseDupeFilter):
        """
        自定义一个url去重的类
        """
    
        def __init__(self):
            self.vistited_set = set()
    
        @classmethod
        def from_settings(cls, settings):
            return cls()
    
        def request_seen(self, request):
            print('当前的request', request)
            if request in self.vistited_set:
                return True
            self.vistited_set.add(request_fingerprint(request))
            return False
    

四、爬虫的暂停与继续

上面的方法是将爬虫的request对象存储在set集合中,当再次启动的时候,会清除之前的,不能继续过滤之前已经加入到set集合中的request,源码中有一个地方可以实现

  • 1、源码代码

    class RFPDupeFilter(BaseDupeFilter):
        """Request Fingerprint duplicates filter"""
    
        def __init__(self, path=None, debug=False):
            self.file = None
            self.fingerprints = set()
            self.logdupes = True
            self.debug = debug
            self.logger = logging.getLogger(__name__)
            if path:
                self.file = open(os.path.join(path, 'requests.seen'), 'a+')
                self.file.seek(0)
                self.fingerprints.update(x.rstrip() for x in self.file)
    
        @classmethod
        def from_settings(cls, settings):
            debug = settings.getbool('DUPEFILTER_DEBUG')
            # 注意这个地方的job_dir工具方法
            return cls(job_dir(settings), debug)
    
  • 2、关于job_dir的工具方法

    import os
    
    def job_dir(settings):
        # 直接从配置中读取JOBDIR配置来存储已经加入到队列中的request对象
        path = settings['JOBDIR']
        if path and not os.path.exists(path):
            os.makedirs(path)
        return path
    
  • 3、在settings.py中配置

    # 设置存储已经访问过的request的文件夹目录
    JOBDIR = 'abc'
    

results matching ""

    No results matching ""