Item和ItemLoader的使用

本章节中要抓取的网站

一、爬虫的目的

我们写爬虫主要目的就是从网页中把哪些非结构性的数据提取出结构性数据,存储到数据库中,在spider的类中默认会创建parse()方法,该方法就是用来解析网页上数据源,Item就是将这些提取的数据包装为结构化数据

Item对象是一种简单的容器,用来保存爬取到的数据,提供类似于字典的API以及用来声明可用字段的简单语法.

Item的使用一般分为两种:

  • 1、常规的Item(就是定义普通需要下载的字段)
  • 2、ItemLoader的使用

二、关于ItemItemLoader的区别

ItemLoader是负责数据的收集、处理、填充,item仅仅是承载了数据本身 数据的收集、处理、填充归功于item loader中两个重要组件官网地址:

  • 输入处理input processors
  • 输出处理output processors

三、Item的使用

  • 1、定义一个Item的类

    import scrapy
    
    class QuotesPageItem(scrapy.Item):
        content = scrapy.Field()
        tags = scrapy.Field()
    
  • 2、在spider爬虫中使用定义好的Item

    import scrapy
    
    from quotes_page.items import QuotesPageItem
    
    class QuotesSpider(scrapy.Spider):
        name = 'quotes'
        allowed_domains = ['quotes.toscrape.com']
        start_urls = ['http://quotes.toscrape.com/']
    
        def parse(self, response):
            quote_list = response.xpath('//div[@class="quote"]')
            for quote in quote_list:
                content = quote.xpath('./span[@class="text"]/text()').get()
                tag = quote.xpath('.//a[@class="tag"]/text()').getall()
                tags = ','.join(tag)
                # 直接使用刚刚定义的`Item`
                yield QuotesPageItem(content=content, tags=tags)
    
  • 3、运行命令运行爬虫

    scrapy crawl quotes --nolog -o quotes.json
    

四、ItemLoader的认识

  • 1、导包

    from scrapy.loader import ItemLoader
    
  • 2、关于ItemLoader的源码

    class ItemLoader(object):
    
        default_item_class = Item
        default_input_processor = Identity()
        default_output_processor = Identity()
        default_selector_class = Selector
    
        def __init__(self, item=None, selector=None, response=None, parent=None, **context):
            """"
            这个地方是本人加上去的,非源码里面的注册
            item:一个类似上面一样的item对象
            selector:选择的节点
            response:是parse函数中response
            """"
            if selector is None and response is not None:
                selector = self.default_selector_class(response)
            self.selector = selector
            context.update(selector=selector, response=response)
            if item is None:
                item = self.default_item_class()
            self.context = context
            self.parent = parent
            self._local_item = context['item'] = item
            self._local_values = defaultdict(list)
    

五、关于ItemLoader的使用步骤

  • 1、在item.py文件中创建一个ItemItemLoader的类
    • 1.Item的类主要是定义哪些字段
    • 2.ItemLoader的类主要是定义规则
  • 2、在spider爬虫中引入两个item.py中定义好的两个类
  • 3、实例化ItemLoader

    item_loader = ItemLoader类(item=Item类, response=response)
    
  • 4、基本的操作

    #针对直接取值的情况
    item_loader.add_value('front_image_url','front_image_url')
    #针对css选择器
    item_loader.add_css('title','div.entry-header h1::text')
    item_loader.add_css('create_date','p.entry-meta-hide-on-mobile::text')
    item_loader.add_css('praise_num','#112547votetotal::text')
    #针对xpath的情况
    item_loader.add_xpath('content','//*[@id="post-112239"]/div[3]/div[3]/p[1]')
    #把结果返回给item对象
    article_item = item_loader.load_item()
    yield article_item
    

六、关于ItemItemLoader类的书写

  • 1、依然是以上面的案例改写
  • 2、导包

    from scrapy.loader import ItemLoader
    # TakeFirst表示取第一个值
    # MapCompose可以多个操作
    from scrapy.loader.processors import TakeFirst, MapCompose
    
  • 3、定义一个ItemLoader的类

    class QuotesItemLoader(ItemLoader):
        # 设置默认的输出取第一个值
        default_output_processor = TakeFirst()
    
  • 3、定义Item

    # 去除html标签
    from w3lib.html import remove_tags
    
    def str_strip(str):
        """
        定义一个函数去重字符中空格及换行
        :param str:
        :return:
        """
        return re.sub(re.compile('[\n\t\r]'), '', str)
    
    def add_title(val):
        """
        定义一个添加标题的函数
        :param val:
        :return:
        """
        return "标题:{0}".format(val)
    
    def format_json(val):
        result = [val]
        return result
    
    class QuotesItem(scrapy.Item):
        """
        定义item,这样可以定义输入的规则
        """
        content = scrapy.Field(
            input_processor=MapCompose(add_title, remove_tags, str_strip)
        )
        tags = scrapy.Field(
            input_processor=MapCompose(remove_tags, str_strip),
            # 单独定义自己的输出,会覆盖默认的输出
            output_processor=MapCompose(format_json)
        )
    

七、上面案例的完整代码

  • 1、items.py文件

    import scrapy
    import re
    from scrapy.loader import ItemLoader
    from scrapy.loader.processors import TakeFirst, MapCompose
    
    # 去除html标签
    from w3lib.html import remove_tags
    
    class QuotesItemLoader(ItemLoader):
        # 设置只取第一个值
        default_output_processor = TakeFirst()
    
    def str_strip(str):
        """
        定义一个函数去重字符中空格及换行
        :param str:
        :return:
        """
        return re.sub(re.compile('[\n\t\r]'), '', str)
    
    def add_title(val):
        """
        定义一个添加标题的函数
        :param val:
        :return:
        """
        return "标题:{0}".format(val)
    
    def format_json(val):
        result = [val]
        return result
    
    class QuotesItem(scrapy.Item):
        """
        定义item,这样可以定义输入的规则
        """
        content = scrapy.Field(
            input_processor=MapCompose(add_title, remove_tags, str_strip)
        )
        tags = scrapy.Field(
            input_processor=MapCompose(remove_tags, str_strip),
            output_processor=MapCompose(format_json)
        )
    
  • 2、在spiders爬虫中使用

    import scrapy
    
    from quotes_page.items import QuotesPageItem, QuotesItem, QuotesItemLoader
    
    class QuotesSpider(scrapy.Spider):
        name = 'quotes'
        allowed_domains = ['quotes.toscrape.com']
        start_urls = ['http://quotes.toscrape.com/']
    
        def parse(self, response):
            quote_list = response.xpath('//div[@class="quote"]')
            for quote in quote_list:
                item_loader = QuotesItemLoader(item=QuotesItem(), selector=quote)
                item_loader.add_xpath('content', './span[@class="text"]/text()')
                item_loader.add_xpath('tags', './/a[@class="tag"]/text()')
                yield item_loader.load_item()
    
  • 3、运行

    scrapy crawl quotes -o quotes.json
    
  • 4、结果

    [
      {"content": "标题:“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”", "tags": ["change", "deep-thoughts", "thinking", "world"]},
      {"content": "标题:“It is our choices, Harry, that show what we truly are, far more than our abilities.”", "tags": ["abilities", "choices"]},
      {"content": "标题:“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”", "tags": ["inspirational", "life", "live", "miracle", "miracles"]},
      {"content": "标题:“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”", "tags": ["aliteracy", "books", "classic", "humor"]},
      {"content": "标题:“Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”", "tags": ["be-yourself", "inspirational"]},
      {"content": "标题:“Try not to become a man of success. Rather become a man of value.”", "tags": ["adulthood", "success", "value"]},
      {"content": "标题:“It is better to be hated for what you are than to be loved for what you are not.”", "tags": ["life", "love"]},
      {"content": "标题:“I have not failed. I've just found 10,000 ways that won't work.”", "tags": ["edison", "failure", "inspirational", "paraphrased"]},
      {"content": "标题:“A woman is like a tea bag; you never know how strong it is until it's in hot water.”", "tags": ["misattributed-eleanor-roosevelt"]},
      {"content": "标题:“A day without sunshine is like, you know, night.”", "tags": ["humor", "obvious", "simile"]}
    ]
    

八、建议查看官网的

results matching ""

    No results matching ""