Item和ItemLoader的使用
本章节中要抓取的网站
一、爬虫的目的
我们写爬虫主要目的就是从网页中把哪些非结构性的数据提取出结构性数据,存储到数据库中,在spider
的类中默认会创建parse()
方法,该方法就是用来解析网页上数据源,Item
就是将这些提取的数据包装为结构化数据
Item
对象是一种简单的容器,用来保存爬取到的数据,提供类似于字典的API
以及用来声明可用字段的简单语法.
Item
的使用一般分为两种:
- 1、常规的
Item
(就是定义普通需要下载的字段) - 2、
ItemLoader
的使用
二、关于Item
与ItemLoader
的区别
ItemLoader
是负责数据的收集、处理、填充,item
仅仅是承载了数据本身
数据的收集、处理、填充归功于item loader
中两个重要组件官网地址:
- 输入处理
input processors
- 输出处理
output processors
三、Item
的使用
1、定义一个
Item
的类import scrapy class QuotesPageItem(scrapy.Item): content = scrapy.Field() tags = scrapy.Field()
2、在
spider
爬虫中使用定义好的Item
import scrapy from quotes_page.items import QuotesPageItem class QuotesSpider(scrapy.Spider): name = 'quotes' allowed_domains = ['quotes.toscrape.com'] start_urls = ['http://quotes.toscrape.com/'] def parse(self, response): quote_list = response.xpath('//div[@class="quote"]') for quote in quote_list: content = quote.xpath('./span[@class="text"]/text()').get() tag = quote.xpath('.//a[@class="tag"]/text()').getall() tags = ','.join(tag) # 直接使用刚刚定义的`Item` yield QuotesPageItem(content=content, tags=tags)
3、运行命令运行爬虫
scrapy crawl quotes --nolog -o quotes.json
四、ItemLoader
的认识
1、导包
from scrapy.loader import ItemLoader
2、关于
ItemLoader
的源码class ItemLoader(object): default_item_class = Item default_input_processor = Identity() default_output_processor = Identity() default_selector_class = Selector def __init__(self, item=None, selector=None, response=None, parent=None, **context): """" 这个地方是本人加上去的,非源码里面的注册 item:一个类似上面一样的item对象 selector:选择的节点 response:是parse函数中response """" if selector is None and response is not None: selector = self.default_selector_class(response) self.selector = selector context.update(selector=selector, response=response) if item is None: item = self.default_item_class() self.context = context self.parent = parent self._local_item = context['item'] = item self._local_values = defaultdict(list)
五、关于ItemLoader
的使用步骤
- 1、在
item.py
文件中创建一个Item
和ItemLoader
的类- 1.
Item
的类主要是定义哪些字段 - 2.
ItemLoader
的类主要是定义规则
- 1.
- 2、在
spider
爬虫中引入两个item.py
中定义好的两个类 3、实例化
ItemLoader
类item_loader = ItemLoader类(item=Item类, response=response)
4、基本的操作
#针对直接取值的情况 item_loader.add_value('front_image_url','front_image_url') #针对css选择器 item_loader.add_css('title','div.entry-header h1::text') item_loader.add_css('create_date','p.entry-meta-hide-on-mobile::text') item_loader.add_css('praise_num','#112547votetotal::text') #针对xpath的情况 item_loader.add_xpath('content','//*[@id="post-112239"]/div[3]/div[3]/p[1]') #把结果返回给item对象 article_item = item_loader.load_item() yield article_item
六、关于Item
和ItemLoader
类的书写
- 1、依然是以上面的案例改写
2、导包
from scrapy.loader import ItemLoader # TakeFirst表示取第一个值 # MapCompose可以多个操作 from scrapy.loader.processors import TakeFirst, MapCompose
3、定义一个
ItemLoader
的类class QuotesItemLoader(ItemLoader): # 设置默认的输出取第一个值 default_output_processor = TakeFirst()
3、定义
Item
类# 去除html标签 from w3lib.html import remove_tags def str_strip(str): """ 定义一个函数去重字符中空格及换行 :param str: :return: """ return re.sub(re.compile('[\n\t\r]'), '', str) def add_title(val): """ 定义一个添加标题的函数 :param val: :return: """ return "标题:{0}".format(val) def format_json(val): result = [val] return result class QuotesItem(scrapy.Item): """ 定义item,这样可以定义输入的规则 """ content = scrapy.Field( input_processor=MapCompose(add_title, remove_tags, str_strip) ) tags = scrapy.Field( input_processor=MapCompose(remove_tags, str_strip), # 单独定义自己的输出,会覆盖默认的输出 output_processor=MapCompose(format_json) )
七、上面案例的完整代码
1、
items.py
文件import scrapy import re from scrapy.loader import ItemLoader from scrapy.loader.processors import TakeFirst, MapCompose # 去除html标签 from w3lib.html import remove_tags class QuotesItemLoader(ItemLoader): # 设置只取第一个值 default_output_processor = TakeFirst() def str_strip(str): """ 定义一个函数去重字符中空格及换行 :param str: :return: """ return re.sub(re.compile('[\n\t\r]'), '', str) def add_title(val): """ 定义一个添加标题的函数 :param val: :return: """ return "标题:{0}".format(val) def format_json(val): result = [val] return result class QuotesItem(scrapy.Item): """ 定义item,这样可以定义输入的规则 """ content = scrapy.Field( input_processor=MapCompose(add_title, remove_tags, str_strip) ) tags = scrapy.Field( input_processor=MapCompose(remove_tags, str_strip), output_processor=MapCompose(format_json) )
2、在
spiders
爬虫中使用import scrapy from quotes_page.items import QuotesPageItem, QuotesItem, QuotesItemLoader class QuotesSpider(scrapy.Spider): name = 'quotes' allowed_domains = ['quotes.toscrape.com'] start_urls = ['http://quotes.toscrape.com/'] def parse(self, response): quote_list = response.xpath('//div[@class="quote"]') for quote in quote_list: item_loader = QuotesItemLoader(item=QuotesItem(), selector=quote) item_loader.add_xpath('content', './span[@class="text"]/text()') item_loader.add_xpath('tags', './/a[@class="tag"]/text()') yield item_loader.load_item()
3、运行
scrapy crawl quotes -o quotes.json
4、结果
[ {"content": "标题:“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”", "tags": ["change", "deep-thoughts", "thinking", "world"]}, {"content": "标题:“It is our choices, Harry, that show what we truly are, far more than our abilities.”", "tags": ["abilities", "choices"]}, {"content": "标题:“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”", "tags": ["inspirational", "life", "live", "miracle", "miracles"]}, {"content": "标题:“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”", "tags": ["aliteracy", "books", "classic", "humor"]}, {"content": "标题:“Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”", "tags": ["be-yourself", "inspirational"]}, {"content": "标题:“Try not to become a man of success. Rather become a man of value.”", "tags": ["adulthood", "success", "value"]}, {"content": "标题:“It is better to be hated for what you are than to be loved for what you are not.”", "tags": ["life", "love"]}, {"content": "标题:“I have not failed. I've just found 10,000 ways that won't work.”", "tags": ["edison", "failure", "inspirational", "paraphrased"]}, {"content": "标题:“A woman is like a tea bag; you never know how strong it is until it's in hot water.”", "tags": ["misattributed-eleanor-roosevelt"]}, {"content": "标题:“A day without sunshine is like, you know, night.”", "tags": ["humor", "obvious", "simile"]} ]