Powered by GitBook

选择器的使用

在第二章中介绍了解析器,常见的解析器有:xpth,css,正则,在scrapy框架中有个Selector的(注意不是新技术,也是上面的解析器)

一、关于`Selector`选择器的使用

1、直接使用

from scrapy import Selector # 需要查看选择的源码,从这个地方点击进去
html_doc = "<html><title>hello word</title><body></body</html>"
selector = Selector(text=html_doc)
title = selector.xpath('//title/text()').extract_first()

2、使用css选择器

# 获取文本信息
quote.css('.text::text').extract_first()
# 获取属性
quote.css('img::attr("src")').extract_first()

3、使用xpath选择器

# 获取文本信息
quote.xpath('./span[@class="text"]/text()').extract_first()
# 获取属性
quote.xpath('./a[@class="text"]/@href').extract_first()

4、使用正则

关于scrapy框架中返回的response不能直接使用正则,需要在之前使用xpth语法
```
quote.xpath('./a[@class="text"]/@href').re(...)
quote.xpath('./a[@class="text"]/@href').re_first(...)
```

5、总结

1.查看源码中可以知道,使用css最终也是使用xpath选择器,建议直接使用xpath选择器

2.关于extract和extract_first()的源码

    def extract(self):
        """
        Call the ``.extract()`` method for each element is this list and return
        their results flattened, as a list of unicode strings.
        """
        return [x.extract() for x in self]
    # 我们在写代码的时候也可以使用getall
    getall = extract

    def extract_first(self, default=None):
        """
        Return the result of ``.extract()`` for the first element in this list.
        If the list is empty, return the default value.
        """
        for x in self:
            return x.extract()
        else:
            return default
    # 我们写代码的时候也可以使用get
    get = extract_first

results matching ""

No results matching ""