分页功能抓取数据

在网站抓取过程中往往需要分页抓取数据的,本章节介绍几种常见的分页抓取数据的方法

一、方法一:直接在Spider爬虫中的start_urls写上多个url地址

二、方法二:在Spider爬虫的start_urls使用列表推导式来写url

start_urls = ['http://python.jobbole.com/all-posts/page/{0}/'.format(str(page)) for page in range(4)]

三、方法三:直接获取浏览器中下一页的url地址再次发起请求

class BlogSpider(scrapy.Spider):
    name = 'blog'
    allowed_domains = ['blog.jobbole.com', 'python.jobbole.com']
    start_urls = ['http://python.jobbole.com/all-posts/']

    def parse(self, response):
        url_lists = response.xpath('//div[@id="archive"]/div[@class="post floated-thumb"]')

        for current_url in url_lists:
            detail_link = current_url.xpath('./div[@class="post-meta"]//a/@href').extract_first()
            title = current_url.xpath('./div[@class="post-meta"]//a[@class="archive-title"]/text()').extract_first()
            # 重新请求,将title传递到下一个url中
            yield Request(url=parse.urljoin(response.url, detail_link), callback=self.detail_parse,
                          meta={'title': title})

        # 获取到下一页的url地址
        next_page = response.xpath('//div[@class="navigation margin-20"]/a[last()]/@href').get()
        if next_page:
            yield Request(url=parse.urljoin(response.url, next_page), callback=self.parse)

results matching ""

    No results matching ""