Powered by GitBook

使用scrapy进行模拟登录(一)

一、登录的介绍

常见的登录方式有两种

1、Form提交的(scrapy支持的方式)
2、request payload形式的提交方式(scrapy不支持的提交方式)

二、关于爬虫中发送`POST`请求的补充

虽然上面我们介绍Requests中有个method可以指定发送POST请求,但是我们在实际开发过程中不会这样使用,而是使用Request的子类FormRequest来实现。
如果一开始我们就要执行POST请求,那么我们就要重写爬虫的start_requests(self)方法,并且不再调用start_urls里面的url。

三、使用`FormRequest`模拟登录`github`

1、注意点,登录github的时候,我们需要先访问页面,获取页面中的csrf_token认证的值,才能继续模拟登录,这个地方我们就不用重写start_requests方法

2、具体实现模拟登录github代码

import scrapy

class GithubSpider(scrapy.Spider):
    name = 'github'
    allowed_domains = ['github.com']
    start_urls = ['https://github.com/login']

    def parse(self, response):
        url = 'https://github.com/session'
        token = response.xpath('//input[@name="authenticity_token"]/@value').get()
        print('当前token===>', token)
        data = {
            'commit': 'Sign in',
            'utf8': '✓',
            'authenticity_token': token,
            'login': '[email protected]',
            'password': '123456'
        }
        yield scrapy.FormRequest(url=url, formdata=data, callback=self.user_parse)

    def user_parse(self, response):
        with open('github.html', 'w', encoding='utf8') as fb:
            fb.write(response.text)

四、使用`start_requests`重写起始访问网址登陆人人网

  import scrapy

  class RenrenSpider(scrapy.Spider):
      name = 'renren'
      allowed_domains = ['renren.com']
      start_urls = ['http://renren.com/']

      def start_requests(self):
          url = 'http://www.renren.com/PLogin.do'
          data = {
              'email': '[email protected]',
              'password': '123456'
          }
          yield scrapy.FormRequest(url=url, formdata=data, callback=self.parse_page)

      def parse_page(self, response):
          with open('renren.html', 'w', encoding='utf8') as f:
              f.write(response.text)

results matching ""

No results matching ""