Powered by GitBook

scrapy自定义命令

一、关于`scrapy`内置命令

(scrapy_page) ➜  ~ scrapy -h
Scrapy 1.5.0 - no active project

Usage:
  scrapy <command> [options] [args]

Available commands:
  bench         Run quick benchmark test
  fetch         Fetch a URL using the Scrapy downloader
  genspider     Generate new spider using pre-defined templates
  runspider     Run a self-contained spider (without creating a project)
  settings      Get settings values
  shell         Interactive scraping console
  startproject  Create new project
  version       Print Scrapy version
  view          Open URL in browser, as seen by Scrapy

  [ more ]      More commands available when run from project directory

Use "scrapy <command> -h" to see more info about a command
(scrapy_page) ➜  ~

二、关于单个爬虫的自定义命令

1、在根目录下创建一个start.py的文件(文件随意命名)

from scrapy.cmdline import execute

if __name__ == "__main__":
    execute("scrapy crawl 爬虫名字 --nolog".split())
    # 或者是这样
    # execute(["scrapy", "crawl", "blog", "--nolog"])

2、直接点击右键运行文件就可以跑爬虫,不需要再手动输入命令行

3、改进上面的脚本

import os
import sys
from scrapy.cmdline import execute

if __name__ == "__main__":
    sys.path.append(os.path.dirname(os.path.abspath(__file__)))
    execute("scrapy crawl blog --nolog".split())
    # 或者是这样
    # execute(["scrapy", "crawl", "blog", "--nolog"])

三、关于工程中多个爬虫一起执行的自定义命令

1、在spiders同级创建任意目录，如：commands

2、在其中创建crawlall.py文件此处文件名就是自定义的命令

from scrapy.commands import ScrapyCommand

class Command(ScrapyCommand):
    requires_project = True

    def syntax(self):
        return '[options]'

    def short_desc(self):
        return 'Runs all of the spiders'

    def run(self, args, opts):
        # 获取到全部的爬虫
        spider_list = self.crawler_process.spiders.list()
        for name in spider_list:
            self.crawler_process.crawl(name, **opts.__dict__)
        self.crawler_process.start()

3、在settings.py 中添加配置

# COMMANDS_MODULE = '项目名称.目录名称'
# 自定义爬虫命令
COMMANDS_MODULE = 'csdn.commands'

4、直接可以使用scrapy -h查看命令

results matching ""

No results matching ""