scrapy自定义命令
一、关于scrapy
内置命令
(scrapy_page) ➜ ~ scrapy -h
Scrapy 1.5.0 - no active project
Usage:
scrapy <command> [options] [args]
Available commands:
bench Run quick benchmark test
fetch Fetch a URL using the Scrapy downloader
genspider Generate new spider using pre-defined templates
runspider Run a self-contained spider (without creating a project)
settings Get settings values
shell Interactive scraping console
startproject Create new project
version Print Scrapy version
view Open URL in browser, as seen by Scrapy
[ more ] More commands available when run from project directory
Use "scrapy <command> -h" to see more info about a command
(scrapy_page) ➜ ~
二、关于单个爬虫的自定义命令
1、在根目录下创建一个
start.py
的文件(文件随意命名)from scrapy.cmdline import execute if __name__ == "__main__": execute("scrapy crawl 爬虫名字 --nolog".split()) # 或者是这样 # execute(["scrapy", "crawl", "blog", "--nolog"])
2、直接点击右键运行文件就可以跑爬虫,不需要再手动输入命令行
3、改进上面的脚本
import os import sys from scrapy.cmdline import execute if __name__ == "__main__": sys.path.append(os.path.dirname(os.path.abspath(__file__))) execute("scrapy crawl blog --nolog".split()) # 或者是这样 # execute(["scrapy", "crawl", "blog", "--nolog"])
三、关于工程中多个爬虫一起执行的自定义命令
- 1、在
spiders
同级创建任意目录,如:commands
2、在其中创建
crawlall.py
文件此处文件名就是自定义的命令from scrapy.commands import ScrapyCommand class Command(ScrapyCommand): requires_project = True def syntax(self): return '[options]' def short_desc(self): return 'Runs all of the spiders' def run(self, args, opts): # 获取到全部的爬虫 spider_list = self.crawler_process.spiders.list() for name in spider_list: self.crawler_process.crawl(name, **opts.__dict__) self.crawler_process.start()
3、在
settings.py
中添加配置# COMMANDS_MODULE = '项目名称.目录名称' # 自定义爬虫命令 COMMANDS_MODULE = 'csdn.commands'
4、直接可以使用
scrapy -h
查看命令