Scrapy is an application framework for crawling web sites and extracting structured data which can be used for a wide range of useful applications, like data mining, information processing or historical archival.
设置 python 虚拟环境为了避免安装第三方依赖时总是安装在全局而导致不同项目之间的冲突,python 可以使用虚拟环境的方式满足不同应用的需求。
The solution for this problem is to create a virtual environment, a self-contained directory tree that contains a Python installation for a particular version of Python, plus a number of additional packages.
1 python -m venv virtual-env
命令会创建一个 virtual-enviroment 的文件夹,里面包含 python 编译器、标准库和其他一些文件,其中在 Scripts 文件夹下包含激活虚拟环境的执行文件。
activate.bat 在 windows cmd 里运行 virtual-env\Scripts\activate.bat
Activate.PS1 在 windows pwd 里运行 virtual-env\Scripts\Activate.PS1
activate 在 Unix 和 MacOs 里运行 source virtual-env/bin/activate
deactivate.bat 取消虚拟环境 后面安装 scrapy 等都会安装在虚拟环境里
Scrapy 开始项目1 scrapy startproject tutorial test
在 test 文件夹下创建名为 tutorial 的 Scrapy project,test 文件夹下有以下内容
1 2 3 4 5 6 7 8 9 scrapy.cfg # deploy configuration file tutorial/ # project's Python module, you'll import your code from here __init__.py items.py # project items definition file middlewares.py # project middlewares file pipelines.py # project pipelines file settings.py # project settings file spiders/ # a directory where you'll later put your spiders __init__.py
在 spiders 目录下写自己的爬虫文件,在 setting.py 里可以配置请求头等一些内容
1 2 3 4 5 6 7 8 9 10 11 12 import scrapyclass DoubanSpider (scrapy.Spider): name = 'douban' start_urls = [ 'https://movie.douban.com/top250' ] def parse (self, response ): for title in response.css('div.hd' ): yield { 'title' :title.css('span.title::text' ).get() } yield from response.follow_all(css='span.next a' ,callback=self.parse)
1 2 3 4 5 6 7 8 9 10 11 DEFAULT_REQUEST_HEADERS = { 'Accept' : 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9' , 'Accept-Encoding' : 'gzip, deflate' , 'Accept-Language' : 'zh-CN,zh;q=0.9,en-US;q=0.8,en;q=0.7' , 'Content-Type' : 'text/html; charset=utf-8' , 'Cache-Control' : 'max-age=0' , 'Connection' : 'keep-alive' , 'User-Agent' : 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.128 Safari/537.36' } FEED_EXPORT_ENCODING = "UTF-8"
1 scrapy crawl douabn -O douban-250.json
1 scrapy shell "https://movie.douban.com/top250"
参考链接:Scrapy 文档