Scrapy初体验

2021.04.25 Sun

Scrapy is an application framework for crawling web sites and extracting structured data which can be used for a wide range of useful applications, like data mining, information processing or historical archival.

设置 python 虚拟环境

为了避免安装第三方依赖时总是安装在全局而导致不同项目之间的冲突，python 可以使用虚拟环境的方式满足不同应用的需求。

The solution for this problem is to create a virtual environment, a self-contained directory tree that contains a Python installation for a particular version of Python, plus a number of additional packages.

1	`python -m venv virtual-env`

命令会创建一个 virtual-enviroment 的文件夹，里面包含 python 编译器、标准库和其他一些文件,其中在 Scripts 文件夹下包含激活虚拟环境的执行文件。

activate.bat 在 windows cmd 里运行 virtual-env\Scripts\activate.bat
Activate.PS1 在 windows pwd 里运行 virtual-env\Scripts\Activate.PS1
activate 在 Unix 和 MacOs 里运行 source virtual-env/bin/activate
deactivate.bat 取消虚拟环境

后面安装 scrapy 等都会安装在虚拟环境里

Scrapy 开始项目

1	`scrapy startproject tutorial test`

在 test 文件夹下创建名为 tutorial 的 Scrapy project，test 文件夹下有以下内容

scrapy.cfg            # deploy configuration file
tutorial/             # project's Python module, you'll import your code from here
    __init__.py
    items.py          # project items definition file
    middlewares.py    # project middlewares file
    pipelines.py      # project pipelines file
    settings.py       # project settings file
    spiders/          # a directory where you'll later put your spiders
        __init__.py

在 spiders 目录下写自己的爬虫文件，在 setting.py 里可以配置请求头等一些内容

#spiders/douban250_spider.py
import scrapy
class DoubanSpider(scrapy.Spider):
    name = 'douban'
    start_urls = [
        'https://movie.douban.com/top250']
    def parse(self, response):
        for title in response.css('div.hd'):
            yield{
                'title':title.css('span.title::text').get()
            }
        yield from response.follow_all(css='span.next a',callback=self.parse)

#setting.py
DEFAULT_REQUEST_HEADERS = {
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
    'Accept-Encoding': 'gzip, deflate',
    'Accept-Language': 'zh-CN,zh;q=0.9,en-US;q=0.8,en;q=0.7',
    'Content-Type': 'text/html; charset=utf-8',
    'Cache-Control': 'max-age=0',
    'Connection': 'keep-alive',
    'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.128 Safari/537.36'
}
FEED_EXPORT_ENCODING = "UTF-8" #设置导出编码格式

1	`scrapy crawl douabn -O douban-250.json #执行并导出结果为 json 文件`

1	`scrapy shell "https://movie.douban.com/top250" #在命令行里测试请求的页面信息`

参考链接：
Scrapy 文档

Python