Scrapy初体验
2021.04.25 Sun

Scrapy is an application framework for crawling web sites and extracting structured data which can be used for a wide range of useful applications, like data mining, information processing or historical archival.

设置 python 虚拟环境

为了避免安装第三方依赖时总是安装在全局而导致不同项目之间的冲突,python 可以使用虚拟环境的方式满足不同应用的需求。

The solution for this problem is to create a virtual environment, a self-contained directory tree that contains a Python installation for a particular version of Python, plus a number of additional packages.

1
python -m venv virtual-env

命令会创建一个 virtual-enviroment 的文件夹,里面包含 python 编译器、标准库和其他一些文件,其中在 Scripts 文件夹下包含激活虚拟环境的执行文件。

  • activate.bat 在 windows cmd 里运行 virtual-env\Scripts\activate.bat
  • Activate.PS1 在 windows pwd 里运行 virtual-env\Scripts\Activate.PS1
  • activate 在 Unix 和 MacOs 里运行 source virtual-env/bin/activate
  • deactivate.bat 取消虚拟环境

后面安装 scrapy 等都会安装在虚拟环境里

Scrapy 开始项目

1
scrapy startproject tutorial test

在 test 文件夹下创建名为 tutorial 的 Scrapy project,test 文件夹下有以下内容

1
2
3
4
5
6
7
8
9
scrapy.cfg            # deploy configuration file
tutorial/ # project's Python module, you'll import your code from here
__init__.py
items.py # project items definition file
middlewares.py # project middlewares file
pipelines.py # project pipelines file
settings.py # project settings file
spiders/ # a directory where you'll later put your spiders
__init__.py

在 spiders 目录下写自己的爬虫文件,在 setting.py 里可以配置请求头等一些内容

1
2
3
4
5
6
7
8
9
10
11
12
#spiders/douban250_spider.py
import scrapy
class DoubanSpider(scrapy.Spider):
name = 'douban'
start_urls = [
'https://movie.douban.com/top250']
def parse(self, response):
for title in response.css('div.hd'):
yield{
'title':title.css('span.title::text').get()
}
yield from response.follow_all(css='span.next a',callback=self.parse)
1
2
3
4
5
6
7
8
9
10
11
#setting.py
DEFAULT_REQUEST_HEADERS = {
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'Accept-Encoding': 'gzip, deflate',
'Accept-Language': 'zh-CN,zh;q=0.9,en-US;q=0.8,en;q=0.7',
'Content-Type': 'text/html; charset=utf-8',
'Cache-Control': 'max-age=0',
'Connection': 'keep-alive',
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.128 Safari/537.36'
}
FEED_EXPORT_ENCODING = "UTF-8" #设置导出编码格式
1
scrapy crawl douabn -O douban-250.json #执行并导出结果为 json 文件
1
scrapy shell "https://movie.douban.com/top250" #在命令行里测试请求的页面信息

参考链接:
Scrapy 文档

检测到页面内容有更新,是否刷新页面