Scrapy 教程 · php笔记

## Scrapy 教程 **last update: 2022-06-06 10:23:11** ---- [TOC=3,8] ---- [Scrapy | A Fast and Powerful Scraping and Web Crawling Framework](https://scrapy.org/) https://github.com/scrapy/scrapy https://github.com/scrapy-plugins [Scrapy 教程 — Scrapy 2.5.0 文档](https://www.osgeo.cn/scrapy/intro/tutorial.html) ---- ### 准备虚拟环境 venv > 为一个应用创建一套“隔离”的 Python 运行环境，使用不同的虚拟环境可以解决不同应用的依赖冲突问题。 ```shell # 创建虚拟环境 python -m venv venv # 激活虚拟环境 source venv/bin/activate ``` [12. 虚拟环境和包 — Python 3.11.3 文档](https://docs.python.org/zh-cn/3/tutorial/venv.html#tut-venv) [virtualenv Lives!](https://hynek.me/articles/virtualenv-lives/) **windows 环境**：以 **管理员身份** 运行 Windows PowerShell ： ```shell PS D:\web\tutorial-env> set-executionpolicy remotesigned PS D:\web\tutorial-env> get-executionpolicy RemoteSigned PS D:\web\tutorial-env> .\Scripts\activate ``` 设置 PyCharm 终端自动激活虚拟环境：工具 > 终端：勾选【激活 virtualenv】 ---- ### conda venv 虚拟环境能解决不同项目间的包版本冲突问题，但是如果我们需要不同的 python 版本呢？ conda 可以帮助我们方便的安装管理不同版本的 python 和 pip。下载 https://www.anaconda.com/products/individual （国内镜像 https://mirrors.tuna.tsinghua.edu.cn/anaconda/archive/?C=M&O=D） [conda 管理多版本python-蒲公英云](https://dandelioncloud.cn/article/details/1526009310379524098) [Anaconda安装使用教程解决多Python版本问题_anaconda安装多个python版本_print('小白码')的博客-CSDN博客](https://blog.csdn.net/qq_50048105/article/details/113859376) [Python3 的安装 | 静觅](https://cuiqingcai.com/30035.html) [相关环境安装](https://setup.scrape.center/) ---- ### DecryptLogin 安装与使用 ```shell pip3 install DecryptLogin ``` todo ... ---- ### Scrapy 安装 ```shell pip3 install scrapy -i https://pypi.tuna.tsinghua.edu.cn/simple scrapy -V Scrapy 2.8.0 - no active project ``` see: [Scrapy 的安装 | 静觅](https://setup.scrape.center/scrapy) ---- ### Scrapy 使用 https://github.com/orgs/Python3WebSpider/repositories?q=scrapy&type=all&language=&sort= https://github.com/orgs/Python3WebSpider/repositories?q=Pyppeteer+&type=all&language=&sort= [【2022 年】崔庆才 Python3 网络爬虫学习教程 | 静觅](https://cuiqingcai.com/17777.html) #### 创建项目 ```shell scrapy startproject tutorial ``` ~~~ tutorial/ scrapy.cfg # deploy configuration file tutorial/ # project's Python module, you'll import your code from here __init__.py items.py # project items definition file middlewares.py # project middlewares file pipelines.py # project pipelines file settings.py # project settings file spiders/ # a directory where you'll later put your spiders __init__.py ~~~ ---- #### 创建蜘蛛 ~~~shell cd tutorial scrapy genspider quotes quotes.toscrape.com ~~~ 上面的命令会生成如下文件 tutorial/tutorial/spiders/quotes.py ~~~python import scrapy class QuotesSpider(scrapy.Spider): name = "quotes" allowed_domains = ["quotes.toscrape.com"] start_urls = ["https://quotes.toscrape.com"] def parse(self, response): pass ~~~ ---- #### 使用 item item 可以帮我们规范数据字段 tutorial/tutorial/items.py ```python # Define here the models for your scraped items # # See documentation in: # https://docs.scrapy.org/en/latest/topics/items.html import scrapy class QuoteItem(scrapy.Item): # define the fields for your item here like: # name = scrapy.Field() text = scrapy.Field() author = scrapy.Field() tags = scrapy.Field() ``` #### 运行蜘蛛现在修改下我们的蜘蛛 tutorial/tutorial/spiders/quotes.py ： ```python import scrapy from ..items import QuoteItem class QuotesSpider(scrapy.Spider): name = "quotes" allowed_domains = ["quotes.toscrape.com"] start_urls = ["https://quotes.toscrape.com"] def parse(self, response): for quote in response.css('div.quote'): item = QuoteItem() item['text'] = quote.css('span.text::text').get() item['author'] = quote.css('small.author::text').get() item['tags'] = quote.css('div.tags a.tag::text').getall() yield item ``` 运行： ```shell scrapy crawl quotes -O quotes.json ``` 结果如下 tutorial/quotes.json : ~~~json [ {"text": "“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”", "author": "Albert Einstein", "tags": ["change", "deep-thoughts", "thinking", "world"]}, {"text": "“It is our choices, Harry, that show what we truly are, far more than our abilities.”", "author": "J.K. Rowling", "tags": ["abilities", "choices"]}, {"text": "“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”", "author": "Albert Einstein", "tags": ["inspirational", "life", "live", "miracle", "miracles"]}, {"text": "“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”", "author": "Jane Austen", "tags": ["aliteracy", "books", "classic", "humor"]}, {"text": "“Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”", "author": "Marilyn Monroe", "tags": ["be-yourself", "inspirational"]}, {"text": "“Try not to become a man of success. Rather become a man of value.”", "author": "Albert Einstein", "tags": ["adulthood", "success", "value"]}, {"text": "“It is better to be hated for what you are than to be loved for what you are not.”", "author": "André Gide", "tags": ["life", "love"]}, {"text": "“I have not failed. I've just found 10,000 ways that won't work.”", "author": "Thomas A. Edison", "tags": ["edison", "failure", "inspirational", "paraphrased"]}, {"text": "“A woman is like a tea bag; you never know how strong it is until it's in hot water.”", "author": "Eleanor Roosevelt", "tags": ["misattributed-eleanor-roosevelt"]}, {"text": "“A day without sunshine is like, you know, night.”", "author": "Steve Martin", "tags": ["humor", "obvious", "simile"]} ] ~~~ ----- #### 使用 itemloader [scrapy 之 itemloader - 知乎](https://zhuanlan.zhihu.com/p/59905612/) 新建文件 tutorial/tutorial/itemloader.py ```python from scrapy.loader import ItemLoader from scrapy.loader.processors import TakeFirst, Join, Compose class BaseLoader(ItemLoader): pass class quoteLoader(BaseLoader): pass ``` 修改文件 tutorial/tutorial/spiders/quotes.py ```python import scrapy from ..items import QuoteItem from ..itemloader import QuoteLoader class QuotesSpider(scrapy.Spider): name = "quotes" allowed_domains = ["quotes.toscrape.com"] start_urls = ["https://quotes.toscrape.com"] # def parse(self, response): # for quote in response.css('div.quote'): # item = QuoteItem() # item['text'] = quote.css('span.text::text').get() # item['author'] = quote.css('small.author::text').get() # item['tags'] = quote.css('div.tags a.tag::text').getall() # yield item def parse(self, response): quotes = response.css('.quote') for quote in quotes: loader = QuoteLoader(item=QuoteItem(), selector=quote) loader.add_css('text', '.text::text') loader.add_css('author', '.author::text') loader.add_css('tags', '.tag::text') yield loader.load_item() ``` Item Loader 在每个字段中都包含了一个输入处理器和一个输出处理器｡再次执行蜘蛛，发现 text 成了列表，接下来需要利用 Item Loader 的输入/输出处理器｡修改 tutorial/tutorial/itemloader.py ```python class BaseLoader(ItemLoader): default_output_processor = TakeFirst() ``` 此时 text 就和之前一样了。下面将介绍一些内置的的处理器。 **Identity** Identity 是最简单的 Processor，不进行任何处理，直接返回原来的数据 **TakeFirst** akeFirst 返回列表的第一个非空值，类似 extract_first() 的功能，常用作 Output Processor ```python processor = TakeFirst() print(processor(['', 1, 2, 3])) # 1 ``` **Join** Join 方法相当于字符串的 join() 方法，可以把列表拼合成字符串，字符串默认使用空格分隔 ```python processor = Join(',') print(processor(['one', 'two', 'three'])) # one,two,three ``` **Compose** Compose 是用给定的多个函数的组合而构造的 Processor，每个输入值被传递到第一个函数，其输出再传递到第二个函数，依次类推，直到最后一个函数返回整个处理器的输出 ```python processor = Compose(str.upper, lambda s: s.strip()) print(processor(' hello world')) # HELLO WORLD ``` **MapCompose** 与 Compose 类似，MapCompose 可以迭代处理一个列表输入值 ```python processor = MapCompose(str.upper, lambda s: s.strip()) print(processor(['Hello', 'World', 'Python'])) # ['HELLO', 'WORLD', 'PYTHON'] # 被处理的内容是一个可迭代对象，MapCompose 会将该对象遍历然后依次处理。 ``` **SelectJmes** SelectJmes 可以查询 JSON ，传入 Key ，返回查询所得的 Value 。不过需要先安装 `pip install jmespath` 库才可以使用它： ```python from scrapy.loader.processors import SelectJmes processor = SelectJmes('foo') print(processor({'foo': 'bar'})) # bar ``` **有两种方式使用处理器：** 1. `xxx_in` 为声明处理 `xxx` 字段的输入处理器，`xxx_out` 为声明处理 `xxx` 字段的输出处理器 2. `default_input_processor` 和 `default_output_processor` 属性声明默认的输入/输出处理器。修改文件 tutorial/tutorial/itemloader.py ```python from scrapy.loader import ItemLoader from scrapy.loader.processors import Identity, TakeFirst, Join, Compose class BaseLoader(ItemLoader): default_output_processor = TakeFirst() class quoteLoader(BaseLoader): tags_out = Identity() ``` 发现只有 tags 是取多个，其它都是取一个。 ---- ### Scrapy 特性 **过滤重复请求** 默认情况下，Scrapy 过滤掉对已经访问过的URL的重复请求，避免了由于编程错误而太多地访问服务器的问题。这可以通过设置进行配置 [DUPEFILTER_CLASS](https://www.osgeo.cn/scrapy/topics/settings.html#std-setting-DUPEFILTER_CLASS) ---- ### 常见问题 **如何爬取更多链接？** 虽然爬虫是**从一个入口链接开始**的，但不要因此就认为它只能完成一次性的简单爬取任务，我们可在 `parse()` 中根据情况使用 `yield scrapy.Request(next_page, callback=self.parse)` 、`response.follow(next_page, self.parse)`、`yield from response.follow_all(anchors, callback=self.parse)` **继续生成其他请求，以满足爬取所有其他页面。** ---- **如何处理和保存爬取到的数据？** ```shell scrapy runspider quotes_spider.py -o quotes.jl cd project scrapy crawl quotes -O quotes.json scrapy crawl quotes -o quotes.jl ``` ---- **如何使用代理？** ---- **如何分布式大规模爬取？** ---- **如何处理登录？** [Scrapy详解之中间件（Middleware）](https://mp.weixin.qq.com/s?__biz=MzAxMjUyNDQ5OA==&mid=2653557181&idx=1&sn=c62810ab78f40336cb721212ab83f7bd&chksm=806e3f00b719b616286ec1a07f9a5b9eeaba105781f93491685fbe732c60db0118852cfeeec8&scene=27) [scrapy中添加cookie踩坑记录_51CTO博客_scrapy cookie](https://blog.51cto.com/u_11949039/2859241) [scrapy 基础组件专题（十二）：scrapy 模拟登录](https://www.bbsmax.com/A/l1dy7YAxJe/) > 在 `settings.py` 中设置 [`COOKIES_DEBUG=True`](https://www.osgeo.cn/scrapy/topics/downloader-middleware.html#std-setting-COOKIES_DEBUG) 能够在终端看到 cookie 的传递过程 [设置 — Scrapy 2.5.0 文档](https://www.osgeo.cn/scrapy/topics/settings.html#topics-settings-ref) ---- **如何处理验证码？** ---- **如何处理滑块等防爬人机验证？** ---- **如何处理加密防爬？** ---- **如何使用无头浏览器？** ---- **如何在 scrapy 中对接 selenium** ---- **如何管理、控制爬虫？** [Scrapyd 1.4.1 documentation](https://scrapyd.readthedocs.io/en/latest/) ----