scrapy 笔记

安装

pip install scrapy

依赖:

1
2
3
4
5
lxml:一种高效的XML和HTML解析器,
PARSEL:一个HTML / XML数据提取库,基于上面的lxml,
w3lib:一种处理URL和网页编码多功能辅助
twisted,:一个异步网络框架
cryptography and pyOpenSSL,处理各种网络级安全需求

一般执行pip install scrapy会自动解决依赖

出现warning

解决方法是升级pyasn1库:pip install --upgrade pyasn1

1
2
3
λ scrapy version
:0: UserWarning: You do not have a working installation of the service_identity module: 'cannot import name opentype'. Please install it from <https://pypi.python.org/pypi/service_identity> and make sure all of its dependencies are satisfied. Without the service_identity module, Twisted can perform only rudimentary TLS client hostname verification. Many valid certificate/hostname mappings may be rejected.
Scrapy 1.5.1

缺少win32api库

1
2
3
ModuleNotFoundError: No module named 'win32api'

(spiderenv) λ pip install pypiwin32

虚拟环境配置

由于py2出现各种问题选择了py3的虚拟环境

1
virtualenv --no-site-packages spiderenv

开启虚拟环境

1
2
3
4
5
6
7
8
9
10
11
12
13
F:\Python_env\spiderenv
λ cd Scripts\

F:\Python_env\spiderenv\Scripts
λ .\activate.bat

F:\Python_env\spiderenv\Scripts
(spiderenv) λ python -V
Python 3.7.0

F:\Python_env\spiderenv\Scripts
(spiderenv) λ pip -V
pip 18.1 from f:\python_env\spiderenv\lib\site-packages\pip (python 3.7)

vscode支持python虚拟环境

设置settings.json
kCisQU.png

开始

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
(spiderenv) λ scrapy -h                                                         
Scrapy 1.5.1 - project: prob_spider

Usage:
scrapy <command> [options] [args]

Available commands:
bench Run quick benchmark test
check Check spider contracts
crawl Run a spider
edit Edit spider
fetch Fetch a URL using the Scrapy downloader
genspider Generate new spider using pre-defined templates
list List available spiders
parse Parse URL (using its spider) and print the results
runspider Run a self-contained spider (without creating a project)
settings Get settings values
shell Interactive scraping console
startproject Create new project
version Print Scrapy version
view Open URL in browser, as seen by Scrapy

Use "scrapy <command> -h" to see more info about a command

创建一个scrapy项目

1
scrapy startproject demo

创建一个新的spider

1
2
3
4
5
(spiderenv) λ cd demo

(spiderenv) λ scrapy genspider example example.com
Created spider 'example' using template 'basic' in module:
demo.spiders.example

shell命令

进入一个shell,方便调试

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
(spiderenv) λ scrapy shell http://example.com

...

[s] Available Scrapy objects:
[s] scrapy scrapy module (contains scrapy.Request, scrapy.Selector, etc)
[s] crawler <scrapy.crawler.Crawler object at 0x000001C6828A2978>
[s] item {}
[s] request <GET http://example.com>
[s] response <200 http://example.com>
[s] settings <scrapy.settings.Settings object at 0x000001C6828A2B38>
[s] spider <ExampleSpider 'example' at 0x1c6833404e0>
[s] Useful shortcuts:
[s] fetch(url[, redirect=True]) Fetch URL and update local objects (by default, redirects are followed)
[s] fetch(req) Fetch a scrapy.Request and update local objects
[s] shelp() Shell help (print this help)
[s] view(response) View response in a browser
>>>

使用spider进行爬取

1
scrapy crawl myspider
  • 这里的myspider是spider的名字而不是项目的名字

编写

start_urls

URL列表。当没有制定特定的URL时,spider将从该列表中开始进行爬取。

start_requests()

默认实现是使用start_urls的url生成Request,可以重写

1
2
3
4
5
6
7
8
9
10
11
12
13
14
def start_requests(self):

...

yield scrapy.Request(url, callback=lambda response, arg1=arg1: self.parse(response, arg1))


def parse(self, response, arg1):

...

item = exampleSpiderItem()
item['name'] = arg1
yield item
  • 这里scrapy.Request的callback参数可以用lambda传入多个参数,在parse中使用

parse(response)

负责处理response并返回处理的数据以及(/或)跟进的URL。

数据提取工具 css/xpath

css选择器:http://www.w3school.com.cn/cssref/css_selectors.asp

xpath选择器:http://www.w3school.com.cn/xpath/xpath_syntax.asp

提取标签里的内容 //text()

包含HTML标签的所有文字内容提取string()

提取真实的原文数据,需要调用.extract()方法

提取第一个匹配到的元素, 调用.extract_first()

Items

为了定义常用的输出数据,Scrapy提供了Item类。 Item对象是种简单的容器,保存了爬取到得数据。

items.py中编写item

1
2
3
4
5
import scrapy

class ProbSpiderItem(scrapy.Item):
# define the fields for your item here like:
name = scrapy.Field()

在spider中为item赋值,需要import

from examplespider.items import exampleSpiderItem

Item Pipeline

当Item在Spider中被收集之后,它将会被传递到Item Pipeline,一些组件会按照一定的顺序执行对Item的处理。

将item存为json文件

1
2
3
4
5
6
7
8
9
10
11
12
import json

class ProbSpiderPipeline(object):

def __init__(self):
self.file = open('prob.json', 'a')
def process_item(self, item, spider):
content = json.dumps(dict(item), ensure_ascii=False) + "\n"
self.file.write(content)
return item
def close_spider(self, spider):
self.file.close()

还需要在settings.py中去掉item pipelines的注释来启用

1
2
3
4
5
# Configure item pipelines
# See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
'prob_spider.pipelines.ProbSpiderPipeline': 300,
}

实战项目

用scrapy爬取百度知道上一些问题的相关问法

https://github.com/daolgts/prob_spider

参考:https://scrapy-chs.readthedocs.io/zh_CN/1.0/topics/item-pipeline.html