scrapy 笔记

安装

pip install scrapy

依赖：

lxml：一种高效的XML和HTML解析器，
PARSEL：一个HTML / XML数据提取库，基于上面的lxml，
w3lib：一种处理URL和网页编码多功能辅助
twisted,：一个异步网络框架
cryptography and pyOpenSSL，处理各种网络级安全需求

一般执行pip install scrapy会自动解决依赖

出现warning

解决方法是升级pyasn1库:pip install --upgrade pyasn1

1
2
3

λ scrapy version
:0: UserWarning: You do not have a working installation of the service_identity module: 'cannot import name opentype'.  Please install it from <https://pypi.python.org/pypi/service_identity> and make sure all of its dependencies are satisfied.  Without the service_identity module, Twisted can perform only rudimentary TLS client hostname verification.  Many valid certificate/hostname mappings may be rejected.
Scrapy 1.5.1

缺少win32api库

1
2
3

ModuleNotFoundError: No module named 'win32api'

(spiderenv) λ pip install pypiwin32

虚拟环境配置

由于py2出现各种问题选择了py3的虚拟环境

1	virtualenv --no-site-packages spiderenv

开启虚拟环境

F:\Python_env\spiderenv
λ cd Scripts\

F:\Python_env\spiderenv\Scripts
λ .\activate.bat

F:\Python_env\spiderenv\Scripts
(spiderenv) λ python -V
Python 3.7.0

F:\Python_env\spiderenv\Scripts
(spiderenv) λ pip -V
pip 18.1 from f:\python_env\spiderenv\lib\site-packages\pip (python 3.7)

vscode支持python虚拟环境

设置settings.json

开始

(spiderenv) λ scrapy -h                                                         
Scrapy 1.5.1 - project: prob_spider                                             
                                                                                
Usage:                                                                          
  scrapy <command> [options] [args]                                             
                                                                                
Available commands:                                                             
  bench         Run quick benchmark test                                        
  check         Check spider contracts                                          
  crawl         Run a spider                                                    
  edit          Edit spider                                                     
  fetch         Fetch a URL using the Scrapy downloader                         
  genspider     Generate new spider using pre-defined templates                 
  list          List available spiders                                          
  parse         Parse URL (using its spider) and print the results              
  runspider     Run a self-contained spider (without creating a project)        
  settings      Get settings values                                             
  shell         Interactive scraping console                                    
  startproject  Create new project                                              
  version       Print Scrapy version                                            
  view          Open URL in browser, as seen by Scrapy                          
                                                                                
Use "scrapy <command> -h" to see more info about a command

创建一个scrapy项目

1	scrapy startproject demo

创建一个新的spider

(spiderenv) λ cd demo

(spiderenv) λ scrapy genspider example example.com
Created spider 'example' using template 'basic' in module:
  demo.spiders.example

shell命令

进入一个shell，方便调试

(spiderenv) λ scrapy shell http://example.com

...

[s] Available Scrapy objects:
[s]   scrapy     scrapy module (contains scrapy.Request, scrapy.Selector, etc)
[s]   crawler    <scrapy.crawler.Crawler object at 0x000001C6828A2978>
[s]   item       {}
[s]   request    <GET http://example.com>
[s]   response   <200 http://example.com>
[s]   settings   <scrapy.settings.Settings object at 0x000001C6828A2B38>
[s]   spider     <ExampleSpider 'example' at 0x1c6833404e0>
[s] Useful shortcuts:
[s]   fetch(url[, redirect=True]) Fetch URL and update local objects (by default, redirects are followed)
[s]   fetch(req)                  Fetch a scrapy.Request and update local objects
[s]   shelp()           Shell help (print this help)
[s]   view(response)    View response in a browser
>>>

使用spider进行爬取

1	scrapy crawl myspider

这里的myspider是spider的名字而不是项目的名字

编写

start_urls

URL列表。当没有制定特定的URL时，spider将从该列表中开始进行爬取。

start_requests()

默认实现是使用start_urls的url生成Request,可以重写

def start_requests(self):

    ...
    
   yield scrapy.Request(url, callback=lambda response, arg1=arg1: self.parse(response, arg1))
   
   
def parse(self, response, arg1):

    ...
    
    item = exampleSpiderItem()
    item['name'] = arg1
    yield item

这里scrapy.Request的callback参数可以用lambda传入多个参数，在parse中使用

parse(response)

负责处理response并返回处理的数据以及(/或)跟进的URL。

数据提取工具 css/xpath

css选择器：http://www.w3school.com.cn/cssref/css_selectors.asp

xpath选择器：http://www.w3school.com.cn/xpath/xpath_syntax.asp

提取标签里的内容 //text()

包含HTML标签的所有文字内容提取string()

提取真实的原文数据，需要调用.extract()方法

提取第一个匹配到的元素, 调用.extract_first()

Items

为了定义常用的输出数据，Scrapy提供了Item类。 Item对象是种简单的容器，保存了爬取到得数据。

在items.py中编写item

import scrapy

class ProbSpiderItem(scrapy.Item): 
    # define the fields for your item here like:
    name = scrapy.Field()

在spider中为item赋值，需要import

from examplespider.items import exampleSpiderItem

Item Pipeline

当Item在Spider中被收集之后，它将会被传递到Item Pipeline，一些组件会按照一定的顺序执行对Item的处理。

将item存为json文件

import json

class ProbSpiderPipeline(object):
    
    def __init__(self):
        self.file = open('prob.json', 'a')
    def process_item(self, item, spider):
        content = json.dumps(dict(item), ensure_ascii=False) + "\n"
        self.file.write(content)
        return item
    def close_spider(self, spider):
        self.file.close()

还需要在settings.py中去掉item pipelines的注释来启用

# Configure item pipelines
# See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
    'prob_spider.pipelines.ProbSpiderPipeline': 300,
}

实战项目

用scrapy爬取百度知道上一些问题的相关问法

https://github.com/daolgts/prob_spider

参考：https://scrapy-chs.readthedocs.io/zh_CN/1.0/topics/item-pipeline.html

道萝岗特森's Blog

daolgts {{moment(1547909587000).fromNow()}}