lxml:一种高效的XML和HTML解析器, PARSEL:一个HTML / XML数据提取库,基于上面的lxml, w3lib:一种处理URL和网页编码多功能辅助 twisted,:一个异步网络框架 cryptography and pyOpenSSL,处理各种网络级安全需求
一般执行pip install scrapy会自动解决依赖
出现warning
解决方法是升级pyasn1库:pip install --upgrade pyasn1
1 2 3
λ scrapy version :0: UserWarning: You do not have a working installation of the service_identity module: 'cannot import name opentype'. Please install it from <https://pypi.python.org/pypi/service_identity> and make sure all of its dependencies are satisfied. Without the service_identity module, Twisted can perform only rudimentary TLS client hostname verification. Many valid certificate/hostname mappings may be rejected. Scrapy 1.5.1
(spiderenv) λ scrapy -h Scrapy 1.5.1 - project: prob_spider Usage: scrapy <command> [options] [args] Available commands: bench Run quick benchmark test check Check spider contracts crawl Run a spider edit Edit spider fetch Fetch a URL using the Scrapy downloader genspider Generate new spider using pre-defined templates list List available spiders parse Parse URL (using its spider) and print the results runspider Run a self-contained spider (without creating a project) settings Get settings values shell Interactive scraping console startproject Create new project version Print Scrapy version view Open URL in browser, as seen by Scrapy Use "scrapy <command> -h" to see more info about a command
创建一个scrapy项目
1
scrapy startproject demo
创建一个新的spider
1 2 3 4 5
(spiderenv) λ cd demo
(spiderenv) λ scrapy genspider example example.com Created spider 'example' using template 'basic' in module: demo.spiders.example
shell命令
进入一个shell,方便调试
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
(spiderenv) λ scrapy shell http://example.com
...
[s] Available Scrapy objects: [s] scrapy scrapy module (contains scrapy.Request, scrapy.Selector, etc) [s] crawler <scrapy.crawler.Crawler object at 0x000001C6828A2978> [s] item {} [s] request <GET http://example.com> [s] response <200 http://example.com> [s] settings <scrapy.settings.Settings object at 0x000001C6828A2B38> [s] spider <ExampleSpider 'example' at 0x1c6833404e0> [s] Useful shortcuts: [s] fetch(url[, redirect=True]) Fetch URL and update local objects (by default, redirects are followed) [s] fetch(req) Fetch a scrapy.Request and update local objects [s] shelp() Shell help (print this help) [s] view(response) View response in a browser >>>