时间:2021-05-22
安装Tornado
省事点可以直接用grequests库,下面用的是tornado的异步client。 异步用到了tornado,根据官方文档的例子修改得到一个简单的异步爬虫类。可以参考下最新的文档学习下。
pip install tornado
异步爬虫
#!/usr/bin/env python# -*- coding:utf-8 -*-import timefrom datetime import timedeltafrom tornado import httpclient, gen, ioloop, queuesimport tracebackclass AsySpider(object): """A simple class of asynchronous spider.""" def __init__(self, urls, concurrency=10, **kwargs): urls.reverse() self.urls = urls self.concurrency = concurrency self._q = queues.Queue() self._fetching = set() self._fetched = set() def fetch(self, url, **kwargs): fetch = getattr(httpclient.AsyncHTTPClient(), 'fetch') return fetch(url, **kwargs) def handle_html(self, url, html): """handle html page""" print(url) def handle_response(self, url, response): """inherit and rewrite this method""" if response.code == 200: self.handle_html(url, response.body) elif response.code == 599: # retry self._fetching.remove(url) self._q.put(url) @gen.coroutine def get_page(self, url): try: response = yield self.fetch(url) print('######fetched %s' % url) except Exception as e: print('Exception: %s %s' % (e, url)) raise gen.Return(e) raise gen.Return(response) @gen.coroutine def _run(self): @gen.coroutine def fetch_url(): current_url = yield self._q.get() try: if current_url in self._fetching: return print('fetching****** %s' % current_url) self._fetching.add(current_url) response = yield self.get_page(current_url) self.handle_response(current_url, response) # handle reponse self._fetched.add(current_url) for i in range(self.concurrency): if self.urls: yield self._q.put(self.urls.pop()) finally: self._q.task_done() @gen.coroutine def worker(): while True: yield fetch_url() self._q.put(self.urls.pop()) # add first url # Start workers, then wait for the work queue to be empty. for _ in range(self.concurrency): worker() yield self._q.join(timeout=timedelta(seconds=300000)) assert self._fetching == self._fetched def run(self): io_loop = ioloop.IOLoop.current() io_loop.run_sync(self._run)class MySpider(AsySpider): def fetch(self, url, **kwargs): """重写父类fetch方法可以添加cookies,headers,timeout等信息""" cookies_str = "PHPSESSID=j1tt66a829idnms56ppb70jri4; pspt=%7B%22id%22%3A%2233153%22%2C%22pswd%22%3A%228835d2c1351d221b4ab016fbf9e8253f%22%2C%22_code%22%3A%22f779dcd011f4e2581c716d1e1b945861%22%7D; key=%E9%87%8D%E5%BA%86%E5%95%84%E6%9C%A8%E9%B8%9F%E7%BD%91%E7%BB%9C%E7%A7%91%E6%8A%80%E6%9C%89%E9%99%90%E5%85%AC%E5%8F%B8; think_language=zh-cn; SERVERID=a66d7d08fa1c8b2e37dbdc6ffff82d9e|1444973193|1444967835; CNZZDATA1254842228=1433864393-1442810831-%7C1444972138" # 从浏览器拷贝cookie字符串 headers = { 'User-Agent': 'mozilla/5.0 (compatible; baiduspider/2.0; +http://plete() print time.time() - _stif __name__ == '__main__': main()这三种随便一种都有很高的效率,但是这么跑会给网站服务器不小的压力,尤其是小站点,还是有点节操为好。
声明:本页内容来源网络,仅供用户参考;我单位不保证亦不表示资料全面及准确无误,也不保证亦不表示这些资料为最新信息,如因任何原因,本网内容或者用户因倚赖本网内容造成任何损失或损害,我单位将不会负任何法律责任。如涉及版权问题,请提交至online#300.cn邮箱联系删除。
高性能异步爬虫目的:在爬虫中使用异步实现高性能的数据爬取操作异步爬虫的方式:-多线程、多进程(不建议):好处:可以为相关阻塞的操作单独开启多线程或进程,阻塞操作
python多线程和多进程区别是: 1、多线程可以共享全局变量,而多进程是不能的。 2、多线程中,所有子线程的进程号相同;多进程中不同的子进程进程号不同。
一下代码通过协程、多线程、多进程的方式,运行代码展示异步与同步的区别。importgeventimportthreadingimportmultiprocess
我们都知道并发(不是并行)编程目前有四种方式,多进程,多线程,异步,和协程。多进程编程在python中有类似C的os.fork,当然还有更高层封装的multip
Python中的多线程其实并不是真正的多线程,如果想要充分地使用多核CPU的资源,在python中大部分情况需要使用多进程。Python提供了非常好用的多进程包