1、Python 多进程
参考文档:Python 异步编程 多进程
2、Python 多进程爬虫
相关文档:Python 爬虫入门教程
Python中想要提高执行效率,大部分开发者是通过编写多进程来提高运行效率,使用multiprocessing进行并行编程,可以编写多进程爬虫来爬取信息,缺点是每个进程都会有自己的内存,数据多会占用比较大的内存。
1)多进程使用示例
#!/usr/bin/python from multiprocessing import Process, Semaphore, Lock, Queue import time from random import random buffer = Queue(10) buffer.put('init') empty = Semaphore(0) full = Semaphore(1) lock = Lock() class Consumer(Process): def run(self): global buffer, empty, full, lock while True: full.acquire() lock.acquire() print('Consumer get', buffer.get()) time.sleep(1) lock.release() empty.release() class Producer(Process): def run(self): global buffer, empty, full, lock while True: empty.acquire() lock.acquire() num = random() print('Producer put ', num) buffer.put(num) time.sleep(1) lock.release() full.release() if __name__ == '__main__': p = Producer() c = Consumer() p.daemon = c.daemon = True p.start() c.start() p.join() c.join() print('运行完成')
2)多进程爬虫
from multiprocessing import Pool import requests from requests.exceptions import ConnectionError def scrape(url): try: print(requests.get(url)) except ConnectionError: print('Error Occured ', url) finally: print('URL ', url, ' Scraped') if __name__ == '__main__': pool = Pool(processes=3) # 初始化一个 Pool,指定进程数为 3,如果不指定,那么会自动根据 CPU 内核来分配进程数。 urls = [ 'https://www.baidu.com', 'https://www.meituan.com/', 'https://blog.csdn.net/', 'https://www.zhihu.com' ] pool.map(scrape, urls) # map 函数可以遍历每个 URL,然后对其分别执行 scrape