1、Python 多进程
参考文档:Python 异步编程 多进程
2、Python 多进程爬虫
相关文档:Python 爬虫入门教程
Python中想要提高执行效率,大部分开发者是通过编写多进程来提高运行效率,使用multiprocessing进行并行编程,可以编写多进程爬虫来爬取信息,缺点是每个进程都会有自己的内存,数据多会占用比较大的内存。
1)多进程使用示例
#!/usr/bin/python
from multiprocessing import Process, Semaphore, Lock, Queue
import time
from random import random
buffer = Queue(10)
buffer.put('init')
empty = Semaphore(0)
full = Semaphore(1)
lock = Lock()
class Consumer(Process):
def run(self):
global buffer, empty, full, lock
while True:
full.acquire()
lock.acquire()
print('Consumer get', buffer.get())
time.sleep(1)
lock.release()
empty.release()
class Producer(Process):
def run(self):
global buffer, empty, full, lock
while True:
empty.acquire()
lock.acquire()
num = random()
print('Producer put ', num)
buffer.put(num)
time.sleep(1)
lock.release()
full.release()
if __name__ == '__main__':
p = Producer()
c = Consumer()
p.daemon = c.daemon = True
p.start()
c.start()
p.join()
c.join()
print('运行完成')
2)多进程爬虫
from multiprocessing import Pool
import requests
from requests.exceptions import ConnectionError
def scrape(url):
try:
print(requests.get(url))
except ConnectionError:
print('Error Occured ', url)
finally:
print('URL ', url, ' Scraped')
if __name__ == '__main__':
pool = Pool(processes=3) # 初始化一个 Pool,指定进程数为 3,如果不指定,那么会自动根据 CPU 内核来分配进程数。
urls = [
'https://www.baidu.com',
'https://www.meituan.com/',
'https://blog.csdn.net/',
'https://www.zhihu.com'
]
pool.map(scrape, urls) # map 函数可以遍历每个 URL,然后对其分别执行 scrape