平时使用单任务爬取数据效率太低,使用多线程和多任务开销巨大不好管理,然后改用线程池
线程池和进程池比较
from concurrent.futures import ThreadPoolExecutor
from concurrent.futures import ProcessPoolExecutor
import time
def parse(i):
print(">>>>>>:{}".format(i))
time.sleep(1)
# 假如有一个列表
lists = [1,2,3,4,5,6]
def Thread():
# 但任务遍历
start1 = time.time()
for i in lists:
parse(i)
end1=time.time()
print("单任务time1: "+str(end1-start1))
# 线程池submit提交任务
start2 = time.time()
with ThreadPoolExecutor(3) as thd:
for i in lists:
thd.submit(parse,i)
end2=time.time()
print("线程池submit>>time2: "+str(end2-start2))
# 线程池map提交任务
start3 = time.time()
with ThreadPoolExecutor(3) as thd1:
thd1.map(parse,lists)
end3=time.time()
print("线程池map>>time3: "+str(end3-start3))
# 进程池
def prooce():
# 进程池submit
start1 = time.time()
with ProcessPoolExecutor(3) as pro:
for i in lists:
pro.submit(parse,i)
end1 = time.time()
print("进程池submit>>time1: " + str(end1 - start1))
start2 = time.time()
with ProcessPoolExecutor(3) as pro:
pro.map(parse,lists)
end2 = time.time()
print("进程池map>>time2: " + str(end2 - start2))
if __name__ == '__main__':
Thread()
prooce()
输出结果如下:
/home/derek/.virtualenvs/spider3.5/bin/python /home/derek/Desktop/spider2/Crawl_Spider/spiders/Theadpool.py
>>>>>>:1
>>>>>>:2
>>>>>>:3
>>>>>>:4
>>>>>>:5
>>>>>>:6
单任务time1: 6.005887985229492
>>>>>>:1
>>>>>>:2
>>>>>>:3
>>>>>>:4
>>>>>>:5
>>>>>>:6
线程池submit>>time2: 2.0025179386138916
>>>>>>:1
>>>>>>:2
>>>>>>:3
>>>>>>:4
>>>>>>:5
>>>>>>:6
线程池map>>time3: 2.002720832824707
>>>>>>:1
>>>>>>:2
>>>>>>:3
>>>>>>:4
>>>>>>:5
>>>>>>:6
进程池submit>>time1: 2.0118319988250732
>>>>>>:1
>>>>>>:2
>>>>>>:3
>>>>>>:4
>>>>>>:5
>>>>>>:6
进程池map>>time2: 2.0081613063812256
Process finished with exit code 0
- 从上面简单的例子中能看出来,线程池比进程池更快,这也是后期写爬虫主要为线程池的原因,进程开销势必线程大的
- map可以保证输出的顺序, submit输出的顺序是乱的,上面的没有体现出来
- 如果在运行中添加任务,就需要重写threadpool或者future的方法