爬虫学习----提升爬虫的速度一、并发和并行，同步和异步的概念二、多线程爬虫三、多进程爬虫四、多协程爬虫

65 阅读 0 评论 43 点赞

我是靠谱客的博主畅快蜻蜓，最近开发中收集的这篇文章主要介绍爬虫学习----提升爬虫的速度一、并发和并行，同步和异步的概念二、多线程爬虫三、多进程爬虫四、多协程爬虫，觉得挺不错的，现在分享给大家，希望可以做个参考。

概述

一、并发和并行，同步和异步的概念

前面我们已经学习了网络爬虫的基本操作，下面，我们将会学习提升爬虫的速度，提升爬虫的速度有三种：多线程爬虫、多进程爬虫、多协程爬虫。在学习具体的操作之前，我们先来了解一下并发和并行，同步和异步的概念。

并发：在同一时间段内发生若干事件的情况，就是说任务一个接着一个执行，一个执行完成后执行下一个任务。
并行：在同一个时刻发生若干事件的情况。就是说同一时刻多个任务一起执行。
同步：是并发或并行的任务不是独自运行的，任务之间有一定的交替顺序，需要在一个任务得到结果后，另一个任务开始执行
异步：是并发或并行的任务可以独立运行，一个任务的运行不受另一个任务影响。无需等待可同时运行。

二、多线程爬虫

多线程爬虫是以并发的方式执行的。多个线程并不能真正的同时执行，而是通过进程的快速切换加快网络爬虫速度的。Python有一个GIL锁，一个线程的执行包括获取GIL，执行代码直到挂起和释放GIL。某个线程如果想执行，必须获取GIL锁，当多个任务到来时，会先竞争GIL锁，之后才会执行。但是对于爬虫来说，网络爬虫是IO密集型，多线程能够有效地提升效率，因为单线程下有IO操作会进行IO等待。会造成不必要的时间浪费，而开启多线程在线程A等待时自动切换到线程B，可以不浪费CPU资源。从而提升程序执行的效率。

Python多线程对于IO密集型代码比较友好，网络爬虫能够在获取网页的过程中使用多线程，从而加快速度。

Python多线程的两种方法：

（1）函数式：调用_thread模块中的start_new_thread()函数产生新线程

import _thread
import time
# 为线程定义一个函数
def print_time(threadName, delay):
count = 0
while count < 3:
time.sleep(delay)
count += 1
print(threadName, time.ctime())
# 添加新线程
_thread.start_new_thread(print_time, ("Thread-1", 1))
_thread.start_new_thread(print_time, ("Thread-2", 2))
print("Main Finished")

_thread提供了低级别、原始的线程，相比于threading模型，功能比较有限。

（2）类包装式：调用Threading库创建线程，从threading.Thread继承。

threading模块则提供了Thread类来处理线程，包括以下方法：

run()：用以表示线程活动的方法
start()：启动线程活动
join([time])：等待至线程终止，阻塞调用线程直至线程的join()方法被调用为止。
isAlive()：返回线程是否是活动的
getName()：返回线程名
setName()：设置线程名

import threading
import time
class myThread(threading.Thread):
def __init__(self, name, delay):
threading.Thread.__init__(self)
self.name = name
self.delay = delay
def run(self):
print("Starting " + self.name)
print_time(self.name, self.delay)
print("Exiting " + self.name)
def print_time(threadName, delay):
counter = 0
while counter < 3:
time.sleep(delay)
print(threadName, time.ctime())
counter += 1
threads = []
# 创建新线程
thread1 = myThread("Thread-1", 1)
thread2 = myThread("Thread-2", 2)
# 开启新线程
thread1.start()
thread2.start()
# 添加线程到线程列表
threads.append(thread1)
threads.append(thread2)
# 等待所有线程完成
for t in threads:
t.join()
print("Exiting Main Thread")

下面使用线程比对不同的效率。

第一个是简单的单线程向200个网站发送请求：

这种执行方法就是200个网站，依次向网站发送请求，一共是花费了207秒。

第二个是将200个网站分为四组，由四个线程去执行。

import threading
import requests
import time
# 链接地址构造
link_list = []
with open('aa.txt', 'r') as file:
file_list = file.readlines()
for eachone in file_list:
link = eachone.split('t')[1]
link = link.replace('n', '')
link_list.append(link)
# 构造线程
start = time.time()
class myThread(threading.Thread):
def __init__(self, name, link_range):
threading.Thread.__init__(self)
self.name = name
self.link_range = link_range
def run(self):
print("Starting " + self.name)
crawler(self.name, self.link_range)
print("Exiting " + self.name)
def crawler(threadName, link_range):
for i in range(link_range[0], link_range[1] + 1):
try:
r = requests.get(link_list[i], timeout=20)
print(threadName, r.status_code, link_list[i])
except Exception as e:
print(threadName, 'Error: ', e)
thread_list = []
link_range_list = [(0,50), (51,100), (101,150), (151,200)]
# 创建新线程
for i in range(1, 5):
thread = myThread("Thread-" + str(i), link_range_list[i-1])
thread.start()
thread_list.append(thread)
# 等待所有线程完成
for thread in thread_list:
thread.join()
end = time.time()
print("简单多线程爬虫的总时间为：", end-start)
print("Exiting Main Thread")

这种方法的一个缺陷是当某A线程执行完之后，线程B还没有执行完，这样的话，线程A就被闲置。浪费了资源。

第三种是将200个网站放在一个队列中，然后4个线程分别取队列中去任务。

import threading
import requests
import time
import queue as Queue
# 链接地址构造
link_list = []
with open('aa.txt', 'r') as file:
file_list = file.readlines()
for eachone in file_list:
link = eachone.split('t')[1]
link = link.replace('n', '')
link_list.append(link)
start = time.time()
class myThread(threading.Thread):
def __init__(self, name, q):
threading.Thread.__init__(self)
self.name = name
self.q = q
def run(self):
print("Starting " + self.name)
while True:
try:
crawler(self.name, self.q)
except:
break
print("Exiting ",self.name)
def crawler(threadName, q):
url = q.get(timeout=2)
try:
r = requests.get(url, timeout=20)
print(q.qsize(), threadName, r.status_code, url)
except Exception as e:
print(q.qsize(), threadName, url, 'Error', e)
threadList = ["Thread-1", "Thread-2", "Thread-3", "Thread-4"]
workQueue = Queue.Queue(200)
threads = []
# 创建新线程
for tName in threadList:
thread = myThread(tName, workQueue)
thread.start()
threads.append(thread)
# 填充队列
for url in link_list:
workQueue.put(url)
# 等待所有线程完成
for t in threads:
t.join()
end = time.time()
print("Queue多线程爬虫的总时间为：", end-start)
print("Exiting Main Thread")

这样就不会有线程先执行完，出现闲置的情况，从结果也可以看出。第三种方法时最快的。

三、多进程爬虫

Python的多线程只能运行在单核上，以并发的方式异步执行。因此，多线程爬虫不能充分地发挥多核CPU的资源。而多进程则可以利用CPU多核，进程数取决于计算机CPU的处理器个数。由于运行在不同的核上，各个进程的运行时并行的。使用multiprocess库有两种方法：一种是Process+Queue的方法，一种是使用Pool+Queue的方法。

1. 使用multiprocessing的多进程爬虫

当进程数大于CPU的内核数量时，等待运行的进程会等到其他进程运行完让出内核为止。因此，我们要知道自己电脑的CPU核心数量。

from multiprocessing import cpu_count
print(cpu_count())

结果为4，说明本机的CPU核心数为4.接下来，我们开启三个进程，向200个网页发送请求。

from multiprocessing import Process, Queue
import time
import requests
link_list = []
with open('aa.txt', 'r') as file:
file_list = file.readlines()
for eachone in file_list:
link = eachone.split('t')[1]
link = link.replace('n', '')
link_list.append(link)
start = time.time()
class MyProcess(Process):
def __init__(self, q):
Process.__init__(self)
self.q = q
def run(self):
print("Starting ", self.pid)
while not self.q.empty():
crawler(self.q)
print("Exiting ", self.pid)
def crawler(q):
url = q.get(timeout=2)
try:
r = requests.get(url, timeout=20)
print(q.qsize(), r.status_code, url)
except Exception as e:
print(q.qsize(), url, 'Eerror', e)
if __name__ == '__main__':
ProcessName = ["Process-1", "Process-2", "Process-3"]
workQueue = Queue(1000)
# 填充对列
for url in link_list:
workQueue.put(url)
for i in range(0, 3):
p = MyProcess(workQueue)
p.daemon = True
p.start()
p.join()
end = time.time()
print('Process + Queue 多线程爬虫的总时间为：', end-start)
print("Main process Ended!")

2. 使用Pool + Queue的多进程爬虫

第二种方法是使用Pool方法，Pool就是进程池，可以提供指定数量的进程供用户调用。当有新的请求提交到Pool中时，如果池还没有满，就会创建一个新的进程用来执行该请求：但如果池中的进程数已经达到规定的最大值，该请求就会继续等待，知道池中有进程结束才能够创建新的进程。下面了解一下阻塞和非阻塞的概念，关注的是程序在等待调用结果时的状态：

阻塞：等到回调结果出来，在有结果之前，当前进程会被挂起。

非阻塞：添加进程后，不一定非要等到结果就可以添加其他进程运行。

我们使用Pool的非阻塞方法和Queue获取网页数据。

from multiprocessing import Pool, Manager
import time
import requests
link_list = []
with open('aa.txt', 'r') as file:
file_list = file.readlines()
for eachone in file_list:
link = eachone.split('t')[1]
link = link.replace('n', '')
link_list.append(link)
start = time.time()
def crawler(q, index):
Process_id = 'Process-' + str(index)
while not q.empty():
url = q.get(timeout=2)
try:
r = requests.get(url, timeout=20)
print(Process_id, q.qsize(), r.status_code, url)
except Exception as e:
print(Process_id, q.qsize(), url, 'Error: ', e)
if __name__ == '__main__':
manager = Manager()
workQueue = manager.Queue(1000)
# 填充队列
for url in link_list:
workQueue.put(url)
pool = Pool(processes=3)
for i in range(4):
pool.apply_async(crawler, args=(workQueue, i))
print("Strated process")
pool.close()
pool.join()
end = time.time()
print("Pool + Queue 多进程爬虫的总时间为：", end-start)
print('Main process Ended!')

阻塞方法，只需要修改一行代码。将apply_async改成apply.

pool.apply(crawler, args=(workQueue, i))

可以看到，首先是进程0在运行，等他运行结束后，才会运行进程1.

四、多协程爬虫

协程是一种用户态的轻量级线程

优点：

协程像一种在程序级别模拟系统级别的进程，由于是单线程，少了上下文切换，系统消耗少，速度快。
协程方便切换控制流，简化了编程模型
协程有高扩展行和高并发性，一个CPU支持上万协程都不是问题。

缺点：

协程本质是一个单线程，不能同时使用单个CPU的多核，要配合进程是使用。
长时间阻塞的IO操作时不要用协程，可能会阻塞整个程序

import gevent
from gevent.queue import Queue, Empty
import time
import requests
from gevent import monkey # 把下面有可能有IO操作的单独坐上标记
monkey.patch_all()
# 将IO转为异步执行的函数
link_list = []
with open('aa.txt', 'r') as file:
file_list = file.readlines()
for eachone in file_list:
link = eachone.split('t')[1]
link = link.replace('n', '')
link_list.append(link)
start = time.time()
def crawler(index):
Process_id = 'Process-' + str(index)
while not workQueue.empty():
url = workQueue.get(timeout=2)
try:
r = requests.get(url, timeout=20)
print(Process_id, workQueue.qsize(), r.status_code, url)
except Exception as e:
print(Process_id, workQueue.qsize(), url, 'Error: ', e)
def boss():
for url in link_list:
workQueue.put_nowait(url)
if __name__ == '__main__':
workQueue = Queue(1000)
gevent.spawn(boss).join()
jobs = []
for i in range(10):
jobs.append(gevent.spawn(crawler, i))
gevent.joinall(jobs)
end = time.time()
print('gevent + Queue 多协程爬虫的总时间为：', end-start)
print('Main Ended!')

from gevent import monkey # 把下面有可能有IO操作的单独坐上标记
monkey.patch_all()
# 将IO转为异步执行的函数

实现爬虫的并发能力，如果没有这两句，就变成依次抓取。

总结：这篇文章学习了多线程，多进程，多协程的网络爬虫，各有优点，是提升网络爬虫速度很好的工具。

上一篇文章：爬虫学习----数据存储

下一篇文章：爬虫学习----反爬虫问题

注意：本篇学习笔记，是总结唐松老师的《Python网络爬虫从入门到实践》这本书的内容，如果想了解书中详细内容，请自行购买

最后

以上就是畅快蜻蜓为你收集整理的爬虫学习----提升爬虫的速度一、并发和并行，同步和异步的概念二、多线程爬虫三、多进程爬虫四、多协程爬虫的全部内容，希望文章能够帮你解决爬虫学习----提升爬虫的速度一、并发和并行，同步和异步的概念二、多线程爬虫三、多进程爬虫四、多协程爬虫所遇到的程序开发问题。

如果觉得靠谱客网站的内容还不错，欢迎将靠谱客网站推荐给程序员好友。

本图文内容来源于网友提供，作为学习参考使用，或来自网络收集整理，版权属于原作者所有。

本文分类：网络爬虫
浏览次数：65 次浏览
发布日期：2024-07-23 03:35:02
本文链接：https://www.kaopuke.com/article/k-p-k_13_u_7_o_18_f0_14_z_22_4.html

爬虫学习----提升爬虫的速度一、并发和并行，同步和异步的概念二、多线程爬虫三、多进程爬虫四、多协程爬虫

概述

一、并发和并行，同步和异步的概念

二、多线程爬虫

三、多进程爬虫

四、多协程爬虫

最后

评论列表共有 0 条评论

发表评论取消回复

爬虫学习----提升爬虫的速度一、并发和并行，同步和异步的概念二、多线程爬虫三、多进程爬虫四、多协程爬虫

概述

一、并发和并行，同步和异步的概念

二、多线程爬虫

三、多进程爬虫

四、多协程爬虫

最后

相关文章

评论列表共有 0 条评论

发表评论 取消回复

发表评论取消回复