Python学习之BeautifulSoup和XPath爬取英语文章和音频基础知识爬取文章代码爬取音频代码

62 阅读 0 评论 41 点赞

我是靠谱客的博主调皮蜜蜂，最近开发中收集的这篇文章主要介绍Python学习之BeautifulSoup和XPath爬取英语文章和音频基础知识爬取文章代码爬取音频代码，觉得挺不错的，现在分享给大家，希望可以做个参考。

概述

基础知识

urllib

urllib是Python3中自带的HTTP请求库。
四个模块

urllib.request 发送http请求
urllib.error 处理请求过程中,出现的异常。
urllib.parse 解析url
urllib.robotparser 解析robots.txt 文件

request 模块

request 模块是urllib中最重要的一个模块，一般用于发送请求和接收响应。

urllib.request.urlopen()

参数	说明
url	必填，字符串，指定目标网站的 URL
data	指定表单数据。该参数默认为 None，此时urllib使用GET方法发送请求；当给参数赋值后，urllib使用POST方法发送请求，并在该参数中携带表单信息（bytes 类型）
timeout	可选参数，用来指定等待时间，若超过指定时间还没获得响应，则抛出一个异常

该方法始终返回一个 HTTPResponse 对象

参数	说明
read()	返回响应体（bytes 类型），通常需要使用 decode(‘utf-8’) 将其转化为 str 类型
geturl()	返回 URL
getcode()	返回状态码
getheaders()	返回全部响应头信息
getheader(header)	返回指定响应头信息

urllib.request.Request()

参数	说明
url	指定目标网站的 URL
data	发送 POST 请求时提交的表单数据，默认为 None
headers	发送请求时附加的请求头部，默认为 {}
origin_req_host	请求方的 host 名称或者 IP 地址，默认为 None
unverifiable	请求方的请求无法验证，默认为 False
method	指定请求方法，默认为 None

相对于urllib.request.urlopen（）来说urllib.request.Request是进一步的包装请求。

Cookie是指某些网站为了辨别用户身份、进行session跟踪而储存在用户本地终端上的数据

#获取 Cookie
import urllib.request
import http.cookiejar
cookie = http.cookiejar.CookieJar()
cookie_handler = urllib.request.HTTPCookieProcessor(cookie)
opener = urllib.request.build_opener(cookie_handler)
response = opener.open('http://www.baidu.com')
for item in cookie:
    print(item.name + '=' + item.value)
    
#使用 Cookie
import urllib.request
import http.cookiejar
# 将 Cookie 保存到文件
cookie = http.cookiejar.MozillaCookieJar('cookie.txt')
cookie_handler = urllib.request.HTTPCookieProcessor(cookie)
opener = urllib.request.build_opener(cookie_handler)
response = opener.open('http://www.baidu.com')
cookie.save(ignore_discard=True,ignore_expires=True)
# 从文件读取 Cookie 并添加到请求中
cookie = http.cookiejar.MozillaCookieJar()
cookie = cookie.load('cookie.txt',ignore_discard=True,ignore_expires=True)
cookie_handler = urllib.request.HTTPCookieProcessor(cookie)
opener = urllib.request.build_opener(cookie_handler)
response = opener.open('http://www.baidu.com')
# 此时已经得到带有 Cookie 请求返回的响应

对于某些网站，如果同一个IP短时间内发送大量请求，则可能会将该IP判定为爬虫，进而对该 IP 进行封禁。
所以我们有必要使用随机的IP地址来绕开这一层检查，这里提供几个查找免费的IP地址的网站：
西刺代理
云代理
快代理

parse 模块

方法	说明
quote	使用转义字符替换特殊字符，从而将上面的URL处理成合法的URL
urlencode	将dict类型数据转化为符合URL标准的str类型数据
urlparse	解析 URL，返回一个 ParseResult 对象
origin_req_host	请求方的 host 名称或者 IP 地址，默认为 None
unverifiable	请求方的请求无法验证，默认为 False
method	指定请求方法，默认为 None

error 模块

一般用于进行异常处理，其中包含两个重要的类：URLError 和 HTTPError
注意，HTTPError 是 URLError 的子类，所以捕获异常时一般要先处理 HTTPError

import socket
import urllib.request
from urllib import error
try: #使用try except方法进行各种异常处理
    response = urllib.request.urlopen(self._url, timeout=5)
    data = response.read()
    with open(self._filePath,'ab') as audio:
        audio.write(data)
        audio.flush()
    print('%s.mp3 下载完成' % self._word)
except error.HTTPError as err: 
    print('HTTPerror, code: %s' % err.code)
except error.URLError as err:
    print('URLerror, reason: %s' % err.reason)
except socket.timeout:
    print('Time Out!')
except:
    print('Unkown Error!')

Beautiful Soup

官方文档
Beautiful Soup 是一个可以从HTML或XML文件中提取数据的Python库。
安装：pip install bs4。

对象	说明
Tag	通俗点讲就是 HTML 中的一个个标签。
NavigableString	如果拿到标签后，还想获取标签中的内容。那么可以通过tag.string获取标签中的文字。
BeautifulSoup	BeautifulSoup 对象表示的是一个文档的全部内容。大部分时候，可以把它当作 Tag 对象，它支持遍历文档树和搜索文档树中描述的大部分的方法。
Comment	Comment 对象是一个特殊类型的 NavigableString 对象

遍历文档树

contents和children：

html_doc = """
<html><head><title>The Dormouse's story</title></head>

<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc,'lxml')

head_tag = soup.head
# 返回所有子节点的列表
print(head_tag.contents)

# 返回所有子节点的迭代器
for child in head_tag.children:
    print(child)

strings 和 stripped_strings

#如果tag中包含多个字符串,可以使用.strings来循环获取
for string in soup.strings:
    print(repr(string))
    # u"The Dormouse's story"
    # u'nn'
    # u"The Dormouse's story"
    # u'nn'
    # u'Once upon a time there were three little sisters; and their names weren'
    # u'Elsie'
    # u',n'
    # u'Lacie'
    # u' andn'
    # u'Tillie'
    # u';nand they lived at the bottom of a well.'
    # u'nn'
    # u'...'
    # u'n'
#输出的字符串中可能包含了很多空格或空行,使用.stripped_strings可以去除多余空白内容
for string in soup.stripped_strings:
    print(repr(string))
    # u"The Dormouse's story"
    # u"The Dormouse's story"
    # u'Once upon a time there were three little sisters; and their names were'
    # u'Elsie'
    # u','
    # u'Lacie'
    # u'and'
    # u'Tillie'
    # u';nand they lived at the bottom of a well.'
    # u'...'

搜索文档树

find和find_all方法

#搜索文档树，一般用得比较多的就是两个方法，一个是find，一个是find_all。
#find方法是找到第一个满足条件的标签后就立即返回，只返回一个元素。
#find_all方法是把所有满足条件的标签都选到，然后返回回去。
#使用这两个方法，最常用的用法是出入name以及attr参数找出符合要求的标签。
soup.find_all("a",attrs={"id":"link2"})
#或者是直接传入属性的的名字作为关键字参数：
soup.find_all("a",id='link2')

select方法
使用css选择器的语法，应该使用select方法。

#通过标签名查找：
print(soup.select('a'))

#通过类名查找：
#通过类名，则应该在类的前面加一个.。
#比如要查找class=sister的标签。示例代码如下：
print(soup.select('.sister'))

#通过id查找：
#通过id查找，应该在id的名字前面加一个＃号。示例代码如下：
print(soup.select("#link1"))

#组合查找：
#组合查找即和写 class 文件时，标签名与类名、id名进行的组合原理是一样的，
#例如查找 p 标签中，id 等于 link1的内容，二者需要用空格分开：
print(soup.select("p #link1"))

#直接子标签查找，则使用 > 分隔：
print(soup.select("head > title"))

#通过属性查找：

#查找时还可以加入属性元素，属性需要用中括号括起来，
#注意属性和标签属于同一节点，所以中间不能加空格，否则会无法匹配到。
print(soup.select('a[href="http://example.com/elsie"]'))

#获取内容
#以上的 select 方法返回的结果都是列表形式，可以遍历形式输出，然后用 get_text() 方法来获取它的内容。
soup = BeautifulSoup(html, 'lxml')
print type(soup.select('title'))
print soup.select('title')[0].get_text()

for title in soup.select('title'):
    print title.get_text()

爬取文章代码

import requests
from lxml import etree
# 使用文档解析类库
from bs4 import BeautifulSoup
# 使用网络请求类库
import urllib.request
def get_page(url):
    #构造请求头部
    headers = {
          'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.132 Safari/537.36'
    }
    #向目标网址发送请求，接收响应，返回一个 Response 对象
    response = requests.get(url=url,headers=headers)
    response.encoding="utf-8"
    requests.adapters.DEFAULT_RETRIES = 5
    # 获得网页源代码
    html = response.text
    return html
def crawl(url,i):
    html =  get_page(url)
    # 构造 lxml.etree._Element 对象
    # lxml.etree._Element 对象还具有代码补全功能
    # 假如我们得到的 XML 文档不是规范的文档，该对象将会自动补全缺失的闭合标签
    html_elem = etree.HTML(html)
    #// 表示后代节点  * 表示所有节点  text() 表示文本节点
    # xpath 方法返回字符串或者匹配列表，匹配列表中的每一项都是 lxml.etree._Element 对象
    fy1 = html_elem.xpath('//*[@id="article"]//*/text()')
    #每个元素后面加换行符
    for x in range(len(fy1)):
        fy1[x]+='n'
    chinese = "".join(fy1).replace("'","''")#数据库插入时'会报错
    with open('./1/'+str(i)+'.txt','w',encoding = 'utf-8') as fd:
        fd.write(chinese+'n')#写到缓存区
        fd.flush()#将缓存区数据写入文件 
if __name__ == '__main__':
    i=1
    # 输入网址
    for x in range(1,12):
        html_doc = "http://www.kekenet.com/Article/17371/List_"+str(x)+".shtml"
        # 获取请求
        req = urllib.request.Request(html_doc)
        # 打开页面
        webpage = urllib.request.urlopen(req)
        # 读取页面内容
        html = webpage.read()
        # 解析成文档对象
        soup = BeautifulSoup(html, 'html.parser')   #文档对象
        #查找文档中所有a标签
        for k in soup.find_all('a'):
            #print(k)
            #查找href标签
            link=k.get('href')
            #筛选链接
            if(link is not None and "http://www.kekenet.com/Article/2" in link):
                crawl(link,i)
                i=i+1

爬取音频代码

from multiprocessing import Pool
import os
import socket
import urllib.request
from urllib import error
class youdao():
    def __init__(self, type=2, word='hellow'):
        '''
        调用youdao API
        type = 2：美音
        type = 1：英音

        判断当前目录下是否存在两个语音库的目录
        如果不存在，创建
        '''
        word = word.lower()  # 小写
        self._type = type  # 发音方式
        self._word = word  # 单词

        # 文件根目录 先得到绝对路径，再得到目录名称
        self._dirRoot = os.path.dirname(os.path.abspath(__file__))
        #拼接路径
        if 2 == self._type:
            self._dirSpeech = os.path.join(self._dirRoot, 'Speech_US')  # 美音库
        else:
            self._dirSpeech = os.path.join(self._dirRoot, 'Speech_EN')  # 英音库

        # 判断是否存在美音库
        if not os.path.exists('Speech_US'):
            # 不存在，就创建
            os.makedirs('Speech_US')
        # 判断是否存在英音库
        if not os.path.exists('Speech_EN'):
            # 不存在，就创建
            os.makedirs('Speech_EN')
    # 设置发音方式
    def setAccent(self, type):
        '''
        type = 2：美音
        type = 1：英音
        '''
        self._type = type 

        if 2 == self._type:
            self._dirSpeech = os.path.join(self._dirRoot, 'Speech_US')  # 美音库
        else:
            self._dirSpeech = os.path.join(self._dirRoot, 'Speech_EN')  # 英音库

    # def getAccent(self):
    #     '''
    #     type = 2：美音
    #     type = 1：英音
    #     '''
    #     return self._type

    def down(self, type,word):
        '''
        下载单词的MP3
        判断语音库中是否有对应的MP3
        如果没有就下载
        '''
        self.setAccent(type)
        word = word.lower()  # 小写
        tmp = self._getWordMp3FilePath(type,word)#判断文件是否存在
        if tmp is None:
            self._getURL()  # 组合URL
            # 调用下载程序，下载到目标文件夹
            # 下载到目标地址
            try: #使用try except方法进行各种异常处理
                response = urllib.request.urlopen(self._url, timeout=5)
                data = response.read()
                with open(self._filePath,'ab') as audio:
                    audio.write(data)
                    audio.flush()
                print('%s.mp3 下载完成' % self._word)
            except error.HTTPError as err: 
                print('HTTPerror, code: %s' % err.code)
            except error.URLError as err:
                print('URLerror, reason: %s' % err.reason)
            except socket.timeout:
                print('Time Out!')
            except:
                print('Unkown Error!')
  
        else:
            print('已经存在 %s.mp3, 不需要下载' % self._word)

        # 返回声音文件路径
        return self._filePath

    def _getURL(self):
        '''
        私有函数，生成发音的目标URL
        http://dict.youdao.com/dictvoice?type=2&audio=
        '''
        self._url = r'http://dict.youdao.com/dictvoice?type=' + str(self._type) + r'&audio=' + self._word.replace(" ", "%20")

    def _getWordMp3FilePath(self, type,word):
        '''
        获取单词的MP3本地文件路径
        如果有MP3文件，返回路径(绝对路径)
        如果没有，返回None
        '''
        word = word.lower()  # 小写
        self._word = word
        self._fileName = self._word +'.mp3'#分别命名
        self._filePath = os.path.join(self._dirSpeech, self._fileName)

        # 判断是否存在这个MP3文件
        if os.path.exists(self._filePath):
            # 存在这个mp3
            return self._filePath
        else:
            # 不存在这个MP3，返回none
            return None
def audio(i):
    f = open('./javaword.txt',encoding = 'utf-8')#相对路径（相对于当前工作目录）
    a = f.readlines()  
    c =a[i]
    word = c.rstrip('n')
    
    sp = youdao()
    sp.down(2,word)
    sp.down(1,word)
if __name__ == '__main__':
    pool=Pool(processes=4)#申请的进程数
    for i in range(0,1340):   
        pool.apply_async(audio,(i,)) #异步执行，非阻塞方式    
    pool.close()#关闭进程池，关闭之后，不能再向进程池中添加进程
    pool.join()#当进程池中的所有进程执行完后，主进程才可以继续执行。