人生苦短，我用Python-----爬取图片

66 阅读 0 评论 44 点赞

我是靠谱客的博主文艺翅膀，这篇文章主要介绍人生苦短，我用Python-----爬取图片，现在分享给大家，希望可以做个参考。

人生苦短,我学python!

最近准备看看机会,看了好多的jd上,都要求会一点python,shell脚本,就在空闲的时间里面学习了一下,刚刚入门,还是一个小菜鸡,不过能写一两个小爬虫了,嘿嘿嘿

在这里给大家推荐一下我自学的网站,讲的很简单,https://www.liaoxuefeng.com/wiki/1016959663602400,那就是廖雪峰大佬的博客,好东西就是分享.我的第一语言是java,学了这点python之后,我是真觉的人生苦短,我用python! 说的是真对.

程序员大多都是很懒,python 会让你变得更懒,好多东西都已经封装好了,因一个包就能直接用,so easy!这篇文章先来分享一个我自己写的一个爬取图片的小程序,写的很烂,命名方面和java差很多,高抬贵嘴,莫喷

# -*- coding: UTF-8 -*-
import requests, os, time, random
from bs4 import BeautifulSoup
from urllib.request import urlretrieve

"""
    爬取图片网站的demo http://www.shuaia.net/
"""

headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.88 Safari/537.36"
}
params = {"tagname": "美女"}


def get_pageurl(j, target_urls):
    url = r"http://www.shuaia.net/e/tags/index.php?page=%d&line=25&tempid=3" % (j)
    response = requests.get(url=url, headers=headers, params=params)
    if response.status_code != 200:
        return None
    print(response.url)
    response.encoding = 'utf-8'
    soup = BeautifulSoup(response.text, 'lxml')
    find_all = soup.find_all(class_='item-img')
    for item in find_all:
        target_urls.append(item.img.get('alt') + '=' + item.get('href'))
    return target_urls


if __name__ == '__main__':

    while True:
        j = 0
        target_urls = []
        target_urls = get_pageurl(j, target_urls)
        if None == target_urls:
            continue
        print(target_urls)

        j = j + 1
        for item in target_urls:
            detail = item.split("=")
            fileName = detail[0]
            print(fileName)
            file_name = fileName + ".jpg"
            if fileName not in os.listdir():
                os.makedirs(fileName)
            fileUrl = detail[1]
            print("下载 -》》》》" + fileName)
            response_img = requests.get(fileUrl)
            response_img.encoding = 'utf-8'
            html = response_img.text
            img_html = BeautifulSoup(html, 'lxml')
            html_find = img_html.find_all('div', class_='wr-single-content-list')
            img_bf_2 = BeautifulSoup(str(html_find), 'lxml')
            img_url = 'http://www.shuaia.net' + img_bf_2.div.img.get('src')
            urlretrieve(url=img_url, filename=fileName + '/' + file_name)
            print(img_url)
            url_end = ''
            time.sleep(random.randint(0, 5))
            fileUrl = fileUrl[0:len(fileUrl) - 5]
            i = 1
            while True:
                url_end = '_' + str(i + 1) + '.html'
                crl_file_url = fileUrl + url_end
                crl_response_img = requests.get(crl_file_url)
                if crl_response_img.status_code != 200:
                    break
                crl_response_img.encoding = 'utf-8'
                crl_html = crl_response_img.text
                crl_img_html = BeautifulSoup(crl_html, 'lxml')
                crl_html_find_1 = crl_img_html.find_all('div', class_='wr-single-content-list')
                crl_img_bf_2_1 = BeautifulSoup(str(crl_html_find_1), 'lxml')
                crl_img_url = 'http://www.shuaia.net' + crl_img_bf_2_1.div.img.get('src')
                urlretrieve(url=crl_img_url, filename=fileName + '/' + fileName + str(i + 1) + ".jpg")
                i = i + 1
                time.sleep(random.randint(0, 5))