普象搜索词图片爬取

发布时间 2023-10-09 16:20:56作者: 不上火星不改名

功能说明:

  1. 搜索普象网站以获取与特定搜索词相关的图片。
  2. 将找到的每张图片保存到用户指定的文件夹中。
  3. 允许并发下载,加速下载过程。

目的: 该代码主要是为了方便用户从普象网站批量下载与搜索词相关的图片。

使用注意事项:

  1. 在使用此脚本时,确保已安装所有必要的Python库,例如:os, requests, BeautifulSoupThreadPoolExecutor
  2. 使用时应确保有足够的磁盘空间来保存下载的图片。
  3. 由于该脚本使用了多线程下载,可能会对普象网站产生较大的访问压力,请谨慎使用并尽量不要频繁运行。
  4. 普象网站结构或内容可能会发生变化,如果未来网站结构发生变化,该脚本可能需要进行相应的修改。

语法:

  1. download_image(img_url, folder_path): 下载指定URL的图片并保存到指定的文件夹中。
  2. get_images_from_details(detail_url): 从给定的详情URL获取所有的图片URLs。
  3. fetch_and_save_images_for_term(search_term, base_folder_path): 根据指定的搜索词,搜索普象网站并下载找到的所有图片,将其保存到指定的文件夹中。
  4. main(): 脚本的主函数,用于获取用户输入并触发搜索与下载过程。
  5. if __name__ == "__main__":: 确保脚本只在直接运行时执行,而不是在导入时。

import os
import requests
from bs4 import BeautifulSoup
from concurrent.futures import ThreadPoolExecutor

BASE_URL = "https://www.puxiang.com"
DETAILS_PREFIX = BASE_URL + "/galleries/"

headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'}


def download_image(img_url, folder_path):
try:
img_data = requests.get(img_url, headers=headers).content
file_path = os.path.join(folder_path, img_url.split('/')[-1])
with open(file_path, 'wb') as handler:
handler.write(img_data)
print(f"Downloaded and saved to {file_path}")
except Exception as e:
print(f"Error downloading {img_url}. Error: {e}")


def get_images_from_details(detail_url):
response = requests.get(detail_url, headers=headers)
soup = BeautifulSoup(response.content, 'html.parser')
image_tags = soup.find_all("img")

# Filter the img_urls
img_urls = [img['data-src'] for img in image_tags if
img.get('data-src', '').startswith("https://assets.puxiang.com")]
return img_urls


def fetch_and_save_images_for_term(search_term, base_folder_path):
page_num = 1
while True:
SEARCH_URL = BASE_URL + f"/search/puxiang?page={page_num}&key=" + requests.utils.quote(search_term)

response = requests.get(SEARCH_URL, headers=headers)
soup = BeautifulSoup(response.content, 'html.parser')

# Create a folder for this search term if not exist
folder_name = "普象--" + search_term
folder_path = os.path.join(base_folder_path, folder_name)
if not os.path.exists(folder_path):
os.makedirs(folder_path)

# Get details links
detail_links = [BASE_URL + a['href'] for a in soup.find_all("a", class_="work-img")]

# If no detail links are found on this page, break the loop
if not detail_links:
break

# Use ThreadPoolExecutor for concurrent downloading
with ThreadPoolExecutor(max_workers=4) as executor:
for detail_link in detail_links:
img_urls = get_images_from_details(detail_link)
for img_url in img_urls:
executor.submit(download_image, img_url, folder_path)
print(f"Queued images from {detail_link} for search term '{search_term}' on page {page_num}")

# Move to the next page
page_num += 1


def main():
# Ask for the base folder path
base_folder_path = input("请输入保存的文件夹位置,例如:C:\\Users\\皮皮\\Desktop\\花瓣牛马下载: ")

# Ask for search terms
search_terms = input("请输入搜索词,多个搜索词用逗号分隔,例如:爆炸,c4d: ").split(',')

for term in search_terms:
fetch_and_save_images_for_term(term.strip(), base_folder_path)

print("Done!")


if __name__ == "__main__":
main()