EADST

Python: Obtain Baidu Images Using Web Crawler

Python: Obtain Baidu Images Using Web Crawler.

Here is the main code.

# -- coding: utf-8 --
import os
import re
import time
import requests

class CarCollect():

def __init__(self, path='./name.txt'):
    self.num = 1
    self.class_number = 0
    self.line_list = []
    with open(path, encoding='utf-8') as file:
        self.line_list = [k.strip() for k in file.readlines()]
        self.class_number = int(self.line_list[0])
        self.line_list = self.line_list[1:]

def dowmload_picture(self, html, keyword, save_path):
    pic_url = re.findall('"objURL":"(.*?)",', html, re.S)  # get image url
    print('Finding keyword: ' + keyword + ' images, start downloading...')
    for each in pic_url:
        print('='*60)
        print('Downloading ' + keyword + ' number ' + str(self.num) + ' image, url: ' + str(each))
        try:
            if each:
                pic = requests.get(each, timeout=7)
                string = save_path + r'\\' + keyword + '_' + str(self.num) + '.jpg'
                if len(pic.content) > 10000: # img size > 10k
                    with open(string, 'wb') as fp:
                        fp.write(pic.content)
                        self.num += 1
        except BaseException:
            print('error, cannot download')
        if self.num > self.class_number:
            break

def __call__(self):
    headers = {
        'Accept-Language': 'zh-CN,zh;q=0.8,zh-TW;q=0.7,zh-HK;q=0.5,en-US;q=0.3,en;q=0.2',
        'Connection': 'keep-alive',
        'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101 Firefox/60.0',
        'Upgrade-Insecure-Requests': '1'
    }
    session = requests.Session()
    session.headers = headers

    for word in self.line_list:
        # create a folder
        save_path = word + '_file'
        time_now = time.strftime("%Y%m%d_%H%M%S", time.localtime())
        save_path += "_" + time_now
        os.mkdir(save_path)
        # get images
        url = 'https://image.baidu.com/search/flip?tn=baiduimage&ie=utf-8&word=' + word + '&pn='
        image_number = 0
        self.num = 1
        while image_number < self.class_number:
            try:
                result = session.get(url + str(image_number), timeout=10, allow_redirects=False)
                self.dowmload_picture(result.text, word, save_path)
            except:
                print('Internet error')
            image_number += 60

if name == 'main': path = './keywords.txt' car_collect = CarCollect(path) car_collect() print('Done.')

Here is the text file, keywords.txt. The first line is the number we want to obtain from each keyword. The following lines are the keywords.

20
Dog
Cat

相关标签
About Me
XD
Goals determine what you are going to be.
Category
标签云
飞书 Heatmap Gemma OpenAI LeetCode TensorRT WAN Proxy FP8 Mixtral 证件照 Search News ResNet-50 Video uwsgi git Zip 论文 Freesound Logo Algorithm Attention COCO FP32 Quantize Input 第一性原理 uWSGI Random RGB NLP CUDA Permission Rebuttal LLM transformers Pickle Michelin CLAP Land Interview 净利润 Ubuntu BF16 PyTorch CC NLTK IndexTTS2 Claude FP64 DeepStream Disk EXCEL Clash CV Nginx Base64 Use 版权 VGG-16 TSV 关于博主 Qwen Bitcoin GPTQ Bipartite LaTeX OpenCV HaggingFace Tracking 域名 Jetson Safetensors Agent Datetime Template tar Numpy 继承 diffusers Statistics API AI GIT Hungarian CEIR icon CAM 云服务器 UI Linux LoRA 多线程 SAM SQLite torchinfo v0.dev llama.cpp Llama Magnet 财报 mmap PIP Bert Github Translation Website Streamlit TensorFlow Quantization Distillation printf Animate JSON 图形思考法 Sklearn CSV Markdown Dataset 强化学习 tqdm OCR ModelScope BTC Jupyter 腾讯云 PDB Pandas Bin v2ray Shortcut DeepSeek Qwen2.5 多进程 论文速读 Firewall FastAPI Paddle FlashAttention Domain Knowledge 顶会 C++ logger Conda git-lfs SQL Crawler VSCode QWEN CTC 报税 Card Hilton MD5 YOLO ChatGPT Baidu HuggingFace Anaconda Miniforge Pillow Breakpoint Excel 递归学习法 Vim Qwen2 Image2Text NameSilo 算法题 TTS PyCharm 公式 Ptyhon Git Transformers Tensor Web Windows Cloudreve 阿里云 scipy Plate ONNX SPIE RAR GoogLeNet InvalidArgumentError LLAMA 签证 音频 Django Vmess Docker Python Review Diagram XGBoost BeautifulSoup Tiktoken XML SVR Plotly Augmentation Pytorch Hotel FP16 GGML 图标 搞笑 Color Google VPN Food hf Paper Password GPT4 UNIX Data Math WebCrawler PDF
站点统计

本站现有博文328篇,共被浏览846537

本站已经建立2553天!

热门文章
文章归档
回到顶部