EADST

Extract Webpage Information with Python

Here is the python program to extract webpage information with BeautifulSoup and save the data in a CSV file.

from bs4 import BeautifulSoup
import urllib.request
import pandas as pd

url = 'file:///Users/xd/Desktop/ieee/Region_5_Student_Branch_Counselors_and_Chairs.htm'
save_file = 'ieee_info_1'
html = urllib.request.urlopen(url).read()

soup = BeautifulSoup(html, "html.parser")

universities = soup.find_all('div', class_='spoName bullet pad-t15')
people = soup.find_all('div', class_='roster-results')

for u, p in zip(universities, people):
    info = p.find_all('p')
    university = u.get_text()
    name = info[0].get_text()
    if name == 'Position Vacant':
        continue
    title = info[2].get_text()
    address = info[3].get_text() + ', ' + info[4].get_text()
    email = info[-1].get_text()[7:]

    content = [[university, name, title, address, email]]
    list_name = ['university', 'name', 'title', 'address', 'email']
    data = pd.DataFrame(columns=list_name, data=content)
    data.to_csv("{}.csv".format(save_file), mode='a', index=False, header=False, encoding='utf-8')
About Me
XD
Goals determine what you are going to be.
Category
标签云
RAR API Animate Pytorch Vim 多线程 JSON 版权 BeautifulSoup Plate TTS Bitcoin DeepSeek Vmess AI VPN hf QWEN Search Datetime Python Algorithm NLP diffusers CTC Diagram Data CC uWSGI git Input Ptyhon mmap CAM CUDA Interview RGB Card Website UI Quantization Gemma Base64 v2ray Heatmap Qwen2 MD5 PyCharm 公式 Windows GPTQ CLAP WAN NLTK tqdm GoogLeNet TSV InvalidArgumentError Firewall HaggingFace Qwen2.5 ChatGPT Image2Text News Math Pandas Claude Color 域名 PDF 证件照 LaTeX Bin LeetCode CSV Bipartite Jetson Hotel Knowledge scipy IndexTTS2 多进程 强化学习 Safetensors printf WebCrawler Pickle 图形思考法 关于博主 Permission HuggingFace Password C++ uwsgi Freesound TensorRT Random Attention OCR Augmentation Zip logger UNIX VSCode PIP Paddle Markdown Miniforge 飞书 Hungarian FP32 Dataset Jupyter Crawler Qwen Logo SVR 递归学习法 BF16 Breakpoint NameSilo Ubuntu Llama Git PDB 报税 Proxy Linux Tiktoken Excel Distillation Use Quantize Bert Clash XML SQLite Conda GGML 继承 Translation TensorFlow Cloudreve Mixtral Nginx XGBoost GIT 第一性原理 SAM EXCEL Review 阿里云 ONNX YOLO 净利润 GPT4 Web SQL transformers FP8 OpenCV Anaconda Food ResNet-50 搞笑 算法题 Github Magnet 顶会 Django FlashAttention Pillow Tensor Disk 财报 Template Video LLAMA VGG-16 Agent SPIE Transformers LoRA Docker COCO 音频 LLM ModelScope OpenAI git-lfs Sklearn tar 签证 Land CV Domain Michelin llama.cpp CEIR Streamlit Google DeepStream Hilton FP64 BTC 腾讯云 Numpy torchinfo FP16 v0.dev FastAPI Shortcut Baidu PyTorch Paper Plotly Statistics Tracking
站点统计

本站现有博文320篇,共被浏览760682

本站已经建立2432天!

热门文章
文章归档
回到顶部