FP8位数解析
作者:XD / 发表: 2025年5月6日 02:15 / 科研学习/ 阅读量:40
在 AI 模型越来越庞大的今天,我们面临的不仅是算力挑战,更有带宽、能耗和模型部署的瓶颈。正因如此,更高效的数值表示方式成为突破口,其中最受关注的就是 FP8(8位浮点数)格式。
在 AI 模型越来越庞大的今天,我们面临的不仅是算力挑战,更有带宽、能耗和模型部署的瓶颈。正因如此,更高效的数值表示方式成为突破口,其中最受关注的就是 FP8(8位浮点数)格式。
Understanding FP32 and FP64: Single and Double Precision Floating Point
Understanding BF16: Brain Floating Point Format
Understanding FP16: Half-Precision Floating Point
Lucid Plugin from ChatGPT to Creating the Diagram
SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models
Paper: https://arxiv.org/abs/2211.10438
Code: https://github.com/mit-han-lab/smoothquant
Organization: MIT
AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration
Paper: https://arxiv.org/abs/2306.00978
Code: https://github.com/mit-han-lab/llm-awq/
Organization: MIT
ZeroQuant-FP: A Leap Forward in LLMs Post-Training W4A8 Quantization Using Floating-Point Formats
Paper: https://arxiv.org/abs/2307.09782
Code: https://github.com/microsoft/DeepSpeed
Organization: Microsoft
QUIK: Towards End-to-end 4-Bit Inference on Generative Large Language Models
Paper: https://arxiv.org/abs/2310.09259
Code: https://github.com/IST-DASLab/QUIK
Organization: ETH Zurich
SpQR: A Sparse-Quantized Representation for Near-Lossless LLM Weight Compression
Paper: https://arxiv.org/abs/2306.03078
Code: https://github.com/Vahe1994/SpQR
Organization: University of Washington
Optimize Weight Rounding via Signed Gradient Descent for the Quantization of Large Language Models
Paper: https://arxiv.org/abs/2309.05516
Code: https://github.com/intel/neural-compressor
Organization: Intel
Norm Tweaking: High-performance Low-bit Quantization of Large Language Models
Paper: https://arxiv.org/abs/2309.02784
Code: None
Organization: Meituan
Review: H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models
Level: Average, Not Recommend
QWEN7B to LLAMA GPTQ model structure
QWEN7B to LLAMA7B Model Structure
GGML Q4_0 Quantize Analysis in llama.cpp