东毅居士

gemma-3n-E4B-it 模型框架图+简要解析

作者：XD / 发表： 2025年6月30日 04:08 / 科研学习/ 阅读量：360

gemma-3n-E4B-it 框架图

gemma-3n-E4B-it配置文件config.json解析

作者：XD / 发表： 2025年6月30日 03:46 / 科研学习/ 阅读量：215

gemma-3n-E4B-it配置文件config.json解析

FP8位数解析

作者：XD / 发表： 2025年5月6日 02:15 / 科研学习/ 阅读量：492

在 AI 模型越来越庞大的今天，我们面临的不仅是算力挑战，更有带宽、能耗和模型部署的瓶颈。正因如此，更高效的数值表示方式成为突破口，其中最受关注的就是 FP8（8位浮点数）格式。

Understanding FP32 and FP64: Single and Double Precision Floating Point

作者：XD / 发表： 2024年8月5日 05:44 / 科研学习/ 阅读量：1511

Understanding FP32 and FP64: Single and Double Precision Floating Point

Understanding BF16: Brain Floating Point Format

作者：XD / 发表： 2024年8月5日 05:31 / 科研学习/ 阅读量：1981

Understanding BF16: Brain Floating Point Format

Understanding FP16: Half-Precision Floating Point

作者：XD / 发表： 2024年8月5日 05:28 / 科研学习/ 阅读量：1596

Understanding FP16: Half-Precision Floating Point

Lucid Plugin from ChatGPT to Creating the Diagram

作者：XD / 发表： 2024年1月31日 23:02 / 科研学习/ 阅读量：1541

Lucid Plugin from ChatGPT to Creating the Diagram

Quick Review: SmoothQuant: Accurate and Efficient Post-Training Quantization for LLMs

作者：XD / 发表： 2023年12月7日 00:45 / 科研学习/ 阅读量：1603

SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models

Paper: https://arxiv.org/abs/2211.10438

Code: https://github.com/mit-han-lab/smoothquant

Organization: MIT

Quick Review: AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration

作者：XD / 发表： 2023年12月7日 00:38 / 科研学习/ 阅读量：2092

AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration

Paper: https://arxiv.org/abs/2306.00978

Code: https://github.com/mit-han-lab/llm-awq/

Organization: MIT

Quick Review: ZeroQuant-FP

作者：XD / 发表： 2023年12月7日 00:32 / 科研学习/ 阅读量：1723

ZeroQuant-FP: A Leap Forward in LLMs Post-Training W4A8 Quantization Using Floating-Point Formats

Paper: https://arxiv.org/abs/2307.09782

Code: https://github.com/microsoft/DeepSpeed

Organization: Microsoft

Quick Review: QUIK: Towards End-to-end 4-Bit Inference on Generative Large Language Models

作者：XD / 发表： 2023年12月7日 00:06 / 科研学习/ 阅读量：1733

QUIK: Towards End-to-end 4-Bit Inference on Generative Large Language Models

Paper: https://arxiv.org/abs/2310.09259

Code: https://github.com/IST-DASLab/QUIK

Organization: ETH Zurich

Quick Review: SpQR: A Sparse-Quantized Representation for Near-Lossless LLM Weight Compression

作者：XD / 发表： 2023年12月6日 23:57 / 科研学习/ 阅读量：1580

SpQR: A Sparse-Quantized Representation for Near-Lossless LLM Weight Compression

Paper: https://arxiv.org/abs/2306.03078

Code: https://github.com/Vahe1994/SpQR

Organization: University of Washington

Quick Review: Optimize Weight Rounding via Signed Gradient Descent for the Quantization of LLMs

作者：XD / 发表： 2023年12月6日 23:51 / 科研学习/ 阅读量：1299

Optimize Weight Rounding via Signed Gradient Descent for the Quantization of Large Language Models

Paper: https://arxiv.org/abs/2309.05516

Code: https://github.com/intel/neural-compressor

Organization: Intel

Quick Review: Norm Tweaking: High-performance Low-bit Quantization of Large Language Models

作者：XD / 发表： 2023年12月6日 23:44 / 科研学习/ 阅读量：1510

Norm Tweaking: High-performance Low-bit Quantization of Large Language Models

Paper: https://arxiv.org/abs/2309.02784

Code: None

Organization: Meituan

Review: H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models

作者：XD / 发表： 2023年11月21日 02:22 / 科研学习/ 阅读量：1959

Review: H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models

Level: Average, Not Recommend

QWEN7B to LLAMA GPTQ model structure

作者：XD / 发表： 2023年11月13日 21:32 / 科研学习/ 阅读量：1612

QWEN7B to LLAMA GPTQ model structure

原 gemma-3n-E4B-it 模型框架图+简要解析

作者：XD / 发表： 2025年6月30日 04:08 / 科研学习/ 阅读量：360

原 gemma-3n-E4B-it配置文件config.json解析

作者：XD / 发表： 2025年6月30日 03:46 / 科研学习/ 阅读量：215

原 FP8位数解析

作者：XD / 发表： 2025年5月6日 02:15 / 科研学习/ 阅读量：492

原 Understanding FP32 and FP64: Single and Double Precision Floating Point

作者：XD / 发表： 2024年8月5日 05:44 / 科研学习/ 阅读量：1511

原 Understanding BF16: Brain Floating Point Format

作者：XD / 发表： 2024年8月5日 05:31 / 科研学习/ 阅读量：1981

原 Understanding FP16: Half-Precision Floating Point

作者：XD / 发表： 2024年8月5日 05:28 / 科研学习/ 阅读量：1596

原 Lucid Plugin from ChatGPT to Creating the Diagram

作者：XD / 发表： 2024年1月31日 23:02 / 科研学习/ 阅读量：1541

原 Quick Review: SmoothQuant: Accurate and Efficient Post-Training Quantization for LLMs

作者：XD / 发表： 2023年12月7日 00:45 / 科研学习/ 阅读量：1603

原 Quick Review: AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration

作者：XD / 发表： 2023年12月7日 00:38 / 科研学习/ 阅读量：2092

原 Quick Review: ZeroQuant-FP

作者：XD / 发表： 2023年12月7日 00:32 / 科研学习/ 阅读量：1723

原 Quick Review: QUIK: Towards End-to-end 4-Bit Inference on Generative Large Language Models

作者：XD / 发表： 2023年12月7日 00:06 / 科研学习/ 阅读量：1733

原 Quick Review: SpQR: A Sparse-Quantized Representation for Near-Lossless LLM Weight Compression

作者：XD / 发表： 2023年12月6日 23:57 / 科研学习/ 阅读量：1580

原 Quick Review: Optimize Weight Rounding via Signed Gradient Descent for the Quantization of LLMs

作者：XD / 发表： 2023年12月6日 23:51 / 科研学习/ 阅读量：1299

原 Quick Review: Norm Tweaking: High-performance Low-bit Quantization of Large Language Models

作者：XD / 发表： 2023年12月6日 23:44 / 科研学习/ 阅读量：1510

原 Review: H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models

作者：XD / 发表： 2023年11月21日 02:22 / 科研学习/ 阅读量：1959

原 QWEN7B to LLAMA GPTQ model structure

作者：XD / 发表： 2023年11月13日 21:32 / 科研学习/ 阅读量：1612

gemma-3n-E4B-it 模型框架图+简要解析

gemma-3n-E4B-it配置文件config.json解析

FP8位数解析

Understanding FP32 and FP64: Single and Double Precision Floating Point

Understanding BF16: Brain Floating Point Format

Understanding FP16: Half-Precision Floating Point

Lucid Plugin from ChatGPT to Creating the Diagram

Quick Review: SmoothQuant: Accurate and Efficient Post-Training Quantization for LLMs

Quick Review: AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration

Quick Review: ZeroQuant-FP

Quick Review: QUIK: Towards End-to-end 4-Bit Inference on Generative Large Language Models

Quick Review: SpQR: A Sparse-Quantized Representation for Near-Lossless LLM Weight Compression

Quick Review: Optimize Weight Rounding via Signed Gradient Descent for the Quantization of LLMs

Quick Review: Norm Tweaking: High-performance Low-bit Quantization of Large Language Models

Review: H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models

QWEN7B to LLAMA GPTQ model structure