GGML Q4_0 Quantize Analysis in llama.cpp
作者:XD / 发表: 2023年11月13日 20:39 / 更新: 2023年11月13日 21:32 / 科研学习 / 阅读量:851
GGML Q4_0 Quantization in llama.cpp
For the LLAMA7B model, there are 387 tensors consisting of various weights and biases. These tensors include token_embd.weight
, 32
sets of attention and feedforward network weights and biases (attn_norm.weigh
, attn_q.weight
, attn_k.weight
, attn_v.weight
, attn_q.bias
, attn_k.bias
, attn_v.bias
, attn_output.weight
, ffn_norm.weight
, ffn_up.weight
, ffn_gate.weight
, ffn_down.weight
), output_norm.weight
, and output.weight
.
Quantization Details:
- Total Tensors for Quantization: 226
token_embd.weight
32
sets of:attn_q.weight
,attn_k.weight
,attn_v.weight
,attn_output.weight
,ffn_up.weight
,ffn_gate.weight
,ffn_down.weight
output.weight
Tensor Breakdown:
llama_model_loader
:f32
type: 161 tensorsf16
type: 226 tensorsllama_model_quantize_internal
:- Meta size: 6162784 bytes
Example Tensors:
- [ 1/ 387] token_embd.weight - [ 4096, 151851, 1, 1], type = f16, quantizing to q4_0 .. size = 1186.34 MB -> 333.66 MB | hist: 0.036 0.016 0.025 0.039 0.057 0.077 0.096 0.111 0.117 0.111 0.096 0.077 0.057 0.039 0.025 0.021
[ 2/ 387] blk.0.attn_norm.weight - [ 4096, 1, 1, 1], type = f32, size = 0.016 MB
-
[ 3/ 387] blk.0.attn_q.weight - [ 4096, 4096, 1, 1], type = f16, quantizing to q4_0 .. size = 32.00 MB -> 9.00 MB | hist: 0.036 0.015 0.025 0.038 0.056 0.076 0.097 0.113 0.120 0.113 0.097 0.076 0.056 0.038 0.025 0.020
-
[ 4/ 387] blk.0.attn_k.weight - [ 4096, 4096, 1, 1], type = f16, quantizing to q4_0 .. size = 32.00 MB -> 9.00 MB | hist: 0.036 0.015 0.024 0.037 0.055 0.075 0.097 0.115 0.123 0.115 0.097 0.076 0.055 0.037 0.024 0.020
-
[ 5/ 387] blk.0.attn_v.weight - [ 4096, 4096, 1, 1], type = f16, quantizing to q4_0 .. size = 32.00 MB -> 9.00 MB | hist: 0.036 0.016 0.025 0.039 0.056 0.076 0.096 0.112 0.119 0.112 0.096 0.076 0.056 0.039 0.025 0.021
[ 6/ 387] blk.0.attn_q.bias - [ 4096, 1, 1, 1], type = f32, size = 0.016 MB
[ 7/ 387] blk.0.attn_k.bias - [ 4096, 1, 1, 1], type = f32, size = 0.016 MB
[ 8/ 387] blk.0.attn_v.bias - [ 4096, 1, 1, 1], type = f32, size = 0.016 MB
- [ 9/ 387] blk.0.attn_output.weight - [ 4096, 4096, 1, 1], type = f16, quantizing to q4_0 .. size = 32.00 MB -> 9.00 MB | hist: 0.036 0.015 0.025 0.039 0.056 0.077 0.096 0.112 0.118 0.112 0.096 0.077 0.056 0.039 0.025 0.021
[ 10/ 387] blk.0.ffn_norm.weight - [ 4096, 1, 1, 1], type = f32, size = 0.016 MB
-
[ 11/ 387] blk.0.ffn_up.weight - [ 4096, 11008, 1, 1], type = f16, quantizing to q4_0 .. size = 86.00 MB -> 24.19 MB | hist: 0.037 0.016 0.025 0.039 0.057 0.077 0.096 0.111 0.116 0.111 0.096 0.077 0.057 0.039 0.025 0.021
-
[ 12/ 387] blk.0.ffn_gate.weight - [ 4096, 11008, 1, 1], type = f16, quantizing to q4_0 .. size = 86.00 MB -> 24.19 MB | hist: 0.037 0.016 0.026 0.039 0.057 0.077 0.096 0.110 0.116 0.110 0.096 0.077 0.057 0.040 0.026 0.021
-
[ 13/ 387] blk.0.ffn_down.weight - [11008, 4096, 1, 1], type = f16, quantizing to q4_0 .. size = 86.00 MB -> 24.19 MB | hist: 0.036 0.016 0.025 0.039 0.057 0.077 0.096 0.111 0.117 0.111 0.096 0.077 0.057 0.039 0.025 0.021
*...and so on for other tensors [14/ 387]-[385/ 387]
*
The remaining 31 blocks follow a similar pattern. blk.0*-blk.31*
[ 386/ 387] output_norm.weight - [ 4096, 1, 1, 1], type = f32, size = 0.016 MB
- [ 387/ 387] output.weight - [ 4096, 151851, 1, 1], type = f16, quantizing to q6_K .. size = 1186.34 MB -> 486.58 MB | hist:
llama_model_quantize_internal: model size = 14727.19 MB
llama_model_quantize_internal: quant size = 4296.76 MB
llama_model_quantize_internal: hist: 0.036 0.016 0.025 0.039 0.056 0.077 0.096 0.111 0.117 0.111 0.096 0.077 0.057 0.039 0.025 0.021