llama.cpp: Efficient 6-bit Data Packing in an 8-bit Array| 东毅居士

llama.cpp: Efficient 6-bit Data Packing in an 8-bit Array

作者：XD / 发表： 2024年1月25日 00:39 / 更新： 2024年1月25日 00:42 / 编程笔记 / 阅读量：1527

This code snippet, adapted from llama.cpp by ggerganov, demonstrates a method for efficiently packing 6-bit values into an 8-bit uint8 array. It involves scaling, clamping, and bitwise manipulation to optimize or compress data, suitable for specific processing or hardware requirements.

// Initialize inverse scale factor with a fixed scaling offset and the maximum scale value.
float iscale = -32.f/max_scale;
// QK_K = 256. Iterate over a subset of the scales array, determined by QK_K divided by 16.
for (int j = 0; j < QK_K/16; ++j) {
    // Scale and round the j-th element of the scales array to the nearest integer.
    int8_t l = nearest_int(iscale * scales[j]);

    // Clamp the value of l to the range [-32, 31] and normalize it to [0, 63].
    l = MAX(-32, MIN(31, l)) + 32;

    // Store the 0-7th scale lower 4 bits of l in y[i].scales if in the first half of the loop.
    if (j < 8) {
        y[i].scales[j] = l & 0xF;
    } 
    // In the second half, store the 8-15th scale lower 4 bits of l into the higher 4 bits of y[i].scales at j-8.
    else {
        y[i].scales[j-8] |= ((l & 0xF) << 4);
    }

    // Shift the higher 4 bits of l to the lower positions.
    l >>= 4;

    // Calculate the index for storing the lower 2 bits(previous l 2 higher bits) of the shifted l and store them in y[i].scales.
    // The specific position in the array is determined by a combination of modulo and division operations.
    y[i].scales[j % 4 + 8] |= (l << (2 * (j / 4)));
}

The key aspects of this code include:

Scaling and Normalization: Adjusts the data values to a suitable range for bit manipulation.
Bitwise Operations: Utilizes masking (&), shifting (<<, >>), and bitwise OR (|=) to pack data efficiently.
Data Optimization: The method packs data into a smaller space, allowing for efficient use of memory and potentially faster processing.

This approach is particularly useful in scenarios where memory optimization is crucial, such as in embedded systems or when dealing with large datasets.

本文作者：XD 转载请标明出处：http://www.eadst.com/blog/231

本站采用知识共享署名-非商业性使用-相同方式共享 4.0 国际许可协议进行许可。

上一篇
Setting Up v2rayNG with Tencent Cloud Silicon Valley Lighthouse

下一篇
llama.cpp: Definations of Q2_K, Q3_K, Q4_K, Q5_K, Q6_K, and Q8_K Structures

原 llama.cpp: Efficient 6-bit Data Packing in an 8-bit Array

作者：XD / 发表： 2024年1月25日 00:39 / 更新： 2024年1月25日 00:42 / 编程笔记 / 阅读量：1527

llama.cpp: Efficient 6-bit Data Packing in an 8-bit Array