ChatGLM.cpp的量化处理-526互联

chatglm.cpp的一个主要特点就是以量化的形式对大模型进行优化，使其在CPU上能够进行高效推理。

本文主要查看chatglm.cpp是如何对模型进行量化的

chatglm.cpp在使用时主要分成两步：

使用convert.py将模型进行量化，得到ggml格式
使用./build/bin/main进行模型调用

convert.py

截至目前（commit: 7da55260），chatglm.cpp已经支持多个llm，这里以chatglm为例。

class BaseConverter:
    @classmethod
    def convert(cls, f, model, tokenizer, ggml_type):
        f.write(b"ggml")  # magic
        f.write(struct.pack("ii", cls.MODEL_TYPE.value, 1))  # model type & version
        cls.dump_config(f, model.config, ggml_type)
        cls.dump_tokenizer(f, tokenizer)
        cls.dump_model(f, model, ggml_type)

进行convert时有三个大步骤，分别处理config/tokenizer/model，将结果写到同一个文件描述符f中。

这里仅关注dump_model

查看ChatGLMConverter的dump_model

@staticmethod
def dump_model(f, model, ggml_type):
    assert torch.allclose(
        model.state_dict()["transformer.word_embeddings.weight"], model.state_dict()["lm_head.weight"]
    ), "unimplemented: lm_head weight must be tied to input embedding"

    weight_names = ["transformer.word_embeddings.weight"]
    for i in range(model.config.num_layers):
        weight_names += [
            f"transformer.layers.{i}.input_layernorm.weight",
            f"transformer.layers.{i}.input_layernorm.bias",
            f"transformer.layers.{i}.attention.query_key_value.weight",
            f"transformer.layers.{i}.attention.query_key_value.bias",
            f"transformer.layers.{i}.attention.dense.weight",
            f"transformer.layers.{i}.attention.dense.bias",
            f"transformer.layers.{i}.post_attention_layernorm.weight",
            f"transformer.layers.{i}.post_attention_layernorm.bias",
            f"transformer.layers.{i}.mlp.dense_h_to_4h.weight",
            f"transformer.layers.{i}.mlp.dense_h_to_4h.bias",
            f"transformer.layers.{i}.mlp.dense_4h_to_h.weight",
            f"transformer.layers.{i}.mlp.dense_4h_to_h.bias",
        ]
    weight_names += [
        "transformer.final_layernorm.weight",
        "transformer.final_layernorm.bias",
    ]
    dump_state_dict(f, weight_names, model.state_dict(), model.config.quantization_bit, ggml_type)

输入：

f：文件描述符
model：加载的chatglm模型
ggml_type：ggml中的数据类型，参考ggml_type，其中包含Q的type为量化的type

根据chatglm模型结构，整理了所有权重名，准备进行dump_state_dict

def dump_state_dict(f, weight_names, state_dict, quantization_bit, ggml_type):
    tensor_info = []
    for name in tqdm(weight_names, desc="Processing model states"):
        tensor = state_dict[name]
        if tensor.ndim == 2:
            # 2d weight: should quantize it if needed

            # step 1: de-quantize it back to float32
            if tensor.dtype == torch.int8:
                assert quantization_bit in [4, 8]
                scale = state_dict[f"{name}_scale"].float()  # channel-wise scale

                if quantization_bit == 4:
                    # convert int4 weight to int8
                    low_bits = ((tensor << 4) & 0xF0) >> 4
                    high_bits = (tensor & 0xF0) >> 4
                    tensor = torch.stack((high_bits, low_bits), dim=-1).view(tensor.shape[0], -1)
                tensor = tensor * scale[:, None]
            else:
                tensor = tensor.float()

            # step 2: quantize it into ggml format
            tensor_ggml_type = ggml_type
        else:
            # 1d weight: convert it to float32
            assert tensor.ndim == 1
            tensor = tensor.float()
            tensor_ggml_type = GGMLType.F32

        dump_tensor(f, name, tensor, tensor_ggml_type)
        tensor_info.append((name, tensor.shape, tensor_ggml_type.name))

    print(tabulate(tensor_info, headers=["name", "shape", "dtype"], tablefmt="psql"))

看到在作者标注的step2中，只有二维的tensor有需要标注特定ggml_type，根据设置需要进行量化，

所有权重进行dump_tensor

def dump_tensor(f, name: str, tensor: torch.Tensor, ggml_type: GGMLType):
    assert tensor.dtype == torch.float32

    # tensor name
    f.write(struct.pack("i", len(name.encode())))
    f.write(name.encode())

    # tensor shape & dtype
    f.write(struct.pack("i" * (2 + tensor.ndim), tensor.ndim, *tensor.shape, ggml_type.value))

    # tensor data
    if ggml_type == GGMLType.F32:
        tensor = tensor.float()
    elif ggml_type == GGMLType.F16:
        tensor = tensor.half()
    elif ggml_type == GGMLType.Q8_0:
        tensor = quantize_q8_0(tensor)
    elif ggml_type == GGMLType.Q4_0:
        tensor = quantize_q4_0(tensor)
    elif ggml_type == GGMLType.Q4_1:
        tensor = quantize_q4_1(tensor)
    elif ggml_type == GGMLType.Q5_0:
        tensor = quantize_q5_0(tensor)
    elif ggml_type == GGMLType.Q5_1:
        tensor = quantize_q5_1(tensor)
    else:
        raise NotImplementedError(f"Cannot dump tensor of dtype {tensor.dtype}")

    # align address
    aligned_pos = (f.tell() + (GGML_MEM_ALIGN - 1)) // GGML_MEM_ALIGN * GGML_MEM_ALIGN
    f.seek(aligned_pos)
    tensor.numpy().tofile(f)

首先将权重的名称、维度等基本信息写入文件，

之后根据不同type，调用不同的量化方法，得到不同的量化张量，再写入文件。

以ggml_type == GGMLType.Q4_0为例

def quantize_q4_0(tensor: torch.Tensor) -> torch.CharTensor:
    # equivalent to ggml_quantize_q4_0 in ggml.c
    assert tensor.shape[1] % GGML_QK4_0 == 0 # 确保权重元素个数能被32整除
    tensor = tensor.view(-1, GGML_QK4_0)  # 以32分组
    abs_max_indices = tensor.abs().max(dim=-1, keepdim=True).indices # 每组绝对值最大的元素的位置
    max_values = torch.take_along_dim(tensor, abs_max_indices, dim=-1) # 每组绝对值最大的元素的值
    scale = max_values / -8 # 构建scale
    tensor = (tensor / scale + 8).round().clamp(min=0, max=15).char() # 取近似值进行量化
    # compress two int4 weights into an int8
    tensor = tensor[:, :16] | (tensor[:, 16:] << 4) # 用int8装下两个int4
    # add scale into each block
    tensor = torch.cat((scale.half().view(torch.int8), tensor), dim=-1)  # 拼接以适配ggml格式
    return tensor

与ggml中的quantize_row_q4_0_reference方法相同，也可以看我的另一篇博客ggml的量化处理。

直接看上面代码即可。

简单地说进行就是进行以32个值为一组进行量化计算，每组中，以绝对值最大值构建scale，进行量化，得到int4，并将int4拆成两组进行合并压缩成int8，至此将scale和这个int8进行concat，从而写入文件f。

至此，模型的信息和数据已经保存。

chatglm.cpp

chatlm.cpp构建模型，在模型推理计算的时候不用考虑读取的tensor是否是量化的，仅需注意标注tensor的type，量化后的张量之间的运算也全权交付于ggml进行处理。

构建模型首先需要准备ModelConfig，convert.py中会保存原始模型的相关信息，所以从转换得到的数据中进行提取，得到了这个量化模型存档的权重类型。

模型的基类BaseModelForCausalLM会构造一个ModelContext，ctx的dtype来自于config的dtype，为ggml提供type标注。

进而，在构建ChatGLM的每层网络组件时，例如Linear，构造函数将传入这个ctx，会在构建weight和bias时指出其ggml_type。

模型每个权重的tensor会有对应的ggml_type标注。ggml在进行张量运算时，会根据ggml_type进行相应处理。