LLM + TensorRT 采坑记录-526互联

TensorRT的使用，尝试对LLM进行加速。本文为采坑记录

环境：ubuntu20.04, cuda 12.2, pytorch 2.0.1, tensorrt 8.6.1, torch_tensorrt 1.4.0, transformer 0.6.0

设备有限，仅打算尝试opt-1.3b和baichuan-7B

经过尝试，opt-1.3b可以正常使用TensorRT加速，baichuan-7B出现了点故障。

本文以opt-1.3b进行介绍，后面提出了目前baichuan-7B出现的问题，还在想怎么解决。

目前尝试两种方式构建TensorRT的trt模型：

将模型导出为onnx，再将onnx转化为trt
使用torch_tensorrt，将模型转为trt

通过onnx生成trt

首先将pytorch模型转化为ONNX格式

采用pytorch提供的函数torch.onnx.export

def export_dynamic_onnx_model(text, model, tokenizer, dynamic_onnx_model_path):
    inputs = tokenizer.encode_plus(text, return_tensors='pt', add_special_tokens=True)
    inputs = inputs.to(model.device)

    dynamic_ax = {'input_ids': [1], 'input_mask': [1]}
    with torch.no_grad():
        torch.onnx.export(model,
                          (inputs['input_ids'], inputs['attention_mask']),
                          dynamic_onnx_model_path,
                          verbose=True,
                          opset_version=17,
                          do_constant_folding=True,
                          input_names=['input_ids', 'input_mask'],
                          output_names=['output'],
                          dynamic_axes=dynamic_ax)
        print("ONNX Model exported to {0}".format(dynamic_onnx_model_path))

使用tensorrt的trtexec从onnx生成trt

 /path_to/trtexec  \
--onnx=/path_to/model.onnx \
--saveEngine=/path_to/model.trt  \
--fp16 \
--workspace=80000 \
--minShapes=input_ids:1x1,input_mask:1x1 \
--optShapes=input_ids:1x300,input_mask:1x300  \
--maxShapes=input_ids:1x600,input_mask:1x600  \
--device=1

采用torch-tensorrt直接转换成trt

(此方法在我这存在一些问题，还没想明白)

根据官方代码https://github.com/pytorch/TensorRT

为了避免模型搭建时产生了trt的算子不支持、语法不支持等情况，采用torch.jit.trace首先生成TorchScript

然后采用torch_tensorrt进行编译

traced_model = torch.jit.trace(model, (inputs['input_ids'], inputs['attention_mask'])).to('cuda:0')

compile_inputs = [
    torch_tensorrt.Input(
        min_shape=[1, 1],
        opt_shape=[1, 300],
        max_shape=[1, 600],
        dtype=torch.int32),
    torch_tensorrt.Input(
        min_shape=[1, 1],
        opt_shape=[1, 300],
        max_shape=[1, 600],
        dtype=torch.int32)
]
trt_model = torch_tensorrt.ts.compile(traced_model,
                                   inputs=compile_inputs,
                                   truncate_long_and_double=True,
                                   enabled_precisions={torch.float16},
                                   device=torch.device('cuda:0')
                                   )

但是出现报错RuntimeError: Trying to create tensor with negative dimension -1: [-1, -1]

尝试抛弃attention_mask，仅使用input_ids，还是不行，会因为要生成[1, -1]的tensor而报错。

破罐破摔，继续尝试采用固定大小的input

compile_inputs = [
    torch_tensorrt.Input(
        shape=[1, 600],
        dtype=torch.int32),
    torch_tensorrt.Input(
        shape=[1, 600],
        dtype=torch.int32)
]

可以编译成功，通过torch.jit.save保存，得到tensorrt的模型。可惜运行的时候会提示输入shape问题，还没搞清楚。

值得注意：

请注意保证Python的tensorrt和系统的tensorrt版本一致。 dpkg -l | grep TensorRT 和 torch_tensorrt.__version__是否适配
进行torch.jit.load时，需要首先import torch_tensorrt

trt使用

我这里采用了transformer-deploy来使用trt模型，安装参考文档，他提供了一些对trt引擎等的封装，比较好用。

首先加载环境

import tensorrt as trt
from tensorrt import Logger, Runtime
from transformer_deploy.backends.trt_utils import build_engine, load_engine, save_engine, TensorRTShape

trt_logger: Logger = trt.Logger(trt.Logger.ERROR)
runtime: Runtime = trt.Runtime(trt_logger)
tensorrt_model = load_engine(engine_file_path=path, runtime=runtime)

因为加载的trt模型本身只有正向传播功能，缺乏一些greedy search等generate的功能，建议其进行封装，满足hugging face的transformer格式，参考。

因为这是测试的是opt-1.3B，直接继承了下transformers.OPTForCausalLM

def inference_tensorrt(input_ids: torch.Tensor, attention_mask) -> torch.Tensor:
    return trt_model(input_ids, attention_mask)

class ModelWrapper(OPTForCausalLM):
    def __init__(
            self, config: PretrainedConfig, inference: Callable[[torch.Tensor], torch.Tensor]
    ):
        super().__init__(config)
        self.config: PretrainedConfig = config
        self.inference: Callable[[torch.Tensor], torch.Tensor] = inference

    def forward(self, input_ids, attention_mask, **_):
        logits = self.inference(input_ids, attention_mask)['output']
        return CausalLMOutput(logits=logits)

大佬认为，小规模的模型在推理时，cache的意义不算大，重新计算也不会引入太多延迟。所以这里暂时就不把past_key_value引入进来了~~（非偷懒）~~。

效果对比

测试了下opt-1.3b

采用model.generate(**inputs, max_new_tokens=50, do_sample=False, top_k=5, top_p=0.95)循环了100次，在我这里时间分别是220秒和94秒，算是有一定提升。内存略有降低。

baichuan-7B遇到的问题

首先将模型转化为onnx，说不支持转化aten::unflatten算子。

查看了下官网https://pytorch.org/TensorRT/indices/supported_ops.html

注：官网有个页面列举了不支持的算子，但我找不到了。。。

建议的解决方法就是修改算子，所以就尝试换了种写法，unflatten改成reshape。

class Attention(nn.Module):
    """Multi-headed attention from 'Attention Is All You Need' paper"""
    def forward(
            self,
            hidden_states: torch.Tensor,
            attention_mask: Optional[torch.Tensor] = None,
            position_ids: Optional[torch.LongTensor] = None,
            past_key_value: Optional[Tuple[torch.Tensor]] = None,
            output_attentions: bool = False,
            use_cache: bool = False,
    ) -> Tuple[torch.Tensor, Optional[torch.Tensor], Optional[Tuple[torch.Tensor]]]:
        bsz, q_len, _ = hidden_states.size()
        proj = self.W_pack(hidden_states)        
        
        #proj = proj.unflatten(-1, (3, self.hidden_size)).unsqueeze(0).transpose(0, -2).squeeze(-2)
        proj = proj.reshape([*shape[:-1], 3, self.hidden_size]).unsqueeze(0).transpose(0, -2).squeeze(-2)
        
        ...

于是成功转成onnx。

进行onnx转trt时，我这里会产生OOM报错，网上搜了半天

似乎是因为模型比较大还不支持，但看到已经有修改了，https://github.com/microsoft/onnxruntime/pull/16440
但应该截至目前还没有发新版本。

所以这条路暂时放弃。后续有时间再考虑对其加速吧。