tensorrt学习(一)

发布时间 2023-09-02 16:33:38作者: silence_cho

整理下tensorrt学习资料,方便后续查找。(文章内容大部分摘取于网络资源)

1. tensorrt介绍

安装: https://docs.nvidia.com/deeplearning/sdk/tensorrt-install-guide/index.html

tensorrt python文档:https://docs.nvidia.com/deeplearning/tensorrt/api/python_api/index.html

tensorrt c++文档 :https://docs.nvidia.com/deeplearning/tensorrt/api/c_api/index.html

tensorrt的工作流程大致如下:

1.1 tensorrt api

这里简单学习下tensorrt c++ api中几个重要的类:ILogger, IBuilder, INetworkDefinition,ICudaEngine, IParser

ILogger类

tensorRT中的日志类,其主要包含一个枚举类型定义日志等级,一个log虚函数打印日志;

  • 枚举类型:数字越大表示日志等级越不严重

    enum class Severity : int32_t {
      kINTERNAL_ERROR = 0 , 
      kERROR = 1 ,
      kWARNING = 2 ,
      kINFO = 3 ,
      kVERBOSE = 4
    }
    
  • log虚函数: 继承类需要实现该函数

    virtual void log (Severity severity, AsciiChar const *msg) noexcept=0
    

实际使用时,一般继承ILogger类,并实现log函数,如下:

// tensorRT include
#include <NvInfer.h>
#include <NvInferRuntime.h>

class TRTLogger : public nvinfer1::ILogger{
public:
    virtual void log(Severity severity, nvinfer1::AsciiChar const* msg) noexcept override{
        if(severity <= Severity::kVERBOSE){
            printf("%d: %s\n", severity, msg);
        }
    }
};

TRTLogger logger;

IBuilder类

根据设置的优化参数,将一个network网络转化为engine。

创建IBuilder对象

该类不能被继承,一般通过createInferBuilder()函数创建其实例对象,createInferBuilder()函数定义如下:

inline IBuilder* createInferBuilder(ILogger& logger) noexcept
{
    return static_cast<IBuilder*>(createInferBuilder_INTERNAL(&logger, NV_TENSORRT_VERSION));
}

使用示例如下:

#include <NvInfer.h>
#include <NvInferRuntime.h>

class TRTLogger : public nvinfer1::ILogger{
public:
    virtual void log(Severity severity, nvinfer1::AsciiChar const* msg) noexcept override{
        if(severity <= Severity::kVERBOSE){
            printf("%d: %s\n", severity, msg);
        }
    }
};

TRTLogger logger;
nvinfer1::IBuilder* builder = nvinfer1::createInferBuilder(logger);

成员方法使用

IBuilder的几个成员方法比较常用,如下:

  • nvinfer1::IBuilder::createBuilderConfig()
  • nvinfer1::IBuilder::buildEngineWithConfig()
  • nvinfer1::IBuilder::createNetworkV2()
  • nvinfer1::IBuilder::createOptimizationProfile()

createBuilderConfig()

函数定义如下:

nvinfer1::IBuilderConfig * nvinfer1::IBuilder::createBuilderConfig()

函数返回一个IBuilderConfig的实例对象,设置builder创建engine过程中的一些参数,最主要的设置是setMaxWorkspaceSize(), 其使用如下:

nvinfer1::IBuilderConfig* config = builder->createBuilderConfig();
printf("Workspace Size = %.2f MB\n", (1 << 28) / 1024.0f / 1024.0f); // 256Mib
config->setMaxWorkspaceSize(1 << 28);

buildEngineWithConfig()

函数定义如下: ( 从TensorRT 8.0,被IBuilder::buildSerializedNetwork()函数取代 )

TRT_DEPRECATED nvinfer1::ICudaEngine * nvinfer1::IBuilder::buildEngineWithConfig(                   INetworkDefinition& network, IBuilderConfig& config)    

函数根据网络结构INetworkDefinition和配置参数IBuilderConfig, 创建一个engine。因此,同一个网络,采用不同的配置,可以产生不同的engine。使用如下:

nvinfer1::IBuilderConfig* config = builder->createBuilderConfig();
nvinfer1::INetworkDefinition* network = builder->createNetworkV2(1);
config->setMaxWorkspaceSize(1 << 28);
builder->setMaxBatchSize(1); // 推理时 batchSize = 1 
nvinfer1::ICudaEngine* engine = builder->buildEngineWithConfig(*network, *config);

createNetworkV2()

函数定义如下:

nvinfer1::INetworkDefinition * nvinfer1::IBuilder::createNetworkV2(                                                                     NetworkDefinitionCreationFlags flags)   

函数创建一个空的Inetwork网络实例对象,参数NetworkDefinitionCreationFlags是一个枚举类别,其定义如下:

enum class NetworkDefinitionCreationFlag : int32_t
{
    //! Dynamic shape support requires that the kEXPLICIT_BATCH flag is set.
    //! With dynamic shapes, any of the input dimensions can vary at run-time,
    //! and there are no implicit dimensions in the network specification. This is specified by using the
    //! wildcard dimension value -1.
    kEXPLICIT_BATCH = 0, //!< Mark the network to be an explicit batch network

    //! Setting the network to be an explicit precision network has the following implications:
    //! 1) Precision of all input tensors to the network have to be specified with ITensor::setType() function
    //! 2) Precision of all layer output tensors in the network have to be specified using ILayer::setOutputType()
    //! function
    //! 3) The builder will not quantize the weights of any layer including those running in lower precision(INT8). It
    //! will
    //! simply cast the weights into the required precision.
    //! 4) Dynamic ranges must not be provided to run the network in int8 mode. Dynamic ranges of each tensor in the
    //! explicit
    //! precision network is [-127,127].
    //! 5) Quantizing and dequantizing activation values between higher (FP32) and lower (INT8) precision
    //! will be performed using explicit Scale layers with input/output precision set appropriately.
    kEXPLICIT_PRECISION TRT_DEPRECATED_ENUM = 1, //! <-- Deprecated, used for backward compatibility
};

使用NetworkDefinitionCreationFlag::kEXPLICIT_BATCH参数时,表示网络同时支持固定尺寸和动态尺寸的输入。总结下:

  • nvinfer1::INetworkDefinition* network = builder->createNetworkV2(0):同时支持动态和静态输入
  • nvinfer1::INetworkDefinition* network = builder->createNetworkV2(1):只支持静态输入?

createOptimizationProfile()

函数定义如下:

nvinfer1::IOptimizationProfile * nvinfer1::IBuilder::createOptimizationProfile()

创建一个IOptimizationProfile对象实例,当network网络有动态尺寸的输入时,需要通过IOptimizationProfile对象来指定最小输入尺寸,最大输入尺寸和最优输入尺寸。下面是使用代码示例:

auto profile = builder->createOptimizationProfile();
nvinfer1::IBuilderConfig* config = builder->createBuilderConfig();

// 配置最小允许1 x 1 x 3 x 3
profile->setDimensions(input->getName(), nvinfer1::OptProfileSelector::kMIN, nvinfer1::Dims4(1, num_input, 3, 3));
// 配置最优的尺寸
profile->setDimensions(input->getName(), nvinfer1::OptProfileSelector::kOPT, nvinfer1::Dims4(1, num_input, 3, 3));
// 配置最大允许10 x 1 x 5 x 5
profile->setDimensions(input->getName(), nvinfer1::OptProfileSelector::kMAX, nvinfer1::Dims4(maxBatchSize, num_input, 5, 5));

config->addOptimizationProfile(profile);

INetworkDefinition

定义一个Network网络,如网络输入,网络层,网络输出结构。该类不能被继承,一般通过如下方式创建:

nvinfer1::INetworkDefinition* network = builder->createNetworkV2(1);

有多种方式来实现Network网络的结构,可以采用c++ api,也可以直接加载onnx文件

    1. 采用c++ api定义一个网络结构的代码:
nvinfer1::Weights make_weights(float* ptr, int n){
    nvinfer1::Weights w;
    w.count = n;
    w.type = nvinfer1::DataType::kFLOAT;
    w.values = ptr;
    return w;
}

nvinfer1::INetworkDefinition* network = builder->createNetworkV2(1);

const int num_input = 3;   // in_channel
const int num_output = 2;  // out_channel
float layer1_weight_values[] = {1.0, 2.0, 0.5, 0.1, 0.2, 0.5}; // 前3个给w1的rgb,后3个给w2的rgb 
float layer1_bias_values[]   = {0.3, 0.8};

//输入指定数据的名称、数据类型和完整维度,将输入层添加到网络
nvinfer1::ITensor* input = network->addInput("image", nvinfer1::DataType::kFLOAT, nvinfer1::Dims4(1, num_input, 1, 1));
nvinfer1::Weights layer1_weight = make_weights(layer1_weight_values, 6);
nvinfer1::Weights layer1_bias   = make_weights(layer1_bias_values, 2);
//添加全连接层, 注意对input进行了解引用
auto layer1 = network->addFullyConnected(*input, num_output, layer1_weight, layer1_bias); 
//添加激活层 
auto prob = network->addActivation(*layer1->getOutput(0), nvinfer1::ActivationType::kSIGMOID); // 注意更严谨的写法是*(layer1->getOutput(0)) 即对getOutput返回的指针进行解引用
// 将我们需要的prob标记为输出
network->markOutput(*prob->getOutput(0));
    1. 采用parser解析onnx文件
nvinfer1::INetworkDefinition* network = builder->createNetworkV2(1);
// 通过onnxparser解析的结果会填充到network中,类似addConv的方式添加进去
nvonnxparser::IParser* parser = nvonnxparser::createParser(*network, logger);
if(!parser->parseFromFile("demo.onnx", 1)){
    printf("Failed to parser demo.onnx\n");
}

成员函数使用

INetworkDefinition的一些常用函数需要了解下:

  • int32_t nvinfer1::INetworkDefinition::getNbInputs() const: 网络有几个输入
  • ITensor* nvinfer1::INetworkDefinition::getInput(int32_t index) const: 获取网络第index个输入的指针
  • int32_t nvinfer1::INetworkDefinition::getNbLayers() const: 网络有多少层
  • ILayer* nvinfer1::INetworkDefinition::getLayer(int32_t index)const: 获取网络第index层的指针
  • int32_t nvinfer1::INetworkDefinition::getNbOutputs() const: 网络有几个输出
  • ITensor* nvinfer1::INetworkDefinition::getOutput(int32_t index)const: 获取网络第index个输出的指针
  • bool nvinfer1::INetworkDefinition::hasImplicitBatchDimension() const: 查询网络建立时,是否采用隐式的尺寸(动态尺寸)
  • void nvinfer1::INetworkDefinition::markOutput(ITensor& tensor): 指定一个tensor为网络的输出

ICudaEngine

在头文件InferRuntime.h

网络network产生的,用来进行网络推理的engine, 这个类也不能被继承,一般通过builder->buildEngineWithConfig()或者runtime->deserializeCudaEngine()创建实例对象, 如下:

# 通过IBuilder创建engine
nvinfer1::IBuilder* builder = nvinfer1::createInferBuilder(logger);
nvinfer1::IBuilderConfig* config = builder->createBuilderConfig();
nvinfer1::INetworkDefinition* network = builder->createNetworkV2(1);
nvinfer1::ICudaEngine* engine = builder->buildEngineWithConfig(*network, *config);


# 通过IRuntime反序列化engine (IBuilder创建的engine,通过engine->serialize()进行序列化保存)
auto engine_data = load_file("engine.trtmodel");  
nvinfer1::IRuntime* runtime  = nvinfer1::createInferRuntime(logger);
nvinfer1::ICudaEngine* engine = runtime->deserializeCudaEngine(engine_data.data(), engine_data.size());

常用成员函数

ICudaEngine的一些常用成员函数,需要了解下:

  • TRT_DEPRECATED int32_t nvinfer1::ICudaEngine::getNbBindings() const: 获取engine输入和输出的总个数,tensorrt8.5之后被getNbIOTensors()代替。 ( 注意:If the engine has been built for K profiles, the first getNbBindings() / K bindings are used by profile number 0, the following getNbBindings() / K bindings are used by profile number 1 etc. )

  • char const* nvinfer1::ICudaEngine::getBindingName(int32_t bindingIndex)const: 获取第bindingIndex个binding对应的名称 (binding就是指engine的输入和输出

  • int32_t nvinfer1::ICudaEngine::getBindingIndex(char const* name)const: 获取name对应的binding的bindingIndex

  • Dims nvinfer1::ICudaEngine::getBindingDimensions(int32_t bindingIndex) const: 获取第bindingIndex个binding对应的尺寸

  • DataType nvinfer1::ICudaEngine::getBindingDataType(int32_t bindingIndex)const: 获取第bindingIndex个binding对应的数据类型

  • int32_t nvinfer1::ICudaEngine::getMaxBatchSize()const: 获取engine允许的最大batch_size

  • IHostMemory * nvinfer1::ICudaEngine::serialize()const: 将engine进行序列化,保存为二进制文件,使用代码如下:

    nvinfer1::IHostMemory* model_data = engine->serialize();
    FILE* f = fopen("engine.trtmodel", "wb");
    fwrite(model_data->data(), 1, model_data->size(), f);
    fclose(f);
    
  • IExecutionContext* nvinfer1::ICudaEngine::createExecutionContext(): 创建一个执行上下文环境,进行推理,使用代码如下:

    nvinfer1::IExecutionContext* execution_context = engine->createExecutionContext();
    cudaStream_t stream = nullptr;
    cudaStreamCreate(&stream);
    
    float input_data_host[] = {1, 2, 3};
    float* input_data_device = nullptr;
    
    float output_data_host[2];
    float* output_data_device = nullptr;
    cudaMalloc(&input_data_device, sizeof(input_data_host));
    cudaMalloc(&output_data_device, sizeof(output_data_host));
    cudaMemcpyAsync(input_data_device, input_data_host, sizeof(input_data_host), cudaMemcpyHostToDevice, stream);
    // 用一个指针数组指定input和output在gpu中的指针。
    float* bindings[] = {input_data_device, output_data_device};
    
    bool success  = execution_context->enqueueV2((void**)bindings, stream, nullptr);
    cudaMemcpyAsync(output_data_host, output_data_device, sizeof(output_data_host), cudaMemcpyDeviceToHost, stream);
    cudaStreamSynchronize(stream);
    

IRuntime

主要用来将序列化的engine进行反序列化。 这个类也不能被继承,一般通过如下方式创建

nvinfer1::IRuntime* runtime   = nvinfer1::createInferRuntime(logger);  // 需要传入一个ILogger对象

进行反序列化的函数为deserializeCudaEngine(),定义如下:

ICudaEngine* nvinfer1::IRuntime::deserializeCudaEngine(void const* blob,std::size_t size)
  • blob: 为直线保存engine的memory指针
  • size:为保存engine的memory的大小

使用代码如下:

auto engine_data = load_file("engine.trtmodel");
nvinfer1::IRuntime* runtime   = nvinfer1::createInferRuntime(logger);
nvinfer1::ICudaEngine* engine = runtime->deserializeCudaEngine(engine_data.data(), engine_data.size());

IParser

包含在头文件中NvOnnxParser.h中,用来解析onnx文件的网络到INetworkDefinition对象中。创建方式如下:

nvinfer1::INetworkDefinition* network = builder->createNetworkV2(1);
nvonnxparser::IParser* parser = nvonnxparser::createParser(*network, logger);

主要通过parseFromFile()成员函数来解析onnx文件,函数定义如下:

virtual bool nvonnxparser::IParser::parseFromFile(const char* onnxModelFile,
                                                int verbosity)  
  • onnxModelFile: onnx文件路径
  • verbosity: 打印日志等级

使用代码如下:

if(!parser->parseFromFile("demo.onnx", 1)){
     printf("Failed to parser demo.onnx\n");
 }

1.2 Tensorrt hello-world

下面代码创建一个一层的全连接网络,构建一个engine,并将其序列化,保存为文件:

// tensorRT include
#include <NvInfer.h>
#include <NvInferRuntime.h>

// cuda include
#include <cuda_runtime.h>

// system include
#include <stdio.h>

class TRTLogger : public nvinfer1::ILogger{
public:
    virtual void log(Severity severity, nvinfer1::AsciiChar const* msg) noexcept override{
        if(severity <= Severity::kVERBOSE){
            printf("%d: %s\n", severity, msg);
        }
    }
};

nvinfer1::Weights make_weights(float* ptr, int n){
    nvinfer1::Weights w;
    w.count = n;     // The number of weights in the array.
    w.type = nvinfer1::DataType::kFLOAT;
    w.values = ptr;
    return w;
}

int main(){
    // 本代码主要实现一个最简单的神经网络 figure/simple_fully_connected_net.png 
     
    TRTLogger logger; // logger是必要的,用来捕捉warning和info等

    // ----------------------------- 1. 定义 builder, config 和network -----------------------------
    // 这是基本需要的组件
    //形象的理解是你需要一个builder去build这个网络,网络自身有结构,这个结构可以有不同的配置
    nvinfer1::IBuilder* builder = nvinfer1::createInferBuilder(logger);
    // 创建一个构建配置,指定TensorRT应该如何优化模型,tensorRT生成的模型只能在特定配置下运行
    nvinfer1::IBuilderConfig* config = builder->createBuilderConfig();
    // 创建网络定义,其中createNetworkV2(1)表示采用显性batch size,新版tensorRT(>=7.0)时,不建议采用0非显性batch size
    // 因此贯穿以后,请都采用createNetworkV2(1)而非createNetworkV2(0)或者createNetwork
    nvinfer1::INetworkDefinition* network = builder->createNetworkV2(1);

    // 构建一个模型
    /*
        Network definition:

        image
          |
        linear (fully connected)  input = 3, output = 2, bias = True     w=[[1.0, 2.0, 0.5], [0.1, 0.2, 0.5]], b=[0.3, 0.8]
          |
        sigmoid
          |
        prob
    */

    // ----------------------------- 2. 输入,模型结构和输出的基本信息 -----------------------------
    const int num_input = 3;   // in_channel
    const int num_output = 2;  // out_channel
    float layer1_weight_values[] = {1.0, 2.0, 0.5, 0.1, 0.2, 0.5}; // 前3个给w1的rgb,后3个给w2的rgb 
    float layer1_bias_values[]   = {0.3, 0.8};

    //输入指定数据的名称、数据类型和完整维度,将输入层添加到网络
    nvinfer1::ITensor* input = network->addInput("image", nvinfer1::DataType::kFLOAT, nvinfer1::Dims4(1, num_input, 1, 1));
    nvinfer1::Weights layer1_weight = make_weights(layer1_weight_values, 6);
    nvinfer1::Weights layer1_bias   = make_weights(layer1_bias_values, 2);
    //添加全连接层
    auto layer1 = network->addFullyConnected(*input, num_output, layer1_weight, layer1_bias);      // 注意对input进行了解引用
    //添加激活层 
    auto prob = network->addActivation(*layer1->getOutput(0), nvinfer1::ActivationType::kSIGMOID); // 注意更严谨的写法是*(layer1->getOutput(0)) 即对getOutput返回的指针进行解引用
    
    // 将我们需要的prob标记为输出
    network->markOutput(*prob->getOutput(0));

    printf("Workspace Size = %.2f MB\n", (1 << 28) / 1024.0f / 1024.0f); // 256Mib
    config->setMaxWorkspaceSize(1 << 28);
    builder->setMaxBatchSize(1); // 推理时 batchSize = 1 

    // ----------------------------- 3. 生成engine模型文件 -----------------------------
    //TensorRT 7.1.0版本已弃用buildCudaEngine方法,统一使用buildEngineWithConfig方法
    nvinfer1::ICudaEngine* engine = builder->buildEngineWithConfig(*network, *config);
    if(engine == nullptr){
        printf("Build engine failed.\n");
        network->destroy();
		config->destroy();
		builder->destroy();
        return -1;
    }

    // ----------------------------- 4. 序列化模型文件并存储 -----------------------------
    // 将模型序列化,并储存为文件
    nvinfer1::IHostMemory* model_data = engine->serialize();
    FILE* f = fopen("engine.trtmodel", "wb");
    fwrite(model_data->data(), 1, model_data->size(), f);
    fclose(f);

    // 卸载顺序按照构建顺序倒序
    model_data->destroy();
    engine->destroy();
    network->destroy();
    config->destroy();
    builder->destroy();
    printf("Done.\n");
    return 0;
}

重点提炼:

  1. 必须使用createNetworkV2,并指定为1(表示显性batch)。createNetwork已经废弃,非显性batch官方不推荐??? (待确认)。这个方式直接影响推理时enqueue还是enqueueV2

  2. builder、config等指针,记得释放,否则会有内存泄漏,使用ptr->destroy()释放

  3. markOutput表示是该模型的输出节点,mark几次,就有几个输出,addInput几次就有几个输入。这与推理时相呼应

  4. workspaceSize是工作空间大小,某些layer需要使用额外存储时,不会自己分配空间,而是为了内存复用,直接找tensorRT要workspace空间。指的这个意思

  5. 一定要记住,保存的模型只能适配编译时的trt版本、编译时指定的设备。也只能保证在这种配置下是最优的。如果用trt跨不同设备执行,有时候可以运行,但不是最优的,也不推荐

1.3 Tensorrt inference

下面代码加载一个engine序列化文件,创建engine和context,并进行推理,代码如下:

// tensorRT include
#include <NvInfer.h>
#include <NvInferRuntime.h>

// cuda include
#include <cuda_runtime.h>

// system include
#include <stdio.h>
#include <math.h>

#include <iostream>
#include <fstream>
#include <vector>

using namespace std;

class TRTLogger : public nvinfer1::ILogger{
public:
    virtual void log(Severity severity, nvinfer1::AsciiChar const* msg) noexcept override{
        if(severity <= Severity::kINFO){
            printf("%d: %s\n", severity, msg);
        }
    }
} logger;

nvinfer1::Weights make_weights(float* ptr, int n){
    nvinfer1::Weights w;
    w.count = n;
    w.type = nvinfer1::DataType::kFLOAT;
    w.values = ptr;
    return w;
}


bool build_model(){
    TRTLogger logger;

    // 这是基本需要的组件
    nvinfer1::IBuilder* builder = nvinfer1::createInferBuilder(logger);
    nvinfer1::IBuilderConfig* config = builder->createBuilderConfig();
    nvinfer1::INetworkDefinition* network = builder->createNetworkV2(1);

    // 构建一个模型
    /*
        Network definition:

        image
          |
        linear (fully connected)  input = 3, output = 2, bias = True     w=[[1.0, 2.0, 0.5], [0.1, 0.2, 0.5]], b=[0.3, 0.8]
          |
        sigmoid
          |
        prob
    */

    const int num_input = 3;
    const int num_output = 2;
    float layer1_weight_values[] = {1.0, 2.0, 0.5, 0.1, 0.2, 0.5};
    float layer1_bias_values[]   = {0.3, 0.8};

    nvinfer1::ITensor* input = network->addInput("image", nvinfer1::DataType::kFLOAT, nvinfer1::Dims4(1, num_input, 1, 1));
    nvinfer1::Weights layer1_weight = make_weights(layer1_weight_values, 6);
    nvinfer1::Weights layer1_bias   = make_weights(layer1_bias_values, 2);
    auto layer1 = network->addFullyConnected(*input, num_output, layer1_weight, layer1_bias);
    auto prob = network->addActivation(*layer1->getOutput(0), nvinfer1::ActivationType::kSIGMOID);
    
    // 将我们需要的prob标记为输出
    network->markOutput(*prob->getOutput(0));

    printf("Workspace Size = %.2f MB\n", (1 << 28) / 1024.0f / 1024.0f);
    config->setMaxWorkspaceSize(1 << 28);
    builder->setMaxBatchSize(1);

    nvinfer1::ICudaEngine* engine = builder->buildEngineWithConfig(*network, *config);
    if(engine == nullptr){
        printf("Build engine failed.\n");
        return false;
    }

    // 将模型序列化,并储存为文件
    nvinfer1::IHostMemory* model_data = engine->serialize();
    FILE* f = fopen("engine.trtmodel", "wb");
    fwrite(model_data->data(), 1, model_data->size(), f);
    fclose(f);

    // 卸载顺序按照构建顺序倒序
    model_data->destroy();
    engine->destroy();
    network->destroy();
    config->destroy();
    builder->destroy();
    printf("Done.\n");
    return true;
}

vector<unsigned char> load_file(const string& file){
    ifstream in(file, ios::in | ios::binary);
    if (!in.is_open())
        return {};

    in.seekg(0, ios::end);
    size_t length = in.tellg();

    std::vector<uint8_t> data;
    if (length > 0){
        in.seekg(0, ios::beg);
        data.resize(length);

        in.read((char*)&data[0], length);
        //in.read((char*)data.data(), length);
    }
    in.close();
    return data;
}

void inference(){

    // ------------------------------ 1. 准备模型并加载   ----------------------------
    TRTLogger logger;
    auto engine_data = load_file("engine.trtmodel");
    // 执行推理前,需要创建一个推理的runtime接口实例。与builer一样,runtime需要logger:
    nvinfer1::IRuntime* runtime   = nvinfer1::createInferRuntime(logger);
    // 将模型从读取到engine_data中,则可以对其进行反序列化以获得engine
    nvinfer1::ICudaEngine* engine = runtime->deserializeCudaEngine(engine_data.data(), engine_data.size());
    if(engine == nullptr){
        printf("Deserialize cuda engine failed.\n");
        runtime->destroy();
        return;
    }

    nvinfer1::IExecutionContext* execution_context = engine->createExecutionContext();
    cudaStream_t stream = nullptr;
    // 创建CUDA流,以确定这个batch的推理是独立的
    cudaStreamCreate(&stream);

    /*
        Network definition:

        image
          |
        linear (fully connected)  input = 3, output = 2, bias = True     w=[[1.0, 2.0, 0.5], [0.1, 0.2, 0.5]], b=[0.3, 0.8]
          |
        sigmoid
          |
        prob
    */

    // ------------------------------ 2. 准备好要推理的数据并搬运到GPU   ----------------------------
    float input_data_host[] = {1, 2, 3};
    float* input_data_device = nullptr;

    float output_data_host[2];
    float* output_data_device = nullptr;
    cudaMalloc(&input_data_device, sizeof(input_data_host));
    cudaMalloc(&output_data_device, sizeof(output_data_host));
    cudaMemcpyAsync(input_data_device, input_data_host, sizeof(input_data_host), cudaMemcpyHostToDevice, stream);
    // 用一个指针数组指定input和output在gpu中的指针。
    float* bindings[] = {input_data_device, output_data_device};

    // ------------------------------ 3. 推理并将结果搬运回CPU   ----------------------------
    bool success      = execution_context->enqueueV2((void**)bindings, stream, nullptr);
    cudaMemcpyAsync(output_data_host, output_data_device, sizeof(output_data_host), cudaMemcpyDeviceToHost, stream);
    cudaStreamSynchronize(stream);

    printf("output_data_host = %f, %f\n", output_data_host[0], output_data_host[1]);

    // ------------------------------ 4. 释放内存 ----------------------------
    printf("Clean memory\n");
    cudaStreamDestroy(stream);
    cudaFree(input_data_device);
	cudaFree(output_data_device);
    execution_context->destroy();
    engine->destroy();
    runtime->destroy();

    // ------------------------------ 5. 手动推理进行验证 ----------------------------
    const int num_input = 3;
    const int num_output = 2;
    float layer1_weight_values[] = {1.0, 2.0, 0.5, 0.1, 0.2, 0.5};
    float layer1_bias_values[]   = {0.3, 0.8};

    printf("手动验证计算结果:\n");
    for(int io = 0; io < num_output; ++io){
        float output_host = layer1_bias_values[io];
        for(int ii = 0; ii < num_input; ++ii){
            output_host += layer1_weight_values[io * num_input + ii] * input_data_host[ii];
        }

        // sigmoid
        float prob = 1 / (1 + exp(-output_host));
        printf("output_prob[%d] = %f\n", io, prob);
    }
}

int main(){

    if(!build_model()){
        return -1;
    }
    inference();
    return 0;
}

重点提炼:

  1. bindings是tensorRT对输入输出张量的描述,bindings = input-tensor + output-tensor。比如input有a,output有b, c, d,那么bindings = [a, b, c, d],bindings[0] = a,bindings[2] = c。则engine->getBindingDimensions(0)获取的即为input的尺寸

  2. enqueueV2是异步推理,加入到stream队列等待执行。输入的bindings则是tensors的指针(注意是device pointer)。其shape对应于编译时指定的输入输出的shape(这里只演示全部shape静态)

  3. createExecutionContext可以执行多次,允许一个引擎具有多个执行上下文,不过看看就好,别当真

1.3 Dynamic-shape

如果需要输入设置为动态的shape,主要有两步操作:

    1. build engine时,通过profile设置最小,最优和最大的输入尺寸,如下代码:
    // 如果模型有多个输入,则必须多个profile
    auto profile = builder->createOptimizationProfile();
    
    // 配置最小允许1 x 1 x 3 x 3
    profile->setDimensions(input->getName(), nvinfer1::OptProfileSelector::kMIN, nvinfer1::Dims4(1, num_input, 3, 3));
    profile->setDimensions(input->getName(), nvinfer1::OptProfileSelector::kOPT, nvinfer1::Dims4(1, num_input, 3, 3));
    
    // 配置最大允许10 x 1 x 5 x 5
    // if networkDims.d[i] != -1, then minDims.d[i] == optDims.d[i] == maxDims.d[i] == networkDims.d[i]
    profile->setDimensions(input->getName(), nvinfer1::OptProfileSelector::kMAX, 		 nvinfer1::Dims4(maxBatchSize, num_input, 5, 5));
    config->addOptimizationProfile(profile);
    
    1. 进行推理时,通过context设置此时输入的shape, 如下代码:
    execution_context->setBindingDimensions(0, nvinfer1::Dims4(ib, 1, ih, iw));
    float* bindings[] = {input_data_device, output_data_device};
    bool success  = execution_context->enqueueV2((void**)bindings, stream, nullptr);
    

下面代码,搭建了一层卷积网络,在build engine时设置网络的输入为动态的,可接受(1, 1, 3, 3)到(10,1, 5, 5)之间的输入尺寸:

// tensorRT include
#include <NvInfer.h>
#include <NvInferRuntime.h>

// cuda include
#include <cuda_runtime.h>

// system include
#include <stdio.h>
#include <math.h>

#include <iostream> 
#include <fstream> // 后面要用到ios这个库
#include <vector>

using namespace std;

class TRTLogger : public nvinfer1::ILogger{
public:
    virtual void log(Severity severity, nvinfer1::AsciiChar const* msg) noexcept override{
        if(severity <= Severity::kINFO){
            printf("%d: %s\n", severity, msg);
        }
    }
} logger;

nvinfer1::Weights make_weights(float* ptr, int n){
    nvinfer1::Weights w;
    w.count = n;
    w.type = nvinfer1::DataType::kFLOAT;
    w.values = ptr;
    return w;
}

bool build_model(){
    TRTLogger logger;

    // ----------------------------- 1. 定义 builder, config 和network -----------------------------
    nvinfer1::IBuilder* builder = nvinfer1::createInferBuilder(logger);
    nvinfer1::IBuilderConfig* config = builder->createBuilderConfig();
    nvinfer1::INetworkDefinition* network = builder->createNetworkV2(1);

    // 构建一个模型
    /*
        Network definition:

        image
          |
        conv(3x3, pad=1)  input = 1, output = 1, bias = True     w=[[1.0, 2.0, 0.5], [0.1, 0.2, 0.5], [0.2, 0.2, 0.1]], b=0.0
          |
        relu
          |
        prob
    */


    // ----------------------------- 2. 输入,模型结构和输出的基本信息 -----------------------------
    const int num_input = 1;
    const int num_output = 1;
    float layer1_weight_values[] = {
        1.0, 2.0, 3.1, 
        0.1, 0.1, 0.1, 
        0.2, 0.2, 0.2
    }; // 行优先
    float layer1_bias_values[]   = {0.0};

    // 如果要使用动态shape,必须让NetworkDefinition的维度定义为-1,in_channel是固定的
    nvinfer1::ITensor* input = network->addInput("image", nvinfer1::DataType::kFLOAT, nvinfer1::Dims4(-1, num_input, -1, -1));
    nvinfer1::Weights layer1_weight = make_weights(layer1_weight_values, 9);
    nvinfer1::Weights layer1_bias   = make_weights(layer1_bias_values, 1);
    auto layer1 = network->addConvolution(*input, num_output, nvinfer1::DimsHW(3, 3), layer1_weight, layer1_bias);
    layer1->setPadding(nvinfer1::DimsHW(1, 1));

    auto prob = network->addActivation(*layer1->getOutput(0), nvinfer1::ActivationType::kRELU); // *(layer1->getOutput(0))
     
    // 将我们需要的prob标记为输出
    network->markOutput(*prob->getOutput(0));

    int maxBatchSize = 10;
    printf("Workspace Size = %.2f MB\n", (1 << 28) / 1024.0f / 1024.0f);
    // 配置暂存存储器,用于layer实现的临时存储,也用于保存中间激活值
    config->setMaxWorkspaceSize(1 << 28);

    // --------------------------------- 2.1 关于profile ----------------------------------
    // 如果模型有多个输入,则必须多个profile
    auto profile = builder->createOptimizationProfile();

    // 配置最小允许1 x 1 x 3 x 3
    profile->setDimensions(input->getName(), nvinfer1::OptProfileSelector::kMIN, nvinfer1::Dims4(1, num_input, 3, 3));
    profile->setDimensions(input->getName(), nvinfer1::OptProfileSelector::kOPT, nvinfer1::Dims4(1, num_input, 3, 3));

    // 配置最大允许10 x 1 x 5 x 5
    // if networkDims.d[i] != -1, then minDims.d[i] == optDims.d[i] == maxDims.d[i] == networkDims.d[i]
    profile->setDimensions(input->getName(), nvinfer1::OptProfileSelector::kMAX, nvinfer1::Dims4(maxBatchSize, num_input, 5, 5));
    config->addOptimizationProfile(profile);

    nvinfer1::ICudaEngine* engine = builder->buildEngineWithConfig(*network, *config);
    if(engine == nullptr){
        printf("Build engine failed.\n");
        return false;
    }

    // -------------------------- 3. 序列化 ----------------------------------
    // 将模型序列化,并储存为文件
    nvinfer1::IHostMemory* model_data = engine->serialize();
    FILE* f = fopen("engine.trtmodel", "wb");
    fwrite(model_data->data(), 1, model_data->size(), f);
    fclose(f);

    // 卸载顺序按照构建顺序倒序
    model_data->destroy();
    engine->destroy();
    network->destroy();
    config->destroy();
    builder->destroy();
    printf("Done.\n");
    return true;
}

vector<unsigned char> load_file(const string& file){
    ifstream in(file, ios::in | ios::binary);
    if (!in.is_open())
        return {};

    in.seekg(0, ios::end);
    size_t length = in.tellg();

    std::vector<uint8_t> data;
    if (length > 0){
        in.seekg(0, ios::beg);
        data.resize(length);

        in.read((char*)&data[0], length);
    }
    in.close();
    return data;
}

void inference(){
    // ------------------------------- 1. 加载model并反序列化 -------------------------------
    TRTLogger logger;
    auto engine_data = load_file("engine.trtmodel");
    nvinfer1::IRuntime* runtime   = nvinfer1::createInferRuntime(logger);
    nvinfer1::ICudaEngine* engine = runtime->deserializeCudaEngine(engine_data.data(), engine_data.size());
    if(engine == nullptr){
        printf("Deserialize cuda engine failed.\n");
        runtime->destroy();
        return;
    }

    nvinfer1::IExecutionContext* execution_context = engine->createExecutionContext();
    cudaStream_t stream = nullptr;
    cudaStreamCreate(&stream);

    /*
        Network definition:

        image
          |
        conv(3x3, pad=1)  input = 1, output = 1, bias = True     w=[[1.0, 2.0, 0.5], [0.1, 0.2, 0.5], [0.2, 0.2, 0.1]], b=0.0
          |
        relu
          |
        prob
    */

    // ------------------------------- 2. 输入与输出 -------------------------------
    float input_data_host[] = {
        // batch 0
        1,   1,   1,
        1,   1,   1,
        1,   1,   1,

        // batch 1
        -1,   1,   1,
        1,   0,   1,
        1,   1,   -1
    };
    float* input_data_device = nullptr;

    // 3x3输入,对应3x3输出
    int ib = 2;
    int iw = 3;
    int ih = 3;
    float output_data_host[ib * iw * ih];
    float* output_data_device = nullptr;
    cudaMalloc(&input_data_device, sizeof(input_data_host));
    cudaMalloc(&output_data_device, sizeof(output_data_host));
    cudaMemcpyAsync(input_data_device, input_data_host, sizeof(input_data_host), cudaMemcpyHostToDevice, stream);


    // ------------------------------- 3. 推理 -------------------------------
    // 明确当前推理时,使用的数据输入大小
    execution_context->setBindingDimensions(0, nvinfer1::Dims4(ib, 1, ih, iw));
    float* bindings[] = {input_data_device, output_data_device};
    bool success      = execution_context->enqueueV2((void**)bindings, stream, nullptr);
    cudaMemcpyAsync(output_data_host, output_data_device, sizeof(output_data_host), cudaMemcpyDeviceToHost, stream);
    cudaStreamSynchronize(stream);


    // ------------------------------- 4. 输出结果 -------------------------------
    for(int b = 0; b < ib; ++b){
        printf("batch %d. output_data_host = \n", b);
        for(int i = 0; i < iw * ih; ++i){
            printf("%f, ", output_data_host[b * iw * ih + i]);
            if((i + 1) % iw == 0)
                printf("\n");
        }
    }

    printf("Clean memory\n");
    cudaStreamDestroy(stream);
    cudaFree(input_data_device);
    cudaFree(output_data_device);
    execution_context->destroy();
    engine->destroy();
    runtime->destroy();
}

int main(){

    if(!build_model()){
        return -1;
    }
    inference();
    return 0;
}

重点提炼:

  1. 动态shape,即编译时指定可动态的范围[L-H],推理时可以允许 L <= shape <= H

  2. OptimizationProfile是一个优化配置文件,用来指定输入的shape可以变换的范围的,不要被优化两个字蒙蔽了双眼

  3. 如果onnx的输入某个维度是-1,表示该维度动态,否则表示该维度是明确的,明确维度的minDims, optDims, maxDims一定是一样的