序

说到深度模型优化，可能想到最多的就是上GPU，对于CV、NLP这一类模型效果非常明显，一般RT能下降到原来的1/10。但是在实际中，会遇到一些排序类的模型例如推荐模型DSMM、ESMM、DIN等模型，这些模型深度一般只有4、5层，上GPU后性能、RT反而下降，猜测原因可能是模型网络简单，导致反复IO，最终降低性能。

如何优化

一般CV、NLP这类网络复杂甚至是大模型，我们的优化方向一般是 tf -> gpu -> onnx -> tensorRT。而上述的小模型可能得用其他一些通用方案，参考文章。

主要思路大致可以认为：

裁剪不必要的节点，缩减图大小
合并部分节点、参数，缩减图大小
量化，降低精度

除了文章中提到以上方式，推理性能优化还有包括：

使用共享Embedding，这个理论上也算是缩减变量
数值类特征 + boundary 代替分类特征
缩小Emedding维度
减少重复特征的输入

对于第2点，分类特征很多时候是使用的string类型，实际在观察模型耗时的过程中会发现，Tensorflow中的ToString Op是非常耗时的。10%的分类特征，如果由String转为Float类型，RT能够下降10%～20%。综合模型效果，可以使用连续性数值特征 + boundary的方式代替直接使用分类特征，理论上这种方式不会降低模型效果。

第3点本质上是降低模型的网络复杂度。实际调优的过程中发现，适当降低embedding dim，不仅可以降低RT，甚至也能优化模型效果。具体多大需要看场景以及性能要求

第4点，参考文章。这一点在推理侧进行优化，大致思路就是缓存部分特征，降低IO、带宽，最终达到优化整体推理链路的效果。

具体优化操作

开始前准备

在具体优化前，我们需要准备tensorboard，以及对应的tensorflow版本。

tensorboard可以查看模型图结构、模型耗时情况。但是tensorboard对tensorflow版本依赖很高，版本配置不对tensorboard本身不会报错，就是不给显示；对于刚开始配置阶段带来非常大的不方便，多数情况就会止步于此。

这边使用一个个人推荐的tensorboard镜像，部署命令如下：

docker run -d -it -p 8888:8888 -p 6006:6006 -v /root/jupyter:/home/jovyan/work -v /root/tensorboard_log:/tensorboard -e GRANT_SUDO=yes --user root --name=tensorboard jupyter/tensorflow-notebook:tensorflow-2.8.1

docker 启动后，在 jupyterlab 终端执行

pip install tensorboard-plugin-profile=2.8.0

分析模型

图结构

在准备好环境后，我们开始对模型进行分析。以下模型格式都是saved_model，因为这是tf-serving以及其他推理引擎默认的模型格式

import tensorflow as tf
model = tf.saved_model.load('./model')
graph_def = model.signatures['serving_default'].graph.as_graph_def()

# 保存graph_def为.pbtxt文件
tf.io.write_graph(graph_def, '.', 'model.pbtxt', as_text=True)

上述代码就可以将模型图结构保存，然后选取GRAPH页面，上传model.pbtxt 即可产出模型图结构

算子耗时

from tensorflow.profiler.experimental import Profile
import tensorflow as tf

with tf.device('/cpu:0'):
    with Profile('model-infer-log'):
        model = tf.saved_model.load('./model')
        infer = model.signatures['serving_default']
        for input_data in input_datas:
            predictions = infer(**input_data)

使用Profile这个包，会输出一堆的event文件。在tensorboard中，选择profile，按照要求，将这些event文件上传到指定的目录，即可看到每个算子的具体耗时情况。

模型参数

import tensorflow as tf

model_path = "model"

model =  tf.saved_model.load(model_path)
model_graph = model.signatures['serving_default'].graph.as_graph_def()

def describe_graph(graph_def, show_nodes=False):
  # print('Input Feature Nodes: {}'.format(
  #     [node.name for node in graph_def.node if node.op=='Placeholder']))
  # print('')
  print('Unused Nodes: {}'.format(
      [node.name for node in graph_def.node if 'unused'  in node.name]))
  print('')
  # print('Output Nodes: {}'.format( 
  #     [node.name for node in graph_def.node if (
  #         'predictions' in node.name or 'softmax' in node.name or '')]))
  # print('')
  print('Quantization Nodes: {}'.format(
      [node.name for node in graph_def.node if 'quant' in node.name]))
  print('')
  print('Constant Count: {}'.format(
      len([node for node in graph_def.node if node.op=='Const'])))
  print('')
  print('Variable Count: {}'.format(
      len([node for node in graph_def.node if 'Variable' in node.op])))
  print('')
  print('Identity Count: {}'.format(
      len([node for node in graph_def.node if node.op=='Identity'])))
  print('', 'Total nodes: {}'.format(len(graph_def.node)), '')

  if show_nodes==True:
    for node in graph_def.node:
      print('Op:{} - Name: {}'.format(node.op, node.name))
    
describe_graph(model_graph)

上述代码可以计算出当前图有多少变量、参数。优化模型最直观的表现就是参数变少了。

模型优化

prune模型

首先第一步，先确定模型的输出节点是什么。对于多输出模型，这点可能比较重要。在训练时设置的多输出，但是实际推理时可能只需要单输出。删除不需要的输出，可能会导致整个图的访问节点数量大幅度减少，从提升性能。

signature_def = model.signatures['serving_default']
for output in signature_def.outputs:
    print(output.name)

确定好输出节点后，开始优化模型：

from tensorflow.python.tools import freeze_graph
from tensorflow.python.saved_model import tag_constants
import os
def freeze_model(saved_model_dir ,output_node_names, output_filename):
  initializer_nodes = ''
  freeze_graph.freeze_graph(
      input_saved_model_dir=saved_model_dir,
      output_graph=output_filename,
      saved_model_tags = tag_constants.SERVING,
      output_node_names=output_node_names,
      initializer_nodes=initializer_nodes,
      input_graph=None,
      input_saver=False,
      input_binary=False,
      input_checkpoint=None,
      restore_op_name=None,
      filename_tensor_name=None,
      clear_devices=False,
      input_meta_graph=False,
  )
  print('graph freezed!')
    
freeze_model(model_path,"Sigmoid","./freeze.pb")

裁剪模型

from tensorflow.python.tools import optimize_for_inference_lib
from tensorflow.python.framework import dtypes

def optimize_graph(graph_filename, output_nodes):
    return optimize_for_inference_lib.optimize_for_inference(
    input_graph_def=get_graph_def_from_file(graph_filename),
    input_node_names=[],
    output_node_names=output_nodes,
        placeholder_type_enum=dtypes.float32
    )
    

optimize_model_graph = optimize_graph("freeze.pb" , ["Sigmoid"])
describe_graph(optimize_model_graph)

简单的两步，可以都通过上面describe_graph看下模型的具体情况；在优化理想的情况下，节点数量会有大幅度下降。

导出模型

在优化完成后再导出模型

def convert_graph_def_to_saved_model(export_dir, graph_def, signature_def):
    if tf.compat.v1.gfile.Exists(export_dir):
        tf.compat.v1.gfile.DeleteRecursively(export_dir)
    with tf.compat.v1.Session(graph=tf.Graph()) as session:
        tf.import_graph_def(graph_def, name='')
        tf.compat.v1.saved_model.simple_save(
            session,
            export_dir,
            inputs={
                node.name: session.graph.get_tensor_by_name(node.name) for node in signature_def.inputs if tf.dtypes.resource != node.dtype},
            outputs={node.name: session.graph.get_tensor_by_name(node.name) for node in signature_def.outputs }
        )
        print('Optimized graph converted to SavedModel!')

convert_graph_def_to_saved_model("optimized-model",optimize_model_graph, model.signatures['serving_default'])