TensorRT 命令行程序trtexec常用用法

75 阅读 0 评论 50 点赞

我是靠谱客的博主机智帽子，最近开发中收集的这篇文章主要介绍TensorRT 命令行程序trtexec常用用法，觉得挺不错的，现在分享给大家，希望可以做个参考。

概述

安装TensorRT后，进入到/usr/src/tensorrt/bin目录下，可以看到有个trtexec二进制可执行文件，执行 ./trtexec --help可以看到命令行支持的所有参数项:

=== Model Options ===
  --uff=<file>                UFF model
  --onnx=<file>               ONNX model
  --model=<file>              Caffe model (default = no model, random weights used)
  --deploy=<file>             Caffe prototxt file
  --output=<name>[,<name>]*   Output names (it can be specified multiple times); at least one output is required for UFF and Caffe
  --uffInput=<name>,X,Y,Z     Input blob name and its dimensions (X,Y,Z=C,H,W), it can be specified multiple times; at least one is required for UFF models
  --uffNHWC                   Set if inputs are in the NHWC layout instead of NCHW (use X,Y,Z=H,W,C order in --uffInput)

=== Build Options ===
  --maxBatch                  Set max batch size and build an implicit batch engine (default = 1)
  --explicitBatch             Use explicit batch sizes when building the engine (default = implicit)
  --minShapes=spec            Build with dynamic shapes using a profile with the min shapes provided
  --optShapes=spec            Build with dynamic shapes using a profile with the opt shapes provided
  --maxShapes=spec            Build with dynamic shapes using a profile with the max shapes provided
  --minShapesCalib=spec       Calibrate with dynamic shapes using a profile with the min shapes provided
  --optShapesCalib=spec       Calibrate with dynamic shapes using a profile with the opt shapes provided
  --maxShapesCalib=spec       Calibrate with dynamic shapes using a profile with the max shapes provided
                              Note: All three of min, opt and max shapes must be supplied.
                                    However, if only opt shapes is supplied then it will be expanded so
                                    that min shapes and max shapes are set to the same values as opt shapes.
                                    In addition, use of dynamic shapes implies explicit batch.
                                    Input names can be wrapped with escaped single quotes (ex: 'Input:0').
                              Example input shapes spec: input0:1x3x256x256,input1:1x3x128x128
                              Each input shape is supplied as a key-value pair where key is the input name and
                              value is the dimensions (including the batch dimension) to be used for that input.
                              Each key-value pair has the key and value separated using a colon (:).
                              Multiple input shapes can be provided via comma-separated key-value pairs.
  --inputIOFormats=spec       Type and formats of the input tensors (default = all inputs in fp32:chw)
                              Note: If this option is specified, please make sure that all inputs are in the same order
                                     as network inputs ID.
  --outputIOFormats=spec      Type and formats of the output tensors (default = all outputs in fp32:chw)
                              Note: If this option is specified, please make sure that all outputs are in the same order
                                     as network outputs ID.
                              IO Formats: spec  ::= IOfmt[","spec]
                                          IOfmt ::= type:fmt
                                          type  ::= "fp32"|"fp16"|"int32"|"int8"
                                          fmt   ::= ("chw"|"chw2"|"chw4"|"hwc8"|"chw16"|"chw32")["+"fmt]
  --workspace=N               Set workspace size in megabytes (default = 16)
  --noBuilderCache            Disable timing cache in builder (default is to enable timing cache)
  --nvtxMode=[default|verbose|none] Specify NVTX annotation verbosity
  --minTiming=M               Set the minimum number of iterations used in kernel selection (default = 1)
  --avgTiming=M               Set the number of times averaged in each iteration for kernel selection (default = 8)
  --noTF32                    Disable tf32 precision (default is to enable tf32, in addition to fp32)
  --fp16                      Enable fp16 precision, in addition to fp32 (default = disabled)
  --int8                      Enable int8 precision, in addition to fp32 (default = disabled)
  --best                      Enable all precisions to achieve the best performance (default = disabled)
  --calib=<file>              Read INT8 calibration cache file
  --safe                      Only test the functionality available in safety restricted flows
  --saveEngine=<file>         Save the serialized engine
  --loadEngine=<file>         Load a serialized engine

=== Inference Options ===
  --batch=N                   Set batch size for implicit batch engines (default = 1)
  --shapes=spec               Set input shapes for dynamic shapes inference inputs.
                              Note: Use of dynamic shapes implies explicit batch.
                                    Input names can be wrapped with escaped single quotes (ex: 'Input:0').
                              Example input shapes spec: input0:1x3x256x256, input1:1x3x128x128
                              Each input shape is supplied as a key-value pair where key is the input name and
                              value is the dimensions (including the batch dimension) to be used for that input.
                              Each key-value pair has the key and value separated using a colon (:).
                              Multiple input shapes can be provided via comma-separated key-value pairs.
  --loadInputs=spec           Load input values from files (default = generate random inputs). Input names can be wrapped with single quotes (ex: 'Input:0')
                              Input values spec ::= Ival[","spec]
                                           Ival ::= name":"file
  --iterations=N              Run at least N inference iterations (default = 10)
  --warmUp=N                  Run for N milliseconds to warmup before measuring performance (default = 200)
  --duration=N                Run performance measurements for at least N seconds wallclock time (default = 3)
  --sleepTime=N               Delay inference start with a gap of N milliseconds between launch and compute (default = 0)
  --streams=N                 Instantiate N engines to use concurrently (default = 1)
  --exposeDMA                 Serialize DMA transfers to and from device. (default = disabled)
  --useSpinWait               Actively synchronize on GPU events. This option may decrease synchronization time but increase CPU usage and power (default = disabled)
  --threads                   Enable multithreading to drive engines with independent threads (default = disabled)
  --useCudaGraph              Use cuda graph to capture engine execution and then launch inference (default = disabled)
  --buildOnly                 Skip inference perf measurement (default = disabled)

=== Build and Inference Batch Options ===
                              When using implicit batch, the max batch size of the engine, if not given,
                              is set to the inference batch size;
                              when using explicit batch, if shapes are specified only for inference, they
                              will be used also as min/opt/max in the build profile; if shapes are
                              specified only for the build, the opt shapes will be used also for inference;
                              if both are specified, they must be compatible; and if explicit batch is
                              enabled but neither is specified, the model must provide complete static
                              dimensions, including batch size, for all inputs

=== Reporting Options ===
  --verbose                   Use verbose logging (default = false)
  --avgRuns=N                 Report performance measurements averaged over N consecutive iterations (default = 10)
  --percentile=P              Report performance for the P percentage (0<=P<=100, 0 representing max perf, and 100 representing min perf; (default = 99%)
  --dumpOutput                Print the output tensor(s) of the last inference iteration (default = disabled)
  --dumpProfile               Print profile information per layer (default = disabled)
  --exportTimes=<file>        Write the timing results in a json file (default = disabled)
  --exportOutput=<file>       Write the output tensors to a json file (default = disabled)
  --exportProfile=<file>      Write the profile information per layer in a json file (default = disabled)

=== System Options ===
  --device=N                  Select cuda device N (default = 0)
  --useDLACore=N              Select DLA core N for layers that support DLA (default = none)
  --allowGPUFallback          When DLA is enabled, allow GPU fallback for unsupported layers (default = disabled)
  --plugins                   Plugin library (.so) to load (can be specified multiple times)

=== Help ===
  --help, -h                  Print this message

一般来说主要使用命令行程序trtexec来做将pytorch模型导出的onnx文件或者tensorflow生成的uff文件或者caffe的权重文件caffemodel和proto文件解析生成模型engine文件，然后使用模型的engine做推理测试效果。

以onnx为例:

#解析onnx文件生成engine文件，并做推理，输出结果到rsult.json中
./trtexec --onnx=efficientdet-d0-s.onnx --loadInputs='data':o4_clip-1_raw_data.bin --saveEngine=efficientdet-d0-s.engine  --output='regression','classification','anchors' --exportOutput=result.json

#对onnx来说,output不是必须指定的，但是对推理来说loadInputs是必须指定的
./trtexec --onnx=efficientdet-d0-s.onnx --loadInputs='data':o4_clip-1_raw_data.bin --saveEngine=efficientdet-d0-s.engine  --exportOutput=trtexec-result.json

#直接加载已有的engine文件做推理，输出结果数据到trtexec-result.json中
./trtexec --loadEngine=efficientdet-d0-s.engine  --loadInputs='data':o4_clip-1_raw_data.bin  --exportOutput=trtexec-result.json

这里--onnnx、--saveEngine、--loadEngine好理解不用多说，--output是指定network的输出节点名，有多个输出节点时，节点名字符串之间以逗号分隔，对于onnx文件无需指定，对uff和caffe模型才需要指定，--exportOutput是将模型推理的结果数据(各个输出节点的数据)以json格式保存到指定文件中。

这里需要重点说一下--loadInputs，因为help打印出来的提示信息并没有说的很清楚透彻:

--loadInputs=spec           Load input values from files (default = generate random inputs). Input names can be wrapped with single quotes (ex: 'Input:0')
                              Input values spec ::= Ival[","spec]
                                           Ival ::= name":"file

--loadInputs后面是指定模型的输入数据，数据可以是一个或者多个name:file形式的键值对，中间以逗号分隔，这里的name就是输入节点的名字，例如"input:0"、"data"，很显然network只有一个输入节点时，就一个name:file，有多个输入节点时才会有多个键值对需要指定，那么后面的file应该是什么数据呢？help提示中并没有说，我开始以为是输入图片数据所在的图片文件的名字，就以这一的形式指定:

--loadInputs='data':o4_clip-1.jpg

推理效果很差，于是翻看tensorrt的相关代码，层层往下找到它解析loadInputs参数并读取值的地方，发现它直接从file读取二进制数据，没有做任何其他格式处理，这说明这里的file需要是个纯二进制数据文件，那怎么把输入网络的数据写到这样的文件里呢？一种办法当然是在你的网络里自己增加代码，把输入节点经过预处理的数据以二进制方式写入一个文件里，第二种办法比较方便就是利用numpy的to_file()函数 (不要使用numpy的save()函数，save()保存的是numpy格式的，不是纯裸二进制文件！)来直接把预处理后的数据写入二进制数据文件，例如o4_clip-1_raw_data.bin，然后执行trtexec时指定

--loadInputs='data':o4_clip-1_raw_data.bin

这时识别效果才会比较正常了(为何说比较正常而不说完全符合预期结果呢？因为tensorrt解析onnx后生成的engine文件在推理时的精度比直接使用pytorch调用原始的pt模型文件或者使用onnxruntime调用同一模型的onnx来识别的精度要差一些，同一图片里的同一个object的得分明显下降，这个问题去年报给NVIDIA了，但是至今还没有解决)，贴个我做过的某个机场识别行李箱的例子:

图1

图2

可以看到使用onnxruntime调用onnx或者使用pytorch调用pt得到的识别结果的图1中的右上方的行李箱的得分为0.958，而使用tensorrt解析onnx后得到的engine对同一张图进行识别的结果的图2中的同一个箱子的得分只有0.148了！我这还是没加--fp16参数减小模型的效果！后面展示识别结果时加阈值过滤，那么右上方那个箱子就成了没识别出来了！所以这效果相差真挺多的，但是这个问题nvidia一直没有解决。所以做边缘端模型落地有时受限于这些硬件或软件的缺陷或限制造成的识别能力下降真的让人很头疼郁闷抓狂，尤其你的领导或客户根本不懂边缘端模型落地和PC/sever上跑模型那是两个世界的话。

关于使用numpy的to_file()保存模型的预处理了的输入数据的实际例子，粘贴部分代码如下:



...
compound_coef = 0
force_input_size = None  # set None to use default size
img_path = 'o4_clip-1.png'

use_cuda = True
use_float16 = False
cudnn.fastest = True
cudnn.benchmark = True

obj_list = ['baggage']

input_sizes = [512, 640, 768, 896, 1024, 1280, 1280, 1536]
input_size = input_sizes[compound_coef] if force_input_size is None else force_input_size
ori_imgs, framed_imgs, framed_metas = preprocess(img_path, max_size=input_size)

if use_cuda:
    x = torch.stack([torch.from_numpy(fi).cuda() for fi in framed_imgs], 0)
else:
    x = torch.stack([torch.from_numpy(fi) for fi in framed_imgs], 0)

x = x.to(torch.float32 if not use_float16 else torch.float16).permute(0, 3, 1, 2)

model = EfficientDetBackbone(compound_coef=compound_coef, num_classes=len(obj_list))
model.load_state_dict(torch.load(f'logs/airport/efficientdet-d0_499_79500.pth'))
model.requires_grad_(False)
model.eval()

if use_cuda:
    model = model.cuda()
if use_float16:
    model = model.half()

with torch.no_grad():   
    #features, regression, classification, anchors = model(x)
    regression, classification, anchors = model(x)
    nx = to_numpy(x)
    #np.save("raw_data_numpy.npy",nx)
    nx.tofile("o4_clip-1_raw_data.bin")