概述
本文主要是翻译、总结INT8量化的官方例子里面的README.md
一、翻译
Description
sampleINT8这个例子,执行的是INT8的校准和推理
具体的说,这个例子演示了如何使用INT8推理。INT8推理仅在计算能力6.1以及7.X的GPU上可用。网络校准之后,校准的输出会缓存下来以避免重复这个过程。然后,你可以用任何深度学习框架重现你自己的实验,以便在ImageNet网络上验证你的结果。
How does this sample work?
INT8 engine和FP32 engine以及FP16 engine一样,是通过32位的网络定义(caffemodel)构建的,但是其中多了几个步骤。具体来说,builder 和 network必须使用INT8进行构建,构建的时候需要知道每个tensor的动态范围。INT8 校准器 可以确定最好地用INT8来表示权重和激活值,并设置相应的张量动态范围。或者,你可以自定义每个张量的动态范围;这在sampleINT8API中有介绍。
这个例子基于mnist_caffe完成
具体来说,本例子主要包含以下步骤
- Defines the network
- Sets up the calibrator
- Configures the builder
- Builds the engine
- Runs the engine
- Verifies the output
Defining the network
定义一个INT8的execution 与其他精度完全相同,权重应该作为FP32的值导入,构建器将校准网络以找到合适的量化因子,将网络降至INT8精度。这个示例使nvcaffe解析器导入网络:
const IBlobNameToTensor* blobNameToTensor =
parser->parse(locateFile(deployFile).c_str(),
locateFile(modelFile).c_str(),
*network,
DataType::kFLOAT);
Setup the calibrator
当建立一个INT8网络的时候,校准是一个额外的步骤。建立INT8网络的时候必须提供输入样本,或者说校准数据。然后TensorRT会以FP32前向推理,并收集 中间激活层的统计信息,这些信息会被用来构建INT8引擎
Calibration data
校准使用的数据必须是具有代表性的图像(能够代表实际使用环境的图像)。因为本sample是基于caffe的,caffe会在网络前向前就把所有的预处理(缩放、裁剪、减均值)完成。本sample使用了一个工具类BatchStream来读取校准文件,并转化成适合校准的输入。这些文件的生成会在 Batch files for calibration.进行讨论。
你可以通过以下的例子创建校准数据流 BatchStream calibrationStream(CAL_BATCH_SIZE, NB_CAL_BATCHES);
BatchStream 类提供了辅助方法用以检索批数据。校准器使用BatchStream对象了检索批数据来教训校准。一般来说,BatchStream类应该提供getBatch()
和 getBatchSize()
的实现,这两个实现可以被 IInt8Calibrator::getBatch()
和IInt8Calibrator::getBatchSize()
调用。理想情况下,你可以自定义BatchStream 类用以提供校准数据。更多信息,需要看 BatchStream.h
.
注意:校准数据必须是具有代表性的!!! 比如你的分类数据有1000类,你不能仅提供10个类别的数据进行校准
Calibrator interface
INT8量化必须实现IInt8Calibrator
接口以提供校准数据和辅助函数来读写校准表
我们可以这样创建校准器对象:std::unique_ptr<IInt8Calibrator> calibrator;
TensorRT 提供3种校准器 IInt8Calibrator
的实现,IInt8EntropyCalibrator
IInt8EntropyCalibrator2
IInt8LegacyCalibrator
,看NvInfer.h获取更多关于
IInt8Calibrator ` 接口及其变种的信息。
本例中默认使用IInt8EntropyCalibrator2
,我们可以设置校准器以使用IInt8EntropyCalibrator2
。具体方法如下
calibrator.reset(new Int8EntropyCalibrator2(calibrationStream, FIRST_CAL_BATCH, gNetworkName, INPUT_BLOB_NAME));
为了进行校准,接口必须实现getBatchSize()
和 getBatch()
用以从BatchStream对象中检索数据。
builder 会在校准开始的时候调用一次 getBatchSize()
来得到校准数据的batch size。getBatch()
会被反复调用,用来获得批数据,知道此函数返回false。 每一个参与校准的批次必须包含batch-size幅图像。
bool getBatch(void* bindings[], const char* names[], int nbBindings)
override
{
if (!mStream.next())
return false;
CHECK(cudaMemcpy(mDeviceInput, mStream.getBatch(), mInputCount * sizeof(float), cudaMemcpyHostToDevice));
assert(!strcmp(names[0], INPUT_BLOB_NAME));
bindings[0] = mDeviceInput;
return true;
}
对于每一个输入tensor,指向GPU内存输入数据的指针必须被写入bindings 数组中。names数组包含输入tensor的的名字。每一个tensor在bindings数组的位置必须和在names数组的位置相对应,二者的都行都必须为nbBindings
因为校准的过程是很耗费时间的,你可以选择提供writeCalibrationCache()
的实现,用以把标准表写到适当的位置方便之后使用。然后,实现readCalibrationCache()
来从预期的位置读取校准表。
在校准期间,builder将会使用 readCalibrationCache()
检查校准表是否存在。如果出现以下情况,builder将会重新进行校准
- 校准文件不存在
- 与现有的TensorRT版本不兼容
- 与生成校准表的校准器变体不兼容
关于IInt8Calibrator
接口的更多信息,请看EntropyCalibrator.h
Calibration file
校准文件存储了网络中的每一个tensor的激活值的范围。激活值的范围是通过校准算法生成的动态范围,也就是abs(max_dynamic_range) / 127.0f
.
校准文件由 CalibrationTable<NetworkName>
调用,<NetworkName>
指的是你的网络的名字,比如mnist
。 这个文件位于目录TensorRT-x.x.x.x/data/mnist
如果CalibrationTable
文件没有找到,builder将会运行校准算法来创建。CalibrationTable
内容如下
TRT-5100-EntropyCalibration2
data: 3c000889
conv1: 3c8954be
pool1: 3c8954be
conv2: 3dd33169
pool2: 3dd33169
ip1: 3daeff07
ip2: 3e7d50ec
prob: 3c010a14
其中:
<TRT-xxxx>-<xxxxxxx>
表示的是TensorRT版本和对应的校准算法名字,比如EntropyCalibration2<layer name> :
校准时,网络中的每一个tensor的激活值的取值范围(floating)
'CalibrationTable
文件在build 阶段时运行校准算法生成。创建校准文件后,可以读取该文件,以便后续运行,而无需再次运行校准。您可以为readCalibrationCache()
提供实现,以从所需位置加载校准文件。如果读取校准文件与校准器类型(用于生成文件)和TensorRT版本兼容,builder将跳过校准步骤,而是使用校准文件中的每个tensor范围值进行校准。
以下是具体步骤的代码
Configuring the builder
-
Ensure that INT8 inference is supported on the platform:
if (!builder->platformHasFastInt8()) return false;
-
Enable INT8 mode. Setting this flag ensures that builder auto-tuner will consider INT8 implementation.
builder->setInt8Mode(true);
-
Pass the calibrator object (calibrator) to the builder.
builder->setInt8Calibrator(calibrator);
Building the engine
After we configure the builder with INT8 mode and calibrator, we can build the engine similar to any FP32 engine.
ICudaEngine* engine = builder->buildCudaEngine(*network);
Running the engine
After the engine has been built, it can be used just like an FP32 engine. For example, inputs and outputs remain in 32-bit floating point.
-
Retrieve the names of the input and output tensors to bind the buffers.
inputIndex = engine.getBindingIndex(INPUT_BLOB_NAME), outputIndex = engine.getBindingIndex(OUTPUT_BLOB_NAME);
-
Allocate memory for input and output buffers.
CHECK(cudaMalloc(&buffers[inputIndex], inputSize)); CHECK(cudaMalloc(&buffers[outputIndex], outputSize)); CHECK(cudaMemcpy(buffers[inputIndex], input, inputSize, cudaMemcpyHostToDevice));
-
Create a CUDA stream and run Inference.
cudaStream_t stream; CHECK(cudaStreamCreate(&stream)); context.enqueue(batchSize, buffers, stream, nullptr);
-
Copy the CUDA buffer output to CPU output buffers for post processing.
CHECK(cudaMemcpy(output, buffers[outputIndex], outputSize, cudaMemcpyDeviceToHost));
Verifying the output
This sample outputs Top-1 and Top-5 metrics for both FP32 and INT8 precision, as well as for FP16 if it is natively supported by the hardware. These numbers should be within 1%.
TensorRT API layers and ops
In this sample, the following layers are used. For more information about these layers, see the TensorRT Developer Guide: Layers documentation.
Activation layer
The Activation layer implements element-wise activation functions.
Convolution layer
The Convolution layer computes a 2D (channel, height, and width) convolution, with or without bias.
FullyConnected layer
The FullyConnected layer implements a matrix-vector product, with or without bias.
SoftMax layer
The SoftMax layer applies the SoftMax function on the input tensor along an input dimension specified by the user.
Batch files for calibration
Download the MNIST dataset
- This sample requires the training set and training labels
- Unzip the files obtained above using the
gunzip
utility. For example,gunzip t10k-labels-idx1-ubyte.gz
. - Lastly, copy these files to the
<TensorRT root directory>/samples/data/int8/mnist/
directory
- Unzip the files obtained above using the
Running the sample
-
Compile this sample by running make in the
<TensorRT root directory>/samples/sampleINT8
directory. The binary namedsample_int8
will be created in the<TensorRT root directory>/bin
directory.cd <TensorRT root directory>/samples/sampleINT8 make
Where
<TensorRT root directory>
is where you installed TensorRT. -
Run the sample on MNIST.
./sample_int8 mnist
-
Verify that the sample ran successfully. If the sample runs successfully you should see output similar to the following:
&&&& RUNNING TensorRT.sample_int8 # ./sample_int8 mnist [I] FP32 run:400 batches of size 100 starting at 100 [I] Processing next set of max 100 batches [I] Processing next set of max 100 batches [I] Processing next set of max 100 batches [I] Processing next set of max 100 batches [I] Top1: 0.9904, Top5: 1 [I] Processing 40000 images averaged 0.00170236 ms/image and 0.170236 ms/batch. [I] FP16 run:400 batches of size 100 starting at 100 [I] Processing next set of max 100 batches [I] Processing next set of max 100 batches [I] Processing next set of max 100 batches [I] Processing next set of max 100 batches [I] Top1: 0.9904, Top5: 1 [I] Processing 40000 images averaged 0.00128872 ms/image and 0.128872 ms/batch. INT8 run:400 batches of size 100 starting at 100 [I] Processing next set of max 100 batches [I] Processing next set of max 100 batches [I] Processing next set of max 100 batches [I] Processing next set of max 100 batches [I] Top1: 0.9908, Top5: 1 [I] Processing 40000 images averaged 0.000946117 ms/image and 0.0946117 ms/batch. &&&& PASSED TensorRT.sample_int8 # ./sample_int8 mnist
This output shows that the sample ran successfully;
PASSED
.
Sample --help
options
To see the full list of available options and their descriptions, use the -h
or --help
command line option.
Additional resources
The following resources provide a deeper understanding how to perform inference in INT8 using custom calibration:
INT8:
- 8-bit Inference with TensorRT
- INT8 Calibration Using C++
Models:
- MNIST lenet.prototxt
Blogs:
- Fast INT8 Inference for Autonomous Vehicles with TensorRT 3
- Low Precision Inference with TensorRT
- 8-Bit Quantization and TensorFlow Lite: Speeding up Mobile Inference with Low Precision
Videos:
- Inference and Quantization
- 8-bit Inference with TensorRT Webinar
Documentation:
- Introduction to NVIDIA’s TensorRT Samples
- Working with TensorRT Using the C++ API
- NVIDIA’s TensorRT Documentation Library
二、总结
待续…
最后
以上就是活泼时光为你收集整理的TensorRT 7.0.0.0 int8量化的说明一、翻译Additional resources二、总结的全部内容,希望文章能够帮你解决TensorRT 7.0.0.0 int8量化的说明一、翻译Additional resources二、总结所遇到的程序开发问题。
如果觉得靠谱客网站的内容还不错,欢迎将靠谱客网站推荐给程序员好友。
发表评论 取消回复