January 24th, 2022
The whole code is available at here.
After completing the Tensor-RT's simplified introduction, you might be wondering, is that all there is to it for the object detection with YOLOX model?
Well, unfortunately, the answer is no!
In order to make the inference to work, we are gonna need two additional processes:
According to YOLO’s official Onnxruntime inference, their preprocess consists of resizing image to the given input size while maintaining the aspect ratio, and fill the blank space with zeros. On the other hand, the postprocessing prosess consist of two parts: convert the output to bounding boxes + score threshold + class name, and do NMS to filter out redundant boxes.
Image 1. Full Summary of Object Detection with Tensor-RT.

In my implementation, I simply put them into the three parts that I’ve mentioned above. The implementation are separated into preprocessing->inference->postprocessing (as shown in file trt_main_complete.py on Line 259-279)
ylrunner = YOLOX_runner() # instantiate the Yolo interface runner
# preprocess
img_prep, ratio = ylrunner.preprocess(img, shapetuple)
img_prep = img_prep.astype(dtype)[None,...]
# inference TRT
if 'onnx' != USE_MODE:
result_infer = model(img_prep,cfg.batch_size)
# inference ONNX
else:
result_infer = model.run(None, {inpname:img_prep})
# post process
result_pp = ylrunner.demo_postprocess(result_infer[0].reshape(img_prep.shape[0],-1, 85), shapetuple)
result_out = ylrunner.filter_with_nms(result_pp[0], nms_threshold=cfg.nms_threshold, score_threshold=cfg.score_threshold)
In here, we also want to know how the whole processes fare. To do that, we add one function .vis that will essentially draw the detection output back to the image input. Here are some example using royalty-free images:
| Input Image | Tensor-RT Output (FP32) |
|---|---|
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
In here, we also try to compare the output using Tensor-RT FP32, Tensor-RT FP116, ONNX FP32, and ONNX FP16. The results are as follow (best to open it in the new tab):
| ONNX output FP32 | ONNX output FP16 | Tensor-RT output FP32 | Tensor-RT output FP16 |
|---|---|---|---|
|
|
|
|
As you can see, the results of both Tensor-RT and ONNX, either using FP32 or FP16 are similar, Then, what’s the catch?
In the last part, I also try to compare the speed using Tensor-RT FP32, Tensor-RT FP116, ONNX FP32, and ONNX FP16. The results are as follow (best to open it in the new tab):
| ONNX output FP32 | ONNX output FP16 | Tensor-RT output FP32 | Tensor-RT output FP16 |
|---|---|---|---|
|
|
|
|
|
Based on the inference log, we could see that the inference speed of ONNX-FP16 is faster than the FP32 counterpart. However, interestingly, the NMS speed of ONNX-FP16 is slower than FP32 (total: 34ms vs 30ms). On the other hand, we don’t see the slowdown on the Tensor-RT counterparts (both FP32 and FP16’s nms perform similar to ONNX-FP32’s NMS). We could see that the speed of Tensor-RT+FP16 is faster by almost 10ms than the FP32 one (~25ms vs ~17ms).
Based on the size of the model, ONNXRuntime is slightly smaller compared to Tensor-RT counterparts by around ~5MB. The snapshot of the model size is as the following:
In summary, the rank based on speed is as follow:
The rank based on model size is as follow:
Hope this explanation helps. If there is any question or mistake with the content, please don’t hesitate to let me know, see you in the next blog and stay safe!