There is no black magic, everything follows the rules.

What does the deep learning serving frameworks do?

  • respond to request (RESTful HTTP or RPC)
  • model inference (with runtime)
  • preprocessing & postprocessing (optional)
  • queries dynamic batching (increase throughput)
  • monitoring metrics
  • service health check
  • versioning
  • multiple instances

Actually, when we are trying to deploy the models with kubernetes, we only need part of these features. But we do care about the performance of these frameworks. So let's do a benchmark.

Benchmark

Environments:

  • CPU: Intel(R) Xeon(R) Gold 6130 CPU @ 2.10GHz
  • GPU: NVIDIA V100
  • Memory: 251GiB
  • OS: Ubuntu 16.04.6 LTS (Xenial Xerus)

Docker Images:

  • tensorflow/tensorflow:latest-gpu
  • tensorflow/serving:latest-gpu
  • nvcr.io/nvidia/tensorrtserver:19.10-py3

The cost of time is recorded after warmup. Dynamic batching disabled.

All the code can be found in this gist.

FrameworkModelModel TypeImagesBatch sizeTime(s)
TensorflowResNet50TF Savedmodel320003283.189
TensorflowResNet50TF Savedmodel320001086.897
Tensorflow ServingResNet50TF Savedmodel3200032120.496
Tensorflow ServingResNet50TF Savedmodel3200010116.887
Triton (TensorRT Inference Server)ResNet50TF Savedmodel3200032201.855
Triton (TensorRT Inference Server)ResNet50TF Savedmodel3200010171.056
Falcon + msgpack + TensorflowResNet50TF Savedmodel3200032115.686
Falcon + msgpack + TensorflowResNet50TF Savedmodel3200010115.572

According to the benchmark, Triton is not ready for production, TF Serving is a good option for TensorFlow models, and self-host service is also quite good (you may need to implement dynamic batching for production).

Comparing

Tensorflow Serving

https://www.tensorflow.org/tfx/serving

  • coupled with Tensorflow ecosystem (also support other format, not out-of-box)
  • A/B testing
  • provide both gRPC and HTTP RESTful API
  • prometheus integration
  • batching
  • multiple models
  • preprocessing & postprocessing can be implemented with signatures

Triton Inference Server

https://github.com/NVIDIA/triton-inference-server/

  • support multiply backends: ONNX, PyTorch, TensorFlow, Caffe2, TensorRT
  • both gRPC and HTTP with SDK
  • internal health check and prometheus metrics
  • batching
  • concurrent model execution
  • preprocessing & postprocessing can be done with ensemble models
  • shm-size, memlock, stack configurations are not available for Kubernetes

Multi Model Server

https://github.com/awslabs/multi-model-server

  • require Java 8
  • provide HTTP
  • Java layer communicates with Python workers through Unix Domain Socket or TCP
  • batching (not mature)
  • multiple models
  • log4j
  • management API
  • need to write model loading and inference code (means can use any runtime you want)
  • easy to add preprocessing and postprocessing to the service

GraphPipe

https://oracle.github.io/graphpipe

  • use flatbuffer which is more efficient
  • 2 years ago...
  • Oracle laid off the whole team

TorchServe

https://github.com/pytorch/serve

  • fork from Multi Model Server
  • developing...