BERT Practical deployment on NVIDIA GPU

Table of Contents

Triton Inference Server - This article is part of a series.

Part 1: Nvidia Triton Server 에서 모델 배포하기

Part 2: Nvidia Triton Server 에서 리소스 최대한 활용하기 (Throughput, Latency 개선방법)

Part 3: How Netflix uses Triton for model scoring service

Part 4: This Article

This posting is a summary of GTC 2020 presentation, with personal thougts added.

Deep Learning
#

At a very high level, deep learning is divided into steps. First step is training. You have a very deep network, you have a huge amount of data and you train the network to find the values for the weights of the networkk. For instance, the classical example is age classifiction. You train with images, and then once you computed the weights, you can do inference. During inference you use the weights used in the previous phase. So we will focus on the inference side,

In the inference part, you need to have programmability, low latency, accuracy. You also have to deal with high network, and want high throughtput, efficiency rate of learning. So all this is delivered by using GPUs.

Challenge during inference
#

Usually, in the traditional inference system you basically have one GPU dedicated to run inference on a particular model so you have one GPU for speech recognition you have one GPU for recommendation system and another GPU for language processing. The problem is that sometimes you have spikes in one application so do you get you get your one GPU vary at 100% utilization and the others are left unused or underutilized.