Skip to main content

BERT Practical deployment on NVIDIA GPU

·2 mins· loading · loading ·
CS Triton
Soeun Uhm
Author
Soeun Uhm
problem-solving engineer, talented in grit.
Table of Contents
Triton Inference Server - This article is part of a series.
Part 4: This Article
This posting is a summary of GTC 2020 presentation, with personal thougts added.

Deep Learning
#

img
Deep Learning

At a very high level, deep learning is divided into steps. First step is training. You have a very deep network, you have a huge amount of data and you train the network to find the values for the weights of the networkk. For instance, the classical example is age classifiction. You train with images, and then once you computed the weights, you can do inference. During inference you use the weights used in the previous phase. So we will focus on the inference side,

img
Why we use GPUs for inference

In the inference part, you need to have programmability, low latency, accuracy. You also have to deal with high network, and want high throughtput, efficiency rate of learning. So all this is delivered by using GPUs.

Challenge during inference
#

Usually, in the traditional inference system you basically have one GPU dedicated to run inference on a particular model so you have one GPU for speech recognition you have one GPU for recommendation system and another GPU for language processing. The problem is that sometimes you have spikes in one application so do you get you get your one GPU vary at 100% utilization and the others are left unused or underutilized.

Case Study
#

Triton Inference Server - This article is part of a series.
Part 4: This Article

Related

How Netflix uses Triton for model scoring service
·5 mins· loading · loading
Tools Triton
Learings and Painpoints of using Triton
Nvidia Triton Server 에서 리소스 최대한 활용하기 (Throughput, Latency 개선방법)
·6 mins· loading · loading
Tools Triton
삽질을 통해 헤쳐나간 Triton Server 사용기 - 2탄
Nvidia Triton Server 에서 모델 배포하기
·3 mins· loading · loading
Tools Triton
삽질을 통해 헤쳐나간 Triton Server 사용기 - 1탄