Multi GPU training Part 3: Engineering Challenges of Multi ...on-demand.gputechconf.com/gtc-cn/2018/pdf/CH8501.pdf · XGBOOST XGBoost is an implementation of gradient boosted decision

RAPIDSGPU POWERED MACHINE LEARNING

1980 1990 2000 2010 2020

GPU-Computing perf

1.5X per year

1000X

by

2025

RISE OF GPU COMPUTING

Original data up to the year 2010 collected and plotted by M. Horowitz, F. Labonte, O. Shacham, K.

Olukotun, L. Hammond, and C. Batten New plot and data collected for 2010-2015 by K. Rupp

102

103

104

105

106

107

Single-threaded perf

1.5X per year

1.1X per year

APPLICATIONS

SYSTEMS

ALGORITHMS

CUDA

ARCHITECTURE

EXTENDING DL → BIG DATA ANALYTICSFrom Business Intelligence to Data Science

Deep

Learning

Traditional Machine Learning

(regressions, decision trees, graph)Analytics

DATA SCIENCE

ARTIFICIAL INTELLIGENCE

DENSE DATA TABULAR/SPARSE DATADENSE DATA TYPES

(images, video, voice)

USE CASES IN EVERY INDUSTRY

CONSUMER INTERNET

— Personalized recommendations to drive viewership

— Optimized ad targeting

— Preventing churn by identifying factors that influence loyalty

RETAIL

— Inventory forecasting

— Personalized recommendations

— Optimized pricing and promotions

— Preventing credit card fraud and cyber attacks

FINANCIAL SERVICES

— Personalized guidance on financial products

— Return optimization based on market signals

— Fraud detection

HEALTHCARE

— Better disease prediction with genomic medicine

— Improved health outcomes with analysis of EMRs

— Predictive care/treatment

TODAY’S DATA SCIENCE STIFLES INNOVATION

All

DataETL

Manage Data

Structured

Data Store

Data Preparation

Training

Model Training

Visualization

Evaluate

Inference

Deploy

Slow Training Times for Data Scientists

Ie: HURRY UP AND WAIT

DATA SCIENCE CHALLENGES

SLOW TRAINING

Hours to build GBDT

30+

SLOW DATA PROCESSING

DaysData Transformation

WeeksFeature Engineering

MonthsScoring Pipelines

ESCALATING TCO

More servers and infrastructure yielding diminishing performance returns

XGBOOST

XGBoost is an implementation of gradient

boosted decision trees designed for speed

and performance.

Definition

It is a powerful tool for

solving classification and

regression problems in a

supervised learning setting.

Source: https://goo.gl/eTxVtA

Example of Decision Tree

PREDICT: WHO ENJOYS COMPUTER GAMES

https://goo.gl/eTxVtA

Source: https://goo.gl/eTxVtA

Example of Using Ensembled Decision Trees

COMBINE TREES FOR STRONGER PREDICTIONS

https://goo.gl/eTxVtA

RAPIDS OVERVIEW

RAPIDS

GPU Accelerated Data Science

RAPIDS is a set of open source libraries

for GPU accelerating data preparation

and machine learning.

OSS website: http://www.rapids.ai/

http://www.rapids.ai/

RE-IMAGINING DATA SCIENCE WORKFLOWOpen Source, End-to-end GPU-accelerated Workflow Built On CUDA

Data preparation /

wrangling

cuDF

Optimized ML model

training

cuML Visualization

Data visualization

libraries

data insights

RAPIDS LIBRARIES

GPU accelerated software for doing data manipulation and data preparation.

Accelerates loading, filtering, and manipulation of data for model training data preparation.

Python drop-in Pandas replacement built on CUDA C++

cuDF

GPU accelerated traditional machine learning libraries.

XGBoost, Kalman, K-means, KNN, DBScan, PCA, TSVD and more.

cuML

Collection of graph analytics libraries. Coming soon.

cuGRAPH

RAPIDS — OPEN GPU DATA SCIENCESoftware Stack Python

Data Preparation

cuDFVisualization

cuGRAPHModel Training

cuML

CUDA

PYTHON

APACHE ARROW on GPU Memory

DASK

DEEP LEARNING

FRAMEWORKS

CUDNN

RAPIDS

CUMLCUDF CUGRAPH

THE RAPIDS VALUE PROPOSITIONHigh Performance, Easy-to-use

Data Scientist Data Science Leader

Reduced Training TimeDrastically improve your productivity with near-interactive data science

Hassle-Free IntegrationAccelerate your Python data science toolchain with minimal code changes and no new tools to learn

Open SourceCustomizable, extensible, interoperable — the open-source software is supported by NVIDIA and built on Apache Arrow

Top Model AccuracyIncrease machine learning model accuracy by iterating on models faster and deploying them more frequently

TCO ReductionDecrease the server costs, footprint, power consumption of your ML workloads reducing the TCO

Increased Data Scientist ProductivityReduce training time, allow data scientists to be more productive

RAPIDS DEPLOYMENT STACK

TARGET INDUSTRIES

Retail Finance CICN Healthcare

TARGET AUDIENCE AND RECOMMENDED SYSTEMS

Individual Data Scientist Shared Infrastructure For Data Scientists

Quadro GV100 WS2 GV100, NVLink

DGX Station4 V100, NVLink

CloudV100 Cloud Instances

V100 Servers4-8 V100, NVLink, HGX-1,

HGX-2

DGX-18 V100, NVLink

DGX-216 V100, NVLink

CloudV100 Cloud Instances

PILLARS OF RAPIDS PERFORMANCE

CUDA Architecture NVLink/NVSwitch Memory Architecture

Massively parallel processing

NVSWITCH

6x NVLINK

High speed connecting between GPUs for distribute algorithms

Large virtual GPU memory, high-speed memory

Iu-Bump

DRAM Core Die

DRAM Core Die

DRAM Core Die

DRAM Core Die

Base Die

TSV

DESIGNED TO DO THE PREVIOUSLY IMPOSSIBLE

1

2

3

5

4

6 Two Intel Xeon Platinum CPUs

7 1.5 TB System Memory

19

30 TB NVME SSDs Internal Storage

NVIDIA Tesla V100 32 GB Tensor Core GPUs

Two GPU Boards8 V100 32GB GPUs per board6 NVSwitches per board512GB Total HBM2 Memoryinterconnected byPlane Card

Twelve NVSwitches2.4 TB/sec bi-section

bandwidth

Eight EDR Infiniband/100 GigE1600 Gb/sec Total Bi-directional Bandwidth

PCIe Switch Complex

8

9

9Dual 10/25/100 Gb/secEthernet

20

NVSWITCH: THE REVOLUTIONARY AI NETWORK FABRIC• Inspired by leading edge research

that demands unrestricted model parallelism

• Like the evolution from dial-up to broadband, NVSwitch delivers a networking fabric for the future, today

• Delivering 2.4 TB/s bisection bandwidth, equivalent to a PCIe bus with 1,200 lanes

• NVSwitches on DGX-2 = all of Netflix HD <45s

TRADITIONALHPCCLUSTER

300 Servers

$3M

180 kW

GPU-ACCELERATEDHPC + AI CLUSTER

1 DGX-2

10 kW

1/8 the Cost

1/15 the Space

1/18 the Power

FASTER INSIGHTS FOR MACHINE LEARNINGDGX-2 544X Speedup Compared to CPU-Only Server Nodes

0 500 1,000 1,500 2,000 2,500 3,000 3,500

1 CPU instance

20 CPU instances

30 CPU instances

50 CPU instances

100 CPU instances

HGX-2

Process Time (min)

cuIO/ cuDF (Load and Data prep) Data Conversion XGBoost

GPU Measurements Completed on DGX-2 running RAPIDSCPU: 20 CPU cluster- comparison is prorated to 1 CPU (61 GB of memory, 8 vCPUs, 64-bit platform), Apache Spark

US Mortgage Data Fannie Mae and Freddie Mac 2006-2017 | 146M mortgagesBenchmark 200GB CSV dataset | Data preparation includes joins, variable transformations

544X speedup

FASTER SPEEDS, REAL WORLD BENEFITS

2,290

1,956

1,999

1,948

169

157

0 500 1,000 1,500 2,000 2,500

20 CPU Nodes

30 CPU Nodes

50 CPU Nodes

100 CPU Nodes

DGX-2

5x DGX-1

0 2,000 4,000 6,000 8,000 10,000

20 CPU Nodes

30 CPU Nodes

50 CPU Nodes

100 CPU Nodes

DGX-2

5x DGX-1

cuML — XGBoost

2,741

1,675

715

379

42

19

0 1,000 2,000 3,000

20 CPU Nodes

30 CPU Nodes

50 CPU Nodes

100 CPU Nodes

DGX-2

5x DGX-1

End-to-EndcuIO/cuDF —Load and Data Preparation

Benchmark

200GB CSV dataset; Data preparation includes joins, variable transformations.

CPU Cluster Configuration

CPU nodes (61 GiB of memory, 8 vCPUs, 64-bit platform), Apache Spark

DGX Cluster Configuration

5x DGX-1 on InfiniBand network

Time in seconds — Shorter is better

cuIO / cuDF (Load and Data Preparation) Data Conversion XGBoost

DGX POD FOR RAPIDS

RAPIDS.AI - Open GPU Data Science

CPU vs GPUPORTING EXISTING CODE

PCA

Training and query results:

• CPU: ~5 minutes

• GPU: ~7 seconds

Principal Component Analysis (PCA)

…Now!Before…

Cloud

HOW? DOWNLOAD AND DEPLOY

On-premises

Source code, libraries, packages

Source available on GitHub | Container available on NGC and Docker Hub | Conda and PIP

NGC

https://github.com/rapidsaihttps://ngc.nvidia.com

https://hub.docker.com/u/rapidsai

https://anaconda.org/rapidsai

PIP available at a later date

https://github.com/rapidsai

https://ngc.nvidia.com/

https://hub.docker.com/u/rapidsai

https://anaconda.org/rapidsai

ACCELERATING MACHINE LEARNINGThe RAPIDS Ecosystem

Open Source Community

Enterprise Data Science Platforms

StartupsDeep Learning

Integration

GPU Servers Storage Partners

https://rapids.ai

https://rapids.ai/

Documents

Multi GPU training Part 3: Engineering Challenges of Multi ...on-demand.gputechconf.com/gtc-cn/2018/pdf/CH8501.pdf · XGBOOST XGBoost is an implementation of gradient boosted decision