Combining NVIDIA Docker and databases to enhance agile … · 2017. 5. 11. · Combining NVIDIA...

Preview:

Citation preview

ORNL is managed by UT-Battelle for the US Department of Energy

Combining NVIDIA Docker and databases to enhance agile development and optimize resource allocation

Chris Davis,Sophie Voisin,Devin White,Andrew Hardin

Scalable and High Performance Geocomputation Team Geographic Information Science and Technology GroupOak Ridge National Laboratory

GTC 2017 – May 2017

2

Outline

• Background

• Example HPC Application

• Study Results

• Lessons Learned / Future Work

3

The Story

• We are:– Developing an HPC suite of applications– Spread across multiple R&D teams– In an Agile development process– Delivering to a production environment– Needing to support multiple systems / multiple capabilities– Collecting performance metrics for system optimization

4

Why We Use NVIDIA-DockerResource Optimization

GPU Access

Flexibility

Operating Systems

NVIDIA-Docker Docker Virtual Machine

5

Hardware – Quadro: Compute + Display

Card M4000 P6000

Capability 5.2 6.1Block 32 32SM 13 30Cores 1664 3840Memory 8GB 24GB

6

Hardware – Tesla: Compute Only

Card K40 K80

Capability 3.5 3.7Block 16 16SM 15 13Cores 2880 2496Memory 12GB 12GB

7

Hardware – High End

DELL C4130

GPU 4 x K80

RAM 256GB

Cores 48

SSD Storage 400GB

8

Constructing Containers

• Build Container:– Based off NVIDIA Images at gitlab.com– https://gitlab.com/nvidia/cuda/tree/centos7– CentOS 7– CUDA 8.0 / 7.5– cuDNN 5.1– GCC 4.9.2– Cores: 24– Mount local folder with code

• Compile against chosen compute capability• Copy product inside container• ”docker commit” container updates to new image• “docker save” to Isilon

Isilon

Container

Container

Container

Git Repo

PostgreSQL

Compile Stats

Profile Stats

Data

HPC Server

NVIDIA-Docker

GPUsCPUs

Local Drive

Container

9

Running Containers

• For each compute capability:– “docker load” from Isilon storage– Run container & profile script– Send nvprof results to Profile Stats DB– Container/Image removed

Isilon

Container

Container

Container

PostgreSQL

Compile Stats

Profile Stats

Data

HPC Server

NVIDIA-Docker

GPUsCPUs

Local Drive

Container

10

Hooking It All Together

HPC Server

NVIDIA-Docker

GPUsCPUs

Local Drive

Container

Isilon

Container

Container

Container

Git Repo

PostgreSQL

Compile Stats

Profile Stats

Data

HPC Server

NVIDIA-Docker

GPUsCPUs

Local Drive

Container

HPC Server

NVIDIA-Docker

GPUsCPUs

Local Drive

Container

• One server generates containers

• All servers pull containers from Isilon

• Data to be processed pulled from Isilon

• Container build stats stored in Compiler DB

• Container execution stats stored in Profiler DB

11

Profiling Combinations

• nvprof– Output Parsed– Sent to Profile DB

• Containers for:– Cuda Version– Each Capability– All Capabilities– CPU only

• Data sets: 4

• Total of 104 profiles

CPU

3.0

3.5

3.75.0

5.2

6.0

6.1

CUDA 8.0

D1

D2D3

D4

M4000

K80

P6000

K40

All Capabilities

CUDA 7.5

12

Database

Hostname

Dataset

CUDA Version

Num CPU Threads

Compile Time

Compute Capability

Execution Time

Timestamp

GPU Device

Num CPU Threads

Timestamp

Num CPU Threads

Dataset

Kernel / API Call

Step Time Percent

Step Time

Num Calls

Ave Time

Min Time

Max Time

Step Name

Timestamp

• Postgres Databases– Shared Fields– Compile DB– Run Time DB– NVPROF DB

13

Outline

• Background

• Example HPC Application

• Study Results

• Lessons Learned / Future Work

14

Example HPC Application

• Geospatial metadata generator– Leverages Open Source 3rdparty libraries

• OpenCV, Caffe, GDAL, …

– Computer Vision Algorithms – GPU Enabled• SURF, ORB, NCC, NMI…

– Automated matching against control data– Calculates geospatial metadata for input imagery

Satellites Manned Aircraft Unmanned Aerial Systems

15

• Two-step Image Re-alignment Application using NMI

Example HPC Application - GTC16

Input Image

Source Selection

Global Localization

Registration

Resection

MetadataOutput Image

GPU

Preprocessing

CPU

Pipeline

Core Libraries:• NITRO• GDAL• Proj.4• libpq (Postgres)• OpenCV• CUDA• OpenMP

Normalized Mutual Information

!"# = &' + &)&*

Histograms SourceControl

Joint

16

• Global Localization

Example HPC Application - GTC16

Input Image

Source Selection

Global Localization

Registration

Resection

MetadataOutput Image

GPU

Preprocessing

CPU

Pipeline

Core Libraries:• NITRO• GDAL• Proj.4• libpq (Postgres)• OpenCV• CUDA• OpenMP

Control 382x100

Tactical 258x67

• Objective– Re-align the source image with the control image.

• Method In-house Implementation– Roughly match source and control images.

– Coarse resolution

– Mask for non-valid data

– Exhaustive search

Solutions 4250

17

Example HPC Application - GTC16

• Global Localization

18

• Similarity Metric

Example HPC Application - GTC16

– Normalized Mutual Information

– Histogram with masked area• Missing data

• Artifact

• Homogeneous area

Source image and mask: NSxMS pixels

Control image and mask: NCxMC pixels

Solution space: nxm NMI coefficients

!"# = &' + &)&*

& = −,- . /012- .3

456

& istheentropy- . istheprobabilitydensityfunction

H ∈ J 0. . 255 for S and C0. . 65535 for J

19

Example HPC Application - GTC16

Summary• Global Localization as coarse re-alignment

– Problematic: joint histogram computation for each solution• No compromise on the number of bins - 65536

• Exhaustive search

– Solution: leverage of the K80 specifications• 12 GB of memory

• 1 thread per solution

• Less than 25 seconds - 61K solutions

for a 131K pixel image

Kernel specifications

occupancy 100%

threads / block 128

stack frame 264192

total memory / block 33.81 MB

total memory / SM 541.06 MB

total memory / GPU 7.03 GB

memory % 61.06%

spill stores – spill loads 0 – 0

registers 27

smem / block 0

smem / SM 0

smem % 0.00%

cmem[0] – cmem[2] 448 – 20- 1 solution / thread

20

• Registration Control 382x100

Tactical 258x67

Example HPC Application - GTC16

Input Image

Source Selection

Global Localization

Registration

Resection

MetadataOutput Image

GPU

Preprocessing

CPU

Pipeline

Core Libraries:• NITRO• GDAL• Proj.4• libpq (Postgres)• OpenCV• CUDA• OpenMP

21

• Registration Control 382x100

Tactical 258x67Tactical & Control 4571x1555

Example HPC Application - GTC16

Input Image

Source Selection

Global Localization

Registration

Resection

MetadataOutput Image

GPU

Preprocessing

CPU

Pipeline

Core Libraries:• NITRO• GDAL• Proj.4• libpq (Postgres)• OpenCV• CUDA• OpenMP

• Objective– Refine the localization

• Method– Use higher resolution ~400 times– Keypoint matching

22

Example HPC Application - GTC16

Tiepoint list

Control Image

Descriptor

Keypoint listdetect frommetric

Search Windows

detect describeSource Image Keypoint list Descriptor

Descriptors: 11x11 intensity values

Search windows: 73x73 pixels

• Registration Workflow

23

• Similarity Metric– Normalized Mutual Information

– Small “images” but numerous Keypoints• Numerous keypoints

– up to 65536 with GPU SURF detector• Image / Descriptor size

– 11 x 11 intensity values to describe• Search area

– 73 x 73 control sub-image• Solution space

– 63 x 63 = 3969 / keypoint

Application

Descriptors: 11x11 intensity values

Search windows: 73x73 pixels

Solution spaces: 63x63 NMI coefficients

!"# = &' + &)&*

& = −,- . /012- .3

456

& istheentropy- . istheprobabilitydensityfunction

H ∈ J 0. . 255 for S and C0. . 65535 for J

24

Example HPC Application - GTC16

Summary• Registration refine the re-alignment

– Problematic: joint histogram computation for each solution• No compromise on the number of bins - 65536

• Exhaustive search

– Solution: leverage of the K80 specifications• 12 GB of memory

• 1 block per solution

• Leverage the number of values of the descriptors

121 (maximum) << 65536

• Less than 100 seconds - 65K keypoints

260M NMI coefficients

• About 10K keypoints in less than 20 seconds

List of indices for source

List of indices for the corresponding subset controlJoint histogram

=

KernelFind the best match for all keypoints

1 block per keypointOptimize for the 63 x 63 search windows

64 threads / blocks – 1 idle each threads compute a “row” of solutions

Sparse joint histogram65536 bins but only 121 values

Leverage the 11 x 11 descriptor sizeCreate 2 lists (length 121) of intensity valuesUpdate joint histogram count from listsLoop over lists to retrieve aggregate count Set aggregate count to 0 after first retrieval

25

Outline

• Background

• Example HPC Application

• Study Results

• Lessons Learned / Future Work

26

Compile Time Results

0

100

200

300

400

500

600

700

800

900

1000

0

500

1000

1500

2000

2500

OFF 30 35 37 50 52 60 61 30 - 52 30 - 61

size

of b

inar

y fil

es in

MB

time

in s

econ

dsCompute Capability Specifications

CUDA 7.5 CUDA 8.0 CUDA7.5 CUDA 8.0

27

Run Time Results

0

50

100

150

200

D1 Ave Run Time (sec)

CPU CUDA 7.5 CUDA8

0

50

100

150

200

D2 Ave Run Time (sec)

CPU CUDA 7.5 CUDA 8

0

50

100

150

200

D3 Ave Run Time (sec)

CPU CUDA 7.5 CUDA 8

0

50

100

150

200

D4 Ave Run Time (sec)

CPU CUDA 7.5 CUDA 8

28

K80 - Kernel Time Results in Seconds with nvprof

10

15

20

25

30

35

CUDA 7.5 CUDA 8 CUDA 7.5 CUDA 8 CUDA 7.5 CUDA 8 CUDA 7.5 CUDA 8

D1 D2 D3 D4

Step 2 Kernel Timings vs CUDA version (7.5 and 8)

average min max std std

0.1

0.15

0.2

0.25

0.3

0.35

CUDA 7.5 CUDA 8 CUDA 7.5 CUDA 8 CUDA 7.5 CUDA 8 CUDA 7.5 CUDA 8

D1 D2 D3 D4

Step 1 Kernel Timings vs CUDA version (7.5 and 8)

average min max std std

29

Run Time Results

020406080

100120140160180200

CUDA 7.5 CUDA 8 CUDA 7.5 CUDA 8 CUDA 7.5 CUDA 8 CUDA 7.5 CUDA 8

K40 K80 M4000 P6000

D1 - Step 2 Kernel (sec)

average min max std std

020406080

100120140160180200

CUDA 7.5 CUDA 8 CUDA 7.5 CUDA 8 CUDA 7.5 CUDA 8 CUDA 7.5 CUDA 8

K40 K80 M4000 P6000

D2 - Step 2 Kernel (sec)

average min max std std

020406080

100120140160180200

CUDA 7.5 CUDA 8 CUDA 7.5 CUDA 8 CUDA 7.5 CUDA 8 CUDA 7.5 CUDA 8

K40 K80 M4000 P6000

D3 - Step 2 Kernel (sec)

average min max std std

020406080

100120140160180200

CUDA 7.5 CUDA 8 CUDA 7.5 CUDA 8 CUDA 7.5 CUDA 8 CUDA 7.5 CUDA 8

K40 K80 M4000 P6000

D4 - Step 2 Kernel (sec)

average min max std std

30

Outline

• Background

• Example HPC Application

• Study Results

• Lessons Learned / Future Work

31

Lessons Learned

• GPU isolation: Ran into issue with swapping out P6000 and K40. – nvidia-smi swapped GPU ID for K40 and M4000.– This caused nvidia-docker to ignore NV_GPU value– UUID vs Index – Our Application can set the GPU index for multi-GPU environment

• (default to 0)

32

Future Work

• Move off Desktop machines to full testing platform with dedicated hardware with multiple GPU types

• Investigate Docker Registry & Docker Swarm for managing containers

• Enhance Database analysis to autogenerate reports

• Generalize the process to containerize any GPU application to profile with this architecture

Thank you!

34

Customer Resources

DELL C4130

GPU 4 x K80

RAM 256GB

Cores 48

SSD Storage 400GB

0

5

10

15

20

25

30

35

40

45

50

D1 D2 D3 D4

Run time with 6 threads (sec)

CPU CUDA 7.5

Recommended