Combining NVIDIA Docker and databases to enhance agile … · 2017. 5. 11. · Combining NVIDIA...

ORNL is managed by UT-Battelle for the US Department of Energy

Combining NVIDIA Docker and databases to enhance agile development and optimize resource allocation

Chris Davis,Sophie Voisin,Devin White,Andrew Hardin

Scalable and High Performance Geocomputation Team Geographic Information Science and Technology GroupOak Ridge National Laboratory

GTC 2017 – May 2017

Outline

• Background

• Example HPC Application

• Study Results

• Lessons Learned / Future Work

The Story

• We are:– Developing an HPC suite of applications– Spread across multiple R&D teams– In an Agile development process– Delivering to a production environment– Needing to support multiple systems / multiple capabilities– Collecting performance metrics for system optimization

Why We Use NVIDIA-DockerResource Optimization

GPU Access

Flexibility

Operating Systems

NVIDIA-Docker Docker Virtual Machine

Hardware – Quadro: Compute + Display

Card M4000 P6000

Capability 5.2 6.1Block 32 32SM 13 30Cores 1664 3840Memory 8GB 24GB

Hardware – Tesla: Compute Only

Card K40 K80

Capability 3.5 3.7Block 16 16SM 15 13Cores 2880 2496Memory 12GB 12GB

Hardware – High End

DELL C4130

GPU 4 x K80

RAM 256GB

Cores 48

SSD Storage 400GB

Constructing Containers

• Build Container:– Based off NVIDIA Images at gitlab.com– https://gitlab.com/nvidia/cuda/tree/centos7– CentOS 7– CUDA 8.0 / 7.5– cuDNN 5.1– GCC 4.9.2– Cores: 24– Mount local folder with code

• Compile against chosen compute capability• Copy product inside container• ”docker commit” container updates to new image• “docker save” to Isilon

Isilon

Container

Git Repo

PostgreSQL

Compile Stats

Profile Stats

HPC Server

NVIDIA-Docker

GPUsCPUs

Local Drive

Container

Running Containers

• For each compute capability:– “docker load” from Isilon storage– Run container & profile script– Send nvprof results to Profile Stats DB– Container/Image removed

Isilon

Container

PostgreSQL

Compile Stats

Profile Stats

HPC Server

NVIDIA-Docker

GPUsCPUs

Local Drive

Container

Hooking It All Together

HPC Server

NVIDIA-Docker

GPUsCPUs

Local Drive

Container

Isilon

Container

Git Repo

PostgreSQL

Compile Stats

Profile Stats

HPC Server

NVIDIA-Docker

GPUsCPUs

Local Drive

Container

HPC Server

NVIDIA-Docker

GPUsCPUs

Local Drive

Container

• One server generates containers

• All servers pull containers from Isilon

• Data to be processed pulled from Isilon

• Container build stats stored in Compiler DB

• Container execution stats stored in Profiler DB

Profiling Combinations

• nvprof– Output Parsed– Sent to Profile DB

• Containers for:– Cuda Version– Each Capability– All Capabilities– CPU only

• Data sets: 4

• Total of 104 profiles

3.75.0

CUDA 8.0

All Capabilities

CUDA 7.5

Database

Hostname

Dataset

CUDA Version

Num CPU Threads

Compile Time

Compute Capability

Execution Time

Timestamp

GPU Device

Num CPU Threads

Timestamp

Num CPU Threads

Dataset

Kernel / API Call

Step Time Percent

Step Time

Num Calls

Ave Time

Min Time

Max Time

Step Name

Timestamp

• Postgres Databases– Shared Fields– Compile DB– Run Time DB– NVPROF DB

Outline

• Background

• Study Results

Example HPC Application

• Geospatial metadata generator– Leverages Open Source 3rdparty libraries

• OpenCV, Caffe, GDAL, …

– Computer Vision Algorithms – GPU Enabled• SURF, ORB, NCC, NMI…

– Automated matching against control data– Calculates geospatial metadata for input imagery

Satellites Manned Aircraft Unmanned Aerial Systems

• Two-step Image Re-alignment Application using NMI

Example HPC Application - GTC16

Input Image

Source Selection

Global Localization

Registration

Resection

MetadataOutput Image

Preprocessing

Pipeline

Core Libraries:• NITRO• GDAL• Proj.4• libpq (Postgres)• OpenCV• CUDA• OpenMP

Normalized Mutual Information

!"# = &' + &)&*

Histograms SourceControl

• Global Localization

Input Image

Source Selection

Global Localization

Registration

Resection

Preprocessing

Pipeline

Control 382x100

Tactical 258x67

• Objective– Re-align the source image with the control image.

• Method In-house Implementation– Roughly match source and control images.

– Coarse resolution

– Mask for non-valid data

– Exhaustive search

Solutions 4250

• Global Localization

• Similarity Metric

– Normalized Mutual Information

– Histogram with masked area• Missing data

• Artifact

• Homogeneous area

Source image and mask: NSxMS pixels

Control image and mask: NCxMC pixels

Solution space: nxm NMI coefficients

!"# = &' + &)&*

& = −,- . /012- .3

& istheentropy- . istheprobabilitydensityfunction

H ∈ J 0. . 255 for S and C0. . 65535 for J

Summary• Global Localization as coarse re-alignment

– Problematic: joint histogram computation for each solution• No compromise on the number of bins - 65536

• Exhaustive search

– Solution: leverage of the K80 specifications• 12 GB of memory

• 1 thread per solution

• Less than 25 seconds - 61K solutions

for a 131K pixel image

Kernel specifications

occupancy 100%

threads / block 128

stack frame 264192

total memory / block 33.81 MB

total memory / SM 541.06 MB

total memory / GPU 7.03 GB

memory % 61.06%

spill stores – spill loads 0 – 0

registers 27

smem / block 0

smem / SM 0

smem % 0.00%

cmem[0] – cmem[2] 448 – 20- 1 solution / thread

• Registration Control 382x100

Tactical 258x67

Input Image

Source Selection

Global Localization

Registration

Resection

Preprocessing

Pipeline

• Registration Control 382x100

Tactical 258x67Tactical & Control 4571x1555

Input Image

Source Selection

Global Localization

Registration

Resection

Preprocessing

Pipeline

• Objective– Refine the localization

• Method– Use higher resolution ~400 times– Keypoint matching

Tiepoint list

Control Image

Descriptor

Keypoint listdetect frommetric

Search Windows

detect describeSource Image Keypoint list Descriptor

Descriptors: 11x11 intensity values

Search windows: 73x73 pixels

• Registration Workflow

• Similarity Metric– Normalized Mutual Information

– Small “images” but numerous Keypoints• Numerous keypoints

– up to 65536 with GPU SURF detector• Image / Descriptor size

– 11 x 11 intensity values to describe• Search area

– 73 x 73 control sub-image• Solution space

– 63 x 63 = 3969 / keypoint

Application

Descriptors: 11x11 intensity values

Search windows: 73x73 pixels

Solution spaces: 63x63 NMI coefficients

!"# = &' + &)&*

& = −,- . /012- .3

& istheentropy- . istheprobabilitydensityfunction

H ∈ J 0. . 255 for S and C0. . 65535 for J

Summary• Registration refine the re-alignment

– Problematic: joint histogram computation for each solution• No compromise on the number of bins - 65536

• Exhaustive search

– Solution: leverage of the K80 specifications• 12 GB of memory

• 1 block per solution

• Leverage the number of values of the descriptors

121 (maximum) << 65536

• Less than 100 seconds - 65K keypoints

260M NMI coefficients

• About 10K keypoints in less than 20 seconds

List of indices for source

List of indices for the corresponding subset controlJoint histogram

KernelFind the best match for all keypoints

1 block per keypointOptimize for the 63 x 63 search windows

64 threads / blocks – 1 idle each threads compute a “row” of solutions

Sparse joint histogram65536 bins but only 121 values

Leverage the 11 x 11 descriptor sizeCreate 2 lists (length 121) of intensity valuesUpdate joint histogram count from listsLoop over lists to retrieve aggregate count Set aggregate count to 0 after first retrieval

Outline

• Background

• Study Results

Compile Time Results

OFF 30 35 37 50 52 60 61 30 - 52 30 - 61

dsCompute Capability Specifications

CUDA 7.5 CUDA 8.0 CUDA7.5 CUDA 8.0

Run Time Results

D1 Ave Run Time (sec)

CPU CUDA 7.5 CUDA8

CPU CUDA 7.5 CUDA 8

K80 - Kernel Time Results in Seconds with nvprof

CUDA 7.5 CUDA 8 CUDA 7.5 CUDA 8 CUDA 7.5 CUDA 8 CUDA 7.5 CUDA 8

D1 D2 D3 D4

Step 2 Kernel Timings vs CUDA version (7.5 and 8)

average min max std std

D1 D2 D3 D4

Step 1 Kernel Timings vs CUDA version (7.5 and 8)

Run Time Results

020406080

100120140160180200

K40 K80 M4000 P6000

D1 - Step 2 Kernel (sec)

020406080

100120140160180200

K40 K80 M4000 P6000

020406080

100120140160180200

K40 K80 M4000 P6000

020406080

100120140160180200

K40 K80 M4000 P6000

Outline

• Background

• Study Results

Lessons Learned

• GPU isolation: Ran into issue with swapping out P6000 and K40. – nvidia-smi swapped GPU ID for K40 and M4000.– This caused nvidia-docker to ignore NV_GPU value– UUID vs Index – Our Application can set the GPU index for multi-GPU environment

• (default to 0)

Future Work

• Move off Desktop machines to full testing platform with dedicated hardware with multiple GPU types

• Investigate Docker Registry & Docker Swarm for managing containers

• Enhance Database analysis to autogenerate reports

• Generalize the process to containerize any GPU application to profile with this architecture

Thank you!

Customer Resources

DELL C4130

GPU 4 x K80

RAM 256GB

Cores 48

SSD Storage 400GB

D1 D2 D3 D4

Run time with 6 threads (sec)

CPU CUDA 7.5

Combining NVIDIA Docker and databases to enhance agile … · 2017. 5. 11. · Combining NVIDIA...

Documents

docker run docker service is the new · Getting Started with Docker Clustering Mike Goelzer / mgoelzer@docker.com / @mgoelzer Docker Inc. docker service is the new docker run docker

Preparing To Use Docker Containers - Nvidia...Preparing To Use Docker Containers DU-08786-001_v001 | 4 3. Enabling Users To Run Docker Containers 2.1. Version 2.x Or Earlier: Installing

Using DC/OS for - D2iQmesosphere.github.io/.../Using_DCOS_for_Continuous... · DC/OS: WORKLOADS DC/OS 1.9 25 Test Locally with Nvidia-Docker, deploy to production with DC/OS Isolate

Deep Learning with MATLAB - INAF (Indico)€¦ · NVIDIA NGC & DGX Supports MATLAB for Deep Learning GPU-accelerated MATLAB Docker container for deep learning – Leverage multiple

ENABLING MACHINE LEARNING AS A SERVICE (MLAAS) WITH GPU ACCELERATION USING … · GPU-accelerated ML services for data scientists. ... administrators can install Docker and nvidia-docker

Docker Docker - Docker Security - Docker

NVIDIA 2016 Sustainability Reportimages.nvidia.com/.../nvidia-2016-sustainabilityreport-final-v2.pdf · NVIDIA SUSTAINABILITY REPORT 2016 NVIDIA 2016 SUSTAINABILITY REPORT ... NVIDIA

events.static.linuxfound.org · 2017-12-14 · OpenStack Cartridge Instance n create/destroy instances Cartridge Instance 2 os hardware Pocker Docker Docker Docker Docker Docker Docker

Docker on Hadoop - events.static.linuxfound.org Docker Docker must be ... Hadoop artfacts must exist in Docker containers

Container as a Service on GPU Cloud: Our Decision among ... · Wrap docker run and docker create commands e.g. $ nvidia-docker run --rm nvidia/cuda nvidia-smi Add docker cli options

NVIDIA Containers For Deep Learning Frameworks · 2020-01-27 · Docker Containers NVIDIA Containers For Deep Learning Frameworks DU-08518-001_v001 | 2 that are not affected by the

Installation, container management, Docker - iol.unibo.it · • Docker containers are built from Docker images. By default, Docker pulls these images from Docker Hub, a Docker registry

A GENTLE INTRODUCTION TO AI · Networks ” Timur Sattarov ... “Designing a GPU-Based Counterparty Credit Risk System ... Runtime for Docker NVIDIA Driver NVIDIA GPU Applications

Xen Containers: Better way to run Docker Containers5 Docker Containers Running • docker run/create/stop Building • docker build Packaging • docker push/pull/commit Docker –a

Docker und IBM Digital Experience in Docker Container€¦ · Docker und IBM Digital Experience in Docker Container 1. Watson Customer Engagement v What is docker ... • Docker swarm

Best Practices for Docker - docs.nvidia.com · Best Practices for Docker DG-08863-001 _v001 | 1 Chapter 1. DOCKER BEST PRACTICES WITH NVIDIA CONTAINERS …

openQA - SUSE€¦ · Jenkins Gerrit Travis Selenium Docker Docker Docker Docker Docker Docker Docker Docker ... openQA runs tests on each Maintenance Update before being released

New NVIDIA GPU Support for Apache Mesos and DC/OSon-demand.gputechconf.com/gtc/2017/presentation/s7160... · 2017. 5. 18. · Overview of Talk Brief intro to docker and nvidia-docker

NVIDIA DGX SYSTEMS - Colfax International · 2017-10-09 · NVIDIA DOCKER NVIDIA DRIVER HOST OS THIRD PARTY ACCELERATED SOLUTIONS CONTAINERIZATION TOOL GPU DRIVER ... DGX-1 NVIDIA

Docker 101 - techccu.csie.iotechccu.csie.io/2015/slides/frank.pdf · Docker Basics - CLI Docker client docker version docker info docker search [keyword] docker push/pull/commit docker