Take GPU Power to the Maximum for Vision and Depth Sensor ...on-demand.gputechconf.com/...sensor-K1-tegra-cloud.pdfTitle: Take GPU Power to the Maximum for Vision and Depth Sensor

Maximizing GPU Power for Vision and Depth Sensor Processing

From NVIDIA's Tegra K1 to GPUs on

the Cloud

Chen Sagiv Eri Rubin

SagivTech Ltd.

• Mobile Revolution

• Mobile – Cloud Concept

• 3D Imaging

• Two use case • SceneNet on Tegra K1

• Depth Sensing on Tegra K1

• SagivTech Streaming Infrastructure

• Take home Tips for Tegra K1

Today’s Talk

• Established in 2009 and headquartered in Israel

• Core domain expertise: GPU Computing and Computer Vision

• What we do: - Technology - Solutions - Projects - EU Research - Training

• GPU expertise:

- Hard core optimizations - Efficient streaming for single or multiple GPU systems - Mobile GPUs

SagivTech Snapshot

• In 1984, this was cutting-edge science fiction in The Terminator

• 30 years later, science fiction is becoming a reality!

Mobile Revolution is happening now !

The Combined Model: Mobile & Cloud Computing

• Understanding, interpretation and interaction with our surroundings via mobile device

• Demand for immense processing power for implementation of computationally-intensive algorithms in real time with low latency

• Computation tasks are divided between the device and the server

• With CUDA – it’s simply easier!

Mobile – Cloud Concept

• Acquisition – Depth Sensors

• Processing – modeling, segmentation, recognition, tracking

• Visualization – Digital Holography

3D Imaging is happening now !

• If you’ve been to a concert recently, you’ve probably seen how many people take videos of the event with mobile phone cameras

• Each user has only one video – taken from one angle and location and of only moderate quality

Mobile Crowdsourcing Video Scene Reconstruction

Leverage the power of multiple mobile phone cameras

to create a high-quality 3D video experience that is

sharable via social networks

The Idea behind SceneNet

Creation of the 3D Video Sequence

The scene is photographed by

several people using their cell

phone camera

The video data is

transmitted via the

cellular network to a

High Performance

Computing server.

Following time

synchronization, resolution

normalization and spatial

registration, the several videos

are merged into a 3-D video

cube.

TIME

The Event Community

A 3-D video event is created.

The 3-D video event

will be available on

the internet as public

or private event.

The event will create a

community, where each

member may provide another

piece of the puzzle and view

the entire information.

TIME

VIEW

SHARE

SEARCH

GPU Computing in SceneNet

Video Registration

&

3D Reconstruction

Computational

Acceleration

Bilateral Filter Acceleration on Tegra K1




Speedup GPU 4 CPU Threads 1 CPU Thread Image Size

x60 2.8ms 170ms 630ms 256 x 256

x57 12ms 690ms 2550ms 512 x 512

x60 45ms 2720ms 10300ms 1024 x 1024

• The Mission: Running a depth sensing technology on a mobile platform

• The Challenge: First time on Tegra K1

• Extreme optimizations on a CPU-GPU platform to allow the device to handle other tasks in parallel

• The Expertise:

• Mantis Vision – the 3D core technology and Structured light algorithms

• SagivTech – the GPU computing expertise

• The bottom line: Depth sensing in running in real time in parallel to other compute intensive applications !

First Depth Sensing Module for Mobile Devices – on Tegra K1

• In one word: Easy!

• Started with the most similar platform - GTX630, based on the GK208.

• Took only a few hours to transfer all the code.

• What's our secret ?

Migrating from Discrete Kepler to K1

Our Infra is composed of a set of modules

SagivTech Infra Stack

STInfraSys

STInfraGPU

STStreamingGPU

STMultiGPU STCudaK

ernels

STCuda

Functions

STGL

Interop

for (int ....) {

START_BLOCK_TIME();

... Calculate some stuff ….

TAKE_BLOCK_SUB_TIME("2. First Part");

... Calculate some stuff ….

TAKE_BLOCK_SUB_TIME("3. Second Part");

}

Timing Code Sample Simple One Line of code to time a block

Timers:

---------

BENCHMARK: Recent Avg Global Avg Max time Count

---------------------------------------------------------------------------------------------------------------

|MyFunc.1. First Part 142.594 142.659 156.859 100

|MyFunc.2. Calculation 1706.63 1720.07 1987.78 100

Timing Code Sample

Simple One Line of code to time a block

The major functionalities provided by the NDArray are: • Initialize a NDArray of any arbitrary size • Bind to an existing device/host pre-allocated pointer • Copy to/from host/device. • Load and Save functionality to/from file. Especially

useful for regression purposes • Most of the functionality of the NDArray is done in an

asynchronously manner

NDArray

• STL style code, no need to free and alloc

• Async is hidden from the user

NDArray Code Sample

st::CArray1D<int> arr_h1;

st::CArray1D<int> arr_d1(iArrayLength, false, 512);

arr_h1.Init(iArrayLength);

arr_h1.Fill(11);

arr_h1.CopyTo(arr_d1);

Single line regression system

Regression Code Sample

st::RegressionParameters par = st::System::GetInstance().GetRegParams();

par.mode = regressionMode;

st::System::GetInstance().SetRegressionParams(par);

if(!ST_REGRESSION(h_cmpNDArr)) return 1;

return 0;

ST MultiGPU Real World Use Case

Four GPUs Four pipes Utilization: 96%+

FPS: 20.46

Scaling: 3.79 – Near linear Scaling!

Note NO gaps in the profiler

GPU streaming

• Need to remember that Android is overlaid on a Linux base

• Code development and testing (including CUDA) can be done on any PC

• Profiling on Logan – NVProf for Logan – can be ported to your PC

Key Points for Developing on the K1

• There is a strong separation between the Android system and the NDK

• A CUDA developer doesn’t need to become an Android developer

• From the Android developer viewpoint this is simply a library

• An Android developer doesn’t need to become a CUDA developer

Key Points for Developing on the K1

• Only 1 SMX (compared to 15 on the k20x)

• Only one RAM, shared by the CPU and the GPU

• Shared memory is similar in behavior to shared memory in Kepler 2

• LDG - very useful, easy optimization

• We used Thrust and moved to CUB (for streams)

• Will be possible to use existing library infrastructure on Logan

Take Home Tips for CUDA on Tegra K1

• Development methodology is similar to discrete GPU development

• No dynamic parallelism

• No hyper Q

• Don’t underestimate Tegra’s CPU - the challenge is to divide work between the various components

Take Home Tips for CUDA on Tegra K1

This project is partially funded by the European Union under the 7th Research Framework, programme FET-Open SME, Grant agreement no. 309169.

Mobile Crowdsourcing Video Scene Reconstruction

T h a n k Yo u F o r m o r e i n f o r m a t i o n p l e a s e c o n t a c t

N i z a n S a g i v

n i z a n @ s a g i v t e c h . c o m

+ 9 7 2 5 2 8 1 1 3 4 5 6

Documents

Take GPU Power to the Maximum for Vision and Depth Sensor ...on-demand.gputechconf.com/...sensor-K1-tegra-cloud.pdfTitle: Take GPU Power to the Maximum for Vision and Depth Sensor