25
Zürcher Fachhochschule 1 General purpose processing using embedded GPUs: A study of latency and its varia:on Ma#hias Rosenthal and Amin Mazloumian May, 2016

General purpose processing using embedded GPUs: A study of … · 2016. 6. 1. · Zürcher Fachhochschule 1 General purpose processing using embedded GPUs: A study of latency and

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: General purpose processing using embedded GPUs: A study of … · 2016. 6. 1. · Zürcher Fachhochschule 1 General purpose processing using embedded GPUs: A study of latency and

Zürcher Fachhochschule 1

GeneralpurposeprocessingusingembeddedGPUs:Astudyof

latencyanditsvaria:on

Ma#hiasRosenthalandAminMazloumianMay,2016

Page 2: General purpose processing using embedded GPUs: A study of … · 2016. 6. 1. · Zürcher Fachhochschule 1 General purpose processing using embedded GPUs: A study of latency and

Zürcher Fachhochschule 2

Agenda

•  GeneralPurposeGPUCompuAng•  EmbeddedCPU/GPUversusCPU/FPGA•  CPU–GPUDataTransfer

–  UnifiedVirtualAddressing(DMA)–  Memorymapped(ZeroCopy)

•  LatencyResults•  Kernel-LoopSoluAonavoidingGPUKernellaunch

Page 3: General purpose processing using embedded GPUs: A study of … · 2016. 6. 1. · Zürcher Fachhochschule 1 General purpose processing using embedded GPUs: A study of latency and

Zürcher Fachhochschule 3

GPUCompuAng

Originallyused3DgamerenderingGPUsareheavilyusedin

HighPerformanceCompuAng Financialmodeling RoboAcs GasandoilexploraAon CuYng-edgescienAficresearch

àWhataboutembeddedsystems??

Page 4: General purpose processing using embedded GPUs: A study of … · 2016. 6. 1. · Zürcher Fachhochschule 1 General purpose processing using embedded GPUs: A study of latency and

Zürcher Fachhochschule 4

CPUvs.GPU

[h#p://michaelgalloy.com/2013/06/11/cpu-vs-gpu-performance.html]

SPSinglePrecisionDPDoublePrecision

Page 5: General purpose processing using embedded GPUs: A study of … · 2016. 6. 1. · Zürcher Fachhochschule 1 General purpose processing using embedded GPUs: A study of latency and

Zürcher Fachhochschule 5

CPUvs.GPU

•  CPUs:Hugecache,opAmizedforseveralthreads:Sequen:alinstruc:ons

•  GPUs:100+simplecoresforhugeparallelizaAon:Intensiveparalleliza:on

Page 6: General purpose processing using embedded GPUs: A study of … · 2016. 6. 1. · Zürcher Fachhochschule 1 General purpose processing using embedded GPUs: A study of latency and

Zürcher Fachhochschule 6

DiscretevsIntegratedGPU

DiscreteGPU IntegratedGPU

Cache Cache

CPU GPU CPU GPU

Page 7: General purpose processing using embedded GPUs: A study of … · 2016. 6. 1. · Zürcher Fachhochschule 1 General purpose processing using embedded GPUs: A study of latency and

Zürcher Fachhochschule 7

CPU/GPUCompuAngvs.CPU/FPGA

Flexibility&MaintenancePowerConsumpAonDevelopmentCostLatencyLatencyvariaAon

High

HighNanosecondsMicroseconds

High LowMid

Low

? NovariaAon

CPU/GPU CPU/FPGA

(CPU/GPU/DSP/FPGA)

Page 8: General purpose processing using embedded GPUs: A study of … · 2016. 6. 1. · Zürcher Fachhochschule 1 General purpose processing using embedded GPUs: A study of latency and

Zürcher Fachhochschule 8

Example:NvidiaTK1

-  GPU:192Cudacore

-  CPU:ARMA-15Quad-core-  Videodecode:Full-HD60Hz

-  Videoencode:Full-HD30Hz

-  Networking:1GBEthernet

Page 9: General purpose processing using embedded GPUs: A study of … · 2016. 6. 1. · Zürcher Fachhochschule 1 General purpose processing using embedded GPUs: A study of latency and

Zürcher Fachhochschule 9

GPUProgramming:CUDA

[https://code.msdn.microsoft.com/vstudio/NVIDIA-GPU-Architecture-45c11e6d]

LinuxcompilaAonmodel

AddiAonalLibraries

StandardCudaProgramm

Page 10: General purpose processing using embedded GPUs: A study of … · 2016. 6. 1. · Zürcher Fachhochschule 1 General purpose processing using embedded GPUs: A study of latency and

Zürcher Fachhochschule 10

NvidiaTK1

[GPUperformanceAnalysis,Nvidia(2012)]

64KByteConfigurable

L1/SMEM/RO

128KByteL2

192Cores 192Cores 192Cores TK1

Page 11: General purpose processing using embedded GPUs: A study of … · 2016. 6. 1. · Zürcher Fachhochschule 1 General purpose processing using embedded GPUs: A study of latency and

Zürcher Fachhochschule 11

DataTransferonTK1

InputVideo/Audio/

Data

TK1

CPU GPU

CPUCache GPUCache

OutputVideo/Audio/

DataInput

DRAM

Output

2OpAonsforDataTransfertoGPUinCuda:•  UnifiedVirtualAddressing(GPUDMATransfer)•  Memorymapped(ZeroCopy)

?

Page 12: General purpose processing using embedded GPUs: A study of … · 2016. 6. 1. · Zürcher Fachhochschule 1 General purpose processing using embedded GPUs: A study of latency and

Zürcher Fachhochschule 12

CudaDataTransfer

Method1:UnifiedVirtualAddressing(withCPU-GPUDMA)

•  AllocaAoninGPUmemory•  LocalaccessforfirstGPU•  NodirectCPUaccess•  DMATransferCPU<->GPU

cudaMemcpy

CPU GPU GPU

Page 13: General purpose processing using embedded GPUs: A study of … · 2016. 6. 1. · Zürcher Fachhochschule 1 General purpose processing using embedded GPUs: A study of latency and

Zürcher Fachhochschule 13

CudaDataTransfer

GPUprocessingUnifiedVirtualAddressing(DMA): Step1:CopydatatoGPUmemory

Step2:ProcessdatainGPUusing1000softhreads

Step3:Copyresultsbacktohostmemory

Page 14: General purpose processing using embedded GPUs: A study of … · 2016. 6. 1. · Zürcher Fachhochschule 1 General purpose processing using embedded GPUs: A study of latency and

Zürcher Fachhochschule 14

CudaDataTransfer

// Step 0: allocate memory cudaMalloc( &dev_a, size ); cudaMalloc( &dev_b, size ); cudaMalloc( &dev_c, size ); // Step 1: copy inputs to device cudaMemcpy( dev_a, a, size, cudaMemcpyHostToDevice ); // GPU-DMA cudaMemcpy( dev_b, b, size, cudaMemcpyHostToDevice ); // GPU-DMA // Step 2: launch add() kernel on GPU add <<< N, M >>>( dev_a, dev_b, dev_c ); // Step 3: copy device result back to host copy of c cudaMemcpy( c, dev_c, size, cudaMemcpyDeviceToHost )

Page 15: General purpose processing using embedded GPUs: A study of … · 2016. 6. 1. · Zürcher Fachhochschule 1 General purpose processing using embedded GPUs: A study of latency and

Zürcher Fachhochschule 15

CudaDataTransfer

Method2:Memorymapped(ZeroCopy)

•  AllocaAoninCPUmemory•  LocalaccessforCPU•  MemorymappedforGPUs CPU GPU GPU

Page 16: General purpose processing using embedded GPUs: A study of … · 2016. 6. 1. · Zürcher Fachhochschule 1 General purpose processing using embedded GPUs: A study of latency and

Zürcher Fachhochschule 16

CudaDataTransfer

GPUprocessingMemoryMapped(ZeroCopy): Step1:CopydatatoGPUmemory

Step2:ProcessdatainGPUusing1000softhreads

Step3:Copyresultsbacktohostmemory

Page 17: General purpose processing using embedded GPUs: A study of … · 2016. 6. 1. · Zürcher Fachhochschule 1 General purpose processing using embedded GPUs: A study of latency and

Zürcher Fachhochschule 17

// Step 0: allocate memory cudaMalloc( &dev_a, size ); cudaMallocHost(&dev_a,size); cudaMalloc( &dev_b, size ); cudaMallocHost(&dev_b,size); cudaMalloc( &dev_c, size ); cudaMallocHost(&dev_c,size); // Step 1: copy inputs to device cudaMemcpy( dev_a, a, size, cudaMemcpyHostToDevice ); cudaMemcpy( dev_b, b, size, cudaMemcpyHostToDevice ); // Step 2: launch add() kernel on GPU add <<< N, M >>>( dev_a, dev_b, dev_c ); // Step 3: copy device result back to host copy of c cudaMemcpy( c, dev_c, size, cudaMemcpyDeviceToHost )

TypicalGPUworkflow:Memory-mapped

Page 18: General purpose processing using embedded GPUs: A study of … · 2016. 6. 1. · Zürcher Fachhochschule 1 General purpose processing using embedded GPUs: A study of latency and

Zürcher Fachhochschule 18

DMAvs.Memory-mapped

DMA(cudaMemcpy)

Memory-mapped(ZeroCopy)

Factor2

Page 19: General purpose processing using embedded GPUs: A study of … · 2016. 6. 1. · Zürcher Fachhochschule 1 General purpose processing using embedded GPUs: A study of latency and

Zürcher Fachhochschule 19

GPULatencyVariaAon:output=input

__device__voididenAty(float*input,float*output,intnumElem):

for(intindex=0;index<numElem;index++){

output[index]=input[index]

}

Inputsize=25

(90%)

(0.01%)

TestedonLinux-KernelwithPREEMPT_RT/FullPreempt

Page 20: General purpose processing using embedded GPUs: A study of … · 2016. 6. 1. · Zürcher Fachhochschule 1 General purpose processing using embedded GPUs: A study of latency and

Zürcher Fachhochschule 20

-ThereisahugevariaAoninprocessingAme.-For100bytesdata(25floatvalues)perthread:

-90%ofthelaunchestakelessthan40microsec. -0.01%ofthelaunchestakearound500microsec.

-Slowlaunchesdropupdateratefrom25KHzto2KHz.

GPULatencyVariaAon

Page 21: General purpose processing using embedded GPUs: A study of … · 2016. 6. 1. · Zürcher Fachhochschule 1 General purpose processing using embedded GPUs: A study of latency and

Zürcher Fachhochschule 21

GPULatencyVariaAon

Inputsize25 250 2500 25000

Jetson TK1

RT Kernel

identity<<<1,1>>>

(90%)

Page 22: General purpose processing using embedded GPUs: A study of … · 2016. 6. 1. · Zürcher Fachhochschule 1 General purpose processing using embedded GPUs: A study of latency and

Zürcher Fachhochschule 22

OurSoluAonforLatencyVariaAon

Kernel Loop: while (true) { poll_CPU_flag(); output_data = fct(input_data); }

GPU

... wait_for_input_in_DRAM(); flag_to_GPU(); ...

TK1

CPU GPU

CPUCache GPUCache

Input

DRAM

Output

•  Implementkernel-loopsinGPUcores•  Memorymapped(zerocopy)dataaccess•  EachGPUkernel-loopproducesoutputfromitsinputdata(memory-mapped)

•  ThenumberofGPUcoreslimitthenumberofkernelloops

CPU

Page 23: General purpose processing using embedded GPUs: A study of … · 2016. 6. 1. · Zürcher Fachhochschule 1 General purpose processing using embedded GPUs: A study of latency and

Zürcher Fachhochschule

SoCs with GPU as Industrial Modules

23

NvidiaTK1Module Snapdragon820Module AllwinnerA80Module

Sources: Nvidia, Avionic Design, Toradex, Intrinisic, Theobroma Systems

NvidiaTX1ModuleNvidiaTK1Module

Page 24: General purpose processing using embedded GPUs: A study of … · 2016. 6. 1. · Zürcher Fachhochschule 1 General purpose processing using embedded GPUs: A study of latency and

Zürcher Fachhochschule

SoCs with GPU as Industrial Modules

24

Mobile Processor

Android TV Video Conferencing

Lecture recording streaming Medical Imaging

Driving Assistance Source: Google / PMK

Page 25: General purpose processing using embedded GPUs: A study of … · 2016. 6. 1. · Zürcher Fachhochschule 1 General purpose processing using embedded GPUs: A study of latency and

Zürcher Fachhochschule 25

-  Ourresultsconfirmthatforsmalldatachunksmemorymappedtransfers

ismoreefficient

-  WeobserveahugebutrarevariaAoninGPUprocessingAme

-  ThevariaAondramaAcallyreducesupdateratebyanorderofmagnitude

-  OursoluAonistoimplementGPUkernel-loopsandmemory-mappedtransfer

Conclusion