Upload
doankien
View
216
Download
0
Embed Size (px)
Citation preview
0 0
0
0 University of Michigan
Leveraging Mobile GPUs for Flexible High-speed Wireless Communication
Qi Zheng, Cao Gao, Trevor Mudge, Ronald Dreslinski *University of Michigan, Ann Arbor
The 3rd International Workshop on Parallelism in Mobile Platforms
(PRISM-3) Jun 14, 2015
1 1
1
1 University of Michigan
Popular Mobile Device & Applications
2 2
2
2 University of Michigan
1.0E-01
1.0E+00
1.0E+01
1.0E+02
1.0E+03
2000 2004 2008 2010
Fast Technology Evolution
3 3
3
3 University of Michigan
Support of Multiple Standards
4 4
4
4 University of Michigan
Wireless Processor in Mobile Devices ! Hardware accelerator based processors
! ASIC ! DSP with many ASICs ! FPGA customized as ASICs
" High performance " High energy efficiency
Software-defined Radio!
✕ Long time to market ✕ Hard and costly to maintain compatibility
5 5
5
5 University of Michigan
Software-defined Radio (SDR) ! Wireless system is implemented entirely in software
! General-purpose programming language
" Very short time to market • Only implement new software for new technologies
" Easy to maintain compatibility • Different programs for different standards
? Performance ? Energy efficiency
6 6
6
6 University of Michigan
Processor for SDR
! Good programmability
! High computing throughput
! High energy efficiency
! Explore DLP and TLP
GPU!
7 7
7
7 University of Michigan
Desktop GPU Power ≥ 100W
Graphics Processing Unit (GPU) ! High throughput:
! GOPS-level peak throughput
! SIMT execution model ! Make use of abundant DLP and TLP
! Good programming support ! CUDA/OpenCL
Mobile GPU Power ~ 1W
8 8
8
8 University of Michigan
Our Contribution ! Parallel implementations of LTE baseband kernels on a mobile
GPGPU
! Performance of running LTE baseband ! Meet the specification peak data rate for physical layer kernels ! Get close to 10Mpbs typical data rate with for the Turbo decoder ! Meet 20ms latency budget
! Power and energy efficiency of a mobile GPGPU ! High power compared with application-specific processors ! Still in an acceptable range ! Implications of GPU architecture to further reduce the power
9 9
9
9 University of Michigan
Outline ! Motivation
! LTE baseband kernels ! Running LTE baseband on a mobile GPGPU
! Implications for future mobile GPGPUs
! Conclusion
10 10
10
10 University of Michigan
LTE Baseband Downlink Receiver ! Downlink Receiver has the most computations in a mobile device
RF# OFDM#Demodula.on#
MIMO##Detect#
Channel#Es.mate#
Modula.on#Demapper#Descrambling#Rate#
Match#Turbo#
Decoder#
Antenna
PHY Layer kernel MAC Layer kernel Radio
11 11
11
11 University of Michigan
Parallelism in Each Kernel ! The PHY layer kernels
! Antenna-level ! Symbol-level ! Subcarrier-level ! Algorithm-level
! The Turbo Decoder ! Trellis-level ! Subblock-level
! For all kernels ! Batch multiple packets
12 12
12
12 University of Michigan
Outline ! Motivation
! LTE baseband kernels ! Running LTE baseband on a mobile GPGPU
! Methodology ! PHY layer ! Turbo decoder ! Power and energy efficiency
! Implications for future mobile GPGPUs
! Conclusion
13 13
13
13 University of Michigan
Experimental Setup ! Jetson TK1 Board
! NVIDIA 4-Plus-1™ quad-core ARM® Cortex-A15 CPU
! NVIDIA Kepler GK20a GPU
! 192 CUDA cores
! ≤ 852MHz
! 256 KB RF / 64 KB L1 memory / 128 KB L2 cache
! 2G DDR3L RAM (shared by CPU and GPU)
! Wattsup.net Power Meter ! 33% board power consumed by SoC
14 14
14
14 University of Michigan
Outline ! Motivation
! LTE baseband kernels ! Running LTE baseband on a mobile GPGPU
! Methodology ! PHY layer ! Turbo decoder ! Power and energy efficiency
! Implications for future mobile GPGPUs
! Conclusion
15 15
15
15 University of Michigan
PHY Layer Throughput
Packet batching increases throughput Saturate at 8
0 20 40 60 80
100 120 140 160 180 200
0 5 10 15 20 25 30 35
Ach
ieve
d Th
roug
hput
s (M
bps)
The number of packets 1x1+QPSK 2x2+QPSK 1x1+16QAM 2x2+16QAM
16 16
16
16 University of Michigan
Compare with Specification Peak
0.0#
1.0#
2.0#
3.0#
4.0#
5.0#
6.0#
7.0#
0# 1# 2# 3# 4# 5# 6# 7#
Throughp
ut#Normalized
#to#
Specifica>o
n#Pe
ak#
The#number#of#packets#
1x1+QPSK# 2x2+QPSK# 1x1+16QAM# 2x2+16QAM#
Fulfill all but one specification peak data rates Overall, 2 packets -- a good batching size
17 17
17
17 University of Michigan
0.0#
2.0#
4.0#
6.0#
8.0#
10.0#
12.0#
14.0#
16.0#
18.0#
0# 1# 2# 3# 4# 5# 6# 7# 8# 9#
Throughp
ut#Normalized
#to#
Average/Typical#U
ser#R
ate#
The#number#of#packets#
1x1+QPSK# 2x2+QPSK# 1x1+16QAM# 2x2+16QAM#
Compare with 10Mbps Typical Rate
Fulfill all typical user data rates
18 18
18
18 University of Michigan
0 10 20 30 40 50 60 70 80 90
0 5 10 15 20 25 30 35 Wor
st C
ase
Late
ncy
(ms)
The number of packets
1x1+QPSK 2x2+QPSK 1x1+16QAM 2x2+16QAM
PHY Layer Latency
Packet batching also increases latency Packet batching also increases latency, can batch ≤ 8 packets given 20ms budgets
19 19
19
19 University of Michigan
Outline ! Motivation
! LTE baseband kernels ! Running LTE baseband on a mobile GPGPU
! Methodology ! PHY layer ! Turbo decoder ! Power and energy efficiency
! Implications for future mobile GPGPUs
! Conclusion
20 20
20
20 University of Michigan
Turbo Throughput
0"
2"
4"
6"
8"
10"
0" 2" 4" 6" 8" 10" 12" 14" 16" 18"
Turbo"De
coding"
Throughp
ut"(M
bps)"
The"number"of"packets"#Subblock=16" #Subblock=32" #Subblock=64"
Throughput increases with #subblocks an batching sizes
*Qi Zheng, et al, “Parallelization Techniques for Implementing Trellis Algorithms on Graphics Processors”, ISCAS 2013, Beijing
32 subblocks + 2 packets: 8.4 Mpbs throughput, 4.8 ms latency <<peak rate, but ~10Mbp typical rate
21 21
21
21 University of Michigan
Outline ! Motivation
! LTE baseband kernels ! Running LTE baseband on a mobile GPGPU
! Methodology ! PHY layer ! Turbo decoder ! Power and energy efficiency
! Implications for future mobile GPGPUs
! Conclusion
22 22
22
22 University of Michigan
Energy Efficiency
Higher frequency achieves better energy efficiency – “Run to finish”
0.0
5.0
10.0
15.0
20.0
25.0
30.0
72 252 468 684 852
Ene
rgy
Effi
cien
cy (M
b/J)
GPU Frequency (MHz)
23 23
23
23 University of Michigan
Power Consumption
Processor Standards Details
ASIC (LeoCore) WiMAX 70mW@ 70MHz
FPGA (MAGALI) LTE 234mW@400MHz
DSP (Tomahawk) LTE, WiMAX 904mW@170MHz
CPU (SoftLTE) LTE [email protected]
Mobile GPU LTE 1390mW@852MHz
24 24
24
24 University of Michigan
Outline ! Motivation
! LTE baseband kernels ! Running LTE baseband on a mobile GPGPU
! Implications for future mobile GPGPUs
! Conclusion
25 25
25
25 University of Michigan
Reduce Energy from Unused Register ! Register file is underutilized running PHY layers
! Due to max_thread/SM and max_thread_block/SM
! Power gate unused registers
0%#10%#20%#30%#40%#50%#60%#70%#80%#90%#100%#
1# 2# 4# 8# 16#
Normalized
#RF#U:liza:
on#
The#number#of#packets#
1x1+QPSK# 2x2+QPSK# 1x1+16QAM# 2x2+16QAM#
26 26
26
26 University of Michigan
! Trellis Algorithm ! Turbo algorithm ! Viterbi algorithm ! Baum-Welch algorithm ! Coding theory ! Speech recognition ! Data compression
! Architecture and ISA support ! Similar to AES instruction set in Intel CPU
GPU Support for Trellis Algorithm
27 27
27
27 University of Michigan
Outline ! Motivation
! LTE baseband kernels ! Running LTE baseband on a mobile GPGPU
! Implications for future mobile GPGPUs
! Conclusion
28 28
28
28 University of Michigan
Conclusion ! Mobile GPUs can be used for wireless communication in a
mobile device ! Meet specification peaks for 3 out of 4 PHY configurations ! Meet the typical rate for all PHY configurations ! Turbo is close to the typical data rate
! Need special supports for the Turbo decoder in a mobile GPU
! Energy is high compared with application-specific solutions ! But in the accepting range ! Need further optimizations
29 29
29
29 University of Michigan
Thanks!
Any questions?
30 30
30
30 University of Michigan
Backup Slides
31 31
31
31 University of Michigan
Quick Subscription Growth
0
1000
2000
3000
4000
5000
6000
7000
1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011
Wor
ld m
obile
-cel
lula
r sub
scrip
tions
(M
illio
ns)*
Year
*ITU report -- The World in 2011: ICT Facts and Figures
32 32
32
32 University of Michigan
Our Contribution ! Parallel implementations of LTE baseband kernels on a mobile
GPGPU
! Throughputs and latency of different configurations
! Energy efficiency of a mobile GPGPU
33 33
33
33 University of Michigan
Parallelism in PHY Layer Kernels Kernel Parallelism Number of threads
OFDM Modulation
(FFT)
Antenna-level Symbol-level
Algorithm-level
Channel Estimation
Antenna-level Subcarrier-level
MIMO detector Symbol-level Subcarrier-level
Modulation demapper
Antenna-level Symbol-level
Subcarrier-level Algorithm-level
Descrambling Antenna-level Symbol-level
Subcarrier-level
Nant ×Nsym ×NFFT
Nant ×Nsub
Nsym ×Nsub
Nant ×Nsym ×Nsub ×NMod
Nant ×Nsym ×Nsub × log2(NMod )
Nant ×Nsym ×Nsub
34 34
34
34 University of Michigan
Parallelism in Turbo Decoder* ! Total number of threads
! Implementation performance tradeoff
€
Nthread = N packet ⋅Nsubblock ⋅Threadtrellis
Parallelism Scheme Throughput Latency Bit Error Rate
Packet-level Better Worse No Change
Subblock-level Better No Change Worse
Trellis-level Better No Change No Change
Subblock+NII Worse No Change Better
Subblock+TS Worse No Change Better
*Qi Zheng, et al, “Parallelization Techniques for Implementing Trellis Algorithms on Graphics Processors”, ISCAS 2013, Beijing
35 35
35
35 University of Michigan
Mobile GPU Architecture ! GPU on NVIDIA Jetson Tegra K1 board
! Kepler architecture – 192 CUDA cores@852 MHz ! 64KB L1 memory / 128KB L2 cache / Share 2GB DDR3L with CPU
36 36
36
36 University of Michigan
Performance Metrics ! Collected stats for a variety of important events ! CUDA Event API
! Throughput ! Latency
! NVIDIA Profiler ! Achieved occupancy ! Memory utilization ! Dynamic instruction count
37 37
37
37 University of Michigan
GPU Occupancy
Packet batching increases throughput… which saturates at 8 packets due to full GPU utilization
0%#10%#20%#30%#40%#50%#60%#70%#80%#90%#100%#
0# 2# 4# 6# 8# 10# 12# 14# 16# 18#
Achieved
#Occup
ancy#
The#number#of#packets#
1x1+QPSK# 2x2+QPSK# 1x1+16QAM# 2x2+16QAM#
0"20"40"60"80"
100"120"140"160"180"200"
0" 5" 10" 15" 20" 25" 30" 35"
Achieved
"Throu
ghpu
ts"(M
bps)"
The"number"of"packets"
1x1+QPSK" 2x2+QPSK" 1x1+16QAM" 2x2+16QAM"
38 38
38
38 University of Michigan
Hotspot Analysis
0%#10%#20%#30%#40%#50%#60%#70%#80%#90%#100%#
1x1+QPSK#
2x2+QPSK#
1x1+16QAM#
2x2+16QAM#
Execu9
on#Tim
e#Breakd
own#
Rate#Matching#
Descrambling#
Demodula9on#
MIMO#Detec9on#
Channel#Es9ma9on#
FFT#
Demodulation dominates due to its heavy computation
39 39
39
39 University of Michigan
Hotspot Analysis -- #instructions
Demodulation dominates due to its heavy computation
0%#10%#20%#30%#40%#50%#60%#70%#80%#90%#
100%#
1x1+QPSK(
2x2+QPSK(
1x1+16QAM(
2x2+16QAM(
Executed
(Instruc7on
s(
Rate(Matching(
Descrambling(
Demodula7on(
MIMO(Detec7on(
Channel(Es7ma7on(
FFT(
40 40
40
40 University of Michigan
Turbo Throughput
0"
2"
4"
6"
8"
10"
0" 2" 4" 6" 8" 10" 12" 14" 16" 18"
Turbo"De
coding"
Throughp
ut"(M
bps)"
The"number"of"packets"#Subblock=16" #Subblock=32" #Subblock=64"
Throughput increases with the number of subblocks and packets, and starts saturating at 64 subblocks
41 41
41
41 University of Michigan
Frequency Scaling -- PHY
0 5
10 15 20 25 30 35 40 45
0 100 200 300 400 500 600 700 800 900
Ach
ieve
d Th
roug
hput
(M
bps)
GPU Frequency (MHz)
1x1+QPSK 2x2+QPSK 1x1+16QAM 2x2+16QAM
Throughputs have almost linear scaling with the frequency
42 42
42
42 University of Michigan
Frequency Scaling -- Turbo
Throughputs have almost linear scaling with the frequency
0.0#
1.0#
2.0#
3.0#
4.0#
5.0#
6.0#
7.0#
8.0#
9.0#
0# 100# 200# 300# 400# 500# 600# 700# 800# 900#
Achieved
#Throu
ghpu
t#(Mbp
s)#
GPU#Frequency#(MHz)#