Leveraging Mobile GPUs for Flexible High-speed Wireless ...caogao/prism_3_talk.pdf · Leveraging Mobile GPUs for Flexible High-speed Wireless Communication Qi Zheng, Cao Gao, Trevor

0 0

0

0 University of Michigan

Leveraging Mobile GPUs for Flexible High-speed Wireless Communication

Qi Zheng, Cao Gao, Trevor Mudge, Ronald Dreslinski *University of Michigan, Ann Arbor

The 3rd International Workshop on Parallelism in Mobile Platforms

(PRISM-3) Jun 14, 2015

1 1

1


Popular Mobile Device & Applications

2 2

2


1.0E-01

1.0E+00

1.0E+01

1.0E+02

1.0E+03

2000 2004 2008 2010

Fast Technology Evolution

3 3

3


Support of Multiple Standards

4 4

4


Wireless Processor in Mobile Devices !  Hardware accelerator based processors

!  ASIC !  DSP with many ASICs !  FPGA customized as ASICs

" High performance " High energy efficiency

Software-defined Radio!

✕ Long time to market ✕ Hard and costly to maintain compatibility

5 5

5


Software-defined Radio (SDR) ! Wireless system is implemented entirely in software

!  General-purpose programming language

" Very short time to market •  Only implement new software for new technologies

" Easy to maintain compatibility •  Different programs for different standards

?  Performance ?  Energy efficiency

6 6

6


Processor for SDR

!  Good programmability

!  High computing throughput

!  High energy efficiency

!  Explore DLP and TLP

GPU!

7 7

7


Desktop GPU Power ≥ 100W

Graphics Processing Unit (GPU) !  High throughput:

!  GOPS-level peak throughput

!  SIMT execution model ! Make use of abundant DLP and TLP

!  Good programming support !  CUDA/OpenCL

Mobile GPU Power ~ 1W

8 8

8


Our Contribution !  Parallel implementations of LTE baseband kernels on a mobile

GPGPU

!  Performance of running LTE baseband ! Meet the specification peak data rate for physical layer kernels !  Get close to 10Mpbs typical data rate with for the Turbo decoder ! Meet 20ms latency budget

!  Power and energy efficiency of a mobile GPGPU !  High power compared with application-specific processors !  Still in an acceptable range !  Implications of GPU architecture to further reduce the power

9 9

9


Outline ! Motivation

!  LTE baseband kernels !  Running LTE baseband on a mobile GPGPU

!  Implications for future mobile GPGPUs

!  Conclusion

10 10

10


LTE Baseband Downlink Receiver !  Downlink Receiver has the most computations in a mobile device

RF# OFDM#Demodula.on#

MIMO##Detect#

Channel#Es.mate#

Modula.on#Demapper#Descrambling#Rate#

Match#Turbo#

Decoder#

Antenna

PHY Layer kernel MAC Layer kernel Radio

11 11

11


Parallelism in Each Kernel !  The PHY layer kernels

!  Antenna-level !  Symbol-level !  Subcarrier-level !  Algorithm-level

!  The Turbo Decoder !  Trellis-level !  Subblock-level

!  For all kernels !  Batch multiple packets

12 12

12




! Methodology !  PHY layer !  Turbo decoder !  Power and energy efficiency


!  Conclusion

13 13

13


Experimental Setup !  Jetson TK1 Board

!  NVIDIA 4-Plus-1™ quad-core ARM® Cortex-A15 CPU

!  NVIDIA Kepler GK20a GPU

! 192 CUDA cores

! ≤ 852MHz

! 256 KB RF / 64 KB L1 memory / 128 KB L2 cache

!  2G DDR3L RAM (shared by CPU and GPU)

! Wattsup.net Power Meter !  33% board power consumed by SoC

14 14

14






!  Conclusion

15 15

15


PHY Layer Throughput

Packet batching increases throughput Saturate at 8

0 20 40 60 80

100 120 140 160 180 200

0 5 10 15 20 25 30 35

Ach

ieve

d Th

roug

hput

s (M

bps)

The number of packets 1x1+QPSK 2x2+QPSK 1x1+16QAM 2x2+16QAM

16 16

16


Compare with Specification Peak

0.0#

1.0#

2.0#

3.0#

4.0#

5.0#

6.0#

7.0#

0# 1# 2# 3# 4# 5# 6# 7#

Throughp

ut#Normalized

#to#

Specifica>o

n#Pe

ak#

The#number#of#packets#

1x1+QPSK# 2x2+QPSK# 1x1+16QAM# 2x2+16QAM#

Fulfill all but one specification peak data rates Overall, 2 packets -- a good batching size

17 17

17


0.0#

2.0#

4.0#

6.0#

8.0#

10.0#

12.0#

14.0#

16.0#

18.0#

0# 1# 2# 3# 4# 5# 6# 7# 8# 9#

Throughp

ut#Normalized

#to#

Average/Typical#U

ser#R

ate#



Compare with 10Mbps Typical Rate

Fulfill all typical user data rates

18 18

18


0 10 20 30 40 50 60 70 80 90

0 5 10 15 20 25 30 35 Wor

st C

ase

Late

ncy

(ms)

The number of packets

1x1+QPSK 2x2+QPSK 1x1+16QAM 2x2+16QAM

PHY Layer Latency

Packet batching also increases latency Packet batching also increases latency, can batch ≤ 8 packets given 20ms budgets

19 19

19






!  Conclusion

20 20

20


Turbo Throughput

0"

2"

4"

6"

8"

10"

0" 2" 4" 6" 8" 10" 12" 14" 16" 18"

Turbo"De

coding"

Throughp

ut"(M

bps)"

The"number"of"packets"#Subblock=16" #Subblock=32" #Subblock=64"

Throughput increases with #subblocks an batching sizes

*Qi Zheng, et al, “Parallelization Techniques for Implementing Trellis Algorithms on Graphics Processors”, ISCAS 2013, Beijing

32 subblocks + 2 packets: 8.4 Mpbs throughput, 4.8 ms latency <<peak rate, but ~10Mbp typical rate

21 21

21






!  Conclusion

22 22

22


Energy Efficiency

Higher frequency achieves better energy efficiency – “Run to finish”

0.0

5.0

10.0

15.0

20.0

25.0

30.0

72 252 468 684 852

Ene

rgy

Effi

cien

cy (M

b/J)

GPU Frequency (MHz)

23 23

23


Power Consumption

Processor Standards Details

ASIC (LeoCore) WiMAX 70mW@ 70MHz

FPGA (MAGALI) LTE 234mW@400MHz

DSP (Tomahawk) LTE, WiMAX 904mW@170MHz

CPU (SoftLTE) LTE [email protected]

Mobile GPU LTE 1390mW@852MHz

24 24

24





!  Conclusion

25 25

25


Reduce Energy from Unused Register !  Register file is underutilized running PHY layers

!  Due to max_thread/SM and max_thread_block/SM

!  Power gate unused registers

0%#10%#20%#30%#40%#50%#60%#70%#80%#90%#100%#

1# 2# 4# 8# 16#

Normalized

#RF#U:liza:

on#



26 26

26


!  Trellis Algorithm !  Turbo algorithm !  Viterbi algorithm !  Baum-Welch algorithm !  Coding theory !  Speech recognition !  Data compression

!  Architecture and ISA support !  Similar to AES instruction set in Intel CPU

GPU Support for Trellis Algorithm

27 27

27





!  Conclusion

28 28

28


Conclusion ! Mobile GPUs can be used for wireless communication in a

mobile device ! Meet specification peaks for 3 out of 4 PHY configurations ! Meet the typical rate for all PHY configurations !  Turbo is close to the typical data rate

!  Need special supports for the Turbo decoder in a mobile GPU

!  Energy is high compared with application-specific solutions !  But in the accepting range !  Need further optimizations

29 29

29


Thanks!

Any questions?

30 30

30


Backup Slides

31 31

31


Quick Subscription Growth

0

1000

2000

3000

4000

5000

6000

7000

1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011

Wor

ld m

obile

-cel

lula

r sub

scrip

tions

(M

illio

ns)*

Year

*ITU report -- The World in 2011: ICT Facts and Figures

32 32

32


Our Contribution !  Parallel implementations of LTE baseband kernels on a mobile

GPGPU

!  Throughputs and latency of different configurations

!  Energy efficiency of a mobile GPGPU

33 33

33


Parallelism in PHY Layer Kernels Kernel Parallelism Number of threads

OFDM Modulation

(FFT)

Antenna-level Symbol-level

Algorithm-level

Channel Estimation

Antenna-level Subcarrier-level

MIMO detector Symbol-level Subcarrier-level

Modulation demapper

Antenna-level Symbol-level

Subcarrier-level Algorithm-level

Descrambling Antenna-level Symbol-level

Subcarrier-level

Nant ×Nsym ×NFFT

Nant ×Nsub

Nsym ×Nsub

Nant ×Nsym ×Nsub ×NMod

Nant ×Nsym ×Nsub × log2(NMod )

Nant ×Nsym ×Nsub

34 34

34


Parallelism in Turbo Decoder* !  Total number of threads

!  Implementation performance tradeoff

€

Nthread = N packet ⋅Nsubblock ⋅Threadtrellis

Parallelism Scheme Throughput Latency Bit Error Rate

Packet-level Better Worse No Change

Subblock-level Better No Change Worse

Trellis-level Better No Change No Change

Subblock+NII Worse No Change Better

Subblock+TS Worse No Change Better

*Qi Zheng, et al, “Parallelization Techniques for Implementing Trellis Algorithms on Graphics Processors”, ISCAS 2013, Beijing

35 35

35


Mobile GPU Architecture !  GPU on NVIDIA Jetson Tegra K1 board

!  Kepler architecture – 192 CUDA cores@852 MHz !  64KB L1 memory / 128KB L2 cache / Share 2GB DDR3L with CPU

36 36

36


Performance Metrics !  Collected stats for a variety of important events !  CUDA Event API

!  Throughput !  Latency

!  NVIDIA Profiler !  Achieved occupancy ! Memory utilization !  Dynamic instruction count

37 37

37


GPU Occupancy

Packet batching increases throughput… which saturates at 8 packets due to full GPU utilization

0%#10%#20%#30%#40%#50%#60%#70%#80%#90%#100%#

0# 2# 4# 6# 8# 10# 12# 14# 16# 18#

Achieved

#Occup

ancy#



0"20"40"60"80"

100"120"140"160"180"200"

0" 5" 10" 15" 20" 25" 30" 35"

Achieved

"Throu

ghpu

ts"(M

bps)"

The"number"of"packets"

1x1+QPSK" 2x2+QPSK" 1x1+16QAM" 2x2+16QAM"

38 38

38


Hotspot Analysis

0%#10%#20%#30%#40%#50%#60%#70%#80%#90%#100%#

1x1+QPSK#

2x2+QPSK#

1x1+16QAM#

2x2+16QAM#

Execu9

on#Tim

e#Breakd

own#

Rate#Matching#

Descrambling#

Demodula9on#

MIMO#Detec9on#

Channel#Es9ma9on#

FFT#

Demodulation dominates due to its heavy computation

39 39

39


Hotspot Analysis -- #instructions

Demodulation dominates due to its heavy computation

0%#10%#20%#30%#40%#50%#60%#70%#80%#90%#

100%#

1x1+QPSK(

2x2+QPSK(

1x1+16QAM(

2x2+16QAM(

Executed

(Instruc7on

s(

Rate(Matching(

Descrambling(

Demodula7on(

MIMO(Detec7on(

Channel(Es7ma7on(

FFT(

40 40

40


Turbo Throughput

0"

2"

4"

6"

8"

10"

0" 2" 4" 6" 8" 10" 12" 14" 16" 18"

Turbo"De

coding"

Throughp

ut"(M

bps)"

The"number"of"packets"#Subblock=16" #Subblock=32" #Subblock=64"

Throughput increases with the number of subblocks and packets, and starts saturating at 64 subblocks

41 41

41


Frequency Scaling -- PHY

0 5

10 15 20 25 30 35 40 45

0 100 200 300 400 500 600 700 800 900

Ach

ieve

d Th

roug

hput

(M

bps)

GPU Frequency (MHz)

1x1+QPSK 2x2+QPSK 1x1+16QAM 2x2+16QAM

Throughputs have almost linear scaling with the frequency

42 42

42


Frequency Scaling -- Turbo

Throughputs have almost linear scaling with the frequency

0.0#

1.0#

2.0#

3.0#

4.0#

5.0#

6.0#

7.0#

8.0#

9.0#

0# 100# 200# 300# 400# 500# 600# 700# 800# 900#

Achieved

#Throu

ghpu

t#(Mbp

s)#

GPU#Frequency#(MHz)#

Documents

Leveraging Mobile GPUs for Flexible High-speed Wireless ...caogao/prism_3_talk.pdf · Leveraging Mobile GPUs for Flexible High-speed Wireless Communication Qi Zheng, Cao Gao, Trevor