1 1
1
1 University of Michigan
Architecting an LTE Base Station with Graphics Processing Units
Qi Zheng*, Yajing Chen*, Ronald Dreslinski*, Chaitali Chakrabarti+, Achilleas Anastasopoulos*,
Scott Mahlke*, Trevor Mudge* *University of Michigan, Ann Arbor +Arizona State University, Tempe
SiPS ’13
Oct 17, 2013
2 2
2
2 University of Michigan
Wireless Base Station
3 3
3
3 University of Michigan
Baseband SDR in a Base Station
GOPS Throughput
4 4
4
4 University of Michigan
1.0E-01
1.0E+00
1.0E+01
1.0E+02
1.0E+03
Peak
Dat
a R
ate
(Mbp
s)
Technology Evolution
5 5
5
5 University of Michigan
Baseband processor
Processor for Wireless Base Station
! Flexibility -- Good programmability
! Performance -- High computing throughput
6 6
6
6 University of Michigan
GPU
Baseband processor
Processor for Wireless Base Station
! Flexibility -- Good programmability
! Performance -- High computing throughput
7 7
7
7 University of Michigan
Graphics Processing Unit " High Throughput:
" GOPS/TOPS-level peak throughput
" Good Programming Support " CUDA " OpenCL
" High Efficiency
Processor GFLOP/dollar GFLOP/watt Nvidia GeForce GTX680 6.192 15.848
Intel Xeon E7-8837 0.037 0.656
Intel Itanium 8350 0.007 0.150
8 8
8
8 University of Michigan
GPU Architecture " SIMT – Single Instruction Multiple Threads
" Thousands of cores on a GPU " Explore Data-Level Parallelism
t0 t1 t2 tn-2 tn-1 tn ..…
d0 d1 d2 dn-2 dn-1 dn ..…
Instruction Instruction Fetch Unit
ALU
Memory
9 9
9
9 University of Michigan
GPU Architecture " SIMT – Single Instruction Multiple Threads
" Thousands of cores on a GPU " Explore Data-Level Parallelism
" Multithreading " Hide long memory latency
t0 t1 t2 tn-2 tn-1 tn ..…
t0 t1 t2 tn-2 tn-1 tn ..…
t0 t1 t2 tn-2 tn-1 tn ..…
Thread Group 0: ld R0 [R1+Offset] sub R2, R0, #2 add R0, R2, R3
Dcache miss
10 10
10
10 University of Michigan
GPU Architecture " SIMT – Single Instruction Multiple Threads
" Thousands of cores on a GPU " Explore Data-Level Parallelism
" Multithreading " Hide long memory latency
t0 t1 t2 tn-2 tn-1 tn ..…
t0 t1 t2 tn-2 tn-1 tn ..…
t0 t1 t2 tn-2 tn-1 tn ..…
Thread Group 1: mul R3,R4,R5,R6 ld R0 [R1+Offset] sub R2, R0, #2 add R0, R2, R3
11 11
11
11 University of Michigan
GPU Architecture " SIMT – Single Instruction Multiple Threads
" Thousands of cores on a GPU " Explore Data-Level Parallelism
" Multithreading " Hide long memory latency
GOPS/TOPS-level peak throughput
t0 t1 t2 tn-2 tn-1 tn ..…
t0 t1 t2 tn-2 tn-1 tn ..…
t0 t1 t2 tn-2 tn-1 tn ..…
SIMT
Multithreading
12 12
12
12 University of Michigan
GPU Mapping Challenge " Core underutilization
" Over 1000 cores on a commercial GPU
" Pipeline stall " Long memory access latency
t0 t1 t2 tn-2 tn-1 tn ..…
t0 t1 t2 tn-2 tn-1 tn ..…
t0 t1 t2 tn-2 tn-1 tn ..…
SIMT
Multithreading
13 13
13
13 University of Michigan
Our Contribution " Previous works
" Mapped only a single kernel onto a GPU
" Base station study on traditional platforms, such as DSP, FPGA, etc
" In this work " Key Kernel Parallelization
" Kernel runtime performance
" Minimum number of GPUs needed
" System Power consumption
14 14
14
14 University of Michigan
Outline " Motivation
" GPU Architecture " Key Kernels Parallelization
" Experimental Results
" Conclusion
15 15
15
15 University of Michigan
List of Key Kernels " PHY Layer
" SC-FDMA demodulation (FFT) " Transform decoder (IDFT) " Channel estimation " MIMO detection " Modulation demapper
" Turbo decoder
16 16
16
16 University of Michigan
Parallelism in PHY Layer Kernels
Parallelism Description Num of threads
User-level Process data from different users in parallel #thread = Nusr
Antenna-level Process data from different receiver antennae in parallel #thread = Nant
Symbol-level Processing SC-FDMA symbols in a subframe in parallel #thread = Nsym
Subcarrier-level Each subcarrier in a symbol of is processed in parallel #thread = Nsub
Algorithm-level parallelism inherent in each algorithm
Varies based on kernels
17 17
17
17 University of Michigan
Parallelism in PHY Layer Kernels Kernel Parallelism Number of threads
FFT/IFFT User-level
Antenna-level Symbol-level
Algorithm-level
Channel Estimation
User-level Antenna-level
Subcarrier-level
MIMO detector User-level
Symbol-level Subcarrier-level
Modulation demapper
User-level Antenna-level Symbol-level
Subcarrier-level Algorithm-level
€
Nusr × Nant × Nsym × NFFT
€
Nusr × Nant × Nsub
€
Nusr × Nsym × Nsub
€
Nusr × Nant × Nsym × Nsub × NMod
Nusr × Nant × Nsym × Nsub × log2 (NMod )
18 18
18
18 University of Michigan
Parallelism in Turbo Decoder* " Total number of threads
" Implementation performance tradeoff
€
Nthread = N packet ⋅Nsubblock ⋅Threadtrellis
Parallelism Scheme Throughput Latency Bit Error Rate
Packet-level Better Worse No Change
Subblock-level Better No Change Worse
Trellis-level Better No Change No Change
Subblock+NII Worse No Change Better
Subblock+TS Worse No Change Better
*Qi Zheng, et al, “Parallelization Techniques for Implementing Trellis Algorithms on Graphics Processors”, ISCAS 2013, Beijing
19 19
19
19 University of Michigan
Experimental Setup " Nvidia GeForce GTX680
" 8 Streaming Multiprocessors " 1536 Streaming Processors " 64KB L1 cache + shared memory " 512KB L2 cache " 2GB DRAM
" GPU Runtime measure " CUDA event record
" GPU Power measure " GPU-Z
20 20
20
20 University of Michigan
Experimental Setup Kernel Configuration
Turbo decoder
Code Rate = 1/3, Codeword Length = 6144, Iteration Number = 5
Modulation demapper 16QAM/64QAM
SC-FDMA FFT 2048
Decoding IFFT 1200
MIMO 1x1, 2x2, 4x4
21 21
21
University of Michigan
" One LTE subframe on a GPU " Different antenna configurations
PHY Layer Kernel Runtime
!"
!#$"
!#%"
!#&"
!#'"
("
(#$"
(#%"
(#&"
(#'"
$"
))*" +))*" ,+,-"./0/1023" 45677/8"/9:;6:27"
,2.<86:27"./;6==/3>(&?@,"
,2.<86:27"./;6==/3>&%?@,"
ABC"D6E/3"F/
37/8"G<7
:;/9"H;
9I"2J"6
7"D*K"9<LJ36;/"
(M(" $M$" %M%"
22 22
22
22 University of Michigan
Turbo Decoder Performance
Schemes Throughput
(Mbps)
Worst-case Codeword Latency
(ms)
BER
Tellis-level Subblock Num
Codeword Num
State-level 512 2 77.6 0.7 1.6×10-3
State-level 256 4 78.2 1.7 4.1×10-3
State-level For-Back 256 2 78.3 0.7 4.1×10-3
State-level For-Back 128 7 80.6 3.1 2.0×10-3
23 23
23
23 University of Michigan
Turbo Decoder Performance
Schemes Throughput
(Mbps)
Worst-case Codeword Latency
(ms)
BER
Tellis-level Subblock Num
Codeword Num
State-level 512 2 77.6 0.7 1.6×10-3
State-level 256 4 78.2 1.7 4.1×10-3
State-level For-Back 256 2 78.3 0.7 4.1×10-3
State-level For-Back 128 7 80.6 3.1 2.0×10-3
24 24
24
24 University of Michigan
!"
#"
$"
%"
&"
'"
("
)"
*"
+"
#!"
'!" )'" #!!" #'!" $!!" %!!"
,-.
/01"2
3"456
(*!"47
89"
70:;"<:=:">:=0"?@/A9B"
7CD" 5-1/2" 52=:E"
Number of needed GPUs " Meet latency and throughput requirements
€
• (tk )∑ ≤ 1ms for a subframe• Thturbo ≥Thsys
25 25
25
25 University of Michigan
Number of needed GPUs " Meet latency and throughput requirements
!"
#"
$"
%"
&"
'"
("
)"
*"
+"
#!"
'!" )'" #!!" #'!" $!!" %!!"
,-.
/01"2
3"456
(*!"47
89"
70:;"<:=:">:=0"?@/A9B"
7CD" 5-1/2" 52=:E"
€
• (tk )∑ ≤ 1ms for a subframe• Thturbo ≥Thsys
26 26
26
26 University of Michigan
!"
#"
$"
%"
&"
'"
("
)"
*"
+"
#!"
'!" )'" #!!" #'!" $!!" %!!"
,-.
/01"2
3"456
(*!"47
89"
70:;"<:=:">:=0"?@/A9B"
7CD" 5-1/2" 52=:E"
Number of needed GPUs " Meet latency and throughput requirements
€
• (tk )∑ ≤ 1ms for a subframe• Thturbo ≥Thsys
27 27
27
University of Michigan
Power and Energy " Support 75Mbps
" Two GTX680 GPUs + One Intel Core 2 CPU " Total power is 188W
!""#
$%"#
&%'#
()%&#
!%(# !%$#
*+,-./#0123456-78,9#
:4-5;#<,=;<,-## >?@ABCD#AA:##B,=;<E+.#FAA:## C;<4G7H;+#<,87II,-##?J7++,G#,3H87H;+## CFCK#<,L,=L;-##
!"#
!$#
$"#
$$#
%"#
%$#
&'()*#+,-#
!""#
$%"#
&%'#
()%&#
!%(# !%$#
*+,-./#0123456-78,9#
:4-5;#<,=;<,-## >?@ABCD#AA:## B,=;<E+.#FAA:## C;<4G7H;+#<,87II,-## ?J7++,G#,3H87H;+## CFCK#<,L,=L;-##
28 28
28
28 University of Michigan
Conclusion " Highly parallel GPU implementations of all key kernels
" Kernel runtimes under different configurations
" Up to four GTX680 GPUs needed for ≤150Mbps " Can fit into a motherboard with low latency
" Dual-GPU solution consumes 188W for 75Mbps " Competitive with a commercial solution
29 29
29
29 University of Michigan
Thanks!
Any questions?
30 30
30
30 University of Michigan
Backup
31 31
31
31 University of Michigan
Our Contribution " Previous works
" Mapped only a single LTE kernel onto a GPU
" Base station study on traditional platforms, such as DSP, FPGA, etc
" In this work " Parallel implementations of LTE key signal processing kernels on GPU
" Kernel runtime performance under different system configurations
" The number of GPUs needed for the baseband subsystem in an LTE base station
" Power consumption of the GPU-based solution
32 32
32
32 University of Michigan
Parallelism in PHY Layer Kernels " User-level Parallelism
" #thread = Nusr
" Antenna-level Parallelism " #thread = Nant
" Symbol-level Parallelism " #thread = Nsym
" Subcarrier-level Parallelism " #thread = Nsub
" Algorithm-level Parallelism
33 33
33
33 University of Michigan
Parallelism in PHY Layer Kernels " User-level Parallelism
" #thread = Nusr
" Antenna-level Parallelism " #thread = Nant
" Symbol-level Parallelism " #thread = Nsym
" Subcarrier-level Parallelism " #thread = Nsub
" Algorithm-level Parallelism
34 34
34
34 University of Michigan
Parallelism in PHY Layer Kernels " User-level Parallelism
" #thread = Nusr
" Antenna-level Parallelism " #thread = Nant
" Symbol-level Parallelism " #thread = Nsym
" Subcarrier-level Parallelism " #thread = Nsub
" Algorithm-level Parallelism
35 35
35
35 University of Michigan
Parallelism in PHY Layer Kernels " User-level Parallelism
" #thread = Nusr
" Antenna-level Parallelism " #thread = Nant
" Symbol-level Parallelism " #thread = Nsym
" Subcarrier-level Parallelism " #thread = Nsub
" Algorithm-level Parallelism
36 36
36
36 University of Michigan
Parallelism in PHY Layer Kernels " User-level Parallelism
" #thread = Nusr
" Antenna-level Parallelism " #thread = Nant
" Symbol-level Parallelism " #thread = Nsym
" Subcarrier-level Parallelism " #thread = Nsub
" Algorithm-level Parallelism
37 37
37
37 University of Michigan
Parallelism in PHY Layer Kernels " User-level Parallelism
" #thread = Nusr
" Antenna-level Parallelism " #thread = Nant
" Symbol-level Parallelism " #thread = Nsym
" Subcarrier-level Parallelism " #thread = Nsub
" Algorithm-level Parallelism
38 38
38
38 University of Michigan
Number of needed GPUs " The minimum number of GTX680 GPUs needed for the
baseband system of an LTE base station
Peak Data Rate (Mbps)
Number of GPUs
PHY Turbo Total
50 1 1 2
75 1 1 2
100 1 2 3
150 2 2 4
200 2 3 5
300 5 4 9
39 39
39
39 University of Michigan
Processor for Wireless Base Station
Good programmability
High computing throughput
40 40
40
40 University of Michigan
Processor for Wireless Base Station
High computing throughput
Good programmability GPU
41 41
41
University of Michigan
Power and Energy " Support 75Mbps
" two GTX680 GPUs + one Intel Core 2 CPU " Total power is 188W
Kernel Power (W) Energy (J/subframe)
Turbo decoder 63.3 144.0
SC-FDMA FFT 56.7 3.4
Decoding IFFT 56.9 5.7
Modulation demapper 56.3 26.5
Channel estimation 61.8 1.2
MIMO detector 57.7 1.3