Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
Scaling the Peak: Maximizing floating pointperformance on the Epiphany NoC
Anish Varghese, Gaurav Mitra, Robert Edwards and AlistairRendell
Research School of Computer ScienceThe Australian National University
May 30, 2015
A. Varghese et.al. (ANU) Scaling the Peak PTC, 2015 1 / 37
Outline
1 Introduction
2 Programming the Epiphany
3 Performance Experiments
4 Implementing High Performance Applications
5 Energy Measurement
6 Summary
A. Varghese et.al. (ANU) Scaling the Peak PTC, 2015 2 / 37
Introduction
Energy efficient HPC at ANUParallella-16 Parallella-64 Prototype
NVIDIA Jetson TK1 - ARM-GPU SoC TI Keystone II - ARM-DSP SoC
A. Varghese et.al. (ANU) Scaling the Peak PTC, 2015 3 / 37
Programming the Epiphany Programming Considerations
Outline
1 Introduction
2 Programming the EpiphanyProgramming Considerations
3 Performance Experiments
4 Implementing High Performance Applications
5 Energy Measurement
6 Summary
A. Varghese et.al. (ANU) Scaling the Peak PTC, 2015 4 / 37
Programming the Epiphany Programming Considerations
Programming Considerations
Store code and data in different banks
Interleave FMADD with load/stores
Unroll inner loops
A. Varghese et.al. (ANU) Scaling the Peak PTC, 2015 5 / 37
Performance Experiments Experiment Platform
Outline
1 Introduction
2 Programming the Epiphany
3 Performance ExperimentsExperiment PlatformOn-chip CommunicationOff-chip Communication
4 Implementing High Performance Applications
5 Energy Measurement
6 SummaryA. Varghese et.al. (ANU) Scaling the Peak PTC, 2015 6 / 37
Performance Experiments Experiment Platform
Experiment Platform
Parallella-64 Prototype
ZedBoard evaluation module with Zynq SoCDaughter card with Epiphany-IV 64-core (E64G401)Dual core ARM Cortex-A9 host at 667 MHzEpiphany eCores at 600 MHz512 MB of DDR3 RAM on the host32 MB shared with eCores
A. Varghese et.al. (ANU) Scaling the Peak PTC, 2015 7 / 37
Performance Experiments On-chip Communication
Outline
1 Introduction
2 Programming the Epiphany
3 Performance ExperimentsExperiment PlatformOn-chip CommunicationOff-chip Communication
4 Implementing High Performance Applications
5 Energy Measurement
6 SummaryA. Varghese et.al. (ANU) Scaling the Peak PTC, 2015 8 / 37
Performance Experiments On-chip Communication
DMA vs Direct Write Bandwidth
16 80 800 4000 8000 24000 32000 40000 480000
200
400
600
800
1,000
1,200
1,400
1,600
1,800
2,000
5.59 30.68
282.43
809.81
1,261.62
1,801.4 1,828.04
1,905.671,963.71
34.4288.37 91.64 91.98 91.99 90.19 90.19 90.19 90.19
Total Message Size (Bytes)
Ban
dwid
th(M
B/s
)
DMA Transfer Direct On-Chip Write
A. Varghese et.al. (ANU) Scaling the Peak PTC, 2015 9 / 37
Performance Experiments On-chip Communication
Message Transfer Latency
8 80 200 400 600 800 20000
1
2
3
4
5
6
7
8
9
2.73 2.74 2.74 2.74 2.74 2.74
3.35
3.58 · 10−20.36
0.9
1.79
2.69
3.58
8.95
Total Message Size (Bytes)
Tim
e(u
sec)
DMA Transfer Direct On-Chip Write
Latency of ≈ 7 cycles per transfer
A. Varghese et.al. (ANU) Scaling the Peak PTC, 2015 10 / 37
Performance Experiments Off-chip Communication
Outline
1 Introduction
2 Programming the Epiphany
3 Performance ExperimentsExperiment PlatformOn-chip CommunicationOff-chip Communication
4 Implementing High Performance Applications
5 Energy Measurement
6 SummaryA. Varghese et.al. (ANU) Scaling the Peak PTC, 2015 11 / 37
Performance Experiments Off-chip Communication
Core-wise utilization of external memory link
1 2 3 4 5 6 7 8
1
2
3
4
5
6
7
80.00
0.02
0.04
0.06
0.08
0.10
0.120.140.160.18
150 MB/sec using Direct Writes
400 MB/sec using DMA
A. Varghese et.al. (ANU) Scaling the Peak PTC, 2015 12 / 37
Implementing High Performance Applications Heat Stencil
Outline
1 Introduction
2 Programming the Epiphany
3 Performance Experiments
4 Implementing High Performance ApplicationsHeat StencilMatrix Multiplication
5 Energy Measurement
6 Summary
A. Varghese et.al. (ANU) Scaling the Peak PTC, 2015 13 / 37
Implementing High Performance Applications Heat Stencil
Five-point star-shaped heat stencil
Tnewi,j = w1 ∗ Tprevi,j+1 + w2 ∗ Tprevi,j
+ w3 ∗ Tprevi,j−1 + w4 ∗ Tprevi+1,j
+ w5 ∗ Tprevi−1,j
A. Varghese et.al. (ANU) Scaling the Peak PTC, 2015 14 / 37
Implementing High Performance Applications Heat Stencil
Single-Core Stencil - 20 columns
A. Varghese et.al. (ANU) Scaling the Peak PTC, 2015 15 / 37
Implementing High Performance Applications Heat Stencil
Diving into Assembly
1 .Lb:dogrid1 26,27,28,29,30,47,48,49,50,51,52,53, 1, 2, 3, 4, 5, 6, 7, 8, 9,10
3 ldr r20,[r0,#0 + stride]dogrid0 31,32,33,34,35,52,53,54,55,56,57,58, 6, 7, 8, 9,10,11,12,13,14,15
5 ldr r41,[r1,#0]dogrid1 36,37,38,39,40,57,58,59,60,61,62,63,11,12,13,14,15,16,17,18,19,20
7 str r21,[r1],#stride; 2nd row
9 dogrid0 43,44,45,46,47,20,21,22,23,24,25,26,16,17,18,19,20,1+stride,2+stride,3+stride,4+stride,5+stride
add r0,r0,#(stride * 4)11 dogrid1 48,49,50,51,52,25,26,27,28,29,30,31, 1, 2, 3, 4, 5, 6, 7, 8, 9,10
ldr r42,[r0,#stride]13 dogrid0 53,54,55,56,57,30,31,32,33,34,35,36, 6, 7, 8, 9,10,11,12,13,14,15
ldr r63,[r1,#0]15 dogrid1 58,59,60,61,62,35,36,37,38,39,40,41,11,12,13,14,15,16,17,18,19,20
str r43,[r1],#stride17 ; 1st row
dogrid0 21,22,23,24,25,42,43,44,45,46,47,48,16,17,18,19,20,1+stride,2+stride,3+stride,4+stride,5+stride
19 add r0,r0,#(stride * 4)sub r2,r2,#1
21 nopbne .Lb
A. Varghese et.al. (ANU) Scaling the Peak PTC, 2015 16 / 37
Implementing High Performance Applications Heat Stencil
Writing Macros
.macro dogrid0 a0,a1,a2,a3,a4,b0,b1,b2,b3,b4,b5,b6,o0,o1,o2,o3,o4,i0,i1,i2,i3,i42 fmadd r15,r\a0,r3
str r8,[r0,#\o0]4 fmadd r16,r\a1,r3
str r9,[r0,#\o1]6 fmadd r17,r\a2,r3
str r10,[r0,#\o2]8 fmadd r18,r\a3,r3
str r11,[r0,#\o3]10 fmadd r19,r\a4,r3
str r14,[r0,#\o4]12 fmadd r15,r\b0,r4
fmadd r16,r\b1,r414 fmadd r17,r\b2,r4
fmadd r18,r\b3,r416 fmadd r19,r\b4,r4
fmadd r15,r\b1,r518 .
.20 .
Interleaved pair
Code available at https://github.com/parallella/parallella-examples/tree/master/heat_stencil
A. Varghese et.al. (ANU) Scaling the Peak PTC, 2015 17 / 37
Implementing High Performance Applications Heat Stencil
Floating point performance - Single-Core
20 40 60 800
0.2
0.4
0.6
0.8
1
1.21.09
0.971.02 1.04
1.12
1.011.05 1.07
1.131.08 1.06 1.08
1.14
1.021.06
Columns
GFL
OP
S
20 Rows 40 Rows 60 Rows 80 Rows
Stencil evaluated for 50 iterations.
81-95% of peak performance
A. Varghese et.al. (ANU) Scaling the Peak PTC, 2015 18 / 37
Implementing High Performance Applications Heat Stencil
Multi-Core Stencil
Synchronization between neighbouring eCores
DMA for boundary transfers
A. Varghese et.al. (ANU) Scaling the Peak PTC, 2015 19 / 37
Implementing High Performance Applications Heat Stencil
Floating point performance - Multi-Core
20 40 60 800
20
40
60
80
69.9
62.564.6 65.8
45.7
50.3
55.6 57
71.7
64.466.3 67.7
55.9 57.360.5
63.7
72.4
65.167.2 68.8
60.8 6063.1
65.2
72.8
65.467.9
63.661.5
64.5
Columns
GFL
OP
S
20 Rows 40 Rows 60 Rows 80 Rows
Lighter colors show performance without communication
65.2 GFLOPS: 85% of peak performance with communication
A. Varghese et.al. (ANU) Scaling the Peak PTC, 2015 20 / 37
Implementing High Performance Applications Matrix Multiplication
Outline
1 Introduction
2 Programming the Epiphany
3 Performance Experiments
4 Implementing High Performance ApplicationsHeat StencilMatrix Multiplication
5 Energy Measurement
6 Summary
A. Varghese et.al. (ANU) Scaling the Peak PTC, 2015 21 / 37
Implementing High Performance Applications Matrix Multiplication
Single-Core Matmul
C11 = A11B11 + A12B21 + A13B31 + . . .
C12 = A11B12 + A12B22 + A13B32 + . . .
...
C21 = A21B11 + A22B21 + A23B31 + . . .
C22 = A21B12 + A22B22 + A23B32 + . . .
Operand matrices in different memory banks
Stack moved to bottom half of bank 1
Hand-tuned assembly code
A. Varghese et.al. (ANU) Scaling the Peak PTC, 2015 22 / 37
Implementing High Performance Applications Matrix Multiplication
Diving into Assembly (again)
.macro doMult areg,index,2
fmadd r32,r\areg,r164 ldrd r22,[r1,#\index + 3]
fmadd r33,r\areg,r176 ldr r\aprev,[r0,#\inc]
fmadd r34,r\areg,r188 fmadd r35,r\areg,r19
fmadd r36,r\areg,r2010 fmadd r37,r\areg,r21
ldrd r16,[r1,#\index + 4]12 fmadd r38,r\areg,r22
ldrd r18,[r1,#\index + 5]14 fmadd r39,r\areg,r23
ldrd r20,[r1,#\index + 6]16 fmadd r40,r\areg,r16
ldrd r22,[r1,#\index + 7]18 .
.20 .
Interleaved pair
Code available at https://github.com/parallella/parallella-examples/tree/master/matmul_optimized
A. Varghese et.al. (ANU) Scaling the Peak PTC, 2015 23 / 37
Implementing High Performance Applications Matrix Multiplication
Floating point performance - Single-Core
Dimensions GFLOPS % Peak8 × 8 0.85 70.5
16 × 16 1.07 89.520 × 20 1.11 92.524 × 24 1.12 93.432 × 32 1.15 95.9
A. Varghese et.al. (ANU) Scaling the Peak PTC, 2015 24 / 37
Implementing High Performance Applications Matrix Multiplication
Multi-Core Matmul - Cannon’s Algorithm
A. Varghese et.al. (ANU) Scaling the Peak PTC, 2015 25 / 37
Implementing High Performance Applications Matrix Multiplication
Buffering Strategy
Transfer of Matrix A and Matrix B
A. Varghese et.al. (ANU) Scaling the Peak PTC, 2015 26 / 37
Implementing High Performance Applications Matrix Multiplication
Floating Point Performance - Multi-Core
Matrix C Num of eCores(per-core) 4 × 4 8 × 8
GFLOPS % Peak GFLOPS % Peak8 × 8 5.1 26 20.3 26
16 × 16 12.8 67 51.4 6720 × 20 14.4 75 57.6 7524 × 24 15.4 80 62.2 8132 × 32 16.3 85 65.3 85
A. Varghese et.al. (ANU) Scaling the Peak PTC, 2015 27 / 37
Implementing High Performance Applications Matrix Multiplication
Floating Point Performance - Off-chip
Matrix C GFLOPS % Peak % Memory Transfers512 × 512 8.3 10.8 % 87.2 %
1024 × 1024 8.6 11.1 % 86.9 %1536 × 1536 6.3 8.2 % 89.1 %
A. Varghese et.al. (ANU) Scaling the Peak PTC, 2015 28 / 37
Energy Measurement Setup
Outline
1 Introduction
2 Programming the Epiphany
3 Performance Experiments
4 Implementing High Performance Applications
5 Energy MeasurementSetupAPIMeasurements
6 SummaryA. Varghese et.al. (ANU) Scaling the Peak PTC, 2015 29 / 37
Energy Measurement Setup
Measurement Setup
A. Varghese et.al. (ANU) Scaling the Peak PTC, 2015 30 / 37
Energy Measurement Setup
Measurement Setup
A. Varghese et.al. (ANU) Scaling the Peak PTC, 2015 31 / 37
Energy Measurement API
Outline
1 Introduction
2 Programming the Epiphany
3 Performance Experiments
4 Implementing High Performance Applications
5 Energy MeasurementSetupAPIMeasurements
6 SummaryA. Varghese et.al. (ANU) Scaling the Peak PTC, 2015 32 / 37
Energy Measurement API
Measurement API
/* Initialize serial port IO objects from Measured Device */2 void power_init()
{4 io = new boost::asio::io_service();
serial = new boost::asio::serial_port(*io, serial_port);6 serial->set_option(boost::asio::serial_port_base::baud_rate(baud_rate));
}8
/* Send single character ’r’ to reset ADC timer and start measurement */10 void power_start()
{12 boost::asio::write(*serial, boost::asio::buffer("r",1));
}14
/* Send single character ’s’ to stop ADC timer and measurement */16 void power_stop()
{18 boost::asio::write(*serial, boost::asio::buffer("s",1));
}20
/* Perform serial IO object cleanup */22 void power_cleanup()
{24 delete io;
delete serial;26 }
A. Varghese et.al. (ANU) Scaling the Peak PTC, 2015 33 / 37
Energy Measurement Measurements
Outline
1 Introduction
2 Programming the Epiphany
3 Performance Experiments
4 Implementing High Performance Applications
5 Energy MeasurementSetupAPIMeasurements
6 SummaryA. Varghese et.al. (ANU) Scaling the Peak PTC, 2015 34 / 37
Energy Measurement Measurements
Energy Efficiency for Matmul - Comparisons
Parallella-16
Dimensions nJoules/Flop128 × 128 0.33256 × 256 1.82512 × 512 1.98
1024 × 1024 1.83
NVIDIA Jetson
Dimensions nJoules/Flop128 × 128 9.7256 × 256 1.06512 × 512 0.16
1024 × 1024 0.06
128×128 experiment on Parallella-16 used only on-chip SRAM
Same numbers can be derived analytically
A. Varghese et.al. (ANU) Scaling the Peak PTC, 2015 35 / 37
Summary
Summary & Future Work
Achieved high-performance85% of Peak for On-chip
Introduced a method to measure energy to solutionHigh programming effort was required
Compiler could be optimized furtherEnergy efficiency
Observed on-chip efficiency of 330 pJoules/Flop4096 eCores => 10 pJoules/Flop
Higher bandwidth and more SRAM per eCore needed
A. Varghese et.al. (ANU) Scaling the Peak PTC, 2015 36 / 37
Summary
Questions?
A. Varghese et.al. (ANU) Scaling the Peak PTC, 2015 37 / 37