37
Scaling the Peak: Maximizing floating point performance on the Epiphany NoC Anish Varghese, Gaurav Mitra, Robert Edwards and Alistair Rendell Research School of Computer Science The Australian National University May 30, 2015 A. Varghese et.al. (ANU) Scaling the Peak PTC, 2015 1 / 37

Scaling the Peak: Maximizing floating point performance on the … · 2015. 6. 1. · Scaling the Peak: Maximizing floating point performance on the Epiphany NoC Anish Varghese,

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Scaling the Peak: Maximizing floating point performance on the … · 2015. 6. 1. · Scaling the Peak: Maximizing floating point performance on the Epiphany NoC Anish Varghese,

Scaling the Peak: Maximizing floating pointperformance on the Epiphany NoC

Anish Varghese, Gaurav Mitra, Robert Edwards and AlistairRendell

Research School of Computer ScienceThe Australian National University

May 30, 2015

A. Varghese et.al. (ANU) Scaling the Peak PTC, 2015 1 / 37

Page 2: Scaling the Peak: Maximizing floating point performance on the … · 2015. 6. 1. · Scaling the Peak: Maximizing floating point performance on the Epiphany NoC Anish Varghese,

Outline

1 Introduction

2 Programming the Epiphany

3 Performance Experiments

4 Implementing High Performance Applications

5 Energy Measurement

6 Summary

A. Varghese et.al. (ANU) Scaling the Peak PTC, 2015 2 / 37

Page 3: Scaling the Peak: Maximizing floating point performance on the … · 2015. 6. 1. · Scaling the Peak: Maximizing floating point performance on the Epiphany NoC Anish Varghese,

Introduction

Energy efficient HPC at ANUParallella-16 Parallella-64 Prototype

NVIDIA Jetson TK1 - ARM-GPU SoC TI Keystone II - ARM-DSP SoC

A. Varghese et.al. (ANU) Scaling the Peak PTC, 2015 3 / 37

Page 4: Scaling the Peak: Maximizing floating point performance on the … · 2015. 6. 1. · Scaling the Peak: Maximizing floating point performance on the Epiphany NoC Anish Varghese,

Programming the Epiphany Programming Considerations

Outline

1 Introduction

2 Programming the EpiphanyProgramming Considerations

3 Performance Experiments

4 Implementing High Performance Applications

5 Energy Measurement

6 Summary

A. Varghese et.al. (ANU) Scaling the Peak PTC, 2015 4 / 37

Page 5: Scaling the Peak: Maximizing floating point performance on the … · 2015. 6. 1. · Scaling the Peak: Maximizing floating point performance on the Epiphany NoC Anish Varghese,

Programming the Epiphany Programming Considerations

Programming Considerations

Store code and data in different banks

Interleave FMADD with load/stores

Unroll inner loops

A. Varghese et.al. (ANU) Scaling the Peak PTC, 2015 5 / 37

Page 6: Scaling the Peak: Maximizing floating point performance on the … · 2015. 6. 1. · Scaling the Peak: Maximizing floating point performance on the Epiphany NoC Anish Varghese,

Performance Experiments Experiment Platform

Outline

1 Introduction

2 Programming the Epiphany

3 Performance ExperimentsExperiment PlatformOn-chip CommunicationOff-chip Communication

4 Implementing High Performance Applications

5 Energy Measurement

6 SummaryA. Varghese et.al. (ANU) Scaling the Peak PTC, 2015 6 / 37

Page 7: Scaling the Peak: Maximizing floating point performance on the … · 2015. 6. 1. · Scaling the Peak: Maximizing floating point performance on the Epiphany NoC Anish Varghese,

Performance Experiments Experiment Platform

Experiment Platform

Parallella-64 Prototype

ZedBoard evaluation module with Zynq SoCDaughter card with Epiphany-IV 64-core (E64G401)Dual core ARM Cortex-A9 host at 667 MHzEpiphany eCores at 600 MHz512 MB of DDR3 RAM on the host32 MB shared with eCores

A. Varghese et.al. (ANU) Scaling the Peak PTC, 2015 7 / 37

Page 8: Scaling the Peak: Maximizing floating point performance on the … · 2015. 6. 1. · Scaling the Peak: Maximizing floating point performance on the Epiphany NoC Anish Varghese,

Performance Experiments On-chip Communication

Outline

1 Introduction

2 Programming the Epiphany

3 Performance ExperimentsExperiment PlatformOn-chip CommunicationOff-chip Communication

4 Implementing High Performance Applications

5 Energy Measurement

6 SummaryA. Varghese et.al. (ANU) Scaling the Peak PTC, 2015 8 / 37

Page 9: Scaling the Peak: Maximizing floating point performance on the … · 2015. 6. 1. · Scaling the Peak: Maximizing floating point performance on the Epiphany NoC Anish Varghese,

Performance Experiments On-chip Communication

DMA vs Direct Write Bandwidth

16 80 800 4000 8000 24000 32000 40000 480000

200

400

600

800

1,000

1,200

1,400

1,600

1,800

2,000

5.59 30.68

282.43

809.81

1,261.62

1,801.4 1,828.04

1,905.671,963.71

34.4288.37 91.64 91.98 91.99 90.19 90.19 90.19 90.19

Total Message Size (Bytes)

Ban

dwid

th(M

B/s

)

DMA Transfer Direct On-Chip Write

A. Varghese et.al. (ANU) Scaling the Peak PTC, 2015 9 / 37

Page 10: Scaling the Peak: Maximizing floating point performance on the … · 2015. 6. 1. · Scaling the Peak: Maximizing floating point performance on the Epiphany NoC Anish Varghese,

Performance Experiments On-chip Communication

Message Transfer Latency

8 80 200 400 600 800 20000

1

2

3

4

5

6

7

8

9

2.73 2.74 2.74 2.74 2.74 2.74

3.35

3.58 · 10−20.36

0.9

1.79

2.69

3.58

8.95

Total Message Size (Bytes)

Tim

e(u

sec)

DMA Transfer Direct On-Chip Write

Latency of ≈ 7 cycles per transfer

A. Varghese et.al. (ANU) Scaling the Peak PTC, 2015 10 / 37

Page 11: Scaling the Peak: Maximizing floating point performance on the … · 2015. 6. 1. · Scaling the Peak: Maximizing floating point performance on the Epiphany NoC Anish Varghese,

Performance Experiments Off-chip Communication

Outline

1 Introduction

2 Programming the Epiphany

3 Performance ExperimentsExperiment PlatformOn-chip CommunicationOff-chip Communication

4 Implementing High Performance Applications

5 Energy Measurement

6 SummaryA. Varghese et.al. (ANU) Scaling the Peak PTC, 2015 11 / 37

Page 12: Scaling the Peak: Maximizing floating point performance on the … · 2015. 6. 1. · Scaling the Peak: Maximizing floating point performance on the Epiphany NoC Anish Varghese,

Performance Experiments Off-chip Communication

Core-wise utilization of external memory link

1 2 3 4 5 6 7 8

1

2

3

4

5

6

7

80.00

0.02

0.04

0.06

0.08

0.10

0.120.140.160.18

150 MB/sec using Direct Writes

400 MB/sec using DMA

A. Varghese et.al. (ANU) Scaling the Peak PTC, 2015 12 / 37

Page 13: Scaling the Peak: Maximizing floating point performance on the … · 2015. 6. 1. · Scaling the Peak: Maximizing floating point performance on the Epiphany NoC Anish Varghese,

Implementing High Performance Applications Heat Stencil

Outline

1 Introduction

2 Programming the Epiphany

3 Performance Experiments

4 Implementing High Performance ApplicationsHeat StencilMatrix Multiplication

5 Energy Measurement

6 Summary

A. Varghese et.al. (ANU) Scaling the Peak PTC, 2015 13 / 37

Page 14: Scaling the Peak: Maximizing floating point performance on the … · 2015. 6. 1. · Scaling the Peak: Maximizing floating point performance on the Epiphany NoC Anish Varghese,

Implementing High Performance Applications Heat Stencil

Five-point star-shaped heat stencil

Tnewi,j = w1 ∗ Tprevi,j+1 + w2 ∗ Tprevi,j

+ w3 ∗ Tprevi,j−1 + w4 ∗ Tprevi+1,j

+ w5 ∗ Tprevi−1,j

A. Varghese et.al. (ANU) Scaling the Peak PTC, 2015 14 / 37

Page 15: Scaling the Peak: Maximizing floating point performance on the … · 2015. 6. 1. · Scaling the Peak: Maximizing floating point performance on the Epiphany NoC Anish Varghese,

Implementing High Performance Applications Heat Stencil

Single-Core Stencil - 20 columns

A. Varghese et.al. (ANU) Scaling the Peak PTC, 2015 15 / 37

Page 16: Scaling the Peak: Maximizing floating point performance on the … · 2015. 6. 1. · Scaling the Peak: Maximizing floating point performance on the Epiphany NoC Anish Varghese,

Implementing High Performance Applications Heat Stencil

Diving into Assembly

1 .Lb:dogrid1 26,27,28,29,30,47,48,49,50,51,52,53, 1, 2, 3, 4, 5, 6, 7, 8, 9,10

3 ldr r20,[r0,#0 + stride]dogrid0 31,32,33,34,35,52,53,54,55,56,57,58, 6, 7, 8, 9,10,11,12,13,14,15

5 ldr r41,[r1,#0]dogrid1 36,37,38,39,40,57,58,59,60,61,62,63,11,12,13,14,15,16,17,18,19,20

7 str r21,[r1],#stride; 2nd row

9 dogrid0 43,44,45,46,47,20,21,22,23,24,25,26,16,17,18,19,20,1+stride,2+stride,3+stride,4+stride,5+stride

add r0,r0,#(stride * 4)11 dogrid1 48,49,50,51,52,25,26,27,28,29,30,31, 1, 2, 3, 4, 5, 6, 7, 8, 9,10

ldr r42,[r0,#stride]13 dogrid0 53,54,55,56,57,30,31,32,33,34,35,36, 6, 7, 8, 9,10,11,12,13,14,15

ldr r63,[r1,#0]15 dogrid1 58,59,60,61,62,35,36,37,38,39,40,41,11,12,13,14,15,16,17,18,19,20

str r43,[r1],#stride17 ; 1st row

dogrid0 21,22,23,24,25,42,43,44,45,46,47,48,16,17,18,19,20,1+stride,2+stride,3+stride,4+stride,5+stride

19 add r0,r0,#(stride * 4)sub r2,r2,#1

21 nopbne .Lb

A. Varghese et.al. (ANU) Scaling the Peak PTC, 2015 16 / 37

Page 17: Scaling the Peak: Maximizing floating point performance on the … · 2015. 6. 1. · Scaling the Peak: Maximizing floating point performance on the Epiphany NoC Anish Varghese,

Implementing High Performance Applications Heat Stencil

Writing Macros

.macro dogrid0 a0,a1,a2,a3,a4,b0,b1,b2,b3,b4,b5,b6,o0,o1,o2,o3,o4,i0,i1,i2,i3,i42 fmadd r15,r\a0,r3

str r8,[r0,#\o0]4 fmadd r16,r\a1,r3

str r9,[r0,#\o1]6 fmadd r17,r\a2,r3

str r10,[r0,#\o2]8 fmadd r18,r\a3,r3

str r11,[r0,#\o3]10 fmadd r19,r\a4,r3

str r14,[r0,#\o4]12 fmadd r15,r\b0,r4

fmadd r16,r\b1,r414 fmadd r17,r\b2,r4

fmadd r18,r\b3,r416 fmadd r19,r\b4,r4

fmadd r15,r\b1,r518 .

.20 .

Interleaved pair

Code available at https://github.com/parallella/parallella-examples/tree/master/heat_stencil

A. Varghese et.al. (ANU) Scaling the Peak PTC, 2015 17 / 37

Page 18: Scaling the Peak: Maximizing floating point performance on the … · 2015. 6. 1. · Scaling the Peak: Maximizing floating point performance on the Epiphany NoC Anish Varghese,

Implementing High Performance Applications Heat Stencil

Floating point performance - Single-Core

20 40 60 800

0.2

0.4

0.6

0.8

1

1.21.09

0.971.02 1.04

1.12

1.011.05 1.07

1.131.08 1.06 1.08

1.14

1.021.06

Columns

GFL

OP

S

20 Rows 40 Rows 60 Rows 80 Rows

Stencil evaluated for 50 iterations.

81-95% of peak performance

A. Varghese et.al. (ANU) Scaling the Peak PTC, 2015 18 / 37

Page 19: Scaling the Peak: Maximizing floating point performance on the … · 2015. 6. 1. · Scaling the Peak: Maximizing floating point performance on the Epiphany NoC Anish Varghese,

Implementing High Performance Applications Heat Stencil

Multi-Core Stencil

Synchronization between neighbouring eCores

DMA for boundary transfers

A. Varghese et.al. (ANU) Scaling the Peak PTC, 2015 19 / 37

Page 20: Scaling the Peak: Maximizing floating point performance on the … · 2015. 6. 1. · Scaling the Peak: Maximizing floating point performance on the Epiphany NoC Anish Varghese,

Implementing High Performance Applications Heat Stencil

Floating point performance - Multi-Core

20 40 60 800

20

40

60

80

69.9

62.564.6 65.8

45.7

50.3

55.6 57

71.7

64.466.3 67.7

55.9 57.360.5

63.7

72.4

65.167.2 68.8

60.8 6063.1

65.2

72.8

65.467.9

63.661.5

64.5

Columns

GFL

OP

S

20 Rows 40 Rows 60 Rows 80 Rows

Lighter colors show performance without communication

65.2 GFLOPS: 85% of peak performance with communication

A. Varghese et.al. (ANU) Scaling the Peak PTC, 2015 20 / 37

Page 21: Scaling the Peak: Maximizing floating point performance on the … · 2015. 6. 1. · Scaling the Peak: Maximizing floating point performance on the Epiphany NoC Anish Varghese,

Implementing High Performance Applications Matrix Multiplication

Outline

1 Introduction

2 Programming the Epiphany

3 Performance Experiments

4 Implementing High Performance ApplicationsHeat StencilMatrix Multiplication

5 Energy Measurement

6 Summary

A. Varghese et.al. (ANU) Scaling the Peak PTC, 2015 21 / 37

Page 22: Scaling the Peak: Maximizing floating point performance on the … · 2015. 6. 1. · Scaling the Peak: Maximizing floating point performance on the Epiphany NoC Anish Varghese,

Implementing High Performance Applications Matrix Multiplication

Single-Core Matmul

C11 = A11B11 + A12B21 + A13B31 + . . .

C12 = A11B12 + A12B22 + A13B32 + . . .

...

C21 = A21B11 + A22B21 + A23B31 + . . .

C22 = A21B12 + A22B22 + A23B32 + . . .

Operand matrices in different memory banks

Stack moved to bottom half of bank 1

Hand-tuned assembly code

A. Varghese et.al. (ANU) Scaling the Peak PTC, 2015 22 / 37

Page 23: Scaling the Peak: Maximizing floating point performance on the … · 2015. 6. 1. · Scaling the Peak: Maximizing floating point performance on the Epiphany NoC Anish Varghese,

Implementing High Performance Applications Matrix Multiplication

Diving into Assembly (again)

.macro doMult areg,index,2

fmadd r32,r\areg,r164 ldrd r22,[r1,#\index + 3]

fmadd r33,r\areg,r176 ldr r\aprev,[r0,#\inc]

fmadd r34,r\areg,r188 fmadd r35,r\areg,r19

fmadd r36,r\areg,r2010 fmadd r37,r\areg,r21

ldrd r16,[r1,#\index + 4]12 fmadd r38,r\areg,r22

ldrd r18,[r1,#\index + 5]14 fmadd r39,r\areg,r23

ldrd r20,[r1,#\index + 6]16 fmadd r40,r\areg,r16

ldrd r22,[r1,#\index + 7]18 .

.20 .

Interleaved pair

Code available at https://github.com/parallella/parallella-examples/tree/master/matmul_optimized

A. Varghese et.al. (ANU) Scaling the Peak PTC, 2015 23 / 37

Page 24: Scaling the Peak: Maximizing floating point performance on the … · 2015. 6. 1. · Scaling the Peak: Maximizing floating point performance on the Epiphany NoC Anish Varghese,

Implementing High Performance Applications Matrix Multiplication

Floating point performance - Single-Core

Dimensions GFLOPS % Peak8 × 8 0.85 70.5

16 × 16 1.07 89.520 × 20 1.11 92.524 × 24 1.12 93.432 × 32 1.15 95.9

A. Varghese et.al. (ANU) Scaling the Peak PTC, 2015 24 / 37

Page 25: Scaling the Peak: Maximizing floating point performance on the … · 2015. 6. 1. · Scaling the Peak: Maximizing floating point performance on the Epiphany NoC Anish Varghese,

Implementing High Performance Applications Matrix Multiplication

Multi-Core Matmul - Cannon’s Algorithm

A. Varghese et.al. (ANU) Scaling the Peak PTC, 2015 25 / 37

Page 26: Scaling the Peak: Maximizing floating point performance on the … · 2015. 6. 1. · Scaling the Peak: Maximizing floating point performance on the Epiphany NoC Anish Varghese,

Implementing High Performance Applications Matrix Multiplication

Buffering Strategy

Transfer of Matrix A and Matrix B

A. Varghese et.al. (ANU) Scaling the Peak PTC, 2015 26 / 37

Page 27: Scaling the Peak: Maximizing floating point performance on the … · 2015. 6. 1. · Scaling the Peak: Maximizing floating point performance on the Epiphany NoC Anish Varghese,

Implementing High Performance Applications Matrix Multiplication

Floating Point Performance - Multi-Core

Matrix C Num of eCores(per-core) 4 × 4 8 × 8

GFLOPS % Peak GFLOPS % Peak8 × 8 5.1 26 20.3 26

16 × 16 12.8 67 51.4 6720 × 20 14.4 75 57.6 7524 × 24 15.4 80 62.2 8132 × 32 16.3 85 65.3 85

A. Varghese et.al. (ANU) Scaling the Peak PTC, 2015 27 / 37

Page 28: Scaling the Peak: Maximizing floating point performance on the … · 2015. 6. 1. · Scaling the Peak: Maximizing floating point performance on the Epiphany NoC Anish Varghese,

Implementing High Performance Applications Matrix Multiplication

Floating Point Performance - Off-chip

Matrix C GFLOPS % Peak % Memory Transfers512 × 512 8.3 10.8 % 87.2 %

1024 × 1024 8.6 11.1 % 86.9 %1536 × 1536 6.3 8.2 % 89.1 %

A. Varghese et.al. (ANU) Scaling the Peak PTC, 2015 28 / 37

Page 29: Scaling the Peak: Maximizing floating point performance on the … · 2015. 6. 1. · Scaling the Peak: Maximizing floating point performance on the Epiphany NoC Anish Varghese,

Energy Measurement Setup

Outline

1 Introduction

2 Programming the Epiphany

3 Performance Experiments

4 Implementing High Performance Applications

5 Energy MeasurementSetupAPIMeasurements

6 SummaryA. Varghese et.al. (ANU) Scaling the Peak PTC, 2015 29 / 37

Page 30: Scaling the Peak: Maximizing floating point performance on the … · 2015. 6. 1. · Scaling the Peak: Maximizing floating point performance on the Epiphany NoC Anish Varghese,

Energy Measurement Setup

Measurement Setup

A. Varghese et.al. (ANU) Scaling the Peak PTC, 2015 30 / 37

Page 31: Scaling the Peak: Maximizing floating point performance on the … · 2015. 6. 1. · Scaling the Peak: Maximizing floating point performance on the Epiphany NoC Anish Varghese,

Energy Measurement Setup

Measurement Setup

A. Varghese et.al. (ANU) Scaling the Peak PTC, 2015 31 / 37

Page 32: Scaling the Peak: Maximizing floating point performance on the … · 2015. 6. 1. · Scaling the Peak: Maximizing floating point performance on the Epiphany NoC Anish Varghese,

Energy Measurement API

Outline

1 Introduction

2 Programming the Epiphany

3 Performance Experiments

4 Implementing High Performance Applications

5 Energy MeasurementSetupAPIMeasurements

6 SummaryA. Varghese et.al. (ANU) Scaling the Peak PTC, 2015 32 / 37

Page 33: Scaling the Peak: Maximizing floating point performance on the … · 2015. 6. 1. · Scaling the Peak: Maximizing floating point performance on the Epiphany NoC Anish Varghese,

Energy Measurement API

Measurement API

/* Initialize serial port IO objects from Measured Device */2 void power_init()

{4 io = new boost::asio::io_service();

serial = new boost::asio::serial_port(*io, serial_port);6 serial->set_option(boost::asio::serial_port_base::baud_rate(baud_rate));

}8

/* Send single character ’r’ to reset ADC timer and start measurement */10 void power_start()

{12 boost::asio::write(*serial, boost::asio::buffer("r",1));

}14

/* Send single character ’s’ to stop ADC timer and measurement */16 void power_stop()

{18 boost::asio::write(*serial, boost::asio::buffer("s",1));

}20

/* Perform serial IO object cleanup */22 void power_cleanup()

{24 delete io;

delete serial;26 }

A. Varghese et.al. (ANU) Scaling the Peak PTC, 2015 33 / 37

Page 34: Scaling the Peak: Maximizing floating point performance on the … · 2015. 6. 1. · Scaling the Peak: Maximizing floating point performance on the Epiphany NoC Anish Varghese,

Energy Measurement Measurements

Outline

1 Introduction

2 Programming the Epiphany

3 Performance Experiments

4 Implementing High Performance Applications

5 Energy MeasurementSetupAPIMeasurements

6 SummaryA. Varghese et.al. (ANU) Scaling the Peak PTC, 2015 34 / 37

Page 35: Scaling the Peak: Maximizing floating point performance on the … · 2015. 6. 1. · Scaling the Peak: Maximizing floating point performance on the Epiphany NoC Anish Varghese,

Energy Measurement Measurements

Energy Efficiency for Matmul - Comparisons

Parallella-16

Dimensions nJoules/Flop128 × 128 0.33256 × 256 1.82512 × 512 1.98

1024 × 1024 1.83

NVIDIA Jetson

Dimensions nJoules/Flop128 × 128 9.7256 × 256 1.06512 × 512 0.16

1024 × 1024 0.06

128×128 experiment on Parallella-16 used only on-chip SRAM

Same numbers can be derived analytically

A. Varghese et.al. (ANU) Scaling the Peak PTC, 2015 35 / 37

Page 36: Scaling the Peak: Maximizing floating point performance on the … · 2015. 6. 1. · Scaling the Peak: Maximizing floating point performance on the Epiphany NoC Anish Varghese,

Summary

Summary & Future Work

Achieved high-performance85% of Peak for On-chip

Introduced a method to measure energy to solutionHigh programming effort was required

Compiler could be optimized furtherEnergy efficiency

Observed on-chip efficiency of 330 pJoules/Flop4096 eCores => 10 pJoules/Flop

Higher bandwidth and more SRAM per eCore needed

A. Varghese et.al. (ANU) Scaling the Peak PTC, 2015 36 / 37

Page 37: Scaling the Peak: Maximizing floating point performance on the … · 2015. 6. 1. · Scaling the Peak: Maximizing floating point performance on the Epiphany NoC Anish Varghese,

Summary

Questions?

A. Varghese et.al. (ANU) Scaling the Peak PTC, 2015 37 / 37