51
Automatic Throughput and Critical Path Analysis of x86 and ARM Assembly Kernels J. Laukemann, J. Hammer, G. Hager, G. Wellein

AutomaticThroughput and Critical Path Analysis of x86 and ARM … · 2019. 11. 18. · J. Laukemann, J. Hammer, G. Hager, G. Wellein. 1. Analytic Performance Modeling Why? Assumptions

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: AutomaticThroughput and Critical Path Analysis of x86 and ARM … · 2019. 11. 18. · J. Laukemann, J. Hammer, G. Hager, G. Wellein. 1. Analytic Performance Modeling Why? Assumptions

Automatic Throughput and Critical Path Analysisof x86 and ARM Assembly Kernels

J. Laukemann, J. Hammer, G. Hager, G. Wellein

Page 2: AutomaticThroughput and Critical Path Analysis of x86 and ARM … · 2019. 11. 18. · J. Laukemann, J. Hammer, G. Hager, G. Wellein. 1. Analytic Performance Modeling Why? Assumptions

1. Analytic Performance ModelingWhy?Assumptions & Related Tools

2. Throughput & Latency - NomenclatureDefinition of Throughput, Critical Path and Loop-Carried Dependency

3. OSACA: Automating the in-core model constructionOverview, Structure and Output

4. Gauss-Seidel-Method Example

5. Future Work

18.11.2019 2PMBS19 | OSACA | Jan Laukemann

Overview

Page 3: AutomaticThroughput and Critical Path Analysis of x86 and ARM … · 2019. 11. 18. · J. Laukemann, J. Hammer, G. Hager, G. Wellein. 1. Analytic Performance Modeling Why? Assumptions

• How fast can my kernel run at best?

• What are the relevant hardwarebottlenecks?

• Apply simplified model of underlyinghardware• In-core execution

• Data transfer

• Combining execution and data transfer

18.11.2019 3PMBS19 | OSACA | Jan Laukemann

ECM Model

Roofline Model

Performance Modeling for Loop Kernels

Page 4: AutomaticThroughput and Critical Path Analysis of x86 and ARM … · 2019. 11. 18. · J. Laukemann, J. Hammer, G. Hager, G. Wellein. 1. Analytic Performance Modeling Why? Assumptions

• How fast can my kernel run at best?

• What are the relevant hardwarebottlenecks?

• Apply simplified model of underlyinghardware• In-core execution

• Data transfer

• Combining execution and data transfer

18.11.2019 4PMBS19 | OSACA | Jan Laukemann

ECM Model

Roofline Model

Performance Modeling for Loop Kernels

Page 5: AutomaticThroughput and Critical Path Analysis of x86 and ARM … · 2019. 11. 18. · J. Laukemann, J. Hammer, G. Hager, G. Wellein. 1. Analytic Performance Modeling Why? Assumptions

1. All Data in L1

2. Average distribution of port scheduling

PMBS19 | OSACA | Jan Laukemann

Assumptions & Related Tools

18.11.2019 5

OSACA v0.21 OSACA v0.3 IACA2 (EoL) LLVM-MCA3

Throughput ✔ ✔ ✔ ✔

Critical Path ✘ ✔ ✘ 😕

Loop-Carried Dependencies

✘ ✔ ✘ 😕

1 Presented at PMBS182 Intel Architecture Code Analyzer (https://software.intel.com/en-us/articles/intel-architecture-code-analyzer) 3 LLVM Machine Code Analyzer (https://llvm.org/docs/CommandGuide/llvm-mca.html)

Page 6: AutomaticThroughput and Critical Path Analysis of x86 and ARM … · 2019. 11. 18. · J. Laukemann, J. Hammer, G. Hager, G. Wellein. 1. Analytic Performance Modeling Why? Assumptions

TP: ThroughputCP: Critical Path

18.11.2019 6PMBS19 | OSACA | Jan Laukemann

• Dependencies within loop

• No loop-carried dependencies

TP: 1 cy

Throughput & Latency

Page 7: AutomaticThroughput and Critical Path Analysis of x86 and ARM … · 2019. 11. 18. · J. Laukemann, J. Hammer, G. Hager, G. Wellein. 1. Analytic Performance Modeling Why? Assumptions

TP: ThroughputCP: Critical Path

18.11.2019 7PMBS19 | OSACA | Jan Laukemann

• Dependencies within loop

• No loop-carried dependencies

TP: 1 cy

Throughput & Latency

ADD MUL SUB DIV

Page 8: AutomaticThroughput and Critical Path Analysis of x86 and ARM … · 2019. 11. 18. · J. Laukemann, J. Hammer, G. Hager, G. Wellein. 1. Analytic Performance Modeling Why? Assumptions

TP: ThroughputCP: Critical Path

18.11.2019 8PMBS19 | OSACA | Jan Laukemann

• Dependencies within loop

• No loop-carried dependencies

t

TP: 1 cy

Throughput & Latency

ADD MUL SUB DIV

Page 9: AutomaticThroughput and Critical Path Analysis of x86 and ARM … · 2019. 11. 18. · J. Laukemann, J. Hammer, G. Hager, G. Wellein. 1. Analytic Performance Modeling Why? Assumptions

TP: ThroughputCP: Critical Path

18.11.2019 9PMBS19 | OSACA | Jan Laukemann

• Dependencies within loop

• No loop-carried dependencies

t

TP: 1 cy

Throughput & Latency

ADD MUL SUB DIV

Page 10: AutomaticThroughput and Critical Path Analysis of x86 and ARM … · 2019. 11. 18. · J. Laukemann, J. Hammer, G. Hager, G. Wellein. 1. Analytic Performance Modeling Why? Assumptions

TP: ThroughputCP: Critical Path

18.11.2019 10PMBS19 | OSACA | Jan Laukemann

• Dependencies within loop

• No loop-carried dependencies

t

TP: 1 cy

Throughput & Latency

ADD MUL SUB DIV

Page 11: AutomaticThroughput and Critical Path Analysis of x86 and ARM … · 2019. 11. 18. · J. Laukemann, J. Hammer, G. Hager, G. Wellein. 1. Analytic Performance Modeling Why? Assumptions

TP: ThroughputCP: Critical Path

18.11.2019 11PMBS19 | OSACA | Jan Laukemann

• Dependencies within loop

• No loop-carried dependencies

1 cy/it

CP: 3 cy

t

TP: 1 cy

Throughput & Latency

ADD MUL SUB DIV

Page 12: AutomaticThroughput and Critical Path Analysis of x86 and ARM … · 2019. 11. 18. · J. Laukemann, J. Hammer, G. Hager, G. Wellein. 1. Analytic Performance Modeling Why? Assumptions

TP: ThroughputCP: Critical Path

18.11.2019 12PMBS19 | OSACA | Jan Laukemann

• Dependencies within loop

• No loop-carried dependencies

1 cy/it

CP: 3 cy

t

TP: 1 cy

Throughput & Latency

ADD MUL SUB DIV

Page 13: AutomaticThroughput and Critical Path Analysis of x86 and ARM … · 2019. 11. 18. · J. Laukemann, J. Hammer, G. Hager, G. Wellein. 1. Analytic Performance Modeling Why? Assumptions

18.11.2019 13PMBS19 | OSACA | Jan Laukemann

TP: 1 cy

Throughput & Latency

• Dependencies within loop

• Loop-carried dependencies

TP: ThroughputCP: Critical Path LCD: Loop-Carried Dependency

Page 14: AutomaticThroughput and Critical Path Analysis of x86 and ARM … · 2019. 11. 18. · J. Laukemann, J. Hammer, G. Hager, G. Wellein. 1. Analytic Performance Modeling Why? Assumptions

18.11.2019 14PMBS19 | OSACA | Jan Laukemann

t

TP: 1 cy

Throughput & Latency

• Dependencies within loop

• Loop-carried dependencies

TP: ThroughputCP: Critical Path LCD: Loop-Carried Dependency

Page 15: AutomaticThroughput and Critical Path Analysis of x86 and ARM … · 2019. 11. 18. · J. Laukemann, J. Hammer, G. Hager, G. Wellein. 1. Analytic Performance Modeling Why? Assumptions

18.11.2019 15PMBS19 | OSACA | Jan Laukemann

t

TP: 1 cy

Throughput & Latency

• Dependencies within loop

• Loop-carried dependencies

TP: ThroughputCP: Critical Path LCD: Loop-Carried Dependency

Page 16: AutomaticThroughput and Critical Path Analysis of x86 and ARM … · 2019. 11. 18. · J. Laukemann, J. Hammer, G. Hager, G. Wellein. 1. Analytic Performance Modeling Why? Assumptions

18.11.2019 16PMBS19 | OSACA | Jan Laukemann

t

TP: 1 cy

Throughput & Latency

• Dependencies within loop

• Loop-carried dependencies

TP: ThroughputCP: Critical Path LCD: Loop-Carried Dependency

Page 17: AutomaticThroughput and Critical Path Analysis of x86 and ARM … · 2019. 11. 18. · J. Laukemann, J. Hammer, G. Hager, G. Wellein. 1. Analytic Performance Modeling Why? Assumptions

18.11.2019 17PMBS19 | OSACA | Jan Laukemann

3 cy/itCP: 3 cyLCD: 3 cy

t

TP: 1 cy

Throughput & Latency

• Dependencies within loop

• Loop-carried dependencies

TP: ThroughputCP: Critical Path LCD: Loop-Carried Dependency

Page 18: AutomaticThroughput and Critical Path Analysis of x86 and ARM … · 2019. 11. 18. · J. Laukemann, J. Hammer, G. Hager, G. Wellein. 1. Analytic Performance Modeling Why? Assumptions

18.11.2019 18PMBS19 | OSACA | Jan Laukemann

TP: 1 cy

Throughput & Latency

TP: ThroughputCP: Critical Path LCD: Loop-Carried Dependency

Page 19: AutomaticThroughput and Critical Path Analysis of x86 and ARM … · 2019. 11. 18. · J. Laukemann, J. Hammer, G. Hager, G. Wellein. 1. Analytic Performance Modeling Why? Assumptions

18.11.2019 19PMBS19 | OSACA | Jan Laukemann

t

TP: 1 cy

Throughput & Latency

TP: ThroughputCP: Critical Path LCD: Loop-Carried Dependency

Page 20: AutomaticThroughput and Critical Path Analysis of x86 and ARM … · 2019. 11. 18. · J. Laukemann, J. Hammer, G. Hager, G. Wellein. 1. Analytic Performance Modeling Why? Assumptions

18.11.2019 20PMBS19 | OSACA | Jan Laukemann

t

TP: 1 cy

Throughput & Latency

TP: ThroughputCP: Critical Path LCD: Loop-Carried Dependency

Page 21: AutomaticThroughput and Critical Path Analysis of x86 and ARM … · 2019. 11. 18. · J. Laukemann, J. Hammer, G. Hager, G. Wellein. 1. Analytic Performance Modeling Why? Assumptions

18.11.2019 21PMBS19 | OSACA | Jan Laukemann

t

TP: 1 cy

Throughput & Latency

TP: ThroughputCP: Critical Path LCD: Loop-Carried Dependency

Page 22: AutomaticThroughput and Critical Path Analysis of x86 and ARM … · 2019. 11. 18. · J. Laukemann, J. Hammer, G. Hager, G. Wellein. 1. Analytic Performance Modeling Why? Assumptions

18.11.2019 22PMBS19 | OSACA | Jan Laukemann

3 cy/itLCD: 3 cy CP: 5 cy

t

TP: 1 cy

Throughput & Latency

TP: ThroughputCP: Critical Path LCD: Loop-Carried Dependency

Page 23: AutomaticThroughput and Critical Path Analysis of x86 and ARM … · 2019. 11. 18. · J. Laukemann, J. Hammer, G. Hager, G. Wellein. 1. Analytic Performance Modeling Why? Assumptions

18.11.2019 23PMBS19 | OSACA | Jan Laukemann

movl $111,%ebx #START MARKER.byte 100,103,144 #START MARKER.L22:vmovapd 0(%r13,%rax),%ymm0vfmadd213pd (%r14,%rax),%ymm1,%ymm0vmovapd %ymm0,(%r12,%rax)addq $32,%raxcmpq %rax,%r15jne .L22

movl $222,%ebx #END MARKER .byte 100,103,144 #END MARKER

mov x1,#111 //START.byte 213,3,32,31 //START.L18:ldr q2, [x20, x0]ldr q1, [x21, x0]fmla v1.2d, v2.2d, v0.2dstr q1, [x19, x0]add x0, x0, #16cmp x22, x0bne .L18

mov x1,#222 //END.byte 213,3,32,31 //END

Marked Assembly

OSACA Workflow: Input – Database

x86

arm

Page 24: AutomaticThroughput and Critical Path Analysis of x86 and ARM … · 2019. 11. 18. · J. Laukemann, J. Hammer, G. Hager, G. Wellein. 1. Analytic Performance Modeling Why? Assumptions

Intel Cascade Lake

18.11.2019 24PMBS19 | OSACA | Jan Laukemann

movl $111,%ebx #START MARKER.byte 100,103,144 #START MARKER.L22:vmovapd 0(%r13,%rax),%ymm0vfmadd213pd (%r14,%rax),%ymm1,%ymm0vmovapd %ymm0,(%r12,%rax)addq $32,%raxcmpq %rax,%r15jne .L22

movl $222,%ebx #END MARKER .byte 100,103,144 #END MARKER

mov x1,#111 //START.byte 213,3,32,31 //START.L18:ldr q2, [x20, x0]ldr q1, [x21, x0]fmla v1.2d, v2.2d, v0.2dstr q1, [x19, x0]add x0, x0, #16cmp x22, x0bne .L18

mov x1,#222 //END.byte 213,3,32,31 //END

Marked Assembly

OSACA Workflow: Input – Database

x86

arm

Page 25: AutomaticThroughput and Critical Path Analysis of x86 and ARM … · 2019. 11. 18. · J. Laukemann, J. Hammer, G. Hager, G. Wellein. 1. Analytic Performance Modeling Why? Assumptions

load_latency: {gpr: 4, xmm: 4, ymm: 4, zmm: 4}load_throughput: {port_pressure: [0,0,0,0.5 ... ,0]}- name: vfmadd213pdoperands:- class: "register"name: "ymm"source: truedestination: false

- class: "register"name: "ymm"source: truedestination: false

- class: "register"name: "ymm"source: truedestination: true

throughput: 0.5latency: 4 # 0 DV 1 2 D 3 D 4 5 6 7 port_pressure: [0.5,0,0.5,0.5,0.5,0.5,0.5,0,0,0,0]

18.11.2019 25PMBS19 | OSACA | Jan Laukemann

movl $111,%ebx #START MARKER.byte 100,103,144 #START MARKER.L22:vmovapd 0(%r13,%rax),%ymm0vfmadd213pd (%r14,%rax),%ymm1,%ymm0vmovapd %ymm0,(%r12,%rax)addq $32,%raxcmpq %rax,%r15jne .L22

movl $222,%ebx #END MARKER .byte 100,103,144 #END MARKER

mov x1,#111 //START.byte 213,3,32,31 //START.L18:ldr q2, [x20, x0]ldr q1, [x21, x0]fmla v1.2d, v2.2d, v0.2dstr q1, [x19, x0]add x0, x0, #16cmp x22, x0bne .L18

mov x1,#222 //END.byte 213,3,32,31 //END

Marked Assembly

Machine Files / Databases

OSACA Workflow: Input – Database

x86

arm

Page 26: AutomaticThroughput and Critical Path Analysis of x86 and ARM … · 2019. 11. 18. · J. Laukemann, J. Hammer, G. Hager, G. Wellein. 1. Analytic Performance Modeling Why? Assumptions

load_latency: {gpr: 4, xmm: 4, ymm: 4, zmm: 4}load_throughput: {port_pressure: [0,0,0,0.5 ... ,0]}- name: vfmadd213pdoperands:- class: "register"name: "ymm"source: truedestination: false

- class: "register"name: "ymm"source: truedestination: false

- class: "register"name: "ymm"source: truedestination: true

throughput: 0.5latency: 4 # 0 DV 1 2 D 3 D 4 5 6 7 port_pressure: [0.5,0,0.5,0.5,0.5,0.5,0.5,0,0,0,0]

18.11.2019 26PMBS19 | OSACA | Jan Laukemann

movl $111,%ebx #START MARKER.byte 100,103,144 #START MARKER.L22:vmovapd 0(%r13,%rax),%ymm0vfmadd213pd (%r14,%rax),%ymm1,%ymm0vmovapd %ymm0,(%r12,%rax)addq $32,%raxcmpq %rax,%r15jne .L22

movl $222,%ebx #END MARKER .byte 100,103,144 #END MARKER

mov x1,#111 //START.byte 213,3,32,31 //START.L18:ldr q2, [x20, x0]ldr q1, [x21, x0]fmla v1.2d, v2.2d, v0.2dstr q1, [x19, x0]add x0, x0, #16cmp x22, x0bne .L18

mov x1,#222 //END.byte 213,3,32,31 //END

Marked Assembly

Machine Files / Databases

OSACA Workflow: Input – Database

x86

arm

Page 27: AutomaticThroughput and Critical Path Analysis of x86 and ARM … · 2019. 11. 18. · J. Laukemann, J. Hammer, G. Hager, G. Wellein. 1. Analytic Performance Modeling Why? Assumptions

load_latency: {gpr: 4, xmm: 4, ymm: 4, zmm: 4}load_throughput: {port_pressure: [0,0,0,0.5 ... ,0]}- name: vfmadd213pdoperands:- class: "register"name: "ymm"source: truedestination: false

- class: "register"name: "ymm"source: truedestination: false

- class: "register"name: "ymm"source: truedestination: true

throughput: 0.5latency: 4 # 0 DV 1 2 D 3 D 4 5 6 7 port_pressure: [0.5,0,0.5,0.5,0.5,0.5,0.5,0,0,0,0]

18.11.2019 27PMBS19 | OSACA | Jan Laukemann

movl $111,%ebx #START MARKER.byte 100,103,144 #START MARKER.L22:vmovapd 0(%r13,%rax),%ymm0vfmadd213pd (%r14,%rax),%ymm1,%ymm0vmovapd %ymm0,(%r12,%rax)addq $32,%raxcmpq %rax,%r15jne .L22

movl $222,%ebx #END MARKER .byte 100,103,144 #END MARKER

mov x1,#111 //START.byte 213,3,32,31 //START.L18:ldr q2, [x20, x0]ldr q1, [x21, x0]fmla v1.2d, v2.2d, v0.2dstr q1, [x19, x0]add x0, x0, #16cmp x22, x0bne .L18

mov x1,#222 //END.byte 213,3,32,31 //END

Marked Assembly

Machine Files / Databases

OSACA Workflow: Input – Database

x86

arm

Page 28: AutomaticThroughput and Critical Path Analysis of x86 and ARM … · 2019. 11. 18. · J. Laukemann, J. Hammer, G. Hager, G. Wellein. 1. Analytic Performance Modeling Why? Assumptions

Combined Analysis Report------------------------

Port pressure in cycles| 0 - 0DV | 1 | 2 - 2D | 3 - 3D | 4 | 5 | 6 | 7 || CP | LCD |

-------------------------------------------------------------------------------------------------179 | | | | | | | | || | | .L22:180 | | | 0.50 0.50 | 0.50 0.50 | | | | || 4.0 | | vmovapd 0(%r13,%rax), %ymm0

181 | 0.50 | 0.50 | 0.50 0.50 | 0.50 0.50 | | | | || 4.0 | | vfmadd213pd (%r14,%rax),%ymm1,%ymm0182 | | | 0.50 | 0.50 | 1.00 | | | || 5.0 | | vmovapd %ymm0, (%r12,%rax)

183 | 0.25 | 0.25 | | | | 0.25 | 0.25 | || | 1.0 | addq $32, %rax184 | 0.25 | 0.25 | | | | 0.25 | 0.25 | || | | cmpq %rax, %r15

185 | | | | | | | | || | | * jne .L221.00 1.00 1.50 1.00 1.50 1.00 1.00 0.50 0.50 13.0 1.0

Loop-Carried Dependencies Analysis Report-----------------------------------------183 | 1.0 | addq $32, %rax | [183]

18.11.2019 28PMBS19 | OSACA | Jan Laukemann

185: jne 179: label 180: vmovapd

181: vfmadd213pd

182: vmovapd

181: LOAD

4 4

4

183: addq

184: compq

1

OSACA Workflow: Output

Page 29: AutomaticThroughput and Critical Path Analysis of x86 and ARM … · 2019. 11. 18. · J. Laukemann, J. Hammer, G. Hager, G. Wellein. 1. Analytic Performance Modeling Why? Assumptions

18.11.2019 29PMBS19 | OSACA | Jan Laukemann

185: jne 179: label 180: vmovapd

181: vfmadd213pd

182: vmovapd

181: LOAD

4 4

4

183: addq

184: compq

1

OSACA Workflow: OutputCombined Analysis Report------------------------

Port pressure in cycles| 0 - 0DV | 1 | 2 - 2D | 3 - 3D | 4 | 5 | 6 | 7 || CP | LCD |

-------------------------------------------------------------------------------------------------179 | | | | | | | | || | | .L22:180 | | | 0.50 0.50 | 0.50 0.50 | | | | || 4.0 | | vmovapd 0(%r13,%rax), %ymm0

181 | 0.50 | 0.50 | 0.50 0.50 | 0.50 0.50 | | | | || 4.0 | | vfmadd213pd (%r14,%rax),%ymm1,%ymm0182 | | | 0.50 | 0.50 | 1.00 | | | || 5.0 | | vmovapd %ymm0, (%r12,%rax)

183 | 0.25 | 0.25 | | | | 0.25 | 0.25 | || | 1.0 | addq $32, %rax184 | 0.25 | 0.25 | | | | 0.25 | 0.25 | || | | cmpq %rax, %r15

185 | | | | | | | | || | | * jne .L221.00 1.00 1.50 1.00 1.50 1.00 1.00 0.50 0.50 13.0 1.0

Loop-Carried Dependencies Analysis Report-----------------------------------------183 | 1.0 | addq $32, %rax | [183]

Page 30: AutomaticThroughput and Critical Path Analysis of x86 and ARM … · 2019. 11. 18. · J. Laukemann, J. Hammer, G. Hager, G. Wellein. 1. Analytic Performance Modeling Why? Assumptions

18.11.2019 30PMBS19 | OSACA | Jan Laukemann

185: jne 179: label 180: vmovapd

181: vfmadd213pd

182: vmovapd

181: LOAD

4 4

4

183: addq

184: compq

1

OSACA Workflow: OutputCombined Analysis Report------------------------

Port pressure in cycles| 0 - 0DV | 1 | 2 - 2D | 3 - 3D | 4 | 5 | 6 | 7 || CP | LCD |

-------------------------------------------------------------------------------------------------179 | | | | | | | | || | | .L22:180 | | | 0.50 0.50 | 0.50 0.50 | | | | || 4.0 | | vmovapd 0(%r13,%rax), %ymm0

181 | 0.50 | 0.50 | 0.50 0.50 | 0.50 0.50 | | | | || 4.0 | | vfmadd213pd (%r14,%rax),%ymm1,%ymm0182 | | | 0.50 | 0.50 | 1.00 | | | || 5.0 | | vmovapd %ymm0, (%r12,%rax)

183 | 0.25 | 0.25 | | | | 0.25 | 0.25 | || | 1.0 | addq $32, %rax184 | 0.25 | 0.25 | | | | 0.25 | 0.25 | || | | cmpq %rax, %r15

185 | | | | | | | | || | | * jne .L221.00 1.00 1.50 1.00 1.50 1.00 1.00 0.50 0.50 13.0 1.0

Loop-Carried Dependencies Analysis Report-----------------------------------------183 | 1.0 | addq $32, %rax | [183]

Page 31: AutomaticThroughput and Critical Path Analysis of x86 and ARM … · 2019. 11. 18. · J. Laukemann, J. Hammer, G. Hager, G. Wellein. 1. Analytic Performance Modeling Why? Assumptions

18.11.2019 31PMBS19 | OSACA | Jan Laukemann

185: jne 179: label 180: vmovapd

181: vfmadd213pd

182: vmovapd

181: LOAD

4 4

4

183: addq

184: compq

1

OSACA Workflow: OutputCombined Analysis Report------------------------

Port pressure in cycles| 0 - 0DV | 1 | 2 - 2D | 3 - 3D | 4 | 5 | 6 | 7 || CP | LCD |

-------------------------------------------------------------------------------------------------179 | | | | | | | | || | | .L22:180 | | | 0.50 0.50 | 0.50 0.50 | | | | || 4.0 | | vmovapd 0(%r13,%rax), %ymm0

181 | 0.50 | 0.50 | 0.50 0.50 | 0.50 0.50 | | | | || 4.0 | | vfmadd213pd (%r14,%rax),%ymm1,%ymm0182 | | | 0.50 | 0.50 | 1.00 | | | || 5.0 | | vmovapd %ymm0, (%r12,%rax)

183 | 0.25 | 0.25 | | | | 0.25 | 0.25 | || | 1.0 | addq $32, %rax184 | 0.25 | 0.25 | | | | 0.25 | 0.25 | || | | cmpq %rax, %r15

185 | | | | | | | | || | | * jne .L221.00 1.00 1.50 1.00 1.50 1.00 1.00 0.50 0.50 13.0 1.0

Loop-Carried Dependencies Analysis Report-----------------------------------------183 | 1.0 | addq $32, %rax | [183]

Page 32: AutomaticThroughput and Critical Path Analysis of x86 and ARM … · 2019. 11. 18. · J. Laukemann, J. Hammer, G. Hager, G. Wellein. 1. Analytic Performance Modeling Why? Assumptions

Combined Analysis Report------------------------

Port pressure in cycles| 0 - 0DV | 1 | 2 - 2D | 3 - 3D | 4 | 5 | 6 | 7 || CP | LCD |

-------------------------------------------------------------------------------------------------179 | | | | | | | | || | | .L22:180 | | | 0.50 0.50 | 0.50 0.50 | | | | || 4.0 | | vmovapd 0(%r13,%rax), %ymm0

181 | 0.50 | 0.50 | 0.50 0.50 | 0.50 0.50 | | | | || 4.0 | | vfmadd213pd (%r14,%rax),%ymm1,%ymm0182 | | | 0.50 | 0.50 | 1.00 | | | || 5.0 | | vmovapd %ymm0, (%r12,%rax)

183 | 0.25 | 0.25 | | | | 0.25 | 0.25 | || | 1.0 | addq $32, %rax184 | 0.25 | 0.25 | | | | 0.25 | 0.25 | || | | cmpq %rax, %r15

185 | | | | | | | | || | | * jne .L221.00 1.00 1.50 1.00 1.50 1.00 1.00 0.50 0.50 13.0 1.0

Loop-Carried Dependencies Analysis Report-----------------------------------------183 | 1.0 | addq $32, %rax | [183]

18.11.2019 32PMBS19 | OSACA | Jan Laukemann

185: jne 179: label 180: vmovapd

181: vfmadd213pd

182: vmovapd

181: LOAD

4 4

4

183: addq

184: compq

1

OSACA Workflow: Output

Page 33: AutomaticThroughput and Critical Path Analysis of x86 and ARM … · 2019. 11. 18. · J. Laukemann, J. Hammer, G. Hager, G. Wellein. 1. Analytic Performance Modeling Why? Assumptions

• Limited by loop-carried dependency

• Create code with -Ofast, -funroll-loops(+ architecture specific flags)

• Analyze for Intel Cascake Lake, AMD Zen andMarvell ThunderX2

18.11.2019 33

do it=1, itmaxdo k=1, kmax-1

do i=1, imax-1phi(i,k,t0) = 0.25 * (

phi(i,k-1,t0) + phi(i+1,k,t0) +phi(i,k+1,t0) + phi(i-1,k,t0))

dodo

do

PMBS19 | OSACA | Jan Laukemann

Gauss-Seidel Method Example

Page 34: AutomaticThroughput and Critical Path Analysis of x86 and ARM … · 2019. 11. 18. · J. Laukemann, J. Hammer, G. Hager, G. Wellein. 1. Analytic Performance Modeling Why? Assumptions

18.11.2019 34

mov x1, #111 // START MARKER.byte 213,3,32,31 // START MARKER

.L20:ldr d31, [x15, x18, lsl 3]ldr d0, [x15, 8]mov x14, x15add x16, x15, 24ldr d2, [x15, x30, lsl 3]add x15, x15, 32fadd d1, d31, d0fadd d3, d1, d30fadd d4, d3, d2fmul d5, d4, d9str d5, [x14], 8ldr d6, [x14, x18, lsl 3]ldr d16, [x14, 8]add x13, x14, 8ldr d7, [x14, x30, lsl 3]fadd d17, d6, d16fadd d18, d17, d5fadd d19, d18, d7fmul d20, d19, d9str d20, [x15, -24]

PMBS19 | OSACA | Jan Laukemann

ldr d21, [x13, x18, lsl 3]ldr d23, [x14, 16]ldr d22, [x13, x30, lsl 3]fadd d24, d21, d23fadd d25, d24, d20fadd d26, d25, d22fmul d27, d26, d9str d27, [x14, 8]ldr d30, [x15]ldr d28, [x16, x18, lsl 3]ldr d29, [x16, x30, lsl 3]fadd d31, d28, d30fadd d2, d31, d27fadd d0, d2, d29fmul d30, d0, d9str d30, [x15, -8]cmp x7, x15bne .L20mov x1, #222 // END MARKER.byte 213,3,32,31 // END MARKER

Gauss-Seidel Method Example

Page 35: AutomaticThroughput and Critical Path Analysis of x86 and ARM … · 2019. 11. 18. · J. Laukemann, J. Hammer, G. Hager, G. Wellein. 1. Analytic Performance Modeling Why? Assumptions

18.11.2019 35PMBS19 | OSACA | Jan Laukemann

mov x1, #111 // START MARKER.byte 213,3,32,31 // START MARKER

.L20:ldr d31, [x15, x18, lsl 3]ldr d0, [x15, 8]mov x14, x15add x16, x15, 24ldr d2, [x15, x30, lsl 3]add x15, x15, 32fadd d1, d31, d0fadd d3, d1, d30fadd d4, d3, d2fmul d5, d4, d9str d5, [x14], 8ldr d6, [x14, x18, lsl 3]ldr d16, [x14, 8]add x13, x14, 8ldr d7, [x14, x30, lsl 3]fadd d17, d6, d16fadd d18, d17, d5fadd d19, d18, d7fmul d20, d19, d9str d20, [x15, -24]

ldr d21, [x13, x18, lsl 3]ldr d23, [x14, 16]ldr d22, [x13, x30, lsl 3]fadd d24, d21, d23fadd d25, d24, d20fadd d26, d25, d22fmul d27, d26, d9str d27, [x14, 8]ldr d30, [x15]ldr d28, [x16, x18, lsl 3]ldr d29, [x16, x30, lsl 3]fadd d31, d28, d30fadd d2, d31, d27fadd d0, d2, d29fmul d30, d0, d9str d30, [x15, -8]cmp x7, x15bne .L20mov x1, #222 // END MARKER.byte 213,3,32,31 // END MARKER

Gauss-Seidel Method Example

Page 36: AutomaticThroughput and Critical Path Analysis of x86 and ARM … · 2019. 11. 18. · J. Laukemann, J. Hammer, G. Hager, G. Wellein. 1. Analytic Performance Modeling Why? Assumptions

18.11.2019 36PMBS19 | OSACA | Jan Laukemann

mov x1, #111 // START MARKER.byte 213,3,32,31 // START MARKER

.L20:ldr d31, [x15, x18, lsl 3]ldr d0, [x15, 8]mov x14, x15add x16, x15, 24ldr d2, [x15, x30, lsl 3]add x15, x15, 32fadd d1, d31, d0fadd d3, d1, d30fadd d4, d3, d2fmul d5, d4, d9str d5, [x14], 8ldr d6, [x14, x18, lsl 3]ldr d16, [x14, 8]add x13, x14, 8ldr d7, [x14, x30, lsl 3]fadd d17, d6, d16fadd d18, d17, d5fadd d19, d18, d7fmul d20, d19, d9str d20, [x15, -24]

ldr d21, [x13, x18, lsl 3]ldr d23, [x14, 16]ldr d22, [x13, x30, lsl 3]fadd d24, d21, d23fadd d25, d24, d20fadd d26, d25, d22fmul d27, d26, d9str d27, [x14, 8]ldr d30, [x15]ldr d28, [x16, x18, lsl 3]ldr d29, [x16, x30, lsl 3]fadd d31, d28, d30fadd d2, d31, d27fadd d0, d2, d29fmul d30, d0, d9str d30, [x15, -8]cmp x7, x15bne .L20mov x1, #222 // END MARKER.byte 213,3,32,31 // END MARKER

Gauss-Seidel Method Example

Page 37: AutomaticThroughput and Critical Path Analysis of x86 and ARM … · 2019. 11. 18. · J. Laukemann, J. Hammer, G. Hager, G. Wellein. 1. Analytic Performance Modeling Why? Assumptions

18.11.2019 37PMBS19 | OSACA | Jan Laukemann

mov x1, #111 // START MARKER.byte 213,3,32,31 // START MARKER

.L20:ldr d31, [x15, x18, lsl 3]ldr d0, [x15, 8]mov x14, x15add x16, x15, 24ldr d2, [x15, x30, lsl 3]add x15, x15, 32fadd d1, d31, d0fadd d3, d1, d30fadd d4, d3, d2fmul d5, d4, d9str d5, [x14], 8ldr d6, [x14, x18, lsl 3]ldr d16, [x14, 8]add x13, x14, 8ldr d7, [x14, x30, lsl 3]fadd d17, d6, d16fadd d18, d17, d5fadd d19, d18, d7fmul d20, d19, d9str d20, [x15, -24]

ldr d21, [x13, x18, lsl 3]ldr d23, [x14, 16]ldr d22, [x13, x30, lsl 3]fadd d24, d21, d23fadd d25, d24, d20fadd d26, d25, d22fmul d27, d26, d9str d27, [x14, 8]ldr d30, [x15]ldr d28, [x16, x18, lsl 3]ldr d29, [x16, x30, lsl 3]fadd d31, d28, d30fadd d2, d31, d27fadd d0, d2, d29fmul d30, d0, d9str d30, [x15, -8]cmp x7, x15bne .L20mov x1, #222 // END MARKER.byte 213,3,32,31 // END MARKER

Gauss-Seidel Method Example

Page 38: AutomaticThroughput and Critical Path Analysis of x86 and ARM … · 2019. 11. 18. · J. Laukemann, J. Hammer, G. Hager, G. Wellein. 1. Analytic Performance Modeling Why? Assumptions

18.11.2019 38PMBS19 | OSACA | Jan Laukemann

mov x1, #111 // START MARKER.byte 213,3,32,31 // START MARKER

.L20:ldr d31, [x15, x18, lsl 3]ldr d0, [x15, 8]mov x14, x15add x16, x15, 24ldr d2, [x15, x30, lsl 3]add x15, x15, 32fadd d1, d31, d0fadd d3, d1, d30fadd d4, d3, d2fmul d5, d4, d9str d5, [x14], 8ldr d6, [x14, x18, lsl 3]ldr d16, [x14, 8]add x13, x14, 8ldr d7, [x14, x30, lsl 3]fadd d17, d6, d16fadd d18, d17, d5fadd d19, d18, d7fmul d20, d19, d9str d20, [x15, -24]

ldr d21, [x13, x18, lsl 3]ldr d23, [x14, 16]ldr d22, [x13, x30, lsl 3]fadd d24, d21, d23fadd d25, d24, d20fadd d26, d25, d22fmul d27, d26, d9str d27, [x14, 8]ldr d30, [x15]ldr d28, [x16, x18, lsl 3]ldr d29, [x16, x30, lsl 3]fadd d31, d28, d30fadd d2, d31, d27fadd d0, d2, d29fmul d30, d0, d9str d30, [x15, -8]cmp x7, x15bne .L20mov x1, #222 // END MARKER.byte 213,3,32,31 // END MARKER

Gauss-Seidel Method Example

Page 39: AutomaticThroughput and Critical Path Analysis of x86 and ARM … · 2019. 11. 18. · J. Laukemann, J. Hammer, G. Hager, G. Wellein. 1. Analytic Performance Modeling Why? Assumptions

18.11.2019 39

Port pressure in cycles| 0 - 0DV | 1 - 1DV | 2 | 3 | 4 | 5 || CP | LCD |

----------------------------------------------------------------------------520 | | | | | | || | | .L20:521 | | | | 0.50 | 0.50 | || 4.0 | | ldr d31, [x15, x18, lsl 3]522 | | | | 0.50 | 0.50 | || | | ldr d0, [x15, 8]523 | 0.50 | 0.50 | | | | || | | mov x14, x15524 | 0.33 | 0.33 | 0.33 | | | || | | add x16, x15, 24525 | | | | 0.50 | 0.50 | || | | ldr d2, [x15, x30, lsl 3]526 | 0.33 | 0.33 | 0.33 | | | || | | add x15, x15, 32527 | 0.50 | 0.50 | | | | || 6.0 | | fadd d1, d31, d0528 | 0.50 | 0.50 | | | | || 6.0 | 6.0 | fadd d3, d1, d30529 | 0.50 | 0.50 | | | | || 6.0 | 6.0 | fadd d4, d3, d2530 | 0.50 | 0.50 | | | | || 6.0 | 6.0 | fmul d5, d4, d9531 | | | | 0.50 | 0.50 | 1.00 || 4.0 | | str d5, [x14], 8532 | | | | 0.50 | 0.50 | || 4.0 | | ldr d6, [x14, x18, lsl 3]533 | | | | 0.50 | 0.50 | || | | ldr d16, [x14, 8]534 | 0.33 | 0.33 | 0.33 | | | || | | add x13, x14, 8535 | | | | 0.50 | 0.50 | || | | ldr d7, [x14, x30, lsl 3]536 | 0.50 | 0.50 | | | | || 6.0 | | fadd d17, d6, d16537 | 0.50 | 0.50 | | | | || 6.0 | 6.0 | fadd d18, d17, d5538 | 0.50 | 0.50 | | | | || 6.0 | 6.0 | fadd d19, d18, d7539 | 0.50 | 0.50 | | | | || 6.0 | 6.0 | fmul d20, d19, d9540 | | | | 0.50 | 0.50 | 1.00 || | | str d20, [x15, -24]541 | | | | 0.50 | 0.50 | || | | ldr d21, [x13, x18, lsl 3]542 | | | | 0.50 | 0.50 | || | | ldr d23, [x14, 16]543 | | | | 0.50 | 0.50 | || | | ldr d22, [x13, x30, lsl 3]544 | 0.50 | 0.50 | | | | || | | fadd d24, d21, d23545 | 0.50 | 0.50 | | | | || 6.0 | 6.0 | fadd d25, d24, d20546 | 0.50 | 0.50 | | | | || 6.0 | 6.0 | fadd d26, d25, d22547 | 0.50 | 0.50 | | | | || 6.0 | 6.0 | fmul d27, d26, d9548 | | | | 0.50 | 0.50 | 1.00 || | | str d27, [x14, 8]549 | | | | 0.50 | 0.50 | || | | ldr d30, [x15]550 | | | | 0.50 | 0.50 | || | | ldr d28, [x16, x18, lsl 3]551 | | | | 0.50 | 0.50 | || | | ldr d29, [x16, x30, lsl 3]552 | 0.50 | 0.50 | | | | || | | fadd d31, d28, d30553 | 0.50 | 0.50 | | | | || 6.0 | 6.0 | fadd d2, d31, d27554 | 0.50 | 0.50 | | | | || 6.0 | 6.0 | fadd d0, d2, d29555 | 0.50 | 0.50 | | | | || 6.0 | 6.0 | fmul d30, d0, d9556 | | | | 0.50 | 0.50 | 1.00 || 4.0 | | str d30, [x15, -8]557 | 0.33 | 0.33 | 0.33 | | | || | | cmp x7, x15558 | | | | | | || | | * bne .L20

9.83 9.83 1.33 8.00 8.00 4.00 100.0 72.0

Loop-Carried Dependencies Analysis Report-----------------------------------------526 | 1.0 | add x15, x15, 32 | [526]555 | 72.0 | fmul d30, d0, d9 | [528, 529, 530, 537, 538, 539, 545, 546, 547, 553, 554, 555]

PMBS19 | OSACA | Jan Laukemann

Gauss-Seidel Method Example – Output

Page 40: AutomaticThroughput and Critical Path Analysis of x86 and ARM … · 2019. 11. 18. · J. Laukemann, J. Hammer, G. Hager, G. Wellein. 1. Analytic Performance Modeling Why? Assumptions

18.11.2019 40

Port pressure in cycles| 0 - 0DV | 1 - 1DV | 2 | 3 | 4 | 5 || CP | LCD |

----------------------------------------------------------------------------520 | | | | | | || | | .L20:521 | | | | 0.50 | 0.50 | || 4.0 | | ldr d31, [x15, x18, lsl 3]522 | | | | 0.50 | 0.50 | || | | ldr d0, [x15, 8]523 | 0.50 | 0.50 | | | | || | | mov x14, x15524 | 0.33 | 0.33 | 0.33 | | | || | | add x16, x15, 24525 | | | | 0.50 | 0.50 | || | | ldr d2, [x15, x30, lsl 3]526 | 0.33 | 0.33 | 0.33 | | | || | | add x15, x15, 32527 | 0.50 | 0.50 | | | | || 6.0 | | fadd d1, d31, d0528 | 0.50 | 0.50 | | | | || 6.0 | 6.0 | fadd d3, d1, d30529 | 0.50 | 0.50 | | | | || 6.0 | 6.0 | fadd d4, d3, d2530 | 0.50 | 0.50 | | | | || 6.0 | 6.0 | fmul d5, d4, d9531 | | | | 0.50 | 0.50 | 1.00 || 4.0 | | str d5, [x14], 8532 | | | | 0.50 | 0.50 | || 4.0 | | ldr d6, [x14, x18, lsl 3]533 | | | | 0.50 | 0.50 | || | | ldr d16, [x14, 8]534 | 0.33 | 0.33 | 0.33 | | | || | | add x13, x14, 8535 | | | | 0.50 | 0.50 | || | | ldr d7, [x14, x30, lsl 3]536 | 0.50 | 0.50 | | | | || 6.0 | | fadd d17, d6, d16537 | 0.50 | 0.50 | | | | || 6.0 | 6.0 | fadd d18, d17, d5538 | 0.50 | 0.50 | | | | || 6.0 | 6.0 | fadd d19, d18, d7539 | 0.50 | 0.50 | | | | || 6.0 | 6.0 | fmul d20, d19, d9540 | | | | 0.50 | 0.50 | 1.00 || | | str d20, [x15, -24]541 | | | | 0.50 | 0.50 | || | | ldr d21, [x13, x18, lsl 3]542 | | | | 0.50 | 0.50 | || | | ldr d23, [x14, 16]543 | | | | 0.50 | 0.50 | || | | ldr d22, [x13, x30, lsl 3]544 | 0.50 | 0.50 | | | | || | | fadd d24, d21, d23545 | 0.50 | 0.50 | | | | || 6.0 | 6.0 | fadd d25, d24, d20546 | 0.50 | 0.50 | | | | || 6.0 | 6.0 | fadd d26, d25, d22547 | 0.50 | 0.50 | | | | || 6.0 | 6.0 | fmul d27, d26, d9548 | | | | 0.50 | 0.50 | 1.00 || | | str d27, [x14, 8]549 | | | | 0.50 | 0.50 | || | | ldr d30, [x15]550 | | | | 0.50 | 0.50 | || | | ldr d28, [x16, x18, lsl 3]551 | | | | 0.50 | 0.50 | || | | ldr d29, [x16, x30, lsl 3]552 | 0.50 | 0.50 | | | | || | | fadd d31, d28, d30553 | 0.50 | 0.50 | | | | || 6.0 | 6.0 | fadd d2, d31, d27554 | 0.50 | 0.50 | | | | || 6.0 | 6.0 | fadd d0, d2, d29555 | 0.50 | 0.50 | | | | || 6.0 | 6.0 | fmul d30, d0, d9556 | | | | 0.50 | 0.50 | 1.00 || 4.0 | | str d30, [x15, -8]557 | 0.33 | 0.33 | 0.33 | | | || | | cmp x7, x15558 | | | | | | || | | * bne .L20

9.83 9.83 1.33 8.00 8.00 4.00 100.0 72.0

Loop-Carried Dependencies Analysis Report-----------------------------------------526 | 1.0 | add x15, x15, 32 | [526]555 | 72.0 | fmul d30, d0, d9 | [528, 529, 530, 537, 538, 539, 545, 546, 547, 553, 554, 555]

PMBS19 | OSACA | Jan Laukemann

Block Throughput 2.46 cy

9.83 9.83

Gauss-Seidel Method Example – Output

Page 41: AutomaticThroughput and Critical Path Analysis of x86 and ARM … · 2019. 11. 18. · J. Laukemann, J. Hammer, G. Hager, G. Wellein. 1. Analytic Performance Modeling Why? Assumptions

18.11.2019 41

Port pressure in cycles| 0 - 0DV | 1 - 1DV | 2 | 3 | 4 | 5 || CP | LCD |

----------------------------------------------------------------------------520 | | | | | | || | | .L20:521 | | | | 0.50 | 0.50 | || 4.0 | | ldr d31, [x15, x18, lsl 3]522 | | | | 0.50 | 0.50 | || | | ldr d0, [x15, 8]523 | 0.50 | 0.50 | | | | || | | mov x14, x15524 | 0.33 | 0.33 | 0.33 | | | || | | add x16, x15, 24525 | | | | 0.50 | 0.50 | || | | ldr d2, [x15, x30, lsl 3]526 | 0.33 | 0.33 | 0.33 | | | || | | add x15, x15, 32527 | 0.50 | 0.50 | | | | || 6.0 | | fadd d1, d31, d0528 | 0.50 | 0.50 | | | | || 6.0 | 6.0 | fadd d3, d1, d30529 | 0.50 | 0.50 | | | | || 6.0 | 6.0 | fadd d4, d3, d2530 | 0.50 | 0.50 | | | | || 6.0 | 6.0 | fmul d5, d4, d9531 | | | | 0.50 | 0.50 | 1.00 || 4.0 | | str d5, [x14], 8532 | | | | 0.50 | 0.50 | || 4.0 | | ldr d6, [x14, x18, lsl 3]533 | | | | 0.50 | 0.50 | || | | ldr d16, [x14, 8]534 | 0.33 | 0.33 | 0.33 | | | || | | add x13, x14, 8535 | | | | 0.50 | 0.50 | || | | ldr d7, [x14, x30, lsl 3]536 | 0.50 | 0.50 | | | | || 6.0 | | fadd d17, d6, d16537 | 0.50 | 0.50 | | | | || 6.0 | 6.0 | fadd d18, d17, d5538 | 0.50 | 0.50 | | | | || 6.0 | 6.0 | fadd d19, d18, d7539 | 0.50 | 0.50 | | | | || 6.0 | 6.0 | fmul d20, d19, d9540 | | | | 0.50 | 0.50 | 1.00 || | | str d20, [x15, -24]541 | | | | 0.50 | 0.50 | || | | ldr d21, [x13, x18, lsl 3]542 | | | | 0.50 | 0.50 | || | | ldr d23, [x14, 16]543 | | | | 0.50 | 0.50 | || | | ldr d22, [x13, x30, lsl 3]544 | 0.50 | 0.50 | | | | || | | fadd d24, d21, d23545 | 0.50 | 0.50 | | | | || 6.0 | 6.0 | fadd d25, d24, d20546 | 0.50 | 0.50 | | | | || 6.0 | 6.0 | fadd d26, d25, d22547 | 0.50 | 0.50 | | | | || 6.0 | 6.0 | fmul d27, d26, d9548 | | | | 0.50 | 0.50 | 1.00 || | | str d27, [x14, 8]549 | | | | 0.50 | 0.50 | || | | ldr d30, [x15]550 | | | | 0.50 | 0.50 | || | | ldr d28, [x16, x18, lsl 3]551 | | | | 0.50 | 0.50 | || | | ldr d29, [x16, x30, lsl 3]552 | 0.50 | 0.50 | | | | || | | fadd d31, d28, d30553 | 0.50 | 0.50 | | | | || 6.0 | 6.0 | fadd d2, d31, d27554 | 0.50 | 0.50 | | | | || 6.0 | 6.0 | fadd d0, d2, d29555 | 0.50 | 0.50 | | | | || 6.0 | 6.0 | fmul d30, d0, d9556 | | | | 0.50 | 0.50 | 1.00 || 4.0 | | str d30, [x15, -8]557 | 0.33 | 0.33 | 0.33 | | | || | | cmp x7, x15558 | | | | | | || | | * bne .L20

9.83 9.83 1.33 8.00 8.00 4.00 100.0 72.0

Loop-Carried Dependencies Analysis Report-----------------------------------------526 | 1.0 | add x15, x15, 32 | [526]555 | 72.0 | fmul d30, d0, d9 | [528, 529, 530, 537, 538, 539, 545, 546, 547, 553, 554, 555]

PMBS19 | OSACA | Jan Laukemann

100

Block Throughput 2.46 cy

Critical Path 25.0 cy

9.83 9.83

Gauss-Seidel Method Example – Output

Page 42: AutomaticThroughput and Critical Path Analysis of x86 and ARM … · 2019. 11. 18. · J. Laukemann, J. Hammer, G. Hager, G. Wellein. 1. Analytic Performance Modeling Why? Assumptions

18.11.2019 42

Port pressure in cycles| 0 - 0DV | 1 - 1DV | 2 | 3 | 4 | 5 || CP | LCD |

----------------------------------------------------------------------------520 | | | | | | || | | .L20:521 | | | | 0.50 | 0.50 | || 4.0 | | ldr d31, [x15, x18, lsl 3]522 | | | | 0.50 | 0.50 | || | | ldr d0, [x15, 8]523 | 0.50 | 0.50 | | | | || | | mov x14, x15524 | 0.33 | 0.33 | 0.33 | | | || | | add x16, x15, 24525 | | | | 0.50 | 0.50 | || | | ldr d2, [x15, x30, lsl 3]526 | 0.33 | 0.33 | 0.33 | | | || | | add x15, x15, 32527 | 0.50 | 0.50 | | | | || 6.0 | | fadd d1, d31, d0528 | 0.50 | 0.50 | | | | || 6.0 | 6.0 | fadd d3, d1, d30529 | 0.50 | 0.50 | | | | || 6.0 | 6.0 | fadd d4, d3, d2530 | 0.50 | 0.50 | | | | || 6.0 | 6.0 | fmul d5, d4, d9531 | | | | 0.50 | 0.50 | 1.00 || 4.0 | | str d5, [x14], 8532 | | | | 0.50 | 0.50 | || 4.0 | | ldr d6, [x14, x18, lsl 3]533 | | | | 0.50 | 0.50 | || | | ldr d16, [x14, 8]534 | 0.33 | 0.33 | 0.33 | | | || | | add x13, x14, 8535 | | | | 0.50 | 0.50 | || | | ldr d7, [x14, x30, lsl 3]536 | 0.50 | 0.50 | | | | || 6.0 | | fadd d17, d6, d16537 | 0.50 | 0.50 | | | | || 6.0 | 6.0 | fadd d18, d17, d5538 | 0.50 | 0.50 | | | | || 6.0 | 6.0 | fadd d19, d18, d7539 | 0.50 | 0.50 | | | | || 6.0 | 6.0 | fmul d20, d19, d9540 | | | | 0.50 | 0.50 | 1.00 || | | str d20, [x15, -24]541 | | | | 0.50 | 0.50 | || | | ldr d21, [x13, x18, lsl 3]542 | | | | 0.50 | 0.50 | || | | ldr d23, [x14, 16]543 | | | | 0.50 | 0.50 | || | | ldr d22, [x13, x30, lsl 3]544 | 0.50 | 0.50 | | | | || | | fadd d24, d21, d23545 | 0.50 | 0.50 | | | | || 6.0 | 6.0 | fadd d25, d24, d20546 | 0.50 | 0.50 | | | | || 6.0 | 6.0 | fadd d26, d25, d22547 | 0.50 | 0.50 | | | | || 6.0 | 6.0 | fmul d27, d26, d9548 | | | | 0.50 | 0.50 | 1.00 || | | str d27, [x14, 8]549 | | | | 0.50 | 0.50 | || | | ldr d30, [x15]550 | | | | 0.50 | 0.50 | || | | ldr d28, [x16, x18, lsl 3]551 | | | | 0.50 | 0.50 | || | | ldr d29, [x16, x30, lsl 3]552 | 0.50 | 0.50 | | | | || | | fadd d31, d28, d30553 | 0.50 | 0.50 | | | | || 6.0 | 6.0 | fadd d2, d31, d27554 | 0.50 | 0.50 | | | | || 6.0 | 6.0 | fadd d0, d2, d29555 | 0.50 | 0.50 | | | | || 6.0 | 6.0 | fmul d30, d0, d9556 | | | | 0.50 | 0.50 | 1.00 || 4.0 | | str d30, [x15, -8]557 | 0.33 | 0.33 | 0.33 | | | || | | cmp x7, x15558 | | | | | | || | | * bne .L20

9.83 9.83 1.33 8.00 8.00 4.00 100.0 72.0

Loop-Carried Dependencies Analysis Report-----------------------------------------526 | 1.0 | add x15, x15, 32 | [526]555 | 72.0 | fmul d30, d0, d9 | [528, 529, 530, 537, 538, 539, 545, 546, 547, 553, 554, 555]

PMBS19 | OSACA | Jan Laukemann

100 72

Block Throughput 2.46 cy

Critical Path 25.0 cy

Loop-Carried Dep. 18.0 cy9.83 9.83

Gauss-Seidel Method Example – Output

Page 43: AutomaticThroughput and Critical Path Analysis of x86 and ARM … · 2019. 11. 18. · J. Laukemann, J. Hammer, G. Hager, G. Wellein. 1. Analytic Performance Modeling Why? Assumptions

18.11.2019 43PMBS19 | OSACA | Jan Laukemann

521: ldr 4

527: fadd 528: fadd

522: ldr

529: fadd

525: ldr

530: fmul 537: fadd 538: fadd 539: fmul 545: fadd 546: fadd 553: fadd547: fmul 554: fadd 555: fmul

556: str

523: mov

531: str

532: ldr

536: fadd

533: ldr

535: ldr

534: add

542: ldr

548: str544: fadd

541: ldr

543: ldr

540: str

526: add

549: ldr

557: cmp

552: fadd

524: add

551: ldr

550: ldr

6 6 6 6 6 6 6 6 6 6 6 6

6

61

CP

LCD 1

LCD 2

Gauss-Seidel Method Example – Output

Page 44: AutomaticThroughput and Critical Path Analysis of x86 and ARM … · 2019. 11. 18. · J. Laukemann, J. Hammer, G. Hager, G. Wellein. 1. Analytic Performance Modeling Why? Assumptions

18.11.2019 44PMBS19 | OSACA | Jan Laukemann

ArchitectureUnroll

factor

MeasuredPrediction [cy/it]

OSACA IACA LLVM-MCA

MLUP/s cy/it TP LCD CP TP LCD CP TP LCD CP

Intel Cascade Lake X 4x 178.3 14.02 2.19 14.0 18.02.0

(14.0)2.0 14.75 19.0

AMD Zen 4x 194.4 11.83 2.0 11.5 15.0 3.0 18.0 24.0

Marvell ThunderX2 4x 118.9 18.50 2.46 18.0 25.0

Results & Comparison

Page 45: AutomaticThroughput and Critical Path Analysis of x86 and ARM … · 2019. 11. 18. · J. Laukemann, J. Hammer, G. Hager, G. Wellein. 1. Analytic Performance Modeling Why? Assumptions

18.11.2019 45PMBS19 | OSACA | Jan Laukemann

ArchitectureUnroll

factor

MeasuredPrediction [cy/it]

OSACA IACA LLVM-MCA

MLUP/s cy/it TP LCD CP TP LCD CP TP LCD CP

Intel Cascade Lake X 4x 178.3 14.02 2.19 14.0 18.02.0

(14.0)2.0 14.75 19.0

AMD Zen 4x 194.4 11.83 2.0 11.5 15.0 3.0 18.0 24.0

Marvell ThunderX2 4x 118.9 18.50 2.46 18.0 25.0

Results & Comparison

2.19 14.0 18.02.0

(14.0)2.0 14.75 19.0

Page 46: AutomaticThroughput and Critical Path Analysis of x86 and ARM … · 2019. 11. 18. · J. Laukemann, J. Hammer, G. Hager, G. Wellein. 1. Analytic Performance Modeling Why? Assumptions

18.11.2019 46PMBS19 | OSACA | Jan Laukemann

ArchitectureUnroll

factor

MeasuredPrediction [cy/it]

OSACA IACA LLVM-MCA

MLUP/s cy/it TP LCD CP TP LCD CP TP LCD CP

Intel Cascade Lake X 4x 178.3 14.02 2.19 14.0 18.02.0

(14.0)2.0 14.75 19.0

AMD Zen 4x 194.4 11.83 2.0 11.5 15.0 3.0 18.0 24.0

Marvell ThunderX2 4x 118.9 18.50 2.46 18.0 25.0

Results & Comparison

2.0 11.5 15.0 3.0 18.0 24.0

2.19 14.0 18.02.0

(14.0)2.0 14.75 19.0

Page 47: AutomaticThroughput and Critical Path Analysis of x86 and ARM … · 2019. 11. 18. · J. Laukemann, J. Hammer, G. Hager, G. Wellein. 1. Analytic Performance Modeling Why? Assumptions

18.11.2019 47PMBS19 | OSACA | Jan Laukemann

ArchitectureUnroll

factor

MeasuredPrediction [cy/it]

OSACA IACA LLVM-MCA

MLUP/s cy/it TP LCD CP TP LCD CP TP LCD CP

Intel Cascade Lake X 4x 178.3 14.02 2.19 14.0 18.02.0

(14.0)2.0 14.75 19.0

AMD Zen 4x 194.4 11.83 2.0 11.5 15.0 3.0 18.0 24.0

Marvell ThunderX2 4x 118.9 18.50 2.46 18.0 25.0

Results & Comparison

2.46 18.0 25.0

2.0 11.5 15.0 3.0 18.0 24.0

2.19 14.0 18.02.0

(14.0)2.0 14.75 19.0

Page 48: AutomaticThroughput and Critical Path Analysis of x86 and ARM … · 2019. 11. 18. · J. Laukemann, J. Hammer, G. Hager, G. Wellein. 1. Analytic Performance Modeling Why? Assumptions

• Automatic extraction, throughput and critical path analysis

• Cross-platform (Intel, AMD, ARM)

• Accurate predictions

• Open Source

• Allows architectural exploration

18.11.2019 48PMBS19 | OSACA | Jan Laukemann

Summary – OSACA

Page 49: AutomaticThroughput and Critical Path Analysis of x86 and ARM … · 2019. 11. 18. · J. Laukemann, J. Hammer, G. Hager, G. Wellein. 1. Analytic Performance Modeling Why? Assumptions

• Support of hidden dependencies

• More precise LCD analysis

• Support new micro-architectures (Zen 2, Power 9, …)

• More precise latency analysis for FMA instructions

• Considering ROB, register renaming, retirement, …

• Optimally balanced port utilization

18.11.2019 49PMBS19 | OSACA | Jan Laukemann

Future Work

Page 50: AutomaticThroughput and Critical Path Analysis of x86 and ARM … · 2019. 11. 18. · J. Laukemann, J. Hammer, G. Hager, G. Wellein. 1. Analytic Performance Modeling Why? Assumptions

• Support of hidden dependencies

• More precise LCD analysis

• Support new micro-architectures (Zen 2, Power 9, …)

• More precise latency analysis for FMA instructions

• Considering ROB, register renaming, retirement, …

• Optimally balanced port utilization

18.11.2019 50PMBS19 | OSACA | Jan Laukemann

Future Work

Page 51: AutomaticThroughput and Critical Path Analysis of x86 and ARM … · 2019. 11. 18. · J. Laukemann, J. Hammer, G. Hager, G. Wellein. 1. Analytic Performance Modeling Why? Assumptions

• Support of hidden dependencies

• More precise LCD analysis

• Support new micro-architectures (Zen 2, Power 9, …)

• More precise latency analysis for FMA instructions

• Considering ROB, register renaming, retirement, …

• Optimally balanced port utilization

18.11.2019 51PMBS19 | OSACA | Jan Laukemann

Future Work

Open Source Architecture Code Analyzer

github.com/RRZE-HPC/OSACA

Reproduce at: https://github.com/RRZE-HPC/OSACA-CP-2019/