A Roadmap to Restoring Computing's Former Glory

Preview:

DESCRIPTION

A Roadmap to Restoring Computing's Former Glory. David I. August. Princeton University. (Not speaking for Parakinetics, Inc.). Era of DIY: Multicore Reconfigurable GPUs Clusters. 10 Cores!. 10-Core Intel Xeon “Unparalleled Performance”. Golden era of computer architecture. - PowerPoint PPT Presentation

Citation preview

A Roadmap to Restoring Computing's Former Glory

David I. August

Princeton University

(Not speaking for Parakinetics, Inc.)

Golden era of computer architecture

1992 20121994 1996 1998 2000 2002 2004 2006 2008 2010

~ 3 years behind

CPU92CPU95CPU2000CPU2006

Year

SP

EC

CIN

T P

erfo

rman

ce (

log.

Sca

le)

Era of DIY:• Multicore• Reconfigurable• GPUs• Clusters

10 Cores!

10-Core Intel Xeon“Unparalleled Performance”

P6 SUPERSCALAR ARCHITECTURE (CIRCA 1994)

AutomaticSpeculation

AutomaticPipelining

Parallel ResourcesAutomatic

Allocation/Scheduling

Commit

MULTICORE ARCHITECTURE (CIRCA 2010)

AutomaticPipelining

Parallel Resources

AutomaticSpeculation

AutomaticAllocation/Scheduling

Commit

Realizable parallelism

Parallel Library Calls

Time

Time

Thr

eads

Thr

eads

Credit: Jack Dongarra

“Compiler Advances Double Computing Power Every 18 Years!” – Proebsting’s Law

Multicore Needs:

1. Automatic resource allocation/scheduling, speculation/commit, and pipelining.

2. Low overhead access to programmer insight.3. Code reuse. Ideally, this includes support of legacy codes as

well as new codes.4. Intelligent automatic parallelization.

Parallel Programming

Automatic Parallelization Parallel Libraries

Computer Architecture

Implicitly parallel programming with

critique-based iterative, occasionally interactive,

speculatively pipelined automatic

parallelization

A Roadmap to restoring computing’s

former glory.

Multicore Needs:1. Automatic resource allocation/scheduling, speculation/commit, and pipelining.2. Low overhead access to programmer insight.3. Code reuse. Ideally, this includes support of legacy codes as well as new codes.4. Intelligent automatic parallelization.

New or ExistingSequential Code DSWP Family

Optis Parallelized Code

Machine Specific Performance Primitives

Complainer/Fixer

InsightAnnotation

One Implementation

New or ExistingLibraries

InsightAnnotation

OtherOptis

SpeculativeOptis

0

1

2

3

4

5

LD:1

LD:2

W:1

W:3

LD:3

Core 1

Core 2

Core 3

W:2

W:4

LD:4

LD:5

C:1

C:2

C:3

Core 4

Spec-PS-DSWPP6 SUPERSCALAR ARCHITECTURE

Example

A: while (node) {B: node = node->next;C: res = work(node);D: write(res); }

B1

C1

A1

Core 1 Core 2 Core 3

A2

B2

D1

C2

D2

Tim

e

Program Dependence Graph

A B

D

C

Control DependenceData Dependence

Example

A: while (node) {B: node = node->next;C: res = work(node);D: write(res); }

B1

C1

A1

Core 1 Core 2 Core 3

A2

B2

D1

C2

D2

Tim

e

Spec-DOALL

Program Dependence Graph

A B

D

C

Control DependenceData Dependence

Example

A: while (node) {B: node = node->next;C: res = work(node);D: write(res); }

Core 1 Core 2 Core 3

Tim

e

Spec-DOALL

A2

B2

C2

D2

A1

B1

C1

D1

A3

B3

Program Dependence Graph

A B

D

C

Control DependenceData Dependence

Example

B: node = node->next;C: res = work(node);D: write(res); }

Core 1 Core 2 Core 3

Tim

e

Program Dependence Graph

A B

D

C

Control DependenceData Dependence

Spec-DOALL

A2A1 A3

B2

C2

D2

B1

C1

D1

B3

C3

D3

A: while (node) { while (true) {

B2

C2

D2

B3

C3

D3

B4

C4

D4

197.parser

Slowdown

Core 1 Core 2 Core 3

Tim

e

C1

D1

B1

B7

C3

D3

B3

C4

D4

B4

C5

D5

B5

C6

B6

Spec-DOACROSS

Core 1 Core 2 Core 3

Tim

e

Spec-DSWP

C2

D2

B2

C1

D1

B1

B3

B4

B2

C2

C3 D2

B5

B6

B7

D3

C5

C6

C4

D5

D4

Throughput: 1 iter/cycle Throughput: 1 iter/cycle

Comparison: Spec-DOACROSS and Spec-DSWP

Comm.Latency = 2: Comm.Latency = 2:Comm.Latency = 1: 1 iter/cycle Comm.Latency = 1: 1 iter/cycle

Core 1 Core 2 Core 3

Tim

e

C1

D1

B1

C2

D2

B2

C3

D3

B3

Core 1 Core 2 Core 3

B2

B3

B1

B5

B6

B4

C2

C3

C1

C5

C6

C4

B7

PipelineFill time

0.5 iter/cycle 1 iter/cycle

D2

D3

D1

D5

D4

Tim

eC4

D4

B4

C5

D5

B5

C6

B6

B7

(1,1)(8,2)

(16,4)(24,6)

(32,8)

(40,10)

(48,12)

(56,14)

(64,16)

(72,18)

(80,20)

(88,22)

(96,24)

(104,26)

(112,28)

(120,30)

(128,32)0

5

10

15

20

25

30

35

40

45

50TLSSpec-PS-DSWP

(Number of Total Cores, Number of Nodes)

Perf

orm

ance

Spe

edup

(X)

TLS vs. Spec-DSWP[MICRO 2010]Geomean of 11 benchmarks on the same cluster

Multicore Needs:1. Automatic resource allocation/scheduling, speculation/commit, and pipelining. 2. Low overhead access to programmer insight.3. Code reuse. Ideally, this includes support of legacy codes as well as new codes.4. Intelligent automatic parallelization.

New or ExistingSequential Code DSWP Family

Optis Parallelized Code

Machine Specific Performance Primitives

Complainer/Fixer

InsightAnnotation

One Implementation

New or ExistingLibraries

InsightAnnotation

OtherOptis

SpeculativeOptis

19

char *memory;

void * alloc(int size);

void * alloc(int size) { void * ptr = memory; memory = memory + size; return ptr;}

Core 1 Core 2

Tim

e

Core 3

Execution Plan

alloc1

alloc2

alloc3

alloc4

alloc5

alloc6

20

char *memory;

void * alloc(int size);@Commutative

void * alloc(int size) { void * ptr = memory; memory = memory + size; return ptr;}

Core 1 Core 2

Tim

e

Core 3

Execution Plan

alloc1

alloc2

alloc3

alloc4

alloc5

alloc6

21

char *memory;

void * alloc(int size);@Commutative

Core 1 Core 2

Tim

e

Core 3

Execution Plan

alloc1

alloc2

alloc3

alloc4

alloc5

alloc6

void * alloc(int size) { void * ptr = memory; memory = memory + size; return ptr;}

Easily Understood Non-Determinism!

[MICRO ‘07, Top Picks ’08; Automatic: PLDI ‘11]

~50 of ½ Million LOCs modified in SpecINT 2000Mods also include Non-Deterministic Branch

Multicore Needs:1. Automatic resource allocation/scheduling, speculation/commit, and pipelining. 2. Low overhead access to programmer insight. 3. Code reuse. Ideally, this includes support of legacy codes as well as new codes. 4. Intelligent automatic parallelization.

New or ExistingSequential Code DSWP Family

Optis Parallelized Code

Machine Specific Performance Primitives

Complainer/Fixer

InsightAnnotation

One Implementation

New or ExistingLibraries

InsightAnnotation

OtherOptis

SpeculativeOptis

24

SumReduction

Unroll

Rotate

0.90X

0.10X

30.0X

1.1X

0.8XSum

Reduction

Unroll

SumReduction

Rotate

Rotate

Unroll

1.5X

Iterative Compilation[Cooper ‘05; Almagor ‘04; Triantafyllis ’05]

PS-DSWPComplainer

Red Edges: Deps between malloc() & free()Blue Edges: Deps between rand() callsGreen Edges: Flow Deps inside Inner LoopOrange Edges: Deps between function calls

Unroll

SumReduction

Rotate

PS-DSWPComplainer Who can

help me? ProgrammerAnnotation

PS-DSWPComplainer

SumReduction

PS-DSWPComplainer

SumReduction

PROGRAMMERCommutative

PS-DSWPComplainer

SumReduction

PROGRAMMERCommutative

LIBRARYCommutative

PS-DSWPComplainer

SumReduction

PROGRAMMERCommutative

LIBRARYCommutative

1 8 16 24 32 40 48 56 640

1020304050

Scalable Speedup!

Parallel HMMER V2HMMER with Commutative

Multicore Needs:1. Automatic resource allocation/scheduling, speculation/commit, and pipelining. 2. Low overhead access to programmer insight. 3. Code reuse. Ideally, this includes support of legacy codes as well as new codes. 4. Intelligent automatic parallelization.

New or ExistingSequential Code DSWP Family

Optis Parallelized Code

Machine Specific Performance Primitives

Complainer/Fixer

InsightAnnotation

One Implementation

New or ExistingLibraries

InsightAnnotation

OtherOptis

SpeculativeOptis

Performance relative to Best Sequential128 Cores in 32 Nodes with Intel Xeon Processors [MICRO 2010]

Restoration of Trend

“Compiler Advances Double Computing Power Every 18 Years!” – Proebsting’s Law

Compiler Technology

Architecture/Devices

Era of DIY:• Multicore• Reconfigurable• GPUs• Clusters

Compiler technology inspired class of architectures?

The End

Recommended