A Roadmap to Restoring Computing's Former Glory
David I. August
Princeton University
(Not speaking for Parakinetics, Inc.)
Golden era of computer architecture
1992 20121994 1996 1998 2000 2002 2004 2006 2008 2010
~ 3 years behind
CPU92CPU95CPU2000CPU2006
Year
SP
EC
CIN
T P
erfo
rman
ce (
log.
Sca
le)
Era of DIY:• Multicore• Reconfigurable• GPUs• Clusters
10 Cores!
10-Core Intel Xeon“Unparalleled Performance”
P6 SUPERSCALAR ARCHITECTURE (CIRCA 1994)
AutomaticSpeculation
AutomaticPipelining
Parallel ResourcesAutomatic
Allocation/Scheduling
Commit
MULTICORE ARCHITECTURE (CIRCA 2010)
AutomaticPipelining
Parallel Resources
AutomaticSpeculation
AutomaticAllocation/Scheduling
Commit
Realizable parallelism
Parallel Library Calls
Time
Time
Thr
eads
Thr
eads
Credit: Jack Dongarra
“Compiler Advances Double Computing Power Every 18 Years!” – Proebsting’s Law
Multicore Needs:
1. Automatic resource allocation/scheduling, speculation/commit, and pipelining.
2. Low overhead access to programmer insight.3. Code reuse. Ideally, this includes support of legacy codes as
well as new codes.4. Intelligent automatic parallelization.
Parallel Programming
Automatic Parallelization Parallel Libraries
Computer Architecture
Implicitly parallel programming with
critique-based iterative, occasionally interactive,
speculatively pipelined automatic
parallelization
A Roadmap to restoring computing’s
former glory.
Multicore Needs:1. Automatic resource allocation/scheduling, speculation/commit, and pipelining.2. Low overhead access to programmer insight.3. Code reuse. Ideally, this includes support of legacy codes as well as new codes.4. Intelligent automatic parallelization.
New or ExistingSequential Code DSWP Family
Optis Parallelized Code
Machine Specific Performance Primitives
Complainer/Fixer
InsightAnnotation
One Implementation
New or ExistingLibraries
InsightAnnotation
OtherOptis
SpeculativeOptis
0
1
2
3
4
5
LD:1
LD:2
W:1
W:3
LD:3
Core 1
Core 2
Core 3
W:2
W:4
LD:4
LD:5
C:1
C:2
C:3
Core 4
Spec-PS-DSWPP6 SUPERSCALAR ARCHITECTURE
Example
A: while (node) {B: node = node->next;C: res = work(node);D: write(res); }
B1
C1
A1
Core 1 Core 2 Core 3
A2
B2
D1
C2
D2
Tim
e
Program Dependence Graph
A B
D
C
Control DependenceData Dependence
Example
A: while (node) {B: node = node->next;C: res = work(node);D: write(res); }
B1
C1
A1
Core 1 Core 2 Core 3
A2
B2
D1
C2
D2
Tim
e
Spec-DOALL
Program Dependence Graph
A B
D
C
Control DependenceData Dependence
Example
A: while (node) {B: node = node->next;C: res = work(node);D: write(res); }
Core 1 Core 2 Core 3
Tim
e
Spec-DOALL
A2
B2
C2
D2
A1
B1
C1
D1
A3
B3
Program Dependence Graph
A B
D
C
Control DependenceData Dependence
Example
B: node = node->next;C: res = work(node);D: write(res); }
Core 1 Core 2 Core 3
Tim
e
Program Dependence Graph
A B
D
C
Control DependenceData Dependence
Spec-DOALL
A2A1 A3
B2
C2
D2
B1
C1
D1
B3
C3
D3
A: while (node) { while (true) {
B2
C2
D2
B3
C3
D3
B4
C4
D4
197.parser
Slowdown
Core 1 Core 2 Core 3
Tim
e
C1
D1
B1
B7
C3
D3
B3
C4
D4
B4
C5
D5
B5
C6
B6
Spec-DOACROSS
Core 1 Core 2 Core 3
Tim
e
Spec-DSWP
C2
D2
B2
C1
D1
B1
B3
B4
B2
C2
C3 D2
B5
B6
B7
D3
C5
C6
C4
D5
D4
Throughput: 1 iter/cycle Throughput: 1 iter/cycle
Comparison: Spec-DOACROSS and Spec-DSWP
Comm.Latency = 2: Comm.Latency = 2:Comm.Latency = 1: 1 iter/cycle Comm.Latency = 1: 1 iter/cycle
Core 1 Core 2 Core 3
Tim
e
C1
D1
B1
C2
D2
B2
C3
D3
B3
Core 1 Core 2 Core 3
B2
B3
B1
B5
B6
B4
C2
C3
C1
C5
C6
C4
B7
PipelineFill time
0.5 iter/cycle 1 iter/cycle
D2
D3
D1
D5
D4
Tim
eC4
D4
B4
C5
D5
B5
C6
B6
B7
(1,1)(8,2)
(16,4)(24,6)
(32,8)
(40,10)
(48,12)
(56,14)
(64,16)
(72,18)
(80,20)
(88,22)
(96,24)
(104,26)
(112,28)
(120,30)
(128,32)0
5
10
15
20
25
30
35
40
45
50TLSSpec-PS-DSWP
(Number of Total Cores, Number of Nodes)
Perf
orm
ance
Spe
edup
(X)
TLS vs. Spec-DSWP[MICRO 2010]Geomean of 11 benchmarks on the same cluster
Multicore Needs:1. Automatic resource allocation/scheduling, speculation/commit, and pipelining. 2. Low overhead access to programmer insight.3. Code reuse. Ideally, this includes support of legacy codes as well as new codes.4. Intelligent automatic parallelization.
New or ExistingSequential Code DSWP Family
Optis Parallelized Code
Machine Specific Performance Primitives
Complainer/Fixer
InsightAnnotation
One Implementation
New or ExistingLibraries
InsightAnnotation
OtherOptis
SpeculativeOptis
19
char *memory;
void * alloc(int size);
void * alloc(int size) { void * ptr = memory; memory = memory + size; return ptr;}
Core 1 Core 2
Tim
e
Core 3
Execution Plan
alloc1
alloc2
alloc3
alloc4
alloc5
alloc6
20
char *memory;
void * alloc(int size);@Commutative
void * alloc(int size) { void * ptr = memory; memory = memory + size; return ptr;}
Core 1 Core 2
Tim
e
Core 3
Execution Plan
alloc1
alloc2
alloc3
alloc4
alloc5
alloc6
21
char *memory;
void * alloc(int size);@Commutative
Core 1 Core 2
Tim
e
Core 3
Execution Plan
alloc1
alloc2
alloc3
alloc4
alloc5
alloc6
void * alloc(int size) { void * ptr = memory; memory = memory + size; return ptr;}
Easily Understood Non-Determinism!
[MICRO ‘07, Top Picks ’08; Automatic: PLDI ‘11]
~50 of ½ Million LOCs modified in SpecINT 2000Mods also include Non-Deterministic Branch
Multicore Needs:1. Automatic resource allocation/scheduling, speculation/commit, and pipelining. 2. Low overhead access to programmer insight. 3. Code reuse. Ideally, this includes support of legacy codes as well as new codes. 4. Intelligent automatic parallelization.
New or ExistingSequential Code DSWP Family
Optis Parallelized Code
Machine Specific Performance Primitives
Complainer/Fixer
InsightAnnotation
One Implementation
New or ExistingLibraries
InsightAnnotation
OtherOptis
SpeculativeOptis
24
SumReduction
Unroll
Rotate
0.90X
0.10X
30.0X
1.1X
0.8XSum
Reduction
Unroll
SumReduction
Rotate
Rotate
Unroll
1.5X
Iterative Compilation[Cooper ‘05; Almagor ‘04; Triantafyllis ’05]
PS-DSWPComplainer
Red Edges: Deps between malloc() & free()Blue Edges: Deps between rand() callsGreen Edges: Flow Deps inside Inner LoopOrange Edges: Deps between function calls
Unroll
SumReduction
Rotate
PS-DSWPComplainer Who can
help me? ProgrammerAnnotation
PS-DSWPComplainer
SumReduction
PS-DSWPComplainer
SumReduction
PROGRAMMERCommutative
PS-DSWPComplainer
SumReduction
PROGRAMMERCommutative
LIBRARYCommutative
PS-DSWPComplainer
SumReduction
PROGRAMMERCommutative
LIBRARYCommutative
1 8 16 24 32 40 48 56 640
1020304050
Scalable Speedup!
Parallel HMMER V2HMMER with Commutative
Multicore Needs:1. Automatic resource allocation/scheduling, speculation/commit, and pipelining. 2. Low overhead access to programmer insight. 3. Code reuse. Ideally, this includes support of legacy codes as well as new codes. 4. Intelligent automatic parallelization.
New or ExistingSequential Code DSWP Family
Optis Parallelized Code
Machine Specific Performance Primitives
Complainer/Fixer
InsightAnnotation
One Implementation
New or ExistingLibraries
InsightAnnotation
OtherOptis
SpeculativeOptis
Performance relative to Best Sequential128 Cores in 32 Nodes with Intel Xeon Processors [MICRO 2010]
Restoration of Trend
“Compiler Advances Double Computing Power Every 18 Years!” – Proebsting’s Law
Compiler Technology
Architecture/Devices
Era of DIY:• Multicore• Reconfigurable• GPUs• Clusters
Compiler technology inspired class of architectures?
The End