Parallelism: A Serious Goal or a Silly Mantra(some half-thought-out ideas)
Random thoughts on Parallelism
• Why the sudden preoccupation with parallelism?
• The Silliness (or what I call Meganonsense)– Break the problem Use half the energy– 1000 mickey mouse cores– Hardware is sequential– Server throughput (how many pins?)– What about GPUs and Data Base?
• Current bugs to exploiting parallelism (or are they?)– Dark silicon– Amdahl’s Law– The Cloud
• The answer– The fundamental concept vis-à-vis parallelism– What it means re: the transformation hierarchy
Random thoughts on Parallelism
• Why the sudden preoccupation with parallelism?
• The Silliness (or what I call Meganonsense)– Break the problem Use half the energy– 1000 mickey mouse cores– Hardware is sequential– Server throughput (how many pins?)– What about GPUs and Data Base?
• Current bugs to exploiting parallelism (or are they?)– Dark silicon– Amdahl’s Law– The Cloud
• The answer– The fundamental concept vis-à-vis parallelism– What it means re: the transformation hierarchy
It starts with the raw material (Moore’s Law)
• The first microprocessor (Intel 4004), 1971– 2300 transistors– 106 KHz
• The Pentium chip, 1992– 3.1 million transistors– 66 MHz
• Today– more than one billion transistors– Frequencies in excess of 5 GHz
• Tomorrow ?
And what we have done with this raw material
Time
Nu
mb
er o
f T
ran
sist
ors
Cache
Microprocessor
Too many people do not realize:Parallelism did not start with Multi-core
• Pipelining
• Out-of-order Execution
• Multiple operations in a single microinstruction
• VLIW (horizontal microcode exposed to the software)
Random thoughts on Parallelism
• Why the sudden preoccupation with parallelism?
• The Silliness (or what I call Meganonsense)– Break the problem Use half the energy– 1000 mickey mouse cores– Hardware is sequential– Server throughput (how many pins?)– What about GPUs and Data Base?
• Current bugs to exploiting parallelism (or are they?)– Dark silicon– Amdahl’s Law– The Cloud
• The answer– The fundamental concept vis-à-vis parallelism– What it means re: the transformation hierarchy
One thousand mickey mouse cores
• Why not a million? Why not ten million?
• Let’s start with 16– What if we could replace 4 with one more powerful core?
• …and we learned:– One more powerful core is not enough
– Sometimes we need several
– Morphcore was born
– BUT not all morphcore (fixed function vs flexibility)
The Asymmetric Chip Multiprocessor (ACMP)
Niagara-likecore
Niagara-likecore
Niagara-likecore
Niagara-likecore
Niagara-likecore
Niagara-likecore
Niagara-likecore
Niagara-likecore
Niagara-likecore
Niagara-likecore
Niagara-likecore
Niagara-likecore
Largecore
ACMP Approach
Niagara-likecore
Niagara-likecore
Niagara-likecore
Niagara-likecore
Niagara-likecore
Niagara-likecore
Niagara-likecore
Niagara-likecore
Niagara-likecore
Niagara-likecore
Niagara-likecore
Niagara-likecore
Niagara-likecore
Niagara-likecore
Niagara-likecore
Niagara-likecore
“Niagara” Approach
Largecore
Largecore
Largecore
Largecore
“Tile-Large” Approach
Large core vs. Small Core
• Out-of-order• Wide fetch e.g. 4-wide• Deeper pipeline• Aggressive branch
predictor (e.g. hybrid)• Many functional units• Trace cache• Memory dependence
speculation
• In-order• Narrow Fetch e.g. 2-wide• Shallow pipeline• Simple branch predictor
(e.g. Gshare)• Few functional units
LargeCore
SmallCore
0
1
2
3
4
5
6
7
8
9
0 0.2 0.4 0.6 0.8 1
Degree of Parallelism
Sp
eed
up
vs.
1 L
arg
e C
ore
NiagaraTile-LargeACMP
Throughput vs. Serial Performance
Server throughput
• The Good News: Not a software problem– Each core runs its own problem
• The Bad News: How many pins?– Memory bandwidth
• More Bad News: How much energy?– Each core runs its own problem
What about GPUs and Data Base
• In theory, absolutely!
• GPUs (SMT + SIMD + Predication)– Provided there are no conditional branches (Divergence)
– Provided memory accesses line up nicely (Coalescing)
• Data Bases– Provided there are no critical sections
Random thoughts on Parallelism
• Why the sudden preoccupation with parallelism?
• The Silliness (or what I call Meganonsense)– Break the problem Use half the energy– 1000 mickey mouse cores– Hardware is sequential– Server throughput (how many pins?)– What about GPUs and Data Base?
• Current bugs to exploiting parallelism (or are they?)– Dark silicon– Amdahl’s Law– The Cloud
• The answer– The fundamental concept vis-à-vis parallelism– What it means re: the transformation hierarchy
Dark Silicon
• Too many transistors: we can not power them all– All those cores powered down
– All that parallelism wasted
• Not really: The Refrigerator! (aka: Accelerators)– Fork (in parallel)
– Although not all at the same time!
Amdahl’s Law
• The serial bottleneck always limits performance
• Heterogeneous cores AND control over them
can minimize the effect
The Cloud
• It is behind the curtain, how to manage it
• Answer: the on-chip run-time system
• Answer: Pragmas beyond the Cloud
Random thoughts on Parallelism
• Why the sudden preoccupation with parallelism?
• The Silliness (or what I call Meganonsense)– Break the problem Use half the energy– 1000 mickey mouse cores– Hardware is sequential– Server throughput (how many pins?)– What about GPUs and Data Base?
• Current bugs to exploiting parallelism (or are they?)– Dark silicon– Amdahl’s Law– The Cloud
• The answer– The fundamental concept vis-à-vis parallelism– What it means re: the transformation hierarchy
The fundamental concept:
Synchronization
Algorithm
Program
ISA (Instruction Set Arch)
Microarchitecture
Circuits
Problem
Electrons
At every layer we synchronize
• Algorithm: task dependencies
• ISA: sequential control flow (implicit)
• Microarchitecture: ready bits
• Circuit : clock cycle (implicit)
Who understands this?
• Should this be part of students’ parallelism education?
• Where should it come in the curriculum?
• Can students even understand these different layers?
Parallel to Sequential to Parallel
• Guri says: think sequential, execute parallel– i.e. don’t throw away 60 years of computing experience– The original HPS model of out-of-order execution– Synchronization is obvious: restricted data flow
• At the higher level, parallel at larger granularity– Pragmas in JAVA? Who would have thought!– Dave Kuck’s CEDAR project, vintage 1985– Synchronization is necessary: course grain data flow
Can we do more?
• The run-time system – part of the chip design– The chip knows the chip resources– On-chip monitoring can supply information– The run-time system can direct the use of those resources
• The Cloud – the other extreme, and today’s be-all– How do we harness its capability?– What is needed from the hierarchy to make it work
My message
• Parallelism is a serious goal
IF we want to solve the most challenging problems
(Cure cancer, predict tsunamis)
• Telling people to think parallel is nice, but often silly
• Examining the transformation hierarchy
and seeing where we can leverage
seems to me a sounder approach