Download ppt - Parallelism: A Serious Goal or a Silly Mantra (some half-thought-out ideas)

Parallelism: A Serious Goal or a Silly Mantra(some half-thought-out ideas)

Random thoughts on Parallelism

• Why the sudden preoccupation with parallelism?

• The Silliness (or what I call Meganonsense)– Break the problem Use half the energy– 1000 mickey mouse cores– Hardware is sequential– Server throughput (how many pins?)– What about GPUs and Data Base?

• Current bugs to exploiting parallelism (or are they?)– Dark silicon– Amdahl’s Law– The Cloud

• The answer– The fundamental concept vis-à-vis parallelism– What it means re: the transformation hierarchy






It starts with the raw material (Moore’s Law)

• The first microprocessor (Intel 4004), 1971– 2300 transistors– 106 KHz

• The Pentium chip, 1992– 3.1 million transistors– 66 MHz

• Today– more than one billion transistors– Frequencies in excess of 5 GHz

• Tomorrow ?

And what we have done with this raw material

Time

Nu

mb

er o

f T

ran

sist

ors

Cache

Microprocessor

Too many people do not realize:Parallelism did not start with Multi-core

• Pipelining

• Out-of-order Execution

• Multiple operations in a single microinstruction

• VLIW (horizontal microcode exposed to the software)






One thousand mickey mouse cores

• Why not a million? Why not ten million?

• Let’s start with 16– What if we could replace 4 with one more powerful core?

• …and we learned:– One more powerful core is not enough

– Sometimes we need several

– Morphcore was born

– BUT not all morphcore (fixed function vs flexibility)

The Asymmetric Chip Multiprocessor (ACMP)

Niagara-likecore

Niagara-likecore

Niagara-likecore

Niagara-likecore

Niagara-likecore

Niagara-likecore

Niagara-likecore

Niagara-likecore

Niagara-likecore

Niagara-likecore

Niagara-likecore

Niagara-likecore

Largecore

ACMP Approach

Niagara-likecore

Niagara-likecore

Niagara-likecore

Niagara-likecore

Niagara-likecore

Niagara-likecore

Niagara-likecore

Niagara-likecore

Niagara-likecore

Niagara-likecore

Niagara-likecore

Niagara-likecore

Niagara-likecore

Niagara-likecore

Niagara-likecore

Niagara-likecore

“Niagara” Approach

Largecore

Largecore

Largecore

Largecore

“Tile-Large” Approach

Large core vs. Small Core

• Out-of-order• Wide fetch e.g. 4-wide• Deeper pipeline• Aggressive branch

predictor (e.g. hybrid)• Many functional units• Trace cache• Memory dependence

speculation

• In-order• Narrow Fetch e.g. 2-wide• Shallow pipeline• Simple branch predictor

(e.g. Gshare)• Few functional units

LargeCore

SmallCore

0

1

2

3

4

5

6

7

8

9

0 0.2 0.4 0.6 0.8 1

Degree of Parallelism

Sp

eed

up

vs.

1 L

arg

e C

ore

NiagaraTile-LargeACMP

Throughput vs. Serial Performance

Server throughput

• The Good News: Not a software problem– Each core runs its own problem

• The Bad News: How many pins?– Memory bandwidth

• More Bad News: How much energy?– Each core runs its own problem

What about GPUs and Data Base

• In theory, absolutely!

• GPUs (SMT + SIMD + Predication)– Provided there are no conditional branches (Divergence)

– Provided memory accesses line up nicely (Coalescing)

• Data Bases– Provided there are no critical sections






Dark Silicon

• Too many transistors: we can not power them all– All those cores powered down

– All that parallelism wasted

• Not really: The Refrigerator! (aka: Accelerators)– Fork (in parallel)

– Although not all at the same time!

Amdahl’s Law

• The serial bottleneck always limits performance

• Heterogeneous cores AND control over them

can minimize the effect

The Cloud

• It is behind the curtain, how to manage it

• Answer: the on-chip run-time system

• Answer: Pragmas beyond the Cloud






The fundamental concept:

Synchronization

Algorithm

Program

ISA (Instruction Set Arch)

Microarchitecture

Circuits

Problem

Electrons

At every layer we synchronize

• Algorithm: task dependencies

• ISA: sequential control flow (implicit)

• Microarchitecture: ready bits

• Circuit : clock cycle (implicit)

Who understands this?

• Should this be part of students’ parallelism education?

• Where should it come in the curriculum?

• Can students even understand these different layers?

Parallel to Sequential to Parallel

• Guri says: think sequential, execute parallel– i.e. don’t throw away 60 years of computing experience– The original HPS model of out-of-order execution– Synchronization is obvious: restricted data flow

• At the higher level, parallel at larger granularity– Pragmas in JAVA? Who would have thought!– Dave Kuck’s CEDAR project, vintage 1985– Synchronization is necessary: course grain data flow

Can we do more?

• The run-time system – part of the chip design– The chip knows the chip resources– On-chip monitoring can supply information– The run-time system can direct the use of those resources

• The Cloud – the other extreme, and today’s be-all– How do we harness its capability?– What is needed from the hierarchy to make it work

My message

• Parallelism is a serious goal

IF we want to solve the most challenging problems

(Cure cancer, predict tsunamis)

• Telling people to think parallel is nice, but often silly

• Examining the transformation hierarchy

and seeing where we can leverage

seems to me a sounder approach