31
Parallelism and Heterogenity Scaling Laws

Parallelism and Heterogenity Scaling Laws

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Parallelism and Heterogenity Scaling Laws

Parallelism and Heterogenity Scaling Laws

Page 2: Parallelism and Heterogenity Scaling Laws

Amdahl’s Law [1967, Gene Amdahl]

2

Maximum speedup achievable on a multicore

}F

Time on 1 core = 1� F

1+

F

1

Time on N cores = (parallel)

(Serial)

1� F

1+

F

NSerial No

Program Phases

Speedup = 1

1�F1 + F

N

If F = 0.35@ 4 cores, speedup = 2@ cores, speedup = 31

Page 3: Parallelism and Heterogenity Scaling Laws

Strong scaling vs Weak scalingStrong Scaling : If new machine has K times more resources, how much does perf. improve ?

3

Spee

dup

01020304050607080

# of cores1 2 4 8 16 32 64 128 256

0.50.90.99

Weak Scaling : If new machine has K times more resources, can we solve a bigger problem size ?

99% Parallel 72x speedup

Page 4: Parallelism and Heterogenity Scaling Laws

Amdahl’s Law for Multicores [Marty and Hill, 2009]

Multicore Chip partitioned into multiple cores (includes L1 cache) uncore (Intel terminology for Shared L2 cache, L3)

Resources per-chip bounded Area, Power, $, or a combination Bound of total N resources per-chip. How many cores ? How big each ?

4

L1$

Shared

L1$ L1$ L1$

Mem.

C0 C1 C2 C3

Page 5: Parallelism and Heterogenity Scaling Laws

Core Types

Your favorite trick can be used to improve single-core performance using same resource

becoming increasingly hard to do power-efficiently

Wimpy Core : Consumes 1 CU (CU: measure of core resources) performance = 1

Hulk Core: consumes R CUs performance = perf(R)

5

Page 6: Parallelism and Heterogenity Scaling Laws

If Perf (R) >= R ; always use the hulk cores. speeds up everything

Unfortunately, life isn’t easy Perf (R) < R

Assume Perf (R) = reasonable assumption? Microprocessor examples seem to indicate

How to design core for specific Perf (R) basic idea: do many instructions in parallel

Hulk Cores

6

pR

<latexit sha1_base64="MjdJCE7F2MmKym4U1RE7P/GFSrc=">AAAB/nicdVDLSgNBEOyNrxhfUY9eBhPBg4TdqITcAl5yjGIemCxhdjJJhszOrjOzgbAE/AivevQmXv0VT/6Kk02EKFrQUFR1093lhZwpbdsfVmpldW19I72Z2dre2d3L7h80VBBJQusk4IFseVhRzgSta6Y5bYWSYt/jtOmNrmZ+c0ylYoG41ZOQuj4eCNZnBGsj3eU76l7q+Gaa72ZzTsFOgOxC2aB0sSBlB31bOVig1s1+dnoBiXwqNOFYqbZjh9qNsdSMcDrNdCJFQ0xGeEDbhgrsU3XWG7NQJdSNk+On6MSYPdQPpCmhUaIuD8fYV2rie6bTx3qofnsz8S+vzc2LwnFjJsJIU0Hmq/oRRzpAsyxQj0lKNJ8Ygolk5nBEhlhiok1imeVE/ieNYsE5L1xeF3OV6iKbNBzBMZyCAyWoQBVqUAcCAh7hCZ6tB+vFerXe5q0pazFzCD9gvX8BHmGWdQ==</latexit>

Page 7: Parallelism and Heterogenity Scaling Laws

Multicores under consideration

7

Symmetric

Asymmetric

Morphing

Page 8: Parallelism and Heterogenity Scaling Laws

Symmetric Multicores

How many cores ? How big each core ?

Chip is bounded to N CUs each core has R CUs

Number of cores per-chip = N/R

For example, lets say N = 16

8

R = 1 R = 4 R = 16

Page 9: Parallelism and Heterogenity Scaling Laws

Symmetric Multicore : Performance

Serial Phase (1-F) runs on 1 thread on 1 core performance Perf (R) Execution time = (1-F) / Perf (R)

Parallel Phase uses all N/R cores. Core @ Perf (R) Execution time = F / [Perf (R) * N/R]

9

Speedup =

/

11�F

Perf(R) + F⇤RPerf(R)⇤N

II Phase More cores!

Serial Phase Perf(R)

Page 10: Parallelism and Heterogenity Scaling Laws

Symmetric Multicore (Chip = 16 CUs)

Need lots of parallelism in multicore world!

10

Spee

dup

0

4

8

12

16

Per-core CU1 2 4 8 16

0.5

(16 cores) (4 cores) (1 core)

R=16 Core=1 Optimal!

Page 11: Parallelism and Heterogenity Scaling Laws

Symmetric Multicore (Chip = 16 CUs)

More parallelism helps; but limited speedup!

11

Spee

dup

0

4

8

12

16

Per-core CU1 2 4 8 16

0.50.9

(16 cores) (4 cores) (1 core)

F=0.9,R=2 Speedup 6.7x @ 8 cores

Page 12: Parallelism and Heterogenity Scaling Laws

Applications with high F; significant performance loss with bigger cores Performance loss

Symmetric Multicore (Chip = 16 CUs)

12

Spee

dup

0

4

8

12

16

Per-core CU1 2 4 8 16

0.5 0.90.99 0.999

(16 cores) (4 cores) (1 core)

/ RpR

=p

R

Page 13: Parallelism and Heterogenity Scaling Laws

Remember Perf (R) when scaling up CPU = √R Lets say 1st gen 1 CU system = 1 CU

Now consider 2nd gen 4 CU system Four 1CU cores or One 4CU core? When F=0.999; always pick Four 1CU cores

Even parallel fraction not perfectly parallel Synchronization, Contention, Locks etc Need SW-Perf(R) (depends on application)

Model-bias towards parallelism

13

F=0.999 Speedup ~4 Speedup = 2

Page 14: Parallelism and Heterogenity Scaling Laws

Multicore Moore’s Law

Since 1970s Technology Moore’s Law Double transistors every 2 years. Should possibly continue....

Microarchitect’s Moore’s Law double single-thread performance every 2 years Stopped due to power required

Multicore’s Moore’s Law 2x cores every 2 years (1 in 2007- 8 in 2010) Need to double software threads every two years Need HW to enable 2x threads every two years

14

Page 15: Parallelism and Heterogenity Scaling Laws

Symmetric Multicore (Chip = 256 CUs)

15

Spee

dup

0

32

64

96

128

160

192

224

256

Per-core CU1 2 4 8 16 32 64 128 256

0.99 0.999

(256 cores) (16 cores) (1 core)

R=2 (vs R=1@16) 80x @ 128 cores

More cores, lil hulk cores!

R=1 204x @ 256 cores More wimpy cores

Page 16: Parallelism and Heterogenity Scaling Laws

Symmetric Multicore (Chip = 256 CUs)

16

Spee

dup

0

32

64

96

128

160

192

224

256

Per-core CU1 2 4 8 16 32 64 128 256

0.5 0.90.99 0.999

(256 cores) (16 cores) (1 core)

R=2 (vs R=1@16) 80x @ 128 cores

More cores, lil hulk cores!

R=1 204x @ 256 cores More wimpy cores

R=32 (vs R=2@16) 8 cores @ 256/16 CU chip

Hulk cores

Page 17: Parallelism and Heterogenity Scaling Laws

Symmetric Multicore (Chip = 256 CUs)

17

Spee

dup

0

32

64

96

128

160

192

224

256

Per-core CU1 2 4 8 16 32 64 128 256

0.5 0.90.99 0.999

(256 cores) (16 cores) (1 core)

R=2 (vs R=1@16) 80x @ 128 cores

More cores, lil hulk cores!

R=1 204x @ 256 cores More wimpy cores

R=32 (vs R=2@16) 8 cores @ 256/16 CU chip

Hulk cores

With more CUs per chip, need hulk cores

Page 18: Parallelism and Heterogenity Scaling Laws

Cost-Effective Multicore ComputingIs Speedup (N cores) < N that bad ?

It depends on cost of adding cores. $$$, Power Cost-ratio = Cost (Ncores) / Cost (1)

If chip budget is cost, Cost-ratio << 1. Much of multicore cost outside core [IEEE 1995] Caches, Memory Controller, SSD etc.

If power is cost, cost-ratio can approach 1

Multicore computing effective if Cost-ratio > N Intel 6 core = $1600; AMD 10-core 2000$ If 10-core speedup >1.25x, then cost-effective 18

Page 19: Parallelism and Heterogenity Scaling Laws

Multicores in Servers and Clients

Multicore parallelism where cost-ratio is low and applications have the parallelism (high F)

Clients (high F is hard) Smart-phones just moved to dual-cores how many cores?

Servers can use vast parallelism (Mapreduce, data analysis) natural overlap across clients 19

Causing move to cloud computing

Page 20: Parallelism and Heterogenity Scaling Laws

Asymmetric Multicores

Enhance some cores to improve performance for serial phase.

Many designs possible (In this talk, 1 Hulk core)

How to enhance core ? coming up in last 1/3rd of class

20

Page 21: Parallelism and Heterogenity Scaling Laws

Total chip resources = N CUs

Assume two-types of cores on-chip One core = R CU, N-R 1 CU cores Total cores = N-R+1

Asymmetric Multicores

21

Page 22: Parallelism and Heterogenity Scaling Laws

Asymmetric Cores : Performance

22

}F (parallel)

(Serial) Serial Phase = (1-F) / K*Perf (R) Parallel Phase = (F) / [K*Perf (R) + N-(K*R)]

where K is # of Hulk cores.

In our case, K = 1

Speedup = 1

1�FPerf(R) + F

Perf(R)+N�R

Page 23: Parallelism and Heterogenity Scaling Laws

Asymmetric cores offer great potential with 1 Hulk core, speedup increases significantly helps take care of Amdahl’s law

Asymmetric Multicore (Chip = 256 CUs)

23

Spee

dup

0.00

57.65

115.30

172.95

230.60

Per-core CU1 2 4 8 16 32 64 128 256

0.5 0.9 0.99 0.999

(256 cores) (1 Hulk, 240 cores) (1 core)

R=41 (vs 3) 216 (vs. 85 cores)

Speedup = 166 (vs 80)

R=118 (vs 28) 139 (vs 9 cores) Speedup = 65.6

Asymmetric cores provide bang for the buck

Page 24: Parallelism and Heterogenity Scaling Laws

Low parallelism only Hulk!

Asymmetric Multicore (Chip = 256 CUs)

24

Spee

dup

0

40

80

120

160

200

240

# of Hulk Cores1 2 4 8 16

0.5 0.9 0.99 0.999

(1 Hulk 240 Wimpy)

(4 Hulk, 192 Wimpy) (16 Hulk)

Higher parallelism, more wimpy!

As F increases, always increase wimpy cores!

Page 25: Parallelism and Heterogenity Scaling Laws

Asymmetric Multicores : Challenge

25

Task Management : How to schedule computation?

Locality : How to keep data close to task?

Coordinate Tasks : How to synchronize data?

Page 26: Parallelism and Heterogenity Scaling Laws

Morphing Multicores

Chip consists of N 1CU cores efficient for parallel phase

At runtime glue R 1CU cores to create R CU core improves performance for serial phase

How to dynamically glue cores ? Not the focus; need’s future research

26

Advantage : Can harness all cores on the chip Core optimized

Page 27: Parallelism and Heterogenity Scaling Laws

Morphing Multicores : Performance

N 1CU cores, from which R 1CU cores glued

Serial phase uses R CU core at Perf (R) execution time = (1-F)/R

Parallel phases uses N cores execution time = (1-F)/N

27

Speedup = 1

1�FPerf(R) + F

N

Page 28: Parallelism and Heterogenity Scaling Laws

Morphing Multicore (Chip = 256 CUs)

28

Spee

dup

0326496

128160192224256

Hulk-Core CU1 2 4 8 16 32 64 128 256

0.5 0.9 0.99 0.999

Spee

dup

0326496

128160192224256

Hulk-Core CU1 2 4 8 16 32 64 128 256

0.5 0.9 0.99 0.999Morphing multicores are awesome!

Especially at higher chip resource levels

How to glue!

Page 29: Parallelism and Heterogenity Scaling Laws

Multicore Amdahl’s Law

29

Symmetric

Asymmetric

Morphing

11�F

Perf(R) + F⇤RPerf(R)⇤N

11�F

Perf(R) + FPerf(R)+N�R

11�F

Perf(R) + FN

Page 30: Parallelism and Heterogenity Scaling Laws

Challenges (1/2)

Serial Fraction (1-F) has fine-grain parallelism

Parallel Fraction (F) has serialization overheads You will learn in the next 2-3 weeks.

Software challenges for asymmetric and dynamic multicores

How much parallelism in future software?

30

Page 31: Parallelism and Heterogenity Scaling Laws

Challenges (2/2)

31

Parallelism all the time ?

Amdahl’s Law affects serial fraction ? Need to increase core speed.

Lots of walls: Power, Area, Shared caches How to scale CPU performance?