77
IBM Research © 2009 Multicore Programming Challenges Michael Perrone IBM Master Inventor Mgr., Multicore Computing Dept.

IBM Research © 2009 Multicore Programming Challenges Michael Perrone IBM Master Inventor Mgr., Multicore Computing Dept

Embed Size (px)

Citation preview

Page 1: IBM Research © 2009 Multicore Programming Challenges Michael Perrone IBM Master Inventor Mgr., Multicore Computing Dept

IBM Research

© 2009

Multicore ProgrammingChallenges

Michael PerroneIBM Master InventorMgr., Multicore Computing Dept.

Page 2: IBM Research © 2009 Multicore Programming Challenges Michael Perrone IBM Master Inventor Mgr., Multicore Computing Dept

IBM Research

© 20092 [email protected]

Multicore Performance Challenge

# of Cores

Per

form

ance

Page 3: IBM Research © 2009 Multicore Programming Challenges Michael Perrone IBM Master Inventor Mgr., Multicore Computing Dept

IBM Research

© 20093 [email protected]

Take Home Messages

“Who needs 100 cores to run MS Word?”- Dave Patterson, Berkeley

• Performance is critical and it's not free!

• Data movement is critical to performance!

Which curve are you on?

Pe

rfo

rman

ce

# of Cores

Page 4: IBM Research © 2009 Multicore Programming Challenges Michael Perrone IBM Master Inventor Mgr., Multicore Computing Dept

IBM Research

© 20094 [email protected]

Outline

• What’s happening?

• Why is it happening?

• What are the implications?

• What can we do about it?

Page 5: IBM Research © 2009 Multicore Programming Challenges Michael Perrone IBM Master Inventor Mgr., Multicore Computing Dept

IBM Research

© 20095 [email protected]

What’s happening?

• Industry shift to multicore– Intel, IBM, AMD, Sun, nVidia, Cray, etc.

• Increasing– # Cores

– Heterogeneity (e.g., Cell processor, system level)

• Decreasing– Core complexity (e.g., Cell processor, GPUs)

– Decreasing since Pentium4 single core

– Bytes per FLOP

Single core Homogeneous Heterogeneous

Multicore Multicore

Page 6: IBM Research © 2009 Multicore Programming Challenges Michael Perrone IBM Master Inventor Mgr., Multicore Computing Dept

IBM Research

© [email protected]

Heterogeneity: Amdahl’s Law for Multicore

Unicore

Homogeneous

Heterogeneous

Even for square root performance growth (Hill & Marty, 2008)

Loophole: Have cores work in concert on serial code…

Serial Parallel

Cores

Page 7: IBM Research © 2009 Multicore Programming Challenges Michael Perrone IBM Master Inventor Mgr., Multicore Computing Dept

IBM Research

© [email protected]

Good & Bad News

GOOD NEWS

Multicore programming is parallel programming

BAD NEWS

Multicore programming is parallel programming

Page 8: IBM Research © 2009 Multicore Programming Challenges Michael Perrone IBM Master Inventor Mgr., Multicore Computing Dept

IBM Research

© 20098 [email protected]

Many Levels of Parallelism

• Node

• Socket

• Chip

• Core

• Thread

• Register/SIMD

• Multiple instruction pipelines

• Need to be aware of all of them!

Page 9: IBM Research © 2009 Multicore Programming Challenges Michael Perrone IBM Master Inventor Mgr., Multicore Computing Dept

IBM Research

© 20099 [email protected]

Additional System Types

MulticoreCPU

System Bus

main memory

accelerator accelerator

accelerator

Power core

System Bus

main memory

bri

dg

e

accelerator accelerator

accelerator

NIC

NIC

IB

System Bus

memory

bri

dg

e

MulticoreCPU

System Bus

main memory

bri

dg

e

PCIe

accelerator accelerator

accelerator

memory

Heterogeneous bus attached

IO bus attached Network attached

MulticoreCPU

System Bus

main memory

bri

dg

e

accelerator accelerator

accelerator

NIC

NICE’net System Bus

memory

bri

dg

e

On-chipI/O bus

Homogeneous bus attached

System Bus

main memory

MulticoreCPU

MulticoreCPU

Page 10: IBM Research © 2009 Multicore Programming Challenges Michael Perrone IBM Master Inventor Mgr., Multicore Computing Dept

IBM Research

© 200910 [email protected]

Multicore Programming Challenge

Easier

Per

form

ance

ProgrammabilityHarder

Higher

Lower

Interestingresearch!

“Lazy” Programming

Nirvana

DangerZone!

Better toolsBetter programming

Page 11: IBM Research © 2009 Multicore Programming Challenges Michael Perrone IBM Master Inventor Mgr., Multicore Computing Dept

IBM Research

© 200911 [email protected]

Outline

• What’s happening?

• Why is it happening?

– HW Challenges

– BW Challenges

• What are the implications?

• What can we do about it?

Page 12: IBM Research © 2009 Multicore Programming Challenges Michael Perrone IBM Master Inventor Mgr., Multicore Computing Dept

IBM Research

© [email protected]

Power Density – The fundamental problem

1

10

100

1000

1.5μ 1μ 0.7μ 0.5μ 0.35μ 0.25μ 0.18μ 0.13μ 0.1μ 0.07μ

i386i486

Pentium®Pentium Pro®

Pentium II®Pentium III®

W/cm2

Hot Plate

Nuclear Reactor

Source: Fred Pollack, Intel. New Microprocessor Challenges in the Coming Generations of CMOS Technologies, Micro32

Page 13: IBM Research © 2009 Multicore Programming Challenges Michael Perrone IBM Master Inventor Mgr., Multicore Computing Dept

IBM Research

© [email protected]

What’s causing the problem?

10S Tox=11AGate Stack

Gate dielectric approaching a fundamental limit (a few atomic layers)

0.010.110.001

0.01

0.1

1

10

100

1000

Gate Length (microns)

Active Power

Passive Power

1994 2004P

ow

er D

ensi

ty (

W/c

m2 )

65 nM

Gate Length

1 0.010.1

1000

100

10

1

0.1

0.01

0.001

Page 14: IBM Research © 2009 Multicore Programming Challenges Michael Perrone IBM Master Inventor Mgr., Multicore Computing Dept

IBM Research

© [email protected]

1.0E+02

1.0E+03

1.0E+04

1990 1995 2000 2005 2010

Clock Speed (MHz)

103

102

104

Managing power dissipation is limiting clock speed increases

Microprocessor Clock Speed TrendsC

lock

Fre

quen

cy (

MH

z)

Page 15: IBM Research © 2009 Multicore Programming Challenges Michael Perrone IBM Master Inventor Mgr., Multicore Computing Dept

IBM Research

© [email protected]

Intuition: Power vs. Performance Trade Off

Relative Performance

RelativePower

1.8 1.3

1

.7

1.4

1.6

5

Page 16: IBM Research © 2009 Multicore Programming Challenges Michael Perrone IBM Master Inventor Mgr., Multicore Computing Dept

IBM Research

© 200916 [email protected]

Outline

• What’s happening?

• Why is it happening?

– HW Challenges

– BW Challenges

• What are the implications?

• What can we do about it?

Page 17: IBM Research © 2009 Multicore Programming Challenges Michael Perrone IBM Master Inventor Mgr., Multicore Computing Dept

IBM Research

© 200917 [email protected]

The Hungry Beast

Processor(“beast”)

Data(“food”)

Data Pipe

Pipe too small = starved beast

Pipe big enough = well-fed beast

Pipe too big = wasted resources

Page 18: IBM Research © 2009 Multicore Programming Challenges Michael Perrone IBM Master Inventor Mgr., Multicore Computing Dept

IBM Research

© 200918 [email protected]

The Hungry Beast

Processor(“beast”)

Data(“food”)

Data Pipe

Pipe too small = starved beast

Pipe big enough = well-fed beast

Pipe too big = wasted resources

If flops grow faster than pipe capacity…

… the beast gets hungrier!

Page 19: IBM Research © 2009 Multicore Programming Challenges Michael Perrone IBM Master Inventor Mgr., Multicore Computing Dept

IBM Research

© 200919 [email protected]

Move the food closer Cache

Processor

Data(“food”)

Load more food while the beast eats

Page 20: IBM Research © 2009 Multicore Programming Challenges Michael Perrone IBM Master Inventor Mgr., Multicore Computing Dept

IBM Research

© 200920 [email protected]

What happens if the beast is still hungry? Cache

If the data set doesn’t fit in cache

Cache misses

Memory latency exposed

Performance degraded

Several important application classes don’t fitGraph searching algorithms

Network security

Natural language processing

Bioinformatics

Many HPC workloads

Processor

Page 21: IBM Research © 2009 Multicore Programming Challenges Michael Perrone IBM Master Inventor Mgr., Multicore Computing Dept

IBM Research

© 200921 [email protected]

Make the food bowl larger Cache Cache size steadily increasing

Implications

Chip real estate reserved for cache

Less space on chip for computes

More power required for fewer FLOPS

Processor

Page 22: IBM Research © 2009 Multicore Programming Challenges Michael Perrone IBM Master Inventor Mgr., Multicore Computing Dept

IBM Research

© 200922 [email protected]

Make the food bowl larger

Processor

Cache Cache size steadily increasing

Implications

Chip real estate reserved for cache

Less space on chip for computes

More power required for fewer FLOPS

But…

Important application working sets are growing faster

Multicore even more demanding on cache than unicore

Page 23: IBM Research © 2009 Multicore Programming Challenges Michael Perrone IBM Master Inventor Mgr., Multicore Computing Dept

IBM Research

© 200923 [email protected]

The beast is hungry!

Data pipe not growing fast enough!

Page 24: IBM Research © 2009 Multicore Programming Challenges Michael Perrone IBM Master Inventor Mgr., Multicore Computing Dept

IBM Research

© 200924 [email protected]

The beast had babies

• Multicore makes the data problem worse!

– Efficient data movement is critical

– Latency hiding is critical

Page 25: IBM Research © 2009 Multicore Programming Challenges Michael Perrone IBM Master Inventor Mgr., Multicore Computing Dept

IBM Research

© 200925 [email protected]

GOAL: The proper care and feeding of hungry beasts

Page 26: IBM Research © 2009 Multicore Programming Challenges Michael Perrone IBM Master Inventor Mgr., Multicore Computing Dept

IBM Research

© 200926 [email protected]

Outline

• What’s happening?

• Why is it happening?

• What are the implications?

• What can we do about it?

Page 27: IBM Research © 2009 Multicore Programming Challenges Michael Perrone IBM Master Inventor Mgr., Multicore Computing Dept

IBM Research

© 200927 [email protected]

Example: The Cell/B.E. Processor

Page 28: IBM Research © 2009 Multicore Programming Challenges Michael Perrone IBM Master Inventor Mgr., Multicore Computing Dept

IBM Research

© 200928 [email protected]

Feeding the Cell Processor

8 SPEs each with

– LS

– MFC

– SXU

PPE

– OS functions

– Disk IO

– Network IO

16B/cycle (2x)16B/cycle

BIC

FlexIOTM

MIC

Dual XDRTM

16B/cycle

EIB (up to 96B/cycle)

16B/cycle

64-bit Power Architecture with VMX

PPE

SPE

LS

SXUSPU

MFC

PXUL1

PPU

16B/cycle

L232B/cycle

LS

SXUSPU

MFC

LS

SXUSPU

MFC

LS

SXUSPU

MFC

LS

SXUSPU

MFC

LS

SXUSPU

MFC

LS

SXUSPU

MFC

LS

SXUSPU

MFC

Page 29: IBM Research © 2009 Multicore Programming Challenges Michael Perrone IBM Master Inventor Mgr., Multicore Computing Dept

IBM Research

© 200929 [email protected]

Cell Approach: Feed the beast more efficiently

Explicitly “orchestrate” the data flow

Enables detailed programmer control of data flowGet/Put data when & where you want itHides latency: Simultaneous reads, writes & computes

Avoids restrictive HW cache managementUnlikely to determine optimal data flowPotentially very inefficient

Allows more efficient use of the existing bandwidth

BOTTOM LINE:

It’s all about the data!

Page 30: IBM Research © 2009 Multicore Programming Challenges Michael Perrone IBM Master Inventor Mgr., Multicore Computing Dept

IBM Research

© 200930 [email protected]

Lessons Learned Cell Processor

• Core simplicity impacted algorithmic design– Increased predictability– Avoid recursion & branches– Simpler code is better code

– e.g., bubble vs. comb sort

• Heterogeneity– Serial core must balance parallel cores well

• Programmability suffered– Forced to address data flow directly

– Led to better algorithms & performance portability

Page 31: IBM Research © 2009 Multicore Programming Challenges Michael Perrone IBM Master Inventor Mgr., Multicore Computing Dept

IBM Research

© 200931 [email protected]

What are the implications?

• Computational Complexity

• Parallel programming

• Communication

• Synchronization

• Collecting metadata

• Merging Operations

• Grouping Operations

• Memory Layout

• Memory Conflicts

• Debugging

Some generalSome Cell specific

Page 32: IBM Research © 2009 Multicore Programming Challenges Michael Perrone IBM Master Inventor Mgr., Multicore Computing Dept

IBM Research

© 200932 [email protected]

Computational complexity is inadequate

• Focus on computes: O(N), O(N2), O(lnN), etc.

• Ignores BW analysis– Memory flows are now the bottlenecks

– Memory hierarchies are critical to performance

– Need to incorporate memory into the picture

• Need “Data Complexity”– Necessarily HW dependent

– Calculate data movement (track where they come from) and divide by BW to get time for data

Page 33: IBM Research © 2009 Multicore Programming Challenges Michael Perrone IBM Master Inventor Mgr., Multicore Computing Dept

IBM Research

© [email protected]

Don’t apply computational complexity blindly

O(N) isn’t always better than O(N2)

N

O(N)

O(N2)

You are here

Run

Tim

e

More cores can lead to smaller N per core…

Page 34: IBM Research © 2009 Multicore Programming Challenges Michael Perrone IBM Master Inventor Mgr., Multicore Computing Dept

IBM Research

© [email protected]

Where is your data?

L3 cache

Run

Tim

e

Disk

L2 cache

L1 cache

Tape

Put your data where you want it when you want it!

N (“Locality”)

Localize your data!

Page 35: IBM Research © 2009 Multicore Programming Challenges Michael Perrone IBM Master Inventor Mgr., Multicore Computing Dept

IBM Research

© [email protected]

Example: Compression

• Compress to reduce data flow

• Increases slope of O(N)

• But reduces run time

Compute

Read Write

Compute

Read

Compression

Write

Run Time

Compression

Compute

N

Computational Complexity

Page 36: IBM Research © 2009 Multicore Programming Challenges Michael Perrone IBM Master Inventor Mgr., Multicore Computing Dept

IBM Research

© 200936 [email protected]

Implication: Communication Overhead

• BW can swamp compute

• Minimize communication

1 2

Page 37: IBM Research © 2009 Multicore Programming Challenges Michael Perrone IBM Master Inventor Mgr., Multicore Computing Dept

IBM Research

© 200937 [email protected]

Implication: Communication Overhead

• Modify partitioning to reduce communications

• Trade off with synchronization

L

L

9L vs. 4L

Page 38: IBM Research © 2009 Multicore Programming Challenges Michael Perrone IBM Master Inventor Mgr., Multicore Computing Dept

IBM Research

© 200938 [email protected]

Implications: Synchronization Overhead

Time

SynchronizationOverhead

Page 39: IBM Research © 2009 Multicore Programming Challenges Michael Perrone IBM Master Inventor Mgr., Multicore Computing Dept

IBM Research

© 200939 [email protected]

Implications: Synchronization – Load Balancing

• Modify data partitioning to balance workloads

Uniform Adaptive

Page 40: IBM Research © 2009 Multicore Programming Challenges Michael Perrone IBM Master Inventor Mgr., Multicore Computing Dept

IBM Research

© 200940 [email protected]

Implications: Synchronization – Nondeterminism

Suppose: =

Page 41: IBM Research © 2009 Multicore Programming Challenges Michael Perrone IBM Master Inventor Mgr., Multicore Computing Dept

IBM Research

© 200941 [email protected]

Implications: Synchronization – Nondeterminism

Run Time

Pro

babi

lity

Average nondeterministic

Deterministic

Max of N Threads

Page 42: IBM Research © 2009 Multicore Programming Challenges Michael Perrone IBM Master Inventor Mgr., Multicore Computing Dept

IBM Research

© 200942 [email protected]

Implications: Metadata - Parallel sort example

• Collect histogram in first pass

• Use histogram to parallelize second pass

Unsorteddata Metadata

Sorteddata

Page 43: IBM Research © 2009 Multicore Programming Challenges Michael Perrone IBM Master Inventor Mgr., Multicore Computing Dept

IBM Research

© 200943 [email protected]

Buffer

Input Image

Transposed Image

Tile

Transposed Tile

Transposed Buffer

Implications: Merge Operations – FFT Example

• Naive

– 1D FFT (x axis)

– Transpose

– 1D FFT (y axis)

– Transpose

• Improved – Merge steps

– FFT/Transpose (x axis)

– FFT/Transpose (y axis)

• Avoid unnecessary data movement

Page 44: IBM Research © 2009 Multicore Programming Challenges Michael Perrone IBM Master Inventor Mgr., Multicore Computing Dept

IBM Research

© 200944 [email protected]

Implications: Restructure to Avoid Data Movement

Compute A

Transform A to B

Compute B

Transform B to A

Compute A

Transform A to B

Compute B

Transform B to A

Compute A

Transform A to B

Compute B

Compute B

Compute B

Compute B

Compute A

Compute A

Compute A

Page 45: IBM Research © 2009 Multicore Programming Challenges Michael Perrone IBM Master Inventor Mgr., Multicore Computing Dept

IBM Research

© 200945 [email protected]

Implications: Streaming Data & Finite Automata

DFA

DFA DFADFA

Data

Replicate &Overlap

Enables loop unrolling & software pipelining

Page 46: IBM Research © 2009 Multicore Programming Challenges Michael Perrone IBM Master Inventor Mgr., Multicore Computing Dept

IBM Research

© 200946 [email protected]

Find (lots of) substrings in (long) string

Build graph of words & represent as DFA

Sample Word List:

“the”“that”

“math”

Implications: Streaming Data – NID Example

Page 47: IBM Research © 2009 Multicore Programming Challenges Michael Perrone IBM Master Inventor Mgr., Multicore Computing Dept

IBM Research

© 200947 [email protected]

Random access to large state transition table (STT)

Implications: Streaming Data – NID Example

Page 48: IBM Research © 2009 Multicore Programming Challenges Michael Perrone IBM Master Inventor Mgr., Multicore Computing Dept

IBM Research

© 200948 [email protected]

Implications: Streaming Data – Hiding Latency

Page 49: IBM Research © 2009 Multicore Programming Challenges Michael Perrone IBM Master Inventor Mgr., Multicore Computing Dept

IBM Research

© 200949 [email protected]

Implications: Streaming Data – Hiding Latency

Enables loop unrolling & software pipelining

Page 50: IBM Research © 2009 Multicore Programming Challenges Michael Perrone IBM Master Inventor Mgr., Multicore Computing Dept

IBM Research

© 200950 [email protected]

Roofline Model (S. Williams)

Compute bound

ProcessingRate

Software Pipelining

Data LocalityLow High

Latencybound

Page 51: IBM Research © 2009 Multicore Programming Challenges Michael Perrone IBM Master Inventor Mgr., Multicore Computing Dept

IBM Research

© 200951 [email protected]

Implications: Group Like Operations – Tokenization Ex.

• Intuitive

– Get data Serial

– State Transition Serial

– Action Branchy & Nondeterministic

– Repeat

DFA

Data

Action

Page 52: IBM Research © 2009 Multicore Programming Challenges Michael Perrone IBM Master Inventor Mgr., Multicore Computing Dept

IBM Research

© 200952 [email protected]

Implications: Group Like Operations – Tokenization Ex.

Better– Get data Serial

– State Transition Serial

– Add Action to List Serial

– Repeat

– Process Action Lists Serial

DFA

Data

Action Action List 1

Action List 3

Action List 2

• Loop unrolling• SIMD• Load balance

Page 53: IBM Research © 2009 Multicore Programming Challenges Michael Perrone IBM Master Inventor Mgr., Multicore Computing Dept

IBM Research

© 200953 [email protected]

Neural net function F(X)

– RBF, MLP, KNN, etc.

If too big for cache, BW Bound

N Basis functions: dot product + nonlinearity

D Input dimensions

DxN Matrix of parameters

OutputF

X

Implications: Covert BW to Compute Bound – NN Ex.

Page 54: IBM Research © 2009 Multicore Programming Challenges Michael Perrone IBM Master Inventor Mgr., Multicore Computing Dept

IBM Research

© 200954 [email protected]

Split function over multiple SPEs

Avoids unnecessary memory traffic

Reduce compute time per SPE

Minimal merge overhead

Merge

Implications: Covert BW to Compute Bound – NN Ex.

Page 55: IBM Research © 2009 Multicore Programming Challenges Michael Perrone IBM Master Inventor Mgr., Multicore Computing Dept

IBM Research

© 200955 [email protected]

BW: High Low

Latency: Low High

Size: Small Larger

L1L2

Register File

Implications:Pay Attention to Memory Hierarchy

Main Memory

Page 56: IBM Research © 2009 Multicore Programming Challenges Michael Perrone IBM Master Inventor Mgr., Multicore Computing Dept

IBM Research

© 200956 [email protected]

• Data eviction rate

• Optimal tiling

• Shared memory space can impact load balancing

Implications: Pay Attention to Memory Hierarchy

C L1

L2

L1C

C L1

L2

L1C

L3

C L1

L1C

C L1

L1C

L2

Page 57: IBM Research © 2009 Multicore Programming Challenges Michael Perrone IBM Master Inventor Mgr., Multicore Computing Dept

IBM Research

© 200957 [email protected]

Implications: Memory Hierarchy & Tiling

= X

Optimal tiling depends on cache size

Page 58: IBM Research © 2009 Multicore Programming Challenges Michael Perrone IBM Master Inventor Mgr., Multicore Computing Dept

IBM Research

© 200958 [email protected]

Single Element Data envelope

Stride 1

StrideN2

N

Implications: Data Re-Use – FFT Revisited

• Long stride trashes cache

• Use full cachelines where possible

Page 59: IBM Research © 2009 Multicore Programming Challenges Michael Perrone IBM Master Inventor Mgr., Multicore Computing Dept

IBM Research

© 200959 [email protected]

Implications: Handle Race Conditions (Debugging)

• Heisenberg Uncertainty Principle

– Instrumenting the code changes behavior

– Problem with maintaining exact timing

Write data

Write data

Read data

Good

Bad

?

1

1

2

Thread

Page 60: IBM Research © 2009 Multicore Programming Challenges Michael Perrone IBM Master Inventor Mgr., Multicore Computing Dept

IBM Research

© 200960 [email protected]

Implications: More Cores – More Memory Conflicts

• Avoid bank conflicts

– Plan data layout

– Avoid multiples of the number of banks

– Randomize start points

– Make critical data sizes and number of threads relatively prime

Bank 1 Bank 8

Thread

765432

1

8

Bank 1 Bank 8

765432

1

8

HOT SPOT

Page 61: IBM Research © 2009 Multicore Programming Challenges Michael Perrone IBM Master Inventor Mgr., Multicore Computing Dept

IBM Research

© 200961 [email protected]

(X,Y)

New G at each (x,y)

Radial symmetry of G reduces BW requirements

Data

Green’s Function

∑ jiyxGjyixD ),,,(),(

Implications: Reduce Data Movement

ij

Page 62: IBM Research © 2009 Multicore Programming Challenges Michael Perrone IBM Master Inventor Mgr., Multicore Computing Dept

IBM Research

© 200962 [email protected]

Data

SPE 0 SPE 1 SPE 2 SPE 3 SPE 4 SPE 5 SPE 6 SPE 7

Implications: Reduce Data Movement

Page 63: IBM Research © 2009 Multicore Programming Challenges Michael Perrone IBM Master Inventor Mgr., Multicore Computing Dept

IBM Research

© 200963 [email protected]

Data

SPE 0 SPE 1 SPE 2 SPE 3 SPE 4 SPE 5 SPE 6 SPE 7

Implications: Reduce Data Movement

Page 64: IBM Research © 2009 Multicore Programming Challenges Michael Perrone IBM Master Inventor Mgr., Multicore Computing Dept

IBM Research

© 200964 [email protected]

For each X

– Load next column of data

– Load next column of indices

– For each Y• Load Green’s functions• SIMDize Green’s functions • Compute convolution at

(X,Y)– Cycle buffers

H

2R+1

1

Data bufferGreen’s Index buffer

(X,Y)

R

2

Implications: Reduce Data Movement

Page 65: IBM Research © 2009 Multicore Programming Challenges Michael Perrone IBM Master Inventor Mgr., Multicore Computing Dept

IBM Research

© 200965 [email protected]

Outline

• What’s happening?

• Why is it happening?

• What are the implications?

• What can we do about it?

Page 66: IBM Research © 2009 Multicore Programming Challenges Michael Perrone IBM Master Inventor Mgr., Multicore Computing Dept

IBM Research

© 200966 [email protected]

What can we do about it?

• We want

– High performance

– Low power

– Easy programmability

• We need

– “Magic” compiler

– Multicore enabled libraries

– Multicore enabled tools

– New algorithms

Chooseany two!

Page 67: IBM Research © 2009 Multicore Programming Challenges Michael Perrone IBM Master Inventor Mgr., Multicore Computing Dept

IBM Research

© 200967 [email protected]

What can we do about it?

• Compiler “magic”

– OpenMP, autovectorizationBUT… Doesn’t encourage parallel thinking

• Programming models

– CUDA, OpenCL, Pthreads, UPC, PGAS, etc

• Tools

– Cell SDK, RapidMind (Intel), PeakStream (Google), Cilk (Intel), Gedae, VSIPL++, Charm++, Atlas, FFTW, PHiPAC

• If you want performance…

– No substitute for better algorithms & hand-tuning!

– Performance analyzers» HPCToolkit, FDPR-Pro, Code Analyzer, Diablo, TAU, Paraver, VTune,

SunStudio Performance Analyzer, Code Analyzer, PDT, Trace Analyzer, Thor, etc.

Page 68: IBM Research © 2009 Multicore Programming Challenges Michael Perrone IBM Master Inventor Mgr., Multicore Computing Dept

IBM Research

© 200968 [email protected]

What can we do about it? Example: OpenCL

• Open “standard”

• Based on C - not difficult to learn

• Allows natural transition from (proprietary) CUDA programs

• Interoperates with MPI

• Provides application portability

– Hides specifics of underlying accelerator architecture

– Avoids HW lock-in: “future-proofs” applications

• Weaknesses

– No DP, no recursion & accelerator model only

Portability does not equal performance portability!

Page 69: IBM Research © 2009 Multicore Programming Challenges Michael Perrone IBM Master Inventor Mgr., Multicore Computing Dept

IBM Research

© [email protected]

What can we do about it?

Hide Complexity in Libraries

• Manually

– Slow, expensive, new library for each architecture

• Autotuners

– Search program space for optimal performance

– Examples: Atlas (BLAS), FFTW (FFT), Spiral (DSP). OSKI (Sparse BLAS), PhiPAC (BLAS)

• Local Optimality Problem:

– F() & G() may be optimal, but will F(G()) be?

Page 70: IBM Research © 2009 Multicore Programming Challenges Michael Perrone IBM Master Inventor Mgr., Multicore Computing Dept

IBM Research

© 200970 [email protected]

It’s all about the data! The data problem is growing

Intelligent software prefetching

– Use DMA engines

– Don’t rely on HW prefetching

Efficient data management

– Multibuffering: Hide the latency!

– BW utilization: Make every byte count!

– SIMDization: Make every vector count!

– Problem/data partitioning: Make every core work!

– Software multithreading: Keep every core busy!

What can we do about it?

Page 71: IBM Research © 2009 Multicore Programming Challenges Michael Perrone IBM Master Inventor Mgr., Multicore Computing Dept

IBM Research

© 200971 [email protected]

Conclusions

• Programmability will continue to suffer– No pain - no gain

• Incorporate data flow into algorithmic development– Computational complexity vs. “data flow” complexity

• Restructure algorithms to minimize:– Synchronization, communication, non-determinism, load

imbalance, non-locality

• Data management is the key to better performance– Merge/Group data operations to minimize memory traffic

– Restructure data traffic: Tile, Align, SIMDize, Compress

– Minimize memory bottlenecks

Page 72: IBM Research © 2009 Multicore Programming Challenges Michael Perrone IBM Master Inventor Mgr., Multicore Computing Dept

IBM Research

© 200972 [email protected]

Backup Slides

Page 73: IBM Research © 2009 Multicore Programming Challenges Michael Perrone IBM Master Inventor Mgr., Multicore Computing Dept

IBM Research

© 200973 [email protected]

AbstractThe computer industry is facing fundamental challenges that are driving a major change in the design of computer

processors. Due to restrictions imposed by quantum physics, one historical path to higher computer processor performance - by increased clock frequency - has come to an end. Increasing clock frequency now leads to power consumption costs that are too high to justify. As a result, we have seen in recent years that the processor frequencies have peaked and are receding from their high point. At the same time, competitive market conditions are giving business advantage to those companies that can field new streaming applications, handle larger data sets, and update their models to market conditions faster. This desire for newer, faster and larger is driving continued demand for higher computer performance.

The industry’s response to address these challenges has been to embrace “multicore” technology by designing processors that have multiple processing cores on each silicon chip. Increasing the number of cores per chip has enabled processor peak performance to double with each doubling of the number of cores. With performance doubling occurring at approximately constant clock frequency so that energy costs can be controlled, multicore technology is poised to deliver the performance users need for their next generation applications while at the same time reducing total cost of ownership per FLOP.

The multicore solution to the clock frequency problem comes at a cost: Performance scaling on multicore is generally sub-linear and frequently decreases beyond some number of cores. For a variety of technical reasons, off-chip bandwidth is not increasing as fast as the number of cores per chip which is making memory and communication bottlenecks the main barriers to improved performance. What these bottlenecks mean to multicore users is that precise and flexible control of data flows will be crucial to achieving high performance. Simple mappings of their existing algorithms to multicore will not result in the naïve performance scaling one might expect from increasing the number of cores per chip. Algorithmic changes, in many cases major, will have to be made to get value out of multicore. Multicore users will have to re-think and in many cases re-write their applications if they want to achieve high performance. Multicore forces each programmer to become a parallel programmer; to think of their chips as clusters; and to deal with the issues of communication, synchronization, data transfer and non-determinism as integral elements of their algorithms. And for those already familiar with parallel programming, multicore processors add a new level of parallelism and additional layers of complexity.

This talk will highlight some of the challenges that need to be overcome in order to get better performance scaling on multicore, and will suggest some solutions.

Page 74: IBM Research © 2009 Multicore Programming Challenges Michael Perrone IBM Master Inventor Mgr., Multicore Computing Dept

IBM Research

© 200974 [email protected]

Cell Comparison: ~4x the FLOPS @ ~½ the power Both 65nm technology

(to scale)

Page 75: IBM Research © 2009 Multicore Programming Challenges Michael Perrone IBM Master Inventor Mgr., Multicore Computing Dept

IBM Research

© 200975 [email protected]

To Scale Comparison of L2

IBM

AMD

Intel

Cell

BE

Page 76: IBM Research © 2009 Multicore Programming Challenges Michael Perrone IBM Master Inventor Mgr., Multicore Computing Dept

IBM Research

© [email protected]

Intel Multi-Core Forum (2006)

0

8750

17500

26250

35000

0 2 4 6 8 10 12 14 16 18 20 22 24

Linux

The Issue

9.8x

Processors

Throughput

SDET

Page 77: IBM Research © 2009 Multicore Programming Challenges Michael Perrone IBM Master Inventor Mgr., Multicore Computing Dept

IBM Research

© [email protected]

The “Yale Patt Ladder”

Problem

Algorithm

Program

ISA (Instruction Set Architecture)

Microarchitecture

Circuits

Electrons

To improveperformanceneed people

who can crossbetween levels