Feeding the Multicore Beast: It’s All About the Data!...IBM Research © 2008 Feeding the Multicore...

Preview:

Citation preview

IBM Research

© 2008

Feeding theMulticore Beast:It’s All About the Data!

Michael PerroneIBM Master InventorMgr, Cell Solutions Dept.

IBM Research

© 20082 mpp@us.ibm.com

Outline

History: Data challenge

Motivation for multicore

Implications for programmers

How Cell addresses these implications

Examples

• 2D/3D FFT– Medical Imaging, Petroleum, general HPC…

• Green’s Functions– Seismic Imaging (Petroleum)

• String Matching– Network Processing: DPI & Intrusion Detections

• Neural Networks– Finance

IBM Research

© 20083 mpp@us.ibm.com

Chapter 1:

The Beast is Hungry!

IBM Research

© 20084 mpp@us.ibm.com

The Hungry Beast

Processor

(“beast”)

Data

(“food”)Data Pipe

Pipe too small = starved beast

Pipe big enough = well-fed beast

Pipe too big = wasted resources

IBM Research

© 20085 mpp@us.ibm.com

The Hungry Beast

Processor

(“beast”)

Data

(“food”)Data Pipe

Pipe too small = starved beast

Pipe big enough = well-fed beast

Pipe too big = wasted resources

If flops grow faster than pipe capacity…

… the beast gets hungrier!

IBM Research

© 20086 mpp@us.ibm.com

Move the food closer

Example: Intel Tulsa

– Xeon MP 7100 series

– 65nm, 349mm2, 2 Cores

– 3.4 GHz @ 150W

– ~54.4 SP GFlops

– http://www.intel.com/products

/processor/xeon/index.htm

Large cache on chip

– ~50% of area

– Keeps data close for

efficient access

If the data is local,

the beast is happy!

– True for many algorithms

IBM Research

© 20087 mpp@us.ibm.com

What happens if the beast is still hungry?

Data

Cache

If the data set doesn’t fit in cache

– Cache misses

– Memory latency exposed

– Performance degraded

Several important application classes don’t fit

– Graph searching algorithms

– Network security

– Natural language processing

– Bioinformatics

– Many HPC workloads

IBM Research

© 20088 mpp@us.ibm.com

Make the food bowl larger

Data

Cache

Cache size steadily increasing

Implications

– Chip real estate reserved for cache

– Less space on chip for computes

– More power required for fewer FLOPS

IBM Research

© 20089 mpp@us.ibm.com

Make the food bowl larger

Data

Cache

Cache size steadily increasing

Implications

– Chip real estate reserved for cache

– Less space on chip for computes

– More power required for fewer FLOPS

But…

– Important application working sets are growing faster

– Multicore even more demanding on cache than uni-core

IBM Research

© 200810 mpp@us.ibm.com

Chapter 2:

The Beast Has Babies

IBM Research

© 200811 mpp@us.ibm.com

Power Density – The fundamental problem

1

10

100

1000

1.5 1 0.7 0.5 0.35 0.25 0.18 0.13 0.1 0.07

i386i486

Pentium®

Pentium Pro ®

Pentium II ®

Pentium III®

W/cm2

Hot Plate

Nuclear Reactor

Source: Fred Pollack, Intel. New Microprocessor Challenges in the Coming Generations of CMOS Technologies, Micro32

IBM Research

© 200812 mpp@us.ibm.com

What’s causing the problem?

10S Tox=11AGate Stack

Gate dielectric approaching a

fundamental limit (a few atomic layers)

Po

wer

Den

sit

y (

W/c

m2)

65 nM

Gate Length (microns)

1 0.010.1

1000

100

10

1

0.1

0.01

0.001

Power, signal jitter, etc...

IBM Research

© 200813 mpp@us.ibm.com

1.0E+02

1.0E+03

1.0E+04

1990 1995 2000 2005 2010

Clo

ck S

peed

(M

Hz)

Clock Speed

103

102

104

Diminishing Returns on FrequencyIn a power-constrained environment, chip clock speed yields diminishing

returns. The industry has moved to lower frequency multicore architectures.

Frequency-DrivenDesignPoints

IBM Research

© 200814 mpp@us.ibm.com

Power vs Performance Trade Offs

Relative Performance

0

1

2

3

4

5

Rela

tive P

ow

er

1

1.45

1.3.85 1.7

We need to adapt our algorithms to

get performance out of multicore

IBM Research

© 200815 mpp@us.ibm.com

Implications of Multicore

There are more mouths to feed

– Data movement will take center stage

Complexity of cores will stop increasing

… and has started to decrease in some cases

Complexity increases will center around communication

Assumption

– Achieving a significant % or peak performance is important

IBM Research

© 200816 mpp@us.ibm.com

Chapter 3:

The Proper Care and Feedingof Hungry Beasts

IBM Research

© 200817 mpp@us.ibm.com

Cell/B.E. Processor: 200GFLOPS (SP) @ ~70W

IBM Research

© 200818 mpp@us.ibm.com

Feeding the Cell Processor

8 SPEs each with

– LS

– MFC

– SXU

PPE

– OS functions

– Disk IO

– Network IO

16B/cycle (2x)16B/cycle

BIC

FlexIOTM

MIC

Dual

XDRTM

16B/cycle

EIB (up to 96B/cycle)

16B/cycle

64-bit Power Architecture with VMX

PPE

SPE

LS

SXU

SPU

MFC

PXUL1

PPU

16B/cycle

L232B/cycle

LS

SXU

SPU

MFC

LS

SXU

SPU

MFC

LS

SXU

SPU

MFC

LS

SXU

SPU

MFC

LS

SXU

SPU

MFC

LS

SXU

SPU

MFC

LS

SXU

SPU

MFC

IBM Research

© 200819 mpp@us.ibm.com

Cell Approach: Feed the beast more efficiently

Explicitly “orchestrate” the data flow between main

memory and each SPE’s local store

– Use SPE’s DMA engine to gather & scatter data between

memory main memory and local store

– Enables detailed programmer control of data flow

• Get/Put data when & where you want it

• Hides latency: Simultaneous reads, writes & computes

– Avoids restrictive HW cache management

• Unlikely to determine optimal data flow

• Potentially very inefficient

– Allows more efficient use of the existing bandwidth

IBM Research

© 200820 mpp@us.ibm.com

Cell Approach: Feed the beast more efficiently

Explicitly “orchestrate” the data flow between main

memory and each SPE’s local store

– Use SPE’s DMA engine to gather & scatter data between

memory main memory and local store

– Enables detailed programmer control of data flow

• Get/Put data when & where you want it

• Hides latency: Simultaneous reads, writes & computes

– Avoids restrictive HW cache management

• Unlikely to determine optimal data flow

• Potentially very inefficient

– Allows more efficient use of the existing bandwidth

BOTTOM LINE:

It’s all about the data!

IBM Research

© 200821 mpp@us.ibm.com

Cell Comparison: ~4x the FLOPS @ ~½ the power

Both 65nm technology

(to scale)

IBM Research

© 200822 mpp@us.ibm.com

Memory Managing Processor vs. Traditional General Purpose Processor

IBM

AMD

Intel

Cell

BE

IBM Research

© 200823 mpp@us.ibm.com

Examples of Feeding Cell

2D and 3D FFTs

Seismic Imaging

String Matching

Neural Networks (function approximation)

IBM Research

© 200824 mpp@us.ibm.com

Feeding FFTs to Cell

Buffer

Input

Image

Transposed

Image

Tile

Transposed

Tile

Transposed

Buffer

SIMDized data

DMAs double buffered

Pass 1: For each buffer

• DMA Get buffer

• Do four 1D FFTs in SIMD

• Transpose tiles

• DMA Put buffer

Pass 2: For each buffer

• DMA Get buffer

• Do four 1D FFTs in SIMD

• Transpose tiles

• DMA Put buffer

IBM Research

© 200825 mpp@us.ibm.com

3D FFTs

Long stride trashes cache

Cell DMA allows prefetch

Single Element Data envelope

Stride 1

Stride

N2

N

IBM Research

© 200826 mpp@us.ibm.com

Feeding Seismic Imaging to Cell

(X,Y)

New G at each (x,y)

Radial symmetry of G reduces BW requirements

Data

Green’s Function

ij

jiyxGjyixD ),,,(),(

IBM Research

© 200827 mpp@us.ibm.com

Feeding Seismic Imaging to Cell Data

SPE 0 SPE 1 SPE 2 SPE 3 SPE 4 SPE 5 SPE 6 SPE 7

IBM Research

© 200828 mpp@us.ibm.com

Feeding Seismic Imaging to Cell Data

SPE 0 SPE 1 SPE 2 SPE 3 SPE 4 SPE 5 SPE 6 SPE 7

IBM Research

© 200829 mpp@us.ibm.com

Feeding Seismic Imaging to Cell

For each X

– Load next column of data

– Load next column of indices

– For each Y

• Load Green’s functions

• SIMDize Green’s functions

• Compute convolution at (X,Y)

– Cycle buffers

H

2R+1

1

Data buffer

Green’s Index buffer

(X,Y)

R

2

IBM Research

© 200830 mpp@us.ibm.com

Feeding String Matching to Cell

Find (lots of) substrings in (long) string

Build graph of words & represent as DFA

Problem: Graph doesn’t fit in LS

Sample Word List:

“the”

“that”

“math”

IBM Research

© 200831 mpp@us.ibm.com

Feeding String Matching to Cell

IBM Research

© 200832 mpp@us.ibm.com

Hiding Main Memory Latency

IBM Research

© 200833 mpp@us.ibm.com

Software Multithreading

IBM Research

© 200834 mpp@us.ibm.com

Feeding Neural Networks to Cell

Neural net function F(X)

– RBF, MLP, KNN, etc.

If too big for LS, BW Bound

N Basis functions: dot product + nonlinearity

D Input dimensions

DxN Matrix of parameters

Output

F

X

IBM Research

© 200835 mpp@us.ibm.com

Convert BW Bound to Compute Bound

Split function over multiple SPEs

Avoids unnecessary memory traffic

Reduce compute time per SPE

Minimal merge overhead

Merge

IBM Research

© 200836 mpp@us.ibm.com

Moral of the Story:It’s All About the Data!

The data problem is growing: multicore

Intelligent software prefetching

– Use DMA engines

– Don’t rely on HW prefetching

Efficient data management

– Multibuffering: Hide the latency!

– BW utilization: Make every byte count!

– SIMDization: Make every vector count!

– Problem/data partitioning: Make every core work!

– Software multithreading: Keep every core busy!

IBM Research

© 200837 mpp@us.ibm.com

Backup

IBM Research

© 200838 mpp@us.ibm.com

Abstract

Technological obstacles have prevented the microprocessor

industry from achieving increased performance through increased

chip clock speeds. In a reaction to these restrictions, the industry

has chosen the multicore processors path. Multicore processors

promise tremendous GFLOPS performance but raise the challenge

of how one programs them. In this talk, I will discuss the motivation

for multicore, the implications to programmers and how the

Cell/B.E. processors design addresses these challenges. As an

example, I will review one or two applications that highlight the

strengths of Cell.

Recommended