Feeding the Multicore Beast: It’s All About the Data!...IBM Research © 2008 Feeding the Multicore...

IBM Research

Feeding theMulticore Beast:It’s All About the Data!

Michael PerroneIBM Master InventorMgr, Cell Solutions Dept.

IBM Research

Outline

History: Data challenge

Motivation for multicore

Implications for programmers

How Cell addresses these implications

Examples

• 2D/3D FFT– Medical Imaging, Petroleum, general HPC…

• Green’s Functions– Seismic Imaging (Petroleum)

• String Matching– Network Processing: DPI & Intrusion Detections

• Neural Networks– Finance

IBM Research

Chapter 1:

The Beast is Hungry!

IBM Research

The Hungry Beast

Processor

(“beast”)

(“food”)Data Pipe

Pipe too small = starved beast

Pipe big enough = well-fed beast

Pipe too big = wasted resources

IBM Research

The Hungry Beast

Processor

(“beast”)

(“food”)Data Pipe

Pipe too small = starved beast

Pipe big enough = well-fed beast

Pipe too big = wasted resources

If flops grow faster than pipe capacity…

… the beast gets hungrier!

IBM Research

Move the food closer

Example: Intel Tulsa

– Xeon MP 7100 series

– 65nm, 349mm2, 2 Cores

– 3.4 GHz @ 150W

– ~54.4 SP GFlops

– http://www.intel.com/products

/processor/xeon/index.htm

Large cache on chip

– ~50% of area

– Keeps data close for

efficient access

If the data is local,

the beast is happy!

– True for many algorithms

IBM Research

What happens if the beast is still hungry?

If the data set doesn’t fit in cache

– Cache misses

– Memory latency exposed

– Performance degraded

Several important application classes don’t fit

– Graph searching algorithms

– Network security

– Natural language processing

– Bioinformatics

– Many HPC workloads

IBM Research

Make the food bowl larger

Cache size steadily increasing

Implications

– Chip real estate reserved for cache

– Less space on chip for computes

– More power required for fewer FLOPS

IBM Research

Make the food bowl larger

Cache size steadily increasing

Implications

– Chip real estate reserved for cache

– Less space on chip for computes

– More power required for fewer FLOPS

But…

– Important application working sets are growing faster

– Multicore even more demanding on cache than uni-core

IBM Research

Chapter 2:

The Beast Has Babies

IBM Research

Power Density – The fundamental problem

1.5 1 0.7 0.5 0.35 0.25 0.18 0.13 0.1 0.07

i386i486

Pentium®

Pentium Pro ®

Pentium II ®

Pentium III®

Hot Plate

Nuclear Reactor

Source: Fred Pollack, Intel. New Microprocessor Challenges in the Coming Generations of CMOS Technologies, Micro32

IBM Research

What’s causing the problem?

10S Tox=11AGate Stack

Gate dielectric approaching a

fundamental limit (a few atomic layers)

Gate Length (microns)

1 0.010.1

Power, signal jitter, etc...

IBM Research

1.0E+02

1.0E+03

1.0E+04

1990 1995 2000 2005 2010

Clock Speed

Diminishing Returns on FrequencyIn a power-constrained environment, chip clock speed yields diminishing

returns. The industry has moved to lower frequency multicore architectures.

Frequency-DrivenDesignPoints

IBM Research

Power vs Performance Trade Offs

Relative Performance

tive P

1.3.85 1.7

We need to adapt our algorithms to

get performance out of multicore

IBM Research

Implications of Multicore

There are more mouths to feed

– Data movement will take center stage

Complexity of cores will stop increasing

… and has started to decrease in some cases

Complexity increases will center around communication

Assumption

– Achieving a significant % or peak performance is important

IBM Research

Chapter 3:

The Proper Care and Feedingof Hungry Beasts

IBM Research

Cell/B.E. Processor: 200GFLOPS (SP) @ ~70W

IBM Research

Feeding the Cell Processor

8 SPEs each with

– LS

– MFC

– SXU

– OS functions

– Disk IO

– Network IO

16B/cycle (2x)16B/cycle

FlexIOTM

16B/cycle

EIB (up to 96B/cycle)

16B/cycle

64-bit Power Architecture with VMX

16B/cycle

L232B/cycle

IBM Research

Cell Approach: Feed the beast more efficiently

Explicitly “orchestrate” the data flow between main

memory and each SPE’s local store

– Use SPE’s DMA engine to gather & scatter data between

memory main memory and local store

– Enables detailed programmer control of data flow

• Get/Put data when & where you want it

• Hides latency: Simultaneous reads, writes & computes

– Avoids restrictive HW cache management

• Unlikely to determine optimal data flow

• Potentially very inefficient

– Allows more efficient use of the existing bandwidth

IBM Research

Cell Approach: Feed the beast more efficiently

Explicitly “orchestrate” the data flow between main

memory and each SPE’s local store

– Use SPE’s DMA engine to gather & scatter data between

memory main memory and local store

– Enables detailed programmer control of data flow

• Get/Put data when & where you want it

• Hides latency: Simultaneous reads, writes & computes

– Avoids restrictive HW cache management

• Unlikely to determine optimal data flow

• Potentially very inefficient

– Allows more efficient use of the existing bandwidth

BOTTOM LINE:

It’s all about the data!

IBM Research

Cell Comparison: ~4x the FLOPS @ ~½ the power

Both 65nm technology

(to scale)

IBM Research

Memory Managing Processor vs. Traditional General Purpose Processor

IBM Research

Examples of Feeding Cell

2D and 3D FFTs

Seismic Imaging

String Matching

Neural Networks (function approximation)

IBM Research

Feeding FFTs to Cell

Buffer

Transposed

Buffer

SIMDized data

DMAs double buffered

Pass 1: For each buffer

• DMA Get buffer

• Do four 1D FFTs in SIMD

• Transpose tiles

• DMA Put buffer

Pass 2: For each buffer

• DMA Get buffer

• Do four 1D FFTs in SIMD

• Transpose tiles

• DMA Put buffer

IBM Research

3D FFTs

Long stride trashes cache

Cell DMA allows prefetch

Single Element Data envelope

Stride 1

Stride

IBM Research

Feeding Seismic Imaging to Cell

New G at each (x,y)

Radial symmetry of G reduces BW requirements

Green’s Function

jiyxGjyixD ),,,(),(

IBM Research

Feeding Seismic Imaging to Cell Data

SPE 0 SPE 1 SPE 2 SPE 3 SPE 4 SPE 5 SPE 6 SPE 7

IBM Research

Feeding Seismic Imaging to Cell Data

SPE 0 SPE 1 SPE 2 SPE 3 SPE 4 SPE 5 SPE 6 SPE 7

IBM Research

Feeding Seismic Imaging to Cell

For each X

– Load next column of data

– Load next column of indices

– For each Y

• Load Green’s functions

• SIMDize Green’s functions

• Compute convolution at (X,Y)

– Cycle buffers

Data buffer

Green’s Index buffer

IBM Research

Feeding String Matching to Cell

Find (lots of) substrings in (long) string

Build graph of words & represent as DFA

Problem: Graph doesn’t fit in LS

Sample Word List:

“the”

“that”

“math”

IBM Research

Feeding String Matching to Cell

IBM Research

Hiding Main Memory Latency

IBM Research

Software Multithreading

IBM Research

Feeding Neural Networks to Cell

Neural net function F(X)

– RBF, MLP, KNN, etc.

If too big for LS, BW Bound

N Basis functions: dot product + nonlinearity

D Input dimensions

DxN Matrix of parameters

Output

IBM Research

Convert BW Bound to Compute Bound

Split function over multiple SPEs

Avoids unnecessary memory traffic

Reduce compute time per SPE

Minimal merge overhead

IBM Research

Moral of the Story:It’s All About the Data!

The data problem is growing: multicore

Intelligent software prefetching

– Use DMA engines

– Don’t rely on HW prefetching

Efficient data management

– Multibuffering: Hide the latency!

– BW utilization: Make every byte count!

– SIMDization: Make every vector count!

– Problem/data partitioning: Make every core work!

– Software multithreading: Keep every core busy!

IBM Research

Backup

IBM Research

Abstract

Technological obstacles have prevented the microprocessor

industry from achieving increased performance through increased

chip clock speeds. In a reaction to these restrictions, the industry

has chosen the multicore processors path. Multicore processors

promise tremendous GFLOPS performance but raise the challenge

of how one programs them. In this talk, I will discuss the motivation

for multicore, the implications to programmers and how the

Cell/B.E. processors design addresses these challenges. As an

example, I will review one or two applications that highlight the

strengths of Cell.

Feeding the Multicore Beast: It’s All About the Data!...IBM Research © 2008 Feeding the Multicore...

Documents

Multicore and Multicore programming with OpenMP (Syst emes

Multicore Processsors

Using Multicore Navigator Multicore Applications

Feeding the Agile Beast

Intel Multicore

Body Beast the Book of Beast

Multicore Processing, Virtualization, and Containerization · • Multicore Processing (MCP) – via multicore processors • Virtualization (V) – via virtual machines (VMs) •

A Synaptus Production Feeding the Agile Beast Improving Business Value Delivered

Feeding the Bandwidth Beast: Addressing Demand and Reducing Cost through a Collaborative Network Redesign Project (263690471)

Multicore Computers

Beauty & the Beast - Southern Gateway · PDF fileSECOND SONGBOOK Ritard Beast. Beast. Beast. a tempo Beast . Beast Doo Doo Doo Doo Doo 51 Doo Doo Page 11 Beau—ty and the Tale as

Feeding The Beast: Addressing the Internet’s Insatiable

Public Information at the Scene “Feeding the beast without being eaten in the process.” 1

Multicore Function

MULTICORE SYSTEM DESIGN WITH XUM: THE EXTENSIBLE UTAH MULTICORE

Facebook Newsfeed Tips, Feeding The Beast by Justin Kistner, Social Fresh Charlotte 2011

Multicore processor

doi: 10.1007/978-1-4939-9074-0 23€¦ · in the Bayesian evolutionary analysis by sampling trees (BEAST) software between multicore central processing units (CPUs) and a wide range

Feeding the Multicore Beast: It’s All About the Data!

Feeding the agile beast agile 2010