POD: A Heterogeneous Many-Core Platform Using a 3D-integrated Acceleration Layer

POD: A Heterogeneous Many-Core

Platform Using a 3D-integrated

Acceleration Layer

Hsien-Hsin S. Lee Georgia Tech

withDong Hyuk Woo Georgia TechJosh Fryman Intel CorporationAllan Knies Intel Research BerkeleyMarsha Eng Intel Corporation

2Hsien-Hsin S. Lee: POD

‘Many-Core’? Are We There Yet?

• What is new?– On-Die

Performance Scalability

Area ScalabilityArea Scalability

- How many cores on a die?How many cores on a die?

Power ScalabilityPower Scalability

- How much power per core?How much power per core?

Performance Scalability

Power Issue 1: Thermal Control

3D Cooler Pro

Pure copperCooler

jetCooligy’s channel

Cooking-Aware Computing(Source: Kevin Skadron, UVa) PS3 Grill

leakddstdddd IVIVfCVP 2

Power Issue 2: Supply

Google Server Farms (Oregon)

Technology (nm) 65 45 32 22 16

# of cores 4 8 16 32 64

Frequency (GHz) 3.0 3.0 3.0 3.0 3.0

Vdd (V) 1.3 1.1 0.9 0.8 0.7

Power / core (W) 37.5 26.8 18.0 14.2 10.9

Total power (W) 150 215 288 454 696

Scalable Power Consumption?

* The above pictures (>=8) do not represent any current or future Intel products.

A Thousand Cores, Power Consumption?

• UIUC: Programming Model for one thousand cores [DAC-44]

• Mark Horowitz (Stanford): You will never see a one-thousand core processor [C2S2/GSRC review, Dec 2007]

Some Known Facts on Interconnection

• One main energy hog!– Alpha 21364:

• 20% (25W / 125W)

– MIT RAW: • 40% of individual tile power

– UT TRIPS: • 25%

• Mesh-based MIMD many-core?– Excessive energy in its router logic– Unpredictable communication pattern

• Unfriendly to low-power techniques

Interconnection Network related

Many Many-Cores . . . To Name a Few

P* : Homogeneous

c* : Homogeneous (Energy Efficient)

P+c*: Heterogeneous

0 10 20 30 40 50 60 70

Single-chip power budget

cePower-Equivalent Performance

P* c* P+c*Woo and Lee. To appear in IEEE Computer

Power-Equivalent Performance per Joule

0 10 20 30 40 50 60 70

P* c* P+c*Woo and Lee. To appear in IEEE Computer

PODPOD

Parallel-On-DemandParallel-On-Demand

Parallel-On-DieParallel-On-Die

Woo, Fryman, Knies, Eng, and Lee. To appear in IEEE MICRO, July 2008

Modern Constraints and Issues

• Area & Power scalability (efficiency)

• On-chip wire delay

• Backward/forward binary compatibility

• Extensibility

• Low NRE

A “Snap-On” Broad-Purpose Accelerator

Memory ControllerL2 Cache

x86 Processor

Processing Element

Row Response Queue (RRQ)

Memory ring

External TLB (xTLB)

Arbiter (ARB)

Memory Bus (MBUS)

Interleaved 2D Torus Links

Data Return Buffer (DRB)

Instruction Bus (IBUS)

ORTree

3D Die Stacking, Extending Moore’s Law

Processor Layer

Cache Layer

DRAM Layer

Analog Layer

Face to Face

Back to Back

Face to Back

Heat Sink

3D-integrated POD

You live and work here

You get your beer here

Host Processor

POD Array (with 8x8 PEs)

Host Processor

Instruction Bus (IBus)

Host Processor

ORTree: Status Monitoring

Host ProcessorFlag status

Data Return Buffer (DRB)

Host Processor

2D Torus P2P Links

0 7 1 6 2 5 3 4

Host Processor

2D Torus P2P Links

Host Processor

RRQ & MBus: DMA

Host Processor

RowResponse

MemoryBus

On-chip MC / Ring

Host Processor

Arbitrator

Memory Controller

Wire Delay Aware Design

EFLAGS! MemFence

Simplified PE Pipelining

DEC / RF

Inter-PE Communication

• Fully synchronized and orchestrated by the host processor – No crossbar– No contention / arbitration– No large buffer of conventional on-chip routers– Nothing unpredictable from the standpoint of

programmers!

Memory Organization

RRQn-1

x86 Host Core

Data Ring: ~66B

RRQ Ctrl Ring: ~2B

Address Ring: ~8B

LLC Ctrl Ring: ~2B

Worst-case:

Execution Model

• Heterogeneous ISAs– Host Processor (e.g., x86 CISC)

• For backward compatibility

– RISC-type VLIW POD• For area- and power-efficiency

• Host orchestrates the entire execution

• Instruction Encapsulation– SendBits (x86 extension)– Followed by 96-bit POD VLIW instructions

Programming Model

• Example code: Reduction

char* base_addr = (char*) malloc(npe);for ( i=0; I < npe; i++ ) {

*(base_addr+i) = i+1;}$ASM mov r15 = MY_PE_ID_POINTER$ASM ld r15 = local [ r15 + 0 ]POD_movl( r14, base_addr );

$ASM add r15 = r15, r14

$ASM ld r1 = sys [ r15 + 0 ]POD_mfence();

$ASM add r2 = r2, r1for ( i=0; i< (npeX-1); i++ ) {

$ASM xfer.east r1 = r1$ASM add r2 = r2, r1

}$ASM add r1 = r2, 0for ( i=0; i< (npeY-1); i++ ) {

$ASM xfer.north r1 = r1$ASM add r2 = r2, r1

Programming Model (2x2 POD)

• Malloc memory at the host memory spacechar* base_addr = (char*) malloc(npe);for ( i=0; I < npe; i++ ) {

$ASM add r15 = r15, r14

• Initialize ai

base_addr: 16

16: 017: 018: 019: 0

$ASM add r15 = r15, r14

• Load the address of my PE ID

base_addr: 16

16: 117: 218: 319: 4

$ASM add r15 = r15, r14

• Load my PE ID

r15 2 r15 2

base_addr: 16

16: 117: 218: 319: 4

$ASM add r15 = r15, r14

• Broadcast base_addr

r15 2 r15 3

r15 0 r15 1

base_addr: 16

16: 117: 218: 319: 4

$ASM add r15 = r15, r14

• Calculate the pointer to my ai

r14r15

base_addr: 16

16: 117: 218: 319: 4

$ASM add r15 = r15, r14

• Load my ai from the host memory space

r14r15

base_addr: 16

16: 117: 218: 319: 4

$ASM add r15 = r15, r14

• Wait until it is loaded

r14r15

base_addr: 16

16: 117: 218: 319: 4

$ASM add r15 = r15, r14

• Update S

r14r15

base_addr: 16

16: 117: 218: 319: 4

$ASM add r15 = r15, r14

• 1st iteration

r1r2r14r15

331618

r1r2r14r15

441619

r1r2r14r15

111616

r1r2r14r15

221617

base_addr: 16

16: 117: 218: 319: 4

$ASM add r15 = r15, r14

• Transfer my ai to my east-side neighbor

r1r2r14r15

331618

r1r2r14r15

441619

r1r2r14r15

111616

r1r2r14r15

221617

base_addr: 16

16: 117: 218: 319: 4

$ASM add r15 = r15, r14

r1r2r14r15

• Update S

base_addr: 16

16: 117: 218: 319: 4

$ASM add r15 = r15, r14

r1r2r14r15

• Copy rowsum to r1

base_addr: 16

16: 117: 218: 319: 4

$ASM add r15 = r15, r14

r1r2r14r15

• 1st iteration

base_addr: 16

16: 117: 218: 319: 4

$ASM add r15 = r15, r14

r1r2r14r15

• Transfer my rowsum to my north-side neighbor

base_addr: 16

16: 117: 218: 319: 4

$ASM add r15 = r15, r14

r1r2r14r15

• Update S

base_addr: 16

16: 117: 218: 319: 4

$ASM add r15 = r15, r14

r1r2r14r15

101618

r1r2r14r15

101619

r1r2r14r15

101616

r1r2r14r15

101617

• Done!

base_addr: 16

16: 117: 218: 319: 4

$ASM add r15 = r15, r14

r1r2r14r15

101618

r1r2r14r15

101619

r1r2r14r15

101616

r1r2r14r15

101617

• Broadcast the pointer to ‘result’

base_addr: 16

16: 117: 218: 319: 4

int result;

POD_movl( r14, &result );

$ASM mov r15 = MY_PE_ID_POINTER$ASM ld r15 = local [ r15 + 0 ]

$ASM sub r15 = r15, 0$ASM pushmask.e

$ASM st sys[ r14 + 0 ] = r2$ASM popmask

POD_mfence();

128: <- addr of ‘result’

r1r2r14r15

1012818

r1r2r14r15

1012819

r1r2r14r15

1012816

r1r2r14r15

1012817

• Only PE0 will write-back!

base_addr: 16

16: 117: 218: 319: 4

int result;

POD_mfence();

r1r2r14r15

101282

r1r2r14r15

101282

r1r2r14r15

101282

r1r2r14r15

101282

• Load PE_ID

base_addr: 16

16: 117: 218: 319: 4

int result;

POD_mfence();

r1r2r14r15

101282

r1r2r14r15

101283

r1r2r14r15

101280

r1r2r14r15

101281

• Compare PE_ID

base_addr: 16

16: 117: 218: 319: 4

int result;

POD_mfence();

r1r2r14r15

101282

r1r2r14r15

101283

r1r2r14r15

101280

r1r2r14r15

101281

• Enable PE only when equal to zero

base_addr: 16

16: 117: 218: 319: 4

int result;

POD_mfence();

r1r2r14r15

101282

r1r2r14r15

101283

r1r2r14r15

101280

r1r2r14r15

101281

• Write-back

base_addr: 16

16: 117: 218: 319: 4

int result;

POD_mfence();

r1r2r14r15

101282

r1r2r14r15

101283

r1r2r14r15

101280

r1r2r14r15

101281

• Enable ALL PEs

base_addr: 16

16: 117: 218: 319: 4

int result;

POD_mfence();

r1r2r14r15

101282

r1r2r14r15

101283

r1r2r14r15

101280

r1r2r14r15

101281

• Wait until the result is written-back

base_addr: 16

16: 117: 218: 319: 4

int result;

POD_mfence();

Dimension Virtualization for POD

• Virtualization of the number of PEs– Six runtime variables needed in HW

• Total number of PEs in a row• Total number of PEs in a column• Total number of PEs• Per-PE x-coordinate value• Per-PE y-coordinate value• PE ID of each PE

• Virtualization of the existence of a POD layer– Software/hardware binary translator– One PE on a conventional general-purpose

processor layer (area overhead ~9%)

EvaluationEvaluation

Physical Design Evaluation @ 45nm

• PE– 128KB 2R/W-1R-1W SRAM, Basic GP ALU,

simplified SSE ALU– ~1.9 mm2 (~1.33W)

• RRQ– (conservative estimation)– ~1.9 mm2

• POD Total: ~137 mm2 (~100W)

• Penryn (6MB L2) : ~107 mm2

Physical Design Evaluation

AGUDL1

Intel Penryn Core 2 Duo processor die photo

Physical Design Evaluation

Intel Penryn Core 2 Duo processor die photo

Performance Result

64869.3

GFLOPS

113.3GFLOPS

134.8GFLOPS

860.2GFLOPS

91.6GFLOPS

68.2GFLOPS

504.6GFLOPS

66.8GFLOPS

95.6GFLOPS

DenseMMM

(SP FP)

(DP FP)

OptionPricing(SP FP)

DownSampling(SP FP)

Histogram

LinearRegression

(SP FP)

K-means

(SP FP)

CollisionDetection(SP FP)

RayCasting(SP FP)

1x1 POD 2x2 POD 4x4 POD 8x8 POD

Performance per mm2

DenseMMM FFT IDCT OptionPricing

DownSampling

Histogram LinearRegression

K-means CollisionDetection

RayCasting

Performance per Joule

DownSampling

RayCasting

Interconnection Power

Input buffer X-bar arbiter Link

RAW 31% 30% ~0% 39%

TRIPS 35% 33% 1% 31%

Wang et al., “Power-Driven Design of Router Microarchitectures in On-Chip Networks”, MICRO 2003

2D Torus P2P Link Active Time

DownSampling

RayCasting

N E W S N E W S N E W S N E W S N E W S N E W S N E W S N E W S N E W S N E W S

Conclusions

• We propose POD – Parallel-On-Demand many-core architecture – Broad-purpose acceleration – High energy and area efficiency– Backward compatibility

• A single-chip POD can achieve up to 1.5 TFLOPS of IEEE SP FP operations.

• Interconnection architecture of POD is extremely power- and area-efficient.

Host Processor

Let’s Use Power to Compute, Not Commute!

Georgia Tech

ECE MARS Labs

http://arch.ece.gatech.edu

Back-up SlidesBack-up Slides

x86 Modification

• 5 new instructions:– sendibits, getort, drainort, movl, getdrb,

• 3 modified instructions:– lfence, sfence, mfence

• 4 possible new co-processor registers or else mem-mapped addrs:– ibits register, ort status, drb value, xTLB support

• IBits queue

• All other modifications external to the x86 core– Arbiter for LLC priority can be external to LLC

• Deferred Design Issue: LLC-POD Coherence– v1 Study: uncachable– v2 Prototype: LLC Snoops on A/D rings

POD ISA

• “Instructions” are 3-wide VLIW bundles, – One of each G,X,M (32 bits each)

• Integer Pipe (G):– and,or,not,xor,shl,shr,add,sub,mul,cmp,cmov,xfer,xferdrb : !

• SSE Pipe (X):– pfp{fma,add,sub,mul,div,max,min},pi{add,sub,mul,and,or,not,c

mp},shuffle,cvt,xfer : !idiv

• Memory Pipe (M):– ld,st,blkrd/wr,mask{set,pop,push},mov,xfer

• Hybrid (G+X+M):– movl

PODSIM

Native

machinePODSIM

PE pipelining

Memory subsystem

Accurate modeling of on-chip, off-chip bandwidth

Limitation:

Host processor overhead not modeled (I$ miss, branch misprediction)

LLC takeover of the MC ring by the host

Out-of-order Host Issues

• Speculative execution– No recovery mechanism in each PE– Broadcast POD instructions in a non-speculative

manner• As long as host-side code does not depend on the results

from POD, this approach does not affect the performance.

• Instruction Reordering– POD instructions are strongly ordered.

• IBits queue – Similar to the store queue

Programming Model

• Example: Reduction

for ( i = 0; i < input_size; i++ )host_input[i] = i+1;

blk_size = input_size / npememcpy( &local_input[0], &host_input[blk_size*mypeID], my_input_size )memory_fence;

local_sum = 0for ( i = 0; i < blk_size; i++ )

local_sum += local_input[i]

row_sum = local_sumfor ( i=0; i< (npeX-1); i++ ) {

xfer.east local_sum = local_sumrow_sum += local_sum

global_sum = row_sumfor ( i=0; i< (npeY-1); i++ ) {

xfer.north row_sum = row_sumglobal_sum += row_sum

Programming Model

0 1 2 3 4 5 6 7

Programming Model

0 1 2 3 4 5 6 72 (blk_size)

Programming Model

0 1 2 3 4 5 6 72 (blk_size)

4 5 6 7

0 1 2 3

Programming Model

0 1 2 3 4 5 6 72 (blk_size)

4 5 6 7

0 1 2 3

Programming Model

0 1 2 3 4 5 6 72 (blk_size)

Programming Model

0 1 2 3 4 5 6 72 (blk_size)

Programming Model

0 1 2 3 4 5 6 72 (blk_size)

Programming Model

0 1 2 3 4 5 6 72 (blk_size)

Programming Model

0 1 2 3 4 5 6 72 (blk_size)

Programming Model

0 1 2 3 4 5 6 72 (blk_size)

Design Space Exploration

• Baseline Design

POD Design Space Exploration

• 3D many-POD processorDie 0

Power-Equivalent Performance per Watt

0 10 20 30 40 50 60 70

P* c* P+c*

Microfluid Channels for 3D Integration

• Challenges– Routing– Fluid Pressure– Cost

a stack

C4 bump

fluid in fluid out

package substrate

Courtesy : Young-Joon Lee, Georgia Tech

transistor Signal TSVmicrofluidic

channel

metal layers

P/G nets

Clock net

fluidic via(hot)

fluidic via(cool)

Avatrel cover

Clock TSV

I/O TSVI/O

buffer

P/G TSV

Many-Core? Do You Need It?

Well, it is hard to say in Computing WorldWell, it is hard to say in Computing World

If you were plowing a field, which would you rather use:

Two strong oxen or 1024 chickens?--- Seymour Cray

What Did We Learn from Deep Blue?

DEEP BLUE

480 chess chips

Can evaluate 200,000,000 moves per second!!

Feng-Hsiung Hsu

POD: A Heterogeneous Many-Core Platform Using a 3D-integrated Acceleration Layer

Documents

POD Basics

Transparent Heterogeneous Cloud Acceleration

PEACE IN A POD - Europlan · S-POD S-POD S POD FIXED AND FOCUSED The S POD is a single person private space suitable for uninterrupted work and making important calls. This quiet

pod kupolom

POD Introduction

Airflow on Kubernetes: Containerizing your Workflows · Golang templating language to templatize manifests Automate deployment of Airflow with Helm using Terraform Pod Pod Pod Pod

Acceleration of parallel algorithms using a heterogeneous computing …raata/Conference_files/Inta_NIMS... · 2016-05-03 · Acceleration of parallel algorithms using a heterogeneous

Driver Pod with One Driver Driver Pod with Two Drivers...Driver Pod Installation 13 Driver Pod Installation 14a Move driver pod back and position directly above ceiling access hole

181078 POD

Last Updated: October 20, 2010 - Cisco€¦ · Design and Implementation Guides† Compact Pod Design– Compact Pod Implementation– – Large Pod Design – Large Pod Implementation

GLUCOSE 1 GLUCOSA TRINDER.GOD-POD. Colorimetric 1 TRINDER.GOD-POD. Colorimetrico (Calibrator included / Calibrador incluido) GLUCOSE 1 GLUCOSA Liquid.GOD-POD. Colorimetric / Liquid.GOD-POD

MS Word Template_102504 - Cisco - Global Home Page · Web view... Complete, CCNP Premium POD – Complete, CCNP Premium POD – Switch, CCNP Extreme POD – Switch, CCNP Extreme POD

Heterogeneous Hardware Acceleration of Parallel Algorithmsraata/Conference_files/Chimera_AIP2012... · 2016-04-24 · Heterogeneous Hardware Acceleration of Parallel Algorithms Department

Dynamic Application Reconfiguration on Heterogeneous Hardware · acceleration based on the currently available hardware re-sources. ThroughTornadoVM, we introduce a new level of

POD/EIM nonlinear model reduction and POD/4-D VAR with TR ...myweb.fsu.edu/rstefanescu/Papers/Navon_Stefanescu_Poitiers.pdf · 1. POD/DEIM POD/EIM justi cation and methodology POD/DEIM

Twitter Pod

Frequently Asked Questions about SSA’s Promoting ... - POD · Link to the POD Evaluation Helpdesk. Link to the POD Evaluation Helpdesk. Link to the POD Evaluation Helpdesk. Title:

Installing Windows XP and Server 2003 · Web viewNetworks Lab Pod Diagram (x = pod number)Breakout Box Patch Panel Pod Switch Fa0/1 Fa0/0 Console Pod Router (Rear) Pod x Server Addr:

Pod research

sheltered lounge · 2020. 5. 19. · lounge pod with privacy wing + ottoman sectional lounge desk pod lounge lounge pod lounge pod + privacy hood Caav designed by Justin Champaign