Asymmetric Chip Multi-Core: The Future Chip Multiprocessorfrankel/AltComFeb09/... · 25th American...

Preview:

Citation preview

Uri WeiserEE Technion

Alternative Computing Day February 9th, 2009

Asymmetric Chip Multi-Core:The Future Chip Multiprocessor

1

Sailing Basicswind

Buoy

2

Sailing - wind shiftwind

Buoy

3

a

a

a

25th American Cap – September 23rd 1983Liberty I against Australia II

7 rounds race, Score status: 3:0 to Liberty I, 4th round started

4

wind

Buoy

Start line

Sailing competitiongetting first

Boat 1

Boat 2

wind

Buoy

- What is the Strategy of Boat 1?

- What is the Strategy of Boat 2?

Do not follow Invent5

OutlineI selected the other way, but not as far as alternative computing

Today's environment

– On die: Cores; Platform: NUMA

The application environment

Cores implications

The memory wall

NUMA implications

Scheduling

Conclusions

6

Cores implicationsCMP is ubiquitous since Paul Otellini announced the P-Due-2 in June 2006

Reasons: Power wall

- Single Core Performance/power trend - Process technology

Open questions

– How many Cores? 10s 1000s? Implications!

– Cache architecture?

– Are the cores Symmetric (aka Homogenous) or Asymmetric?

– If Asymmetric

– Same ISA

– Different ISA extensions

– Application specific accelerators

– Task/threads scheduling

7

The outside world

NUMA implicationssecond boat chose NUMA in 2007

Reasons: Integration and Performance

Implications

– Synchronize Memory Controllers (MC)

– Future Multiple MC on die

Open issues/opportunities

– All memory traffic runs through the CPU

– Task/Threads assignments and scheduling

Core$Mem

Core

Core

Core

Core $ Mem

Core

Core

Core

8

Outline

Today's environment– On die: Cores; Platform: NUMA

The application environment

Cores implications

The memory wall

NUMA implications

Scheduling

Conclusions

9

The Computer runs multiple programs

CPUs & Caches

Memory

Mass storage

?

Fast I/Oe.g. camera, Accelerators

Slow I/Oe.g. human interface

Display

Communication

Computation and “scratch pad” The “outside” world

Computation enginese.g. Graphics

engine, Media, Communication

engines,

10

Flying machinesAre they all the same?

11

12

Flying machinesAre they all the same?

13

Flying machinesAre they all the same?

Requirements and program behavior

QoS– Application’s response Latency + Performance

– Differential services in multiple applications environment

– Priority of Programs

– Priority of Data

– Lossless or loss consent (e.g. Music, video, communication…)

– Power constrains

– Reliability

14

Parallelism

– Threads level parallelism– Many/Huge number of threads 10s or 1000s

– Data level parallelism vector|SIMD

– Instruction level parallelism ILP

– Pipelining (one dimension/few dimensions graphics/systolic algorithms)

Requirements and program behavior

Remember Amdahl law15

– Memory BW

– Commutation vs. memory access e.g. G Byte/Flop, M

0.05 Byte/Flop

– Streaming e.g. Media, communication, stock exchange data

– Computation behavior FP, integer, branches, loops

– Locality of program/data impact on caches, TLBs, memory

– Sharing

– Foot print Program/data

Requirements and program behavior (cont’)

16

Outline

Today's environment

– On die: Cores; Platform: NUMA

The application environment

Cores implications

Memory Wall

NUMA implications

Scheduling

Conclusions

17

Gain

Frequency

CMP implication:Same ISA/ISA extensions/Application Specific Accelerators

Analog Circuit Paradigm

GBWP = Gain Bandwidth Product = constant @ a given technology

BW1

Gain1

BW2Gain2

e.g. Gain1*BW1= Gain2*BW2

18

Gain

Frequency

Analog Circuit Paradigm (cont.)

BW1 BWn Gain

19

Application domainSame ISA/ISA extensions/Application Specific Accelerators

Performance

Apps range

General purpose

Application specific – Accelerators (not our generic ISA)

Application specific engines achieve higher

performance/power than general purpose engines

20

Application domainSame ISA/ISA extensions/Application Specific Accelerators

Performance

Apps range

Original

ISA

general purpose engines with ISA extensions

ISA extension A ISA extension A

21

The memory Wall

Many execution engines more

performancemore memory accesses

How to reduce the memory latency

performance impact?

– Caches

–Threads

22

The memory Wallthe Cache solution

The “MultiCore” machines solution use cache

as a memory access “filter”

Implications:

– Number of cores: Cache area limits the number of cores

– BW: Reduces memory access Bandwidth

– Coherency: In flat memory structure cache needs coherency mechanism

– Cache Access time (see example: Nahalal)

23

Core Mem

Core

Core

Core

M

C

Inte

rcon

ne

ct

$

Aerial view of Nahalal cooperative village

Example: Nahalal; Shared $ vs. Private $

Overview of Nahalal cache organization

P0 P1

P2

P3P4P5

P6

P7

P0 P1

P2

P3P4P5

P6

P7S$

P$

P$

P$

P$P$

P$

P$

P$

CPU0

CP

U1

CP

U5

CPU3CPU7

CPU2

CPU6 CPU4

CPU0

CP

U1

CP

U5

CPU3CPU7

CPU2

CPU6 CPU4

P$

S$

P$ P$

P$P$

P$

P$

P$

24

Nahalal Cache Performance vs. CIM

27.41% improvement in average cache access time– 41.1% in apache

L2 hit time

0

5

10

15

20

25

30

35

40

45

50

equake fma3d barnes water apache zeus specjbb

Cy

cle

s CIM

NAHALAL

# c

ycl

es

Average Cache hit access Time

(cycles)

3.9% 8.57%

40.53%

41.1%

29.06%29.35%39.4%

CPU0 CPU1 CPU3CPU2

CP40 CPU5 CPU7CPU6

CPU0 CPU1 CPU3CPU2

CPU4 CPU5 CPU7CPU6

Bank0 Bank1 Bank2 Bank3

Bank4 Bank5 Bank6 Bank7

CIM

25

The memory Wallthe Threads solution

The “MultiTHreaded” machines solution

Hide memory latency behind many many threads

When 1000s threads exist NO CACHE is valuable/needed

Implications:

– Applications: 1000s of concurrent threads

– BW: Major impact on Memory access Bandwidth

– Copies: Need to keep many threads’ RFs and states

– Exception handling: complex structure

Outline

Today's environment

– On die: Cores; Platform: NUMA

The application environment

Cores implications

The memory wall

NUMA implications

Scheduling

Conclusions

27

CPU

Priva

te C

ach

e

Priva

te C

ach

e

Sh

are

d C

ach

e

Priva

te C

ach

e

NAHLAL Cache structure Asymmetric NUMA Architecture

CPU CPUCPU

Bridge

“Sh

are

d M

em

ory

Priva

te M

em

ory

Priva

te M

em

ory

CPUs clusters

CPU CPU CPU CPU CPU CPU

28

Priva

te C

ach

e

Memory

NUMA implicationa platform view

CPUs

IOsAccelerators

Communication

Storage

External devices

Bridge

NUMA

29

Memory

Platform NUMA

CPUs

IOsAccelerators

Communication

Storage

External devices

Bridge

Platform NUMA

• Can we improve platform data movements

• Reduce unnecessary traffic

• Efficient on-loading

Memory (aka M3)

30

Outline

Today's environment

– On die: Cores; Platform: NUMA

The application environment

Cores implications

NUMA implications

Scheduling

Conclusions

31

Task Scheduling

Current platform tasks’ scheduling is being performed by OS based on:

– Available resources

– Ready tasks

Are there other dimensions?

– Scheduling based on thread’s essence

– Can we impact scheduling based on “The data content”?*

*Commex Technologies Patent pending 32

33

ACCMP - Asymmetric Cluster CMP

Big cores and small cores same ISA

Serial phases execute on large core

Parallel phases execute on all cores

βaSerial

Parallel

34

ACCMP - Analysis

ACCMP Performance Vs. Power

Symmetric Upper

Bound

a =1, β =4

a =0.33, β =6

1

2

3

4

5

0 5 10 15 20 25

Relative Power

Rel

ati

ve

Per

form

an

ce

Symmetric Upper Bound

Asymmetric (a=1, β=4)

Asymmetric (a=0.33, β=6)

1

112

1 2

Perf .

1

ACCMP

a

P Pv k nk a

a a

ACCMP delivers more

performance per unit

power!

35

Multiple Applications Scheduling

App-A App-B

t

Observation: Serial threads are the bottleneck of applications

Current OS grants serial threads the same priority as parallel threads, resulting in lower system throughput and unfairness

Solution: Boost priority of serial threads

ACCMP: Serial threads will be granted more powerful cores

Benchmark Throughput

36

∞:1

8:1

4:1

2:1

1:1

1:2

1:4

1:8

1:∞

∞:1 8:1 4:1 2:1 1:1 1:2 1:4 1:8 1:∞Sy

nth

eti

c B

en

chm

ark

"A"

Synthetic Benchmark "B"Benchmark “B” Serial to Parallel ratio

Ben

ch

mark

“A

” S

eri

al

to P

ara

llel ra

tio

Application’s Data’s requirements

Differential services

– Some data is latency non-tolerant(e.g. banking transactions)

– Some data is BW non-tolerant(e.g. Video)

– Some data is more “valuable”…

Why not to handle data according to its “value” (in addition its processing requirement)?

37

Data Content Aware (DCA) Architecture*

Task assignment based on Data Content

DCA

CA

System without DCA System with DCA

38*Commex Technologies Patent pending

Latency

620 640770810 850 905 900 900 900 940 940 940 940 940

580666

780840940

1200

1517

1900

2740

0

500

1000

1500

2000

2500

3000

3500

4000

4500

5000

0 500000 1000000 1500000 2000000

Late

ncy [

µsec]

Load [PPS]

Commex Classification

Symmetric

*Test does not include UDP zero copy

No Service

W/O Content

Aware

39

Latency vs. load

40

Conclusions

The environment is changing

– Many open issues interesting research topics

– Core homogeneity?

– Computation assignment and scheduling?– Thread essence base?

– Data Content base?

– Cache or not Cache?

– Cache structure?

– …

41

Thanks:Research is related to the work of: Zvika Guz, Tomer Morad, Leonid Azriel, Commex-Technologies

Thank you

42

Recommended