Asymmetric Chip Multi-Core: The Future Chip Multiprocessorfrankel/AltComFeb09/... · 25th American...

Uri WeiserEE Technion

Alternative Computing Day February 9th, 2009

Asymmetric Chip Multi-Core:The Future Chip Multiprocessor

Sailing Basicswind

Sailing - wind shiftwind

25th American Cap – September 23rd 1983Liberty I against Australia II

7 rounds race, Score status: 3:0 to Liberty I, 4th round started

Start line

Sailing competitiongetting first

Boat 1

Boat 2

- What is the Strategy of Boat 1?

- What is the Strategy of Boat 2?

Do not follow Invent5

OutlineI selected the other way, but not as far as alternative computing

Today's environment

– On die: Cores; Platform: NUMA

The application environment

Cores implications

The memory wall

NUMA implications

Scheduling

Conclusions

Cores implicationsCMP is ubiquitous since Paul Otellini announced the P-Due-2 in June 2006

Reasons: Power wall

- Single Core Performance/power trend - Process technology

Open questions

– How many Cores? 10s 1000s? Implications!

– Cache architecture?

– Are the cores Symmetric (aka Homogenous) or Asymmetric?

– If Asymmetric

– Same ISA

– Different ISA extensions

– Application specific accelerators

– Task/threads scheduling

The outside world

NUMA implicationssecond boat chose NUMA in 2007

Reasons: Integration and Performance

Implications

– Synchronize Memory Controllers (MC)

– Future Multiple MC on die

Open issues/opportunities

– All memory traffic runs through the CPU

– Task/Threads assignments and scheduling

Core$Mem

Core $ Mem

Outline

Today's environment– On die: Cores; Platform: NUMA

Cores implications

The memory wall

NUMA implications

Scheduling

Conclusions

The Computer runs multiple programs

CPUs & Caches

Memory

Mass storage

Fast I/Oe.g. camera, Accelerators

Slow I/Oe.g. human interface

Display

Communication

Computation and “scratch pad” The “outside” world

Computation enginese.g. Graphics

engine, Media, Communication

engines,

Flying machinesAre they all the same?

Requirements and program behavior

QoS– Application’s response Latency + Performance

– Differential services in multiple applications environment

– Priority of Programs

– Priority of Data

– Lossless or loss consent (e.g. Music, video, communication…)

– Power constrains

– Reliability

Parallelism

– Threads level parallelism– Many/Huge number of threads 10s or 1000s

– Data level parallelism vector|SIMD

– Instruction level parallelism ILP

– Pipelining (one dimension/few dimensions graphics/systolic algorithms)

Requirements and program behavior

Remember Amdahl law15

– Memory BW

– Commutation vs. memory access e.g. G Byte/Flop, M

0.05 Byte/Flop

– Streaming e.g. Media, communication, stock exchange data

– Computation behavior FP, integer, branches, loops

– Locality of program/data impact on caches, TLBs, memory

– Sharing

– Foot print Program/data

Requirements and program behavior (cont’)

Outline

Today's environment

Cores implications

Memory Wall

NUMA implications

Scheduling

Conclusions

Frequency

CMP implication:Same ISA/ISA extensions/Application Specific Accelerators

Analog Circuit Paradigm

GBWP = Gain Bandwidth Product = constant @ a given technology

BW2Gain2

e.g. Gain1*BW1= Gain2*BW2

Frequency

Analog Circuit Paradigm (cont.)

BW1 BWn Gain

Application domainSame ISA/ISA extensions/Application Specific Accelerators

Performance

Apps range

General purpose

Application specific – Accelerators (not our generic ISA)

Application specific engines achieve higher

performance/power than general purpose engines

Application domainSame ISA/ISA extensions/Application Specific Accelerators

Performance

Apps range

Original

general purpose engines with ISA extensions

ISA extension A ISA extension A

The memory Wall

Many execution engines more

performancemore memory accesses

How to reduce the memory latency

performance impact?

– Caches

–Threads

The memory Wallthe Cache solution

The “MultiCore” machines solution use cache

as a memory access “filter”

Implications:

– Number of cores: Cache area limits the number of cores

– BW: Reduces memory access Bandwidth

– Coherency: In flat memory structure cache needs coherency mechanism

– Cache Access time (see example: Nahalal)

Core Mem

Aerial view of Nahalal cooperative village

Example: Nahalal; Shared $ vs. Private $

Overview of Nahalal cache organization

P3P4P5

CPU3CPU7

CPU6 CPU4

CPU3CPU7

CPU6 CPU4

Nahalal Cache Performance vs. CIM

27.41% improvement in average cache access time– 41.1% in apache

L2 hit time

equake fma3d barnes water apache zeus specjbb

NAHALAL

Average Cache hit access Time

(cycles)

3.9% 8.57%

40.53%

29.06%29.35%39.4%

CPU0 CPU1 CPU3CPU2

CP40 CPU5 CPU7CPU6

CPU0 CPU1 CPU3CPU2

CPU4 CPU5 CPU7CPU6

Bank0 Bank1 Bank2 Bank3

Bank4 Bank5 Bank6 Bank7

The memory Wallthe Threads solution

The “MultiTHreaded” machines solution

Hide memory latency behind many many threads

When 1000s threads exist NO CACHE is valuable/needed

Implications:

– Applications: 1000s of concurrent threads

– BW: Major impact on Memory access Bandwidth

– Copies: Need to keep many threads’ RFs and states

– Exception handling: complex structure

Outline

Today's environment

Cores implications

The memory wall

NUMA implications

Scheduling

Conclusions

NAHLAL Cache structure Asymmetric NUMA Architecture

CPU CPUCPU

Bridge

CPUs clusters

CPU CPU CPU CPU CPU CPU

Memory

NUMA implicationa platform view

IOsAccelerators

Communication

Storage

External devices

Bridge

Memory

Platform NUMA

IOsAccelerators

Communication

Storage

External devices

Bridge

Platform NUMA

• Can we improve platform data movements

• Reduce unnecessary traffic

• Efficient on-loading

Memory (aka M3)

Outline

Today's environment

Cores implications

NUMA implications

Scheduling

Conclusions

Task Scheduling

Current platform tasks’ scheduling is being performed by OS based on:

– Available resources

– Ready tasks

Are there other dimensions?

– Scheduling based on thread’s essence

– Can we impact scheduling based on “The data content”?*

*Commex Technologies Patent pending 32

ACCMP - Asymmetric Cluster CMP

Big cores and small cores same ISA

Serial phases execute on large core

Parallel phases execute on all cores

βaSerial

Parallel

ACCMP - Analysis

ACCMP Performance Vs. Power

Symmetric Upper

a =1, β =4

a =0.33, β =6

0 5 10 15 20 25

Relative Power

Symmetric Upper Bound

Asymmetric (a=1, β=4)

Asymmetric (a=0.33, β=6)

Perf .

P Pv k nk a

ACCMP delivers more

performance per unit

power!

Multiple Applications Scheduling

App-A App-B

Observation: Serial threads are the bottleneck of applications

Current OS grants serial threads the same priority as parallel threads, resulting in lower system throughput and unfairness

Solution: Boost priority of serial threads

ACCMP: Serial threads will be granted more powerful cores

Benchmark Throughput

∞:1 8:1 4:1 2:1 1:1 1:2 1:4 1:8 1:∞Sy

Synthetic Benchmark "B"Benchmark “B” Serial to Parallel ratio

llel ra

Application’s Data’s requirements

Differential services

– Some data is latency non-tolerant(e.g. banking transactions)

– Some data is BW non-tolerant(e.g. Video)

– Some data is more “valuable”…

Why not to handle data according to its “value” (in addition its processing requirement)?

Data Content Aware (DCA) Architecture*

Task assignment based on Data Content

System without DCA System with DCA

38*Commex Technologies Patent pending

Latency

620 640770810 850 905 900 900 900 940 940 940 940 940

580666

780840940

0 500000 1000000 1500000 2000000

µsec]

Load [PPS]

Commex Classification

Symmetric

*Test does not include UDP zero copy

No Service

W/O Content

Latency vs. load

Conclusions

The environment is changing

– Many open issues interesting research topics

– Core homogeneity?

– Computation assignment and scheduling?– Thread essence base?

– Data Content base?

– Cache or not Cache?

– Cache structure?

– …

Thanks:Research is related to the work of: Zvika Guz, Tomer Morad, Leonid Azriel, Commex-Technologies

Thank you

Asymmetric Chip Multi-Core: The Future Chip Multiprocessorfrankel/AltComFeb09/... · 25th American...

Documents

EKG rounds

Chapter 18 Asymmetric Informationfaculty.bcitbusiness.org/kevinw/6500/Perloff/18M_Perloff_8008884_02_Micro_C18.pdf18.1 Problems Due to Asymmetric Information •Asymmetric information

Case Rounds Laura Miles Teams Case Rounds February 10 2012

Doha Rounds

Safeguarding Investments in Asymmetric Interorganizational ... · Safeguarding Investments in Asymmetric Interorganizational Relationships: ... Asymmetric Interorganizational Relationships:

Morbidity Rounds

Grand Rounds: A Method for Improving Student Learning … · Keywords: grand rounds, case presentations, mentorship, ... Grand Rounds is traditionally defined as "rounds involving

Computer)Architecture)101) · 2012-09-02 · Flash‘Based)SSD)Architecture) chip chip chip … chip chip chip … chip chip chip … chip chip chip … Read Write Trim Logical address

Corporate Rounds and Vouchers Rounds and Vouchers...corporate rounds tally. Golf Computer Systems have developed an extensive system to record, report and account for corporate rounds

Global Rounds - All Oralists Argued Twice Prelim Rounds

Instructional Rounds

Technical information for seaward licensing rounds · TECHNICAL INFORMATION FOR SEAWARD LICENSING ROUNDS ... { ... TECHNICAL INFORMATION FOR SEAWARD LICENSING ROUNDS

INTERNATIONAL ROUNDS - Oxford Media Lawpricemootcourt.socleg.ox.ac.uk/.../International-Rounds...Brochure.pdf · INTERNATIONAL ROUNDS 30th March – 2nd April 2016. ... making your

EAR ROUNDS

ROUNDS GREEN NEW COLLIERY DISASTER, 1846 - The … - mining/Rounds... · Rounds Green New Colliery Disaster, 1846 3 historyofoldbury.co.uk . Rounds Green New Colliery Accident Report

Grand rounds

Silicon chip birefringence. Asymmetric crystals cut so optic axis is in the plane of the plate. Light comes in perpendicular to the plate. Waveplates

Rounds & About

Res Rounds Handout FINAL Scheibnersites.utexas.edu/pharmacotherapy-rounds/files/2020/... · Res Rounds Handout FINAL Scheibner ... } } } o µ

Grand Rounds