Computer Science and Engineering Advanced Computer Architecture CSE 8383 April 17, 2008 Session 11

Computer Science and Engineering

Advanced Computer ArchitectureAdvanced Computer Architecture

CSE 8383CSE 8383

April 17, 2008April 17, 2008

Session 11Session 11


Contents

1. Multi-Core

Why now?

A Paradigm Shift

Multi-Core Architecture

2. Case Studies

IBM Cell

Intel Core 2Duo

AMD


The Path to Multi-Core


Background

WaferThin slice of semiconducting material, such as a silicon crystal, upon which microcircuits are constructed

Die Size The die size of the processor refers to its physical surface area size on the wafer. It is typically measured in square millimeters (mm^2). In essence a "die" is really a chip . the smaller the chip, the more of them that can be made from a single wafer.

Circuit SizeThe level of miniaturization of the processor. In order to pack more transistors into the same space, they must be continually made smaller and smaller. Measured in Microns (m) or Nanometer (nm)


Examples

386C Die Size: 42 mm2 1.0 technology 275,000 transistors

Pentium III Die Size: 106 mm2 0.18 technology 28 million transistors

486C Die Size: 90 mm2 0.7 technology 1.2 million transistors

Pentium Die Size: 148 mm2 0.5 technology 3.2 million transistors


Pentium III (0.18 process technology)

Source: Fred Pollack, Intel. New Micro-architecture Challenges in the coming Generations of CMOS Process Technologies. Micro32



Technology (nm) 90 65 45 32 22

Integration Capacity (BT)

2 4 8 16 32

nm Process Technology


Increasing Die Size

Using the same technology

Increasing the Die Size 2-3X 1.5-1.7X in Performance.

Power is proportional to Die-area * Frequency

We cannot produce microprocessors with ever increasing Die size – The constraint is POWER


Reducing circuit size in particular is key to reducing the size of the chip.

The first generation Pentium used a 0.8 micron circuit size, and required 296 square millimeters per chip.

The second generation chip had the circuit size reduced to 0.6 microns, and the die size dropped by a full 50% to 148 square millimeters.

Reducing circuit Size


Shrink transistors by 30% every generation transistor density doubles, oxide thickness shrinks, frequency increases, and threshold voltage decreases.

Gate thickness cannot keep on shrinking slowing frequency increase, less threshold voltage reduction.


Processor Evolution

• Gate delay reduces by 1/ (frequency up by )• Number of transistors in a constant area goes up by 2 (Deeper pipelines,

ILP, more cashes)• Additional transistors enable an additional increase in performance• Result: 2x performance at roughly equal cost

Generation

i

Generation

i +1

2 2

2

(0.5 m, for example) (0.35 m, for example)


What happens to power if we hold die size constant at each generation?

Allows ~ 100% growth in transistors each generation



What happens to die Size if we hold power constant at each generation?

Die size has to reduce ~ 25% in area each generation 50% growth in transistors, which limits PERFORMANCE, Power Density is still a problem


Computer Science and EngineeringSource: Intel Developer Forum, Spring 2004

Pat Gelsinger (Pentium at 90 W)

Power Density continues to soar


Business as Usual won’t work: Power is a Major Barrier

As Processor Continue to improve in Performance and Speed, Power consumption and heat dissipation have become major challenges

Higher costs:• Thermal Packaging• Fans• Electricity• Air conditioning


A new Paradigm Shift

Old ParadigmPerformance == improved Frequency, unconstrained power,

voltage scaling

New Paradigm:Performance == improved IPC, Multi-core, power efficient

micro architecture advancement


Multiple CPUs on a Single Chip

An attractive option for chip designers because of the availability of cores from earlier processor generations,

which, when shrunk down to present-day process technology, are small enough for aggregation into a single

die


Multi-core

• Gate delay does not reduce much • The frequency and performance of each core is

the same or a little less than previous generation

Generation

i

Generation

i

Generation

i

Technology Generation i Technology Generation i+1


10

100

1

2003 2005 2007 2009 2011 2013

Increasing HW

Threads HT

Multi-core Era

Scalar and Parallel

Applications

Many-core Era

Massively Parallel

Applications

From HT to Many-Core

Intel predicts Intel predicts 100’s of cores 100’s of cores on a chip in on a chip in 20152015

Computer Science and EngineeringSource: Saman Amarasinghe, MIT (6.189 2007, lecture-1)

# of Cores

Multi-cores are Reality


Multi-Core Architecture


Multi-core Architecture

Multiple cores are being integrated on a single chip and made available for general purpose computing

Higher levels of integration – multiple processing cores Caches memory controllers some I/O processing)

Network on Chip (NoC)


Interconnection Networks

M M M M

P P P P P

Interconnection Networks

M M M M

P P P P

Shared memory

• One copy of data shared among multiple cores

• Synchronization via locking

• intel

Distributed memory

• Cores access local data

• Cores exchange data


Memory Access Alternatives

Symmetric Multiprocessors (SMP)

Message Passing (MP)

Distributed Shared Memory (DSM)

Shared address space

Distributed address space

Global

Memory

SMP

Symmetric Multiprocessors

Distributed

Memory

DMS

Distributed Shared Memory

MP

Message Passing


Network on Chip (NoC)

control data I/O

Traditional BusSwitch Network


Global Memory

P P P

PC PC PC

SC SC SC

Global Memory

P P P

PC PC PC

Secondary Cache

Global Memory

P P P

Secondary Cache

Primary Cache

Shared Memory

Shared Global Memory

Shared Secondary CacheShared Primary Cache


General Architecture

CPU coreregisters

L1 I$ L1 D$

L2 cache

main memory I/O

CPU core 1registers

L1 I$ L1 D$

L2 cache

CPU core Nregisters

L1 I$ L1 D$

L2 cache

main memory I/O

Conventional Microprocessor Multiple cores


General Architecture (cont)

Shared Cache

CPU core 1registers

L1 I$ L1 D$

CPU core Nregisters

L1 I$ L1 D$

L2 cache

main memory I/O

CPU core 1

regs

L1 I$ L1 D$ L1 I$ L1 D$

L2 cache

main memory I/O

regs

regs

regs

CPU core N

regs

regs

regs

regs

Multithreaded Shared Cache


“Case Studies”


Case Study 1:“IBM’s Cell Processor”


Cell Highlights

Supercomputer on a chip

Multi-core microprocessor(9 cores)

>4 Ghz clock frequency

10X performance for many applications


Key Attributes

Cell is Multi-core-Contains 64-bit power architecture-Contains 8 synergetic processor elements

Cell is a Broadband Architecture-SPE is RISC architecture with SIMD organization and local store-128+ concurrent transactions to memory per processor

Cell is a Real-Time Architecture-Resource allocation (for bandwidth measurement)-Locking caching (via replacement management table)

Cell is a Security Enabled Architecture-Isolate SPE for flexible security programming


Cell Processor Components


Cell BE Processor Block Diagram


POWER Processing Element (PPE)

POWER Processing Unit (PPU) connected to a 512KB L2 cache.

Responsible for running the OS and coordinating the SPEs.

Key design goals: maximize the performance/power ratio as well as the performance/area ratio.

Dual-issue, in-order processor with dual-thread support

Utilizes delayed-execution pipelines and allows limited out-of-order execution of load instructions.


Synergistic Processing Elements (SPE)

Dual-issue, in-order machine with a large 128-entry, 128-bit register file used for both floating-point and integer operations

Modular design consisting of a Synergistic Processing Unit (SPU) and a Memory Flow Controller (MFC).

Compute engine with SIMD support and 256KB of dedicated local storage.

The MFC contains a DMA controller with an associated MMU and an Atomic Unit to handle synch operations with other SPUs and the PPU.


SPE (cont.)

They operate directly on instructions and data from its dedicated local store.

They rely on a channel interface to access the main memory and other local stores.

The channel interface, which is in the MFC, runs independently of the SPU and is capable of translating addresses and doing DMA transfers while the SPU continues with the program execution.

SIMD support can perform operations on 16 8-bit, 8 16-bit, 4 32-bit integers, or 4 single-precision floating-point numbers per cycle.

At 3.2GHz, each SPU is capable of performing up to 51.2 billion 8-bit integer operations or 25.6GFLOPs in single precision.


Blade level 2 cell processors per blade

Chip level 9 cores

Instruction level Dual issue pipelines on each SPE

Register level Native SIMD on SPE and PPE VMX

Four levels of Parallelism


Cell Chip Floor plan


Element Interconnect Bus (EIB)

Implemented as a ring

Interconnect 12 elements:1 PPE with 51.2GB/s aggregate bandwidth8 SPEs: each with 51.2GB/s aggregate bandwidthMIC: 25.6GB/s of memory bandwidth2 IOIF: 35GB/s(out), 25GB/s(in) of I/O bandwidth

Support two transfer modesDMA between SPEsMMIO/DMA between PPE and system memory

Source: Ainsworth & Pinkston, On Characterizing Performance of the Cell Broad

band Engine Element Interconnect Bus, 1st International Symp. on NOCS 2007



An EIB consists of the following:1. Four 16 byte-wide rings (two in each direction)

1.1 Each ring capable of handling up to 3 concurrent non-overlapping transfers

1.2 Supports up to 12 data transfers at a time

2. A shared command bus2.1 Distributes commands

2.2 Sets up end to end transactions

2.3 Handles coherency

3. A central data arbiter to connect the 12 Cell elements

3.1 Implemented in a star-like structure 3.2 It controls access to the EIB data rings on a per transaction basis

Source: Ainsworth & Pinkston, On Characterizing Performance of the Cell Broad

band Engine Element Interconnect Bus, 1st International Symp. on NOCS 2007




Cell Manufacturing Parameters

About 234 million transistors (compared with 125 million for Pentium 4) that runs at more than 4.0 GHz

As compared to conventional processors, Cell is fairly large, with a die size of 221 square millimeters

The introductory design is fabricated using a 90 nm Silicon on insulator (SOL) process

In March 2007 IBM announced that the 65 nm version of Cell BE (Broadband Engine) is in production


Cell Power Consumption

Each SPE consumes about 1 W when clocked at 2 GHz, 2 W at 3 GHz, and 4 W at 4 GHz

Including the eight SPEs, the PPE, and other logic, the CELL processor will dissipate close to 15W at 2 GHz, 30W at 3 GHz, and approximately 60W 4 GHz


Cell Power Management

Dynamic Power Management (DPM)

Five Power Management States

One linear sensor

Ten digital thermal sensors


Case Study 2:“Intel’s Core 2 Duo ”


Intel Core 2 Duo Highlights

Multi-core microprocessor(2 cores)

It has a range of 1.5 to 3 Ghz clock frequency

2X performance for many applications

Dedicated level 1 cache and shared level 2 cache

Its shared L2 cache comes in two flavors: 2MB and 4MB, depending on the model

It supports 64bit architecture


Intel Core 2 Duo Block Diagram

Dedicated L1$

Shared L2$

The two cores exchange data implicitly through the shared level 2 cache


Intel Core 2 Duo Architecture

Reduced front-side bus traffic: effective data sharing between cores allows data requests to be resolved at the shared cache level instead of going all the way to the system memory

Core 1 had to

retrieve the data

from Core 2 by

going all the way

through the FSB

and Main Memory

One Copy needed

to be retrieved


Intel’s Core 2 Duo Manufacturing Parameters

About 291 million transistors

Compared to Cell’s 221 square millimeters, Core 2 Duo has a smaller die size between 143 and 107 square millimeters depending on the model.

The current Intel process technology for the Dual core ranges between 65 nm and 45nm (2007) with an estimate of 155 million transistors .


Intel Core 2 Duo Power Consumption

Power consumption in Core 2 Duo ranges 65w-130w depending on the model.

Assuming you have 75 w processor model (Conroe is 65W) it will cost you $4 to keep your computer up for the whole month


Intel Core 2 Duo Power Management

It uses 65 nm technology instead of the previous 90nm technology

(Less voltage requirements)

Aggressive clock gating

Enhanced Speed-Step

Low VCC Arrays

Blocks controlled via sleep transistors

Low leakage transistors


Case Study 3:“AMD’s Quad-Core Processor

(Barcelona) ”


AMD Quad-Core Highlights

Designed to enable simultaneous 32- and 64-bit computing

Minimizes the cost of transition and maximizes current investments

Integrated DDR2 Memory Controller

Increases application performance by dramatically reducing memory latency

Scales memory bandwidth and performance to match compute needs

HyperTranspor Technology Provides up to 24.0GB/s peak bandwidth per processor, reducing I/O bottlenecks


AMD Quad-Core Block Diagram

Dedicated L1$ and L2$

Shared L3$


AMD Quad-Core Architecture

It has a crossbar switch instead of the usual bus used in dual core processors

It lowers the probability of having memory access collisions

L3$ to alleviate the memory access latency since we have a greater possibility of accessing the memory due to the high number of cores


AMD Quad-Core Architecture (cont)

Replacement policies:L1,L2: pseudo LRU L3:Sharing aware pseudo LRU

Cache Hierarchy :

Dedicated L1 cache

2 way associative

8 banks (each 16B wide).

Dedicated L2 cache

16 way associative

victim cache, exclusive w.r.t L1

Shared L3 cache

32 way associative

Fills from L3 leave likely shared lines in L3

Victim cache, partially exclusive w.r.t. L2

Sharing aware replacement policy


AMD Quad-Core Manufacturing Parameters

The current AMD process technology for Quad-Core is 65nm

It is comprised of approximately 463M transistors (about 119M less than Intel’s quad-core Kentsfield)

It has a die size of 285 square millimeters (Compared to Cell’s 221 square millimeters)


AMD Quad-Core Power Consumption

Power consumption in AMD Quad-Core ranges 68-95w( compared to 65w-130w of Intel’s Core 2 Duo) depending on the model.

AMD CoolCore Technology

Reduces processor energy consumption by turning off unused parts of the processor. For example, the memory controller can turn off the write logic when reading from memory, helping reduce system power

Power can be switched on or off within a single clock cycle, saving energy with no impact to performance


AMD Quad-Core Power Management

Native quad-core technology enables enhanced power

management across all four cores

Documents

Computer Science and Engineering Advanced Computer Architecture CSE 8383 April 17, 2008 Session 11