61
Computer Science and Engineering Advanced Computer Architecture Advanced Computer Architecture CSE 8383 CSE 8383 April 17, 2008 April 17, 2008 Session 11 Session 11

Computer Science and Engineering Advanced Computer Architecture CSE 8383 April 17, 2008 Session 11

Embed Size (px)

Citation preview

Page 1: Computer Science and Engineering Advanced Computer Architecture CSE 8383 April 17, 2008 Session 11

Computer Science and Engineering

Advanced Computer ArchitectureAdvanced Computer Architecture

CSE 8383CSE 8383

April 17, 2008April 17, 2008

Session 11Session 11

Page 2: Computer Science and Engineering Advanced Computer Architecture CSE 8383 April 17, 2008 Session 11

Computer Science and Engineering

Contents

1. Multi-Core

Why now?

A Paradigm Shift

Multi-Core Architecture

2. Case Studies

IBM Cell

Intel Core 2Duo

AMD

Page 3: Computer Science and Engineering Advanced Computer Architecture CSE 8383 April 17, 2008 Session 11

Computer Science and Engineering

The Path to Multi-Core

Page 4: Computer Science and Engineering Advanced Computer Architecture CSE 8383 April 17, 2008 Session 11

Computer Science and Engineering

Background

WaferThin slice of semiconducting material, such as a silicon crystal, upon which microcircuits are constructed

Die Size The die size of the processor refers to its physical surface area size on the wafer. It is typically measured in square millimeters (mm^2). In essence a "die" is really a chip . the smaller the chip, the more of them that can be made from a single wafer.

Circuit SizeThe level of miniaturization of the processor. In order to pack more transistors into the same space, they must be continually made smaller and smaller. Measured in Microns (m) or Nanometer (nm)

Page 5: Computer Science and Engineering Advanced Computer Architecture CSE 8383 April 17, 2008 Session 11

Computer Science and Engineering

Examples

386C Die Size: 42 mm2 1.0 technology 275,000 transistors

Pentium III Die Size: 106 mm2 0.18 technology 28 million transistors

486C Die Size: 90 mm2 0.7 technology 1.2 million transistors

Pentium Die Size: 148 mm2 0.5 technology 3.2 million transistors

Page 6: Computer Science and Engineering Advanced Computer Architecture CSE 8383 April 17, 2008 Session 11

Computer Science and Engineering

Pentium III (0.18 process technology)

Source: Fred Pollack, Intel. New Micro-architecture Challenges in the coming Generations of CMOS Process Technologies. Micro32

Page 7: Computer Science and Engineering Advanced Computer Architecture CSE 8383 April 17, 2008 Session 11

Computer Science and Engineering

Page 8: Computer Science and Engineering Advanced Computer Architecture CSE 8383 April 17, 2008 Session 11

Computer Science and Engineering

Technology (nm) 90 65 45 32 22

Integration Capacity (BT)

2 4 8 16 32

nm Process Technology

Page 9: Computer Science and Engineering Advanced Computer Architecture CSE 8383 April 17, 2008 Session 11

Computer Science and Engineering

Increasing Die Size

Using the same technology

Increasing the Die Size 2-3X 1.5-1.7X in Performance.

Power is proportional to Die-area * Frequency

We cannot produce microprocessors with ever increasing Die size – The constraint is POWER

Page 10: Computer Science and Engineering Advanced Computer Architecture CSE 8383 April 17, 2008 Session 11

Computer Science and Engineering

Reducing circuit size in particular is key to reducing the size of the chip.

The first generation Pentium used a 0.8 micron circuit size, and required 296 square millimeters per chip.

The second generation chip had the circuit size reduced to 0.6 microns, and the die size dropped by a full 50% to 148 square millimeters.

Reducing circuit Size

Page 11: Computer Science and Engineering Advanced Computer Architecture CSE 8383 April 17, 2008 Session 11

Computer Science and Engineering

Shrink transistors by 30% every generation transistor density doubles, oxide thickness shrinks, frequency increases, and threshold voltage decreases.

Gate thickness cannot keep on shrinking slowing frequency increase, less threshold voltage reduction.

Page 12: Computer Science and Engineering Advanced Computer Architecture CSE 8383 April 17, 2008 Session 11

Computer Science and Engineering

Processor Evolution

• Gate delay reduces by 1/ (frequency up by )• Number of transistors in a constant area goes up by 2 (Deeper pipelines,

ILP, more cashes)• Additional transistors enable an additional increase in performance• Result: 2x performance at roughly equal cost

Generation

i

Generation

i +1

2 2

2

(0.5 m, for example) (0.35 m, for example)

Page 13: Computer Science and Engineering Advanced Computer Architecture CSE 8383 April 17, 2008 Session 11

Computer Science and Engineering

What happens to power if we hold die size constant at each generation?

Allows ~ 100% growth in transistors each generation

Source: Fred Pollack, Intel. New Micro-architecture Challenges in the coming Generations of CMOS Process Technologies. Micro32

Page 14: Computer Science and Engineering Advanced Computer Architecture CSE 8383 April 17, 2008 Session 11

Computer Science and Engineering

What happens to die Size if we hold power constant at each generation?

Die size has to reduce ~ 25% in area each generation 50% growth in transistors, which limits PERFORMANCE, Power Density is still a problem

Source: Fred Pollack, Intel. New Micro-architecture Challenges in the coming Generations of CMOS Process Technologies. Micro32

Page 15: Computer Science and Engineering Advanced Computer Architecture CSE 8383 April 17, 2008 Session 11

Computer Science and EngineeringSource: Intel Developer Forum, Spring 2004

Pat Gelsinger (Pentium at 90 W)

Power Density continues to soar

Page 16: Computer Science and Engineering Advanced Computer Architecture CSE 8383 April 17, 2008 Session 11

Computer Science and Engineering

Business as Usual won’t work: Power is a Major Barrier

As Processor Continue to improve in Performance and Speed, Power consumption and heat dissipation have become major challenges

Higher costs:• Thermal Packaging• Fans• Electricity• Air conditioning

Page 17: Computer Science and Engineering Advanced Computer Architecture CSE 8383 April 17, 2008 Session 11

Computer Science and Engineering

A new Paradigm Shift

Old ParadigmPerformance == improved Frequency, unconstrained power,

voltage scaling

New Paradigm:Performance == improved IPC, Multi-core, power efficient

micro architecture advancement

Page 18: Computer Science and Engineering Advanced Computer Architecture CSE 8383 April 17, 2008 Session 11

Computer Science and Engineering

Multiple CPUs on a Single Chip

An attractive option for chip designers because of the availability of cores from earlier processor generations,

which, when shrunk down to present-day process technology, are small enough for aggregation into a single

die

Page 19: Computer Science and Engineering Advanced Computer Architecture CSE 8383 April 17, 2008 Session 11

Computer Science and Engineering

Multi-core

• Gate delay does not reduce much • The frequency and performance of each core is

the same or a little less than previous generation

Generation

i

Generation

i

Generation

i

Technology Generation i Technology Generation i+1

Page 20: Computer Science and Engineering Advanced Computer Architecture CSE 8383 April 17, 2008 Session 11

Computer Science and Engineering

10

100

1

2003 2005 2007 2009 2011 2013

Increasing HW

Threads HT

Multi-core Era

Scalar and Parallel

Applications

Many-core Era

Massively Parallel

Applications

From HT to Many-Core

Intel predicts Intel predicts 100’s of cores 100’s of cores on a chip in on a chip in 20152015

Page 21: Computer Science and Engineering Advanced Computer Architecture CSE 8383 April 17, 2008 Session 11

Computer Science and EngineeringSource: Saman Amarasinghe, MIT (6.189 2007, lecture-1)

# of Cores

Multi-cores are Reality

Page 22: Computer Science and Engineering Advanced Computer Architecture CSE 8383 April 17, 2008 Session 11

Computer Science and Engineering

Multi-Core Architecture

Page 23: Computer Science and Engineering Advanced Computer Architecture CSE 8383 April 17, 2008 Session 11

Computer Science and Engineering

Multi-core Architecture

Multiple cores are being integrated on a single chip and made available for general purpose computing

Higher levels of integration – multiple processing cores Caches memory controllers some I/O processing)

Network on Chip (NoC)

Page 24: Computer Science and Engineering Advanced Computer Architecture CSE 8383 April 17, 2008 Session 11

Computer Science and Engineering

Interconnection Networks

M M M M

P P P P P

Interconnection Networks

M M M M

P P P P

Shared memory

• One copy of data shared among multiple cores

• Synchronization via locking

• intel

Distributed memory

• Cores access local data

• Cores exchange data

Page 25: Computer Science and Engineering Advanced Computer Architecture CSE 8383 April 17, 2008 Session 11

Computer Science and Engineering

Memory Access Alternatives

Symmetric Multiprocessors (SMP)

Message Passing (MP)

Distributed Shared Memory (DSM)

Shared address space

Distributed address space

Global

Memory

SMP

Symmetric Multiprocessors

Distributed

Memory

DMS

Distributed Shared Memory

MP

Message Passing

Page 26: Computer Science and Engineering Advanced Computer Architecture CSE 8383 April 17, 2008 Session 11

Computer Science and Engineering

Network on Chip (NoC)

control data I/O

Traditional BusSwitch Network

Page 27: Computer Science and Engineering Advanced Computer Architecture CSE 8383 April 17, 2008 Session 11

Computer Science and Engineering

Global Memory

P P P

PC PC PC

SC SC SC

Global Memory

P P P

PC PC PC

Secondary Cache

Global Memory

P P P

Secondary Cache

Primary Cache

Shared Memory

Shared Global Memory

Shared Secondary CacheShared Primary Cache

Page 28: Computer Science and Engineering Advanced Computer Architecture CSE 8383 April 17, 2008 Session 11

Computer Science and Engineering

General Architecture

CPU coreregisters

L1 I$ L1 D$

L2 cache

main memory I/O

CPU core 1registers

L1 I$ L1 D$

L2 cache

CPU core Nregisters

L1 I$ L1 D$

L2 cache

main memory I/O

Conventional Microprocessor Multiple cores

Page 29: Computer Science and Engineering Advanced Computer Architecture CSE 8383 April 17, 2008 Session 11

Computer Science and Engineering

General Architecture (cont)

Shared Cache

CPU core 1registers

L1 I$ L1 D$

CPU core Nregisters

L1 I$ L1 D$

L2 cache

main memory I/O

CPU core 1

regs

L1 I$ L1 D$ L1 I$ L1 D$

L2 cache

main memory I/O

regs

regs

regs

CPU core N

regs

regs

regs

regs

Multithreaded Shared Cache

Page 30: Computer Science and Engineering Advanced Computer Architecture CSE 8383 April 17, 2008 Session 11

Computer Science and Engineering

“Case Studies”

Page 31: Computer Science and Engineering Advanced Computer Architecture CSE 8383 April 17, 2008 Session 11

Computer Science and Engineering

Case Study 1:“IBM’s Cell Processor”

Page 32: Computer Science and Engineering Advanced Computer Architecture CSE 8383 April 17, 2008 Session 11

Computer Science and Engineering

Cell Highlights

Supercomputer on a chip

Multi-core microprocessor(9 cores)

>4 Ghz clock frequency

10X performance for many applications

Page 33: Computer Science and Engineering Advanced Computer Architecture CSE 8383 April 17, 2008 Session 11

Computer Science and Engineering

Key Attributes

Cell is Multi-core-Contains 64-bit power architecture-Contains 8 synergetic processor elements

Cell is a Broadband Architecture-SPE is RISC architecture with SIMD organization and local store-128+ concurrent transactions to memory per processor

Cell is a Real-Time Architecture-Resource allocation (for bandwidth measurement)-Locking caching (via replacement management table)

Cell is a Security Enabled Architecture-Isolate SPE for flexible security programming

Page 34: Computer Science and Engineering Advanced Computer Architecture CSE 8383 April 17, 2008 Session 11

Computer Science and Engineering

Cell Processor Components

Page 35: Computer Science and Engineering Advanced Computer Architecture CSE 8383 April 17, 2008 Session 11

Computer Science and Engineering

Cell BE Processor Block Diagram

Page 36: Computer Science and Engineering Advanced Computer Architecture CSE 8383 April 17, 2008 Session 11

Computer Science and Engineering

POWER Processing Element (PPE)

POWER Processing Unit (PPU) connected to a 512KB L2 cache.

Responsible for running the OS and coordinating the SPEs.

Key design goals: maximize the performance/power ratio as well as the performance/area ratio.

Dual-issue, in-order processor with dual-thread support

Utilizes delayed-execution pipelines and allows limited out-of-order execution of load instructions.

Page 37: Computer Science and Engineering Advanced Computer Architecture CSE 8383 April 17, 2008 Session 11

Computer Science and Engineering

Synergistic Processing Elements (SPE)

Dual-issue, in-order machine with a large 128-entry, 128-bit register file used for both floating-point and integer operations

Modular design consisting of a Synergistic Processing Unit (SPU) and a Memory Flow Controller (MFC).

Compute engine with SIMD support and 256KB of dedicated local storage.

The MFC contains a DMA controller with an associated MMU and an Atomic Unit to handle synch operations with other SPUs and the PPU.

Page 38: Computer Science and Engineering Advanced Computer Architecture CSE 8383 April 17, 2008 Session 11

Computer Science and Engineering

SPE (cont.)

They operate directly on instructions and data from its dedicated local store.

They rely on a channel interface to access the main memory and other local stores.

The channel interface, which is in the MFC, runs independently of the SPU and is capable of translating addresses and doing DMA transfers while the SPU continues with the program execution.

SIMD support can perform operations on 16 8-bit, 8 16-bit, 4 32-bit integers, or 4 single-precision floating-point numbers per cycle.

At 3.2GHz, each SPU is capable of performing up to 51.2 billion 8-bit integer operations or 25.6GFLOPs in single precision.

Page 39: Computer Science and Engineering Advanced Computer Architecture CSE 8383 April 17, 2008 Session 11

Computer Science and Engineering

Blade level 2 cell processors per blade

Chip level 9 cores

Instruction level Dual issue pipelines on each SPE

Register level Native SIMD on SPE and PPE VMX

Four levels of Parallelism

Page 40: Computer Science and Engineering Advanced Computer Architecture CSE 8383 April 17, 2008 Session 11

Computer Science and Engineering

Cell Chip Floor plan

Page 41: Computer Science and Engineering Advanced Computer Architecture CSE 8383 April 17, 2008 Session 11

Computer Science and Engineering

Element Interconnect Bus (EIB)

Implemented as a ring

Interconnect 12 elements:1 PPE with 51.2GB/s aggregate bandwidth8 SPEs: each with 51.2GB/s aggregate bandwidthMIC: 25.6GB/s of memory bandwidth2 IOIF: 35GB/s(out), 25GB/s(in) of I/O bandwidth

Support two transfer modesDMA between SPEsMMIO/DMA between PPE and system memory

Source: Ainsworth & Pinkston, On Characterizing Performance of the Cell Broad

band Engine Element Interconnect Bus, 1st International Symp. on  NOCS 2007

Page 42: Computer Science and Engineering Advanced Computer Architecture CSE 8383 April 17, 2008 Session 11

Computer Science and Engineering

Element Interconnect Bus (EIB)

An EIB consists of the following:1. Four 16 byte-wide rings (two in each direction)

1.1 Each ring capable of handling up to 3 concurrent non-overlapping transfers

1.2 Supports up to 12 data transfers at a time

2. A shared command bus2.1 Distributes commands

2.2 Sets up end to end transactions

2.3 Handles coherency

3. A central data arbiter to connect the 12 Cell elements

3.1 Implemented in a star-like structure 3.2 It controls access to the EIB data rings on a per transaction basis

Source: Ainsworth & Pinkston, On Characterizing Performance of the Cell Broad

band Engine Element Interconnect Bus, 1st International Symp. on  NOCS 2007

Page 43: Computer Science and Engineering Advanced Computer Architecture CSE 8383 April 17, 2008 Session 11

Computer Science and Engineering

Element Interconnect Bus (EIB)

Page 44: Computer Science and Engineering Advanced Computer Architecture CSE 8383 April 17, 2008 Session 11

Computer Science and Engineering

Cell Manufacturing Parameters

About 234 million transistors (compared with 125 million for Pentium 4) that runs at more than 4.0 GHz

As compared to conventional processors, Cell is fairly large, with a die size of 221 square millimeters

The introductory design is fabricated using a 90 nm Silicon on insulator (SOL) process

In March 2007 IBM announced that the 65 nm version of Cell BE (Broadband Engine) is in production

Page 45: Computer Science and Engineering Advanced Computer Architecture CSE 8383 April 17, 2008 Session 11

Computer Science and Engineering

Cell Power Consumption

Each SPE consumes about 1 W when clocked at 2 GHz, 2 W at 3 GHz, and 4 W at 4 GHz

Including the eight SPEs, the PPE, and other logic, the CELL processor will dissipate close to 15W at 2 GHz, 30W at 3 GHz, and approximately 60W 4 GHz

Page 46: Computer Science and Engineering Advanced Computer Architecture CSE 8383 April 17, 2008 Session 11

Computer Science and Engineering

Cell Power Management

Dynamic Power Management (DPM)

Five Power Management States

One linear sensor

Ten digital thermal sensors

Page 47: Computer Science and Engineering Advanced Computer Architecture CSE 8383 April 17, 2008 Session 11

Computer Science and Engineering

Case Study 2:“Intel’s Core 2 Duo ”

Page 48: Computer Science and Engineering Advanced Computer Architecture CSE 8383 April 17, 2008 Session 11

Computer Science and Engineering

Intel Core 2 Duo Highlights

Multi-core microprocessor(2 cores)

It has a range of 1.5 to 3 Ghz clock frequency

2X performance for many applications

Dedicated level 1 cache and shared level 2 cache

Its shared L2 cache comes in two flavors: 2MB and 4MB, depending on the model

It supports 64bit architecture

Page 49: Computer Science and Engineering Advanced Computer Architecture CSE 8383 April 17, 2008 Session 11

Computer Science and Engineering

Intel Core 2 Duo Block Diagram

Dedicated L1$

Shared L2$

The two cores exchange data implicitly through the shared level 2 cache

Page 50: Computer Science and Engineering Advanced Computer Architecture CSE 8383 April 17, 2008 Session 11

Computer Science and Engineering

Intel Core 2 Duo Architecture

Reduced front-side bus traffic: effective data sharing between cores allows data requests to be resolved at the shared cache level instead of going all the way to the system memory

Core 1 had to

retrieve the data

from Core 2 by

going all the way

through the FSB

and Main Memory

One Copy needed

to be retrieved

Page 51: Computer Science and Engineering Advanced Computer Architecture CSE 8383 April 17, 2008 Session 11

Computer Science and Engineering

Intel’s Core 2 Duo Manufacturing Parameters

About 291 million transistors

Compared to Cell’s 221 square millimeters, Core 2 Duo has a smaller die size between 143 and 107 square millimeters depending on the model.

The current Intel process technology for the Dual core ranges between 65 nm and 45nm (2007) with an estimate of 155 million transistors .

Page 52: Computer Science and Engineering Advanced Computer Architecture CSE 8383 April 17, 2008 Session 11

Computer Science and Engineering

Intel Core 2 Duo Power Consumption

Power consumption in Core 2 Duo ranges 65w-130w depending on the model.

Assuming you have 75 w processor model (Conroe is 65W) it will cost you $4 to keep your computer up for the whole month

Page 53: Computer Science and Engineering Advanced Computer Architecture CSE 8383 April 17, 2008 Session 11

Computer Science and Engineering

Intel Core 2 Duo Power Management

It uses 65 nm technology instead of the previous 90nm technology

(Less voltage requirements)

Aggressive clock gating

Enhanced Speed-Step

Low VCC Arrays

Blocks controlled via sleep transistors

Low leakage transistors

Page 54: Computer Science and Engineering Advanced Computer Architecture CSE 8383 April 17, 2008 Session 11

Computer Science and Engineering

Case Study 3:“AMD’s Quad-Core Processor

(Barcelona) ”

Page 55: Computer Science and Engineering Advanced Computer Architecture CSE 8383 April 17, 2008 Session 11

Computer Science and Engineering

AMD Quad-Core Highlights

Designed to enable simultaneous 32- and 64-bit computing

Minimizes the cost of transition and maximizes current investments

Integrated DDR2 Memory Controller

Increases application performance by dramatically reducing memory latency

Scales memory bandwidth and performance to match compute needs

HyperTranspor Technology Provides up to 24.0GB/s peak bandwidth per processor, reducing I/O bottlenecks

Page 56: Computer Science and Engineering Advanced Computer Architecture CSE 8383 April 17, 2008 Session 11

Computer Science and Engineering

AMD Quad-Core Block Diagram

Dedicated L1$ and L2$

Shared L3$

Page 57: Computer Science and Engineering Advanced Computer Architecture CSE 8383 April 17, 2008 Session 11

Computer Science and Engineering

AMD Quad-Core Architecture

It has a crossbar switch instead of the usual bus used in dual core processors

It lowers the probability of having memory access collisions

L3$ to alleviate the memory access latency since we have a greater possibility of accessing the memory due to the high number of cores

Page 58: Computer Science and Engineering Advanced Computer Architecture CSE 8383 April 17, 2008 Session 11

Computer Science and Engineering

AMD Quad-Core Architecture (cont)

Replacement policies:L1,L2: pseudo LRU L3:Sharing aware pseudo LRU

Cache Hierarchy :

Dedicated L1 cache

2 way associative

8 banks (each 16B wide).

Dedicated L2 cache

16 way associative

victim cache, exclusive w.r.t L1

Shared L3 cache

32 way associative

Fills from L3 leave likely shared lines in L3

Victim cache, partially exclusive w.r.t. L2

Sharing aware replacement policy

Page 59: Computer Science and Engineering Advanced Computer Architecture CSE 8383 April 17, 2008 Session 11

Computer Science and Engineering

AMD Quad-Core Manufacturing Parameters

The current AMD process technology for Quad-Core is 65nm

It is comprised of approximately 463M transistors (about 119M less than Intel’s quad-core Kentsfield)

It has a die size of 285 square millimeters (Compared to Cell’s 221 square millimeters)

Page 60: Computer Science and Engineering Advanced Computer Architecture CSE 8383 April 17, 2008 Session 11

Computer Science and Engineering

AMD Quad-Core Power Consumption

Power consumption in AMD Quad-Core ranges 68-95w( compared to 65w-130w of Intel’s Core 2 Duo) depending on the model.

AMD CoolCore Technology

Reduces processor energy consumption by turning off unused parts of the processor. For example, the memory controller can turn off the write logic when reading from memory, helping reduce system power

Power can be switched on or off within a single clock cycle, saving energy with no impact to performance

Page 61: Computer Science and Engineering Advanced Computer Architecture CSE 8383 April 17, 2008 Session 11

Computer Science and Engineering

AMD Quad-Core Power Management

Native quad-core technology enables enhanced power

management across all four cores