Upload
howard-farmer
View
224
Download
0
Embed Size (px)
Citation preview
Computer Science and Engineering
Advanced Computer ArchitectureAdvanced Computer Architecture
CSE 8383CSE 8383
April 17, 2008April 17, 2008
Session 11Session 11
Computer Science and Engineering
Contents
1. Multi-Core
Why now?
A Paradigm Shift
Multi-Core Architecture
2. Case Studies
IBM Cell
Intel Core 2Duo
AMD
Computer Science and Engineering
The Path to Multi-Core
Computer Science and Engineering
Background
WaferThin slice of semiconducting material, such as a silicon crystal, upon which microcircuits are constructed
Die Size The die size of the processor refers to its physical surface area size on the wafer. It is typically measured in square millimeters (mm^2). In essence a "die" is really a chip . the smaller the chip, the more of them that can be made from a single wafer.
Circuit SizeThe level of miniaturization of the processor. In order to pack more transistors into the same space, they must be continually made smaller and smaller. Measured in Microns (m) or Nanometer (nm)
Computer Science and Engineering
Examples
386C Die Size: 42 mm2 1.0 technology 275,000 transistors
Pentium III Die Size: 106 mm2 0.18 technology 28 million transistors
486C Die Size: 90 mm2 0.7 technology 1.2 million transistors
Pentium Die Size: 148 mm2 0.5 technology 3.2 million transistors
Computer Science and Engineering
Pentium III (0.18 process technology)
Source: Fred Pollack, Intel. New Micro-architecture Challenges in the coming Generations of CMOS Process Technologies. Micro32
Computer Science and Engineering
Computer Science and Engineering
Technology (nm) 90 65 45 32 22
Integration Capacity (BT)
2 4 8 16 32
nm Process Technology
Computer Science and Engineering
Increasing Die Size
Using the same technology
Increasing the Die Size 2-3X 1.5-1.7X in Performance.
Power is proportional to Die-area * Frequency
We cannot produce microprocessors with ever increasing Die size – The constraint is POWER
Computer Science and Engineering
Reducing circuit size in particular is key to reducing the size of the chip.
The first generation Pentium used a 0.8 micron circuit size, and required 296 square millimeters per chip.
The second generation chip had the circuit size reduced to 0.6 microns, and the die size dropped by a full 50% to 148 square millimeters.
Reducing circuit Size
Computer Science and Engineering
Shrink transistors by 30% every generation transistor density doubles, oxide thickness shrinks, frequency increases, and threshold voltage decreases.
Gate thickness cannot keep on shrinking slowing frequency increase, less threshold voltage reduction.
Computer Science and Engineering
Processor Evolution
• Gate delay reduces by 1/ (frequency up by )• Number of transistors in a constant area goes up by 2 (Deeper pipelines,
ILP, more cashes)• Additional transistors enable an additional increase in performance• Result: 2x performance at roughly equal cost
Generation
i
Generation
i +1
2 2
2
(0.5 m, for example) (0.35 m, for example)
Computer Science and Engineering
What happens to power if we hold die size constant at each generation?
Allows ~ 100% growth in transistors each generation
Source: Fred Pollack, Intel. New Micro-architecture Challenges in the coming Generations of CMOS Process Technologies. Micro32
Computer Science and Engineering
What happens to die Size if we hold power constant at each generation?
Die size has to reduce ~ 25% in area each generation 50% growth in transistors, which limits PERFORMANCE, Power Density is still a problem
Source: Fred Pollack, Intel. New Micro-architecture Challenges in the coming Generations of CMOS Process Technologies. Micro32
Computer Science and EngineeringSource: Intel Developer Forum, Spring 2004
Pat Gelsinger (Pentium at 90 W)
Power Density continues to soar
Computer Science and Engineering
Business as Usual won’t work: Power is a Major Barrier
As Processor Continue to improve in Performance and Speed, Power consumption and heat dissipation have become major challenges
Higher costs:• Thermal Packaging• Fans• Electricity• Air conditioning
Computer Science and Engineering
A new Paradigm Shift
Old ParadigmPerformance == improved Frequency, unconstrained power,
voltage scaling
New Paradigm:Performance == improved IPC, Multi-core, power efficient
micro architecture advancement
Computer Science and Engineering
Multiple CPUs on a Single Chip
An attractive option for chip designers because of the availability of cores from earlier processor generations,
which, when shrunk down to present-day process technology, are small enough for aggregation into a single
die
Computer Science and Engineering
Multi-core
• Gate delay does not reduce much • The frequency and performance of each core is
the same or a little less than previous generation
Generation
i
Generation
i
Generation
i
Technology Generation i Technology Generation i+1
Computer Science and Engineering
10
100
1
2003 2005 2007 2009 2011 2013
Increasing HW
Threads HT
Multi-core Era
Scalar and Parallel
Applications
Many-core Era
Massively Parallel
Applications
From HT to Many-Core
Intel predicts Intel predicts 100’s of cores 100’s of cores on a chip in on a chip in 20152015
Computer Science and EngineeringSource: Saman Amarasinghe, MIT (6.189 2007, lecture-1)
# of Cores
Multi-cores are Reality
Computer Science and Engineering
Multi-Core Architecture
Computer Science and Engineering
Multi-core Architecture
Multiple cores are being integrated on a single chip and made available for general purpose computing
Higher levels of integration – multiple processing cores Caches memory controllers some I/O processing)
Network on Chip (NoC)
Computer Science and Engineering
Interconnection Networks
M M M M
P P P P P
Interconnection Networks
M M M M
P P P P
Shared memory
• One copy of data shared among multiple cores
• Synchronization via locking
• intel
Distributed memory
• Cores access local data
• Cores exchange data
Computer Science and Engineering
Memory Access Alternatives
Symmetric Multiprocessors (SMP)
Message Passing (MP)
Distributed Shared Memory (DSM)
Shared address space
Distributed address space
Global
Memory
SMP
Symmetric Multiprocessors
Distributed
Memory
DMS
Distributed Shared Memory
MP
Message Passing
Computer Science and Engineering
Network on Chip (NoC)
control data I/O
Traditional BusSwitch Network
Computer Science and Engineering
Global Memory
P P P
PC PC PC
SC SC SC
Global Memory
P P P
PC PC PC
Secondary Cache
Global Memory
P P P
Secondary Cache
Primary Cache
Shared Memory
Shared Global Memory
Shared Secondary CacheShared Primary Cache
Computer Science and Engineering
General Architecture
CPU coreregisters
L1 I$ L1 D$
L2 cache
main memory I/O
CPU core 1registers
L1 I$ L1 D$
L2 cache
CPU core Nregisters
L1 I$ L1 D$
L2 cache
main memory I/O
Conventional Microprocessor Multiple cores
Computer Science and Engineering
General Architecture (cont)
Shared Cache
CPU core 1registers
L1 I$ L1 D$
CPU core Nregisters
L1 I$ L1 D$
L2 cache
main memory I/O
CPU core 1
regs
L1 I$ L1 D$ L1 I$ L1 D$
L2 cache
main memory I/O
regs
regs
regs
CPU core N
regs
regs
regs
regs
Multithreaded Shared Cache
Computer Science and Engineering
“Case Studies”
Computer Science and Engineering
Case Study 1:“IBM’s Cell Processor”
Computer Science and Engineering
Cell Highlights
Supercomputer on a chip
Multi-core microprocessor(9 cores)
>4 Ghz clock frequency
10X performance for many applications
Computer Science and Engineering
Key Attributes
Cell is Multi-core-Contains 64-bit power architecture-Contains 8 synergetic processor elements
Cell is a Broadband Architecture-SPE is RISC architecture with SIMD organization and local store-128+ concurrent transactions to memory per processor
Cell is a Real-Time Architecture-Resource allocation (for bandwidth measurement)-Locking caching (via replacement management table)
Cell is a Security Enabled Architecture-Isolate SPE for flexible security programming
Computer Science and Engineering
Cell Processor Components
Computer Science and Engineering
Cell BE Processor Block Diagram
Computer Science and Engineering
POWER Processing Element (PPE)
POWER Processing Unit (PPU) connected to a 512KB L2 cache.
Responsible for running the OS and coordinating the SPEs.
Key design goals: maximize the performance/power ratio as well as the performance/area ratio.
Dual-issue, in-order processor with dual-thread support
Utilizes delayed-execution pipelines and allows limited out-of-order execution of load instructions.
Computer Science and Engineering
Synergistic Processing Elements (SPE)
Dual-issue, in-order machine with a large 128-entry, 128-bit register file used for both floating-point and integer operations
Modular design consisting of a Synergistic Processing Unit (SPU) and a Memory Flow Controller (MFC).
Compute engine with SIMD support and 256KB of dedicated local storage.
The MFC contains a DMA controller with an associated MMU and an Atomic Unit to handle synch operations with other SPUs and the PPU.
Computer Science and Engineering
SPE (cont.)
They operate directly on instructions and data from its dedicated local store.
They rely on a channel interface to access the main memory and other local stores.
The channel interface, which is in the MFC, runs independently of the SPU and is capable of translating addresses and doing DMA transfers while the SPU continues with the program execution.
SIMD support can perform operations on 16 8-bit, 8 16-bit, 4 32-bit integers, or 4 single-precision floating-point numbers per cycle.
At 3.2GHz, each SPU is capable of performing up to 51.2 billion 8-bit integer operations or 25.6GFLOPs in single precision.
Computer Science and Engineering
Blade level 2 cell processors per blade
Chip level 9 cores
Instruction level Dual issue pipelines on each SPE
Register level Native SIMD on SPE and PPE VMX
Four levels of Parallelism
Computer Science and Engineering
Cell Chip Floor plan
Computer Science and Engineering
Element Interconnect Bus (EIB)
Implemented as a ring
Interconnect 12 elements:1 PPE with 51.2GB/s aggregate bandwidth8 SPEs: each with 51.2GB/s aggregate bandwidthMIC: 25.6GB/s of memory bandwidth2 IOIF: 35GB/s(out), 25GB/s(in) of I/O bandwidth
Support two transfer modesDMA between SPEsMMIO/DMA between PPE and system memory
Source: Ainsworth & Pinkston, On Characterizing Performance of the Cell Broad
band Engine Element Interconnect Bus, 1st International Symp. on NOCS 2007
Computer Science and Engineering
Element Interconnect Bus (EIB)
An EIB consists of the following:1. Four 16 byte-wide rings (two in each direction)
1.1 Each ring capable of handling up to 3 concurrent non-overlapping transfers
1.2 Supports up to 12 data transfers at a time
2. A shared command bus2.1 Distributes commands
2.2 Sets up end to end transactions
2.3 Handles coherency
3. A central data arbiter to connect the 12 Cell elements
3.1 Implemented in a star-like structure 3.2 It controls access to the EIB data rings on a per transaction basis
Source: Ainsworth & Pinkston, On Characterizing Performance of the Cell Broad
band Engine Element Interconnect Bus, 1st International Symp. on NOCS 2007
Computer Science and Engineering
Element Interconnect Bus (EIB)
Computer Science and Engineering
Cell Manufacturing Parameters
About 234 million transistors (compared with 125 million for Pentium 4) that runs at more than 4.0 GHz
As compared to conventional processors, Cell is fairly large, with a die size of 221 square millimeters
The introductory design is fabricated using a 90 nm Silicon on insulator (SOL) process
In March 2007 IBM announced that the 65 nm version of Cell BE (Broadband Engine) is in production
Computer Science and Engineering
Cell Power Consumption
Each SPE consumes about 1 W when clocked at 2 GHz, 2 W at 3 GHz, and 4 W at 4 GHz
Including the eight SPEs, the PPE, and other logic, the CELL processor will dissipate close to 15W at 2 GHz, 30W at 3 GHz, and approximately 60W 4 GHz
Computer Science and Engineering
Cell Power Management
Dynamic Power Management (DPM)
Five Power Management States
One linear sensor
Ten digital thermal sensors
Computer Science and Engineering
Case Study 2:“Intel’s Core 2 Duo ”
Computer Science and Engineering
Intel Core 2 Duo Highlights
Multi-core microprocessor(2 cores)
It has a range of 1.5 to 3 Ghz clock frequency
2X performance for many applications
Dedicated level 1 cache and shared level 2 cache
Its shared L2 cache comes in two flavors: 2MB and 4MB, depending on the model
It supports 64bit architecture
Computer Science and Engineering
Intel Core 2 Duo Block Diagram
Dedicated L1$
Shared L2$
The two cores exchange data implicitly through the shared level 2 cache
Computer Science and Engineering
Intel Core 2 Duo Architecture
Reduced front-side bus traffic: effective data sharing between cores allows data requests to be resolved at the shared cache level instead of going all the way to the system memory
Core 1 had to
retrieve the data
from Core 2 by
going all the way
through the FSB
and Main Memory
One Copy needed
to be retrieved
Computer Science and Engineering
Intel’s Core 2 Duo Manufacturing Parameters
About 291 million transistors
Compared to Cell’s 221 square millimeters, Core 2 Duo has a smaller die size between 143 and 107 square millimeters depending on the model.
The current Intel process technology for the Dual core ranges between 65 nm and 45nm (2007) with an estimate of 155 million transistors .
Computer Science and Engineering
Intel Core 2 Duo Power Consumption
Power consumption in Core 2 Duo ranges 65w-130w depending on the model.
Assuming you have 75 w processor model (Conroe is 65W) it will cost you $4 to keep your computer up for the whole month
Computer Science and Engineering
Intel Core 2 Duo Power Management
It uses 65 nm technology instead of the previous 90nm technology
(Less voltage requirements)
Aggressive clock gating
Enhanced Speed-Step
Low VCC Arrays
Blocks controlled via sleep transistors
Low leakage transistors
Computer Science and Engineering
Case Study 3:“AMD’s Quad-Core Processor
(Barcelona) ”
Computer Science and Engineering
AMD Quad-Core Highlights
Designed to enable simultaneous 32- and 64-bit computing
Minimizes the cost of transition and maximizes current investments
Integrated DDR2 Memory Controller
Increases application performance by dramatically reducing memory latency
Scales memory bandwidth and performance to match compute needs
HyperTranspor Technology Provides up to 24.0GB/s peak bandwidth per processor, reducing I/O bottlenecks
Computer Science and Engineering
AMD Quad-Core Block Diagram
Dedicated L1$ and L2$
Shared L3$
Computer Science and Engineering
AMD Quad-Core Architecture
It has a crossbar switch instead of the usual bus used in dual core processors
It lowers the probability of having memory access collisions
L3$ to alleviate the memory access latency since we have a greater possibility of accessing the memory due to the high number of cores
Computer Science and Engineering
AMD Quad-Core Architecture (cont)
Replacement policies:L1,L2: pseudo LRU L3:Sharing aware pseudo LRU
Cache Hierarchy :
Dedicated L1 cache
2 way associative
8 banks (each 16B wide).
Dedicated L2 cache
16 way associative
victim cache, exclusive w.r.t L1
Shared L3 cache
32 way associative
Fills from L3 leave likely shared lines in L3
Victim cache, partially exclusive w.r.t. L2
Sharing aware replacement policy
Computer Science and Engineering
AMD Quad-Core Manufacturing Parameters
The current AMD process technology for Quad-Core is 65nm
It is comprised of approximately 463M transistors (about 119M less than Intel’s quad-core Kentsfield)
It has a die size of 285 square millimeters (Compared to Cell’s 221 square millimeters)
Computer Science and Engineering
AMD Quad-Core Power Consumption
Power consumption in AMD Quad-Core ranges 68-95w( compared to 65w-130w of Intel’s Core 2 Duo) depending on the model.
AMD CoolCore Technology
Reduces processor energy consumption by turning off unused parts of the processor. For example, the memory controller can turn off the write logic when reading from memory, helping reduce system power
Power can be switched on or off within a single clock cycle, saving energy with no impact to performance
Computer Science and Engineering
AMD Quad-Core Power Management
Native quad-core technology enables enhanced power
management across all four cores