Upload
brandon-hill
View
216
Download
0
Embed Size (px)
Citation preview
8/7/2019 multicore.processor.examples
1/30
Sudhakar Yalamanchili, Georgia Institute of Technology
Multicore: CommercialMulticore: CommercialProcessorsProcessors
ECE 4100/6100 (2)
Some Examples
Desktop and Server/Enterprise Space Intel
AMD
SUN Microsystems
The Embedded Space: Freescale Semiconductor
8/7/2019 multicore.processor.examples
2/30
ECE 4100/6100 (3)
Focus
The Chip Level Architecture
What do we have on chip?
The Core Architecture Note the presence/absence/configuration of concepts
studied earlier in class Rationalize the design decisions that led to the preceding What can/should we expect next?
Building systems using multicore chips
Sudhakar Yalamanchili, Georgia Institute of Technology
The Intel Core Duo Processor The Intel Core Duo Processor
SeriesSeries
8/7/2019 multicore.processor.examples
3/30
ECE 4100/6100 (5)
Intel Core Duo
Homogeneous cores Bus based on chip interconnect Shared Memory Traditional I/O
Classic OOO: Reservation Stations,Issue ports, Schedulersetc
Large, shared set associative, prefetch,etc.
Source: Intel Corp.
ECE 4100/6100 (6)
Intel Core Duo: Vital Stats
151 million transistors; Shared 2 MB L2 cache Each core has a 12 stage pipeline (Yonah) Low-power (less than 25 watts) Dual Core microprocessor Supports Intels Vanderpool virtualization technology EM64T (Intel x86-64 extensions) is not supported
Desktop market not severe due to lack of OS and software Sossaman processor for servers, which is based on Yonah, also lacks
EM64T-support severe disadvantage Communication between the L2 cache and both execution cores is
handled by an arbitration bus unit Eliminates cache coherency traffic over the FSB Raises the core-to-L2 latency The increase in clock frequency offsets the impact
Core processors communicate with the system chipset over a 667MT/s front side bus (FSB), up from 533 MT/s used by the fastestPentium M.
Intel Core Solo uses the same two-core die as the Core Duo, butfeatures only one active core
Chips failing quality control can be sold Core 2 Duo processors will also include the ability to disable one core to
conserve power
8/7/2019 multicore.processor.examples
4/30
ECE 4100/6100 (7)
The Core micro-architecture
Source: Ars Technica
ECE 4100/6100 (8)
The Core Execution core
Source: Ars Technica
8/7/2019 multicore.processor.examples
5/30
ECE 4100/6100 (9)
Intel Core Duo
High memory latency due to the lack of on-diememory controller (further aggravated by system-
chipset's use of DDR-II RAM) Main-memory transactions have to pass through
the Northbridge of the chipset Higher latency compared to the AMD's Turion platform. Weakness shared by the entire line of Pentium processors L2-cache is quite effective at hiding main-memory latency
Execution units Three 64-bit integer exec units
one CIU (complex) + two SIU (simple)
Two FPUs Poor Floating Point Unit (FPU) throughput
Limited to little "performance per watt" in singlethreaded applications compared to its predecessor.
ECE 4100/6100 (10)
Core 2 Duo and Core Duo
Very similar architectures Bump in the processor speed Increase in Level 2 cache. (2MB to 4MB) Both chips have a 65-nm process technology architecture and
support a 667 MHz front-side-bus (FSB). 14 stage pipeline
Source: Intel Corp.
8/7/2019 multicore.processor.examples
6/30
ECE 4100/6100 (11)
Intel Core TM2 Duo Processor
143 mm 2Processor Die Size
8/7/2019 multicore.processor.examples
7/30
ECE 4100/6100 (13)
Wide Dynamic Execution
Source: Bit Tech
ECE 4100/6100 (14)
Wide Dynamic Execution
Source: Bit Tech
8/7/2019 multicore.processor.examples
8/30
ECE 4100/6100 (15)
Wide Dynamic Execution
Pipe width of 4 execution units per chip (Pentium
M/Pentium 4 Netburst have 3) Delivery of more instructions per clock cycle Pipeline depth of 14 vs. 31 in Pentium Prescott 4
Compromise between efficient execution of shortinstructions and long instructions
Ops fusion Less work for the processor pipeline to run Micro-ops fusion
fuse together repetitive instructions in x86 code Macro-ops fusion
works on the x86 instructions themselves, not just their microderivatives.
Instruction loads and micro-ops can be reduced byapproximately 15% and 10%, respectively
ECE 4100/6100 (16)
Intelligent Power Capability
Source: Bit Tech
8/7/2019 multicore.processor.examples
9/30
ECE 4100/6100 (17)
SpeedStep technology
Dyamic clock speed reduction Intel mobile processors include this already Enhanced SpeedStep used in Core 2 Duo
Controller that turns on sections of the processor asneeded. One core can be shut down for single-threaded applications
Power consumption decreased by enhancements to
Intel's 65nm process node use Low-K dielectrics and strained silicon use low-leakage and "sleep" transistors
Intelligent Power Capability
ECE 4100/6100 (18)
Advanced Smart Cache
Source: Bit Tech
8/7/2019 multicore.processor.examples
10/30
8/7/2019 multicore.processor.examples
11/30
ECE 4100/6100 (21)
Smart Memory Access
Improved prefetch units Memory disambiguation
Allows re-ordering instructionsmore efficiently
Source: Ars Technica
Example fromhttp://arstechnica.com/articles/paedia/cpu/core.ars/8Execution without memory disambiguation
Memory AliasingExecution with and without memory disambiguation
ECE 4100/6100 (22)
Advanced Digital Media Boost
Source: Bit Tech
8/7/2019 multicore.processor.examples
12/30
ECE 4100/6100 (23)
Advanced Digital Media Boost
Streaming SIMD Extension (SSE) instructions
SSE instructions are an extension of the standard x86instruction set. Utilized in multimedia encoding, decoding, image
manipulation and encryption SSE instructions are 128-bit.
Up from 64-bits Double the SSE performance over previous generation
ECE 4100/6100 (24)
Comparison of SSE to prior processors
Source: Ars Technica
8/7/2019 multicore.processor.examples
13/30
ECE 4100/6100 (25)
Intel Conroe Vs Presler
What is the major difference? Shared L2 versus separate caches
Conroe Presler
Source: Bit Tech
ECE 4100/6100 (26)
Intels Roadmap for Multicore
Source: Adapted from Toms Hardware
2006 20082007
SC 1MBDC 2MB
DC 2/4MBshared
DC 3 MB/6MB shared
(45nm)
2006 20082007
DC 2/4MB
DC 2/4MBshared
DC 4MB
DC 3MB /6MBshared (45nm)
2006 20082007
DC 2MBDC 4MB
DC 16MB
QC 4MB
QC 8/16MBshared
8C 12MBshared(45nm)
SC 512KB/1/ 2MB
8C 12MBshared(45nm)
Desktop processors
Mobile processors
Enterprise processors
Drivers are Market segments More cache More cores
80 core processor prototype has been designed!
8/7/2019 multicore.processor.examples
14/30
ECE 4100/6100 (27)
Intel Chipset Example
Source: Extreme Tech
ECE 4100/6100 (28)
References and Links
http://www.intel.com/products/processor/coreduo/ http://en.wikipedia.org/wiki/Intel_Core http://www.hothardware.com/viewarticle.aspx?articleid=845&cid=1 http://www.bit-tech.net/hardware/2006/03/10/intel_core_microarchitecture/ http://www.bit-
tech.net/hardware/2006/05/19/intel_core_duo_t2600_on_the_desktop http://www.bit-tech.net/hardware/2006/07/14/intel_core_2_duo_processors/ http://www.hardcoreware.net/reviews/review-347-1.htm http://www.trustedreviews.com/cpu-memory/review/2006/08/28/Intel-Core-2-
Duo-Merom-Notebooks/p1 http://www.trustedreviews.com/cpu-memory/review/2006/07/14/Intel-Core-2-
Duo-Conroe-E6400-E6600-E6700-X6800/p1 http://techreport.com/reviews/2006q2/core-duo/index.x?pg=1 http://arstechnica.com/articles/paedia/cpu/core.ars/1 http://www.anandtech.com/mobile/showdoc.aspx?i=2663&p=4 http://www.extremetech.com/article2/0,1697,1988794,00.asp http://www.coreduoinfo.com/blog/about-intel-core-duo/ http://67.91.114.164/intel_c2d_info.htm http://www.pcper.com/article.php?aid=272&type=expert
8/7/2019 multicore.processor.examples
15/30
Sudhakar Yalamanchili, Georgia Institute of Technology
AMD MultiCore ProcessorsAMD MultiCore Processors
ECE 4100/6100 (30)
Dual Core AMD Opteron
Source: AMD
8/7/2019 multicore.processor.examples
16/30
ECE 4100/6100 (31)
AMD Multicore (Dualcore)Opteron
Two AMD Opteron CPUcores on a single die
Each has 1MB L2 cache 90nm, ~205 million
transistors Approximately same die size
as 130nm single-core AMDOpteron processor
95 watt power envelope fits into 90nm power
infrastructure Introduced with K8
Revision E core in April2005
Core 0
Northbridge
1-MB L2
Core 11-MB L2
Source: AMD
ECE 4100/6100 (32)
Opteron Core Pipeline
Source: ChipArchitect
8/7/2019 multicore.processor.examples
17/30
ECE 4100/6100 (33)
AMD Opteron Processor Core Architecture
AGUAGU
Int Decode & Rename
FADD FMISCFMUL44-entryLoad/StoreQueue
36-entry FP scheduler
FP Decode & Rename
ALU
AGU
ALU
MULT
ALU
Res Res Res
L1Icache64KB
L1Dcache
64KB
Fetch Branch
Prediction
Instruction Control Unit (72 entries)
Fastpath Microcode EngineScan/Align/Decode
ops
Source: The 3D shop
ECE 4100/6100 (34)
Dual Core AMD Opteron
AMD64 technology Runs 32-bit applications and is 64-bit capable Compatible with the x86 software infrastructure Enables a single architecture across 32- and 64-bit environments
Direct Connect Architecture NUMA system
Each processor shares its memory with other processors in thesystem
Integrated Memory Controller on-die DDR2 DRAM memory controller offers memory BW up to 10.7 GB/s
per processor HyperTransport
Point-to-point interconnect can be used to build a mesh of multiple-processor Opteron systems
Scalable bandwidth interconnect between processors, I/Osubsystems, and other chipsets
24.0 GB/s peak bandwidth per processor
8/7/2019 multicore.processor.examples
18/30
ECE 4100/6100 (35)
Dual Core AMD Opteron
Not a simple aggregation of K8 cores Integrated the cores for efficiency
Dual-core Opteron acts very much like a SMP system Compatible with existing single-threaded, multi-threaded
(hyperthreaded) software MOESI coherency protocol (O Owns)
Updates through system request interface SSE3 support with 10 new instructions. Quad-core upgradeability Hardware assisted AMD Virtualization Optimized Power Management
ECE 4100/6100 (36)
Dual Core AMD Opteron
Source: Elec Design
8/7/2019 multicore.processor.examples
19/30
ECE 4100/6100 (37)
AMD Opteron (SOI)
Source: Chip Architect
ECE 4100/6100 (38)
AMD 64 bit Core
1MB L2 Cache Detailed discussion of the 64-bit core architecture
at: http://chip-
architect.com/news/2003_09_21_Detailed_Architecture_of _AMDs_64bit_Core.html
8/7/2019 multicore.processor.examples
20/30
8/7/2019 multicore.processor.examples
21/30
ECE 4100/6100 (41)
Cache coherency
Source: Chip Architect
ECE 4100/6100 (42)
AMD Athlon 64 X2
Source: AMD
8/7/2019 multicore.processor.examples
22/30
ECE 4100/6100 (43)
References and Links
http://techreport.com/reviews/2005q2/opteron-x75/index.x?pg=1 http://www.tomshardware.com/2005/06/03/dual_core_stress_test/index.html http://www.a1-electronics.net/AMD_Section/CPUs/2005/AMD_Athlon64x2_Apr.shtml http://en.wikipedia.org/wiki/Opteron http://en.wikipedia.org/wiki/Athlon_64_X2 http://www.amd.com/us-
en/Processors/ProductInformation/0,,30_118_8796_14309,00.html http://chip-
architect.com/news/2003_09_21_Detailed_Architecture_of_AMDs_64bit_Core.html
http://firingsquad.com/hardware/amd_dual-core_opteron_875/page2.asp http://www.xbitlabs.com/articles/cpu/display/opteron-ws_4.html http://www.extremetech.com/article2/0,1697,1675784,00.asp http://www.elecdesign.com/Articles/Index.cfm?AD=1&ArticleID=11991 http://www.the3dshop.com/userimages/amd_systems/opteron_dualcore.htm http://www.nextcomputing.com/advantages/thruadv.shtml
http://arstechnica.com/news.ars/post/20060817-7535.html http://www.bit-tech.net/hardware/2005/05/09/amd_a64x2_4800/1.html
Sudhakar Yalamanchili, Georgia Institute of Technology
SUNSUN UltraSPARC MulticoreUltraSPARC Multicore
8/7/2019 multicore.processor.examples
23/30
ECE 4100/6100 (45)
SUN UltraSPARC T1
Eight cores, each 4-waythreaded
1.2 GHz Cache
16K 4-way 32B L1-I 8K 4-way 16B L1-D 3MB internal L2 cache
partitioned into four banksand four memorycontrollers.
Data moved between theL2 and the cores using anintegrated crossbar switch
to provide high throughput
Source: Sun
ECE 4100/6100 (46)
SUN UltraSPARC T1
Source: Sun
8/7/2019 multicore.processor.examples
24/30
ECE 4100/6100 (47)
SUN UltraSPARC T1 Pipeline
T1's integer pipeline Fetch, Thread Selection, Decode, Execute, Memory Access,
Writeback
Source: Sun
ECE 4100/6100 (48)
SUN UltraSPARC T2 Niagara 2
Source: Sun
8/7/2019 multicore.processor.examples
25/30
8/7/2019 multicore.processor.examples
26/30
ECE 4100/6100 (51)
UltraSparc T2 Memory System
Source: Sun
ECE 4100/6100 (52)
UltraSparc T2 Core Block Diagram
IFU Instruction Fetch Unit 16 KB I$, 32B lines, 8-way SA 64-entry fully-associative ITLB
EXU0/1 Integer Execution Units 4 threads share each unit Executes one integer instrn/cycle
LSU Load/Store Unit 8KB D$, 16B lines, 4-way SA 128-
entry fully-associative DTLB
FGU Floating/Graphics Unit SPU Stream Processing Unit
Cryptographic acceleration TLU Trap Logic Unit
Updates machine state, handlesexceptions and interrupts
MMU Memory Management Unit Hardware tablewalk (HWTW) 8KB, 64KB, 4MB, 256MB pages
Source: Sun
8/7/2019 multicore.processor.examples
27/30
ECE 4100/6100 (53)
UltraSparc T2 Core Pipeline
8 stages for integer operations:
Fetch, Cache, Pick, Decode, Execute, Memory, Bypass,Writeback > 3-cycle load-use Memory (translation, tag/data access) Bypass (late select, formatting)
12 stages for floating-point: Fetch, Cache, Pick, Decode, Execute, FX1, FX2, FX3,
FX4, FX5, FB, FW 6-cycle latency for dependent FP ops Longer pipeline for divide/sqrt
ECE 4100/6100 (54)
References and Links
http://realworldtech.com/page.cfm?ArticleID=RWT090406012516&p=4
http://www.opensparc.net/cgi-bin/goto.php?w=/pubs/preszo/06/HotChips06_09_ppt_master.pdf
http://www.freescale.com/files/netcomm/doc/fact_sh
eet/MPC8572FS.pdf
8/7/2019 multicore.processor.examples
28/30
Sudhakar Yalamanchili, Georgia Institute of Technology
The EmbeddedThe Embedded MulticoresMulticores
ECE 4100/6100 (56)
Freescale MPC8572 PowerQUICC IIIProcessor
Source: Freescale
8/7/2019 multicore.processor.examples
29/30
ECE 4100/6100 (57)
Freescale MPC8572 PowerQUICC IIIProcessor
Dual Embedded e500 core 36-bit physical
addressing Double-precision floating-point Integrated L1/L2 cache
L1 cache 32 KB data and 32 KB Shared L2 cache 1 MB with ECC L2 configurable as SRAM, cache and I/O transactions can
be stashed into L2 cache regions Integrated DDR memory controller with full ECC support Integrated security engine, Pattern Matching
Engine, Packet Deflate Engine Four on-chip triple-speed Ethernet controllers
ECE 4100/6100 (58)
References and Links
http://www.freescale.com/files/netcomm/doc/fact_sheet/MPC8572FS.pdf
8/7/2019 multicore.processor.examples
30/30
ECE 4100/6100 (59)
Summary
Multicore technology spans the product spectrum
The downward migration of leading edge technologycontinues
Architectural principles are key to Developers: extracting performance Designers: improving performance Marketing: understanding new markets for performance
Research spans the spectrum of software, security,
reliability, parallelelism, virtualization and muchmore!