multicore.processor.examples

Embed Size (px)

Citation preview

  • 8/7/2019 multicore.processor.examples

    1/30

    Sudhakar Yalamanchili, Georgia Institute of Technology

    Multicore: CommercialMulticore: CommercialProcessorsProcessors

    ECE 4100/6100 (2)

    Some Examples

    Desktop and Server/Enterprise Space Intel

    AMD

    SUN Microsystems

    The Embedded Space: Freescale Semiconductor

  • 8/7/2019 multicore.processor.examples

    2/30

    ECE 4100/6100 (3)

    Focus

    The Chip Level Architecture

    What do we have on chip?

    The Core Architecture Note the presence/absence/configuration of concepts

    studied earlier in class Rationalize the design decisions that led to the preceding What can/should we expect next?

    Building systems using multicore chips

    Sudhakar Yalamanchili, Georgia Institute of Technology

    The Intel Core Duo Processor The Intel Core Duo Processor

    SeriesSeries

  • 8/7/2019 multicore.processor.examples

    3/30

    ECE 4100/6100 (5)

    Intel Core Duo

    Homogeneous cores Bus based on chip interconnect Shared Memory Traditional I/O

    Classic OOO: Reservation Stations,Issue ports, Schedulersetc

    Large, shared set associative, prefetch,etc.

    Source: Intel Corp.

    ECE 4100/6100 (6)

    Intel Core Duo: Vital Stats

    151 million transistors; Shared 2 MB L2 cache Each core has a 12 stage pipeline (Yonah) Low-power (less than 25 watts) Dual Core microprocessor Supports Intels Vanderpool virtualization technology EM64T (Intel x86-64 extensions) is not supported

    Desktop market not severe due to lack of OS and software Sossaman processor for servers, which is based on Yonah, also lacks

    EM64T-support severe disadvantage Communication between the L2 cache and both execution cores is

    handled by an arbitration bus unit Eliminates cache coherency traffic over the FSB Raises the core-to-L2 latency The increase in clock frequency offsets the impact

    Core processors communicate with the system chipset over a 667MT/s front side bus (FSB), up from 533 MT/s used by the fastestPentium M.

    Intel Core Solo uses the same two-core die as the Core Duo, butfeatures only one active core

    Chips failing quality control can be sold Core 2 Duo processors will also include the ability to disable one core to

    conserve power

  • 8/7/2019 multicore.processor.examples

    4/30

    ECE 4100/6100 (7)

    The Core micro-architecture

    Source: Ars Technica

    ECE 4100/6100 (8)

    The Core Execution core

    Source: Ars Technica

  • 8/7/2019 multicore.processor.examples

    5/30

    ECE 4100/6100 (9)

    Intel Core Duo

    High memory latency due to the lack of on-diememory controller (further aggravated by system-

    chipset's use of DDR-II RAM) Main-memory transactions have to pass through

    the Northbridge of the chipset Higher latency compared to the AMD's Turion platform. Weakness shared by the entire line of Pentium processors L2-cache is quite effective at hiding main-memory latency

    Execution units Three 64-bit integer exec units

    one CIU (complex) + two SIU (simple)

    Two FPUs Poor Floating Point Unit (FPU) throughput

    Limited to little "performance per watt" in singlethreaded applications compared to its predecessor.

    ECE 4100/6100 (10)

    Core 2 Duo and Core Duo

    Very similar architectures Bump in the processor speed Increase in Level 2 cache. (2MB to 4MB) Both chips have a 65-nm process technology architecture and

    support a 667 MHz front-side-bus (FSB). 14 stage pipeline

    Source: Intel Corp.

  • 8/7/2019 multicore.processor.examples

    6/30

    ECE 4100/6100 (11)

    Intel Core TM2 Duo Processor

    143 mm 2Processor Die Size

  • 8/7/2019 multicore.processor.examples

    7/30

    ECE 4100/6100 (13)

    Wide Dynamic Execution

    Source: Bit Tech

    ECE 4100/6100 (14)

    Wide Dynamic Execution

    Source: Bit Tech

  • 8/7/2019 multicore.processor.examples

    8/30

    ECE 4100/6100 (15)

    Wide Dynamic Execution

    Pipe width of 4 execution units per chip (Pentium

    M/Pentium 4 Netburst have 3) Delivery of more instructions per clock cycle Pipeline depth of 14 vs. 31 in Pentium Prescott 4

    Compromise between efficient execution of shortinstructions and long instructions

    Ops fusion Less work for the processor pipeline to run Micro-ops fusion

    fuse together repetitive instructions in x86 code Macro-ops fusion

    works on the x86 instructions themselves, not just their microderivatives.

    Instruction loads and micro-ops can be reduced byapproximately 15% and 10%, respectively

    ECE 4100/6100 (16)

    Intelligent Power Capability

    Source: Bit Tech

  • 8/7/2019 multicore.processor.examples

    9/30

    ECE 4100/6100 (17)

    SpeedStep technology

    Dyamic clock speed reduction Intel mobile processors include this already Enhanced SpeedStep used in Core 2 Duo

    Controller that turns on sections of the processor asneeded. One core can be shut down for single-threaded applications

    Power consumption decreased by enhancements to

    Intel's 65nm process node use Low-K dielectrics and strained silicon use low-leakage and "sleep" transistors

    Intelligent Power Capability

    ECE 4100/6100 (18)

    Advanced Smart Cache

    Source: Bit Tech

  • 8/7/2019 multicore.processor.examples

    10/30

  • 8/7/2019 multicore.processor.examples

    11/30

    ECE 4100/6100 (21)

    Smart Memory Access

    Improved prefetch units Memory disambiguation

    Allows re-ordering instructionsmore efficiently

    Source: Ars Technica

    Example fromhttp://arstechnica.com/articles/paedia/cpu/core.ars/8Execution without memory disambiguation

    Memory AliasingExecution with and without memory disambiguation

    ECE 4100/6100 (22)

    Advanced Digital Media Boost

    Source: Bit Tech

  • 8/7/2019 multicore.processor.examples

    12/30

    ECE 4100/6100 (23)

    Advanced Digital Media Boost

    Streaming SIMD Extension (SSE) instructions

    SSE instructions are an extension of the standard x86instruction set. Utilized in multimedia encoding, decoding, image

    manipulation and encryption SSE instructions are 128-bit.

    Up from 64-bits Double the SSE performance over previous generation

    ECE 4100/6100 (24)

    Comparison of SSE to prior processors

    Source: Ars Technica

  • 8/7/2019 multicore.processor.examples

    13/30

    ECE 4100/6100 (25)

    Intel Conroe Vs Presler

    What is the major difference? Shared L2 versus separate caches

    Conroe Presler

    Source: Bit Tech

    ECE 4100/6100 (26)

    Intels Roadmap for Multicore

    Source: Adapted from Toms Hardware

    2006 20082007

    SC 1MBDC 2MB

    DC 2/4MBshared

    DC 3 MB/6MB shared

    (45nm)

    2006 20082007

    DC 2/4MB

    DC 2/4MBshared

    DC 4MB

    DC 3MB /6MBshared (45nm)

    2006 20082007

    DC 2MBDC 4MB

    DC 16MB

    QC 4MB

    QC 8/16MBshared

    8C 12MBshared(45nm)

    SC 512KB/1/ 2MB

    8C 12MBshared(45nm)

    Desktop processors

    Mobile processors

    Enterprise processors

    Drivers are Market segments More cache More cores

    80 core processor prototype has been designed!

  • 8/7/2019 multicore.processor.examples

    14/30

    ECE 4100/6100 (27)

    Intel Chipset Example

    Source: Extreme Tech

    ECE 4100/6100 (28)

    References and Links

    http://www.intel.com/products/processor/coreduo/ http://en.wikipedia.org/wiki/Intel_Core http://www.hothardware.com/viewarticle.aspx?articleid=845&cid=1 http://www.bit-tech.net/hardware/2006/03/10/intel_core_microarchitecture/ http://www.bit-

    tech.net/hardware/2006/05/19/intel_core_duo_t2600_on_the_desktop http://www.bit-tech.net/hardware/2006/07/14/intel_core_2_duo_processors/ http://www.hardcoreware.net/reviews/review-347-1.htm http://www.trustedreviews.com/cpu-memory/review/2006/08/28/Intel-Core-2-

    Duo-Merom-Notebooks/p1 http://www.trustedreviews.com/cpu-memory/review/2006/07/14/Intel-Core-2-

    Duo-Conroe-E6400-E6600-E6700-X6800/p1 http://techreport.com/reviews/2006q2/core-duo/index.x?pg=1 http://arstechnica.com/articles/paedia/cpu/core.ars/1 http://www.anandtech.com/mobile/showdoc.aspx?i=2663&p=4 http://www.extremetech.com/article2/0,1697,1988794,00.asp http://www.coreduoinfo.com/blog/about-intel-core-duo/ http://67.91.114.164/intel_c2d_info.htm http://www.pcper.com/article.php?aid=272&type=expert

  • 8/7/2019 multicore.processor.examples

    15/30

    Sudhakar Yalamanchili, Georgia Institute of Technology

    AMD MultiCore ProcessorsAMD MultiCore Processors

    ECE 4100/6100 (30)

    Dual Core AMD Opteron

    Source: AMD

  • 8/7/2019 multicore.processor.examples

    16/30

    ECE 4100/6100 (31)

    AMD Multicore (Dualcore)Opteron

    Two AMD Opteron CPUcores on a single die

    Each has 1MB L2 cache 90nm, ~205 million

    transistors Approximately same die size

    as 130nm single-core AMDOpteron processor

    95 watt power envelope fits into 90nm power

    infrastructure Introduced with K8

    Revision E core in April2005

    Core 0

    Northbridge

    1-MB L2

    Core 11-MB L2

    Source: AMD

    ECE 4100/6100 (32)

    Opteron Core Pipeline

    Source: ChipArchitect

  • 8/7/2019 multicore.processor.examples

    17/30

    ECE 4100/6100 (33)

    AMD Opteron Processor Core Architecture

    AGUAGU

    Int Decode & Rename

    FADD FMISCFMUL44-entryLoad/StoreQueue

    36-entry FP scheduler

    FP Decode & Rename

    ALU

    AGU

    ALU

    MULT

    ALU

    Res Res Res

    L1Icache64KB

    L1Dcache

    64KB

    Fetch Branch

    Prediction

    Instruction Control Unit (72 entries)

    Fastpath Microcode EngineScan/Align/Decode

    ops

    Source: The 3D shop

    ECE 4100/6100 (34)

    Dual Core AMD Opteron

    AMD64 technology Runs 32-bit applications and is 64-bit capable Compatible with the x86 software infrastructure Enables a single architecture across 32- and 64-bit environments

    Direct Connect Architecture NUMA system

    Each processor shares its memory with other processors in thesystem

    Integrated Memory Controller on-die DDR2 DRAM memory controller offers memory BW up to 10.7 GB/s

    per processor HyperTransport

    Point-to-point interconnect can be used to build a mesh of multiple-processor Opteron systems

    Scalable bandwidth interconnect between processors, I/Osubsystems, and other chipsets

    24.0 GB/s peak bandwidth per processor

  • 8/7/2019 multicore.processor.examples

    18/30

    ECE 4100/6100 (35)

    Dual Core AMD Opteron

    Not a simple aggregation of K8 cores Integrated the cores for efficiency

    Dual-core Opteron acts very much like a SMP system Compatible with existing single-threaded, multi-threaded

    (hyperthreaded) software MOESI coherency protocol (O Owns)

    Updates through system request interface SSE3 support with 10 new instructions. Quad-core upgradeability Hardware assisted AMD Virtualization Optimized Power Management

    ECE 4100/6100 (36)

    Dual Core AMD Opteron

    Source: Elec Design

  • 8/7/2019 multicore.processor.examples

    19/30

    ECE 4100/6100 (37)

    AMD Opteron (SOI)

    Source: Chip Architect

    ECE 4100/6100 (38)

    AMD 64 bit Core

    1MB L2 Cache Detailed discussion of the 64-bit core architecture

    at: http://chip-

    architect.com/news/2003_09_21_Detailed_Architecture_of _AMDs_64bit_Core.html

  • 8/7/2019 multicore.processor.examples

    20/30

  • 8/7/2019 multicore.processor.examples

    21/30

    ECE 4100/6100 (41)

    Cache coherency

    Source: Chip Architect

    ECE 4100/6100 (42)

    AMD Athlon 64 X2

    Source: AMD

  • 8/7/2019 multicore.processor.examples

    22/30

    ECE 4100/6100 (43)

    References and Links

    http://techreport.com/reviews/2005q2/opteron-x75/index.x?pg=1 http://www.tomshardware.com/2005/06/03/dual_core_stress_test/index.html http://www.a1-electronics.net/AMD_Section/CPUs/2005/AMD_Athlon64x2_Apr.shtml http://en.wikipedia.org/wiki/Opteron http://en.wikipedia.org/wiki/Athlon_64_X2 http://www.amd.com/us-

    en/Processors/ProductInformation/0,,30_118_8796_14309,00.html http://chip-

    architect.com/news/2003_09_21_Detailed_Architecture_of_AMDs_64bit_Core.html

    http://firingsquad.com/hardware/amd_dual-core_opteron_875/page2.asp http://www.xbitlabs.com/articles/cpu/display/opteron-ws_4.html http://www.extremetech.com/article2/0,1697,1675784,00.asp http://www.elecdesign.com/Articles/Index.cfm?AD=1&ArticleID=11991 http://www.the3dshop.com/userimages/amd_systems/opteron_dualcore.htm http://www.nextcomputing.com/advantages/thruadv.shtml

    http://arstechnica.com/news.ars/post/20060817-7535.html http://www.bit-tech.net/hardware/2005/05/09/amd_a64x2_4800/1.html

    Sudhakar Yalamanchili, Georgia Institute of Technology

    SUNSUN UltraSPARC MulticoreUltraSPARC Multicore

  • 8/7/2019 multicore.processor.examples

    23/30

    ECE 4100/6100 (45)

    SUN UltraSPARC T1

    Eight cores, each 4-waythreaded

    1.2 GHz Cache

    16K 4-way 32B L1-I 8K 4-way 16B L1-D 3MB internal L2 cache

    partitioned into four banksand four memorycontrollers.

    Data moved between theL2 and the cores using anintegrated crossbar switch

    to provide high throughput

    Source: Sun

    ECE 4100/6100 (46)

    SUN UltraSPARC T1

    Source: Sun

  • 8/7/2019 multicore.processor.examples

    24/30

    ECE 4100/6100 (47)

    SUN UltraSPARC T1 Pipeline

    T1's integer pipeline Fetch, Thread Selection, Decode, Execute, Memory Access,

    Writeback

    Source: Sun

    ECE 4100/6100 (48)

    SUN UltraSPARC T2 Niagara 2

    Source: Sun

  • 8/7/2019 multicore.processor.examples

    25/30

  • 8/7/2019 multicore.processor.examples

    26/30

    ECE 4100/6100 (51)

    UltraSparc T2 Memory System

    Source: Sun

    ECE 4100/6100 (52)

    UltraSparc T2 Core Block Diagram

    IFU Instruction Fetch Unit 16 KB I$, 32B lines, 8-way SA 64-entry fully-associative ITLB

    EXU0/1 Integer Execution Units 4 threads share each unit Executes one integer instrn/cycle

    LSU Load/Store Unit 8KB D$, 16B lines, 4-way SA 128-

    entry fully-associative DTLB

    FGU Floating/Graphics Unit SPU Stream Processing Unit

    Cryptographic acceleration TLU Trap Logic Unit

    Updates machine state, handlesexceptions and interrupts

    MMU Memory Management Unit Hardware tablewalk (HWTW) 8KB, 64KB, 4MB, 256MB pages

    Source: Sun

  • 8/7/2019 multicore.processor.examples

    27/30

    ECE 4100/6100 (53)

    UltraSparc T2 Core Pipeline

    8 stages for integer operations:

    Fetch, Cache, Pick, Decode, Execute, Memory, Bypass,Writeback > 3-cycle load-use Memory (translation, tag/data access) Bypass (late select, formatting)

    12 stages for floating-point: Fetch, Cache, Pick, Decode, Execute, FX1, FX2, FX3,

    FX4, FX5, FB, FW 6-cycle latency for dependent FP ops Longer pipeline for divide/sqrt

    ECE 4100/6100 (54)

    References and Links

    http://realworldtech.com/page.cfm?ArticleID=RWT090406012516&p=4

    http://www.opensparc.net/cgi-bin/goto.php?w=/pubs/preszo/06/HotChips06_09_ppt_master.pdf

    http://www.freescale.com/files/netcomm/doc/fact_sh

    eet/MPC8572FS.pdf

  • 8/7/2019 multicore.processor.examples

    28/30

    Sudhakar Yalamanchili, Georgia Institute of Technology

    The EmbeddedThe Embedded MulticoresMulticores

    ECE 4100/6100 (56)

    Freescale MPC8572 PowerQUICC IIIProcessor

    Source: Freescale

  • 8/7/2019 multicore.processor.examples

    29/30

    ECE 4100/6100 (57)

    Freescale MPC8572 PowerQUICC IIIProcessor

    Dual Embedded e500 core 36-bit physical

    addressing Double-precision floating-point Integrated L1/L2 cache

    L1 cache 32 KB data and 32 KB Shared L2 cache 1 MB with ECC L2 configurable as SRAM, cache and I/O transactions can

    be stashed into L2 cache regions Integrated DDR memory controller with full ECC support Integrated security engine, Pattern Matching

    Engine, Packet Deflate Engine Four on-chip triple-speed Ethernet controllers

    ECE 4100/6100 (58)

    References and Links

    http://www.freescale.com/files/netcomm/doc/fact_sheet/MPC8572FS.pdf

  • 8/7/2019 multicore.processor.examples

    30/30

    ECE 4100/6100 (59)

    Summary

    Multicore technology spans the product spectrum

    The downward migration of leading edge technologycontinues

    Architectural principles are key to Developers: extracting performance Designers: improving performance Marketing: understanding new markets for performance

    Research spans the spectrum of software, security,

    reliability, parallelelism, virtualization and muchmore!