S2 1405 Schooler Presentation

  • View
    216

  • Download
    0

Embed Size (px)

Text of S2 1405 Schooler Presentation

  • 8/3/2019 S2 1405 Schooler Presentation

    1/35

    Tile Processors:Many-Core for Embedded

    and Cloud Computing

    Richard SchoolerVP Software Engineering

    Tilera Corporation

    rschooler@tilera.com

  • 8/3/2019 S2 1405 Schooler Presentation

    2/35

    HPEC, 15 September 2010 2010 Copyright Tilera Corporation. All Rights Reserved.2

    Exploiting Natural Parallelism

    High-performance applications have lots ofparallelism! Embedded apps:

    Networking: packets, flows

    Media: streams, images, functional & data parallelism Cloud apps:

    Many clients: network sessions Data mining: distributed data & computation

    Lots of different levels: SIMD (fine-grain data parallelism) Thread/process (medium-grain task parallelism) Distributed system (coarse-grain job parallelism)

  • 8/3/2019 S2 1405 Schooler Presentation

    3/35

    HPEC, 15 September 2010 2010 Copyright Tilera Corporation. All Rights Reserved.3

    Every one is going Manycore, but can thearchitecture scale?

    The computing world

    is ready for radicalchange

    Time

    #Cores

    2010

    1

    2006

    2

    4

    32

    20202005

    Intel

    Sun IBM Cell

    nCores

    Larrabee

    Performance

    &

    Performance/WGap

    2014

    Performance

  • 8/3/2019 S2 1405 Schooler Presentation

    4/35

    HPEC, 15 September 2010 2010 Copyright Tilera Corporation. All Rights Reserved.4

    Key Many-Core Challenges: The 3 Ps

    Performance challenge How to scale from 1 to 1000 cores the number of

    cores is the new Megahertz

    Power efficiency challenge Performance per watt is the new metric systems are

    often constrained by power & cooling

    Programming challenge How to provide a converged many core solution in a

    standard programming environment

  • 8/3/2019 S2 1405 Schooler Presentation

    5/35

    HPEC, 15 September 2010 2010 Copyright Tilera Corporation. All Rights Reserved.5

    Problems cannot be solved by the samelevel of thinking that created them.

    Current technologies fail to deliver Incremental performance increase High power Low level of Integration

    Increasingly bigger cores

    We need to have a new thinking to get 10 x performance

    10 x performance per watt Converged computing Standard programming models

    http://www.google.com.hk/imgres?imgurl=http://www.damonchernavsky.com/Famous_Quotes/Quotes_By_Albert_Einstein/albert_einstein.jpg&imgrefurl=http://www.damonchernavsky.com/Famous_Quotes/Quotes_By_Albert_Einstein/quotes_by_albert_einstein.html&usg=__LvhalC7aLpumHqxWYGcDvid_Rbo=&h=288&w=460&sz=33&hl=zh-CN&start=54&um=1&itbs=1&tbnid=o47CTU5Z_NLSSM:&tbnh=80&tbnw=128&prev=/images%3Fq%3Dalbert%2Beinstein%26start%3D40%26um%3D1%26hl%3Dzh-CN%26newwindow%3D1%26safe%3Dstrict%26sa%3DN%26rls%3DIBMA,IBMA:2006-45,IBMA:en%26ndsp%3D20%26tbs%3Disch:1http://www.google.com.hk/imgres?imgurl=http://www.damonchernavsky.com/Famous_Quotes/Quotes_By_Albert_Einstein/albert_einstein.jpg&imgrefurl=http://www.damonchernavsky.com/Famous_Quotes/Quotes_By_Albert_Einstein/quotes_by_albert_einstein.html&usg=__LvhalC7aLpumHqxWYGcDvid_Rbo=&h=288&w=460&sz=33&hl=zh-CN&start=54&um=1&itbs=1&tbnid=o47CTU5Z_NLSSM:&tbnh=80&tbnw=128&prev=/images%3Fq%3Dalbert%2Beinstein%26start%3D40%26um%3D1%26hl%3Dzh-CN%26newwindow%3D1%26safe%3Dstrict%26sa%3DN%26rls%3DIBMA,IBMA:2006-45,IBMA:en%26ndsp%3D20%26tbs%3Disch:1
  • 8/3/2019 S2 1405 Schooler Presentation

    6/35

    HPEC, 15 September 2010 2010 Copyright Tilera Corporation. All Rights Reserved.6

    Stepping Back: How Did We Get Here?

    Moores Conundrum:

    More devices =>? More performance

    Old answers: More complex cores; biggercaches But power-hungry

    New answers: More cores But do conventional approaches scale?

    Diminishing returns!

  • 8/3/2019 S2 1405 Schooler Presentation

    7/35

    HPEC, 15 September 2010 2010 Copyright Tilera Corporation. All Rights Reserved.7

    The Old Challenge: CPU-on-a-chip

    20 MIPS CPUin 1987

    Few thousand gates

  • 8/3/2019 S2 1405 Schooler Presentation

    8/35

    HPEC, 15 September 2010 2010 Copyright Tilera Corporation. All Rights Reserved.8

    What to do with all those transistors?

    The Opportunity: Billions of Transistors

    OldCPU:

  • 8/3/2019 S2 1405 Schooler Presentation

    9/35

    HPEC, 15 September 2010 2010 Copyright Tilera Corporation. All Rights Reserved.9

    ASICs have high performance and low power

    Custom-routed, short wires Lots of ALUs, registers, memories huge on-chip parallelism

    memmem

    mem

    mem

    mem

    Take Inspiration from ASICs

    But how to build a programmable chip?

  • 8/3/2019 S2 1405 Schooler Presentation

    10/35

    HPEC, 15 September 2010 2010 Copyright Tilera Corporation. All Rights Reserved.10

    Replace Long Wires with RoutedInterconnect

    Ctrl

    [IEEE Computer 97]

  • 8/3/2019 S2 1405 Schooler Presentation

    11/35

    HPEC, 15 September 2010 2010 Copyright Tilera Corporation. All Rights Reserved.11

    ALU

    ALU

    ALU

    ALU

    ALU

    ALU

    ALU

    ALU

    ALU

    ALU

    ALU

    ALU

    ALU

    ALU

    ALU

    ALU

    Bypass Net

    RF

    From Centralized Clump of CPUs

  • 8/3/2019 S2 1405 Schooler Presentation

    12/35

    HPEC, 15 September 2010 2010 Copyright Tilera Corporation. All Rights Reserved.12

    ALU

    ALU

    ALU

    ALU

    ALU

    ALU

    R

    Scalar Operand Network (SON) [TPDS 2005]

    To Distributed ALUs, Routed Bypass

    Network

  • 8/3/2019 S2 1405 Schooler Presentation

    13/35

    HPEC, 15 September 2010 2010 Copyright Tilera Corporation. All Rights Reserved.13

    From a Large Centralized Cache

  • 8/3/2019 S2 1405 Schooler Presentation

    14/35

    HPEC, 15 September 2010 2010 Copyright Tilera Corporation. All Rights Reserved.14

    to a Distributed Shared Cache

    ALU

    ALU

    ALU

    ALU

    ALU

    ALU

    R

    $

    [ISCA 1999]

  • 8/3/2019 S2 1405 Schooler Presentation

    15/35

    HPEC, 15 September 2010 2010 Copyright Tilera Corporation. All Rights Reserved.15

    Distributed Everything + RoutedInterconnect Tiled Multicore

    A

    LU

    ALU

    ALU

    ALU

    ALU

    A

    LU

    R

    $

    Each tile is a processor, so programmable

  • 8/3/2019 S2 1405 Schooler Presentation

    16/35

    HPEC, 15 September 2010 2010 Copyright Tilera Corporation. All Rights Reserved.16

    Tiled Multicore Captures ASIC Benefitsand is Programmable

    Scales to large numbers of cores Modular design and verify 1 tile Power efficient

    Short wires plus locality opts

    CV2f Chandrakasan effect, more cores at

    lower freq and voltage CV2f

    ProcessorCore

    = TileCore + Switch

    SCurrent Bus Architecture

  • 8/3/2019 S2 1405 Schooler Presentation

    17/35

    HPEC, 15 September 2010 2010 Copyright Tilera Corporation. All Rights Reserved.17

    Tilera processor portfolioDemonstrating the scale of many-core

    200920082007 2010

    TILEPro36

    TILE64

    TILEPro64

    Gx64 & Gx100Up to 8x performance

    Gx16 & Gx362x the performance

    . . .

  • 8/3/2019 S2 1405 Schooler Presentation

    18/35

    HPEC, 15 September 2010 2010 Copyright Tilera Corporation. All Rights Reserved.18

    Memory Controller (DDR3) Memory Controller (DDR3)

    Memory Controller (DDR3) Memory Controller(DDR3)

    mPIPE

    1.2GHz 1.5GHz 32 MBytes total cache

    546 Gbps peak mem BW

    200 Tbps iMesh BW

    80-120 Gbps packet I/O 8 ports XAUI / 2 XAUI

    2 40Gb Interlaken

    32 ports 1GbE (SGMII)

    80 Gbps PCIe I/O 3 StreamIO ports (20Gb)

    Wire-speed packet eng. 120Mpps

    MiCA engines:

    40 Gbps crypto

    compress & decompress

    FlexibleI/O

    UART x2,USB x2,JTAG,I2C, SPI

    MiCA

    MiCA

    Se

    rDes

    PCIe 2.0

    8-lane

    SerDes

    PCIe 2.04-lane

    SerDes

    PCIe 2.08-lane

    Interlaken

    Interlaken

    10 GbEXAUI S

    erDes4x GbE

    SGMII

    10 GbEXAUI S

    erDes4x GbE

    SGMII

    10 GbEXAUI S

    erDes4x GbE

    SGMII

    10 GbEXAUI S

    erDes4x GbE

    SGMII

    10 GbEXAUI S

    erDes4x GbE

    SGMII

    10 GbEXAUI S

    erDes4x GbE

    SGMII

    10 GbEXAUI S

    erDes4x GbE

    SGMII

    10 GbEXAUI S

    erDes4x GbE

    SGMII

    TILE-Gx100:Complete System-on-a-Chip with 100 64-bit cores

  • 8/3/2019 S2 1405 Schooler Presentation

    19/35

    HPEC, 15 September 2010 2010 Copyright Tilera Corporation. All Rights Reserved.19

    36 Processor Cores 866M, 1.2GHz, 1.5GHz clk

    12 MBytes total cache

    40 Gbps total packet I/O

    4 ports 10GbE (XAUI)

    16 ports 1GbE (SGMII)

    48 Gbps PCIe I/O

    2 16Gbps Stream IO ports

    Wire-speed packet engine

    60Mpps

    MiCA engine:

    20 Gbps crypto

    Compress & decompress

    Flexible

    I/O

    UARTx2,USBx2,JTAG,I2C, SPI

    MiCA

    mPIPE

    10 GbEXAUI

    SerDes

    4x GbESGMII

    10 GbEXAUI

    Se

    rDes

    4x GbESGMII

    10 GbEXAUI

    SerDes

    4x GbESGMII

    10 GbEXAUI

    SerDes

    4x GbESGMII

    SerDesPCIe 2.0

    8-lane

    SerDes

    PCIe 2.04-lane

    SerDes

    PCIe 2.04-lane

    Memory Controller (DDR3)

    Memory Controller (DDR3)

    TILE-Gx36:Scaling to a broad range of applications

  • 8/3/2019 S2