Transcript
  • 8/3/2019 S2 1405 Schooler Presentation

    1/35

    Tile Processors:Many-Core for Embedded

    and Cloud Computing

    Richard SchoolerVP Software Engineering

    Tilera Corporation

    [email protected]

  • 8/3/2019 S2 1405 Schooler Presentation

    2/35

    HPEC, 15 September 2010 2010 Copyright Tilera Corporation. All Rights Reserved.2

    Exploiting Natural Parallelism

    High-performance applications have lots ofparallelism! Embedded apps:

    Networking: packets, flows

    Media: streams, images, functional & data parallelism Cloud apps:

    Many clients: network sessions Data mining: distributed data & computation

    Lots of different levels: SIMD (fine-grain data parallelism) Thread/process (medium-grain task parallelism) Distributed system (coarse-grain job parallelism)

  • 8/3/2019 S2 1405 Schooler Presentation

    3/35

    HPEC, 15 September 2010 2010 Copyright Tilera Corporation. All Rights Reserved.3

    Every one is going Manycore, but can thearchitecture scale?

    The computing world

    is ready for radicalchange

    Time

    #Cores

    2010

    1

    2006

    2

    4

    32

    20202005

    Intel

    Sun IBM Cell

    nCores

    Larrabee

    Performance

    &

    Performance/WGap

    2014

    Performance

  • 8/3/2019 S2 1405 Schooler Presentation

    4/35

    HPEC, 15 September 2010 2010 Copyright Tilera Corporation. All Rights Reserved.4

    Key Many-Core Challenges: The 3 Ps

    Performance challenge How to scale from 1 to 1000 cores the number of

    cores is the new Megahertz

    Power efficiency challenge Performance per watt is the new metric systems are

    often constrained by power & cooling

    Programming challenge How to provide a converged many core solution in a

    standard programming environment

  • 8/3/2019 S2 1405 Schooler Presentation

    5/35

    HPEC, 15 September 2010 2010 Copyright Tilera Corporation. All Rights Reserved.5

    Problems cannot be solved by the samelevel of thinking that created them.

    Current technologies fail to deliver Incremental performance increase High power Low level of Integration

    Increasingly bigger cores

    We need to have a new thinking to get 10 x performance

    10 x performance per watt Converged computing Standard programming models

    http://www.google.com.hk/imgres?imgurl=http://www.damonchernavsky.com/Famous_Quotes/Quotes_By_Albert_Einstein/albert_einstein.jpg&imgrefurl=http://www.damonchernavsky.com/Famous_Quotes/Quotes_By_Albert_Einstein/quotes_by_albert_einstein.html&usg=__LvhalC7aLpumHqxWYGcDvid_Rbo=&h=288&w=460&sz=33&hl=zh-CN&start=54&um=1&itbs=1&tbnid=o47CTU5Z_NLSSM:&tbnh=80&tbnw=128&prev=/images%3Fq%3Dalbert%2Beinstein%26start%3D40%26um%3D1%26hl%3Dzh-CN%26newwindow%3D1%26safe%3Dstrict%26sa%3DN%26rls%3DIBMA,IBMA:2006-45,IBMA:en%26ndsp%3D20%26tbs%3Disch:1http://www.google.com.hk/imgres?imgurl=http://www.damonchernavsky.com/Famous_Quotes/Quotes_By_Albert_Einstein/albert_einstein.jpg&imgrefurl=http://www.damonchernavsky.com/Famous_Quotes/Quotes_By_Albert_Einstein/quotes_by_albert_einstein.html&usg=__LvhalC7aLpumHqxWYGcDvid_Rbo=&h=288&w=460&sz=33&hl=zh-CN&start=54&um=1&itbs=1&tbnid=o47CTU5Z_NLSSM:&tbnh=80&tbnw=128&prev=/images%3Fq%3Dalbert%2Beinstein%26start%3D40%26um%3D1%26hl%3Dzh-CN%26newwindow%3D1%26safe%3Dstrict%26sa%3DN%26rls%3DIBMA,IBMA:2006-45,IBMA:en%26ndsp%3D20%26tbs%3Disch:1
  • 8/3/2019 S2 1405 Schooler Presentation

    6/35

    HPEC, 15 September 2010 2010 Copyright Tilera Corporation. All Rights Reserved.6

    Stepping Back: How Did We Get Here?

    Moores Conundrum:

    More devices =>? More performance

    Old answers: More complex cores; biggercaches But power-hungry

    New answers: More cores But do conventional approaches scale?

    Diminishing returns!

  • 8/3/2019 S2 1405 Schooler Presentation

    7/35

    HPEC, 15 September 2010 2010 Copyright Tilera Corporation. All Rights Reserved.7

    The Old Challenge: CPU-on-a-chip

    20 MIPS CPUin 1987

    Few thousand gates

  • 8/3/2019 S2 1405 Schooler Presentation

    8/35

    HPEC, 15 September 2010 2010 Copyright Tilera Corporation. All Rights Reserved.8

    What to do with all those transistors?

    The Opportunity: Billions of Transistors

    OldCPU:

  • 8/3/2019 S2 1405 Schooler Presentation

    9/35

    HPEC, 15 September 2010 2010 Copyright Tilera Corporation. All Rights Reserved.9

    ASICs have high performance and low power

    Custom-routed, short wires Lots of ALUs, registers, memories huge on-chip parallelism

    memmem

    mem

    mem

    mem

    Take Inspiration from ASICs

    But how to build a programmable chip?

  • 8/3/2019 S2 1405 Schooler Presentation

    10/35

    HPEC, 15 September 2010 2010 Copyright Tilera Corporation. All Rights Reserved.10

    Replace Long Wires with RoutedInterconnect

    Ctrl

    [IEEE Computer 97]

  • 8/3/2019 S2 1405 Schooler Presentation

    11/35

    HPEC, 15 September 2010 2010 Copyright Tilera Corporation. All Rights Reserved.11

    ALU

    ALU

    ALU

    ALU

    ALU

    ALU

    ALU

    ALU

    ALU

    ALU

    ALU

    ALU

    ALU

    ALU

    ALU

    ALU

    Bypass Net

    RF

    From Centralized Clump of CPUs

  • 8/3/2019 S2 1405 Schooler Presentation

    12/35

    HPEC, 15 September 2010 2010 Copyright Tilera Corporation. All Rights Reserved.12

    ALU

    ALU

    ALU

    ALU

    ALU

    ALU

    R

    Scalar Operand Network (SON) [TPDS 2005]

    To Distributed ALUs, Routed Bypass

    Network

  • 8/3/2019 S2 1405 Schooler Presentation

    13/35

    HPEC, 15 September 2010 2010 Copyright Tilera Corporation. All Rights Reserved.13

    From a Large Centralized Cache

  • 8/3/2019 S2 1405 Schooler Presentation

    14/35

    HPEC, 15 September 2010 2010 Copyright Tilera Corporation. All Rights Reserved.14

    to a Distributed Shared Cache

    ALU

    ALU

    ALU

    ALU

    ALU

    ALU

    R

    $

    [ISCA 1999]

  • 8/3/2019 S2 1405 Schooler Presentation

    15/35

    HPEC, 15 September 2010 2010 Copyright Tilera Corporation. All Rights Reserved.15

    Distributed Everything + RoutedInterconnect Tiled Multicore

    A

    LU

    ALU

    ALU

    ALU

    ALU

    A

    LU

    R

    $

    Each tile is a processor, so programmable

  • 8/3/2019 S2 1405 Schooler Presentation

    16/35

    HPEC, 15 September 2010 2010 Copyright Tilera Corporation. All Rights Reserved.16

    Tiled Multicore Captures ASIC Benefitsand is Programmable

    Scales to large numbers of cores Modular design and verify 1 tile Power efficient

    Short wires plus locality opts

    CV2f Chandrakasan effect, more cores at

    lower freq and voltage CV2f

    ProcessorCore

    = TileCore + Switch

    SCurrent Bus Architecture

  • 8/3/2019 S2 1405 Schooler Presentation

    17/35

    HPEC, 15 September 2010 2010 Copyright Tilera Corporation. All Rights Reserved.17

    Tilera processor portfolioDemonstrating the scale of many-core

    200920082007 2010

    TILEPro36

    TILE64

    TILEPro64

    Gx64 & Gx100Up to 8x performance

    Gx16 & Gx362x the performance

    . . .

  • 8/3/2019 S2 1405 Schooler Presentation

    18/35

    HPEC, 15 September 2010 2010 Copyright Tilera Corporation. All Rights Reserved.18

    Memory Controller (DDR3) Memory Controller (DDR3)

    Memory Controller (DDR3) Memory Controller(DDR3)

    mPIPE

    1.2GHz 1.5GHz 32 MBytes total cache

    546 Gbps peak mem BW

    200 Tbps iMesh BW

    80-120 Gbps packet I/O 8 ports XAUI / 2 XAUI

    2 40Gb Interlaken

    32 ports 1GbE (SGMII)

    80 Gbps PCIe I/O 3 StreamIO ports (20Gb)

    Wire-speed packet eng. 120Mpps

    MiCA engines:

    40 Gbps crypto

    compress & decompress

    FlexibleI/O

    UART x2,USB x2,JTAG,I2C, SPI

    MiCA

    MiCA

    Se

    rDes

    PCIe 2.0

    8-lane

    SerDes

    PCIe 2.04-lane

    SerDes

    PCIe 2.08-lane

    Interlaken

    Interlaken

    10 GbEXAUI S

    erDes4x GbE

    SGMII

    10 GbEXAUI S

    erDes4x GbE

    SGMII

    10 GbEXAUI S

    erDes4x GbE

    SGMII

    10 GbEXAUI S

    erDes4x GbE

    SGMII

    10 GbEXAUI S

    erDes4x GbE

    SGMII

    10 GbEXAUI S

    erDes4x GbE

    SGMII

    10 GbEXAUI S

    erDes4x GbE

    SGMII

    10 GbEXAUI S

    erDes4x GbE

    SGMII

    TILE-Gx100:Complete System-on-a-Chip with 100 64-bit cores

  • 8/3/2019 S2 1405 Schooler Presentation

    19/35

    HPEC, 15 September 2010 2010 Copyright Tilera Corporation. All Rights Reserved.19

    36 Processor Cores 866M, 1.2GHz, 1.5GHz clk

    12 MBytes total cache

    40 Gbps total packet I/O

    4 ports 10GbE (XAUI)

    16 ports 1GbE (SGMII)

    48 Gbps PCIe I/O

    2 16Gbps Stream IO ports

    Wire-speed packet engine

    60Mpps

    MiCA engine:

    20 Gbps crypto

    Compress & decompress

    Flexible

    I/O

    UARTx2,USBx2,JTAG,I2C, SPI

    MiCA

    mPIPE

    10 GbEXAUI

    SerDes

    4x GbESGMII

    10 GbEXAUI

    Se

    rDes

    4x GbESGMII

    10 GbEXAUI

    SerDes

    4x GbESGMII

    10 GbEXAUI

    SerDes

    4x GbESGMII

    SerDesPCIe 2.0

    8-lane

    SerDes

    PCIe 2.04-lane

    SerDes

    PCIe 2.04-lane

    Memory Controller (DDR3)

    Memory Controller (DDR3)

    TILE-Gx36:Scaling to a broad range of applications

  • 8/3/2019 S2 1405 Schooler Presentation

    20/35

    HPEC, 15 September 2010 2010 Copyright Tilera Corporation. All Rights Reserved.20

    Full-Featured General Converged Cores

    Processor Each core is a complete computer 3-way VLIW CPU SIMD instructions: 32, 16, and 8-bit ops Instructions for video (e.g., SAD) and

    networking Protection and interrupts

    Memory L1 cache and L2 Cache Virtual and physical address space Instruction and data TLBs

    Cache integrated 2D DMA engine

    Runs SMP Linux

    Runs off-the-shelf C/C++ programs

    Signal processing and general apps

    Core

    Cache

    16K L1-I

    64K L2

    I-TLB

    D-TLB

    2DDMA8K L1-D

    Register File

    Three Execution Pipelines

    TerabitSwitch

  • 8/3/2019 S2 1405 Schooler Presentation

    21/35

    HPEC, 15 September 2010 2010 Copyright Tilera Corporation. All Rights Reserved.21

    Software must complement the hardwareEnable re-use of existing code-bases

    Standards-based development environment e.g. gcc, C, C++, Java, Linux

    Comprehensive command-line & GUI-based tools

    Support multiple OS models

    One OS running SMP Multiple virtualized OSs with protection

    Bare metal or zero-overhead with background OS environment

    Support range of parallel programming styles Threaded programming (pThreads, TBB)

    Run-to-Completion with load-balancing

    Decomposition & Pipelining

    Higher-level frameworks (Erlang, OpenMP, Hadoop etc.)

    http://images.google.com/imgres?imgurl=http://www.alistairdickie.com/overlay/images/stories/linux-logo.gif&imgrefurl=http://www.alistairdickie.com/overlay/index.php%3Foption%3Dcom_content%26task%3Dblogcategory%26id%3D15%26Itemid%3D26&h=446&w=473&sz=35&hl=en&start=1&um=1&tbnid=UCR_ba2bRHrHqM:&tbnh=122&tbnw=129&prev=/images%3Fq%3DLinux%2Blogo%26svnum%3D10%26um%3D1%26hl%3Den%26rls%3DIBMA,IBMA:2006-45,IBMA:en
  • 8/3/2019 S2 1405 Schooler Presentation

    22/35

    HPEC, 15 September 2010 2010 Copyright Tilera Corporation. All Rights Reserved.22

    Software Roadmap

    Standards & open source integration Compiler: gcc, g++ 4.4+

    Linux:

    Kernel: Tile architecture integrated to 2.6.36 User-space: glibc, broader set of standard

    packages

    Extended programming and runtime

    environments Java: porting OpenJDK

    Virtualization: porting KVM

  • 8/3/2019 S2 1405 Schooler Presentation

    23/35

    HPEC, 15 September 2010 2010 Copyright Tilera Corporation. All Rights Reserved.23

    Tile architecture:The future of many-core computing

    Multicore is the way forward But we need the right architecture to utilize it

    The Tile architecture addresses the challenges Scales to 100s of cores

    Delivers very low power

    Runs your existing code

    Standards-based software Familiar tools

    Full range of standard programming environments

  • 8/3/2019 S2 1405 Schooler Presentation

    24/35

    Thank you!

    Questions?

  • 8/3/2019 S2 1405 Schooler Presentation

    25/35

    HPEC, 15 September 2010 2010 Copyright Tilera Corporation. All Rights Reserved.25

    Research Vision to Commercial Product

    2002

    2007

    100Btransistors

    2018

    1BTransistors

    in 2007

    1996

    The opportunity

    CPU

    Mem

    1997

    A blankslate

    MIT Raw16 cores Tile Processor

    64 cores

    MemoryController (DDR3) MemoryController (DDR3)

    MemoryController (DDR3) MemoryController(DDR3)

    mPIPE

    FlexibleI/O

    UART x2,USB x2,JTAG,I2C,SPI

    MiCA

    MiCA

    SerDes

    PCIe2.08-lane

    SerDes

    PCIe2.04-lane

    SerDes

    PCIe2.08-lane

    Interlaken

    Interlaken

    10GbEXAUI S

    erDes

    4xGbESGMII

    10GbEXAUI S

    erDes4xGbE

    SGMII

    10GbEXAUI S

    erDes4xGbE

    SGMII

    10GbEXAUI S

    erDes4xGbE

    SGMII

    10GbEXAUI S

    erDes

    4xGbESGMII

    10GbEXAUI S

    erDes4xGbE

    SGMII

    10GbEXAUI S

    erDes4xGbE

    SGMII

    10GbEXAUI S

    erD

    es4xGbE

    SGMII

    TILE-Gx100100 cores

    The future?

    2010

  • 8/3/2019 S2 1405 Schooler Presentation

    26/35

    HPEC, 15 September 2010 2010 Copyright Tilera Corporation. All Rights Reserved.26

    Standard tools and programming model

    Multicore Development Environment

    Standard application stack

    Standard programming SMP Linux 2.6

    ANSI C/C++

    Java, PHP

    Integrated tools GCC compiler

    Standard gdb gprof

    Eclipse IDE

    Innovative tools Multicore debug

    Multicore profile

    Standards-based tools

    Application layer Open source apps

    Standard C/C++ libs

    Operating System layer 64-way SMP Linux

    Zero Overhead Linux

    Bare metal environment

    Hypervisor layer Virtualizes hardware

    I/O device drivers

    Tile Processor

    Tile Tile Tile Tile

    Hypervisor

    Operating System

    Applications

    Virtualization and highspeed I/O drivers

    kernel drivers

    Applicationslibraries

    http://images.google.com/imgres?imgurl=http://www.alistairdickie.com/overlay/images/stories/linux-logo.gif&imgrefurl=http://www.alistairdickie.com/overlay/index.php%3Foption%3Dcom_content%26task%3Dblogcategory%26id%3D15%26Itemid%3D26&h=446&w=473&sz=35&hl=en&start=1&um=1&tbnid=UCR_ba2bRHrHqM:&tbnh=122&tbnw=129&prev=/images%3Fq%3DLinux%2Blogo%26svnum%3D10%26um%3D1%26hl%3Den%26rls%3DIBMA,IBMA:2006-45,IBMA:en
  • 8/3/2019 S2 1405 Schooler Presentation

    27/35

    HPEC, 15 September 2010 2010 Copyright Tilera Corporation. All Rights Reserved.27

    Standard Software Stack

    Compiler, OSHypervisor

    LanguageSupport

    InfrastructureApps

    Perl

    gcc & g++

    Commercial LinuxDistribution

    ManagementProtocols IPMI 2.0

    Transcoding

    NetworkMonitoring

  • 8/3/2019 S2 1405 Schooler Presentation

    28/35

    HPEC, 15 September 2010 2010 Copyright Tilera Corporation. All Rights Reserved.28

    High single core performanceComparable to Atom & ARM Cortex-A9 cores

    - Data for TILEPro, ARM Cortex-A9, Atom N270 is available on the CoreMark website http://coremark.org/home.php- TILE-Gx and single thread Atom results were measured in Tilera labs- Single core, single thread result for ARM is calculated based on chip scores

    Single-Core Single thread CoreMark Comparison

    -

    500

    1,000

    1,500

    2,000

    2,500

    3,000

    3,500

    Tilera

    TILEPro64

    866 MHz

    Tilera

    TILE-Gx36

    1.25 GHz

    ARM

    Cortex-A9

    1 GHz

    Intel

    Atom N270

    1600 MHz

    CoreMarkScore

    http://coremark.org/home.phphttp://coremark.org/home.php
  • 8/3/2019 S2 1405 Schooler Presentation

    29/35

    HPEC, 15 September 2010 2010 Copyright Tilera Corporation. All Rights Reserved.29 29

    Significant value across multiple markets

    Networking Multimedia Wireless Cloud Classification L4-7 Services Load Balancing

    Monitoring/QoS Security

    VideoConferencing

    Media Streaming

    Transcoding

    Base Station Media Gateway Service Nodes

    Test Equipment

    Apache Memcached Web Applications

    LAMP stack

    High Performance

    Low Power

    Standard Programming

    Over 100 customersOver 40 customers going into production

    Tier 1 customers in all target markets

  • 8/3/2019 S2 1405 Schooler Presentation

    30/35

    HPEC, 15 September 2010 2010 Copyright Tilera Corporation. All Rights Reserved.30

    Targeting markets with highly parallelapplications

    Web ServingIn Memory Cache

    Data Mining

    Web

    TranscodingVideo deliveryWireless media

    Media delivery

    Lawful interceptionSurveillance

    Other

    Government

    Common ThemesHundreds and Thousands of servers running each application

    thousands of parallel transactionsAll need better performance and power efficiency

  • 8/3/2019 S2 1405 Schooler Presentation

    31/35

    HPEC, 15 September 2010 2010 Copyright Tilera Corporation. All Rights Reserved.31

    200 Tbps on-chip bandwidth

    2 Dimensional mesh network

    Scales to large numbers of cores

    Modular: Design-and-verify 1 tile

    Power efficient:

    Short wires & locality optimize CV2f Chandrakasan effect, more cores at

    lower freq and voltage CV2f

    Core + Switch = Tile

    Traditional Bus/Ring Architecture

    S

    ProcessorCore

    The Tile Processor ArchitectureMesh interconnect, power-optimized cores

  • 8/3/2019 S2 1405 Schooler Presentation

    32/35

    HPEC, 15 September 2010 2010 Copyright Tilera Corporation. All Rights Reserved.32

    Global 64-bit address space

    Distributed caches Big centralized caches dont scale Contention

    Long latency

    High power

    Distributed caches have numerousbenefits

    Lower power (less logic lit-up per access)

    Exploit locality (local L1 & L2) Can exploit various cache placement

    mechanisms to enhance performance

    Distributed everythingCache, memory management, connectivity

    CompletelyHW Coherent

    CacheSystem

  • 8/3/2019 S2 1405 Schooler Presentation

    33/35

    HPEC, 15 September 2010 2010 Copyright Tilera Corporation. All Rights Reserved.33

    Highest compute density

    2U form factor

    4 hot pluggable modules

    8 Tilera TILEPro processors

    512 general purpose cores

    1.3 trillion operations /sec

  • 8/3/2019 S2 1405 Schooler Presentation

    34/35

    HPEC, 15 September 2010 2010 Copyright Tilera Corporation. All Rights Reserved.34

    Power efficient and eco-friendly server

    10,000 cores in a 8 Kilowatt rack

    35-50 watts max per node Server power of 400 watts

    90%+ efficient power supplies

    Shared fans and power supplies

  • 8/3/2019 S2 1405 Schooler Presentation

    35/35

    HPEC 15 September 2010

    Coherent distributed cache system

    Globally Shared Physical Address Space Full Hardware Cache Coherence Standard shared memory programming model

    Distributed cache Each tile has local L1 and L2 caches Aggregate of L2 serves as a globally shared L3

    Any cache block can be replicated locally Hardware tracks sharers, invalidates stale copies

    Dynamic Distributed Cache (DDC) Memory pages distributed across all cores or

    homed by allocating core

    Coherent I/O Hardware maintains coherence I/O reads/writes coherent with tile caches Reads/writes delivered by HW to home cache Header/packet delivered directly to tile caches

    SerDes

    GbE 0

    GbE 1

    FlexibleI/O

    UARTJTAG

    SPI, I2C

    DDR2 Controller 3

    DDR2 Controller 0

    DDR2 Controller 2

    DDR2 Controller 1

    PCIe 0

    SerDes

    PCIe 1

    XAUI 0

    SerDes

    XAUI 1

    SerDes

    CompletelyCoherentSystem