2010.Walalch.oct

Embed Size (px)

Citation preview

  • 8/8/2019 2010.Walalch.oct

    1/34

    Computer Software:

    The 'Trojan Horse' of HPC

    Steve Wallach

    swallachatconveycomputer dotcomConvey

  • 8/8/2019 2010.Walalch.oct

    2/34

    swallach - oct 2010 - ucsb 2

    Discussion

    The state of HPC software today What Convey Computer is doing The path to Exascale Computing

  • 8/8/2019 2010.Walalch.oct

    3/34

    swallach - oct 2010 - ucsb 3

    Mythology

    In the old days, wewere told

    Beware of Greeksbearing gifts

  • 8/8/2019 2010.Walalch.oct

    4/34

    swallach - oct 2010 - ucsb 4

    Today

    Beware of Geeks bearing Gifts

  • 8/8/2019 2010.Walalch.oct

    5/34

    swallach - oct 2010 - ucsb 5

    What problems are we solving

    New HardwareParadigms

    UniprocessorPerformance leveling MPP and multi-

    threading for the

    masses

  • 8/8/2019 2010.Walalch.oct

    6/34

    swallach - oct 2010 - ucsb 6

    Deja Vu Multi-Core Evolves

    Many Core ILP fizzles

    x86 extended with sse2, sse3, and sse4 application specific enhancements Co-processor within x86 micro-

    architecture

    Basically performance enhancementsby On chip parallel Instructions for specific application

    acceleration One application instruction replaces

    MANY generic instructions

    Dj vu all over again 1980s Need more performance than micro GPU, CELL, and FPGAs Different software environment

    Heterogeneous Computing AGAIN

    Yogi Berra

  • 8/8/2019 2010.Walalch.oct

    7/34

    swallach - oct 2010 - ucsb 7

    Current Languages

    Fortran 66 Fortran 77 Fortran 95 2003

    HPC Fortran Co-Array Fortran

    C C++ UPC Stream C C# (Microsoft) Ct (Intel)

  • 8/8/2019 2010.Walalch.oct

    8/34

    swallach - oct 2010 - ucsb 8

    Another Bump in the Road

    GPGPUs are very costeffective for many

    applications.

    Matrix Multiply Fortran

    do i = 1,n1

    do k = 1,n3

    c(i,k) = 0.0

    do j = 1,n2

    c(i,k) = c(i,k) + a(i,j) * b(j,k)

    Enddo

    Enddo

    Enddo

  • 8/8/2019 2010.Walalch.oct

    9/34

    swallach - oct 2010 - ucsb 9

    PGI Fortran to CUDA

    Simplified Matrix Multiplication in CUDA, Using Tiled Algorithm

    __global__ voidmatmulKernel( float* C, float* A, float* B, int N2, int N3 )

    { int bx = blockIdx.x, by = blockIdx.y;

    int tx = threadIdx.x, ty = threadIdx.y;

    int aFirst = 16 * by * N2;

    int bFirst = 16 * bx;

    float Csub = 0;

    for( int j = 0; j < N2; j += 16 ) { __shared__ float Atile[16][16], Btile[16][16];

    Atile[ty][tx] = A[aFirst + j + N2 * ty + tx];

    Btile[ty][tx] = B[bFirst + j*N3 + b + N3 * ty + tx]; __syncthreads();

    for( int k = 0; k < 16; ++k )Csub += Atile[ty][k] * Btile[k][tx]; __syncthreads(); }

    int c = N3 * 16 * by + 16 * bx;

    C[c + N3 * ty + tx] = Csub;}

    void matmul( float* A, float* B, float* C, size_t N1, size_t N2, size_t N3 )

    { void *devA, *devB, *devC; cudaSetDevice(0);

    cudaMalloc( &devA, N1*N2*sizeof(float) );

    cudaMalloc( &devB, N2*N3*sizeof(float) );

    cudaMalloc( &devC, N1*N3*sizeof(float) );

    cudaMemcpy( devA, A, N1*N2*sizeof(float),

    cudaMemcpyHostToDevice );

    cudaMemcpy( devB, B, N2*N3*sizeof(float),

    cudaMemcpyHostToDevice );dim3 threads( 16, 16 );

    dim3 grid( N1 / threads.x, N3 / threads.y);

    matmulKernel>( devC, devA, devB, N2, N3 );

    cudaMemcpy( C, devC, N1*N3*sizeof(float),

    cudaMemcpyDeviceToHost );

    cudaFree( devA );

    cudaFree( devB );

    cudaFree( devC );

    http://www.linuxjournal.com/article/10216Michael Wolfe Portland Group

    How We Should Program GPGPUs November 1st, 2008

    Pornographic Programming:

    Cant define it, but you know

    When you see it.

  • 8/8/2019 2010.Walalch.oct

    10/34

    swallach - oct 2010 - ucsb 10

    Find the Accelerator

    Programmer

    Accelerators can be beneficial. It isntfree (like waiting for the next clock

    speed boost) Worst case - you will have tocompletely rethink your

    algorithms and/or data structures Performance tuning is still time

    consuming Dont forget our long history of

    parallel computing...Fall Creek Falls Conference, Chattanooga, TN - Sept. 2009One of These Things Isn't Like the Other...Now What?Pat McCormick, LANL

  • 8/8/2019 2010.Walalch.oct

    11/34

    swallach - oct 2010 - ucsb 11

    Programming Model

    swallach - June 2010 EAGE Barcalona

    Tradeoff programmer productivityvs. performance

    Web programming is mostly donewith scripting and interpretivelanguages Java Javascript

    Server-side programminglanguages (Python, Ruby, etc.).

    Matlab Users tradeoff productivityfor performance Moores Law helps performance Moores Law hurts productivity

    Multi-core What languages are being used

    http://www.tiobe.com/index.php/content/paperinfo/tpci/index.html

    11

  • 8/8/2019 2010.Walalch.oct

    12/34

    swallach - oct 2010 - ucsb 12

    Interpretive Languages

    User cycles are more importantthan computer cycles We need ISAs that model and

    directly support these interpretive

    languages FPGAs cores to support

    extensions to the standard ISA

    Language Directed Design ACM-IEEE Symp. on High-Level-

    Language Computer Architecture, 7-8

    November 1973, College Park, MD FAP Fortran Assembly Program

    (IBM 7090) B1700 S-languages (Burroughs)

    Architecture defined by compilerwriters

  • 8/8/2019 2010.Walalch.oct

    13/34

    swallach - oct 2010 - ucsb 13

    Kathy Yelick, 2008 Keynote, Salishan Conference

    NO

  • 8/8/2019 2010.Walalch.oct

    14/34

    swallach - oct 2010 - ucsb 14

    Introduction to Convey Computer

    1/22/2010 14

    Developer of the HC-1 hybrid-core computer system

    Leverages Intel x86 ecosystem FPGA based coprocessor for

    performance & efficiency

    Experienced team

  • 8/8/2019 2010.Walalch.oct

    15/34

    swallach - oct 2010 - ucsb 15

    The Convey Hybrid-Core

    Computer Extends x86 ISA with

    performance of ahardware-basedarchitecture

    Adapts to applicationworkloads

    Programmed in ANSIstandard C/C++ andFortran

    Leverages x86ecosystem

  • 8/8/2019 2010.Walalch.oct

    16/34

    swallach - oct 2010 - ucsb 16

    Convey Product Reconfigurable Co-Processor to Intel x86-64 Shared 64_bit Virtual and Physical Memory (cache coherent) Coprocessor executes instructions that are viewed as extensions to the x86 ISA Convey Developed Compilers: C,C++, & Fortran based on open 64)

    Automatic Vectorization/Parallelization SIMD Multi-threading

    Generates both x86 and coprocessor instructions

    Cache coherent

    Shared virtual and

    Physical memory

  • 8/8/2019 2010.Walalch.oct

    17/34

    swallach - oct 2010 - ucsb 17

    Inside the Coprocessor

    crossbar

    memory

    controller

    Scalar

    Processing

    Instruction

    Fetch/Decode

    HostInterface

    memory

    controller

    memory

    controller

    memory

    controller

    memory

    controller

    memory

    controller

    memory

    controller

    memory

    controller

    Application Engines

    Personalities dynamically loaded into AEs

    implement application specific instructions

    16 DDR2 memory channels

    Standard or Scatter-Gather DIMMs

    80GB/sec throughput

    System interfaceand memory

    management

    implemented by

    coprocessor

    infrastructure

    directI/Oin

    terface

    Non-blocking

    Virtual output queuing

    Round-robin arbitration31 way interleave

  • 8/8/2019 2010.Walalch.oct

    18/34

    swallach - oct 2010 - ucsb 18

    Convey Scatter-Gather DIMMs Standard DIMMs are optimized for

    cache line transfers

    performance drops dramaticallywhen access pattern is strided or

    random

    Convey Scatter-Gather DIMMs areoptimized for 8-byte transfers deliver high bandwidth for random

    or strided 64-bit accesses

    prime number (31) interleavemaintains performance for power-

    of-two strides

    Supports both SIMD and Parallelmulti-threading compute model

    Out of order loads and stores

  • 8/8/2019 2010.Walalch.oct

    19/34

    swallach - oct 2010 - ucsb 19

    Strided Memory

    1.2 GBytes/sec

  • 8/8/2019 2010.Walalch.oct

    20/34

    swallach - oct 2010 - ucsb 20

    SP Personality

    Load-store vector architecture

    with single precision floating point

    Multiple function pipes for dataparallelism

    Multiple functional units and out-

    of-order execution for instruction

    parallelismto crossbar

    vector register file

    fma

    fma

    fma

    fma

    Vector architectureoptimized forSignal Processingapplications

    integer

    store

    load

    logical

    rcp,d

    ivid

    e

    misc a

    dd

    add

    4 Application Engines / 32 Function Pipes

    vector elements distributed across function pipes

    Crossbar

    Dispatch

    Crossbar

    Dispatch

    Crossbar

    Dispatch

    Crossbar

    Dispatch

  • 8/8/2019 2010.Walalch.oct

    21/34

    swallach - oct 2010 - ucsb 21

    Crossbar

    Dispatch

    UCSD InsPect Personality4 Application Engines / 40 State Machines

    Crossbar

    Dispatch

    Substring

    FetchProtein

    Fetch

    Peptide

    Mass

    Memory

    PRM

    Scores

    Memory

    Score

    SaveMatch

    Temp

    Match

    Memory

    Update

    Score

    To Beat

    Store

    Matches

    Crossbar

    Dispatch

    Crossbar

    Dispatch

    Entire Best Match routine implemented as

    state machine

    Multiple state machines for data parallelism

    Operates on main memory using virtualaddresses

    MIMD mode of computation

    Lightweight thread dispatcher on x86

    1-of-40 State Machines

    Hardwareimplementation ofsearch kernel fromInsPect proteomicsapplication

  • 8/8/2019 2010.Walalch.oct

    22/34

    swallach - oct 2010 - ucsb 22

    Linux Kernel

    Convey Linux Kernel

    Intel

    Host Processor

    cpKERN

    cpHAL

    cpSYS

    Convey

    Coprocessor

    libcny_usr.so libcny_sys.so

    Shared Libraries

    Kernel & Loadable Modules

    Shared LibrariesCoprocessor allocationCoprocessor dispatch & synchronizationLoadable ModulesInitialization & management

    Exception and error handlingContext save/restoreEnhanced Linux KernelMultiple page sizesPage coloring for prime numberinterleave

    Process mgmt for coprocessor threadsIntel 64/Linux LSB binary compatibility

  • 8/8/2019 2010.Walalch.oct

    23/34

    swallach - oct 2010 - ucsb 23

    Berkeleys 13 Motifs

    http://www.eecs.berkeley.edu/Pubs/TechRpts/2008/EECS-2008-23.html

  • 8/8/2019 2010.Walalch.oct

    24/34

    swallach - oct 2010 - ucsb 24

    Compiler ComplexityApplicationPerformanceMetric

    (

    e.g.

    Efficiency)

    Low

    High

    SISD SIMD MIMD/

    Threads

    Full Custom

    Stride-N Smart

    Stride-N Physical

    Stride-1 Physical

    Cache-based

    What does this all mean?

    Each application

    is a different

    point on this 3D

    grid (actually a

    curve)

    24

    How do we create

    architectures to do

    this?

  • 8/8/2019 2010.Walalch.oct

    25/34

    swallach - oct 2010 - ucsb 25

    VECTOR

    (32 Bit Float)

    SEISMIC

    VECT

    OR

    (64Bit-Floa

    t)

    FiniteEl

    ement

    Convey ISA (Compiler View)

    Bit/Logical

    Data Mining

    Sorting/Tree

    Traversal

    User Written

    Systolic/State

    Machine

    Bio-Informatics

    x86-64 ISA

  • 8/8/2019 2010.Walalch.oct

    26/34

    swallach - oct 2010 - ucsb 26

    Convey Compilers

    executable

    Intel64code

    Coprocessorcode

    C/C++ Fortran95

    CommonOptimizer

    Intel 64

    Optimizer& Code

    Generator

    Convey

    Vectorizer& Code

    Generator

    Linker

    Program in ANSI standardC/C++ and Fortran

    Unified compiler generatesx86 & coprocessor

    instructions

    Seamless debuggingenvironment for Intel &

    coprocessor code

    Executable can run onx86_64 nodes or on Convey

    Hybrid-Core nodes

    other gcc

    compatibleobjects

  • 8/8/2019 2010.Walalch.oct

    27/34

  • 8/8/2019 2010.Walalch.oct

    28/34

    swallach - oct 2010 - ucsb 28

    3D Finite Difference (3DFD)

    Personality designed for nearest neighbor

    operations on structured grids maximizes data reuse

    reconfigurable registers

    1D (vector), 2D, and 3D modes 8192 elements in a register

    operations on entire cubes add points to their neighbor to

    the left times a scalar is asingle instruction

    up to 7 points away in anydirection

    finite difference method forpost-stack reverse-timemigration

    X(I,J,K) = S0*Y(I ,J ,K )

    + S1*Y(I-1,J ,K )

    + S2*Y(I+1,J ,K )

    + S3*Y(I ,J-1,K )

    + S4*Y(I ,J+1,K )

    + S5*Y(I ,J ,K-1)

    + S6*Y(I ,J ,K+1)

    swallach - June 2010 EAGE Barcalona 28

  • 8/8/2019 2010.Walalch.oct

    29/34

    swallach - oct 2010 - ucsb 29

    3D Grid Data Storage

    swallach - June 2010 EAGE Barcalona

    Personality maps 3-Dgrid data structure toregister files withinfunction units to achieve

    maximum data reuse. Dimensions of 3-D data

    structure can be easilychanged.

    A few dozen instructionswill perform a 14th orderstencil operation.

    29

  • 8/8/2019 2010.Walalch.oct

    30/34

    swallach - oct 2010 - ucsb 30

    3DFD Personality Framework

    swallach - June 2010 EAGE Barcalona

    The number ofFunction Pipe Groups is

    set to match available

    memory bandwidth.

    The number ofFunction Pipes is set

    based on the resources

    available.

    3-D HW grid structure X axis, Number of

    FPG

    Y axis, Number of FPper FPG

    Z axis, VectorElements within a FP

    Function

    Pipe Group

    Function

    Pipe

    Address

    Generator

    30

  • 8/8/2019 2010.Walalch.oct

    31/34

    swallach - oct 2010 - ucsb 31

  • 8/8/2019 2010.Walalch.oct

    32/34

    swallach - oct 2010 - ucsb 32

    Kathy Yelick, 2008 Keynote, Salishan Conference

  • 8/8/2019 2010.Walalch.oct

    33/34

  • 8/8/2019 2010.Walalch.oct

    34/34

    swallach - oct 2010 - ucsb 34

    Finally