2010.Walalch.oct

8/8/2019 2010.Walalch.oct

1/34

Computer Software:

The 'Trojan Horse' of HPC

Steve Wallach

swallachatconveycomputer dotcomConvey


2/34

swallach - oct 2010 - ucsb 2

Discussion

The state of HPC software today What Convey Computer is doing The path to Exascale Computing


3/34


Mythology

In the old days, wewere told

Beware of Greeksbearing gifts


4/34


Today

Beware of Geeks bearing Gifts


5/34


What problems are we solving

New HardwareParadigms

UniprocessorPerformance leveling MPP and multi-

threading for the

masses


6/34


Deja Vu Multi-Core Evolves

Many Core ILP fizzles

x86 extended with sse2, sse3, and sse4 application specific enhancements Co-processor within x86 micro-

architecture

Basically performance enhancementsby On chip parallel Instructions for specific application

acceleration One application instruction replaces

MANY generic instructions

Dj vu all over again 1980s Need more performance than micro GPU, CELL, and FPGAs Different software environment

Heterogeneous Computing AGAIN

Yogi Berra


7/34


Current Languages

Fortran 66 Fortran 77 Fortran 95 2003

HPC Fortran Co-Array Fortran

C C++ UPC Stream C C# (Microsoft) Ct (Intel)


8/34


Another Bump in the Road

GPGPUs are very costeffective for many

applications.

Matrix Multiply Fortran

do i = 1,n1

do k = 1,n3

c(i,k) = 0.0

do j = 1,n2

c(i,k) = c(i,k) + a(i,j) * b(j,k)

Enddo

Enddo

Enddo


9/34


PGI Fortran to CUDA

Simplified Matrix Multiplication in CUDA, Using Tiled Algorithm

__global__ voidmatmulKernel( float* C, float* A, float* B, int N2, int N3 )

{ int bx = blockIdx.x, by = blockIdx.y;

int tx = threadIdx.x, ty = threadIdx.y;

int aFirst = 16 * by * N2;

int bFirst = 16 * bx;

float Csub = 0;

for( int j = 0; j < N2; j += 16 ) { __shared__ float Atile[16][16], Btile[16][16];

Atile[ty][tx] = A[aFirst + j + N2 * ty + tx];

Btile[ty][tx] = B[bFirst + j*N3 + b + N3 * ty + tx]; __syncthreads();

for( int k = 0; k < 16; ++k )Csub += Atile[ty][k] * Btile[k][tx]; __syncthreads(); }

int c = N3 * 16 * by + 16 * bx;

C[c + N3 * ty + tx] = Csub;}

void matmul( float* A, float* B, float* C, size_t N1, size_t N2, size_t N3 )

{ void *devA, *devB, *devC; cudaSetDevice(0);

cudaMalloc( &devA, N1*N2*sizeof(float) );

cudaMalloc( &devB, N2*N3*sizeof(float) );

cudaMalloc( &devC, N1*N3*sizeof(float) );

cudaMemcpy( devA, A, N1*N2*sizeof(float),

cudaMemcpyHostToDevice );

cudaMemcpy( devB, B, N2*N3*sizeof(float),

cudaMemcpyHostToDevice );dim3 threads( 16, 16 );

dim3 grid( N1 / threads.x, N3 / threads.y);

matmulKernel>( devC, devA, devB, N2, N3 );

cudaMemcpy( C, devC, N1*N3*sizeof(float),

cudaMemcpyDeviceToHost );

cudaFree( devA );

cudaFree( devB );

cudaFree( devC );

http://www.linuxjournal.com/article/10216Michael Wolfe Portland Group

How We Should Program GPGPUs November 1st, 2008

Pornographic Programming:

Cant define it, but you know

When you see it.


10/34


Find the Accelerator

Programmer

Accelerators can be beneficial. It isntfree (like waiting for the next clock

speed boost) Worst case - you will have tocompletely rethink your

algorithms and/or data structures Performance tuning is still time

consuming Dont forget our long history of

parallel computing...Fall Creek Falls Conference, Chattanooga, TN - Sept. 2009One of These Things Isn't Like the Other...Now What?Pat McCormick, LANL


11/34


Programming Model

swallach - June 2010 EAGE Barcalona

Tradeoff programmer productivityvs. performance

Web programming is mostly donewith scripting and interpretivelanguages Java Javascript

Server-side programminglanguages (Python, Ruby, etc.).

Matlab Users tradeoff productivityfor performance Moores Law helps performance Moores Law hurts productivity

Multi-core What languages are being used

http://www.tiobe.com/index.php/content/paperinfo/tpci/index.html

11


12/34


Interpretive Languages

User cycles are more importantthan computer cycles We need ISAs that model and

directly support these interpretive

languages FPGAs cores to support

extensions to the standard ISA

Language Directed Design ACM-IEEE Symp. on High-Level-

Language Computer Architecture, 7-8

November 1973, College Park, MD FAP Fortran Assembly Program

(IBM 7090) B1700 S-languages (Burroughs)

Architecture defined by compilerwriters


13/34


Kathy Yelick, 2008 Keynote, Salishan Conference

NO


14/34


Introduction to Convey Computer

1/22/2010 14

Developer of the HC-1 hybrid-core computer system

Leverages Intel x86 ecosystem FPGA based coprocessor for

performance & efficiency

Experienced team


15/34


The Convey Hybrid-Core

Computer Extends x86 ISA with

performance of ahardware-basedarchitecture

Adapts to applicationworkloads

Programmed in ANSIstandard C/C++ andFortran

Leverages x86ecosystem


16/34


Convey Product Reconfigurable Co-Processor to Intel x86-64 Shared 64_bit Virtual and Physical Memory (cache coherent) Coprocessor executes instructions that are viewed as extensions to the x86 ISA Convey Developed Compilers: C,C++, & Fortran based on open 64)

Automatic Vectorization/Parallelization SIMD Multi-threading

Generates both x86 and coprocessor instructions

Cache coherent

Shared virtual and

Physical memory


17/34


Inside the Coprocessor

crossbar

memory

controller

Scalar

Processing

Instruction

Fetch/Decode

HostInterface

memory

controller

memory

controller

memory

controller

memory

controller

memory

controller

memory

controller

memory

controller

Application Engines

Personalities dynamically loaded into AEs

implement application specific instructions

16 DDR2 memory channels

Standard or Scatter-Gather DIMMs

80GB/sec throughput

System interfaceand memory

management

implemented by

coprocessor

infrastructure

directI/Oin

terface

Non-blocking

Virtual output queuing

Round-robin arbitration31 way interleave


18/34


Convey Scatter-Gather DIMMs Standard DIMMs are optimized for

cache line transfers

performance drops dramaticallywhen access pattern is strided or

random

Convey Scatter-Gather DIMMs areoptimized for 8-byte transfers deliver high bandwidth for random

or strided 64-bit accesses

prime number (31) interleavemaintains performance for power-

of-two strides

Supports both SIMD and Parallelmulti-threading compute model

Out of order loads and stores


19/34


Strided Memory

1.2 GBytes/sec


20/34


SP Personality

Load-store vector architecture

with single precision floating point

Multiple function pipes for dataparallelism

Multiple functional units and out-

of-order execution for instruction

parallelismto crossbar

vector register file

fma

fma

fma

fma

Vector architectureoptimized forSignal Processingapplications

integer

store

load

logical

rcp,d

ivid

e

misc a

dd

add

4 Application Engines / 32 Function Pipes

vector elements distributed across function pipes

Crossbar

Dispatch

Crossbar

Dispatch

Crossbar

Dispatch

Crossbar

Dispatch


21/34


Crossbar

Dispatch

UCSD InsPect Personality4 Application Engines / 40 State Machines

Crossbar

Dispatch

Substring

FetchProtein

Fetch

Peptide

Mass

Memory

PRM

Scores

Memory

Score

SaveMatch

Temp

Match

Memory

Update

Score

To Beat

Store

Matches

Crossbar

Dispatch

Crossbar

Dispatch

Entire Best Match routine implemented as

state machine

Multiple state machines for data parallelism

Operates on main memory using virtualaddresses

MIMD mode of computation

Lightweight thread dispatcher on x86

1-of-40 State Machines

Hardwareimplementation ofsearch kernel fromInsPect proteomicsapplication


22/34


Linux Kernel

Convey Linux Kernel

Intel

Host Processor

cpKERN

cpHAL

cpSYS

Convey

Coprocessor

libcny_usr.so libcny_sys.so

Shared Libraries

Kernel & Loadable Modules

Shared LibrariesCoprocessor allocationCoprocessor dispatch & synchronizationLoadable ModulesInitialization & management

Exception and error handlingContext save/restoreEnhanced Linux KernelMultiple page sizesPage coloring for prime numberinterleave

Process mgmt for coprocessor threadsIntel 64/Linux LSB binary compatibility


23/34


Berkeleys 13 Motifs

http://www.eecs.berkeley.edu/Pubs/TechRpts/2008/EECS-2008-23.html


24/34


Compiler ComplexityApplicationPerformanceMetric

(

e.g.

Efficiency)

Low

High

SISD SIMD MIMD/

Threads

Full Custom

Stride-N Smart

Stride-N Physical

Stride-1 Physical

Cache-based

What does this all mean?

Each application

is a different

point on this 3D

grid (actually a

curve)

24

How do we create

architectures to do

this?


25/34


VECTOR

(32 Bit Float)

SEISMIC

VECT

OR

(64Bit-Floa

t)

FiniteEl

ement

Convey ISA (Compiler View)

Bit/Logical

Data Mining

Sorting/Tree

Traversal

User Written

Systolic/State

Machine

Bio-Informatics

x86-64 ISA


26/34


Convey Compilers

executable

Intel64code

Coprocessorcode

C/C++ Fortran95

CommonOptimizer

Intel 64

Optimizer& Code

Generator

Convey

Vectorizer& Code

Generator

Linker

Program in ANSI standardC/C++ and Fortran

Unified compiler generatesx86 & coprocessor

instructions

Seamless debuggingenvironment for Intel &

coprocessor code

Executable can run onx86_64 nodes or on Convey

Hybrid-Core nodes

other gcc

compatibleobjects


27/34


28/34


3D Finite Difference (3DFD)

Personality designed for nearest neighbor

operations on structured grids maximizes data reuse

reconfigurable registers

1D (vector), 2D, and 3D modes 8192 elements in a register

operations on entire cubes add points to their neighbor to

the left times a scalar is asingle instruction

up to 7 points away in anydirection

finite difference method forpost-stack reverse-timemigration

X(I,J,K) = S0*Y(I ,J ,K )

+ S1*Y(I-1,J ,K )

+ S2*Y(I+1,J ,K )

+ S3*Y(I ,J-1,K )

+ S4*Y(I ,J+1,K )

+ S5*Y(I ,J ,K-1)

+ S6*Y(I ,J ,K+1)

swallach - June 2010 EAGE Barcalona 28


29/34


3D Grid Data Storage


Personality maps 3-Dgrid data structure toregister files withinfunction units to achieve

maximum data reuse. Dimensions of 3-D data

structure can be easilychanged.

A few dozen instructionswill perform a 14th orderstencil operation.

29


30/34


3DFD Personality Framework


The number ofFunction Pipe Groups is

set to match available

memory bandwidth.

The number ofFunction Pipes is set

based on the resources

available.

3-D HW grid structure X axis, Number of

FPG

Y axis, Number of FPper FPG

Z axis, VectorElements within a FP

Function

Pipe Group

Function

Pipe

Address

Generator

30


31/34



32/34


Kathy Yelick, 2008 Keynote, Salishan Conference


33/34


34/34


Finally

Documents

2010.Walalch.oct