View
222
Download
0
Category
Preview:
Citation preview
8/8/2019 2010.Walalch.oct
1/34
Computer Software:
The 'Trojan Horse' of HPC
Steve Wallach
swallachatconveycomputer dotcomConvey
8/8/2019 2010.Walalch.oct
2/34
swallach - oct 2010 - ucsb 2
Discussion
The state of HPC software today What Convey Computer is doing The path to Exascale Computing
8/8/2019 2010.Walalch.oct
3/34
swallach - oct 2010 - ucsb 3
Mythology
In the old days, wewere told
Beware of Greeksbearing gifts
8/8/2019 2010.Walalch.oct
4/34
swallach - oct 2010 - ucsb 4
Today
Beware of Geeks bearing Gifts
8/8/2019 2010.Walalch.oct
5/34
swallach - oct 2010 - ucsb 5
What problems are we solving
New HardwareParadigms
UniprocessorPerformance leveling MPP and multi-
threading for the
masses
8/8/2019 2010.Walalch.oct
6/34
swallach - oct 2010 - ucsb 6
Deja Vu Multi-Core Evolves
Many Core ILP fizzles
x86 extended with sse2, sse3, and sse4 application specific enhancements Co-processor within x86 micro-
architecture
Basically performance enhancementsby On chip parallel Instructions for specific application
acceleration One application instruction replaces
MANY generic instructions
Dj vu all over again 1980s Need more performance than micro GPU, CELL, and FPGAs Different software environment
Heterogeneous Computing AGAIN
Yogi Berra
8/8/2019 2010.Walalch.oct
7/34
swallach - oct 2010 - ucsb 7
Current Languages
Fortran 66 Fortran 77 Fortran 95 2003
HPC Fortran Co-Array Fortran
C C++ UPC Stream C C# (Microsoft) Ct (Intel)
8/8/2019 2010.Walalch.oct
8/34
swallach - oct 2010 - ucsb 8
Another Bump in the Road
GPGPUs are very costeffective for many
applications.
Matrix Multiply Fortran
do i = 1,n1
do k = 1,n3
c(i,k) = 0.0
do j = 1,n2
c(i,k) = c(i,k) + a(i,j) * b(j,k)
Enddo
Enddo
Enddo
8/8/2019 2010.Walalch.oct
9/34
swallach - oct 2010 - ucsb 9
PGI Fortran to CUDA
Simplified Matrix Multiplication in CUDA, Using Tiled Algorithm
__global__ voidmatmulKernel( float* C, float* A, float* B, int N2, int N3 )
{ int bx = blockIdx.x, by = blockIdx.y;
int tx = threadIdx.x, ty = threadIdx.y;
int aFirst = 16 * by * N2;
int bFirst = 16 * bx;
float Csub = 0;
for( int j = 0; j < N2; j += 16 ) { __shared__ float Atile[16][16], Btile[16][16];
Atile[ty][tx] = A[aFirst + j + N2 * ty + tx];
Btile[ty][tx] = B[bFirst + j*N3 + b + N3 * ty + tx]; __syncthreads();
for( int k = 0; k < 16; ++k )Csub += Atile[ty][k] * Btile[k][tx]; __syncthreads(); }
int c = N3 * 16 * by + 16 * bx;
C[c + N3 * ty + tx] = Csub;}
void matmul( float* A, float* B, float* C, size_t N1, size_t N2, size_t N3 )
{ void *devA, *devB, *devC; cudaSetDevice(0);
cudaMalloc( &devA, N1*N2*sizeof(float) );
cudaMalloc( &devB, N2*N3*sizeof(float) );
cudaMalloc( &devC, N1*N3*sizeof(float) );
cudaMemcpy( devA, A, N1*N2*sizeof(float),
cudaMemcpyHostToDevice );
cudaMemcpy( devB, B, N2*N3*sizeof(float),
cudaMemcpyHostToDevice );dim3 threads( 16, 16 );
dim3 grid( N1 / threads.x, N3 / threads.y);
matmulKernel>( devC, devA, devB, N2, N3 );
cudaMemcpy( C, devC, N1*N3*sizeof(float),
cudaMemcpyDeviceToHost );
cudaFree( devA );
cudaFree( devB );
cudaFree( devC );
http://www.linuxjournal.com/article/10216Michael Wolfe Portland Group
How We Should Program GPGPUs November 1st, 2008
Pornographic Programming:
Cant define it, but you know
When you see it.
8/8/2019 2010.Walalch.oct
10/34
swallach - oct 2010 - ucsb 10
Find the Accelerator
Programmer
Accelerators can be beneficial. It isntfree (like waiting for the next clock
speed boost) Worst case - you will have tocompletely rethink your
algorithms and/or data structures Performance tuning is still time
consuming Dont forget our long history of
parallel computing...Fall Creek Falls Conference, Chattanooga, TN - Sept. 2009One of These Things Isn't Like the Other...Now What?Pat McCormick, LANL
8/8/2019 2010.Walalch.oct
11/34
swallach - oct 2010 - ucsb 11
Programming Model
swallach - June 2010 EAGE Barcalona
Tradeoff programmer productivityvs. performance
Web programming is mostly donewith scripting and interpretivelanguages Java Javascript
Server-side programminglanguages (Python, Ruby, etc.).
Matlab Users tradeoff productivityfor performance Moores Law helps performance Moores Law hurts productivity
Multi-core What languages are being used
http://www.tiobe.com/index.php/content/paperinfo/tpci/index.html
11
8/8/2019 2010.Walalch.oct
12/34
swallach - oct 2010 - ucsb 12
Interpretive Languages
User cycles are more importantthan computer cycles We need ISAs that model and
directly support these interpretive
languages FPGAs cores to support
extensions to the standard ISA
Language Directed Design ACM-IEEE Symp. on High-Level-
Language Computer Architecture, 7-8
November 1973, College Park, MD FAP Fortran Assembly Program
(IBM 7090) B1700 S-languages (Burroughs)
Architecture defined by compilerwriters
8/8/2019 2010.Walalch.oct
13/34
swallach - oct 2010 - ucsb 13
Kathy Yelick, 2008 Keynote, Salishan Conference
NO
8/8/2019 2010.Walalch.oct
14/34
swallach - oct 2010 - ucsb 14
Introduction to Convey Computer
1/22/2010 14
Developer of the HC-1 hybrid-core computer system
Leverages Intel x86 ecosystem FPGA based coprocessor for
performance & efficiency
Experienced team
8/8/2019 2010.Walalch.oct
15/34
swallach - oct 2010 - ucsb 15
The Convey Hybrid-Core
Computer Extends x86 ISA with
performance of ahardware-basedarchitecture
Adapts to applicationworkloads
Programmed in ANSIstandard C/C++ andFortran
Leverages x86ecosystem
8/8/2019 2010.Walalch.oct
16/34
swallach - oct 2010 - ucsb 16
Convey Product Reconfigurable Co-Processor to Intel x86-64 Shared 64_bit Virtual and Physical Memory (cache coherent) Coprocessor executes instructions that are viewed as extensions to the x86 ISA Convey Developed Compilers: C,C++, & Fortran based on open 64)
Automatic Vectorization/Parallelization SIMD Multi-threading
Generates both x86 and coprocessor instructions
Cache coherent
Shared virtual and
Physical memory
8/8/2019 2010.Walalch.oct
17/34
swallach - oct 2010 - ucsb 17
Inside the Coprocessor
crossbar
memory
controller
Scalar
Processing
Instruction
Fetch/Decode
HostInterface
memory
controller
memory
controller
memory
controller
memory
controller
memory
controller
memory
controller
memory
controller
Application Engines
Personalities dynamically loaded into AEs
implement application specific instructions
16 DDR2 memory channels
Standard or Scatter-Gather DIMMs
80GB/sec throughput
System interfaceand memory
management
implemented by
coprocessor
infrastructure
directI/Oin
terface
Non-blocking
Virtual output queuing
Round-robin arbitration31 way interleave
8/8/2019 2010.Walalch.oct
18/34
swallach - oct 2010 - ucsb 18
Convey Scatter-Gather DIMMs Standard DIMMs are optimized for
cache line transfers
performance drops dramaticallywhen access pattern is strided or
random
Convey Scatter-Gather DIMMs areoptimized for 8-byte transfers deliver high bandwidth for random
or strided 64-bit accesses
prime number (31) interleavemaintains performance for power-
of-two strides
Supports both SIMD and Parallelmulti-threading compute model
Out of order loads and stores
8/8/2019 2010.Walalch.oct
19/34
swallach - oct 2010 - ucsb 19
Strided Memory
1.2 GBytes/sec
8/8/2019 2010.Walalch.oct
20/34
swallach - oct 2010 - ucsb 20
SP Personality
Load-store vector architecture
with single precision floating point
Multiple function pipes for dataparallelism
Multiple functional units and out-
of-order execution for instruction
parallelismto crossbar
vector register file
fma
fma
fma
fma
Vector architectureoptimized forSignal Processingapplications
integer
store
load
logical
rcp,d
ivid
e
misc a
dd
add
4 Application Engines / 32 Function Pipes
vector elements distributed across function pipes
Crossbar
Dispatch
Crossbar
Dispatch
Crossbar
Dispatch
Crossbar
Dispatch
8/8/2019 2010.Walalch.oct
21/34
swallach - oct 2010 - ucsb 21
Crossbar
Dispatch
UCSD InsPect Personality4 Application Engines / 40 State Machines
Crossbar
Dispatch
Substring
FetchProtein
Fetch
Peptide
Mass
Memory
PRM
Scores
Memory
Score
SaveMatch
Temp
Match
Memory
Update
Score
To Beat
Store
Matches
Crossbar
Dispatch
Crossbar
Dispatch
Entire Best Match routine implemented as
state machine
Multiple state machines for data parallelism
Operates on main memory using virtualaddresses
MIMD mode of computation
Lightweight thread dispatcher on x86
1-of-40 State Machines
Hardwareimplementation ofsearch kernel fromInsPect proteomicsapplication
8/8/2019 2010.Walalch.oct
22/34
swallach - oct 2010 - ucsb 22
Linux Kernel
Convey Linux Kernel
Intel
Host Processor
cpKERN
cpHAL
cpSYS
Convey
Coprocessor
libcny_usr.so libcny_sys.so
Shared Libraries
Kernel & Loadable Modules
Shared LibrariesCoprocessor allocationCoprocessor dispatch & synchronizationLoadable ModulesInitialization & management
Exception and error handlingContext save/restoreEnhanced Linux KernelMultiple page sizesPage coloring for prime numberinterleave
Process mgmt for coprocessor threadsIntel 64/Linux LSB binary compatibility
8/8/2019 2010.Walalch.oct
23/34
swallach - oct 2010 - ucsb 23
Berkeleys 13 Motifs
http://www.eecs.berkeley.edu/Pubs/TechRpts/2008/EECS-2008-23.html
8/8/2019 2010.Walalch.oct
24/34
swallach - oct 2010 - ucsb 24
Compiler ComplexityApplicationPerformanceMetric
(
e.g.
Efficiency)
Low
High
SISD SIMD MIMD/
Threads
Full Custom
Stride-N Smart
Stride-N Physical
Stride-1 Physical
Cache-based
What does this all mean?
Each application
is a different
point on this 3D
grid (actually a
curve)
24
How do we create
architectures to do
this?
8/8/2019 2010.Walalch.oct
25/34
swallach - oct 2010 - ucsb 25
VECTOR
(32 Bit Float)
SEISMIC
VECT
OR
(64Bit-Floa
t)
FiniteEl
ement
Convey ISA (Compiler View)
Bit/Logical
Data Mining
Sorting/Tree
Traversal
User Written
Systolic/State
Machine
Bio-Informatics
x86-64 ISA
8/8/2019 2010.Walalch.oct
26/34
swallach - oct 2010 - ucsb 26
Convey Compilers
executable
Intel64code
Coprocessorcode
C/C++ Fortran95
CommonOptimizer
Intel 64
Optimizer& Code
Generator
Convey
Vectorizer& Code
Generator
Linker
Program in ANSI standardC/C++ and Fortran
Unified compiler generatesx86 & coprocessor
instructions
Seamless debuggingenvironment for Intel &
coprocessor code
Executable can run onx86_64 nodes or on Convey
Hybrid-Core nodes
other gcc
compatibleobjects
8/8/2019 2010.Walalch.oct
27/34
8/8/2019 2010.Walalch.oct
28/34
swallach - oct 2010 - ucsb 28
3D Finite Difference (3DFD)
Personality designed for nearest neighbor
operations on structured grids maximizes data reuse
reconfigurable registers
1D (vector), 2D, and 3D modes 8192 elements in a register
operations on entire cubes add points to their neighbor to
the left times a scalar is asingle instruction
up to 7 points away in anydirection
finite difference method forpost-stack reverse-timemigration
X(I,J,K) = S0*Y(I ,J ,K )
+ S1*Y(I-1,J ,K )
+ S2*Y(I+1,J ,K )
+ S3*Y(I ,J-1,K )
+ S4*Y(I ,J+1,K )
+ S5*Y(I ,J ,K-1)
+ S6*Y(I ,J ,K+1)
swallach - June 2010 EAGE Barcalona 28
8/8/2019 2010.Walalch.oct
29/34
swallach - oct 2010 - ucsb 29
3D Grid Data Storage
swallach - June 2010 EAGE Barcalona
Personality maps 3-Dgrid data structure toregister files withinfunction units to achieve
maximum data reuse. Dimensions of 3-D data
structure can be easilychanged.
A few dozen instructionswill perform a 14th orderstencil operation.
29
8/8/2019 2010.Walalch.oct
30/34
swallach - oct 2010 - ucsb 30
3DFD Personality Framework
swallach - June 2010 EAGE Barcalona
The number ofFunction Pipe Groups is
set to match available
memory bandwidth.
The number ofFunction Pipes is set
based on the resources
available.
3-D HW grid structure X axis, Number of
FPG
Y axis, Number of FPper FPG
Z axis, VectorElements within a FP
Function
Pipe Group
Function
Pipe
Address
Generator
30
8/8/2019 2010.Walalch.oct
31/34
swallach - oct 2010 - ucsb 31
8/8/2019 2010.Walalch.oct
32/34
swallach - oct 2010 - ucsb 32
Kathy Yelick, 2008 Keynote, Salishan Conference
8/8/2019 2010.Walalch.oct
33/34
8/8/2019 2010.Walalch.oct
34/34
swallach - oct 2010 - ucsb 34
Finally
Recommended