High Performance Computing for Engineersdt10/teaching/2011/hpce/hpce-lec0-introduction.pdf · HPCE / dt10 / 2012 / 0.8 Performance and Efficiency Relative to CPU MPPA GPU FPGA U n

HPCE / dt10 / 2012 / 0.1

High PerformanceComputing for Engineers

David [email protected] 903

HPCE / dt10 / 2012 / 0.2

High Performance Computing for Engineers• Research

– Testing communication protocols– Evaluating signal-processing filters– Simulating analogue and digital designs

• Tools– CAD tools: synthesis, place-and-route, verification– Libraries/toolboxes: filter design, compressive sensing

• Products– Oil exploration and discovery– Mobile-phone apps– Financial computing

HPCE / dt10 / 2012 / 0.3

High Performance Computing for Engineers• Types of performance metrics

– Throughput– Latency– Power– Design-time– Capital and running costs

• Required versus desired performance– Subject to a throughput of X, minimise average power– Subject to a budget of Y, maximise energy efficiency– Subject to Z development days, maximise throughput

HPCE / dt10 / 2012 / 0.4

What is available to you• Types of compute device

– Multi-core CPUs– GPUs (Graphics Processing Units)– MPPAs (Massively Parallel Processor Arrays)– FPGAs (Field Programmable Gate Arrays)

• Types of compute system– Embedded Systems– Mobile Phones– Tablets– Laptops– Grid computing– Cloud computing

HPCE / dt10 / 2012 / 0.5

2012 : LG Optimus 2X

Imgs : http://www.techradar.com/reviews/phones/mobile-phones/lg-optimus-2x-929388/review, http://www.anandtech.com/show/2911

NVidia Tegra 2- CPU : Dual-core ARM Cortex A9- GPU : ULP GeForce (8 cores)

HPCE / dt10 / 2012 / 0.6

2012 : Lenovo Thinkpad Edge E525

AMD Fusion A8-3500M- CPU : Quad-Core 2.4GHz Phenom-II- GPU : HD 6620G 400MHz (320 cores)

Img:http://laptops-specs.blogspot.com/2011/09/lenovo-thinkpad-edge-e525-specs.html, http://www.techradar.com/images/zoom/amd-llano-965315/index1

HPCE / dt10 / 2012 / 0.7

2012 : Imperial HPC Cluster• cx2 - SGI Altix ICE 8200 EX

– Racks and racks of high-performance PCs– 3000+ x64 cores running at 3GHz– Available to researchers and undergrads (if they ask nicely)

• Grid-management system– Run program on 1000 PCs

with one command

HPCE / dt10 / 2012 / 0.8

Performance and Efficiency Relative to CPU

MP P AG P U

F P G A

Un i f o rmG a u ssi a n

E xp o n e n t i a l

M e a n (G e o )

0.010.020.030.040.050.060.0

MP P AG P U

F P G A

U n i fo r mG a u s s i a n

E x p o n e n t i a l

Me a n ( G e o )

345

0.0

50.0

100.0

150.0

200.0

Performance Power Efficiency

HPCE / dt10 / 2012 / 0.9

Design tradeoffs

HPCE / dt10 / 2012 / 0.10

Design tradeoffs

HPCE / dt10 / 2012 / 0.11

Design tradeoffs

HPCE / dt10 / 2012 / 0.12

Design tradeoffs

• Task-based parallelism vs threads• Easy to program (less time coding)• Easy to get right (less time testing)

• Many implementations and APIs• Intel Threaded Building Blocks (TBB)• Microsoft .NET Task Parallel Library• OpenCL

HPCE / dt10 / 2012 / 0.13

Design tradeoffs

HPCE / dt10 / 2012 / 0.14

Design tradeoffs

Src: NVIDIA CUDA Compute Unified Device Architecture, Programmers Guide

HPCE / dt10 / 2012 / 0.15

Design tradeoffs

HPCE / dt10 / 2012 / 0.16

Design tradeoffs

HPCE / dt10 / 2012 / 0.17

Design tradeoffs

HPCE / dt10 / 2012 / 0.18

What will you learn

• Systems: what high-performance systems do you have

• Methods: how can these systems be programmed

• Practise: concrete experience with multi-core and GPUs

• Analysis: knowing what to use and when

HPCE / dt10 / 2012 / 0.19

What you won’t learn• Multi-threaded programming

– PThreads, windows threads, mutexes, spin-locks, ...– We’ll look at the concepts and hardware, but ignore the practise– Not needed when using modern task-based methods

• OpenMP – API for parallelising for-loops in C/C++– Old technology, not very user-friendly– Doesn’t map nicely to architectures such as GPUs– We’ll use modern techniques such as TBB and CUDA/OpenCL

• MPI (Messaging Passing Interface)– Point-to-point communication between networks– Important; but very specialised: entire course by itself– This course only considers common non-specialist systems

HPCE / dt10 / 2012 / 0.20

Structure of the course• Exam (50%) + two practical courseworks (50%)

• Task-based project using Intel Threaded Building Blocks– Simple and robust framework for task-level parallelism– Highly portable: linux, windows, posix source

• GPU based project using CUDA or OpenCL– If you have a GPU in your laptop, use that– Certain lab-machines have GPUs compatible with CUDA– Will also explore using OpenCL to target both CPUs and GPUs

HPCE / dt10 / 2012 / 0.21

Skills needed• Basic programming

– If you can’t program in _any_ language then worry• Intel TBB uses C++ rather than C

– Some weird C++ stuff, but not scary: explained in lectures– Working examples given and explained– Templates given as starting point for project work

• GPU programming uses CUDA or OpenCL (both C-like)– Let’s you use whatever graphics card you happen to have– Working examples, explained in lectures– Template as starting point for project work

• Not expected to become a guru, just make it faster

HPCE / dt10 / 2012 / 0.22

Key Focus: Engineering• How does this apply to you?

• Examples from Elec. Eng. problems– Mathematical analysis– Simulation of digital circuits– VLSI circuit layout– Communication channel evaluation– (Fractal zoomers)

• Tools and languages used in EE– C– MATLAB– qsub (Imperial HPC cluster)

HPCE / dt10 / 2012 / 0.23

Simple example : Totient function• Eulers totient function: totient(n)

– Number of integers in range 1..n which are relatively prime to n– Integers i and j are relatively prime if gcd(i,j)=1– Totient not included in MATLAB

HPCE / dt10 / 2012 / 0.24

Version 0 : Simple loop• Eulers totient function: totient(n)

– Number of integers in range 1..n which are relatively prime to n– Not included in MATLAB– Integers i and j are relatively prime if gcd(i,j)=1

function [res]=totient_v0(n)res=0;for i=1:n % Loop over all numbers in 1..nif gcd(i,n)==1 % Check if relatively prime

res=res+1; % Count any that areend

end

HPCE / dt10 / 2012 / 0.25

Version 1 : Vectorising• Convert loops into vector operations

– Standard MATLAB optimisation– Actually a way of making parallelism explicit

function [res]=totient_v1(n)numbers=1:n; % Generate all numbers in 1..ngcd_res= (gcd(numbers,n)==1); % Perform GCD on all numbersres=sum(gcd_res==1); % Count all relatively prime numbers

HPCE / dt10 / 2012 / 0.26

Version 2 : Parallel for loop• MATLAB supports a parfor command

– Each loop iteration is/may be executed in parallel– Can operate on multiple cores, and even multiple machines

HPCE / dt10 / 2012 / 0.27

Version 2 : Parallel for loop• MATLAB supports a parfor command

– Each loop iteration is/may be executed in parallel– Can operate on multiple cores, and even multiple machines

function [res]=totient_v2(n)res=0;parfor i=1:n % Loop over all numbers in 1..nif gcd(i,n)==1 % Check if relatively primeres=res+1; % Count any that are

endend

HPCE / dt10 / 2012 / 0.28

Version 3 : Agglomeration• Too much overhead with current parallel loop

– Each parallel iteration has a cost due to scheduling– Process space in chunks, using smaller vectors

function [res]=totient_v3(n, step)if nargin<2 % How large each chunk should bestep=1000;

endres=0;% Loop over each chunkparfor i=1:floor(n/step)% Then process each chunk as a vectornumbers=(i-1)*step+1:min(i*step,n);rel_prime= (gcd(numbers,n)==1);res=res+sum(rel_prime);

end

HPCE / dt10 / 2012 / 0.29

Results from my dual-core laptop

0 0.5 1 1.5 2 2.5x 105

0

2

4

6

8

v0: For Loopv1: Vectorisedv2: ParFor Loopv3: ParFor Chunked

HPCE / dt10 / 2012 / 0.30

Questions?

Documents

High Performance Computing for Engineersdt10/teaching/2011/hpce/hpce-lec0-introduction.pdf · HPCE / dt10 / 2012 / 0.8 Performance and Efficiency Relative to CPU MPPA GPU FPGA U n