Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
HPCE / dt10 / 2012 / 0.2
High Performance Computing for Engineers• Research
– Testing communication protocols– Evaluating signal-processing filters– Simulating analogue and digital designs
• Tools– CAD tools: synthesis, place-and-route, verification– Libraries/toolboxes: filter design, compressive sensing
• Products– Oil exploration and discovery– Mobile-phone apps– Financial computing
HPCE / dt10 / 2012 / 0.3
High Performance Computing for Engineers• Types of performance metrics
– Throughput– Latency– Power– Design-time– Capital and running costs
• Required versus desired performance– Subject to a throughput of X, minimise average power– Subject to a budget of Y, maximise energy efficiency– Subject to Z development days, maximise throughput
HPCE / dt10 / 2012 / 0.4
What is available to you• Types of compute device
– Multi-core CPUs– GPUs (Graphics Processing Units)– MPPAs (Massively Parallel Processor Arrays)– FPGAs (Field Programmable Gate Arrays)
• Types of compute system– Embedded Systems– Mobile Phones– Tablets– Laptops– Grid computing– Cloud computing
HPCE / dt10 / 2012 / 0.5
2012 : LG Optimus 2X
Imgs : http://www.techradar.com/reviews/phones/mobile-phones/lg-optimus-2x-929388/review, http://www.anandtech.com/show/2911
NVidia Tegra 2- CPU : Dual-core ARM Cortex A9- GPU : ULP GeForce (8 cores)
HPCE / dt10 / 2012 / 0.6
2012 : Lenovo Thinkpad Edge E525
AMD Fusion A8-3500M- CPU : Quad-Core 2.4GHz Phenom-II- GPU : HD 6620G 400MHz (320 cores)
Img:http://laptops-specs.blogspot.com/2011/09/lenovo-thinkpad-edge-e525-specs.html, http://www.techradar.com/images/zoom/amd-llano-965315/index1
HPCE / dt10 / 2012 / 0.7
2012 : Imperial HPC Cluster• cx2 - SGI Altix ICE 8200 EX
– Racks and racks of high-performance PCs– 3000+ x64 cores running at 3GHz– Available to researchers and undergrads (if they ask nicely)
• Grid-management system– Run program on 1000 PCs
with one command
HPCE / dt10 / 2012 / 0.8
Performance and Efficiency Relative to CPU
MP P AG P U
F P G A
Un i f o rmG a u ssi a n
E xp o n e n t i a l
M e a n (G e o )
0.010.020.030.040.050.060.0
MP P AG P U
F P G A
U n i fo r mG a u s s i a n
E x p o n e n t i a l
Me a n ( G e o )
345
0.0
50.0
100.0
150.0
200.0
Performance Power Efficiency
HPCE / dt10 / 2012 / 0.12
Design tradeoffs
• Task-based parallelism vs threads• Easy to program (less time coding)• Easy to get right (less time testing)
• Many implementations and APIs• Intel Threaded Building Blocks (TBB)• Microsoft .NET Task Parallel Library• OpenCL
HPCE / dt10 / 2012 / 0.14
Design tradeoffs
Src: NVIDIA CUDA Compute Unified Device Architecture, Programmers Guide
HPCE / dt10 / 2012 / 0.18
What will you learn
• Systems: what high-performance systems do you have
• Methods: how can these systems be programmed
• Practise: concrete experience with multi-core and GPUs
• Analysis: knowing what to use and when
HPCE / dt10 / 2012 / 0.19
What you won’t learn• Multi-threaded programming
– PThreads, windows threads, mutexes, spin-locks, ...– We’ll look at the concepts and hardware, but ignore the practise– Not needed when using modern task-based methods
• OpenMP – API for parallelising for-loops in C/C++– Old technology, not very user-friendly– Doesn’t map nicely to architectures such as GPUs– We’ll use modern techniques such as TBB and CUDA/OpenCL
• MPI (Messaging Passing Interface)– Point-to-point communication between networks– Important; but very specialised: entire course by itself– This course only considers common non-specialist systems
HPCE / dt10 / 2012 / 0.20
Structure of the course• Exam (50%) + two practical courseworks (50%)
• Task-based project using Intel Threaded Building Blocks– Simple and robust framework for task-level parallelism– Highly portable: linux, windows, posix source
• GPU based project using CUDA or OpenCL– If you have a GPU in your laptop, use that– Certain lab-machines have GPUs compatible with CUDA– Will also explore using OpenCL to target both CPUs and GPUs
HPCE / dt10 / 2012 / 0.21
Skills needed• Basic programming
– If you can’t program in _any_ language then worry• Intel TBB uses C++ rather than C
– Some weird C++ stuff, but not scary: explained in lectures– Working examples given and explained– Templates given as starting point for project work
• GPU programming uses CUDA or OpenCL (both C-like)– Let’s you use whatever graphics card you happen to have– Working examples, explained in lectures– Template as starting point for project work
• Not expected to become a guru, just make it faster
HPCE / dt10 / 2012 / 0.22
Key Focus: Engineering• How does this apply to you?
• Examples from Elec. Eng. problems– Mathematical analysis– Simulation of digital circuits– VLSI circuit layout– Communication channel evaluation– (Fractal zoomers)
• Tools and languages used in EE– C– MATLAB– qsub (Imperial HPC cluster)
HPCE / dt10 / 2012 / 0.23
Simple example : Totient function• Eulers totient function: totient(n)
– Number of integers in range 1..n which are relatively prime to n– Integers i and j are relatively prime if gcd(i,j)=1– Totient not included in MATLAB
HPCE / dt10 / 2012 / 0.24
Version 0 : Simple loop• Eulers totient function: totient(n)
– Number of integers in range 1..n which are relatively prime to n– Not included in MATLAB– Integers i and j are relatively prime if gcd(i,j)=1
function [res]=totient_v0(n)res=0;for i=1:n % Loop over all numbers in 1..nif gcd(i,n)==1 % Check if relatively prime
res=res+1; % Count any that areend
end
HPCE / dt10 / 2012 / 0.25
Version 1 : Vectorising• Convert loops into vector operations
– Standard MATLAB optimisation– Actually a way of making parallelism explicit
function [res]=totient_v1(n)numbers=1:n; % Generate all numbers in 1..ngcd_res= (gcd(numbers,n)==1); % Perform GCD on all numbersres=sum(gcd_res==1); % Count all relatively prime numbers
HPCE / dt10 / 2012 / 0.26
Version 2 : Parallel for loop• MATLAB supports a parfor command
– Each loop iteration is/may be executed in parallel– Can operate on multiple cores, and even multiple machines
HPCE / dt10 / 2012 / 0.27
Version 2 : Parallel for loop• MATLAB supports a parfor command
– Each loop iteration is/may be executed in parallel– Can operate on multiple cores, and even multiple machines
function [res]=totient_v2(n)res=0;parfor i=1:n % Loop over all numbers in 1..nif gcd(i,n)==1 % Check if relatively primeres=res+1; % Count any that are
endend
HPCE / dt10 / 2012 / 0.28
Version 3 : Agglomeration• Too much overhead with current parallel loop
– Each parallel iteration has a cost due to scheduling– Process space in chunks, using smaller vectors
function [res]=totient_v3(n, step)if nargin<2 % How large each chunk should bestep=1000;
endres=0;% Loop over each chunkparfor i=1:floor(n/step)% Then process each chunk as a vectornumbers=(i-1)*step+1:min(i*step,n);rel_prime= (gcd(numbers,n)==1);res=res+sum(rel_prime);
end
HPCE / dt10 / 2012 / 0.29
Results from my dual-core laptop
0 0.5 1 1.5 2 2.5x 105
0
2
4
6
8
v0: For Loopv1: Vectorisedv2: ParFor Loopv3: ParFor Chunked