42
High–Performance Computing (HPC) Prepared By: Abdussamad Muntahi 1

High–Performance Computing

Embed Size (px)

DESCRIPTION

This Presentation was prepared by Abdussamad Muntahi for the Seminar on High Performance Computing on 11/7/13 (Thursday) Organized by BRAC University Computer Club (BUCC) in collaboration with BRAC University Electronics and Electrical Club (BUEEC).

Citation preview

Page 1: High–Performance Computing

High–Performance Computing(HPC)

Prepared By:

Abdussamad Muntahi

1

A.Dey
Typewritten Text
A.Dey
Typewritten Text
© Copyright: Abdussamad Muntahi & BUCC, 2013
A.Dey
Typewritten Text
A.Dey
Typewritten Text
A.Dey
Typewritten Text
A.Dey
Typewritten Text
A.Dey
Typewritten Text
A.Dey
Typewritten Text
A.Dey
Typewritten Text
Page 2: High–Performance Computing

Introduction

• High‐speed computing. Originally implemented only insupercomputers for scientific research

• Tools and systems available to implement and createhigh performance computing systems

• Used for scientific research and computational science• Main area of discipline is developing parallelprocessing algorithms and software so that programscan be divided into small independent parts and can beexecuted simultaneously by separate processors

• HPC systems have shifted from supercomputer tocomputing clusters

2

Page 3: High–Performance Computing

What is Cluster?

• Cluster is a group of machines interconnected in a way thatthey work together as a single system

3

• Terminologyo Node – individual machine in a cluster

o Head/Master node – connected to both theprivate network of the cluster and a publicnetwork and are used to access a givencluster. Responsible for providing user anenvironment to work and distributing taskamong other nodes

o Compute nodes – connected to only theprivate network of the cluster and aregenerally used for running jobs assigned tothem by the head node(s)

Page 4: High–Performance Computing

What is Cluster?

• Types of Clustero Storage

Storage clusters provide a consistent file system imageAllowing simultaneous read and write to a single shared file system

o High‐availability (HA)Provide continuous availability of services by eliminating single points offailure

o Load‐balancingSends network service requests to multiple cluster nodes to balance therequested load among the cluster nodes

o High‐performanceUse cluster nodes to perform concurrent calculationsAllows applications to work in parallel to enhance the performance of theapplicationsAlso referred to as computational clusters or grid computing

4

Page 5: High–Performance Computing

Benefits of Cluster

• Reduced Costo The price of off‐the‐shelf consumer desktops has plummeted in recent

years, and this drop in price has corresponded with a vast increase intheir processing power and performance. The average desktop PCtoday is many times more powerful than the first mainframecomputers.

• Processing Powero The parallel processing power of a high‐performance cluster can, in

many cases, prove more cost effective than a mainframe with similarpower. This reduced price‐per‐unit of power enables enterprises to geta greater ROI (Return On Investment).

• Scalabilityo Perhaps the greatest advantage of computer clusters is the scalability

they offer. While mainframe computers have a fixed processingcapacity, computer clusters can be easily expanded as requirementschange by adding additional nodes to the network.

5

Page 6: High–Performance Computing

Benefits of Cluster

• Improved Network Technologyo In clusters, computers are typically connected via a single virtual local

area network (VLAN), and the network treats each computer as aseparate node. Information can be passed throughout these networkswith very little lag, ensuring that data doesn’t bottleneck betweennodes.

6

• Availabilityo When a mainframe computer fails, the

entire system fails. However, if a nodein a computer cluster fails, itsoperations can be simply transferredto another node within the cluster,ensuring that there is no interruptionin service.

Page 7: High–Performance Computing

Invention of HPC

• Need for ever‐increasing performance

• And visionary concept of Parallel Computing

7

Page 8: High–Performance Computing

Why we need ever‐increasing performance

• Computational power is increasing, but so are our computation problems and needs.

• Some Examples:– Case1: Complete a time‐consuming operation in less time 

• I am an automotive engineer 

• I need to design a new car that consumes less gasoline 

• I’d rather have the design completed in 6 months than in 2 years 

• I want to test my design using computer simulations rather than building very expensive prototypes and crashing them 

– Case 2: Complete an operation under a tight deadline • I work for a weather prediction agency 

• I am getting input from weather stations/sensors 

• I’d like to predict tomorrow’s forecast today 

8

Page 9: High–Performance Computing

Why we need ever‐increasing performance

– Case 3: Perform a high number of operations per seconds • I am an engineer at Amazon.com 

• My Web server gets 1,000 hits per seconds 

• I’d like my web server and databases to handle 1,000 transactions per seconds so that customers do not experience bad delays 

9

Page 10: High–Performance Computing

Why we need ever‐increasing performance

10

Climate modeling Protein folding

Drug discovery Energy research

Data analysis

Page 11: High–Performance Computing

Where are we using HPC?

• Used to solve complex modeling problems in a spectrum of disciplines

• Topics include: 

• HPC is currently applied to business uses as wello data warehouseso transaction processing

11

o Artificial intelligenceo Climate modelingo Automotive engineeringo Cryptographic analysiso Geophysicso Molecular biologyo Molecular dynamics

o Nuclear physicso Physical oceanographyo Plasma physicso Quantum physicso Quantum chemistryo Solid state physicso Structural dynamics.

Page 12: High–Performance Computing

Top 10 Supercomputers for HPC

12Copyright (c) 2000-2009 TOP500.Org | All trademarks and copyrights on this page are owned by their respective owners

June 2013

Page 13: High–Performance Computing

Fastest Supercomputer Tianhe‐2 (MilkyWay‐2) @ China’s National University of Defense 

Technology

13

Page 14: High–Performance Computing

Changing times

• From 1986 – 2002, microprocessors were speeding like a rocket, increasing in performance an average of 50% per year

• Since then, it’s dropped to about 20% increase per year

14

Page 15: High–Performance Computing

The Problem

• Up to now, performance increases have been attributed to increasing density of transistors

• But there are inherent problems

• A little Physics lesson –– Smaller transistors = faster processors

– Faster processors = increased power consumption

– Increased power consumption = increased heat

– Increased heat = unreliable processors

15

Page 16: High–Performance Computing

An intelligent solution

• Move away from single‐core systems to multicoreprocessors

• “core” = processing unit• Introduction of parallelism!!!

• But …– Adding more processors doesn’t help much if programmers aren’t aware of them…

– … or don’t know how to use them.

– Serial programs don’t benefit from this approach (in most cases)

16

Page 17: High–Performance Computing

Parallel Computing

• Form of computation in which many calculations arecarried out simultaneously, operating on the principlethat large problems can often be divided into smallerones, which are then solved concurrently i.e. "inparallel“

• So, we need to rewrite serial programs so that they’re parallel.

• Write translation programs that automatically convert serial programs into parallel programs.– This is very difficult to do.– Success has been limited.

17

Page 18: High–Performance Computing

Parallel Computing

• Example– Compute n values and add them together.

– Serial solution:

18

Page 19: High–Performance Computing

Parallel Computing

• Example– We have p cores, pmuch smaller than n.

– Each core performs a partial sum of approximately n/pvalues.

19

Each core uses it’s own private variablesand executes this block of codeindependently of the other cores.

Page 20: High–Performance Computing

Parallel Computing

• Example– After each core completes execution of the code, is a private variable my_sum contains the sum of the values computed by its calls to Compute_next_value.

– Ex., n = 200, then• Serial – will take 200 addition

• Parallel (for 8 cores)– each core will perform n/p = 25 addition

– And master will perform 8 more addition + 8 receive operation

– Total 41 operation

20

Page 21: High–Performance Computing

Parallel Computing

• Some coding constructs can be recognized by an automatic program generator, and converted to a parallel construct.

• However, it’s likely that the result will be a very inefficient program.

• Sometimes the best parallel solution is to step back and devise an entirely new algorithm.

• Parallel computer programs are more difficult to write thansequential programs

• Potential problems– Race condition (output depending on sequence or timing of other

events)

– Communication and synchronization between the different subtasks

21

Page 22: High–Performance Computing

Parallel Computing

• Parallel Computer classification– Semiconductor industry has settled on two main trajectories

• Multicore trajectory – CPU– Coarse, heavyweight threads, better performance per thread

maximize the speed of sequential program 

• Many‐core trajectory – GPU– large  number of much smaller cores to improve the execution 

throughput of parallel applications– Fine, lightweight threads 

single‐thread performance is poor

22

Presenter
Presentation Notes
 a thread of execution is the smallest sequence of programmed instructions that can be managed independently by an operating system scheduler.
Page 23: High–Performance Computing

CPU vs. GPU

• CPU – Uses sophisticated control logic to allow single thread execution

– Uses large cache to reduce the latency of instruction and data access

– None of them contributed to the peak calculation speed

23

Cache

ALUControl

ALU

ALU

ALU

DRAM

Page 24: High–Performance Computing

CPU vs. GPU

• GPU– Need to conduct massive number of floating‐point calculation

– Optimize the execution throughput of massive numbers of threads

– Cache memories to help control the bandwidth requirements and reduce DRAM access 

24

DRAM

Page 25: High–Performance Computing

CPU vs. GPU

• Speed– Calculation speed: 367 GFLOPS vs. 32 GFLOPS 

– Ratio is about 10 to 1 for GPU vs. CPU

– But speed‐up depends on• Problem set

• Level of parallelism

• Code optimization

• Memory management

25

Presenter
Presentation Notes
FLOPS (for FLoating-point Operations Per Second) is a measure of computer performance, especially in fields of scientific calculations that make heavy use of floating-point calculations, similar to the older, simpler, instructions per second.
Page 26: High–Performance Computing

CPU vs. GPU

26Architecture: CPU took a right hand turn

Page 27: High–Performance Computing

CPU vs. GPU

27

Architecture: GPU still keeping up with Moore’s Law

Page 28: High–Performance Computing

CPU vs. GPU

28

• Architecture and Technology– Control hardware dominates μprocessors

• Complex, difficult to build and verify

• Scales poorly– Pay for max throughput, sustain average throughput

– Control hardware doesn’t do any math!

– Industry moving from “instructions per second” to “instructions per watt”

• Traditional μprocessors are not power‐efficient

– We can continue to put more transistors on a chip• … but we can’t scale their voltage like we used to …

• … and we can’t clock them as fast …

Page 29: High–Performance Computing

Why GPU?

29

• GPU is a massively parallel architecture– Many problems map well to GPU‐style computing

– GPUs have large amount of arithmetic capability

– Increasing amount of programmability in the pipeline

– CPU has duel and quad core chips; but GPU currently has 240 cores (GeForce GTX 280)

• Memory Bandwidth– CPU – 3.2 GB/s; GPU – 141.7 GB/s

• Speed– CPU – 20 GFLOPS (per core)

– GPU – 933 (single‐precision or int) / 78 (double‐precision) GFLOPS 

• Direct access to compute units in new APIs

Page 30: High–Performance Computing

CPU + GPU

30

• CPU and GPU is a powerful combination – CPUs consist of a few cores optimized for serial processing

– GPUs consist of thousands of smaller, more efficient cores designed for parallel performance

– Serial portions of the code run on the CPU

– Parallel portions run on the GPU

– Performance significantly faster

• This idea ignited the movement of GPGPU (General‐Purpose computation on GPU)

Page 31: High–Performance Computing

GPGPU

31

• Using GPU (graphics processing unit) together with a CPU to accelerate general‐purpose scientific and engineering applications

• GPGPU computing offers speed by – Offloading compute‐intensive portions of the application to the GPU

– While the remainder of the code still runs on the CPU

• Data Parallel algorithms take advantage of GPU attributes– Large data arrays, streaming throughput

– Fine‐grain SIMD (single‐instruction multiple‐data) parallelism

– Low‐latency floating point computation

Page 32: High–Performance Computing

Parallel Programming

• HPC Parallel Programming Models associated withdifferent computing technology– Parallel programming in CPU Clusters

– General purpose GPU programming

32

Page 33: High–Performance Computing

Operational Model: CPU

• Originally was designed for distributed memory architectures

• Tasks are divided  among p processes

• Data‐parallel, compute intensive functions should be selected to be assigned to theses processes

• Functions that are executed many times, but independently on different data, are prime candidates– i.e. body of for‐loops

33

Page 34: High–Performance Computing

Operational Model: CPU

• Execution model allows each task to operate independently

• Memory model assumes that memory is private to each task– Move data point‐to‐point between processes

• Perform some collective computations and at the end gather results from different processes– Needs synchronization after the end of tasks of each process

34

Page 35: High–Performance Computing

Programming Language: MPI

• Message Passing Interface– An application programming interface (API) specificationthat allows processes to communicate with one another bysending and receiving messages

– Now a de facto standard for parallel programs running ondistributed memory systems in computer clusters andsupercomputers

– A massage passing API with language‐independentprotocol and semantic specifications

– Support both point‐to‐point and collective communication

– Communications are defined by the APIs

35

Page 36: High–Performance Computing

Programming Language: MPI

• Message Passing Interface– Goals are standardization, high performance, scalability,and portability

– Consists of a specific set of routines (i.e. APIs) directlycallable from C, C++, Fortran and any language able tointerface with such libraries

– Program consists of autonomous processes • The processes may run either the same code (SPMD style) or different codes (heterogeneous)

– Processes communicate with each other via calls to MPI functions

36

Page 37: High–Performance Computing

Operation Model: GPU

37

• Both CPU and GPU operates with separate memory pool

• CPUs are masters and GPUs are workers– CPUs launch computations onto the GPUs

– CPUs can be used for other computations as well

– GPUs will have limited communication back to CPUs

• CPU must initiate data transfer to the GPU memory– Synchronous data transfer – CPU waits for transfer to complete

– Asynchronous data transfer – CPU continues with other work; checks if transfer is complete

Page 38: High–Performance Computing

Operation Model : GPU

38

• GPU can not directly access main memory

• CPU can not directly access GPU memory

• Need to explicitly copy data

Page 39: High–Performance Computing

Operation Model : GPU

39

• GPU is viewed as a compute device operating as a coprocessor to the main CPU (host)– Data‐parallel, compute intensive functions should be off‐loaded to the 

device

– Functions that are executed many times, but independently on different data, are prime candidates

• i.e. body of for‐loops

– A function compiled for the device is called a kernel

– The kernel is executed on the device as many different threads

– Both host (CPU) and device (GPU) manage their own memory – host memory and device memory

Page 40: High–Performance Computing

Programming Language: CUDA

40

• Compute Unified Device Architecture”– Introduced by Nvidia in late 2006

– It is a compiler and toolkit for programming NVIDIA GPUs

– API extends the C programming language

– Adds library functions to access GPU

– Adds directives to translate C into instructions that run on the host CPU or the GPU when needed

– Allows easy multi‐threading ‐ parallel execution on all thread processors on the GPU

– Runs on thousands of threads

– It is a scalable model

Page 41: High–Performance Computing

Programming Language: CUDA

41

• Compute Unified Device Architecture”– General purpose programming model

• User kicks off batches of threads on the GPU

• Specific language and tools

– Driver for loading computation programs into GPU• Standalone Driver ‐ Optimized for computation 

• Interface designed for compute – graphics‐free API

• Explicit GPU memory management

– Objectives• Express parallelism

• Give a high level abstraction from hardware

Page 42: High–Performance Computing

42

The End

42