High–Performance Computing

High–Performance Computing(HPC)

Prepared By:

Abdussamad Muntahi

1

A.Dey

Typewritten Text

A.Dey

Typewritten Text

© Copyright: Abdussamad Muntahi & BUCC, 2013

A.Dey

Typewritten Text

A.Dey

Typewritten Text

A.Dey

Typewritten Text

A.Dey

Typewritten Text

A.Dey

Typewritten Text

A.Dey

Typewritten Text

A.Dey

Typewritten Text

Introduction

• High‐speed computing. Originally implemented only insupercomputers for scientific research

• Tools and systems available to implement and createhigh performance computing systems

• Used for scientific research and computational science• Main area of discipline is developing parallelprocessing algorithms and software so that programscan be divided into small independent parts and can beexecuted simultaneously by separate processors

• HPC systems have shifted from supercomputer tocomputing clusters

2

What is Cluster?

• Cluster is a group of machines interconnected in a way thatthey work together as a single system

3

• Terminologyo Node – individual machine in a cluster

o Head/Master node – connected to both theprivate network of the cluster and a publicnetwork and are used to access a givencluster. Responsible for providing user anenvironment to work and distributing taskamong other nodes

o Compute nodes – connected to only theprivate network of the cluster and aregenerally used for running jobs assigned tothem by the head node(s)

What is Cluster?

• Types of Clustero Storage

Storage clusters provide a consistent file system imageAllowing simultaneous read and write to a single shared file system

o High‐availability (HA)Provide continuous availability of services by eliminating single points offailure

o Load‐balancingSends network service requests to multiple cluster nodes to balance therequested load among the cluster nodes

o High‐performanceUse cluster nodes to perform concurrent calculationsAllows applications to work in parallel to enhance the performance of theapplicationsAlso referred to as computational clusters or grid computing

4

Benefits of Cluster

• Reduced Costo The price of off‐the‐shelf consumer desktops has plummeted in recent

years, and this drop in price has corresponded with a vast increase intheir processing power and performance. The average desktop PCtoday is many times more powerful than the first mainframecomputers.

• Processing Powero The parallel processing power of a high‐performance cluster can, in

many cases, prove more cost effective than a mainframe with similarpower. This reduced price‐per‐unit of power enables enterprises to geta greater ROI (Return On Investment).

• Scalabilityo Perhaps the greatest advantage of computer clusters is the scalability

they offer. While mainframe computers have a fixed processingcapacity, computer clusters can be easily expanded as requirementschange by adding additional nodes to the network.

5

Benefits of Cluster

• Improved Network Technologyo In clusters, computers are typically connected via a single virtual local

area network (VLAN), and the network treats each computer as aseparate node. Information can be passed throughout these networkswith very little lag, ensuring that data doesn’t bottleneck betweennodes.

6

• Availabilityo When a mainframe computer fails, the

entire system fails. However, if a nodein a computer cluster fails, itsoperations can be simply transferredto another node within the cluster,ensuring that there is no interruptionin service.

Invention of HPC

• Need for ever‐increasing performance

• And visionary concept of Parallel Computing

7

Why we need ever‐increasing performance

• Computational power is increasing, but so are our computation problems and needs.

• Some Examples:– Case1: Complete a time‐consuming operation in less time

• I am an automotive engineer

• I need to design a new car that consumes less gasoline

• I’d rather have the design completed in 6 months than in 2 years

• I want to test my design using computer simulations rather than building very expensive prototypes and crashing them

– Case 2: Complete an operation under a tight deadline • I work for a weather prediction agency

• I am getting input from weather stations/sensors

• I’d like to predict tomorrow’s forecast today

8


– Case 3: Perform a high number of operations per seconds • I am an engineer at Amazon.com

• My Web server gets 1,000 hits per seconds

• I’d like my web server and databases to handle 1,000 transactions per seconds so that customers do not experience bad delays

9


10

Climate modeling Protein folding

Drug discovery Energy research

Data analysis

Where are we using HPC?

• Used to solve complex modeling problems in a spectrum of disciplines

• Topics include:

• HPC is currently applied to business uses as wello data warehouseso transaction processing

11

o Artificial intelligenceo Climate modelingo Automotive engineeringo Cryptographic analysiso Geophysicso Molecular biologyo Molecular dynamics

o Nuclear physicso Physical oceanographyo Plasma physicso Quantum physicso Quantum chemistryo Solid state physicso Structural dynamics.

Top 10 Supercomputers for HPC

12Copyright (c) 2000-2009 TOP500.Org | All trademarks and copyrights on this page are owned by their respective owners

June 2013

Fastest Supercomputer Tianhe‐2 (MilkyWay‐2) @ China’s National University of Defense

Technology

13

Changing times

• From 1986 – 2002, microprocessors were speeding like a rocket, increasing in performance an average of 50% per year

• Since then, it’s dropped to about 20% increase per year

14

The Problem

• Up to now, performance increases have been attributed to increasing density of transistors

• But there are inherent problems

• A little Physics lesson –– Smaller transistors = faster processors

– Faster processors = increased power consumption

– Increased power consumption = increased heat

– Increased heat = unreliable processors

15

An intelligent solution

• Move away from single‐core systems to multicoreprocessors

• “core” = processing unit• Introduction of parallelism!!!

• But …– Adding more processors doesn’t help much if programmers aren’t aware of them…

– … or don’t know how to use them.

– Serial programs don’t benefit from this approach (in most cases)

16

Parallel Computing

• Form of computation in which many calculations arecarried out simultaneously, operating on the principlethat large problems can often be divided into smallerones, which are then solved concurrently i.e. "inparallel“

• So, we need to rewrite serial programs so that they’re parallel.

• Write translation programs that automatically convert serial programs into parallel programs.– This is very difficult to do.– Success has been limited.

17

Parallel Computing

• Example– Compute n values and add them together.

– Serial solution:

18

Parallel Computing

• Example– We have p cores, pmuch smaller than n.

– Each core performs a partial sum of approximately n/pvalues.

19

Each core uses it’s own private variablesand executes this block of codeindependently of the other cores.

Parallel Computing

• Example– After each core completes execution of the code, is a private variable my_sum contains the sum of the values computed by its calls to Compute_next_value.

– Ex., n = 200, then• Serial – will take 200 addition

• Parallel (for 8 cores)– each core will perform n/p = 25 addition

– And master will perform 8 more addition + 8 receive operation

– Total 41 operation

20

Parallel Computing

• Some coding constructs can be recognized by an automatic program generator, and converted to a parallel construct.

• However, it’s likely that the result will be a very inefficient program.

• Sometimes the best parallel solution is to step back and devise an entirely new algorithm.

• Parallel computer programs are more difficult to write thansequential programs

• Potential problems– Race condition (output depending on sequence or timing of other

events)

– Communication and synchronization between the different subtasks

21

Parallel Computing

• Parallel Computer classification– Semiconductor industry has settled on two main trajectories

• Multicore trajectory – CPU– Coarse, heavyweight threads, better performance per thread

maximize the speed of sequential program

• Many‐core trajectory – GPU– large number of much smaller cores to improve the execution

throughput of parallel applications– Fine, lightweight threads

single‐thread performance is poor

22

Presenter

Presentation Notes

a thread of execution is the smallest sequence of programmed instructions that can be managed independently by an operating system scheduler.

CPU vs. GPU

• CPU – Uses sophisticated control logic to allow single thread execution

– Uses large cache to reduce the latency of instruction and data access

– None of them contributed to the peak calculation speed

23

Cache

ALUControl

ALU

ALU

ALU

DRAM

CPU vs. GPU

• GPU– Need to conduct massive number of floating‐point calculation

– Optimize the execution throughput of massive numbers of threads

– Cache memories to help control the bandwidth requirements and reduce DRAM access

24

DRAM

CPU vs. GPU

• Speed– Calculation speed: 367 GFLOPS vs. 32 GFLOPS

– Ratio is about 10 to 1 for GPU vs. CPU

– But speed‐up depends on• Problem set

• Level of parallelism

• Code optimization

• Memory management

25

Presenter

Presentation Notes

FLOPS (for FLoating-point Operations Per Second) is a measure of computer performance, especially in fields of scientific calculations that make heavy use of floating-point calculations, similar to the older, simpler, instructions per second.

CPU vs. GPU

26Architecture: CPU took a right hand turn

CPU vs. GPU

27

Architecture: GPU still keeping up with Moore’s Law

CPU vs. GPU

28

• Architecture and Technology– Control hardware dominates μprocessors

• Complex, difficult to build and verify

• Scales poorly– Pay for max throughput, sustain average throughput

– Control hardware doesn’t do any math!

– Industry moving from “instructions per second” to “instructions per watt”

• Traditional μprocessors are not power‐efficient

– We can continue to put more transistors on a chip• … but we can’t scale their voltage like we used to …

• … and we can’t clock them as fast …

Why GPU?

29

• GPU is a massively parallel architecture– Many problems map well to GPU‐style computing

– GPUs have large amount of arithmetic capability

– Increasing amount of programmability in the pipeline

– CPU has duel and quad core chips; but GPU currently has 240 cores (GeForce GTX 280)

• Memory Bandwidth– CPU – 3.2 GB/s; GPU – 141.7 GB/s

• Speed– CPU – 20 GFLOPS (per core)

– GPU – 933 (single‐precision or int) / 78 (double‐precision) GFLOPS

• Direct access to compute units in new APIs

CPU + GPU

30

• CPU and GPU is a powerful combination – CPUs consist of a few cores optimized for serial processing

– GPUs consist of thousands of smaller, more efficient cores designed for parallel performance

– Serial portions of the code run on the CPU

– Parallel portions run on the GPU

– Performance significantly faster

• This idea ignited the movement of GPGPU (General‐Purpose computation on GPU)

GPGPU

31

• Using GPU (graphics processing unit) together with a CPU to accelerate general‐purpose scientific and engineering applications

• GPGPU computing offers speed by – Offloading compute‐intensive portions of the application to the GPU

– While the remainder of the code still runs on the CPU

• Data Parallel algorithms take advantage of GPU attributes– Large data arrays, streaming throughput

– Fine‐grain SIMD (single‐instruction multiple‐data) parallelism

– Low‐latency floating point computation

Parallel Programming

• HPC Parallel Programming Models associated withdifferent computing technology– Parallel programming in CPU Clusters

– General purpose GPU programming

32

Operational Model: CPU

• Originally was designed for distributed memory architectures

• Tasks are divided among p processes

• Data‐parallel, compute intensive functions should be selected to be assigned to theses processes

• Functions that are executed many times, but independently on different data, are prime candidates– i.e. body of for‐loops

33

Operational Model: CPU

• Execution model allows each task to operate independently

• Memory model assumes that memory is private to each task– Move data point‐to‐point between processes

• Perform some collective computations and at the end gather results from different processes– Needs synchronization after the end of tasks of each process

34

Programming Language: MPI

• Message Passing Interface– An application programming interface (API) specificationthat allows processes to communicate with one another bysending and receiving messages

– Now a de facto standard for parallel programs running ondistributed memory systems in computer clusters andsupercomputers

– A massage passing API with language‐independentprotocol and semantic specifications

– Support both point‐to‐point and collective communication

– Communications are defined by the APIs

35

Programming Language: MPI

• Message Passing Interface– Goals are standardization, high performance, scalability,and portability

– Consists of a specific set of routines (i.e. APIs) directlycallable from C, C++, Fortran and any language able tointerface with such libraries

– Program consists of autonomous processes • The processes may run either the same code (SPMD style) or different codes (heterogeneous)

– Processes communicate with each other via calls to MPI functions

36

Operation Model: GPU

37

• Both CPU and GPU operates with separate memory pool

• CPUs are masters and GPUs are workers– CPUs launch computations onto the GPUs

– CPUs can be used for other computations as well

– GPUs will have limited communication back to CPUs

• CPU must initiate data transfer to the GPU memory– Synchronous data transfer – CPU waits for transfer to complete

– Asynchronous data transfer – CPU continues with other work; checks if transfer is complete

Operation Model : GPU

38

• GPU can not directly access main memory

• CPU can not directly access GPU memory

• Need to explicitly copy data

Operation Model : GPU

39

• GPU is viewed as a compute device operating as a coprocessor to the main CPU (host)– Data‐parallel, compute intensive functions should be off‐loaded to the

device

– Functions that are executed many times, but independently on different data, are prime candidates

• i.e. body of for‐loops

– A function compiled for the device is called a kernel

– The kernel is executed on the device as many different threads

– Both host (CPU) and device (GPU) manage their own memory – host memory and device memory

Programming Language: CUDA

40

• Compute Unified Device Architecture”– Introduced by Nvidia in late 2006

– It is a compiler and toolkit for programming NVIDIA GPUs

– API extends the C programming language

– Adds library functions to access GPU

– Adds directives to translate C into instructions that run on the host CPU or the GPU when needed

– Allows easy multi‐threading ‐ parallel execution on all thread processors on the GPU

– Runs on thousands of threads

– It is a scalable model

Programming Language: CUDA

41

• Compute Unified Device Architecture”– General purpose programming model

• User kicks off batches of threads on the GPU

• Specific language and tools

– Driver for loading computation programs into GPU• Standalone Driver ‐ Optimized for computation

• Interface designed for compute – graphics‐free API

• Explicit GPU memory management

– Objectives• Express parallelism

• Give a high level abstraction from hardware

42

The End

42

Education

High–Performance Computing