1 ISCM-10 Taub Computing Center High Performance Computing for Computational Mechanics Moshe Goldberg March 29, 2001

1

ISCM-10

Taub Computing Center

High Performance Computingfor

Computational Mechanics

Moshe GoldbergMarch 29, 2001

2

High Performance Computing for CM

1) Overview2) Alternative Architectures3) Message Passing4) “Shared Memory”5) Case Study

Agenda:

3

1) High Performance Computing

- Overview

4

* Understanding HPC concepts

* Why should programmers care

about the architecture?

* Do compilers make the right choices?

* Nowadays, there are alternatives

Some Important Points

5

Trends in computer development

*Speed of calculation is steadily increasing*Memory may not be in balance with high

calculation speeds*Workstations are approaching speeds of

especially efficient designs*Are we approaching the limit of the speed

of light?* To get an answer faster, we must perform

calculations in parallel

6

Some HPC concepts

* HPC* HPF / Fortran90 * cc-NUMA* Compiler directives* OpenMP* Message passing* PVM/MPI* Beowulf

7

MFLOPS for parix (origin2000), ax=b

0.0

1000.0

2000.0

3000.0

4000.0

1 2 3 4 5 6 7 8 9 10 11 12

processors

MF

LO

PS

n=2001

n=3501

n=5001

8

ideal parallel speedup

1.0

3.0

5.0

7.0

9.0

11.0

1 2 3 4 5 6 7 8 9 10 11 12

processors

sp

ee

up

ideal

speedup =

(time for 1 cpu) _____________

(time for (n) cpu's)

9

speedup for parix (origin2000), ax=b

1.0

3.0

5.0

7.0

9.0

11.0

1 2 3 4 5 6 7 8 9 10 11 12

processors

sp

ee

up

ideal

n=2001

n=3501

n=5001

10

"or" - MFLOPS for matrix multiply (n=3001)

0.0

2000.0

4000.0

6000.0

8000.0

10000.0

12000.0

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33

processors

MF

LO

PS

source

blas

11

"or" - Speedup for Matrix multiply (n=3001)

1.0

3.0

5.0

7.0

9.0

11.0

13.0

15.0

17.0

19.0

21.0

23.0

25.0

27.0

29.0

31.0

33.0

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33

processors

sp

ee

du

p

ideal

source

blas

12

"or" - solve linear equations

0.0

1000.0

2000.0

3000.0

4000.0

5000.0

6000.0

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33

processors

MF

LO

PS

n=2001

n=3501

n=5001

13

"or" - solve linear equations

1.0

3.0

5.0

7.0

9.0

11.0

13.0

15.0

17.0

19.0

21.0

23.0

25.0

27.0

29.0

31.0

33.0

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33

processors

spe

ed

up

ideal

n=2001

n=3501

n=5001

14

2) Alternative Architectures

15

Units Shipped -- All Vectors

0

100

200

300

400

500

600

700

90 91 92 93 94 95 96 97 98 99 OO

Syst

ems

per

Yea

r

OtherNECFujitsuCray

Source: IDC, 2001

16

Units Shipped -- Capability Vector

0

20

40

60

80

100

120

140

90 91 92 93 94 95 96 97 98 99 OO

Syst

ems

per

Yea

rOtherNECFujitsuCray

Source: IDC, 2001

17

18

IUCC (Machba) computers

Cray J90 -- 32 cpu Memory - 4 GB (500 MW)

Origin2000 112 cpu (R12000, 400 MHz) 28.7 GB total memoryPC cluster 64 cpu (Pentium III, 550 MHz) Total memory - 9 GB

Mar 2001

19

Chris Hempel, hpc.utexas.edu

20

21

Chris Hempel, hpc.utexas.edu

22

CPU CPU

Memory

CPU CPU

Symmetric Multiple Processors

Examples: SGI Power Challenge, Cray J90/T90

Memory Bus

23

Memory

CPU

Memory

CPU

Memory

CPU

Memory

CPU

Distributed Parallel Computing

Examples: SP2, Beowulf

24

25

26

27

3) Message Passing

28

call MPI_SEND(sum,1,MPI_REAL,ito,itag,MPI_COMM_WORLD,ierror)

call MPI_RECV(sum,1,MPI_REAL,ifrom,itag,MPI_COMM_WORLD,istatus,ierror)

MPI commands -- examples

29

Some basic MPI functions

Setup: mpi_init mpi_finalize Environment: mpi_comm_size mpi_comm_rank

Communication: mpi_send mpi_receive

Synchronization: mpi_barrier

30

Other important MPI functionsAsynchronous communication: mpi_isend mpi_irecv mpi_iprobe mpi_wait/nowait

Collective communication: mpi_barrier mpi_bcast mpi_gather mpi_scatter mpi_reduce mpi_allreduceDerived data types: mpi_type_contiguous mpi_type_vector mpi_type_indexed mpi_type_pack mpi_type_commit mpi_type_free

Creating communicators: mpi_comm_dup mpi_comm_split mpi_intercomm_create mpi_comm_free

31

4) “Shared Memory”

32

CRAY: CMIC$ DO ALLdo i=1,n

a(i)=ienddo

SGI:C$DOACROSSdo i=1,n

a(i)=ienddo

OpenMP: C$OMP parallel do do i=1,n a(i)=i enddo

Fortran directives --examples

33

OpenMP Summary

OpenMP standard – first published Oct 1997

Directives

Run-time Library Routines

Environment Variables

Versions for f77, f90, c, c++

34

OpenMP Summary

Parallel Do Directive

c$omp parallel do private(I) shared(a)

c$omp end parallel do optional

do I=1,na(I)= I+1enddo

35

OpenMP Summary

Defining a Parallel Region - Individual Do Loopsc$omp parallel shared(a,b)

do j=1,na(j)=jenddo

do k=1,nb(k)=kenddo

c$omp do private(j)

c$omp end do nowaitc$omp do private(k)

c$omp end doc$omp end parallel

36

OpenMP Summary

Parallel Do Directive - Clauses

sharedprivatedefault(private|shared|none)reduction({operator|intrinsic}:var)if(scalar_logical_expression)orderedcopyin(var)

37

OpenMP Summary

Run-Time Library Routines

Execution environment

omp_set_num_threadsomp_get_num_threadsomp_get_max_threadsomp_get_thread_numomp_get_num_procsomp_set_dynamic/omp_get_dynamicomp_set_nested/omp_get_nested

38

OpenMP Summary

Run-Time Library Routines

Lock routines

omp_init_lockomp_destroy_lockomp_set_lockomp_unset_lockomp_test_lock

39

OpenMP Summary

Environment Variables

OMP_NUM_THREADSOMP_DYNAMICOMP_NESTED

40

RISC memory levels

CPU

Main memory

Cache

Single CPU

41

RISC memory levels

CPU

Main memory

Cache

Single CPU

42

RISC memory levels

Main memory

Multiple CPU’s

CPU

Cache 1

CPU0

1

Cache 0

43

RISC memory levels

Main memory

Multiple CPU’s

CPU

Cache 1

CPU0

1

Cache 0

44Main memory

Multiple CPU’s

CPU

Cache 1

CPU0

1

Cache 0

RISC Memory Levels

45

subroutine xmult (x1,x2,y1,y2,z1,z2,n)

real x1(n),x2(n),y1(n),y2(n),z1(n),z2(n)

real a,b,c,d

do i=1,n

a=x1(i)*x2(i); b=y1(i)*y2(i)

c=x1(i)*y2(i); d=x2(i)*y1(i)

z1(i)=a-b; z2(i)=c+d

enddo

end

A sample program

46

subroutine xmult (x1,x2,y1,y2,z1,z2,n)

real x1(n),x2(n),y1(n),y2(n),z1(n),z2(n)

real a,b,c,d

c$omp parallel do

do i=1,n

a=x1(i)*x2(i); b=y1(i)*y2(i)

c=x1(i)*y2(i); d=x2(i)*y1(i)


enddo

end

A sample program

47

Run on Technion origin2000Vector length = 1,000,000Loop repeated 50 timesCompiler optimization: low (-O1)

Elapsed time, sec

threadsCompile 1 2 4

No parallel 15.0 15.3Parallel 16.0 26.0 26.8

Is this running in parallel?

A sample program

48

Run on Technion origin2000Vector length = 1,000,000Loop repeated 50 timesCompiler optimization: low (-O1)

Elapsed time, sec



Is this running in parallel? WHY NOT?

A sample program

49

c$omp parallel do

do i=1,n

a=x1(i)*x2(i); b=y1(i)*y2(i)

c=x1(i)*y2(i); d=x2(i)*y1(i)


enddo

Is this running in parallel? WHY NOT?

Answer: by default, variables a,b,c,dare defined as SHARED

A sample program

50

Elapsed time, sec



Solution: define a,b,c,d as PRIVATE: c$omp parallel do private(a,b,c,d)

This is now running in parallel

A sample program

51

5) Case Study

52

HPC in the Technion

SGI Origin2000 22 cpu (R10000) -- 250 MHz Total memory -- 5.6 GB

PC cluster (linux redhat 6.1) 6 cpu (pentium II - 400MHz) Memory - 500 MB/cpu

53

Fluent test case

-- Stability of a subsonic

turbulent jet

Source: Viktoria SuponitskyFaculty of Aerospace Engineering,

Technion

54

55

Reading "Case25unstead.cas"...

10000 quadrilateral cells, zone 1, binary.

19800 2D interior faces, zone 9, binary.

50 2D wall faces, zone 3, binary.

100 2D pressure-inlet faces, zone 7, binary.

50 2D pressure-outlet faces, zone 5, binary.

50 2D pressure-outlet faces, zone 6, binary.

50 2D velocity-inlet faces, zone 2, binary.

100 2D axis faces, zone 4, binary.

10201 nodes, binary.

10201 node flags, binary.

Fluent test case

10 time steps, 20 iterations per time step

56

57

58

Host spawning Node 0 on machine "parix".

ID Comm. Hostname O.S. PID Mach ID HW ID Name

-------------------------------------------------------------

host net parix irix 19732 0 7 Fluent Host

n7 smpi parix irix 19776 0 7 Fluent Node







n0* smpi parix irix 19767 0 0 Fluent Node

Fluent test case

SMP command: fluent 2d -t8 -psmpi -g < inp

59

Fluent test caseCluster command:

fluent 2d -cnf=clinux1,clinux2,clinux3,clinux4,clinux5,clinux6

-t6 –pnet -g < inp

Node 0 spawning Node 5 on machine "clinux6".

ID Comm. Hostname O.S. PID Mach ID HW ID Name

-----------------------------------------------------------

n5 net clinux6 linux-ia32 3560 5 9 Fluent Node





host net clinux1 linux-ia32 10358 0 3 Fluent Host

n0* net clinux1 linux-ia32 10400 0 -1 Fluent Node

60

Fluent test - time for multiple cpu's

0.0

50.0

100.0

150.0

200.0

250.0

300.0

350.0

400.0

450.0

1 2 3 4 5 6 7 8

number of cpu's

tota

l ru

n t

ime

, se

c

origin2000

pc cluster

61

Fluent test - speedup by cpu's

1.0

2.0

3.0

4.0

5.0

6.0

7.0

8.0

1 2 3 4 5 6 7 8

number of cpu's

sp

ee

du

p

ideal

origin2000

pc cluster

62

TOP500 (November 2, 2000)

63

TOP500 (November 2, 2000)

Documents

1 ISCM-10 Taub Computing Center High Performance Computing for Computational Mechanics Moshe Goldberg March 29, 2001