Download pdf - Introduction to Supercomputing at ARSC

1

Introduction toSupercomputing at

ARSC

Kate Hedstrom,

Arctic Region SupercomputingCenter (ARSC)

[email protected]

Jan, 2004

2

Topics• Introduction to Supercomputers at ARSC

– Computers

• Accounts– Getting an account– Kerberos– Getting help

• Architectures of parallel computers– Programming models

• Running Jobs– Compilers– Storage– Interactive and batch

3

Introduction to ARSCSupercomputers

• They’re all Parallel Computers

• Three Classes:– Shared Memory

– Distributed Memory

– Distributed & Shared Memory

4

Cray X1: klondike

• 128 MSPs

• 4 MSP/node

• 4 Vector CPU/MSP,

800 MHz

• 512 GB Total

• 21 TB Disk

• 1600 GFLOPS peak

•NAC required

5

Cray SX-6: rime

• 8 500MHz NECVector CPUs

• 64 GB of sharedmemory

• 1 TB RAID-5 Disk

• 64 GFLOPS peak

• Only one in theUSA

• On loan from Cray

• Non-NAC

6

Cray SV1ex: chilkoot

• 32 VectorCPUs, 500 MHz

• 32 GB Sharedmemory

• 2 TB Disk

• 64 GFLOPSpeak

• NAC required

7

Cray T3E: yukon

• 272 CPUs, 450MHz

• 256 MB perprocessor

• 69.6 GB totaldistributedmemory

• 230 GFLOPSpeak

• NAC required

8

IBM Power4: iceberg

• 2 nodes of 32 p690+s,

1.7 GHz (2 cabinets)256 GB each

• 92 nodes of 8 p655+s,

1.5 GHz (6 cabinets)

• 6 nodes of 8 p655s1.1 GHz (1 cabinet)

• 16 GB Mem/Node

• 22 TB Disk

• 5000 GFLOPS

• NAC required

9

IBM Regatta: iceflyer• 8-way, 16GB front

end coming soon

• 32 1.7 GHz Power4CPUs in– 24-way SMP node– 7-way interactive node– 1 test node– 32-way SMP node soon

• 256 GB Memory

• 217 GFLOPS

• Non-NAC

10

IBM SP Power3: icehawk• 50 4-Way SMP Nodes

=> 200 CPUs, 375MHz

• 2 GB Memory/Node

• 36 GB Disk/Node

• 264 GFLOPS peak for176 CPUs (max perjob)

• Leaving soon

• NAC required

11

Storing Files

• Robotictape silos

• Two Sunstorageservers

• Nanook

– Non-NACsystems

• Seawolf– NAC

systems

12

Accounts, Logging In

• Getting an Account/Project

• Doing a NAC

• Logging in with Kerberos

13

Getting an Account/Project

• Academic Applicant for resources is a PI:– Full time faculty or staff research person

– Non-commercial work, must reside in USA

– PI may add users to their project– http://www.arsc.edu/support/accounts/acquire.html

• DoD Applicant– http://www.hpcmo.hpc.mil/Htdocs/SAAA

• Commercial, Federal, State– Contact User Services Director

– Barbara Horner-Miller, [email protected]

– Academic guidelines apply

14

Doing a National AgencyCheck (NAC)

• Required for HPCMO Resources only– Not required for workstations, Cray SX-6, or IBM Regatta

• Not a security clearance– But there are detailed questions covering last 5-7 years

• Electronic Personnel Security Questionnaire (EPSQ)– Windows only software

• Fill out EPSQ cover sheet– http://www.arsc.edu/support/policy/pdf/OPM_Cover.pdf

• Fingerprinting, Proof of Citizenship (passport, visa, etc.)– See http://www.arsc.edu/support/policy/accesspolicy.html

15

Logging in with Kerberos

• On non-ARSC systems, download kerberos5client– http://www.arsc.edu/support/howtos/krbclients.html

• Used with SecureID– Uses a pin to generate a key at login time

• Login requires user name, pass phrase, & key– Don’t share your pin or SecureID with anyone

• Foreign Nationals or others with problems– Contact ARSC to use ssh to connect to ARSC gateway– Still need Kerberos & SecureID after connecting

16

SecureID

17

From ARSC System

• Enter username

• Enter <return> for principle

• Enter pass phrase

• Enter SecureID passcode

• From that system:ssh iceflyer

• ssh handles X11 handshaking

From ARSC System

18

From Your System

• Get Kerberos clients installed

• Get ticketkinit [email protected]

• See ticketsklist

• Login into arsc systemkrlogin -l username iceflyerssh -l username iceflyerktelnet -l username iceflyer

19

Rime and Rimegate

• Log into rimegate as usual, withyour rimegate username (arscxxx)ssh -l arscksh rimegate

• Compile on rimegate (sxf90, sxc++)

• Log into rime from rimegatessh rime

• Rimegate $HOME is

/rimegate/users/username on rime

20

SupercomputerArchitectures

• They’re all Parallel Computers

• Three Classes:– Shared Memory

– Distributed Memory

– Distributed & Shared Memory

21

Shared Memory ArchitectureCray SV1, SX-6, IBM Regatta

22

Distributed Memory

Architecture Cray T3E

23

Cluster ArchitectureIBM iceberg, icehawk, Cray X1

• Scalable, distributed,shared-memoryparallel processor

24

Programming Models

• Vector Processing– compiler detection or manual directives

• Threaded Processing (SMP)– OpenMP, Pthreads, java threads– shared memory only

• Distributed Processing (MPP)– message passing with MPI– shared or distributed memory

25

Vector Programming

• Vector CPUs are specialized forarray/matrix operations– 64-element (SV1, X1), 256-element (SX-6) Vector Registers

– Operations proceed assembly-line fashion

– High memory-to-CPU bandwidth

• Less CPU time wasted waiting for data from memory

– Once loaded, produces one result per clock cycle

• Compiler does a lot of the work

26

Vector Programming

• Codes will run without modification.

• Cray compilers automatically detect loopswhich are safe to vectorize.

• Request listing file to find out whatvectorized.

• Programmer can assist the compiler:– Directives and pragmas can force vectorization– Eliminate conditions which inhibit vectorization (e.g.,

subroutine calls and data dependencies in loops)

27

Threaded Programming onShared-Memory Systems

• OpenMP– Directives/pragmas added to serial programs

– A portable standard implemented on Cray (one node),SGI, IBM (one node), etc...

• Other Threaded Paradigms– Java Threads

– Pthreads

28

OpenMP Fortran Example!$omp parallel do do n = 1,10000 A(n) = x * B(n) + c end do___________________________________________________

On 2 CPUS, this On 2 CPUS, this pragma pragma divides work as follows:divides work as follows:CPU 1:CPU 1:

do n = 1,5000 A(n) = x * B(n) + c end do

CPU 2:

do n = 5001,10000 A(n) = x * B(n) + c end do

29

OpenMP C Example

#pragma omp parallel forfor (n = 0; n < 10000; n++) A[n] = x * B[n] + c;___________________________________________________

On 2 CPUS, this On 2 CPUS, this pragma pragma divides work as follows:divides work as follows:CPU 1:CPU 1:

for (n = 0; n < 5000; n++) A[n] = x * B[n] + c;CPU 2:

for (n = 5000; n < 10000; n++) A[n] = x * B[n] + c;

30

Threads DynamicallyAppear and DisappearNumber set by Environment

31

Distributed Processing

Concept:

1) Divide theproblemexplicitly

2) CPUs Performtasksconcurrently

3) Recombineresults

4) All processorsmay or may notbe doing thesame thing

Branimir Gjetvaj

32

Distributed Processing

• Data needed by a given CPU must bestored in the memory associated with thatCPU

• Performed on distributed or sharedmemory computer

• Multiple copies of code are running

• Messages/data are passed between CPUs

• Multi-level: can be combined with vectorand/or OpenMP

33

• Initialization

• Simple send/receive

! Processor 0 sends individual messages to others if (my_rank == 0) then do dest = 1, npes-1 call mpi_send(x, max_size, MPI_FLOAT, dest, 0, comm, ierr); end do else call mpi_recv(x, max_size, MPI_FLOAT, 0, 0, comm, status, ierr); end if

call mpi_init(ierror) call mpi_comm_size (MPI_COMM_WORLD, npes, ierror); call mpi_comm_rank (MPI_COMM_WORLD, my_rank, ierror);

Distributed Processingusing MPI (Fortran)

34

• Initialization

• Simple send/receive

/* Processor 0 sends individual messages to others */ if (my_rank == 0) { for (dest = 1; dest < npes; dest++) { MPI_Send(x, max_size, MPI_FLOAT, dest, 0, comm); } } else { MPI_Recv(x, max_size, MPI_FLOAT, 0, 0, comm, &status); }

MPI_Init(&argc, &argv); MPI_Comm_size (MPI_COMM_WORLD, &npes); MPI_Comm_rank (MPI_COMM_WORLD, &my_rank);

Distributed Processingusing MPI (C)

35

Number of ProcessesConstant

Number set by Environment

36

Message Passing ActivityExample

37

Cluster Programming

• Shared-memory between processorson one node:– OpenMP, threads, or MPI

• Distributed-memory methods betweenprocessors on multiple nodes– MPI

• Mixed mode– MPI distributes to nodes, OpenMP within node

38

Programming Environments

• Compilers

• File Systems

• Running jobs

– Interactive

– Batch

• See individual machine documentation

– http://www.arsc.edu/support/resources/hardware.html

39

Cray Compilers• SV1, T3E

– f90, cc, CC

• X1– ftn, cc, CC

• SX-6 front end (rimegate)– sxf90, sxc++

• SX-6 (rime)

– f90, cc, c++

• No extra flags for MPI, OpenMP

40

IBM Compilers• Serial

– xlf, xlf90, xlf95, xlc, xlC

• OpenMP

– Add -qsmp=omp, _r extension for thread-safe libraries, e.g. xlf_r

• MPI

– mpxlf, mpxlf90, mpxlf95, mpcc, mpCC

• Might be best to always use _rextension (mpxlf90_r)

41

File Systems• Local storage

– $HOME– /tmp or /wrktmp or /wrkdir -> $WRKDIR– /scratch -> $SCRATCH

• Permanent storage– $ARCHIVE

• Quotas– quota -v on Cray– qcheck on IBM

42

Running a job

• Get files from $ARCHIVE to system’sdisk

• Keep source in $HOME, but run in$WRKDIR

• Use $SCRATCH for local-to-nodetemporary files, clean up before job ends

• Put results out to $ARCHIVE

• $WRKDIR is purged

43

Iceflyer Filesystems

• Smallish $HOME

• Larger /wrkdir/username

• $ARCHIVE for longterm storage,especially larger files

• qcheck to check quotas

44

SX6 Filesystems

• Separate from the rest of ARSCsystems

• Rimegate has /home, /scratch

• Rime mounts them as/rimegate/home, /rimegate/scratch

• Rime has own home, /tmp, /atmp,etc.

45

Interactive

• Works on the command line

• Limits exist on resources(time, # cpus, memory)

• Good for debugging

• Larger jobs must be submitted to thebatch system

46

Batch Schedulers

• Cray: NQS– Commands:

• qsub, qstat, qdel

• IBM: LoadLeveler– Commands:

• llclass, llq, llsubmit, llcancel, llmap, xloadl

47

NQS Script (rime)

#@$-q batch # job queue class#@$-s /bin/ksh # which shell#@$-eo # stdout and stderr together#@$-lM 100 MW#@$-lT 30:00 # time requested h:m:s#@$-c 8 # 8 cpus#@$ # required last command

# beginning of shell script

cd $QSUB_WORKDIR # cd to submission directory

export F_PROGINF=DETAILexport OMP_NUM_THREADS=8

./my_job

48

NQS Commands

• qstat to find out job status, list ofqueues

• qsub to submit job

• qdel to delete job from queue

49

LoadLeveler Script (iceflyer)

#!/bin/ksh#@ total_tasks = 4#@ node_usage = shared#@ wall_clock_limit = 1:00:00#@ job_type = parallel#@ output = out.$(jobid)#@ error = err.$(jobid)#@ class = large#@ notification = error#@ queue

poe ./my_job

50

Loadleveler Commands

• llclass to find list of classes

• llq to see list of jobs in queue

• llsubmit to submit job

• llcancel to delete job from queue

• llmap is local program to see loadon machine

• xloadl X11 interface to loadleveler

51

Getting Help

• Consultants and Specialists are here toserve YOU

– [email protected]

– 907-474-5102• http://www.arsc.edu/support/support.html

52

Homework

• Make sure you can log into

– iceflyer

– rimegate

– rime

• Ask consultants for help ifnecessary