Download pdf - High Performance Computing @ AUB Tech...Goals •Demonstrate how you (as users) can benefit from AUB's HPC facilities •Attract users, because: • we want to boost scientific computing

November 2018

High Performance Computing @ AUB

American University of Beirut

GradEx WorkshopMher Kazandjian

How this talk is structured?• History of computing• Scientific computing workflows

• Computer architecture overview• Do's and Don'ts

• Demo's and walk throughs

Goals• Demonstrate how you (as users) can benefit from AUB's

HPC facilities• Attract users, because:

• we want to boost scientific computing research• we want to help you• we have capacity

This presentation is based on actual feedback and use cases collected from users over the past year

History of computing

Alan Turing 1912-1954

Growth over time

12 orders of magnitudesince 1960

Growth over time

~12 orders of magnitudesince 1960

if you had 1000$ in 1970 you could do 10^12 times more calculations with hardware that costs the same today

What is HPC used for today?

●Solving scientific problems●Data mining and deep learning●Military research and security●Cloud computing●Blockchain (cryptocurrency)

What is HPC used for today?

● https://blog.openai.com/ai-and-compute/

Multicores hit the markets in ~2005

Click to add textClick to add textClick to add textClick to add textClick to add textClick to add textClick to add textClick to add textClick to add textClick to add textClick to add textClick to add textClick to add textClick to add text

Growth over time

Users at home started benefiting from parallelism

Prior to that applications that scaled well were restricted to mainframes / datacenters and HPC clusters

HPC @ AUB in 2006

8 compute nodesSpecs per node - 4 cores - 8 GB ram

~ 80 GFlops

HPC is all about scalability• The high speed network is the "most" important

component

But what is scalability?Performance improvements as the number of cores (resources) increasesfor the same problem size - hard scalability

But what is scalability?

This is a CPU under a microscope


Prog.exe

2 sec

Serial runtime = T_serial


Prog.exe

1 sec

Prog.exe

parallel runtime = T_parallel


Prog.exe

0.5 sec

Prog.exe Prog.exe Prog.exe

parallel runtime = T_parallel


Prog.exe

0.5 sec

Prog.exe Prog.exe Prog.exe

Very nice!!but this is usually never the case

First demo – First scalability diagram


Repeat the same process across multiple processors

Prog.exeProg.exeProg.exeProg.exe Prog.exe Prog.exe Prog.exe


Wait! - how do these processors talk to each other? - how much data needs to be transferred for a certain task? - how fast do the processes communicate with each other? - how often should the processes communicate with each other?

Prog.exeProg.exeProg.exeProg.exe Prog.exe Prog.exe Prog.exe

At the single chip level

Through the cache memory of the CPU

Typical latency ~ ns (or less)Typical bandwidth > 150 GB/s

At the single chip level

Through the RAM

Random Access Momory (aka RAM)

Typical latency ~ a few to tens ns Typical bandwidth ~ 10 to 50 GB/s (sometimes more)

https://ark.intel.com/#@Processors

Through the RAM

Second demo: bandwidth and some lingo

- An array is just a bunch of bytes- Bandwidth is the speed with which information is tranferred- A float (double precision) is 8 bytes- an array of one million elements is 1000 x 1000 x 8 bytes = 80 MB- if I measure the time to initialize this array I can measure how fast the cpu can access the RAM (since initializing the array implies visiting each memory address and setting it to zero)- bandwidth = size of array / time to initialize it

Second demo: bandwidth and some lingo

- An array is just a bunch of bytes- Bandwidth is the speed with which information is tranferred- A float (double precision) is 8 bytes- an array of one million elements is 1000 x 1000 x 8 bytes = 80 MB- if I measure the time to initialize this array I can measure how fast the cpu can access the RAM (since initializing the array implies visiting each memory address and setting it to zero)- bandwidth = size of array / time to initialize it Intel i7-6700HQ - https://ark.intel.com/products/88967/Intel-Core-i7-6700HQ-Processor-6M-Cache-up-to-3-50-GHz- - Advertised bandwidth = 34 GB/s - measured bandwidth (single thread quickie) = 22.8 GB/s

At the single motherboard level

Through the RAM

Random Access Memory (aka RAM)

Typical latency ~ a few to tens ns Typical bandwidth ~ 10 to 100 GB/s (sometimes more)

Random Access Memory (aka RAM)

Through QPI (quick path interconnect) - typical latency for small data ~ ns - typical bandwidth 100 GB/s

TIP: server = node = compute node = numa node

QPI

Second demo: bandwidth multi-threaded- https://github.com/jeffhammond/STREAMhttps://ark.intel.com/products/64597/Intel-Xeon-Processor-E5-2665-20M-Cache-2_40-GHz-8_00-GTs-Intel-QPI

- 2 x sockets, expected bandwidth ~102 GB/s- measured ~ 75 GB/s - on a completely idle node ~95 GB/s is possible

Another benchmark2 socket Intel Xeon server

https://github.com/jeffhammond/STREAM

At the cluster level (multiple nodes)

Through the network (ethernet)

Typical latency ~ 10 micro-sec to 100 micro sec Typical bandwidth ~ 100 MB/s to a few 100 MB/s

At the cluster level (multiple nodes)

Through the network (infiniband – high speed network)

Typical latency ~ a few to micro-seconds to < 1 micro secTypical bandwidth > 3 GB/s

Benefits over ethernet: - Remote direct memory access - higher bandwidth - much lower latency

https://en.wikipedia.org/wiki/InfiniBand

What hardware we have at AUBWhat hardware we have at AUB?- Arza: - 256 core, 1 TB RAM IBM cluster - production simulations, benchmarking - http://website.aub.edu.lb/it/hpc/Pages/home.aspx

- vLabs - see Vassili’s slide - very flexible, easy to manage, windows support

- public cloud - infinite resources – limited by $$$ - two pilot projects being tested – will be open soon for testing

http://website.aub.edu.lb/it/hpc/Pages/home.aspx

Parallelization libraries / softwareSMP parallelism - OpenMP - CUDA - Matlab - Spark (recently deployed and tested)

distributed parallelism (cluster wide) - MPI - Spark - MPI + OpenMP (hybrid) - MPI + CUDA - MPI + CUDA + OpenMP - Spark + CUDA (not tested – any volunteers?)

Linux/Unix culture> 99% of HPC clusters wold wide use some kind of linux / unix

- Clicking your way to install software is easy for you (on windows or mac), but a nightmare for power users.

- Linux is: - open-source - free - secure (at least much secure than windows et. al ) - no need for an antivirus that slows down your system - respects your privacy - huge community support in scientific computing - 99.8% of all HPC systems world wide since 1996 are non-windows machines https://github.com/mherkazandjian/top500parser

Software stack on the HPC cluster- Matlab- C, Java, C++, fortran- python 2 and python 3 - jupyter notebooks- Tensorflow (Deep learning)- Scala- Spark- R - R studio, R server (new)

Cluster usage: Demo- The scheduler: resource manager

- bjobs - bqueues - bhosts - lsload

- important places - /gpfs1/my_username - /gpfs1/apps/sw

- basic linux knowledge

- sample job script

Cluster usage: Documentation

https://hpc-aub-users-guide.readthedocs.io/en/latest/ https://github.com/hpcaubuserguide/hpcaub_userguide

The guide is for you - we want you to contribute to it directly - please send us pull requests

Cluster usage: Job scripts

https://hpc-aub-users-guide.readthedocs.io/en/latest/jobs.html



In the user guide, there are samples and templates for many use cases: - we will help you write your own if your use case is not covered - this is 90% of the getting started task - recent success story: - spark server job template

How to benefit from the HPC hardware?- run many serial jobs that do not need to communicate - aka embarrassingly parallel jobs (nothing embarrasing about it though as long as you get your job done) - e.g - train several neural networks with different layer numbers - do a parameter sweep for a certain model ./my_prog.exe --param 1 & ./my_prog.exe --param 2 & ./my_prog.exe --param 3 & These would execute simultaneously

- difficulty: very easy

How to benefit from the HPC hardware?- run many serial jobs that do not need to communicate Demo

How to benefit from the HPC hardware?- run a SMP parallel program (i.e on one node using threads)- e.g - matlab - C/C++/python/Java

Difficulty: very easy to medium (problem dependent)

How to benefit from the HPC hardware?- run a SMP parallel program (i.e on one node using threads)- C

How to benefit from the HPC hardware?- run a SMP parallel program (i.e on one node using threads)- Demo: matlab parfor

How to benefit from the HPC hardware?- run a hybrid MPI + OpenMP parallel job- Demo: Gauß (astrophysics N-Body code) scalability diagram - single node

MPI

OpenMP

OpenMP

OpenMP

OpenMP

How to benefit from the HPC hardware?- hybrid MPI + OpenMP parallel job Gauß (astrophysics N-Body code) scalability diagram - single node

MPI

OpenMP

OpenMP

OpenMP

OpenMP

How to benefit from the HPC hardware?- run a deep learning job- Demo: Tensorflow

How to benefit from the HPC hardware?- jupyter notebooks (connect through web interface)- R-server (connect throug web interface)- Spark (full cluster configuration – up to 1 TB ram usage)- Map-Reduce

How to benefit from the HPC hardware?- Optimal performance benefits

- Go low level - C/Fortran/C++ - need to have good design - good understanding of architecture

- Currently only customers running such codes: - Chemistry department research group - Physics department - Computer science research group getting involved too

Workflows: best practices- prototype on your machine: laptop, teminal/workstation at your department

- when you think your job could benefit from HPC resources: - talk to us (we can help you assess your program better) - prepare a clean prototype

- we will provide you with a pilot project access

- tune / parallelize your application [ we can help you with that if needed ]

- run production jobs

- if you need specific hardware that is not available on campus: - go to the cloud - ideal for testing / benchmarking your code/app on the latest and the greatest hardware

Containers- Run on top of kernel

- zero over head

- portable

- currently + deep learning containers are used on the cluster + R studio server (new)

- we can help you produce a custom container to your problem + you can create your own container too (no need for admin rights)

- pros: + reproducibilty and portability

- cons: + must be a geek to set up a container and willing to put effort to do it (lots of help available online though)

Thank you for attending! # root>