November 2018
High Performance Computing @ AUB
American University of Beirut
GradEx WorkshopMher Kazandjian
How this talk is structured?• History of computing• Scientific computing workflows
• Computer architecture overview• Do's and Don'ts
• Demo's and walk throughs
Goals• Demonstrate how you (as users) can benefit from AUB's
HPC facilities• Attract users, because:
• we want to boost scientific computing research• we want to help you• we have capacity
This presentation is based on actual feedback and use cases collected from users over the past year
History of computing
Alan Turing 1912-1954
Growth over time
12 orders of magnitudesince 1960
Growth over time
~12 orders of magnitudesince 1960

if you had 1000$ in 1970 you could do 10^12 times more calculations with hardware that costs the same today

What is HPC used for today?
●Solving scientific problems●Data mining and deep learning●Military research and security●Cloud computing●Blockchain (cryptocurrency)

What is HPC used for today?
● https://blog.openai.com/ai-and-compute/
Multicores hit the markets in ~2005
 Click to add textClick to add textClick to add textClick to add textClick to add textClick to add textClick to add textClick to add textClick to add textClick to add textClick to add textClick to add textClick to add textClick to add text
Growth over time
Users at home started benefiting from parallelism
Prior to that applications that scaled well were restricted to mainframes / datacenters and HPC clusters
HPC @ AUB in 2006
8 compute nodesSpecs per node - 4 cores - 8 GB ram
~ 80 GFlops
HPC is all about scalability• The high speed network is the "most" important
component
But what is scalability?Performance improvements as the number of cores (resources) increasesfor the same problem size - hard scalability
But what is scalability?
This is a CPU under a microscope
But what is scalability?
Prog.exe
2 sec
Serial runtime = T_serial
But what is scalability?
Prog.exe
1 sec
Prog.exe
parallel runtime = T_parallel
But what is scalability?
Prog.exe
0.5 sec
Prog.exe Prog.exe Prog.exe
parallel runtime = T_parallel
But what is scalability?
Prog.exe
0.5 sec
Prog.exe Prog.exe Prog.exe
Very nice!!but this is usually never the case
First demo – First scalability diagram
But what is scalability?
Repeat the same process across multiple processors
Prog.exeProg.exeProg.exeProg.exe Prog.exe Prog.exe Prog.exe
But what is scalability?
Wait! - how do these processors talk to each other? - how much data needs to be transferred for a certain task? - how fast do the processes communicate with each other? - how often should the processes communicate with each other?
Prog.exeProg.exeProg.exeProg.exe Prog.exe Prog.exe Prog.exe
At the single chip level
Through the cache memory of the CPU
Typical latency ~ ns (or less)Typical bandwidth > 150 GB/s
At the single chip level
Through the RAM
Random Access Momory (aka RAM)
Typical latency ~ a few to tens ns Typical bandwidth ~ 10 to 50 GB/s (sometimes more)
https://ark.intel.com/#@Processors
Through the RAM
Second demo: bandwidth and some lingo
- An array is just a bunch of bytes- Bandwidth is the speed with which information is tranferred- A float (double precision) is 8 bytes- an array of one million elements is 1000 x 1000 x 8 bytes = 80 MB- if I measure the time to initialize this array I can measure how fast the cpu can access the RAM (since initializing the array implies visiting each memory address and setting it to zero)- bandwidth = size of array / time to initialize it
Second demo: bandwidth and some lingo
- An array is just a bunch of bytes- Bandwidth is the speed with which information is tranferred- A float (double precision) is 8 bytes- an array of one million elements is 1000 x 1000 x 8 bytes = 80 MB- if I measure the time to initialize this array I can measure how fast the cpu can access the RAM (since initializing the array implies visiting each memory address and setting it to zero)- bandwidth = size of array / time to initialize it Intel i7-6700HQ - https://ark.intel.com/products/88967/Intel-Core-i7-6700HQ-Processor-6M-Cache-up-to-3-50-GHz- - Advertised bandwidth = 34 GB/s - measured bandwidth (single thread quickie) = 22.8 GB/s
At the single motherboard level
Through the RAM
Random Access Memory (aka RAM)
Typical latency ~ a few to tens ns Typical bandwidth ~ 10 to 100 GB/s (sometimes more)
Random Access Memory (aka RAM)
Through QPI (quick path interconnect) - typical latency for small data ~ ns - typical bandwidth 100 GB/s
TIP: server = node = compute node = numa node
QPI
Second demo: bandwidth multi-threaded- https://github.com/jeffhammond/STREAMhttps://ark.intel.com/products/64597/Intel-Xeon-Processor-E5-2665-20M-Cache-2_40-GHz-8_00-GTs-Intel-QPI
- 2 x sockets, expected bandwidth ~102 GB/s- measured ~ 75 GB/s - on a completely idle node ~95 GB/s is possible
Another benchmark2 socket Intel Xeon server
https://github.com/jeffhammond/STREAM
At the cluster level (multiple nodes)
Through the network (ethernet)
Typical latency ~ 10 micro-sec to 100 micro sec Typical bandwidth ~ 100 MB/s to a few 100 MB/s
At the cluster level (multiple nodes)
Through the network (infiniband – high speed network)
Typical latency ~ a few to micro-seconds to < 1 micro secTypical bandwidth > 3 GB/s
Benefits over ethernet: - Remote direct memory access - higher bandwidth - much lower latency
https://en.wikipedia.org/wiki/InfiniBand
What hardware we have at AUBWhat hardware we have at AUB?- Arza: - 256 core, 1 TB RAM IBM cluster - production simulations, benchmarking - http://website.aub.edu.lb/it/hpc/Pages/home.aspx
- vLabs - see Vassili’s slide - very flexible, easy to manage, windows support
- public cloud - infinite resources – limited by $$$ - two pilot projects being tested – will be open soon for testing
http://website.aub.edu.lb/it/hpc/Pages/home.aspx
Parallelization libraries / softwareSMP parallelism - OpenMP - CUDA - Matlab - Spark (recently deployed and tested)
distributed parallelism (cluster wide) - MPI - Spark - MPI + OpenMP (hybrid) - MPI + CUDA - MPI + CUDA + OpenMP - Spark + CUDA (not tested – any volunteers?)
Linux/Unix culture> 99% of HPC clusters wold wide use some kind of linux / unix
- Clicking your way to install software is easy for you (on windows or mac), but a nightmare for power users.
- Linux is: - open-source - free - secure (at least much secure than windows et. al ) - no need for an antivirus that slows down your system - respects your privacy - huge community support in scientific computing - 99.8% of all HPC systems world wide since 1996 are non-windows machines https://github.com/mherkazandjian/top500parser
Software stack on the HPC cluster- Matlab- C, Java, C++, fortran- python 2 and python 3 - jupyter notebooks- Tensorflow (Deep learning)- Scala- Spark- R - R studio, R server (new)
Cluster usage: Demo- The scheduler: resource manager
- bjobs - bqueues - bhosts - lsload
- important places - /gpfs1/my_username - /gpfs1/apps/sw
- basic linux knowledge
- sample job script
Cluster usage: Documentation
https://hpc-aub-users-guide.readthedocs.io/en/latest/ https://github.com/hpcaubuserguide/hpcaub_userguide
The guide is for you - we want you to contribute to it directly - please send us pull requests
Cluster usage: Job scripts
https://hpc-aub-users-guide.readthedocs.io/en/latest/jobs.html
Cluster usage: Job scripts
https://hpc-aub-users-guide.readthedocs.io/en/latest/jobs.html
In the user guide, there are samples and templates for many use cases: - we will help you write your own if your use case is not covered - this is 90% of the getting started task - recent success story: - spark server job template
Cluster usage: Job scripts
https://hpc-aub-users-guide.readthedocs.io/en/latest/jobs.html
How to benefit from the HPC hardware?- run many serial jobs that do not need to communicate - aka embarrassingly parallel jobs (nothing embarrasing about it though as long as you get your job done) - e.g - train several neural networks with different layer numbers - do a parameter sweep for a certain model ./my_prog.exe --param 1 & ./my_prog.exe --param 2 & ./my_prog.exe --param 3 & These would execute simultaneously
- difficulty: very easy
How to benefit from the HPC hardware?- run many serial jobs that do not need to communicate Demo
How to benefit from the HPC hardware?- run a SMP parallel program (i.e on one node using threads)- e.g - matlab - C/C++/python/Java
Difficulty: very easy to medium (problem dependent)
How to benefit from the HPC hardware?- run a SMP parallel program (i.e on one node using threads)- C
How to benefit from the HPC hardware?- run a SMP parallel program (i.e on one node using threads)- C
How to benefit from the HPC hardware?- run a SMP parallel program (i.e on one node using threads)- Demo: matlab parfor
How to benefit from the HPC hardware?- run a SMP parallel program (i.e on one node using threads)- Demo: matlab parfor
How to benefit from the HPC hardware?- run a SMP parallel program (i.e on one node using threads)- Demo: matlab parfor
How to benefit from the HPC hardware?- run a hybrid MPI + OpenMP parallel job- Demo: Gauß (astrophysics N-Body code) scalability diagram - single node
MPI
OpenMP
OpenMP
OpenMP
OpenMP
How to benefit from the HPC hardware?- hybrid MPI + OpenMP parallel job Gauß (astrophysics N-Body code) scalability diagram - single node
MPI
OpenMP
OpenMP
OpenMP
OpenMP
How to benefit from the HPC hardware?- run a deep learning job- Demo: Tensorflow
How to benefit from the HPC hardware?- run a deep learning job- Demo: Tensorflow
How to benefit from the HPC hardware?- jupyter notebooks (connect through web interface)- R-server (connect throug web interface)- Spark (full cluster configuration – up to 1 TB ram usage)- Map-Reduce
How to benefit from the HPC hardware?- jupyter notebooks (connect through web interface)- R-server (connect throug web interface)- Spark (full cluster configuration – up to 1 TB ram usage)- Map-Reduce
How to benefit from the HPC hardware?- Optimal performance benefits
- Go low level - C/Fortran/C++ - need to have good design - good understanding of architecture
- Currently only customers running such codes: - Chemistry department research group - Physics department - Computer science research group getting involved too
Workflows: best practices- prototype on your machine: laptop, teminal/workstation at your department
- when you think your job could benefit from HPC resources: - talk to us (we can help you assess your program better) - prepare a clean prototype
- we will provide you with a pilot project access
- tune / parallelize your application [ we can help you with that if needed ]
- run production jobs
- if you need specific hardware that is not available on campus: - go to the cloud - ideal for testing / benchmarking your code/app on the latest and the greatest hardware
Containers- Run on top of kernel
- zero over head
- portable
- currently + deep learning containers are used on the cluster + R studio server (new)
- we can help you produce a custom container to your problem + you can create your own container too (no need for admin rights)
- pros: + reproducibilty and portability
- cons: + must be a geek to set up a container and willing to put effort to do it (lots of help available online though)
Thank you for attending! # root>