Upload
others
View
6
Download
0
Embed Size (px)
Citation preview
Shifter: Containers in HPC environmentsHPC Advisory Council SwitzerlandMiguel Gila, CSCSMarch 21, 2016
Docker
What is Docker and how does it work?
§ “Docker containers wrap up a piece of software in a complete filesystem that contains everything it needs to run: code, runtime, system tools, system libraries – anything you can install on a server. This guarantees that it will always run the same, regardless of the environment it is running in.” [*]
Shifter: Containers in HPC environments 3
Virtual Machines Containers[*] https://www.docker.com/what-docker
[*]
[*]
What is Docker and how does it work?
§ A Docker image is a file that contains a filesystem within it. It can be a full filesystem (i.e. CentOS) or partial (i.e. python3.4-slim). Images are stored either on a public registry (Docker hub) or private
§ Docker leverages the namespaces feature of the kernel to isolate processes
§ On its simplest form, Docker basically 1. Pulls an image to the local system2. Creates a chrooted environment with the image (=container)3. Runs our application in the container (’isolated’ from the host thanks to kernel namespaces)
§ However, it can also do other things:§ Isolate network by creating NAT or bridge devices§ Can use a nice GUI
Shifter: Containers in HPC environments 4
Docker in HPC environments
§ Docker is a nice tool, but it’s not built for HPC environments, because:§ Does not integrate well with workload managers § Does not isolate users on shared filesystems§ Requires running a daemon on all nodes§ Not designed to run on diskless clients§ Network is, by default, ‘NATed’§ Building Docker is done within a Docker container. It can be done outside, but is a complex
task (Go language, seriously??)
§ But after all, a sysadmin can make anything to work on a cluster, right?§ We can create (and hopefully maintain) monstrous wrappers to run Docker containers…
Shifter: Containers in HPC environments 5
Shifter
What is Shifter and how does it work?
§ Shifter is a container-based solution thought from the ground up for HPC environments
§ It leverages the current Docker environment and can use Docker images to create containers
§ Shifter basically 1. Pulls an image to a shared location (/scratch)2. Creates a loop device with the image (=container)3. Creates a chrooted environment on the loop device4. Runs our application in chrooted environment
§ Designed to work on HPC clusters and, particularly, on Cray systems
§ It is possible to choose which filesystems to expose to the containers
Shifter: Containers in HPC environments 7
External nodeCompute node
Architecture of Shifter
§ Shifter consists on two parts§ udiRoot is responsible for creating the loop devices, doing chroot and cleaning up after the
binary execution is done. Workload manager plugins are available. Written in C
§ imageGateway is responsible for fetching a Docker image, converting it to a squashfs file and transfer it to a shared location on a filesystem. It also keeps track of the images and tells udiRoot which are available. Written in Python
Shifter: Containers in HPC environments 8
scratch
udiRoot imageGateway
queryrestful API
db
Docker Hub
Or private registry
Workflow
1. User creates an image on his/her computer and pushes it to Docker Hub
Shifter: Containers in HPC environments 9
Docker image
1 2
$ docker build .Uploading context 10240 bytesStep 1 : FROM busyboxPulling repository busybox---> e9aa60c60128MB/2.284 MB (100%) endpoint: https://cdn-registry-
1.docker.io/v1/Step 2 : RUN ls -lh /---> Running in 9c9e81692ae9
total 24drwxr-xr-x 2 root root 4.0K Mar 12 2013 bindrwxr-xr-x 5 root root 4.0K Oct 19 00:19 devdrwxr-xr-x 2 root root 4.0K Oct 19 00:19 etcdrwxr-xr-x 2 root root 4.0K Nov 15 23:34 liblrwxrwxrwx 1 root root 3 Mar 12 2013 lib64 -> libdr-xr-xr-x 116 root root 0 Nov 15 23:34 proclrwxrwxrwx 1 root root 3 Mar 12 2013 sbin -> bindr-xr-xr-x 13 root root 0 Nov 15 23:34 sysdrwxr-xr-x 2 root root 4.0K Mar 12 2013 tmpdrwxr-xr-x 2 root root 4.0K Nov 15 23:34 usr---> b35f4035db3f
Step 3 : CMD echo Hello world---> Running in 02071fceb21b---> f52f38b7823e
Successfully built f52f38b7823eRemoving intermediate container 9c9e81692ae9Removing intermediate container 02071fceb21b
docker build docker push
$ docker push miguelgila/wlcg_wn:20161212
Docker Hub
Or private registry
Workflow
1. User creates an image on his/her computer and pushes it to Docker Hub
2. User tells imageGateway to pull his image and make it available
Shifter: Containers in HPC environments 10
$ ssh santis01 $ module load shifter$ shifterimg pull docker:ubuntu:14.04$ sleep 300 # :-)$ shifterimg imagessantis docker READY 2934337e50 2015-12-10T09:03:38 python:2.7-slimsantis docker READY a4f04f71be 2015-12-10T09:47:17 python:3.5-slimsantis docker READY b0197931ac 2015-12-10T09:53:54 python:3.2-slimsantis docker READY 0bb6b50d84 2015-12-10T10:34:24 python:3.3-slimsantis docker READY cd86406b32 2015-12-10T10:36:55 python:3.5.1santis docker READY 48bc52cc28 2015-12-10T10:47:35 python:3.4.3santis docker READY 0b92f1735f 2015-12-10T11:17:26 python:3.4-slimsantis docker READY 3fba104814 2015-12-10T11:31:23 centos:6.7santis docker READY 12c9d795d8 2015-12-10T11:37:35 centos:6.6santis docker READY 3e0c71ada2 2015-12-10T15:19:24 ubuntu:15.10santis docker READY e751d10964 2015-12-10T13:38:05 miguelgila/wlcg_wn:20151125v2santis docker READY d55e68e6cc 2015-12-10T15:52:09 ubuntu:14.04
shifterimg pull
/scratch/santis
imageGateway
1
2
3 4
Docker image
Shifterimage
Shifterimage
Docker Hub
Or private registry
Compute node
Workflow
1. User creates an image on his/her computer and pushes it to Docker Hub
2. User tells imageGateway to pull his image and make it available3. User runs SBATCH and prepends shifter to his/her executable
Shifter: Containers in HPC environments 11
#!/bin/bash –l#SBATCH --job-name="shifter_osversion”#SBATCH --nodes=1#SBATCH --ntasks-per-node=1#SBATCH --time=00:30:00#SBATCH --exclusive#SBATCH --output=/users/miguelgi/jobs/out/shifter_hostname.stdout.log.%j#SBATCH --error=/users/miguelgi/jobs/out/shifter_hostname.stdout.log.%j#SBATCH --image=docker:miguelgila/wlcg_wn:20151218#SBATCH --imagevolume=/users/miguelgi:/users/miguelgi#SBATCH --imagevolume=/apps:/apps#SBATCH --imagevolume=/scratch/santis:/scratch/santis
module load slurmmodule load shifter/15.12.0
srun shifter --volume=/scratch/santis:/scratch/santis --volume=/users:/users cat /etc/redhat-release
sbatch
/scratch/santis
SRUN/SPANK
1
3
udiRoot
Loop device /dev/loopY
Shifterimage
Create loop dev
shifter binaryRun bin in loop dev
/users
2
Use cases
Full OS containers
§ Cray compute nodes run CLE (a version of SLES 11)
§ With Shifter it is possible to run applications built for specific OS:
Shifter: Containers in HPC environments 13
[miguelgi@santis01]-[11:34:45]-[~/examples]:-($ salloc -t 00:15:00 -N1 --image=docker:centos:6.7salloc: Granted job allocation 16313salloc: Waiting for resource configurationsalloc: Nodes nid00012 are ready for job
[miguelgi@santis01]-[11:34:50]-[~/examples]:-)$ srun shifter cat /etc/redhat-releaseCentOS release 6.7 (Final)
[miguelgi@santis01]-[11:35:03]-[~/examples]:-)$ srun shifter yum --version |head -43.2.29
Installed: rpm-4.8.0-47.el6.x86_64 at 2015-08-19 18:25Built : CentOS BuildSystem <http://bugs.centos.org> at 2015-07-24 11:28Committed: Lubos Kardos <[email protected]> at 2015-06-15
CentOS 6 – non-interactive session
[miguelgi@santis01]-[10:52:40]-[~/examples]:-)$ salloc -t 00:15:00 -N1 --image=docker:debian:7.9salloc: Granted job allocation 16294salloc: Waiting for resource configurationsalloc: Nodes nid00012 are ready for job
[miguelgi@santis01]-[10:58:29]-[~/examples]:-($ srun --pty shifter /bin/bash
[miguelgi@nid00012]-[10:58:31]-[~/examples]:-)$ cat /etc/debian_version7.9
[miguelgi@nid00012]-[10:58:33]-[~/examples]:-)$ uname –aLinux nid00012 3.0.101-0.46.1_1.0502.8871-cray_ari_c #1 SMP Tue Aug 25 21:41:26 UTC 2015 x86_64 GNU/Linux
[miguelgi@nid00012]-[10:59:46]-[~/examples]:-)$ apt-get --version |head -n1apt 0.9.7.9 for amd64 compiled on Oct 17 2014 09:15:56
Debian 7.9 – interactive session
Application containers: Python/Ruby
Shifter: Containers in HPC environments 14
[miguelgi@santis01]-[09:57:23]-[~]:-)$ salloc -t 01:00:00 -n1 --image=docker:python:3.5.1salloc: Granted job allocation 16274salloc: Waiting for resource configurationsalloc: Nodes nid00012 are ready for job
[miguelgi@santis01]-[10:00:51]-[~]:-)$ srun --pty shifter /bin/bash
[miguelgi@nid00012]-[09:01:04]-[~]:-)$ python –VPython 3.5.1
[miguelgi@nid00012]-[09:01:08]-[~]:-)$ which python/usr/local/bin/python
Python 3.5.1 – interactive session
[miguelgi@santis01]-[10:16:35]-[~]:-)$ salloc -t 01:00:00 -n1 --image=docker:python:3.2-slimsalloc: Granted job allocation 16278salloc: Waiting for resource configurationsalloc: Nodes nid00012 are ready for job
[miguelgi@santis01]-[10:19:34]-[~/tmp]:-)$ srun shifter ./myscript.py3.2.6
Python 3.2 – non-interactive session
$ cat myscript.py#! /usr/bin/env pythonimport platformprint(platform.python_version())
myscript.py
[miguelgi@santis01]-[09:57:23]-[~]:-)$ salloc -t 01:00:00 -n1 --image=docker:ruby:2.1.8salloc: Granted job allocation 16280salloc: Waiting for resource configurationsalloc: Nodes nid00012 are ready for job
[miguelgi@santis01]-[10:22:14]-[~]:-)$ srun --pty shifter /bin/bash
miguelgi@nid00012]-[09:22:28]-[~]:-)$ ruby –vruby 2.1.8p440 (2015-12-16 revision 53160) [x86_64-linux
Ruby 2.1.8 – interactive session
[miguelgi@santis01]-[10:22:05]-[~]:-)$ salloc -t 01:00:00 -n1 --image=docker:ruby:2.1.8salloc: Granted job allocation 16280salloc: Waiting for resource configurationsalloc: Nodes nid00012 are ready for job
[miguelgi@santis01]-[10:22:08]-[~/tmp]:-)$ srun shifter ./myscript.rb2.1.8
Ruby 2.1.8 – non-interactive session
$ cat myscript.rb#!/usr/bin/env rubyputs RUBY_VERSION
myscript.rb
§ Can run application specific containers:
Multi-node containers
§ It is possible to run the same container across multiple nodes:
§ Working on getting MPI across nodes to function
§ Working on getting GPUs to be accessible to containers with good results so far (native performance!)
Shifter: Containers in HPC environments 15
lucasbe@santis01 ~/shifter-gpu> sbatch ./nvidia-docker/samples/cuda-stream/benchmark.sbatchSubmitted batch job 496
lucasbe@santis01 /scratch/santis/lucasbe/jobs> cat shifter-gpu.out.logLaunching GPU stream benchmark on nid00012 ...STREAM Benchmark implementation in CUDAArray size (double precision) = 1073.74 MBusing 192 threads per block, 699051 blocksFunction Rate (GB/s) Avg time(s) Min time(s) Max time(s)Copy: 184.3169 0.01167758 0.01165104 0.01170397Scale: 183.1849 0.01175387 0.01172304 0.01178598Add: 180.3075 0.01790012 0.01786518 0.01792288Triad: 180.1056 0.01790700 0.01788521 0.01794291
GPU
#! /usr/bin/env pythonimport platformimport socketprint(platform.python_version())print(socket.gethostname())
hostname.py
[miguelgi@santis01]-[10:39:43]-[~/examples]:-)$ salloc -t 00:15:00 -N2 -w nid00013,nid00014 --image=docker:python:3.2-slimsalloc: Granted job allocation 16285salloc: Waiting for resource configurationsalloc: Nodes nid000[13-14] are ready for job
[miguelgi@santis01]-[10:41:21]-[~/examples]:-)$ srun shifter ./hostname.py3.2.6nid000133.2.6nid00014
Python 3.2 – non-interactive session
Practical use case: WLCG Swiss Tier-2
§ CSCS operates the cluster Phoenix on behalf of CHIPP, the Swiss Institute of Particle Physics
§ Phoenix runs Tier-2 jobs for ATLAS, CMS and LHCb, 3 experiments of the LHC at CERN and part of WLCG (Worldwide LHC Computing Grid)
§ WLCG jobs need and expect RHEL-compatible OS. All software is precompiled and exposed in a cvmfs[*] filesystem
§ But Cray XC compute nodes run CLE, a modified version of SLES 11 SP3
§ So, how do we get these jobs to run on a Cray?
Shifter: Containers in HPC environments 16 [*] https://cernvm.cern.ch/portal/filesystem
Practical use case: WLCG Swiss Tier-2
§ Using Shifter, we are able to run unmodified ATLAS, CMS and LHCb production jobs on a Cray XC40 TDS
§ Jobs see standard CentOS 6 containers§ Nodes are shared: multiple single-core and multi-core
jobs, from different experiments, can run on the same compute node
§ Job efficiency is comparable in both systems
Shifter: Containers in HPC environments 17
JOBID USER ACCOUNT NAME NODELIST ST REASON START_TIME END_TIME TIME_LEFT NODES CPU82471 atlasprd atlas a53eb5f8_34f0_ nid00043 R None 15:03:33 Thu 15:03 1-23:54:18 1 8 82476 cms04 cms gridjob nid00043 R None 15:08:39 Tomorr 03:08 11:59:24 1 2 82451 lhcbplt lhcb gridjob nid00043 R None 15:00:10 Tomorr 03:00 11:50:55 1 2 82447 lhcbplt lhcb gridjob nid00043 R None 14:59:31 Tomorr 02:59 11:50:16 1 2 82448 lhcbplt lhcb gridjob nid00043 R None 14:59:31 Tomorr 02:59 11:50:16 1 2 82449 lhcbplt lhcb gridjob nid00043 R None 14:59:31 Tomorr 02:59 11:50:16 1 2 82450 lhcbplt lhcb gridjob nid00043 R None 14:59:31 Tomorr 02:59 11:50:16 1 2 82446 lhcbplt lhcb gridjob nid00043 R None 14:49:01 Tomorr 02:49 11:39:46 1 2 82444 lhcbplt lhcb gridjob nid00043 R None 14:48:01 Tomorr 02:48 11:38:46 1 2 82445 lhcbplt lhcb gridjob nid00043 R None 14:48:01 Tomorr 02:48 11:38:46 1 2
Wrap-up
§ Can isolate network by creating NAT or bridge devices. What about IB?
§ Users can write as root on exposed RW filesystems
§ Needs a local daemon running
§ Isn’t SLURM-friendly
§ Can run on multiple nodes with own tool (Swarm)
§ Can use GPUs?
§ MPI?
§ Can run images on private registry
§ It shows all /dev, /sys and /proc to the container environment. Easy
§ Users can write as their $USER on any exposed RW filesystem
§ Does not need a local daemon on CN
§ Is SLURM-friendly (SPANK plugin)
§ Can run on multiple nodes with WLM integration
§ Can use GPUs. Working on it!
§ MPI on its way!
§ Can run images on private registry
Shifter: Containers in HPC environments 19
Docker vs. Shifter
Docker
Conclusion
§ Shifter works very well on our HPC environment
§ It’s being constantly developed and new features are appearing on a weekly basis
§ It’s open source and developed by the HPC community
§ It needs some additional work to cover basic HPC use cases (MPI)
§ Interacting with some parts of Shifter is not very user-friendly:§ The process of pulling images is easy, but has no visual feedback§ At times, error messages are difficult to understand
§ No ACLs yet
Shifter: Containers in HPC environments 20
Reference links
§ NERSC info: http://www.nersc.gov/research-and-development/user-defined-images/
§ The code: https://github.com/NERSC/shifter
Shifter: Containers in HPC environments 21
Questions?
Thank you for your attention.