40
Inside Docker for Fedora20/RHEL7 ver1.8e Etsuji Nakai Twitter @enakai00 Open Cloud Campus Inside Docker for Fedora20/RHEL7

Inside Docker for Fedora20/RHEL7

Embed Size (px)

Citation preview

Page 1: Inside Docker for Fedora20/RHEL7

Inside Docker for Fedora20/RHEL7

ver1.8e Etsuji NakaiTwitter @enakai00

Open Cloud Campus

Inside Docker for Fedora20/RHEL7

Page 2: Inside Docker for Fedora20/RHEL7

Open Cloud Campus2

Inside Docker for Fedora20/RHEL7

$ who am i

–The author of “Professional Linux Systems” series.• Available only in Japanese (some are in Korean taranslation.)• Translation offering from publishers are welcomed ;-)

Self-study LinuxDeploy and Manage by yourself

Professional Linux SystemsDeployment and Management

Professional Linux SystemsNetwork Management

Etsuji Nakai–Senior solution architect and

cloud evangelist at Red Hat.

Professional Linux SystemsTechnology for Next Decade

New OpenStack bookis in store now!

Page 3: Inside Docker for Fedora20/RHEL7

Open Cloud Campus3

Inside Docker for Fedora20/RHEL7

Contents

What is Linux Container Device Mapper Thin-Provisioning Network Namespace systemd and cgroups

(*) Contents of this document is based on Fedora20 with docker-io-1.0.0-1.fc20.x86_64

Page 4: Inside Docker for Fedora20/RHEL7

Inside Docker for Fedora20/RHEL7

What is Linux Container

Page 5: Inside Docker for Fedora20/RHEL7

Open Cloud Campus5

Inside Docker for Fedora20/RHEL7

Traditional server virtualization

Physical machine

Physical machine

ホスト OS

Hypervisor(Kernel Module)

VirtualMachine

GuestOS

VMware vSphere, Xen, etc.

Linux KVM

Hardware assisted virtualization(Hypervisor is embedded in firmware.)

Software assisted virtualization(Hypervisor is installed on physical machine.)

Software assisted virtualization(Host OS provides the hypervisor feature.)

Physical machine

OS

Baremetal

Traditional "server virtualization" is a technology to create software emulated "virtual machines" hosting various guest operating systems.

Hypervisor (Software)

Physical machine

Hypervisor (Firmware)

VirtualMachine

GuestOS

VirtualMachine

GuestOS

VirtualMachine

GuestOS

VirtualMachine

GuestOS

VirtualMachine

GuestOS

VirtualMachine

GuestOS

VirtualMachine

GuestOS

Page 6: Inside Docker for Fedora20/RHEL7

Open Cloud Campus6

Inside Docker for Fedora20/RHEL7

"Linux Container" is a Linux kernel feature to contain a group of processes in an independent execution environment called a container.

Linux kernel provides an independent apllication execution environment for each container which includes:– Independent filesystem.– Independent network interface and IP address.– Usage limit for memory and CPU time.

You can use containers on Linux virtual machines in addition to baremetal servers since the container can co-exist with the traditional server virtualization technology.

Linux Kernel

Use

r Pr

oces

s

・・・

Physical MachinePhysical Machine

OS

ContainerBaremetal

Use

r Pr

oces

s

Use

r Pr

oces

s

User Space

Linux Kernel

Use

r Pr

oces

s

Use

r Pr

oces

s

User Space

Use

r Pr

oces

s

Use

r Pr

oces

s

User Space

・・・

What is container technology?

Container

Page 7: Inside Docker for Fedora20/RHEL7

Open Cloud Campus7

Inside Docker for Fedora20/RHEL7

Container supports separation of various resources. They are internally realized with different technologies called "namespace."– Filesystem separation  → Mount namespace (kernel 2.4.19) – Hostname separation → UTS namespace (kernel 2.6.19)– IPC separtion → IPC namespece (kernel 2.6.19)– User (UID/GID) separation → User namespace (kernel 2.6.23〜kernel 3.8)– Processtable separation  → PID namespace (kernel 2.6.24) – Network separtion    → Network Namepsace (kernel 2.6.24)– Usage limit of CPU/Memory → Control groups

(*) Reference: "Namespaces in operation, part 1: namespaces overview"• http://lwn.net/Articles/531114/

Linux container is realized by integrating these namespace features. There are multiple container management tools such as lxctools, libvirt and docker. They may use different parts of these features.

Under the hood

Page 8: Inside Docker for Fedora20/RHEL7

Open Cloud Campus8

Inside Docker for Fedora20/RHEL7

Processes in all containers are executed on the same Linux kernel. But inside a container, you can see processes only in the container.– This is because each container has its own process table. On host linux, which is outside

containers, you can see all processes icnluding ones in containers.

Resource separation / Process tables

# ps -efUID PID PPID C STIME TTY TIME CMDroot 1 0 0 09:49 ? 00:00:00 /bin/sh /usr/local/bin/init.shroot 35 1 0 09:49 ? 00:00:00 /usr/sbin/sshdroot 47 1 0 09:49 ? 00:00:00 /usr/sbin/httpdapache 49 47 0 09:49 ? 00:00:00 /usr/sbin/httpdapache 50 47 0 09:49 ? 00:00:00 /usr/sbin/httpd...apache 56 47 0 09:49 ? 00:00:00 /usr/sbin/httpdroot 57 1 0 09:49 ? 00:00:00 /bin/bash

# ps -efUID PID PPID C STIME TTY TIME CMD...root 802 1 0 18:10 ? 00:01:20 /usr/bin/docker -d --selinux-enabled -H fd://...root 3687 802 0 18:49 pts/2 00:00:00 /bin/sh /usr/local/bin/init.shroot 3736 3687 0 18:49 ? 00:00:00 /usr/sbin/sshdroot 3748 3687 0 18:49 ? 00:00:00 /usr/sbin/httpd48 3750 3748 0 18:49 ? 00:00:00 /usr/sbin/httpd...48 3757 3748 0 18:49 ? 00:00:00 /usr/sbin/httpdroot 3758 3687 0 18:49 pts/2 00:00:00 /bin/bash

Processes seen inside container

Processes seen outside container

Page 9: Inside Docker for Fedora20/RHEL7

Open Cloud Campus9

Inside Docker for Fedora20/RHEL7

Resource separation / Process tables (cont.)

fork/exec

sshd

PID namespace

In the example of previous page, docker daemon fork/exec-ed the initial process "init.sh" and put it in a new "PID namespace." After that, all processes fork/exec-ed from init.sh are put in the same namespace.– Inside container, the initial process has PID=1 independently from the host. Likewise, child

processes of it have independent PID's.– Since Docer1.0 doesn't support UID namespace, the same UID/GID's are used as the host even in

the container. User/group names could be different because /etc/passwd is different in the containter.• Reference:"Docker 1.0 and user namespaces"

https://groups.google.com/forum/#!topic/docker-dev/MoIDYDF3suY

PID=1

bash

/bin/sh /usr/local/bin/init.sh

httpd

httpd

・・・

#!/bin/sh

service sshd startservice httpd startwhile [[ true ]]; do /bin/bashdone

init.sh

docker daemon

Page 10: Inside Docker for Fedora20/RHEL7

Open Cloud Campus10

Inside Docker for Fedora20/RHEL7

Resource separation / Filesystem

A specific directory on the host is bind mounted as a root directory of the container. Inside container, that directory is seen as a root directory, very similar mechanism to the "chroot jail."

When using traditional container management tools such as lxctools or libvirt, you need to prepare the directory contents by hand.– You can put minimam contants for a specific application such as application

bianaries and shared libraries in the directory.– It's also possible to copy a whole root filesystem of a specific linux distribution to

the directory.– If necessary, special filesystems such as /dev, /proc and /sys are mounted in the

container by the management tool.

Mount namespace

/ |--etc |--bin |--sbin...

/export/container01/rootfs/ |--etc |--bin |--sbin ...

bind mount

Page 11: Inside Docker for Fedora20/RHEL7

Open Cloud Campus11

Inside Docker for Fedora20/RHEL7

Resource separation / Filesystem (cont.)

Docker provides the original disk image management system which mounts the specified image on the host and make it the root filesystem of the container.

# df -aFilesystem 1K-blocks Used Available Use% Mounted onrootfs 10190136 169036 9480428 2% //dev/mapper/docker-252:3-130516-d798a41bcba1dbe621bf2dd87de0f9c6dd9f9c8aadb79f84e0170 5ee82f364c6 10190136 169036 9480428 2% /proc 0 0 0 - /procsysfs 0 0 0 - /systmpfs 1025136 0 1025136 0% /devshm 65536 0 65536 0% /dev/shmdevpts 0 0 0 - /dev/pts/dev/vda3 14226800 3013432 10467640 23% /.dockerinit/dev/vda3 14226800 3013432 10467640 23% /etc/resolv.conf/dev/vda3 14226800 3013432 10467640 23% /etc/hostname/dev/vda3 14226800 3013432 10467640 23% /etc/hostsdevpts 0 0 0 - /dev/console...

# dfFilesystem 1K-blocks Used Available Use% Mounted on.../dev/dm-2 10190136 169036 9480428 2% /var/lib/docker/devicemapper/mnt/d798a41bcba1dbe621bf2dd87de0f9c6dd9f9c8aadb79f84e01705ee82f364c6

Filesystemseen in a container

Specified disk imagemounted on the host

Disk image mountedon the host.

Some files are separatelybind-mounted.

Page 12: Inside Docker for Fedora20/RHEL7

Open Cloud Campus12

Inside Docker for Fedora20/RHEL7

Network namespace

Resource separation / Network

Container uses Linux's "veth" device for network communication.– veth is a pair of logical NIC devices connected through a (virtual) crossover cable.

One side of the veth pair is placed in a container's network namespace so that it can be seen only inside the container. The other side is connected to a Linux bridge on the host.– A device name in the container is renamed such as "eth0." By means of the namespace, network

settings such as IP address, routing table and iptables are independently configured in the container。

– The connection between the bridge and a physical network is up to the host configuration.

Host LinuxvethXX

eth0

docker0

eth0

IP masquerade

Physical network

Docker creates a bridge "docker0" and packets from containers are forwarded with IP masquerade.– Packets from the physical network targeted to specified

ports are forwarded to the container using the port forwarding feature of iptables.

172.17.42.1

Page 13: Inside Docker for Fedora20/RHEL7

Open Cloud Campus13

Inside Docker for Fedora20/RHEL7

Resource separation / CPU and Memory

Processes inside container recognize all physical memory and CPU cores. But allocation is restricted with Linux's controll groups (cgroups).– In theory, fine grained allocation controll including number of CPU cores, CPU time quota and I/O

bandwidth is possible.

Docker uses systemd's unit mechanism to manage the group of processes in the container.– When creating a container, Docker asks systemd to create a new unit to start the initial process.

As a result, all processes fork/exec-ed from the initial process belong to the same unit. At the same time, systemd creates a new cgroups' group for the unit.

# systemd-cgls...└─system.slice ├─docker-cc08291a81556ba55f049e50fd2c04287b04c6cf657a8a9971ef42468a2befa7.scope │ ├─7444 nginx: master process ngin │ ├─7458 nginx: worker proces │ ├─7459 nginx: worker proces │ ├─7460 nginx: worker proces │ └─7461 nginx: worker proces...

"docker-<Container ID>.scope" isthe cgroups' group name

Page 14: Inside Docker for Fedora20/RHEL7

Inside Docker for Fedora20/RHEL7

Device Mapper Thin-Provisioning

Page 15: Inside Docker for Fedora20/RHEL7

Open Cloud Campus15

Inside Docker for Fedora20/RHEL7

Device Mapper is a Linux's virtual filesystems mechanism to create a logical device which provides additional features on top of physical block devices. This is done through a wrapper of software modules. Typical moduldes are:– dm-raid : add a software RAID feature– dm-multipath : add a multipath access to LUN's– dm-crypt : add an encryption feature– dm-delay : add an access delay emulation feature

What is Device Mapper?

/dev/sda /dev/sdb

/dev/dm1

Mirroring

dm-raid

/dev/sda

/dev/dm1

dm-crypt

Encryption/Decryption

/dev/sda

/dev/dm1

dm-delay

Access delay

Page 16: Inside Docker for Fedora20/RHEL7

Open Cloud Campus16

Inside Docker for Fedora20/RHEL7

Device Mapper Thin-Provisioning (dm-thin) is a relatively new module which provides "thin-provisioning" and "snapshot" features similar to commercial storage appliances.

dm-thin uses two block devices, one is for "block pool" and the others is for "metadata device."– Fixed size blocks are dynamically allocated to logical devices so that blocks are consumed only

when data are actually written.– Pointers from segments of logical devices to blocks in the block pool are stored in the metadata

device.– CoW (Copy on Write) snapshots are created by allowing pointing to the same block from

different logical devices. You can create multi-generation snapshots with this mecanism.

What is Device Mapper Thin-Provisioning?

Block PoolMetadata

Device

Pointers from segments of logical devicesto block in the pool are stored.

Logical device #001 Logical device #002 Logical device #003

Page 17: Inside Docker for Fedora20/RHEL7

Open Cloud Campus17

Inside Docker for Fedora20/RHEL7

On recent Linux distributions, you can use dm-thin through LVM interface as below.– First, create a volume group as usual.

– Then, define a "thin pool". It creates LV's for block pool and metadata in the background.

Using dm-thin through LVM interface

# fallocate -l $((1024*1024*1024)) pooldev.img# losetup -f pooldev.img # losetup -a/dev/loop0: [64768]:39781720 (/root/pooldev.img)# pvcreate /dev/loop0# vgcreate vg_data /dev/loop0

# lvcreate -L 900M -T vg_data/thinpool Logical volume "lvol1" created Logical volume "thinpool" created# lvs LV VG Attr LSize Pool Origin Data% Move Log Cpy%Sync Convert... lvol0 vg_data -wi------- 4.00m thinpool vg_data twi-a-tz-- 900.00m 0.00

LV: thinpool LV: lvol1

VG: vg_data

Block pool Metadata device

Logical devicevol00

Logical devicevol01

・・・

Page 18: Inside Docker for Fedora20/RHEL7

Open Cloud Campus18

Inside Docker for Fedora20/RHEL7

– Define a new logical device specifying its logical size with -V option.

– Create a snapshot with the following command.

– Snapshots are inactive by default for the sake of data protection. You can use it after activating with the following command.

Using dm-thin through LVM interface (cont.)

# lvcreate -V 100G -T vg_data/thinpool -n vol00 Logical volume "vol00" created# lvs LV VG Attr LSize Pool Origin Data% Move Log Cpy%Sync Convert... lvol0 vg_data -wi------- 4.00m thinpool vg_data twi-a-tz-- 900.00m 0.00 vol00 vg_data Vwi-a-tz-- 100.00g thinpool 0.00

# lvcreate -s --name vol01 vg_data/vol00 Logical volume "vol01" created# lvs LV VG Attr LSize Pool Origin Data% Move Log Cpy%Sync Convert... lvol0 vg_data -wi------- 4.00m thinpool vg_data twi-a-tz-- 900.00m 0.00 vol00 vg_data Vwi-a-tz-- 100.00g thinpool 0.00 vol01 vg_data Vwi---tz-k 100.00g thinpool vol00

# lvchange -K -ay /dev/vg_data/vol01

Page 19: Inside Docker for Fedora20/RHEL7

Open Cloud Campus19

Inside Docker for Fedora20/RHEL7

Docker has a plugin mechanism for image management drivers and "Device Mapper driver" is used in Fedora20/RHEL7. It stores each image in a logical device of "Device Mapper Thin Provisioning (dm-thin)."– When starting a new container, a snapshot of the specified image is attached to the container.– When storing the image with "docker commit", it creates a new snapshot of the snapshot. You'd

better stop the container with "docker stop" before executing "docker commit."

Use of Thin Provisioning in Docker

Local image Snapshot

Create a snapshotwhen starting a container.

×run

commit

rm

Processes

Snapshot

stop

start

Local image

When a container is sopped,all processes in it are stopped.

(The snapshot image is not deleted.)

When a container is removed,the associated snapshot is deleted.Save a new local image by taking

a snapshot of the snapshot.

Page 20: Inside Docker for Fedora20/RHEL7

Open Cloud Campus20

Inside Docker for Fedora20/RHEL7

Docker uses the native dm interface of dm-thin module instead of LVM interface.– When a docker service is launched, it loop-mounts the following "data" and "meadata" disk image

file, and create a block pool with them.

How Docker uses Device Mapper Thin-Provisioning?

# ls -lh /var/lib/docker/devicemapper/devicemapper/total 1.2G-rw-------. 1 root root 100G May 11 21:37 data-rw-------. 1 root root 2.0G May 11 22:05 metadata

# losetup NAME SIZELIMIT OFFSET AUTOCLEAR RO BACK-FILE/dev/loop0 0 0 1 0 /var/lib/docker/devicemapper/devicemapper/data/dev/loop1 0 0 1 0 /var/lib/docker/devicemapper/devicemapper/metadata

# lsblkNAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT...loop0 7:0 0 100G 0 loop └─docker-252:3-130516-pool 253:0 0 100G 0 dm loop1 7:1 0 2G 0 loop └─docker-252:3-130516-pool 253:0 0 100G 0 dm

Block pool device

Metadata device

Page 21: Inside Docker for Fedora20/RHEL7

Open Cloud Campus21

Inside Docker for Fedora20/RHEL7

Configuration data of logical devices are stored in the following JSON files.– /var/lib/docker/devicemapper/metadata/<Image ID>

– The logical device with device ID "0" has a special role. It is created with 10GB size when Docker service is started for the first time. Docker initializes it as an empty ext4 filesystem.

– When you downloads images from an external registory, snapshots of thie device are used to store those images. Therefore, all logical devices have the same 10GB size and ext4 filesystem.

How Docker uses Device Mapper Thin-Provisioning? (cont.)

# docker images enakai/httpdREPOSITORY TAG IMAGE ID CREATED VIRTUAL SIZEenakai/httpd ver1.0 d3d92adfcafb 36 hours ago 206.6 MB

# cat /var/lib/docker/devicemapper/metadata/d3d92adfcafb* | python -mjson.tool{ "device_id": 72, "initialized": false, "size": 10737418240, "transaction_id": 99}

# cat /var/lib/docker/devicemapper/metadata/base | python -mjson.tool{ "device_id": 0, "initialized": true, "size": 10737418240, "transaction_id": 1}

Page 22: Inside Docker for Fedora20/RHEL7

Open Cloud Campus22

Inside Docker for Fedora20/RHEL7

As a sort of hacking technique, you can mount disk image contents by hand, using dmsetup command to interact with dm-thin module.– At first, using the commands in the previous page, check the "deivce_id" and "size" of the disk

image you want to mount. In addition, check the name of thin pool with the following command. It's "docker-252:3-130516-pool" in this example.

– For the sake of simplicity, set these values in shell variables.

Manipulating image contents by hand

# lsblkNAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT...loop0 7:0 0 100G 0 loop └─docker-252:3-130516-pool 253:0 0 100G 0 dm loop1 7:1 0 2G 0 loop └─docker-252:3-130516-pool 253:0 0 100G 0 dm

# device_id=72# size=10737418240# pool=docker-252:3-130516-pool

Page 23: Inside Docker for Fedora20/RHEL7

Open Cloud Campus23

Inside Docker for Fedora20/RHEL7

– Activate and mount the logical device with the following command. Under "rootfs" is the root filesystem seen from a container.

– Finally, unmount and deactivate the logical device.

(*) Modifying the contents of images is not a supported procedure of Docker. You should do it at you own risk as it may damage the image.

– Reference: https://www.kernel.org/doc/Documentation/device-mapper/thin-provisioning.txt

Manipulating image contents by hand (cont.)

# dmsetup create myvol --table "0 $(($size / 512)) thin /dev/mapper/$pool $device_id"# lsblk...loop0 7:0 0 100G 0 loop └─docker-252:3-130516-pool 253:0 0 100G 0 dm └─myvol 253:1 0 10G 0 dm loop1 7:1 0 2G 0 loop └─docker-252:3-130516-pool 253:0 0 100G 0 dm └─myvol 253:1 0 10G 0 dm # mount /dev/mapper/myvol /mnt# ls /mntid lost+found rootfs# cat /mnt/rootfs/var/www/html/index.html Hello, World!

# umount /mnt# dmsetup remove myvol

Page 24: Inside Docker for Fedora20/RHEL7

Inside Docker for Fedora20/RHEL7

Network Namespace

Page 25: Inside Docker for Fedora20/RHEL7

Open Cloud Campus25

Inside Docker for Fedora20/RHEL7

Network namespace

Network configuration in Docker

Container's logical NIC "eth0" is connected to a Linux bridge "docker0." Communication between container and external network is controlled with iptables on the host.– Packets from a container is forwarded with IP

masquerade.– Packets from external network to specified ports are

forwarded to a container with iptables' port forward feature.

Host LinuxvethXX

eth0

docker0

eth0

IP Masquerade

172.17.42.1

As an example, starting a container with portforwarding from 8000 to 80, and from 2222 to 22.

– The one end of a veth pair is connected to the bridge "docker0."

# docker run -itd -p 8000:80 -p 2222:22 enakai/httpd:ver1.0a7838c84cd008161086839379e4a0be2d0e109e02c779229cde49f53b79ae1d5

# brctl showbridge name bridge id STP enabled interfacesdocker0 8000.56847afe9799 no veth66c0

# ifconfig docker0docker0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500 inet 172.17.42.1 netmask 255.255.0.0 broadcast 0.0.0.0...

Page 26: Inside Docker for Fedora20/RHEL7

Open Cloud Campus26

Inside Docker for Fedora20/RHEL7

Network configuration in Docker (cont.)

– nat table of iptables is configured as below.

① Packets from an external network are processed in DOCKER chain for port forwarding.② Packets from localhost to localhost's IP address (except "127.0.0.0/8") are processed in     DOCKER chain, too.③ Packets from a container to an external network are forwarded with IP masquerade.④⑤ Portforwading configuration specified with "docker run".

– I'm not sure why "127.0.0.0/8" is excluded in ②. But anyway, packets to "127.0.0.0/8" are processed appropriately because... (see next page.)

# iptables-save # Generated by iptables-save v1.4.19.1 on Fri Jun 13 22:36:14 2014*nat...-A PREROUTING -m addrtype --dst-type LOCAL -j DOCKER-A OUTPUT ! -d 127.0.0.0/8 -m addrtype --dst-type LOCAL -j DOCKER-A POSTROUTING -s 172.17.0.0/16 ! -d 172.17.0.0/16 -j MASQUERADE-A DOCKER ! -i docker0 -p tcp -m tcp --dport 2222 -j DNAT --to-destination 172.17.0.23:22-A DOCKER ! -i docker0 -p tcp -m tcp --dport 8000 -j DNAT --to-destination 172.17.0.23:80COMMIT

①②③④⑤

Page 27: Inside Docker for Fedora20/RHEL7

Open Cloud Campus27

Inside Docker for Fedora20/RHEL7

Network configuration in Docker (cont.)

– Docker daemon provides the port forward proxy feature, and packets which are not processed with iptables are handled with this.

– Originally, the feature is prepared for hosts without iptables. I'm not sure why packets to "127.0.0.0/8" are selectively handled with this.

# lsof -i -PCOMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME...docker 20003 root 11u IPv6 177010 0t0 TCP *:2222 (LISTEN)docker 20003 root 12u IPv6 178468 0t0 TCP *:8000 (LISTEN)...

Page 28: Inside Docker for Fedora20/RHEL7

Open Cloud Campus28

Inside Docker for Fedora20/RHEL7

Network namespace manipulation

As a sort of hacking technique, you can directly manipulate network namespaces. Without Docker, you would use network namespaces in the following steps.– Define a new namespace.– Add network configuration in the namespace such as logical NIC, IP address, routing table and

iptables.– Launch processes in the namespace.

You can use "ip netns" command to manipulate network namespaces. But you need some additional operations to manipulate network namespaces created by Docker.– Find a PID of one of the processes in the container.

– There is a sysmlink to the descripter to manipulate the namespace in /proc filesystem of this process.

# systemd-cgls...└─system.slice ├─docker-61151db106a7fd6d5cf937a03eac0e9b33c7799d3d48b6cddc83070839afeea9.scop │ ├─502 /bin/sh /usr/local/bin/init.sh │ ├─545 /usr/sbin/sshd │ ├─557 /usr/sbin/httpd...

# ls -l /proc/502/ns/net lrwxrwxrwx 1 root root 0 June 13 22:52 /proc/502/ns/net -> net:[4026532255]

Page 29: Inside Docker for Fedora20/RHEL7

Open Cloud Campus29

Inside Docker for Fedora20/RHEL7

Network namespace manipulation (cont.)

– By creating a symlink under /var/run/netns/ to the descriptor, ip command recognizes the namespace.

– From this point, you can execute any commands inside the namespace "foo-ns."

– For example, by starting bash inside the namespace, you can see the network configuration in the container. But configurations other than network is the same as host since you switched only the network namespace.

# mkdir /var/run/netns# ln -s /proc/502/ns/net /var/run/netns/foo-ns# ip netnsfoo-ns

# ip netns exec foo-ns <command>

# ip netns exec foo-ns bash# ifconfig eth0eth0: flags=67<UP,BROADCAST,RUNNING> mtu 1500 inet 172.17.0.2 netmask 255.255.0.0 broadcast 0.0.0.0...# route -nKernel IP routing tableDestination Gateway Genmask Flags Metric Ref Use Iface0.0.0.0 172.17.42.1 0.0.0.0 UG 0 0 0 eth0172.17.0.0 0.0.0.0 255.255.0.0 U 0 0 0 eth0# exit

# ip netns exec foo-ns <command>

Page 30: Inside Docker for Fedora20/RHEL7

Open Cloud Campus30

Inside Docker for Fedora20/RHEL7

Adding more logical NIC's

With the hacking technique of "ip netns", you can add logical NIC's after starting a new container. The following is an example of adding a logical NIC which connects to the physical network through a bridge "br0." (This is not a supported operation of Docker.)– Create a bridge "br0" and move the IP address (192.168.200.20/24 in this case) of physical NIC

to the bridge.

# brctl addbr br0; ip link set br0 up# ip addr del 192.168.200.20/24 dev eth0; ip addr add 192.168.200.20/24 broadcast 192.168.200.255 dev br0; brctl addif br0 eth0; route add default gw 192.168.200.1# echo 'NM_CONTROLLED="no"' >> /etc/sysconfig/network-scripts/ifcfg-eth0# systemctl enable network.service

Host LinuxvethXX

eth0

Container

docker0IP Masquerade

External network

vethYY

eth1

br0

192.168.200.99

192.168.200.20

192.168.200.20eth0

(*) You should understand what you're doing with these commands. It may disable the network connection if you made a mistake.

Page 31: Inside Docker for Fedora20/RHEL7

Open Cloud Campus31

Inside Docker for Fedora20/RHEL7

Adding more logical NIC's (cont.)

– Create a veth pair "veth-host / veth-guest", and attach "veth-host" to the bridge br0.

# ip link add name veth-host type veth peer name veth-guest# ip link set veth-guest down# brctl addif br0 veth-host# brctl show br0bridge name bridge id STP enabled interfacesbr0 8000.525400677470 no eth0

veth-host

Host LinuxvethXX

eth0

Container

docker0IP Masquerade

External network

veth-host

veth-guest

br0

eth0

• At this point, both veth-host and veth-guest are visible on the host, not in the container.

Page 32: Inside Docker for Fedora20/RHEL7

Open Cloud Campus32

Inside Docker for Fedora20/RHEL7

Adding more logical NIC's (cont.)

– Add veth-guest to the container's namespace. At this point, veth-guest becomes invisible on the host.

– From this point, you can use "ip netns exec" to make additional network configurations in the container. The following is to rename the logical NIC to "eth0" and add an IP address. In addition, modifying routing table to make eth1 as a default gateway.

# ip link set veth-guest netns foo-ns# ifconfig veth-guestveth-guest: error fetching interface information: Device not found

# ip netns exec foo-ns ip link set veth-guest name eth1# ip netns exec foo-ns ip addr add 192.168.200.99/24 dev eth1# ip netns exec foo-ns ip link set eth1 up# ip netns exec foo-ns ip route delete default# ip netns exec foo-ns ip route add default via 192.168.200.1

Page 33: Inside Docker for Fedora20/RHEL7

Open Cloud Campus33

Inside Docker for Fedora20/RHEL7

Adding more logical NIC's (cont.)

– Login to the container and check the network configuration inside container.

– Now you can directly access the container without port forwarding.

– You can remove the symlink in /var/run/netns once you finished the configuration.

By the way, there is a shell script to automate this procedure....– jpetazzo/pipework– https://github.com/jpetazzo/pipework

# ssh enakai@localhost -p 2222 $ ifconfig eth1eth1 Link encap:Ethernet HWaddr BE:53:16:06:BF:3A inet addr:192.168.200.99 Bcast:0.0.0.0 Mask:255.255.255.0...$ route -nKernel IP routing tableDestination Gateway Genmask Flags Metric Ref Use Iface0.0.0.0 192.168.200.1 0.0.0.0 UG 0 0 0 eth1172.17.0.0 0.0.0.0 255.255.0.0 U 0 0 0 eth0192.168.200.0 0.0.0.0 255.255.255.0 U 0 0 0 eth1

$ curl http://192.168.200.99:80Hello, World!

# rm /var/run/netns/foo-ns

Page 34: Inside Docker for Fedora20/RHEL7

Inside Docker for Fedora20/RHEL7

systemd and cgroups

Page 35: Inside Docker for Fedora20/RHEL7

Open Cloud Campus35

Inside Docker for Fedora20/RHEL7

Basics of systemd and cgroups

Refer to the following slides for systemd basics.– Your first dive into systemd

• http://www.slideshare.net/enakai/systemd-study-v14e

Especially, you need to understand how systemd manages cgroups in conjunction with units.– systemd defines various "units" corresponding to services and daemons.– When systemd starts a service as a unit, it dynamically creates cgroups' group for that unit. All

processes of the service is place under this group.– If You specify "CPUShares" and "MemoryLimit" in the unit's configuration file, they are

translated to the corresponding cgroups settings. (CPUShares specifies relative weight of CPU time allocation, and "MemoryLimit" specifies the upper limit of memory usage.)

Page 36: Inside Docker for Fedora20/RHEL7

Open Cloud Campus36

Inside Docker for Fedora20/RHEL7

Basics of systemd and cgroups (cont.)

You can check the cgroups status managed by systemd with the following command.

# systemd-cgls├─1 /usr/lib/systemd/systemd --switched-root --system --deserialize 23├─user.slice│ └─user-0.slice│ ├─session-1.scope│ │ ├─439 sshd: root@pts/0 │ │ ├─444 -bash│ │ ├─464 systemd-cgls│ │ └─465 systemd-cgls│ └─[email protected]│ ├─441 /usr/lib/systemd/systemd --user│ └─442 (sd-pam) └─system.slice ├─polkit.service │ └─352 /usr/lib/polkit-1/polkitd --no-debug ├─auditd.service │ └─301 /sbin/auditd -n ├─systemd-udevd.service │ └─248 /usr/lib/systemd/systemd-udevd...

Page 37: Inside Docker for Fedora20/RHEL7

Open Cloud Campus37

Inside Docker for Fedora20/RHEL7

How Docker works with systemd?

When starting a container, Docker asks systemd to create a new unit to start the initial process. – As a result, all processes fork/exec-ed from the initial process belong to the same unit and

placed under the same cgroups' group. The unit name is "docker-<container ID>.scope".

# docker run -td -p 8000:80 -p 2222:22 enakai/httpd:ver1.0# systemd-cgls -a...└─system.slice ├─var-lib-docker-devicemapper-mnt-a985fc6dbe8dfc6335474ae68291ad3c51cddcbc28c1a47f7c4bc8b37e3b488b.mount ├─docker-a985fc6dbe8dfc6335474ae68291ad3c51cddcbc28c1a47f7c4bc8b37e3b488b.scope │ ├─496 /bin/sh /usr/local/bin/init.sh │ ├─538 /usr/sbin/sshd │ ├─550 /usr/sbin/httpd │ ├─552 /bin/bash │ ├─553 /usr/sbin/httpd │ ├─554 /usr/sbin/httpd │ ├─555 /usr/sbin/httpd │ ├─556 /usr/sbin/httpd │ ├─557 /usr/sbin/httpd │ ├─558 /usr/sbin/httpd │ ├─559 /usr/sbin/httpd │ └─560 /usr/sbin/httpd...

Page 38: Inside Docker for Fedora20/RHEL7

Open Cloud Campus38

Inside Docker for Fedora20/RHEL7

How Docker works with systemd?

– You can check the unit status corresponding to a container.

# unitname=docker-a985fc6dbe8dfc6335474ae68291ad3c51cddcbc28c1a47f7c4bc8b37e3b488b.scope# systemctl status $unitnamedocker-a985fc6dbe8dfc6335474ae68291ad3c51cddcbc28c1a47f7c4bc8b37e3b488b.scope - docker container a985fc6dbe8dfc6335474ae68291ad3c51cddcbc28c1a47f7c4bc8b37e3b488b Loaded: loaded (/run/systemd/system/docker-a985fc6dbe8dfc6335474ae68291ad3c51cddcbc28c1a47f7c4bc8b37e3b488b.scope; static) Drop-In: /run/systemd/system/docker-a985fc6dbe8dfc6335474ae68291ad3c51cddcbc28c1a47f7c4bc8b37e3b488b.scope.d └─90-BlockIOAccounting.conf, 90-CPUAccounting.conf, 90-Description.conf, 90-MemoryAccounting.conf, 90-Slice.conf Active: active (running) since 金 2014-06-13 23:05:27 JST; 1min 41s ago CGroup: /system.slice/docker-a985fc6dbe8dfc6335474ae68291ad3c51cddcbc28c1a47f7c4bc8b37e3b488b.scope ├─496 /bin/sh /usr/local/bin/init.sh ├─538 /usr/sbin/sshd ├─550 /usr/sbin/httpd ├─552 /bin/bash ├─553 /usr/sbin/httpd ├─554 /usr/sbin/httpd ├─555 /usr/sbin/httpd... └─560 /usr/sbin/httpd

6月 13 23:05:27 fedora20 systemd[1]: Started docker container a985fc6dbe8dfc6335474ae68291ad3c51cddcbc28c1a...488b.Hint: Some lines were ellipsized, use -l to show in full.

Page 39: Inside Docker for Fedora20/RHEL7

Open Cloud Campus39

Inside Docker for Fedora20/RHEL7

How Docker works with systemd? (cont.)

– There are "-c" and "-m" options for "docker run" command. They are translated to the unit's configuration parameter "CPUShares" and "MemoryLimit".

– After starting a container, you can change these parameters through systemd's interface.

Systemd will be more integrated with cgroups in the future. After that, additional resource control (CPU pinning, CPU quota, I/O bandwidth) may be added to Docker.

# systemctl show $unitname | grep -E "(CPUShares=|MemoryLimit=)"CPUShares=1024MemoryLimit=18446744073709551615

# systemctl set-property $unitname CPUShares=512 --runtime# systemctl show $unitname | grep -E "(CPUShares=|MemoryLimit=)"CPUShares=512MemoryLimit=18446744073709551615

Page 40: Inside Docker for Fedora20/RHEL7

Inside Docker for Fedora20/RHEL7

Etsuji NakaiTwitter @enakai00

Open Cloud Campus

Let's learn the up-to-datetechnology with Fedora/RHEL