87
Copyright C2014, NTTPC Communications, Inc. All Rights Reserved. 1

Trying and evaluating the new features of GlusterFS 3.5

Embed Size (px)

DESCRIPTION

My presentation in LinuxCon/CloudOpen Japan 2014. It has passed few days since GlusterFS 3.5 released so feel free to correct me if you find my mistakes or misunderstandings. Thanks.

Citation preview

Page 1: Trying and evaluating the new features of GlusterFS 3.5

Copyright (C)  2014, NTTPC Communications, Inc. All Rights Reserved. 1 

Page 2: Trying and evaluating the new features of GlusterFS 3.5

Copyright (C)  2014, NTTPC Communications, Inc. All Rights Reserved. 2 

Agenda�

!  About me !  The new features !  Additional news !  Conclusion

Page 3: Trying and evaluating the new features of GlusterFS 3.5

Copyright (C)  2014, NTTPC Communications, Inc. All Rights Reserved. 3 

About Me

Page 4: Trying and evaluating the new features of GlusterFS 3.5

Copyright (C)  2014, NTTPC Communications, Inc. All Rights Reserved. 4 

About me �

!  Work at as a ... !  Programmer

!  (Most recently)

! Also Interested in

!  Software Engineer ! Cloud Computing ! Big Data/Data Science ! Something new technologies ! Supporting GlusterFS/Red Hat Storage Introduction

with Red Hat K.K.

Page 5: Trying and evaluating the new features of GlusterFS 3.5

Copyright (C)  2014, NTTPC Communications, Inc. All Rights Reserved. 5 

About me �

! GlusterFS since 2007 (v1.3.7) !  for my internet crawler at first.

!  Love Gluster because of the ... !  Potential !  Performance !  Code !  Community

!  Introduced or introducing it into ... !  Printer and scanner solution (field trial) !  Email services !  File storage services (WebDAV, NFS) !  Backup services !  Shared storage platform !  Medical service

!  A board member of the Gluster Community

Page 6: Trying and evaluating the new features of GlusterFS 3.5

Copyright (C)  2014, NTTPC Communications, Inc. All Rights Reserved. 6 

My system �

Seg.1: 192.168.79.0/24, GigE

Seg.2: 10.0.0.0/8, 100BaseT(USB Ethernet)

eins

zwei

drei

vier

fuenf

sechs

sieben

.1

.2

.3

.4

.5

.6

.7

.79.0.1

.79.0.2

.79.0.3

.79.0.4

.79.0.5

.79.0.6

•  Seven nodes, connected to two separated physical network segments.

•  Seg.1 is for GlusterFS and Seg.2 is for other purposes (e.g. SSH) •  Each node is setup with:

•  CentOS 6.5 x86_64 •  GlusterFS 3.5.0 (from source tarball)

.79.0.7

storage pool

(mainly) client

Page 7: Trying and evaluating the new features of GlusterFS 3.5

Copyright (C)  2014, NTTPC Communications, Inc. All Rights Reserved. 7 

My system �

•  Intel NUC DN2820FYKH •  Celeron2.4GHz dual-core, 1MB cache •  8GB RAM •  1TB Solid-state hard drive (w/ 8GB SLC SSD) •  7.5W TDP

•  Why? •  Separate several

loads (mainly of disk accesses and network traffics)

•  Enough cheap to build (38k JPY/node)

•  Save money on electricity (2 JPY/d/node)

•  Suppress my room's temperature increasing

Page 8: Trying and evaluating the new features of GlusterFS 3.5

Copyright (C)  2014, NTTPC Communications, Inc. All Rights Reserved. 8 

My system �

% sudo yum install -y openssh-clients make rpm-build bison flex automake libtool ncurses-devel readline-devel openssl-devel libxml2-devel libibverbs-devel libacl-devel libattr-devel python-devel python-setuptools lvm2-devel systemtap-sdt-devel libaio-devel xfsprogs glib2-devel % tar xzf glusterfs-3.5.0.tar.gz && cd glusterfs-3.5.0 % ./configure --prefix=/usr/local/glusterfs-3.5.0 --enable-bd-xlator --enable-fusermount --enable-systemtap --enable-debug --enable-crypt-xlator --enable-qemu-block --enable-glupy % make && sudo make install # ln -sfn /usr/local/glusterfs-3.5.0 /usr/local/glusterfs # cp -p /etc/init.d/glusterd /etc/init.d/glusterd-3.5.0 # cat <<EOF >> ~/.zshrc export PATH=\$PATH:/usr/local/glusterfs/sbin export MANPATH=\$MANPATH:/usr/local/glusterfs/share/man EOF # source ~/.zshrc # echo "/usr/local/glusterfs/lib" > /etc/ld.so.conf.d/glusterfs.conf # ldconfig # sed -i 's/SELINUX=.*/SELINUX=disabled/g' /etc/selinux/config # chkconfig iptables off # /etc/init.d/iptables stop

GlusterFS 3.5.0 was installed on each node in following way:

Page 9: Trying and evaluating the new features of GlusterFS 3.5

Copyright (C)  2014, NTTPC Communications, Inc. All Rights Reserved. 9 

12 new features

Page 10: Trying and evaluating the new features of GlusterFS 3.5

Copyright (C)  2014, NTTPC Communications, Inc. All Rights Reserved. 10 

Overview �

Features� Open Stack �

Opera- tion �

Mana- gement�

Scala- bility �

Perfor- mance �

Stab- ility �

Sec- urity�

Dev�

AFR_CLI_enhancements ✔️

Exposing Volume Capabilities

✔️

File Snapshot ✔️

GFID Access ✔️

On-Wire Compression + Decompression

✔️

Prevent NFS restart on Volume change (Part 1)

✔️

Quota Scalability ✔️ ✔️

readdir_ahead ✔️

zerofill ✔️ ✔

Brick Failure Detection ✔️

Disk encryption ✔️

Geo-Replication Enhancement

✔ ✔️

Page 11: Trying and evaluating the new features of GlusterFS 3.5

Copyright (C)  2014, NTTPC Communications, Inc. All Rights Reserved. 11 

OpenStack Integration

Enhancements

Page 12: Trying and evaluating the new features of GlusterFS 3.5

Copyright (C)  2014, NTTPC Communications, Inc. All Rights Reserved. 12 

File Snapshot�

# setfattr -n \ trusted.glusterfs.block-format \ -v qcow2:<file_size(in KB/MB/GB)> \ <file_name>

features/ qemu-block

xlator

# setfattr -n \ trusted.glusterfs.block-snapshot-create \ -v <snapshot_name1> <file_name>

# setfattr -n \ trusted.glusterfs.block-snapshot-create \ -v <snapshot_name2> <file_name>

# setfattr -n \ trusted.glusterfs.block-snapshot-goto \ -v <snapshot_name1> <file_name>

# setfattr -n \ trusted.glusterfs.block-snapshot-delete \ -v <snapshot_name2> <file_name>

a file <file_name>

under a mount

point of a volume

fuse hook

to glusterfs

client process

Restore from a snapshot

Take a snapshot

Take a snapshot

Page 13: Trying and evaluating the new features of GlusterFS 3.5

Copyright (C)  2014, NTTPC Communications, Inc. All Rights Reserved. 13 

File Snapshot�

features/ qemu-block

xlator

a file <file_name>

under a mount

point of a volume

as a block

storage for

Cinder

fuse hook

to glusterfs

client process

Restore from a snapshot

Take a snapshot

Take a snapshot

OpenStack Cinder

BD xlator

block-format

block-snapshot-create

block-snapshot-create

block-snapshot-goto

block-snapshot-delete

Page 14: Trying and evaluating the new features of GlusterFS 3.5

Copyright (C)  2014, NTTPC Communications, Inc. All Rights Reserved. 14 

zerofill�

glusterfsd glusterfsd

AFR

libgfapi

User App (e.g. Cinder)

0000 0000 0000

0000 0000 0000

posix_do_zerofill function

ZEROFILL fop (glfs_zerofill function)

SCSI WRITESAME command

BLKZEROOUT ioctl on Linux

Page 15: Trying and evaluating the new features of GlusterFS 3.5

Copyright (C)  2014, NTTPC Communications, Inc. All Rights Reserved. 15 

zerofill�

Server offloaded zerofill vs repeated zeroing[root@llmvm02 remote]# time ./offloaded aakash-test log 20 real 3m34.155s user 0m0.018s sys 0m0.040s [root@llmvm02 remote]# time ./manually aakash-test log 20 real 4m23.043s user 0m2.197s sys 0m14.457s [root@llmvm02 remote]# time ./offloaded aakash-test log 25; real 4m28.363s user 0m0.021s sys 0m0.025s [root@llmvm02 remote]# time ./manually aakash-test log 25 real 5m34.278s user 0m2.957s sys 0m18.808s

http://www.gluster.org/community/documentation/index.php/Features/zerofill

1.23 times faster!

1.25 times faster!

Page 16: Trying and evaluating the new features of GlusterFS 3.5

Copyright (C)  2014, NTTPC Communications, Inc. All Rights Reserved. 16 

Operation Enhancements

Page 17: Trying and evaluating the new features of GlusterFS 3.5

Copyright (C)  2014, NTTPC Communications, Inc. All Rights Reserved. 17 

AFR_CLI_enhancements �

Before 3.5.0# gluster volume heal vol1 Heal operation on volume vol1 has been successful # gluster volume heal vol1 info ... # gluster volume heal vol1 info healed ... # gluster volume heal vol1 info heal-failed ... # gluster volume heal vol1 info split-brain ...

Too many operations to know all the situations...

What I want to know is not the file names...

How long the healing takes?

I don't know when the split-brain detected but...

Page 18: Trying and evaluating the new features of GlusterFS 3.5

Copyright (C)  2014, NTTPC Communications, Inc. All Rights Reserved. 18 

AFR_CLI_enhancements �

After 3.5.0# gluster volume heal vol1 statistics Gathering crawl statistics on volume vol1 has been successful ------------------------------------------------ Crawl statistics for brick no 0 Hostname of brick eins Starting time of crawl: Mon May 19 10:13:02 2014 Ending time of crawl: Mon May 19 10:13:02 2014 Type of crawl: INDEX No. of entries healed: 0 No. of entries in split-brain: 0 No. of heal failed entries: 0 ...

Wow! I can get the statistic and historical information at a glance!

Page 19: Trying and evaluating the new features of GlusterFS 3.5

Copyright (C)  2014, NTTPC Communications, Inc. All Rights Reserved. 19 

Management Enhancements

Page 20: Trying and evaluating the new features of GlusterFS 3.5

Copyright (C)  2014, NTTPC Communications, Inc. All Rights Reserved. 20 

# gluster volume info Volume Name: bd0 Type: Distribute Volume ID: 019d0f4b-d11a-480e-9be8-0c79902f0746 Status: Started Number of Bricks: 1 Transport-type: tcp Bricks: Brick1: sieben:/tmp/bd0-meta

Exposing Volume Capabilities �

I confuse which volume type the volume supports. So I should manage it with other tools like Excel...

Before 3.5.0

Page 21: Trying and evaluating the new features of GlusterFS 3.5

Copyright (C)  2014, NTTPC Communications, Inc. All Rights Reserved. 21 

# gluster volume info Volume Name: bd0 Type: Distribute Volume ID: 019d0f4b-d11a-480e-9be8-0c79902f0746 Status: Started Xlator 1: BD Capability 1: thin Capability 2: offload_copy Capability 3: offload_snapshot Number of Bricks: 1 Transport-type: tcp Bricks: Brick1: sieben:/tmp/bd0-meta Brick1 VG: bd0-vg

Exposing Volume Capabilities �

Probe the type of volume

Provide list of capabilities of a xlator/volume.

Yeah! I can understand the volume type and the detail!

After 3.5.0

Page 22: Trying and evaluating the new features of GlusterFS 3.5

Copyright (C)  2014, NTTPC Communications, Inc. All Rights Reserved. 22 

Review: How to use BD xlator�

# dd if=/dev/zero of=/tmp/bd-loop6 bs=1M count=2048 # losetup /dev/loop6 /tmp/bd-loop6 # pvcreate /dev/loop6 # vgcreate bd0-vg /dev/loop6 Volume group "bd0-vg" successfully created # lvcreate --thin bd0-vg -L 1000M Logical volume "lvol0" created Logical volume "lvol1" created

This VG becomes a volume of GlusterFS

If you want to get the BDs thin-provisioned ones, hit the lvcreate command.

(And the names are fixed.)

Here created a VG with a single 2GB of PV

Page 23: Trying and evaluating the new features of GlusterFS 3.5

Copyright (C)  2014, NTTPC Communications, Inc. All Rights Reserved. 23 

Review: How to use BD xlator�

# lvdisplay bd0-vg --- Logical volume --- LV Name lvol1 VG Name bd0-vg LV UUID PSAFkr-Vyr8-fkGU-kDnA-rWUF-fFFT-111Snr LV Write Access read/write LV Creation host, time sieben, 2014-05-18 14:38:21 +0900 LV Pool transaction ID 0 LV Pool metadata lvol1_tmeta LV Pool data lvol1_tdata LV Pool chunk size 64.00 KiB LV Zero new blocks yes LV Status available # open 0 LV Size 1000.00 MiB Allocated pool data 0.00% Allocated metadata 0.88% Current LE 250 Segments 1 Allocation inherit Read ahead sectors auto - currently set to 256 Block device 253:5

A logical volume pool for thin-provisioning.

No need when using no thin-provisioning.

Page 24: Trying and evaluating the new features of GlusterFS 3.5

Copyright (C)  2014, NTTPC Communications, Inc. All Rights Reserved. 24 

Review: How to use BD xlator�

# mkdir /tmp/bd0-meta # gluster volume create bd0 sieben:/tmp/bd0-meta\?bd0-vg force volume create: bd0: success: please start the volume to access data # gluster volume start bd0 volume start: bd0: success # gluster volume info bd0 Volume Name: bd0 Type: Distribute Volume ID: 019d0f4b-d11a-480e-9be8-0c79902f0746 Status: Started Xlator 1: BD Capability 1: thin Capability 2: offload_copy Capability 3: offload_snapshot Number of Bricks: 1 Transport-type: tcp Bricks: Brick1: sieben:/tmp/bd0-meta Brick1 VG: bd0-vg # mkdir /mnt/glusterfs/bd0 # mount-t glusterfs sieben:/bd0 /mnt/glusterfs/bd0

Meta data store for BD xlator

"?" (question mark) is the separator

Page 25: Trying and evaluating the new features of GlusterFS 3.5

Copyright (C)  2014, NTTPC Communications, Inc. All Rights Reserved. 25 

Review: How to use BD xlator�

# touch /mnt/glusterfs/bd0/lv0 # setfattr -n "user.glusterfs.bd" -v "thin:1024MB" /mnt/glusterfs/bd0/lv0 # lvdisplay bd0-vg --- Logical volume --- LV Name lvol1 VG Name bd0-vg LV UUID PSAFkr-Vyr8-fkGU-kDnA-rWUF-fFFT-111Snr LV Write Access read/write LV Creation host, time sieben.infinibridge.net, 2014-05-18 14:38:21 +0900 LV Pool transaction ID 1 LV Pool metadata lvol1_tmeta LV Pool data lvol1_tdata LV Pool chunk size 64.00 KiB LV Zero new blocks yes LV Status available # open 0 LV Size 1000.00 MiB Allocated pool data 0.00% Allocated metadata 0.98% Current LE 250 Segments 1 Allocation inherit Read ahead sectors auto - currently set to 256 Block device 253:5

Create a file that is backed by an

LV

Or simply -v "lv" when no need for thin-

provisioning

Page 26: Trying and evaluating the new features of GlusterFS 3.5

Copyright (C)  2014, NTTPC Communications, Inc. All Rights Reserved. 26 

Review: How to use BD xlator�

--- Logical volume --- LV Path /dev/bd0-vg/a9790eba-ffbf-4d9c-a674-e02c61ece935 LV Name a9790eba-ffbf-4d9c-a674-e02c61ece935 VG Name bd0-vg LV UUID Z4HtWM-W0jk-YiK5-66ED-zOMw-YhFp-nrnRUU LV Write Access read/write LV Creation host, time sieben.infinibridge.net, 2014-05-18 14:47:31 +0900 LV Pool name lvol1 LV Status available # open 0 LV Size 1.00 GiB Mapped size 0.00% Current LE 256 Segments 1 Allocation inherit Read ahead sectors auto - currently set to 256 Block device 253:9

Page 27: Trying and evaluating the new features of GlusterFS 3.5

Copyright (C)  2014, NTTPC Communications, Inc. All Rights Reserved. 27 

Review: How to use BD xlator�

# for i in `seq 1 9`; do touch /mnt/glusterfs/bd0/lv$i; setfattr -n "user.glusterfs.bd" -v "thin:1024MB" /mnt/glusterfs/bd0/lv$i; done # lvdisplay -C bd0-vg LV VG Attr LSize Pool Origin Data% Move Log Cpy%Sync Convert 39b82644-f8ef-435d-b14e-d199a7e264fa bd0-vg Vwi-a-tz-- 1.00g lvol1 0.00 6002ddb2-28f1-463c-8666-f683fe2441ed bd0-vg Vwi-a-tz-- 1.00g lvol1 0.00 69993340-d691-4502-a9d5-375b8be0fb9e bd0-vg Vwi-a-tz-- 1.00g lvol1 0.00 82af50a2-0124-41d8-a887-d8c30427a663 bd0-vg Vwi-a-tz-- 1.00g lvol1 0.00 996969dd-3e32-491b-95d1-f279e6808d5b bd0-vg Vwi-a-tz-- 1.00g lvol1 0.00 a19ac2af-94df-4d01-b7c3-bbfcbfe5d09e bd0-vg Vwi-a-tz-- 1.00g lvol1 0.00 a9790eba-ffbf-4d9c-a674-e02c61ece935 bd0-vg Vwi-a-tz-- 1.00g lvol1 0.00 d6fd964a-67f8-4d48-96d1-343bed4ee792 bd0-vg Vwi-a-tz-- 1.00g lvol1 0.00 ea58b011-3a41-4bf0-9fe6-3862e24b86f6 bd0-vg Vwi-a-tz-- 1.00g lvol1 0.00 f7df48e5-09b1-4314-b729-1f38e5ceec2e bd0-vg Vwi-a-tz-- 1.00g lvol1 0.00 lvol1 bd0-vg twi-a-tz-- 1000.00m 0.00

Here we create other nine LVs in the same

way.

Page 28: Trying and evaluating the new features of GlusterFS 3.5

Copyright (C)  2014, NTTPC Communications, Inc. All Rights Reserved. 28 

Review: How to use BD xlator�

# mkdir /mnt/bd0-lv/{39b82644-f8ef-435d-b14e-d199a7e264fa,6002ddb2-28f1-463c-8666-f683fe2441ed,69993340-d691-4502-a9d5-375b8be0fb9e,82af50a2-0124-41d8-a887-d8c30427a663,996969dd-3e32-491b-95d1-f279e6808d5b,a19ac2af-94df-4d01-b7c3-bbfcbfe5d09e,a9790eba-ffbf-4d9c-a674-e02c61ece935,d6fd964a-67f8-4d48-96d1-343bed4ee792,ea58b011-3a41-4bf0-9fe6-3862e24b86f6,f7df48e5-09b1-4314-b729-1f38e5ceec2e} # ls /mnt/bd0-lv 39b82644-f8ef-435d-b14e-d199a7e264fa a19ac2af-94df-4d01-b7c3-bbfcbfe5d09e 6002ddb2-28f1-463c-8666-f683fe2441ed a9790eba-ffbf-4d9c-a674-e02c61ece935 69993340-d691-4502-a9d5-375b8be0fb9e d6fd964a-67f8-4d48-96d1-343bed4ee792 82af50a2-0124-41d8-a887-d8c30427a663 ea58b011-3a41-4bf0-9fe6-3862e24b86f6 996969dd-3e32-491b-95d1-f279e6808d5b f7df48e5-09b1-4314-b729-1f38e5ceec2e # for x in 39b82644-f8ef-435d-b14e-d199a7e264fa 6002ddb2-28f1-463c-8666-f683fe2441ed 69993340-d691-4502-a9d5-375b8be0fb9e 82af50a2-0124-41d8-a887-d8c30427a663 996969dd-3e32-491b-95d1-f279e6808d5b a19ac2af-94df-4d01-b7c3-bbfcbfe5d09e a9790eba-ffbf-4d9c-a674-e02c61ece935 d6fd964a-67f8-4d48-96d1-343bed4ee792 ea58b011-3a41-4bf0-9fe6-3862e24b86f6 f7df48e5-09b1-4314-b729-1f38e5ceec2e; do mkfs.xfs -i size=512 /dev/bd0-vg/$x && mount -t xfs /dev/bd0-vg/$x /mnt/bd0-lv/$x; done

Creating mount point for each LV.

Formatting each LV in XFS and mount it.

Page 29: Trying and evaluating the new features of GlusterFS 3.5

Copyright (C)  2014, NTTPC Communications, Inc. All Rights Reserved. 29 

Review: How to use BD xlator�

# df -h | grep bd0-lv /dev/dm-13 1014M 33M 982M 4% /mnt/bd0-lv/39b82644-f8ef-435d-b14e-d199a7e264fa /dev/dm-16 1014M 33M 982M 4% /mnt/bd0-lv/6002ddb2-28f1-463c-8666-f683fe2441ed /dev/dm-18 1014M 33M 982M 4% /mnt/bd0-lv/69993340-d691-4502-a9d5-375b8be0fb9e /dev/dm-11 1014M 33M 982M 4% /mnt/bd0-lv/82af50a2-0124-41d8-a887-d8c30427a663 /dev/dm-12 1014M 33M 982M 4% /mnt/bd0-lv/996969dd-3e32-491b-95d1-f279e6808d5b /dev/dm-17 1014M 33M 982M 4% /mnt/bd0-lv/a19ac2af-94df-4d01-b7c3-bbfcbfe5d09e /dev/dm-9 1014M 33M 982M 4% /mnt/bd0-lv/a9790eba-ffbf-4d9c-a674-e02c61ece935 /dev/dm-14 1014M 33M 982M 4% /mnt/bd0-lv/d6fd964a-67f8-4d48-96d1-343bed4ee792 /dev/dm-15 1014M 33M 982M 4% /mnt/bd0-lv/ea58b011-3a41-4bf0-9fe6-3862e24b86f6 /dev/dm-10 1014M 33M 982M 4% /mnt/bd0-lv/f7df48e5-09b1-4314-b729-1f38e5ceec2e # mount | grep bd0-lv /dev/mapper/bd0--vg-39b82644--f8ef--435d--b14e--d199a7e264fa on /mnt/bd0-lv/39b82644-f8ef-435d-b14e-d199a7e264fa type xfs (rw) /dev/mapper/bd0--vg-6002ddb2--28f1--463c--8666--f683fe2441ed on /mnt/bd0-lv/6002ddb2-28f1-463c-8666-f683fe2441ed type xfs (rw) /dev/mapper/bd0--vg-69993340--d691--4502--a9d5--375b8be0fb9e on /mnt/bd0-lv/69993340-d691-4502-a9d5-375b8be0fb9e type xfs (rw) /dev/mapper/bd0--vg-82af50a2--0124--41d8--a887--d8c30427a663 on /mnt/bd0-lv/82af50a2-0124-41d8-a887-d8c30427a663 type xfs (rw) /dev/mapper/bd0--vg-996969dd--3e32--491b--95d1--f279e6808d5b on /mnt/bd0-lv/996969dd-3e32-491b-95d1-f279e6808d5b type xfs (rw) /dev/mapper/bd0--vg-a19ac2af--94df--4d01--b7c3--bbfcbfe5d09e on /mnt/bd0-lv/a19ac2af-94df-4d01-b7c3-bbfcbfe5d09e type xfs (rw) /dev/mapper/bd0--vg-a9790eba--ffbf--4d9c--a674--e02c61ece935 on /mnt/bd0-lv/a9790eba-ffbf-4d9c-a674-e02c61ece935 type xfs (rw) /dev/mapper/bd0--vg-d6fd964a--67f8--4d48--96d1--343bed4ee792 on /mnt/bd0-lv/d6fd964a-67f8-4d48-96d1-343bed4ee792 type xfs (rw) /dev/mapper/bd0--vg-ea58b011--3a41--4bf0--9fe6--3862e24b86f6 on /mnt/bd0-lv/ea58b011-3a41-4bf0-9fe6-3862e24b86f6 type xfs (rw) /dev/mapper/bd0--vg-f7df48e5--09b1--4314--b729--1f38e5ceec2e on /mnt/bd0-lv/f7df48e5-09b1-4314-b729-1f38e5ceec2e type xfs (rw)

'Cause of thin-provisioning, in total 10GB of block devices are

created on the 2GB of VG!

Page 30: Trying and evaluating the new features of GlusterFS 3.5

Copyright (C)  2014, NTTPC Communications, Inc. All Rights Reserved. 30 

Review: How to use BD xlator�

[sechs]# mount -t glusterfs localhost:/bd0 /mnt/glusterfs/bd0 [sechs]# mount -t xfs -o loop /mnt/glusterfs/bd0/lv0 [sechs]# df -h | grep bd0-lv 1014M 33M 982M 4% /mnt/bd0-lv/lv1

The block devices are shared with GlusterFS as files.

raw block device

physical volume

volume group

LV LV LV

BD volume =

file file file

Convert them with lvm2 development library

=

Shared with GlusterFS

Snapshot and clone are

capable as LV

Page 31: Trying and evaluating the new features of GlusterFS 3.5

Copyright (C)  2014, NTTPC Communications, Inc. All Rights Reserved. 31 

Brick Failure Detection �

Before 3.5.0

1. One of the backend storage

failed!

2. R/W ops from a client

glusterfsd glusterfsd

AFR

3. glusterfsd returned "Input/output error" or "Read-only filesystem"

directly.

Page 32: Trying and evaluating the new features of GlusterFS 3.5

Copyright (C)  2014, NTTPC Communications, Inc. All Rights Reserved. 32 

Brick Failure Detection �

After 3.5.0

1. One of the backend storage

failed!

3. R/W ops from a client

glusterfsd glusterfsd

AFR

4. The client gets no error and completes

the operation.

2. glusterfsd outputs logs and shutdowns itself.

Page 33: Trying and evaluating the new features of GlusterFS 3.5

Copyright (C)  2014, NTTPC Communications, Inc. All Rights Reserved. 33 

Brick Failure Detection �

# brick="/mnt/lv4/vol4"; gluster volume create vol4 eins:$brick zwei:$brick drei:$brick vier:$brick fuenf:$brick sechs:$brick # gluster volume start vol4 # gluster volume set vol4 storage.health-check-interval 10 # gluster volume info vol4 Volume Name: vol4 Type: Distribute Volume ID: 706122a9-44fc-4d1d-8c3b-97482d98b95c Status: Started Number of Bricks: 6 Transport-type: tcp Bricks: Brick1: eins:/mnt/lv4/vol4 Brick2: zwei:/mnt/lv4/vol4 Brick3: drei:/mnt/lv4/vol4 Brick4: vier:/mnt/lv4/vol4 Brick5: fuenf:/mnt/lv4/vol4 Brick6: sechs:/mnt/lv4/vol4 Options Reconfigured: storage.health-check-interval: 10

Setup for test

Page 34: Trying and evaluating the new features of GlusterFS 3.5

Copyright (C)  2014, NTTPC Communications, Inc. All Rights Reserved. 34 

Brick Failure Detection �

[sechs]# dmsetup table vg0-swift: 0 209715200 linear 8:7 838862848 vg0-cinder: 0 209715200 linear 8:7 419432448 vg0-lv4: 0 209715200 linear 8:7 1468008448 vg0-lv3: 0 209715200 linear 8:7 1258293248 vg0-lv2: 0 209715200 linear 8:7 1048578048 vg0-lv1: 0 209715200 linear 8:7 209717248 vg0-lv0: 0 209715200 linear 8:7 2048 vg0-glance: 0 209715200 linear 8:7 629147648

Setup for test (contd.)

Page 35: Trying and evaluating the new features of GlusterFS 3.5

Copyright (C)  2014, NTTPC Communications, Inc. All Rights Reserved. 35 

Brick Failure Detection �

[sechs]# echo 0 209715200 error > dmsetup-error-target [sechs]# dmsetup load vg0-lv4 dmsetup-error-target [sechs]# dmsetup resume vg0-lv4 [sechs]# dmsetup table vg0-swift: 0 209715200 linear 8:7 838862848 vg0-cinder: 0 209715200 linear 8:7 419432448 vg0-lv4: 0 209715200 error vg0-lv3: 0 209715200 linear 8:7 1258293248 vg0-lv2: 0 209715200 linear 8:7 1048578048 vg0-lv1: 0 209715200 linear 8:7 209717248 vg0-lv0: 0 209715200 linear 8:7 2048 vg0-glance: 0 209715200 linear 8:7 629147648

Brick failure test

Page 36: Trying and evaluating the new features of GlusterFS 3.5

Copyright (C)  2014, NTTPC Communications, Inc. All Rights Reserved. 36 

Brick Failure Detection �

[2014-05-18 18:49:53.720594] I [glusterfsd-mgmt.c:56:mgmt_cbk_spec] 0-mgmt: Volume file changed [2014-05-18 18:50:04.238239] W [posix-helpers.c:1294:posix_health_check_thread_proc] 0-vol4-posix: stat() on /mnt/lv4/vol4 returned: Input/output error [2014-05-18 18:50:04.238328] M [posix-helpers.c:1314:posix_health_check_thread_proc] 0-vol4-posix: health-check failed, going down Message from syslogd@sechs at May 19 03:50:04 ... glusterfsd: [2014-05-18 18:50:04.238328] M [posix-helpers.c:1314:posix_health_check_thread_proc] 0-vol4-posix: health-check failed, going down [2014-05-18 18:50:34.238551] M [posix-helpers.c:1319:posix_health_check_thread_proc] 0-vol4-posix: still alive! -> SIGTERM Message from syslogd@sechs at May 19 03:50:34 ... glusterfsd: [2014-05-18 18:50:34.238551] M [posix-helpers.c:1319:posix_health_check_thread_proc] 0-vol4-posix: still alive! -> SIGTERM [2014-05-18 18:50:34.238910] W [glusterfsd.c:1095:cleanup_and_exit] (-->/lib64/libc.so.6(clone+0x6d) [0x7f1144ebab7d] (-->/lib64/libpthread.so.0(+0x79d1) [0x7f114554d9d1] (-->/usr/local/glusterfs-3.5.0/sbin/glusterfsd(glusterfs_sigwaiter+0xf0) [0x4085af]))) 0-: received signum (15), shutting down

var/log/glusterfs/bricks/mnt-lv4-vol4.log on the failed node

Page 37: Trying and evaluating the new features of GlusterFS 3.5

Copyright (C)  2014, NTTPC Communications, Inc. All Rights Reserved. 37 

Brick Failure Detection �

May 19 03:49:55 sechs kernel: XFS (dm-7): metadata I/O error: block 0x0 ("xfs_buf_iodone_callbacks") error 5 buf count 4096 May 19 03:49:57 sechs kernel: XFS (dm-7): metadata I/O error: block 0x6400108 ("xlog_iodone") error 5 buf count 4096 May 19 03:49:57 sechs kernel: XFS (dm-7): xfs_do_force_shutdown(0x2) called from line 1062 of file fs/xfs/xfs_log.c. Return address = 0xffffffffa04dd131 May 19 03:49:57 sechs kernel: XFS (dm-7): Log I/O Error Detected. Shutting down filesystem May 19 03:49:57 sechs kernel: XFS (dm-7): Please umount the filesystem and rectify the problem(s) May 19 03:50:04 sechs glusterfsd: [2014-05-18 18:50:04.238328] M [posix-helpers.c:1314:posix_health_check_thread_proc] 0-vol4-posix: health-check failed, going down Message from syslogd@sechs at May 19 03:50:04 ... glusterfsd: [2014-05-18 18:50:04.238328] M [posix-helpers.c:1314:posix_health_check_thread_proc] 0-vol4-posix: health-check failed, going down May 19 03:50:27 sechs kernel: XFS (dm-7): xfs_log_force: error 5 returned. Message from syslogd@sechs at May 19 03:50:34 ... glusterfsd: [2014-05-18 18:50:34.238551] M [posix-helpers.c:1319:posix_health_check_thread_proc] 0-vol4-posix: still alive! -> SIGTERM May 19 03:50:34 sechs glusterfsd: [2014-05-18 18:50:34.238551] M [posix-helpers.c:1319:posix_health_check_thread_proc] 0-vol4-posix: still alive! -> SIGTERM May 19 03:50:57 sechs kernel: XFS (dm-7): xfs_log_force: error 5 returned.

syslog on the failed node

Page 38: Trying and evaluating the new features of GlusterFS 3.5

Copyright (C)  2014, NTTPC Communications, Inc. All Rights Reserved. 38 

Brick Failure Detection �

# gluster volume status vol4 Status of volume: vol4 Gluster process Port Online Pid ------------------------------------------------------------------------------ Brick eins:/mnt/lv4/vol4 49160 Y 2925 Brick zwei:/mnt/lv4/vol4 49159 Y 440 Brick drei:/mnt/lv4/vol4 49152 Y 32500 Brick vier:/mnt/lv4/vol4 49152 Y 32657 Brick fuenf:/mnt/lv4/vol4 49152 Y 24517 Brick sechs:/mnt/lv4/vol4 N/A N N/A NFS Server on localhost 2049 Y 29535 NFS Server on zwei N/A N N/A NFS Server on vier N/A N N/A NFS Server on drei N/A N N/A NFS Server on eins N/A N N/A NFS Server on fuenf N/A N N/A NFS Server on sechs N/A N N/A Task Status of Volume vol4 ------------------------------------------------------------------------------ There are no active volume tasks

gluster volume status

Page 39: Trying and evaluating the new features of GlusterFS 3.5

Copyright (C)  2014, NTTPC Communications, Inc. All Rights Reserved. 39 

Brick Failure Detection �

# ps -ef | grep glusterfsd | grep -v grep | wc -l 0

processes on the failed node

Page 40: Trying and evaluating the new features of GlusterFS 3.5

Copyright (C)  2014, NTTPC Communications, Inc. All Rights Reserved. 40 

Brick Failure Detection �

[sechs]# service glusterd restart

restart glusterd (and glusterfsd) on the failed node

[2014-05-18 18:58:17.197872] I [glusterfsd.c:1959:main] 0-/usr/local/glusterfs-3.5.0/sbin/glusterfsd: Started running /usr/local/glusterfs-3.5.0/sbin/glusterfsd version 3.5git (/usr/local/glusterfs-3.5.0/sbin/glusterfsd -s sechs --volfile-id vol4.sechs.mnt-lv4-vol4 -p /var/lib/glusterd/vols/vol4/run/sechs-mnt-lv4-vol4.pid -S /var/run/23afc72b5ceddccd28b405b1cdf5b4df.socket --brick-name /mnt/lv4/vol4 -l /usr/local/glusterfs-3.5.0/var/log/glusterfs/bricks/mnt-lv4-vol4.log --xlator-option *-posix.glusterd-uuid=0765d288-a59b-4ccf-90ae-c3332c83dbf4 --brick-port 49152 --xlator-option vol4-server.listen-port=49152) [2014-05-18 18:58:17.205310] I [socket.c:3561:socket_init] 0-socket.glusterfsd: SSL support is NOT enabled [2014-05-18 18:58:17.205486] I [socket.c:3576:socket_init] 0-socket.glusterfsd: using system polling thread [2014-05-18 18:58:17.205880] I [socket.c:3561:socket_init] 0-glusterfs: SSL support is NOT enabled [2014-05-18 18:58:17.205949] I [socket.c:3576:socket_init] 0-glusterfs: using system polling thread [2014-05-18 18:58:18.834910] I [graph.c:254:gf_add_cmdline_options] 0-vol4-server: adding option 'listen-port' for volume 'vol4-server' with value '49152'

Page 41: Trying and evaluating the new features of GlusterFS 3.5

Copyright (C)  2014, NTTPC Communications, Inc. All Rights Reserved. 41 

Brick Failure Detection �

[2014-05-18 18:58:18.834976] I [graph.c:254:gf_add_cmdline_options] 0-vol4-posix: adding option 'glusterd-uuid' for volume 'vol4-posix' with value '0765d288-a59b-4ccf-90ae-c3332c83dbf4' [2014-05-18 18:58:18.837332] I [rpcsvc.c:2064:rpcsvc_set_outstanding_rpc_limit] 0-rpc-service: Configured rpc.outstanding-rpc-limit with value 64 [2014-05-18 18:58:18.837510] W [options.c:848:xl_opt_validate] 0-vol4-server: option 'listen-port' is deprecated, preferred is 'transport.socket.listen-port', continuing with correction [2014-05-18 18:58:18.837572] I [socket.c:3561:socket_init] 0-tcp.vol4-server: SSL support is NOT enabled [2014-05-18 18:58:18.837601] I [socket.c:3576:socket_init] 0-tcp.vol4-server: using system polling thread [2014-05-18 18:58:18.838445] E [common-utils.c:93:mkdir_p] 0-: Failed due to reason Input/output error [2014-05-18 18:58:18.838505] I [mem-pool.c:539:mem_pool_destroy] 0-vol4-changelog: size=108 max=0 total=0 [2014-05-18 18:58:18.838533] E [xlator.c:403:xlator_init] 0-vol4-changelog: Initialization of volume 'vol4-changelog' failed, review your volfile again [2014-05-18 18:58:18.838561] E [graph.c:307:glusterfs_graph_init] 0-vol4-changelog: initializing translator failed [2014-05-18 18:58:18.838610] E [graph.c:502:glusterfs_graph_activate] 0-graph: init failed

restart glusterd (and glusterfsd) on the failed node (contd.)

Page 42: Trying and evaluating the new features of GlusterFS 3.5

Copyright (C)  2014, NTTPC Communications, Inc. All Rights Reserved. 42 

Brick Failure Detection �

[2014-05-18 18:58:18.839480] W [glusterfsd.c:1095:cleanup_and_exit] (-->/usr/local/glusterfs-3.5.0/lib/libgfrpc.so.0(rpc_clnt_handle_reply+0x1b5) [0x7f2981c837d8] (-->/usr/local/glusterfs-3.5.0/sbin/glusterfsd(mgmt_getspec_cbk+0x36a) [0x40cf77] (-->/usr/local/glusterfs-3.5.0/sbin/glusterfsd(glusterfs_process_volfp+0x18a) [0x408bf2]))) 0-: received signum (0), shutting down

restart glusterd (and glusterfsd) on the failed node (contd.)

Page 43: Trying and evaluating the new features of GlusterFS 3.5

Copyright (C)  2014, NTTPC Communications, Inc. All Rights Reserved. 43 

Scalability Enhancement

Page 44: Trying and evaluating the new features of GlusterFS 3.5

Copyright (C)  2014, NTTPC Communications, Inc. All Rights Reserved. 44 

Quota Scalability �

Before 3.5.0

Directory Quota limitation = a few hundreds per volume

Page 45: Trying and evaluating the new features of GlusterFS 3.5

Copyright (C)  2014, NTTPC Communications, Inc. All Rights Reserved. 45 

Quota Scalability �

After 3.5.0

Directory Quota limitation = 65536 per volume

Page 46: Trying and evaluating the new features of GlusterFS 3.5

Copyright (C)  2014, NTTPC Communications, Inc. All Rights Reserved. 46 

Performance Enhancements

Page 47: Trying and evaluating the new features of GlusterFS 3.5

Copyright (C)  2014, NTTPC Communications, Inc. All Rights Reserved. 47 

On-Wire Compression + Decompression �

FUSE client Storage pool

FUSE client Storage pool

Write ops

Read ops

3. Transport

2. Compression

1. open and write

4. Decompression and

write to disk

1. open and read

4. Decompression 3. Transport

2. read and Compression

Page 48: Trying and evaluating the new features of GlusterFS 3.5

Copyright (C)  2014, NTTPC Communications, Inc. All Rights Reserved. 48 

On-Wire Compression + Decompression �

# gluster volume create vol-comp eins:/mnt/lv3/vol-comp # gluster volume set vol-comp network.compression on # gluster volume set vol-comp network.compression.compression-level 8 # gluster volume set vol-comp network.compression.min-size 50 # gluster volume set vol-comp performance.write-behind off # gluster volume set vol-comp performance.strict-write-ordering on # gluster volume set vol-comp performance.open-behind off # gluster volume info vol-comp Volume Name: vol-comp Type: Distribute Volume ID: 92b47734-2552-4168-b3c3-151093562e4f Status: Created Number of Bricks: 1 Transport-type: tcp Bricks: Brick1: eins:/mnt/lv3/vol-comp Options Reconfigured: network.compression.min-size: 50 network.compression.compression-level: 8 performance.open-behind: off performance.write-behind: off performance.strict-write-ordering: on network.compression.mode: server network.compression: on

Data is compressed only when its size exceeds the above value in bytes.

-1: default compression (= 8) 0: no compression 1: best speed 9: best compression

Turn off the performance translators to avoid Input/output error

Page 49: Trying and evaluating the new features of GlusterFS 3.5

Copyright (C)  2014, NTTPC Communications, Inc. All Rights Reserved. 49 

On-Wire Compression + Decompression �

# gluster volume start vol-comp # mount -t glusterfs localhost:/vol-comp /mnt/glusterfs/vol-comp # dd if=/dev/zero of=/mnt/glusterfs/vol-comp/1gb.dat bs=1M count=1024 1024+0 records in 1024+0 records out 1073741824 bytes (1.1 GB) copied, 33.8606 s, 31.7 MB/s # diff /mnt/glusterfs/vol-comp/1gb.dat /tmp/1gb.dat #

•  CPU load on client becomes higher than the one without network compression.

•  Tcpdump showed the 1GB of zero compressed into non-zero one.

•  High-end CPU might show greater performance. •  There are still issues and limitations

•  It cannot work with striped volumes. •  For glusterfs versions <= 3.5, it cannot work with AFR.

117 MB/s when no compression

Compression and Decompression executed correctly

Page 50: Trying and evaluating the new features of GlusterFS 3.5

Copyright (C)  2014, NTTPC Communications, Inc. All Rights Reserved. 50 

readdir_ahead �

Before 3.5.0

volume

read-ahead

Sequential file access can be fast, but sequential directory

access like "ls" cannot.

Page 51: Trying and evaluating the new features of GlusterFS 3.5

Copyright (C)  2014, NTTPC Communications, Inc. All Rights Reserved. 51 

readdir_ahead �

After 3.5.0

volume

Sequential reads of large directories can complete

faster!

readdir-ahead read-ahead

Page 52: Trying and evaluating the new features of GlusterFS 3.5

Copyright (C)  2014, NTTPC Communications, Inc. All Rights Reserved. 52 

readdir_ahead �

# gluster volume set vol0 readdir-ahead enable volume set: success # gluster volume info vol0 Volume Name: vol0 Type: Distribute Volume ID: cf9db2aa-5ee8-40c3-8ca9-8316ab31ba59 Status: Started Number of Bricks: 2 Transport-type: tcp Bricks: Brick1: eins:/mnt/lv0/vol0 Brick2: zwei:/mnt/lv0/vol0 Options Reconfigured: performance.readdir-ahead: enable

disabled by default

How-to

Page 53: Trying and evaluating the new features of GlusterFS 3.5

Copyright (C)  2014, NTTPC Communications, Inc. All Rights Reserved. 53 

readdir_ahead �

# brick="/mnt/lv4/vol4"; gluster volume create vol4 eins:$brick zwei:$brick drei:$brick vier:$brick fuenf:$brick sechs:$brick # gluster volume start vol4 # mount -t glusterfs localhost:/vol4 /mnt/glusterfs/vol4 # mkdir /mnt/glusterfs/vol4/manyfiles # for a in `seq 0 9`; do for b in `seq 0 9`; do for c in `seq 0 9`; for d in `seq 0 9`; do for e in `seq 0 9`; do for f in `seq 0 9`; do for g in `seq 0 9`; do for h in `seq 0 9`; do for i in `seq 0 9`; do file="/mnt/glusterfs/vol4/manyfiles/8kb${a}${b}${c}${d}${e}${f}${g}${h}${i}.dat"; echo ${file}; dd if=/dev/zero of=${file} bs=1K count=8; if [ $? -ne 0 ]; then break; fi; done; done; done; done; done; done; done; done ... ^C # df -ki /mnt/glusterfs/vol4 Filesystem Inodes IUsed IFree IUse% Mounted on localhost:vol4 314572800 3394646 311178154 2% /mnt/glusterfs/vol4 # umount /mnt/glusterfs/vol4

Setup for evaluation

3 million 8K files

Page 54: Trying and evaluating the new features of GlusterFS 3.5

Copyright (C)  2014, NTTPC Communications, Inc. All Rights Reserved. 54 

readdir_ahead �

# mount -t glusterfs localhost:/vol4 /mnt/glusterfs/vol4 # for i in `seq 0 2`; do time ls /mnt/glusterfs/vol4/manyfiles > /dev/null; done 26.24s user 18.70s system 6% cpu 12:03.05 total 26.58s user 12.10s system 5% cpu 11:45.92 total 26.53s user 21.61s system 5% cpu 14:14.75 total # umount /mnt/glusterfs/vol4

Evaluation

# gluster volume stop vol4 && gluster volume start vol4 # gluster volume set vol4 readdir-ahead enable # mount -t glusterfs localhost:/vol4 /mnt/glusterfs/vol4 # for i in `seq 0 2`; do time ls /mnt/glusterfs/vol4/manyfiles > /dev/null; done 26.24s user 17.97s system 11% cpu 6:25.09 total 26.58s user 22.36s system 10% cpu 8:02.83 total 26.57s user 22.83s system 10% cpu 8:13.01 total # gluster volume reset vol4 # umount /mnt/glusterfs/vol4

1.68 times faster!

Page 55: Trying and evaluating the new features of GlusterFS 3.5

Copyright (C)  2014, NTTPC Communications, Inc. All Rights Reserved. 55 

Stability Enhancements

Page 56: Trying and evaluating the new features of GlusterFS 3.5

Copyright (C)  2014, NTTPC Communications, Inc. All Rights Reserved. 56 

Prevent NFS restart on Volume change (Part 1)�

Gluster NFS Graph

nfs/server option nfs3.vol4.volume-id 706122a9-44fc-4d1d-8c3b-97482d98b95c

option rpc-auth.addr.vol4.allow * option nfs3.vol-gfid-access.volume-id 73abf812-4fff-42bd-822b-3036b72f060d

option rpc-auth.addr.vol-gfid-access.allow * option nfs3.vol2.volume-id d0517697-5372-44a1-960f-6db0d988f3b2

option rpc-auth.addr.vol2.allow * option nfs3.vol-comp.volume-id 92b47734-2552-4168-b3c3-151093562e4f

option rpc-auth.addr.vol-comp.allow * option nfs3.vol1.volume-id ba03d1e6-a520-4e7f-ac4c-2440a205e80e

option rpc-auth.addr.vol1.allow * option nfs3.vol0.volume-id cf9db2aa-5ee8-40c3-8ca9-8316ab31ba59

option rpc-auth.addr.vol0.allow * option nfs.drc on option nfs.nlm on

option nfs.dynamic-volumes on

vol0 debug/io-stats

vol0-write-behind performance/write-behind

vol0-dht cluster/distribute

vol0-client-0 protocol/client

vol0-client-1 protocol/client

vol1 debug/io-stats

vol1-write-behind performance/write-behind

vol1-dht cluster/distribute

vol1-client-0 protocol/client

vol1-client-1 protocol/client

vol2 debug/io-stats

vol2-write-behind performance/write-behind

vol2-dht cluster/distribute

vol2-client-0 protocol/client

vol2-client-1 protocol/client

Single nfs/server exists on the top of

all the volumes

Page 57: Trying and evaluating the new features of GlusterFS 3.5

Copyright (C)  2014, NTTPC Communications, Inc. All Rights Reserved. 57 Vol2

My presentation last year (2013) �

!  NFS and Multi-tenancy ! 'nfs.rpc-auth-allow' for multi-tenancy ! some operations on a volume affect

IOs to other volumes

Vol1

Vol0

e.g. gluster volume set ...

IO

IO

IOVol2

Vol1

Vol0IO

IO

IO

Page 58: Trying and evaluating the new features of GlusterFS 3.5

Copyright (C)  2014, NTTPC Communications, Inc. All Rights Reserved. 58 

Prevent NFS restart on Volume change (Part 1)�

!   "Some operations" on a volume !  gluster volume {set|reset} <volumeName> nfs.rpc-auth-allow !  gluster volume {start|stop} <volumeName> !  gluster volume add-brick !  gluster volume remove-brick <volumeName> <brick1> ... <brickn>

commit

Page 59: Trying and evaluating the new features of GlusterFS 3.5

Copyright (C)  2014, NTTPC Communications, Inc. All Rights Reserved. 59 

Prevent NFS restart on Volume change (Part 1)�

!   Internal NFS options became unaffected by volume changes. !  nfs.readdir-size !  nfs.nlm !  nfs.acl !  nfs.mount-rmtab !  nfs.drc !  nfs.drc-size !  nfs.read-size !  nfs.write-size !  nfs.readdir-size !  nfs.export-dir !  nfs.export-dirs !  nfs.enable-ino32 !  nfs.export-volumes !  nfs.addr-namelookup !  nfs.outstanding-rpc-limit !  nfs.mount-mtab !  nfs.register-with-portmap

Page 60: Trying and evaluating the new features of GlusterFS 3.5

Copyright (C)  2014, NTTPC Communications, Inc. All Rights Reserved. 60 

Geo-Replication Enhancement �

storage pool

(a cluster)

gsyncd

Before 3.5.0

SPOF! identify file

changes with xattrs

directory crawl with rsync

Page 61: Trying and evaluating the new features of GlusterFS 3.5

Copyright (C)  2014, NTTPC Communications, Inc. All Rights Reserved. 61 

Geo-Replication Enhancement �

storage pool

(a cluster)

gsyncd for each

peer

After 3.5.0

identify file changes with changelog in memory

gsyncd for each

peer

gsyncd for each

peer gsyncd

for each peer

gsyncd for each

peer

gsyncd for each

peer

Page 62: Trying and evaluating the new features of GlusterFS 3.5

Copyright (C)  2014, NTTPC Communications, Inc. All Rights Reserved. 62 

Geo-Replication Enhancement �

# cat /var/lib/glusterd/vols/vol0/vol0.eins.mnt-lv0-vol0.vol volume vol0-posix type storage/posix option volume-id cf9db2aa-5ee8-40c3-8ca9-8316ab31ba59 option directory /mnt/lv0/vol0 end-volume volume vol0-changelog type features/changelog option changelog-dir /mnt/lv0/vol0/.glusterfs/changelogs option changelog-brick /mnt/lv0/vol0 subvolumes vol0-posix end-volume ... volume vol0-server type protocol/server option auth.addr./mnt/lv0/vol0.allow * option auth.login.863ccc05-1ba2-47cc-8a15-240ad4e8c736.password c8d200d6-db0b-4f87-be0f-664e08f4ceee option auth.login./mnt/lv0/vol0.allow 863ccc05-1ba2-47cc-8a15-240ad4e8c736 option transport-type tcp subvolumes /mnt/lv0/vol0 end-volume

Changelog

Page 63: Trying and evaluating the new features of GlusterFS 3.5

Copyright (C)  2014, NTTPC Communications, Inc. All Rights Reserved. 63 

Geo-Replication Enhancement �

# ls -a /mnt/lv0/vol0/.glusterfs/changelogs . ..

Changelog (contd.)

No use without gsyncd???

Page 64: Trying and evaluating the new features of GlusterFS 3.5

Copyright (C)  2014, NTTPC Communications, Inc. All Rights Reserved. 64 

Security Enhancement

Page 65: Trying and evaluating the new features of GlusterFS 3.5

Copyright (C)  2014, NTTPC Communications, Inc. All Rights Reserved. 65 

Disk encryption�

FUSE client Storage pool

FUSE client Storage pool

Write ops

Read ops

1. open and write

2. Encryption

3. Transport

4. Write the encrypted data

to disk

1. open and read

4. Decryption 3. Transport

2. read from underlying disks

Page 66: Trying and evaluating the new features of GlusterFS 3.5

Copyright (C)  2014, NTTPC Communications, Inc. All Rights Reserved. 66 

Disk encryption�

# gluster volume info Volume Name: vol2 Type: Replicate Volume ID: e0332771-a3c2-4fe5-980c-b3860cfe3baf Status: Started Number of Bricks: 1 x 2 = 2 Transport-type: tcp Bricks: Brick1: eins:/mnt/lv2/vol2 Brick2: zwei:/mnt/lv2/vol2 # gluster volume set vol2 encryption on volume set: success # for x in quick-read write-behind open-behind; do gluster volume set vol2 performance.$x off; done # gluster volume set vol2 encryption.master-key /var/lib/glusterd/vols/vol2/encryption.master-key # openssl rand -hex 32 > /var/lib/glusterd/vols/vol2/encryption.master-key # gluster volume set vol2 encryption.data-key-size 512

Setup

Page 67: Trying and evaluating the new features of GlusterFS 3.5

Copyright (C)  2014, NTTPC Communications, Inc. All Rights Reserved. 67 

Disk encryption�

# gluster volume info Volume Name: vol2 Type: Replicate Volume ID: e0332771-a3c2-4fe5-980c-b3860cfe3baf Status: Started Number of Bricks: 1 x 2 = 2 Transport-type: tcp Bricks: Brick1: eins:/mnt/lv2/vol2 Brick2: zwei:/mnt/lv2/vol2 Options Reconfigured: encryption.data-key-size: 512 encryption.master-key: /var/lib/glusterd/vols/vol2/encryption.master-key performance.open-behind: off performance.write-behind: off performance.quick-read: off features.encryption: on # mount -t glusterfs -o xlator-option=vol2-crypt.master-key=/var/lib/glusterd/vols/vol2/encryption.master-key localhost:/vol2 /mnt/glusterfs/vol2

Setup (contd.)

Page 68: Trying and evaluating the new features of GlusterFS 3.5

Copyright (C)  2014, NTTPC Communications, Inc. All Rights Reserved. 68 

Disk encryption�

# echo "test" > /mnt/glusterfs/vol2/test.txt # cat /mnt/glusterfs/vol2/test.txt test [eins]# cat /mnt/lv2/vol2/test.txt Zd??]K!q??tuv [zwei]# cat /mnt/lv2/vol2/test.txt Zd??]K!q??tuv

Encryption test

# dd if=/dev/zero of=/mnt/glusterfs/vol1/test.dat bs=1 count=32 # dd if=/dev/zero of=/mnt/glusterfs/vol2/test.dat bs=1 count=32 [eins]# dd if=/dev/zero of=/tmp/test.dat bs=1 count=32 [eins]# diff /tmp/test.dat /mnt/lv2/vol2/test.dat Binary files /tmp/test.dat and /mnt/lv2/vol2/test.dat differ [eins]# diff /tmp/test.dat /mnt/lv1/vol1/test.dat #

# tcpdump -i eth0 -XX

Can see the transported zeroed data

fully encrypted.

ASCII files on the bricks are

encrypted.

Binary files on the bricks are also

encrypted.

Confirm that no use of encryption never encrypt the data, so you can access the raw data on several

bricks without encryption.

Page 69: Trying and evaluating the new features of GlusterFS 3.5

Copyright (C)  2014, NTTPC Communications, Inc. All Rights Reserved. 69 

Disk encryption�

# dd if=/dev/zero of=/tmp/1gb.dat bs=1M count=1024 1024+0 records in 1024+0 records out 1073741824 bytes (1.1 GB) copied, 3.61505 s, 297 MB/s # diff3 /tmp/1gb.dat /mnt/glusterfs/vol1/1gb.dat /mnt/glusterfs/vol2/1gb.dat #

Decryption test

Perfect!

Page 70: Trying and evaluating the new features of GlusterFS 3.5

Copyright (C)  2014, NTTPC Communications, Inc. All Rights Reserved. 70 

Disk encryption�

# dd if=/dev/zero of=/mnt/glusterfs/vol1/1gb.dat bs=1M count=1024 1024+0 records in 1024+0 records out 1073741824 bytes (1.1 GB) copied, 18.4542 s, 58.2 MB/s # dd if=/dev/zero of=/mnt/glusterfs/vol2/1gb.dat bs=1M count=1024 1024+0 records in 1024+0 records out 1073741824 bytes (1.1 GB) copied, 263.633 s, 4.1 MB/s

Performance test

Page 71: Trying and evaluating the new features of GlusterFS 3.5

Copyright (C)  2014, NTTPC Communications, Inc. All Rights Reserved. 71 

Disk encryption�

# mount -t nfs -o vers=3,hard,intr,nosuid localhost:/vol2 /mnt/nfs/vol2 mount.nfs: Connection timed out

Work with NFS? (No!)

Page 72: Trying and evaluating the new features of GlusterFS 3.5

Copyright (C)  2014, NTTPC Communications, Inc. All Rights Reserved. 72 

Disk encryption�

# cp /var/lib/glusterd/vols/vol2/encryption.master-key /tmp # mount -t glusterfs -o xlator-option=vol2-crypt.master-key=/tmp/encryption.master-key localhost:/vol2 /mnt/glusterfs/vol-crypt # diff /mnt/glusterfs/vol-crypt/test.txt /tmp/test.txt #

Compromising with the same MK

Page 73: Trying and evaluating the new features of GlusterFS 3.5

Copyright (C)  2014, NTTPC Communications, Inc. All Rights Reserved. 73 

Disk encryption�

# openssl rand -hex 32 > /tmp/encryption.master-key # diff /mnt/glusterfs/vol-crypt/test.txt /tmp/test.txt #

Compromising with a different MK keeping mounted

Page 74: Trying and evaluating the new features of GlusterFS 3.5

Copyright (C)  2014, NTTPC Communications, Inc. All Rights Reserved. 74 

Disk encryption�

# umount /mnt/glusterfs/vol-crypt # mount -t glusterfs -o xlator-option=vol2-crypt.master-key=/tmp/encryption.master-key localhost:/vol2 /mnt/glusterfs/vol-crypt # diff /mnt/glusterfs/vol-crypt/test.txt /tmp/test.txt diff: /mnt/glusterfs/vol-crypt/test.txt: Invalid argument # ls -lh /mnt/glusterfs/vol-crypt total 1.1G -rw-r--r-- 1 root root 1.0G May 18 23:31 1gb.dat -rw-r--r-- 1 root root 32 May 18 22:57 test.dat -rw-r--r-- 1 root root 5 May 18 22:55 test.txt # cp /mnt/glusterfs/vol-crypt/test.txt ~/ cp: reading `/mnt/glusterfs/vol-crypt/test.txt': Invalid argument # ls -l ~/test.txt -rw-r--r-- 1 root root 0 May 19 00:38 /root/test.txt

Compromising with an invalid MK

Page 75: Trying and evaluating the new features of GlusterFS 3.5

Copyright (C)  2014, NTTPC Communications, Inc. All Rights Reserved. 75 

Disk encryption�

# echo "test2" > /mnt/glusterfs/vol-crypt/test2.txt # cat /mnt/glusterfs/vol-crypt/test2.txt test2 # diff /mnt/glusterfs/vol-crypt/test2.txt /tmp/test2.txt #

Compromising with an invalid MK (contd.)

# \rm /mnt/glusterfs/vol-crypt/test.txt rm: cannot remove `/mnt/glusterfs/vol-crypt/test.txt': Invalid argument # ls -lh /mnt/glusterfs/vol-crypt total 1.1G -rw-r--r-- 1 root root 1.0G May 18 23:31 1gb.dat -rw-r--r-- 1 root root 6 May 19 00:39 test2.txt -rw-r--r-- 1 root root 32 May 18 22:57 test.dat -rw-r--r-- 1 root root 5 May 18 22:55 test.txt # \rm /mnt/glusterfs/vol-crypt/test2.txt # ls -lh /mnt/glusterfs/vol-crypt total 1.1G -rw-r--r-- 1 root root 1.0G May 18 23:31 1gb.dat -rw-r--r-- 1 root root 32 May 18 22:57 test.dat -rw-r--r-- 1 root root 5 May 18 22:55 test.txt

Enable to write a file with an invalid MK. (Is it okay?)

Page 76: Trying and evaluating the new features of GlusterFS 3.5

Copyright (C)  2014, NTTPC Communications, Inc. All Rights Reserved. 76 

Disk encryption�

# mv /mnt/glusterfs/vol-crypt/test.txt /mnt/glusterfs/vol-crypt/test2.txt mv: cannot move `/mnt/glusterfs/vol-crypt/test.txt' to a subdirectory of itself, `/mnt/glusterfs/vol-crypt/test2.txt'

Compromising with an invalid MK (contd.)

# umount /mnt/glusterfs/vol-crypt # mount -t glusterfs -o xlator-option=vol2-crypt.master-key=/var/lib/glusterd/vols/vol2/encryption.master-key localhost:/vol2 /mnt/glusterfs/vol-crypt # ls -lh /mnt/glusterfs/vol-crypt total 1.1G -rw-r--r-- 1 root root 1.0G May 18 23:31 1gb.dat -rw-r--r-- 1 root root 6 May 19 00:44 test2.txt -rw-r--r-- 1 root root 32 May 18 22:57 test.dat -rw-r--r-- 1 root root 5 May 18 22:55 test.txt # cat /mnt/glusterfs/vol-crypt/test2.txt cat: /mnt/glusterfs/vol-crypt/test2.txt: Invalid argument # \rm /mnt/glusterfs/vol-crypt/test2.txt rm: cannot remove `/mnt/glusterfs/vol-crypt/test2.txt': Invalid argument

The proper user cannot handle the file created with the invalid

MK. (Is it okay?)

Page 77: Trying and evaluating the new features of GlusterFS 3.5

Copyright (C)  2014, NTTPC Communications, Inc. All Rights Reserved. 77 

Disk encryption�

# gluster volume info vol-crypt Volume Name: vol2 Type: Replicate Volume ID: e0332771-a3c2-4fe5-980c-b3860cfe3baf Status: Started Number of Bricks: 1 x 2 = 2 Transport-type: tcp Bricks: Brick1: eins:/mnt/lv2/vol2 Brick2: zwei:/mnt/lv2/vol2 Options Reconfigured: encryption.data-key-size: 512 encryption.master-key: /var/lib/glusterd/vols/vol2/encryption.master-key performance.open-behind: off performance.write-behind: off performance.quick-read: off features.encryption: on # gluster volume reset vol-crypt volume reset: success: reset volume successful

Compromising with volume reset

Page 78: Trying and evaluating the new features of GlusterFS 3.5

Copyright (C)  2014, NTTPC Communications, Inc. All Rights Reserved. 78 

Disk encryption�

# gluster volume info vol-crypt Volume Name: vol2 Type: Replicate Volume ID: e0332771-a3c2-4fe5-980c-b3860cfe3baf Status: Started Number of Bricks: 1 x 2 = 2 Transport-type: tcp Bricks: Brick1: eins:/mnt/lv2/vol2 Brick2: zwei:/mnt/lv2/vol2

Compromising with volume reset (contd.)

# cat /mnt/glusterfs/vol-crypt/test2.txt U�%U?0��x^-�bO # cat /mnt/glusterfs/vol-crypt/test.txt Zd��]K!q�tuv

May be a way of cracking?

Page 79: Trying and evaluating the new features of GlusterFS 3.5

Copyright (C)  2014, NTTPC Communications, Inc. All Rights Reserved. 79 

Enhancement for Developers

Page 80: Trying and evaluating the new features of GlusterFS 3.5

Copyright (C)  2014, NTTPC Communications, Inc. All Rights Reserved. 80 

A Volume

GFID Access �

A Volume/.gfid

62fe0d4f-dfe9-4a2d-b811-176c6d347a7c

52fd93ea-45ba-47f0-916a-bd3774239237

e5593949-79c3-463c-909b-8cc8ef014eb4

bf758c70-ff2b-4f0d-bfc9-860ece79c246

70aaacf9-1c09-44e2-97a2-9486adf10225

e5498fc4-7345-4f5f-af59-81acff1fd083

f6a608ed-0c68-4d1a-a4d7-fb375ba8fd63

d420cbb3-c1e8-47d3-b317-0c8afbc7a8c4

a76dd563-e878-45a0-ac48-59084d86bd0c

f9dbc760-c8f3-41e6-8d24-68b24c4c577b

5c0374a8-18fe-4dd6-89e7-f6551111d980

You can deal with each file by

GFID

Single namespace, just under the mount

point

Page 81: Trying and evaluating the new features of GlusterFS 3.5

Copyright (C)  2014, NTTPC Communications, Inc. All Rights Reserved. 81 

GFID Access �

# brick="/mnt/lv3/vol-gfid-access";gluster volume create vol-gfid-access eins:$brick zwei:$brick # gluster volume start vol-gfid-access # mkdir /mnt/glusterfs/vol-gfid-access # mount.glusterfs -o aux-gfid-mount localhost:/vol-gfid-access /mnt/glusterfs/vol-gfid-access # for i in `seq 0 9`; do dd if=/dev/zero of=/mnt/glusterfs/vol-gfid-access/$i.dat bs=1M count=1; done # ls -a /mnt/glusterfs/vol-gfid-access/.gfid ls: cannot open directory /mnt/glusterfs/vol-gfid-access/.gfid: Stale file handle # ls -a '/mnt/glusterfs/vol-gfid-access/.gfid/0svu9Cc1wVRLOBiu5NqF3ncw==' ls: cannot access /mnt/glusterfs/vol-gfid-access/.gfid/0svu9Cc1wVRLOBiu5NqF3ncw==: No such file or directory # ls -ld /mnt/glusterfs/vol-gfid-access/.gfid/ drwxr-xr-x 3 root root 166 May 19 03:03 /mnt/glusterfs/vol-gfid-access/.gfid/

Page 82: Trying and evaluating the new features of GlusterFS 3.5

Copyright (C)  2014, NTTPC Communications, Inc. All Rights Reserved. 82 

GFID Access �

# stat /mnt/glusterfs/vol-gfid-access/.gfid/ File: `/mnt/glusterfs/vol-gfid-access/.gfid/' Size: 166 Blocks: 0 IO Block: 131072 directory Device: 16h/22d Inode: 13 Links: 3 Access: (0755/drwxr-xr-x) Uid: ( 0/ root) Gid: ( 0/ root) Access: 2014-05-19 03:03:14.146605880 +0900 Modify: 2014-05-19 03:03:04.968605874 +0900 Change: 2014-05-19 03:03:04.968605874 +0900 # strace ls -a /mnt/glusterfs/vol-gfid-access/.gfid ... stat("/mnt/glusterfs/vol-gfid-access/.gfid", {st_mode=S_IFDIR|0755, st_size=166, ...}) = 0 open("/mnt/glusterfs/vol-gfid-access/.gfid", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = -1 ESTALE (Stale file handle) ...

•  How can I let it work well? •  If it becomes to work fine, applications using GlusterFS can

manage their data in a single namespace.

Page 83: Trying and evaluating the new features of GlusterFS 3.5

Copyright (C)  2014, NTTPC Communications, Inc. All Rights Reserved. 83 

Additional news

Page 84: Trying and evaluating the new features of GlusterFS 3.5

Copyright (C)  2014, NTTPC Communications, Inc. All Rights Reserved. 84 

Added files�

!   cli/src/cli-quoted-client ! contrib/qemu !   error-codes.json !   extras

!  geo-rep !  glusterfs-georep-logrotate !  gluster-rsyslog-*.conf !  hook-scripts/add-brick !  logger.conf.example !  post-upgrade-script-for-quota.sh !  pre-upgrade-script-for-quota.sh

!   geo-replication ! gf-error-codes.h.template ! libgfchangelog.pc.in ! libglusterfs/src

!  client_t !  glusterfs-acl !  timespec

! rpc/rpc-lib/src/rpc-drc !   run-tests.sh !   tests

! xlators !  cluster

! dht/src !  dht-shared.c

!  encryption ! crypt

!  features ! changelog ! compress ! gfid-access

! glupy ! qemu-block ! quota

!  quota-enforcer-client.c !  quoted-aggregator !  quoted-helpers

!  performance !  readdir-ahead

! playground !  storage

! bd (replacement of bd_map)

related qemu codes

glupy has merged!

a lot of test codes!

template for xlator

development

Page 85: Trying and evaluating the new features of GlusterFS 3.5

Copyright (C)  2014, NTTPC Communications, Inc. All Rights Reserved. 85 

Conclusion

Page 86: Trying and evaluating the new features of GlusterFS 3.5

Copyright (C)  2014, NTTPC Communications, Inc. All Rights Reserved. 86 

Conclusion �

!  Update for "everyone" ! 12 features, 8 categories.

!  Contribution by HekaFS ! Disk encryption has been one of my dream since 2.0.2.

!  Voice of users ! Brick Failure Detection ! Prevent NFS restart on Volume change

These are just the great community's power! Use the latest version, and join us!

Page 87: Trying and evaluating the new features of GlusterFS 3.5

Copyright (C)  2014, NTTPC Communications, Inc. All Rights Reserved. 87 To contact us, e-mail here -> [email protected]