IO Performance Part2

1

© 2009 IBM Corporation

Linux on System z Disk I/O Performancepart 2

Mustafa Mešanović – Linux on System z PerformanceSeptember, 28th 2011

2 © 2009 IBM Corporation

Trademarks

Source If Applicable

IBM Presentation Template Full Version

IBM, the IBM logo, and ibm.com are trademarks or registered trademarks of International Business Machines Corp., registered in many jurisdictions worldwide. Other product and service names might be trademarks of IBM or other companies. A current list of IBM trademarks is available on the Web at “Copyright and trademark information” at www.ibm.com/legal/copytrade.shtml.

Linux is a registered trademark of Linus Torvalds in the United States, other countries, or both.

Java and all Java-based trademarks and logos are trademarks of Sun Microsystems, Inc. in the United States, other countries, or both.

Other product and service names might be trademarks of IBM or other companies.

http://www.ibm.com/legal/copytrade.shtml


Agenda

■ Measurement setup description

■ Results comparison for FCION/ECKD and FCP/SCSI

■ Conclusions


Summary – Performance considerations (from part 1)

■ to speed up the data transfer to one or a group of volumes use more than 1 rank, because then simultaneously

– more than 8 physical disks are used– more cache and NVS can be used– more device adapters can be used

■ use more than 1 FICON Express channel

■ in case of ECKD use – more than 1 subchannel per volume in the host channel subsystem– High Performance FICON and/or Read Write Track Data

■ speed up techniques– use more than 1 rank: storage pool striping, Linux striped logical volume– use more channels: ECKD logical path groups, SCSI Linux multipath multibus– ECKD use more than 1 subchannel: PAV, HyperPAV


Configuration – host and connection to storage server

z9 LPAR:■ 8 CPUs■ 512MiB■ 4 FICON Express4

– 2 ports per feature used for FICON– 2 ports per feature used for FCP– Total 8 paths FICON, 8 paths FCP

DS8700:■ 8 FCION Express4

– 1 port per feature used for FICON– 1 port per feature used for FCP

Linux:■ SLES11 SP1 (with HyperPAV, High Performance

FICON)■ Kernel: 2.6.32.13-0.5-default (+dm stripe patch)■ Device-mapper:

– multipath: version 1.2.0– striped: version 1.3.0– Multipath-tools-0.4.8-40.23.1

■ 8 FICON paths defined in a channel path group

FICON PortFCP PortNot used

System z DS8K

Sw

itch


Configuration – DS8700

DS8700:

■ 360 GiB cache■ High Performance FICON, Read Write Track Data and HyperPAV feature■ Connected with 8 FCP and 8 FICON channels/paths■ Firmware level: 6.5.1.203■ Storage pool striped volumes are configured in extent pools of 4 ranks

– On each server 1 extent pool for ECKD and 1 extent pool for SCSI■ Other volumes are configured in extent pools of a single rank

– On each server 4 extent pools for ECKD and 4 extent pools for SCSI

Further details about DS8800 here: http://www.redbooks.ibm.com/abstracts/sg248786.html


Setting up multipath devices (1/5)

■ In case the adapters and disks are not already accessible (case 8 channel paths and 2 volumes)#--- mask out Adapter ID ---echo free 0.0.1700 > /proc/cio_ignoreecho free 0.0.1780 > /proc/cio_ignoreecho free 0.0.1800 > /proc/cio_ignoreecho free 0.0.5000 > /proc/cio_ignoreecho free 0.0.5100 > /proc/cio_ignoreecho free 0.0.5900 > /proc/cio_ignoreecho free 0.0.5a00 > /proc/cio_ignoreecho free 0.0.5b00 > /proc/cio_ignore#--- enable FCP Adapter ---chccwdev -e 0.0.1700chccwdev -e 0.0.1780chccwdev -e 0.0.1800chccwdev -e 0.0.5000chccwdev -e 0.0.5100chccwdev -e 0.0.5900chccwdev -e 0.0.5a00chccwdev -e 0.0.5b00



■ Define 8 channels paths to volumes 6024 and 6114 #--- create Units ---#--- volume #1 ---echo 0x4060402400000000 > /sys/bus/ccw/drivers/zfcp/0.0.1700/0x500507630410c7ed/unit_addecho 0x4060402400000000 > /sys/bus/ccw/drivers/zfcp/0.0.1780/0x500507630408c7ed/unit_addecho 0x4060402400000000 > /sys/bus/ccw/drivers/zfcp/0.0.1800/0x500507630400c7ed/unit_addecho 0x4060402400000000 > /sys/bus/ccw/drivers/zfcp/0.0.5000/0x500507630418c7ed/unit_addecho 0x4060402400000000 > /sys/bus/ccw/drivers/zfcp/0.0.5100/0x50050763041bc7ed/unit_addecho 0x4060402400000000 > /sys/bus/ccw/drivers/zfcp/0.0.5900/0x500507630413c7ed/unit_addecho 0x4060402400000000 > /sys/bus/ccw/drivers/zfcp/0.0.5a00/0x500507630403c7ed/unit_addecho 0x4060402400000000 > /sys/bus/ccw/drivers/zfcp/0.0.5b00/0x50050763040bc7ed/unit_add#--- volume #2 ---echo 0x4061401400000000 > /sys/bus/ccw/drivers/zfcp/0.0.5100/0x50050763041bc7ed/unit_addecho 0x4061401400000000 > /sys/bus/ccw/drivers/zfcp/0.0.5900/0x500507630413c7ed/unit_addecho 0x4061401400000000 > /sys/bus/ccw/drivers/zfcp/0.0.5a00/0x500507630403c7ed/unit_addecho 0x4061401400000000 > /sys/bus/ccw/drivers/zfcp/0.0.5b00/0x50050763040bc7ed/unit_addecho 0x4061401400000000 > /sys/bus/ccw/drivers/zfcp/0.0.1700/0x500507630410c7ed/unit_addecho 0x4061401400000000 > /sys/bus/ccw/drivers/zfcp/0.0.1780/0x500507630408c7ed/unit_addecho 0x4061401400000000 > /sys/bus/ccw/drivers/zfcp/0.0.1800/0x500507630400c7ed/unit_addecho 0x4061401400000000 > /sys/bus/ccw/drivers/zfcp/0.0.5000/0x500507630418c7ed/unit_add



■ Prepare multipath.conf for device mapper to use multibus and the appropriate value for switching paths/etc/init.d/multipathd stop

Shutting down multipathd cp /etc/multipath.conf /etc/multipath.conf.backupvim /etc/multipath.conf

set path_grouping_policy multibusset rr_min_io 1See also man pages of multipath.conf



■ Starting up the multipath daemon results in this output/etc/init.d/multipathd startStarting multipathd dmsetup table36005076304ffc7ed0000000000006114_part1: 0 20971488 linear 253:2 32

36005076304ffc7ed0000000000006024: 0 20971520 multipath 0 0 1 1 round-robin 0 8 1 8:64 1 8:96 1 8:112 1 8:0 1 8:32 1 8:16 1 8:48 1 8:80 1

36005076304ffc7ed0000000000006024_part1: 0 20971488 linear 253:0 32




■ Multipath detailsmultipath -ll 36005076304ffc7ed0000000000006024 dm-0 IBM,2107900size=10G features='0' hwhandler='0' wp=rw`-+- policy='round-robin 0' prio=1 status=active |- 4:0:12:1076117600 sde 8:64 active ready running |- 6:0:10:1076117600 sdg 8:96 active ready running |- 7:0:11:1076117600 sdh 8:112 active ready running |- 0:0:15:1076117600 sda 8:0 active ready running |- 2:0:14:1076117600 sdc 8:32 active ready running |- 1:0:8:1076117600 sdb 8:16 active ready running |- 3:0:9:1076117600 sdd 8:48 active ready running `- 5:0:12:1076117600 sdf 8:80 active ready running36005076304ffc7ed0000000000006114 dm-2 IBM,2107900size=10G features='0' hwhandler='0' wp=rw`-+- policy='round-robin 0' prio=1 status=active |- 4:0:12:1075069025 sdi 8:128 active ready running |- 6:0:10:1075069025 sdk 8:160 active ready running |- 5:0:12:1075069025 sdj 8:144 active ready running |- 7:0:11:1075069025 sdl 8:176 active ready running |- 0:0:15:1075069025 sdm 8:192 active ready running |- 1:0:8:1075069025 sdn 8:208 active ready running |- 2:0:14:1075069025 sdo 8:224 active ready running `- 3:0:9:1075069025 sdp 8:240 active ready running


Creating a striped logical volume (1/x)

■ Logical volume from 2 multipath devices#--- create physical volumes ---pvcreate /dev/mapper/36005076304ffc7ed0000000000006024_part1 /dev/mapper/36005076304ffc7ed0000000000006114_part1No physical volume label read from /dev/mapper/36005076304ffc7ed0000000000006024_part1 Physical volume "/dev/mapper/36005076304ffc7ed0000000000006024_part1" successfully created No physical volume label read from /dev/mapper/36005076304ffc7ed0000000000006114_part1 Physical volume "/dev/mapper/36005076304ffc7ed0000000000006114_part1" successfully created#--- create volume group ---vgcreate -v vg01 --physicalextentsize 4M /dev/mapper/36005076304ffc7ed0000000000006024_part1 /dev/mapper/36005076304ffc7ed0000000000006114_part1Wiping cache of LVM-capable devices Adding physical volume '/dev/mapper/36005076304ffc7ed0000000000006024_part1' to volume group 'vg01' Adding physical volume '/dev/mapper/36005076304ffc7ed0000000000006114_part1' to volume group 'vg01' Archiving volume group "vg01" metadata (seqno 0). Creating volume group backup "/etc/lvm/backup/vg01" (seqno 1). Volume group "vg01" successfully created


Creating a striped logical volume (2/2)

■ Logical volume from 2 multipath devices#--- create logical volume ---lvcreate --stripes 2 --stripesize 64K --extents 5118 --name vol01 vg01Logical volume "vol01" createdlvdisplay --- Logical volume --- LV Name /dev/vg01/vol01 VG Name vg01 LV UUID uBqAxh-QvlQ-4Qq8-peI3-UJj1-PRyr-IrhsjV LV Write Access read/write LV Status available # open 0 LV Size 19.99 GB Current LE 5118 Segments 1 Allocation inherit Read ahead sectors auto - currently set to 1024 Block device 253:4#--- activate LV ---lvchange -a y /dev/vg01/vol01#--- create file system ---mkfs -t ext3 /dev/vg01/vol01#--- mount LV ---mount /dev/vg01/vol01 /mnt/subw0


Verifying device mapper tables

■ Check datadmsetup table36005076304ffc7ed0000000000006114_part1: 0 20971488 linear 253:2 32


36005076304ffc7ed0000000000006024_part1: 0 20971488 linear 253:0 32

vg01-vol01: 0 41156608 striped 2 128 253:1 384 253:3 384



HyperPAV devices

■ Alias devices have no name assignment■ The LCU, to which they belong can be detected from the address■ 72f0 and 72f1 are alias devices for 7297 echo free 0.0.72f0 > /proc/cio_ignoreecho free 0.0.72f1 > /proc/cio_ignore#--- enable alias devices ---chccwdev -e 0.0.72f0chccwdev -e 0.0.72f1lsdasd Bus-ID Status Name Device Type BlkSz Size Blocks ============================================================================ 0.0.72f0 alias ECKD 0.0.72f1 alias ECKD 0.0.7116 active dasda 94:0 ECKD 4096 7043MB 1803060 0.0.7117 active dasdb 94:4 ECKD 4096 7043MB 1803060 0.0.7203 active dasdc 94:8 ECKD 4096 46068MB 11793600 0.0.7297 active dasdd 94:12 ECKD 4096 7043MB 1803060


IOZone Benchmark description

■ Workload– Threaded I/O benchmark (IOzone)– Each process writes or reads a single file– Options to bypass page cache, separate execution of sequential write, sequential read,

random read/write

■ Setup– Main memory was restricted to 512 MiB – File size: 2 GiB, Record size: 8 KiB or 64 KiB– Run with 1, 8 and 32 processes – Sequential run: write, rewrite, read– Random run: write, read (with previous sequential write)– Runs with direct I/O and Linux page cache– Sync and drop caches prior to every invocation of the workload to reduce noise


Measured scenarios: FICON/ECKD

FICON/ECKD (always 8 paths)■ 1 single disk

– configured as 1 volume in a extent pool, containing 1 rank■ 1 single disk

– configured as 1 volume in a extent pool, containing 1 rank– using 7 HyperPAV alias devices

■ 1 storage pool striped disk– configured as 1 volume in a extent pool, containing 4 ranks– using 7 HyperPAV alias devices

■ 1 striped logical volume– built from 2 storage pool striped disks, 1 from sever0

and 1 from server1– each disk

• configured as 1 volume in a extent pool, containing 4 ranks• using 7 HyperPAV alias devices

■ 1 striped logical volume– built from 8 disks, 4 from sever0 and 4 from server1– each disk

• configured as 1 volume in a extent pool, containing 1 rank• using 7 HyperPAV alias devices

FICON/ECKD measurement series

ECKD 1d ECKD 1d hpav

ECKD 1d hpav sps

ECKD 2d lv hpav sps

ECKD 8d lv hpav


Measured scenarios: FCP/SCSI

FCP/SCSI measurement series

SCSI 1d SCSI 1d mb SCSI 1d mb sps

SCSI 2d lv mb sps

SCSI 8d lv mb

FCP/SCSI■ 1 single disk

– configured as 1 volume in a extent pool, containing 1 rank– using 1 path*

■ 1 single disk– configured as 1 volume in a extent pool, containing 1 rank– using 8 paths, multipath multibus, rr_min_io = 1

■ 1 storage pool striped disk– configured as 1 volume in a extent pool, containing 4 ranks– using 8 paths, multipath multibus, rr_min_io = 1

■ 1 striped logical volume– built from 2 storage pool striped disks, 1 from sever0

and 1 from server1– each disk

• configured as 1 volume in a extent pool, containing 4 ranks• using 8 paths, multipath multibus, rr_min_io = 1

■ 1 striped logical volume– built from 8 disks, 4 from sever0 and 4 from server1– each disk

• configured as 1 volume in a extent pool, containing 1 rank• using 8 paths, multipath multibus, rr_min_io = 1


Database like scenario, random read

■ ECKD– For 1 process the scenarios show equal throughput– For 8 processes HyperPAV improves the throughput by

up to 5.9x– For 32 processes the combination Linux logical volume

with HyperPAV dasd improves throughput by 13.6x■ SCSI

– For 1 process the scenarios show equal throughput– For 8 processes multipath multibus improves

throughput by 1.4x– For 32 processes multipath multibus improves

throughput by 3.5x■ ECKD versus SCSI

– Throughput for corresponding scenario is always higher with SCSI

1 8 32

Throughput 8 KiB requests

direct io random read

ECKD 1d ECKD 1d hpav ECKD 1d hpav sps

ECKD 2d lv hpav sps

ECKD 8d lv hpav

SCSI 1d

SCSI 1d mb SCSI 1d mb sps

SCSI 2d lv mb sps

SCSI 8d lv mb

number of processes

Thro

ughp

ut


Database like scenario, random write

■ ECKD– For 1 process the throughput for all scenarios show

minor deviation– For 8 processes HyperPAV + storage pool striping or

Linux logical volume improve the throughput by 3.5x– For 32 processes the combination Linux logical volume

with HyperPAV or storage pool striped dasd improves throughput by 10.8x

■ SCSI– For 1 process the throughput for all scenarios show

minor deviation– For 8 processes the combination storage pool striping

and multipath multibus improves throughput by 5.4x– For 32 processes the combination Linux logical volume

and multipath multibus improves throughput by 13.3x■ ECKD versus SCSI

– ECKD is better for 1 process– SCSI is better for multiple processes

■ General– More NVS keeps throughput up with 32 processes

1 8 32


direct io random write


ECKD 2d lv hpav sps

ECKD 8d lv hpav

SCSI 1d


SCSI 2d lv mb sps

SCSI 8d lv mb

number of processes

Thro

ughp

ut


Database like scenario, sequential read

1 8 32


direct io sequential read


ECKD 2d lv hpav sps

ECKD 8d lv hpav

SCSI 1d


SCSI 2d lv mb sps

SCSI 8d lv mb

number of processes

Thro

ughp

ut

■ ECKD– For 1 process the scenarios show equal throughput– For 8 processes HyperPAV improves the throughput by



– For 1 process the throughput for all scenarios show minor deviation

– For 8 processes multipath multibus improves throughput by 1.5x


■ ECKD versus SCSI– Throughput for corresponding scenario is always higher

with SCSI■ General

– Same picture as for random read


File server, sequential write


minor deviation– For 8 processes HyperPAV + Linux logical volume

improve the throughput by 5.7x– For 32 processes the combination Linux logical volume


– For 1 process multipath multibus improves throughput by 2.5x

– For 8 processes the combination storage pool striping and multipath multibus improves throughput by 2.1x

– For 32 processes the combination Linux logical volume and multipath multibus improves throughput by 4.3x

■ ECKD versus SCSI– For 1 process sometimes advantages for ECKD,

sometimes for SCSI– SCSI is better in most cases for multiple processes

1 8 32


direct io sequential write


ECKD 2d lv hpav sps

ECKD 8d lv hpav

SCSI 1d


SCSI 2d lv mb sps

SCSI 8d lv mb

number of processes

Thro

ughp

ut


File server, sequential read


minor deviation– For 8 processes HyperPAV improves the throughput by



– For 1 process multipath multibus improves throughput by 2.8x


– For 32 processes the combination Linux logical volume and multipath multibus improves throughput by 6.5x

■ ECKD versus SCSI– SCSI is better in most cases

1 8 32


direct io sequential read


ECKD 2d lv hpav sps

ECKD 8d lv hpav

SCSI 1d


SCSI 2d lv mb sps

SCSI 8d lv mb

number of processes

Thro

ughp

ut


Effect of page cache with sequential write

■ General– Compared to direct I/O:

• Helps to increase throughput for scenarios with 1 or a few processes

• Limits throughput in the many process case – Advanatage of SCSI scenarios with additional features

no longer visible■ ECKD

– HyperPAV,storage pool striping and Linux logical volume still improves throughput up to 4.6x

■ SCSI– Multipath multibus with storage pool striping and/or

Linux logical volume still improves throughput up to 2.2x 1 8 32


page cache sequential write


ECKD 2d lv hpav sps

ECKD 8d lv hpav

SCSI 1d


SCSI 2d lv mb sps

SCSI 8d lv mb

number of processes

Thro

ughp

ut


Effect of page cache with sequential read

■ General– The SLES11 read ahead setting of 1024 helps a lot to

improve throughput – Compared to direct I/O:

• Big throughput increase with 1 or a few processes• Limits throughput in the many process case for

SCSI– Advanatage of SCSI scenarios with additional features

no longer visible• The number of available pages in the page cache

limit the throughput at a certain rate■ ECKD

– HyperPAV,storage pool striping and Linux logical volume still improves throughput up to 9.3x

■ SCSI– Multipath multibus with storage pool striping and/or

Linux logical volume still improves throughput up to 4.8x

1 8 32


page cache sequential read


ECKD 2d lv hpav sps

ECKD 8d lv hpav

SCSI 1d


SCSI 2d lv mb sps

SCSI 8d lv mb

number of processes

Thro

ughp

ut


Ext3 and XFS comparisonSCSI, 2sps-mp-lv (kernel version 2.6.39), 32 threads

ext3 xfs

IOzone - page cache - throughput

Initial writeRewriteReadRandom readRandom write

ext3 xfs

IOzone - dio - throughput

Initial writeRewriteReadRandom readRandom write

■ General– xfs improves disk I/O, especially write in our case– page cached I/O has lower throughput, due to memory constraint setup

■ Improvement in our setup– sequential write up to 62% (page cached I/O)– sequential write 20% (direct I/O)– random write up to 41% (direct I/O)


Conclusions

■ Small sets of I/O processes benefit from Linux page cache in case of sequential I/O

■ Larger I/O requests from the application lead to higher throughput

■ Reads benefit most from using HyperPAV (FICON/ECKD) and multipath multibus (FCP/SCSI)

■ Writes benefit from the NVS size– Can be increased by the use of

• Storage pool striped volumes• Linux striped logical volumes to disks of different extent pools

■ The results may vary with other– Storage servers– Linux distributions– Number of disks– Number of channel paths


CPU consumption

■ Linux features, like page cache, PAV, striped logical volume or multipath consume additional processor cycles

■ The consumption – grows with number of I/O requests and/or number of I/O processes– depends on the Linux distribution and versions of components like device mapper or device

drivers– depends on customizable values as e.g. Linux memory size (and implicit page cache size),

read ahead value, number of alias devices, number of paths, rr_min_io setting, I/O request size from the applications

– is similar for ECKD and SCSI in the 1 disk case with no further options

■ HyperPAV and static PAV in SLES11 are much more CPU saving than static PAV in older Linux distributions

■ The CPU consumption in the measured scenarios– needed for disk I/O– for the same amount of transferred data– between a simple and a complex setup is

• for ECKD up to 2x• for SCSI up to 2.5x


Summary

Linux options Hardware options Choice of Linux distribution Appropriate number and size of I/Os File system Placement of temp files direct I/O or page cache Read ahead setting Use of striped logical volume

● ECKD● HyperPAV● High Performance FICON for small

I/O requests● SCSI

● Single path configuration is not supported!

● Multipath multibus – choose rr_min_io value

FICON Express4 or 8 number of channel paths to the storage

server Port selection to exploit link speed No switch interconnects with less

bandwidth Storage server configuration

● Extent pool definitions● Disk placement ● Storage pool striped volumes


Questions

■ Further information is at– Linux on System z – Tuning hints and tips

http://www.ibm.com/developerworks/linux/linux390/perf/index.html– Live Virtual Classes for z/VM and Linux

http://www.vm.ibm.com/education/lvc/

Research & DevelopmentSchönaicher Strasse 22071032 Böblingen, Germany

[email protected]

Mustafa MešanovićLinux on System zPerformance Evaluation

http://www.ibm.com/developerworks/linux/linux390/perf/index.html

http://www.vm.ibm.com/education/lvc/

Documents

IO Performance Part2