Avago 6Gb/s SAS and 12Gb/s SAS Performance Tuning … · Avago 6Gb/s SAS and 12Gb/s SAS Performance Tuning Guide User Guide Version 1.0 October 2014 ... 1.3 Performance Measurement

Avago 6Gb/s SAS and 12Gb/s SAS PerformanceTuning Guide

User Guide

Version 1.0

October 2014

DB15-001127-01

Avago Technologies Confidential- 4 -

Avago 6Gb/s SAS and 12Gb/s SAS Performance Tuning GuideOctober 2014

For a comprehensive list of changes to this document, see the Revision History.

Avago Technologies, the A logo, LSI, Storage by LSI, DataBolt, MegaRAID, MegaRAID Storage Manager, and Fusion-MPT are trademarks of Avago Technologies in the United States and other countries. All other brand and product names may be trademarks of their respective companies.

Data subject to change. Copyright © 2014 Avago Technologies. All Rights Reserved.

Corporate Headquarters Email Website

San Jose, CA 800-372-2447

[email protected] www.lsi.com

mailto:[email protected]

http://www.lsi.com


Table of Contents


Table of Contents

Chapter 1: Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

1.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61.2 Performance Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71.3 Performance Measurement Characteristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81.4 Performance Testing Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81.5 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

Chapter 2: Calculate Expected Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.1 Bottlenecks and Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122.2 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.2.1 Interface Connection Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142.2.2 Device Hardware Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162.2.3 Bottleneck Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2.2.3.1 6Gb/s SAS Controller Bottleneck Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172.2.3.2 12Gb/s SAS Controller PCIe Bottleneck Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182.2.3.3 12Gb/s SAS Controller with PCIe and Drive Bottleneck Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192.2.3.4 12Gb/s SAS Controller Small Sequential IOPs Bottleneck Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202.2.3.5 12Gb/s SAS Controller Throughput Bottleneck Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

2.3 Queue Depth and Expected Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

Chapter 3: Build Your Test Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

3.1 Host System Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243.1.1 Processor Architecture and Core Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243.1.2 Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273.1.3 PCIe Slot Choice . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273.1.4 Non Uniform Memory Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283.1.5 BIOS Options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

3.2 Storage Components and Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293.2.1 Initiators and Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

3.2.1.1 Initiator Features that Affect Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 303.2.2 Expanders and Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

3.2.2.1 Expanders and Latency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 323.2.2.2 DataBolt Technology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

3.2.3 Storage Drives and Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 333.2.4 Target-Mode Controllers and Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 343.2.5 SSD Preconditioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

3.2.5.1 SNIA SSD Preconditioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 353.2.5.2 Alternative SSD Preconditioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

3.3 Storage Topology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 363.3.1 Direct Attached Topology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 363.3.2 Expander Attached Topology - Single . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 373.3.3 Expander Attached Topology - Cascade . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 373.3.4 Expander Attached Topology - Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 383.3.5 Multipath Topology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 393.3.6 Topology Guidelines for Better Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

Chapter 4: Configure Your Test Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

4.1 Operating System Environments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 424.1.1 Windows Operating System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

4.1.1.1 Windows Operating System Hotfixes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 424.1.1.2 MSI-X Interrupt Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 424.1.1.3 Process Affinity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 424.1.1.4 Driver Version and Customization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 424.1.1.5 Disk Write Cache . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45


Table of Contents


4.1.2 Linux Operating System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 464.1.2.1 Linux Kernel Version . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 464.1.2.2 Linux Drivers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 464.1.2.3 MSI-X Interrupt Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 464.1.2.4 I/O Scheduler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 474.1.2.5 Block Layer I/O Scheduler Queue . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 474.1.2.6 SCSI Queue Depth . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 484.1.2.7 Nomerges Setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 484.1.2.8 Rotational Setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 484.1.2.9 Add Random Setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 484.1.2.10 Linux Write Cache . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

4.2 Volume Configurations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 494.2.1 Volume Configurations and Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 494.2.2 Volume Type . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 494.2.3 Strip Size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 514.2.4 Cache Policy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 514.2.5 Disk Cache Policy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 514.2.6 I/O Policy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 514.2.7 Consistency and Initialization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 524.2.8 Background Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 524.2.9 MegaRAID FastPath Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 524.2.10 Guidelines on Volume Configurations for Better Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

4.3 Software Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 544.3.1 Linux Performance Monitoring Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

4.3.1.1 sar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 554.3.1.2 iostat . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 584.3.1.3 blktrace . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 594.3.1.4 blkparse . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

4.3.2 Windows XPerf . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 614.3.3 Windows Performance Monitor (Perfmon) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

Chapter 5: Benchmark Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

5.1 Benchmarking Basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 685.2 Iometer for Windows . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

5.2.1 Run Iometer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 705.2.2 Iometer Tips and Tricks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 715.2.3 Interpret Iometer Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 725.2.4 Iometer References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

5.3 Vdbench . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 755.3.1 Install Vdbench . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 765.3.2 Run Vdbench . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 775.3.3 Sample Vdbench Script . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 775.3.4 Interpret Vdbench Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

5.4 Jetstress . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 775.4.1 Install Jetstress . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 785.4.2 Create your Jetstress Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

5.4.2.1 Select Capacity and Throughput . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 785.4.2.2 Select Test Type . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 795.4.2.3 Define Test Run . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 795.4.2.4 Configure Databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 795.4.2.5 Select Database Source . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

5.4.3 Start the Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 805.4.3.1 Characterize the Jetstress Workload . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

5.4.4 Interpret Jetstress Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 815.4.4.1 Transactional I/O Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 815.4.4.2 Background Database Maintenance I/O Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

5.5 fio for Linux . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 815.5.1 Get Started with fio . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82


Table of Contents


5.5.2 fio Performance-Related Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 835.5.3 Interpret fio Output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

5.6 Verify Benchmark Results for Validity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

Chapter 6: Compare Measured Results with Expected Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

6.1 Performance Result Examples for MegaRAID . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 866.1.1 Eight Drive Direct Attached Example Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 866.1.2 Twenty-four Drive Expander Attached Example Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 876.1.3 Forty Drive Expander Attached Example Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88

6.2 Performance Results Examples for IT Controllers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 896.2.1 Eight Drive Direct Attached Example Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 906.2.2 Twenty-four Drive Expander Attached Example Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 906.2.3 Forty Drive Expander Attached Example Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

Chapter 7: Troubleshoot Performance Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

Appendix A: Performance Testing Checklist . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94

Revision History . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96Version 1.0, October 2014 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96Advance, Version 0.1, March 2014 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96



Chapter 1: IntroductionOverview

Chapter 1: Introduction

Use this Performance Tuning Guide for Avago® 6Gb/s SAS and 12Gb/s SAS I/O controller, ROC controller, and expander products. This document targets only the storage specific performance of these products and aims to convey the following ideas:

Understand the performance measurement process Reach a desired performance goal of a storage topology Debug any unexpected results or bottlenecks that you might encounter during performance measurement

This document focuses on performance related settings and configurations only. See the References section for related documents. For any initial and basic device bring up, refer to the product documentation for your product.

1.1 Overview

In general, the performance measurement process might have the following steps:

1. Decide what to measure.

2. Understand what to expect.

3. Build your test configuration.

4. Configure different parameters that might influence your performance tests.

5. Run the performance benchmark test and capture the results.

6. Analyze and compare your results with the expected results.

7. If you have any unexpected results troubleshoot issues until you achieve the expected results.

The performance measurement process can vary depending on your measurement objective. The objective might be a benchmarking exercise for a new product or a debug effort to understand why a certain measurement is not attaining expected results. This tuning guide organizes its chapters to match the performance measurement process.

Chapter 1, IntroductionIntroduces the performance measurement process with commonly used metrics and methodologies. Reviews factors to consider during the benchmarking process.

Chapter 2, Calculate Expected PerformanceIntroduces the bottlenecks and limitations that you might encounter during performance measurement. This chapter helps you learn what to expect from a specific storage configuration with Avago 6Gb/s and 12Gb/s SAS products.

Chapter 3, Build Your Test SetupHelps to set up your storage topology and configure specific parameters that can affect what you try to measure. Addresses settings and options that may not change between different runs of a specific performance measurement project, such as storage topology.

Chapter 4, Configure Your Test ParametersHelps with understanding different tunable hardware and software options after your system is setup. Addresses options that may change between different runs of a specific performance measurement project, such as different volume configurations of a specific storage topology.

Chapter 5, Benchmark ResourcesHelps you choose the correct benchmarking tool or a system monitoring tool that best suites the metric that you intend to measure. Different settings and tool tips are discussed to get reliable results from these tools and to validate the results.



Chapter 1: IntroductionPerformance Metrics

Chapter 6, Compare Measured Results with Expected ResultsHelps you analyze the results and compare your results with the expected results. This chapter provides example results from Avago standard performance runs to gauge your results.

Chapter 7, Troubleshoot Performance IssuesReviews questions that you might ask or additional tests that you might run in the case of unexpected results. By doing so, this chapter takes you through different debugging steps to isolate and root cause the issue quicker.

1.2 Performance Metrics

This section lists the commonly used primary and secondary performance metrics for performance analysis. Primary performance metrics include throughput and latency.

Throughput (MBPS and IOPs)A rate at which the data can be transferred in a unit of time. Throughput is typically given in terms of I/Os per second (IOPs) and Megabytes per second (MB/s or MBPS). IOPs generally measure data of a random nature. MB/s generally measure data of a sequential nature.

Often throughput of small I/O sizes are expressed in IOPs, whereas large I/O throughputs are expressed in MBPS. Both units represent the same quantity, but with a different scale factor. A larger throughput value indicates greater performance. When expressed as MB/s, throughput is often called bandwidth.

NOTE Avago uses binary base (1 KB = 1024 Bytes) when representing MBPS. Be wary of tools that might represent MBPS in decimal base (1 KB = 1000 bytes).

LatencyThe time to complete an I/O. Under certain conditions latency is the inverse of throughput, and a tradeoff exists between the two. Latency is generally lower on lightly loaded systems and higher on heavily loaded systems that issue several I/Os simultaneously. Lower latencies are more desirable and many applications have requirements around latency thresholds. Several different latency variations follow:

Minimum Latency: Latency of the single fastest I/O measured. Maximum Latency: Latency of the single slowest I/O measured. Average Latency: Latency of all I/O measured and averaged together. Percentile Latency: Maximum latency of a certain percent of all the I/O measured. Typical percentiles

used are 95%, 99% and 99.9%. Use percentile latency to remove extreme, uncharacteristic I/O outliers that skew the latency calculations.

Histogram Latency: Distribution of latencies, of all the I/O measured, by using predetermined ranges (buckets).

NOTE When you compare latencies of different products make sure that the throughput is the same. A storage controller might tune for its maximum throughput, thus compromise on latency, or vice versa.

Less commonly used performance metrics, or secondary metrics, might prove useful in certain situations depending on the point of interest. Secondary performance metrics include the following:

UtilizationPercent of time that a resource is used, such as CPU, a storage link, or a disk

EfficiencyA ratio, typically throughput divided by utilization. A commonly used efficiency metric is IOPs / % CPU



Chapter 1: IntroductionPerformance Measurement Characteristics

Interrupt rateNumber of host driver interrupts per second or per I/O

The metrics outlined in this section describe storage performance at a fundamental level. Most applications and third party benchmarks have their own method to express performance by using different terminologies that depend on what is measured. For example, database benchmarks use transactions as the base unit to measure performance. A transaction could be a single I/O or more complex with multiple read and write I/Os issued to complete a task. Refer to the application or benchmark documentation before you analyze the metrics it produces.

1.3 Performance Measurement Characteristics

Any metrics that you use to characterize performance must have two common characteristics:

ReliabilityPerformance must be a measurement of the system or device in a deterministic, known state. Performance measured when the storage system or device is in a state unknowingly influenced by external variables, such as equipment failures or transient cache states, might result in inaccurate measurements.

RepeatabilityPerformance measurements of a storage system under the same configurations and environment conditions must always provide the same results. Only then can you consider those results valid. Measurements with a high level of variance must not be considered as valid, but analyzed closely for any possible discrepancies between runs. Such analysis helps determine any variables that previously were considered as constants.

1.4 Performance Testing Overview

This section provides a performance testing overview. Each topic is treated in more detail throughout this document. As previously listed, any performance task uses the following steps:

1. Decide what to measure.

2. Understand what to expect.

3. Build your test configuration.

4. Configure different parameters that might influence your performance tests.

5. Run the performance benchmark test and capture the results.

6. Analyze and compare your results with the expected results.

7. If you have any unexpected results troubleshoot issues until you achieve the expected results.

In performance analysis, always first ask, “What is being measured?”.

You can quantify performance in many different ways, depending on the intended purpose of the storage system or a specific device. Some devices focus on delivering as many I/Os as possible. Other devices might focus on delivering fewer I/Os, but in the fastest manner possible. You must decide the operating conditions under which you measure the performance of a device or a system. After you make that decision, you can easily decide the host system, storage topology, media type, link speeds at different interfaces, operating system requirements, RAID configurations, software tools, and so on.

For example, when you measure storage controller performance, you want to eliminate bottlenecks that the storage controller does not cause. You also want to control variables that might affect your measurement. The following high-level guidelines would help you prepare your system for the performance test.



Chapter 1: IntroductionPerformance Testing Overview

General Guidelines for Better Performance Measurements

Use the host system with the latest high performance processors and chip sets. Use latest motherboards that allow use of more than one CPU socket, then populate all the CPUs if possible. Use the latest system BIOS version for the motherboard Tune the system BIOS settings for performance rather than for power saving mode. Use an up-to-date operating system and implement any necessary patches or updates that might

affect performance. Set the interface speeds (such as PCI Express® (PCIe®) and SAS) to their maximum so the controller is the only

bottleneck. For example, to measure a PCIe Generation 3 (8 Gb/s) controller performance, you do not want to configure your motherboard PCIe slot to PCIe Generation 2 (5 Gb/s) speed.

Make sure sufficient drives are in place to exercise the maximum controller performance. For example, to measure the maximum bandwidth of a SAS controller, you might need more than 20 hard drives. In such cases, using only the direct attached drives is bottlenecked by the drives.

Make sure the cables and connectors are not prone to signal integrity issues. For example, use the appropriate cable length to connect the controller and expander. Use cables and connectors that meet the specification standards such SAS, SATA, and PCIe of your storage devices.

Make sure sufficient cooling is in place, so temperature variations do not affect your measurements. Choose the benchmarking tool or system monitoring tool that properly measures the metric of your interest. For

example, if latency is your prime metric, the VDbench tool might be better than the v1.1.0 IOmeter tool. Make sure performance-related features of other devices are in a known state. For example,

— A 12Gb/s SAS expander might enable a buffering feature that is advantageous for 6 Gb/s drives.— Write cache on hard drives impact the performance

Update all devices in your systems with the latest firmware and software, such as BIOS, driver, tools, and utilities. Make sure to run workloads that represent your real-life scenarios.

Overlooking any of the basic guidelines can result in an unreliable or inconsistent performance measurement. The following table lists such problems and their potential causes.

Problem Potential Causes

Performance measurement lower than expected

Insufficient disks Link not running up to expected speed CPU utilization at 100 % Disk sees random I/O when sequential I/O is intended Unexpected file system or operating system influence Some component failed and generated errors Incorrect performance expectation One or more disks in the virtual drive has lower performance than the other drives (that is, drive

becomes defective, has many reallocated sectors or media errors, and so on). Background tasks running, such as a consistency check or patrol read.

Performance measurement higher than expected

Using more disks than expected Disk sees sequential I/O when random I/O is intended I/Os are serviced out of cache instead of reaching the disks Incorrect performance expectation

Unstable performance measurement results

System processes starting and stopping Intermittent component errors Problems with interrupt and process affinity Use of nonpreconditioned SSD



Chapter 1: IntroductionReferences

For any measurement, first develop a baseline. That is, a simple, stable test environment and measurement. Try to deviate from the baseline by only changing one factor at a time to help isolate and root cause any issue that might occur.

When you have made sure that your test system is ready for measurement and your baseline proves no issues exist, you may run your benchmark and obtain your results. If you are running your tests for the first time, it is a good practice to rerun the same tests for repeatability. You might also monitor your results closely to check for any anomalies such as errors, link failures, improper worker assignments, and so on. When your results are valid, compare them with the expected results, results from other benchmarks, and benchmarks published by product vendors. If the results match, use these results as your golden reference for further tests. If these results differ, revisit your test to understand the bottleneck that stops you from reaching the expected results.

NOTE If you see any performance issues with the Avago products capture all information about your test to create a support request for Avago. Work with your FAE to use the LSIGet tool (http://sae.lsi.com/ or ftp://ftp0.lsil.com/outgoing_perm/CaptureScripts) to capture all information. This tool captures the information about the host system, storage topology, RAID volume information, and so on. Also provide the benchmark related information and any associated scripts that you have used.

1.5 References

Refer to the following Avago documentation for product-specific information. Contact your FAE to obtain documentation.

LSI Scrutiny Tool User Guide LSI SAS-3 Architecture Guide StorCLI Reference Manual MegaRAID SAS Device Driver Installation User Guide MegaRAID SAS Software User’s Guide Linux Device Mapping in LSI Expander-Designed Backplanes SEN

Results not repeatable System or other processes starting and stopping Intermittent errors Inconsistent test process RAID background operations starting and stopping Use of nonpreconditioned SSD Too-short test time Thermal problems

Insensitivity to expected parameter changes

Parameter remained unchanged Incorrect expectation I/O not going to the expected target devices

Runtime hardware or software errors

Thermal problems Use of uninitialized volumes Illegal topology Use of broken or inappropriate cables and drive enclosures Insufficient drive power

Problem Potential Causes

http://sae.lsi.com/



Chapter 1: IntroductionReferences

12Gb/s SAS Controllers— LSISAS3xxx PCI Express to 12Gb/s SAS Controller Datasheet— LSISAS3xxx PCI Express to 12Gb/s SAS Controller Configuration Programming Guide— LSISAS3108 PCI Express to 12Gb/s SAS ROC Controller Register Programming Guide— LSISAS3108 PCI Express to 12Gb/s SAS/SATA ROC Controller SDK Programming Guide— LSISAS3xxx Controller Reference Schematic

6Gb/s SAS Controllers— LSISAS2xxx PCI Express to 6Gb/s SAS/SATA Controller Design Considerations SEN— LSISAS2xxx PCI Express to 6Gb/s SAS/SATA ROC Controller Reference Manual— LSISAS2208 PCI Express to 6Gb/s SAS/SATA ROC Controller Programming Guide— LSISAS2208 PCI Express to 6Gb/s SAS/SATA ROC Controller SDK Programming Guide

12Gb/s SAS Expanders— LSISAS3xxx PCI Express to 12Gb/s SAS Controller Datasheet— LSISAS3xxx PCI Express to 12Gb/s SAS Controller Configuration Programming Guide— LSISAS3xXX 12Gb/s SAS/SATA Expander Family Register Reference Manual— LSISAS3xXX-R 12Gb/s SAS/SATA Expander Family Register Reference Manual— LSI 12Gb/s SAS/SATA Expander Software Development Kit Programming Guide— 12Gb/s SAS/SATA Expander Firmware Configuration Programming Guide— LSI 12Gb/s Expander Tools (Xtools) User Guide— LSI 12Gb/s Expander Flash (g3Xflash) User Guide— LSI 12Gb/s Expander Manufacturing Image (g3Xmfg) User Guide— LSI 12Gb/s Expander Diagnostics Utility (g3Xutil) User Guide— LSI 12Gb/s Expander IP Configuration Utility (g3Xip) User Guide— Configuration Page Definition for 12Gb/s SAS/SATA Expander Firmware Application Note— LSISAS3xXX-R 12Gb/s SAS/SATA Expander Family Register Reference Manual— LSI 12Gb/s SAS/SATA Expander Software Development Kit Programming Guide

6Gb/s SAS Expanders— LSISAS2xXX Expander Design Considerations SEN— LSISAS2xXX Expander Reference Manual— LSI 6Gb/s SAS/SATA Expander SDK Programming Guide— LSI Expander Flash Utility (Xflash) User Guide— LSI Expander Tools (Xtools) User Guide— Configuration Page Definition for 6Gb/s SAS/SATA Expander Firmware

HBAs— LSI SAS 9xxx-xx PCI Express to 12Gb/s Serial Attached SCSI (SAS) Host Bus Adapter User Guide— PCI Express to xGb/s Serial Attached SCSI (SAS) Host Bus Adapters User Guide— Quick Installation Guide LSI SAS 9xxx-xx PCI Express to 12Gb/s SAS Host Bus Adapter



Chapter 2: Calculate Expected PerformanceBottlenecks and Limitations

Chapter 2: Calculate Expected Performance

This chapter explains how to calculate the expected performance of your system. To understand the expected performance, it is important to understand the bottlenecks and limitations of different devices and interfaces of your system. This performance guide reviews the bottleneck and limitations related to the following Avago storage products:

SAS Storage I/O Controllers (IOCs)— LSISAS2008, LSISAS2308 (6Gb/s SAS)— LSISAS3004, LSISAS3008 (12Gb/s SAS)

RAID-on-Chip ICs (ROCs)— LSISAS2108, LSISAS2116, LSISAS2208 (6Gb/s SAS)— LSISAS3108 (12Gb/s SAS)

Host Bust Adapters (HBAs)— LSI SAS 92xx (6Gb/s SAS)— LSI SAS 93xx (12Gb/s SAS)

RAID Controllers— LSI MegaRAID SAS 6Gb/s RAID (LSI MegaRAID SAS 92xx)— LSI MegaRAID SAS 12Gb/s RAID (LSI MegaRAID SAS 93xx)

SAS Expanders— LSISAS2x36, LSISAS2x28, LSISAS2x24, LSISAS2x20 (6Gb/s SAS)— LSISAS3x48, LSISAS3x40, LSISAS3x36, LSISAS3x36-R, LSISAS3x28-R, LSISAS3x24-R (12Gb/s SAS)

The performance of each product depends on the PCIe interface and SAS speeds, processing power of the product’s CPU and DMA engines, and I/O routing capabilities of the hardware modules. To measure the product performance capability, avoid other bottlenecks in the system as much as possible. The maximum system performance is the minimum of the maximum device or interface performance in the system. After all, any chain is as strong as its weakest link.

2.1 Bottlenecks and Limitations

Consider the following figure an as example to explain the possible bottlenecks in a storage system.The DDR block and DDR path does not apply for IOC products.

Figure 1 Example Storage Configuration



Chapter 2: Calculate Expected PerformanceBottlenecks and Limitations

For an I/O read or write, the I/O path is as follows:

Operating system or Application > host CPU > PCIe interface > storage controller > SAS > Expander > Drives

The following table lists factors (including, but not limited to) that affect performance at each level of the previous I/O path.

Table 1 I/O Path Elements that Affect Performance

I/O Path Elements Factors that Affect Performance

Operating system or Application

Other applications using host resources Network loads Operating system type (Windows, Linux, et al) Benchmark type (synthetic, application) Queue depth or number of outstanding I/Os per physical drive MSI-X interrupt vector support

Host CPU Processor and I/O architecture CPU speed Number of CPU Sockets Number of processor cores per CPU Hyper-threading Memory size and speed NUMA Host chipset

PCIe interface Link rate (2.5 Gb/s, 5 Gb/s, 8 Gb/s) Link width Signal integrity (SI)

Storage controller IOC and RAID CPU core capability DMA and Fast Path engines I/O coalescing Interrupt coalescing Controller mode (initiator or target)RAID Only DDR memory type, speed, and size RAID initialization or other background operations (Rebuild, Reconstruction, Patrol Read, Consistency

Check) RAID volume configurations Number of drive groups and volumes



Chapter 2: Calculate Expected PerformanceLimitations

NOTE The storage topology, how the storage components connect can affect performance. For easy explanation, only a controller > expander > drives topology is chosen here. A storage topology can be built many different ways and performance can differ between different topologies; see Section 3.3, Storage Topology for more information.

Performance measurement can occur for a specific device or a specific topology.

When you measure device performance, overprovision all other interfaces and devices so the device is the only bottleneck and you measure the maximum device capabilities.

When you measure a specific topology performance, keep all devices and interfaces at the maximum capability so the measurement exposes any device or interface that performs lower than others.

2.2 Limitations

This section presents maximum interface and drive limitations, including:

Maximum theoretical and practical bottlenecks of SAS/SATA and PCIe interfaces in a storage topology Expected maximum performance for different drive types (SAS/SATA, SSD/HDD) Hardware limitations of storage controllers

2.2.1 Interface Connection Limitations

Interface Limitations

SAS Link rate (3 Gb/s, 6 Gb/s, 12 Gb/s) Link width SI

Expander Connection routing and arbitration DataBolt (end device buffering)

Drives Individual drive performance Number of drives Protocol (SAS or SATA) Media type (HDD or SSD) Link rate (3 Gb/s, 6 Gb/s, 12 Gb/s) Write cache Preconditioning (SSD only)

Table 2 Generation 2 Interface Connection Limitations

Technology PhysBW (Uni-Directional) MB/s

Theoretical Practical

PCIe (5 Gb/s)

x1 500 400

x4 2000 1600

x8 4000 3200

Table 1 I/O Path Elements that Affect Performance (Continued)

I/O Path Elements Factors that Affect Performance




Disk Drive Limitations

SAS (6 Gb/s)

x1 600 550

x4 2400 2200

x8 4800 4400

SATA (6 Gb/s)

x1 300 260 (3 Gb/s)520 (6 Gb/s)

x4 1200 1040 (3 Gb/s)2080 (6 Gb/s)

x8 2400 2080 (3 Gb/s)4160 (6 Gb/s)

Table 3 Generation 3 Interface Connection Limitations



PCIe (8 Gb/s)

x1 800 790

x4 3200 3200

x8 6400 6400

SAS (12 Gb/s)

x1 1200 1100

x4 4800 4400

x8 9600 8800

SATA (12 Gb/s)

x1 600 260 (3 Gb/s)490 (6 Gb/s)

x4 2400 820 (3 Gb/s)1540 (6 Gb/s)

x8 4800 1640 (3 Gb/s)3080 (6 Gb/s)

Table 4 Disk Drive Interface Limitations

Generation Drive Type Disk K IOPs Sustained MB/s

Generation 2 (6 Gb/s)

SAS 2.5’ 40 to 250 80 to 210

SAS3.5’ 40 to 250 90 to 220

SATA 2.5’ 10 to 70 40 to 120

SATA 3.5’ 10 to 70 80 to 150

Generation 3 (12 Gb/s)

SAS HDD 40 to 250 100 to 220

SATA HDD 10 to 80 50 to 150

SAS SSD 10 to 120 550

SATA SSD 10 to 100 550

Table 2 Generation 2 Interface Connection Limitations (Continued)






The previous tables help you choose a right storage topology for your measurement. For example, to measure the maximum IOPs of a controller expected to give 500,000 IOPs maximum, eight direct-attached SAS HDDs that give 100,000 IOPs maximum each might be sufficient because 8 x 100,000 = 800,000 IOPs > 500,000 IOPs. However, the same topology is not sufficient to measure the maximum MBPS of the same controller if the controller is expected to exceed 4000 MBPS and the drives are expected to give only 200 MBPS maximum each. The drives limit the performance at (8 x 200 =) 1600 MBPS, so the eight direct-attached SAS HDD topology is not suited to measure greater than the 4000 MBPS limit of the controller.

2.2.2 Device Hardware Limitations

The following table lists the maximum performance for Avago controllers.

NOTE Do not expect RAID performance to equal JBOD performance. RAID operations have additional I/O overheads that reduce maximum capability.

2.2.3 Bottleneck Examples

The maximum throughput is the minimum of the maximum performances of all the interfaces and devices. As a result, bottlenecks might be due to:

Limitation of any interfaces, devices, or topology Number of devices and links Computational overheads, and so on

For example, while finding the maximum IOPs of the system, usually the processing capabilities of the storage controller and host CPU cause the bottleneck. But if the number of drives are lower to meet the maximum IOPs of the controller the number of drives becomes the bottleneck.

While finding the maximum MBPS of the system, factors such as controller DDR interface, SAS, PCIe interface, host CPU, or number of drives can cause the bottleneck. The factor with the lowest maximum MBPS for a specific workload becomes the bottleneck.

Knowing the bottleneck or limitation of your system configuration helps you understand the expected maximum performance of your system. The following examples discuss different bottlenecks related to storage controllers and performance.

Table 5 Device Hardware Maximum Performance

Generation Controller SATA Maximum IOPs SAS Maximum IOPs SAS Maximum Read MBiPS

SAS Maximum Write MBiPS

6 Gb/s SAS, 5GT/s PCIe

LSISAS2008 245,000 at 0.5 KB 350.000 at 0.5 KB 3100 MBiPS 2700 MBiPS

LSISAS2108 256,000 at 0.5 KB [4K SR RAID0 SATA 3Gb/s SSD]

308,000 at 0.5 KB [4 K SR RAID0 6Gb/s SAS SSD]

1721 MBiPs[RAID0 6Gb/s SAS SSD]

934 MBiPS[RAID0 6Gb/s SAS SSD]


LSISAS2308 460,000 at 4 KB 640,000 at 4 KB 4320 MBiPS 4300 MBiPS

LSISAS2208 502,000 at 4 KB [4K SR RAID0]

521,000 at 4 KB [4K SR,RAID0] 4315 MBiPS 4281 MBiPS


LSISAS3008 683,000 at 4 KB 1.45 Million at 4 KB [4 K RR] 5930 MBiPS 6590 MBiPS

LSISAS3108 653,000 at 4 KB 1.43 Million at 4 KB [4 K RR] 5930 MBiPS 6590 MBiPS




2.2.3.1 6Gb/s SAS Controller Bottleneck Example

The following figure shows the SAS controllers that use 8 Gb/s PCIe interface and 6Gb/s SAS. This setup uses a x8 link to an expander with 40 drives. The drives are 6Gb/s SAS drives and each drive is capable of 120 MB/s and 460 IOPs maximum for Random I/Os.

Figure 2 6Gb/s SAS Controllers, Revision C1 and Later

From the inherent nature of the SAS and PCIe interface used, the controller’s SAS connection becomes the performance bottleneck for the case that follows.

Maximum random MB/s = Minimum (PCIe 8 Gb/s, SAS 6Gb/s, 40x drives)

= Minimum (6.4 GB/s, 4.4 GB/s, 40 x 120 MB/s)

= 4.4 GB/s (SAS bottleneck)




2.2.3.2 12Gb/s SAS Controller PCIe Bottleneck Example

Compared to Figure 2, 6Gb/s SAS Controllers, Revision C1 and Later, the following figure uses a 12Gb/s SAS controller with 12Gb/s SAS instead of the 6Gb/s SAS.

Figure 3 12Gb/s SAS Controllers with PCIe Bottleneck

Now the drive throughput becomes the bottleneck for Random Read/Random Write. With 6Gb/s SAS, the 40x drives gave 4.8 GB/s (40 x 120 MB/s):

Maximum Random MB/s = Minimum (PCIe 8 Gb/s, SAS 12Gb/s, 40x drives at 120 MB/s)

= Minimum (6.4 GB/s, 8.8 GB/s, 4.8 GB/s)

= 4.8 GB/s (drive bottleneck)

If the expander is a 12Gb/s expander with Databolt, the expander can extract almost 12 Gb/s performance from 6 Gb/s drives. With DataBolt enabled, the same drives can reach up to 9.6 GB/s (2 x 4.8). Assuming the drives reach 7.2 GB/s for the Random Read/Random Write, the PCIe interface becomes the bottleneck.

Maximum MB/s = Minimum (PCIe 8 Gb/s, SAS 12Gb/s, 40x drives at 6 Gb/s + Databolt)


= 6.4 GB/s (PCIe bottleneck)




2.2.3.3 12Gb/s SAS Controller with PCIe and Drive Bottleneck Example

Compared to Figure 3, the following figure illustrates two cases. One with 40 drives and another with 24 drives.

Figure 4 12Gb/s SAS Controller with PCIe and Drive Bottleneck

The 40-drive case behaves similarly to the previous example, where the bottleneck is the PCIe interface because the drives can reach 7.2 GB/s sequential reads/writes.

For the 24-drive case, the drive performance falls to 4.3 GB/s so the number of drives causes the bottleneck:

Maximum sequential MB/s = Minimum (PCIe 8 Gb/s, SAS 12Gb/s, 24x drives)


= 4.3 GB/s (number-of-drives bottleneck)




2.2.3.4 12Gb/s SAS Controller Small Sequential IOPs Bottleneck Example

The following figure shows the SAS link width reduced to x4 instead of x8, as in earlier examples.

Figure 5 IOPs Small Sequential Random Write

Consider the small sequential IOPs bottlenecks for this case. For 4-KB I/O, the controller can give 600,000 IOPs, which gives 600,000 x 4 KB = 2400 MB/s = 2.4 GB/s.

However, the SAS link is only x4 and is bottlenecked at 2.2 GB/s:

Maximum 4-KB sequential IOPs = Minimum (PCIe 8 Gb/s at x8 link, controller IOPs limit, SAS 6Gb/s at x4 link)


= 2.2 GB/s (SAS link width bottleneck)




2.2.3.5 12Gb/s SAS Controller Throughput Bottleneck Example

The following figure shows a SAS 12Gb/s controller that reaches 1,250,000 IOPs. However, the SAS link to the expander is 3Gb/s, which limits performance to 2.2 GB/s even though the link is a x8 link. This scenario uses forty 6 Gb/s SSDs, each capable of 20,000 IOPs for 0.5-KB Random Write I/Os.

Figure 6 12Gb/s SAS Controller Throughput Bottleneck

The bottleneck for maximum IOPs at 0.5-KB Random Writes for 40 drives can reach only 40 x 20,000 = 800,000 IOPs for 0.5-KB Random Writes = 400 MB/s = 0.4 GB/s. For the controller, it is 1.25-million IOPs at 0.5-KB Random Writes = 1,250,000 IOPS x 0.5 KB = 625 MB/s = 0.625 GB/s.

Assuming the host chipset reaches 1.6 GB/s with a x8 PCIe 8 Gb/s link:

Maximum random IOPs at 0.5-KB Random Writes = Minimum (chipset, controller, SAS 3Gb/s at x8 link, 40x SSD with 40,000 IOPs each)

= Minimum (1.6 GB/s, 0.625 GB/s, 2.2 GB/s, 0.4 GB/s)

= 0.4 GB/s (drive’s random performance and number of drives causes the bottleneck)

If the drives reach 40,000 IOPs instead of 20,000 IOPs, the next possible bottleneck is the controller IOPs limitation. The equation becomes:

Maximum random IOPs at 0.5-KB Random Writes = Minimum (chipset, controller, SAS 3Gb/s at x8 link, 40x SSD with 40,000 IOPs each)

= Minimum (1.6 GB/s, 0.625 GB/s, 2.2 GB/s, 0.8 GB/s)

= 0.625 GB/s (controller IOPs bottleneck)



Chapter 2: Calculate Expected PerformanceQueue Depth and Expected Performance

2.3 Queue Depth and Expected Performance

Queue depth (Qd) is the number of outstanding I/Os for each device. More outstanding I/Os lets a device maintain its workload without incurring idle times at the disk. Synthetic benchmarking tools let you directly control the Qd so it is easier to measure or compare Qd with synthetic benchmarks than real world applications.

Storage applications have many different queues, such as the following:

Drive Queue Depth Adapter SAS Core Outstanding I/O Count Driver Maximum Outstanding I/O Count and Individual Count for each device presented

To understand your total expected performance, it is important to understand the effect of queue depth on your specific media, adapter, and driver.

As shown in the following figure (which shows the Qd scaling for 1x drive case and 8x drive case), increasing the Qd increases the performance until the drive is saturated. At saturation, the drive performs at its maximum capability. Increasing Qd after this level does not increase performance.

NOTE In some cases, too large a Qd increase can add overhead because the drive queue is full and overloaded, and the drive might not perform at its optimal operating conditions.

Figure 7 HGST Direct Attached Throughput of JBOD RAID Types for Sequential Workloads

The following figure illustrates the Qd at controller level and at driver level. This graph compares the Qd scaling of IT/IR, MegaRAID, and iMR controllers with the Windows operating system in an 8x drive direct attached topology. Maximum Qd constraints affect the actual Qd.

Adapters have a limit on the maximum outstanding I/Os (OIO) they can support:

12Gb/s SAS IT/IR controllers have a hardcoded value of approximately 9000 I/Os, but practically they can reach about 5000 I/Os to 7000 I/Os maximum.

12Gb/s SAS MegaRAID Controllers have the maximum OIO set to approximately 920. 12Gb/s SAS iMR controllers have the maximum OIO set to approximately 234.



Chapter 2: Calculate Expected PerformanceQueue Depth and Expected Performance

Each controller generation might have a different maximum OIO setting, dependent on the design considerations at the design time. When the adapter hits its OIO limit, there is no additional benefit in queuing more outstanding I/Os from the benchmarking tool.

As the graph indicates, IT/IR controller have the highest OIO support so they scale up even after QD is greater than 32. MegaRAID controllers show the next highest OIO support, however the MegaRAID-Windows driver limits the Qd per physical drive to 32.

NOTE Each operating system driver might have slightly different algorithms as to how the Max Adapter Outstanding IO is divided amongst the available disks.

iMR controllers show the lowest maximum OIO support because of resource limitations. And so, the scaling is lowest among the controllers.

Figure 8 JBOD Write 8 SAS SSDs Throughput of All RAID Types for Sequential, Random, and OLTP Workloads

NOTE The previous figure highlights a 12Gb/s SAS SSD under Windows 2008 R2 SP1, which is among the first devices to provide additional performance benefits beyond 32 outstanding I/Os.



Chapter 3: Build Your Test SetupHost System Considerations

Chapter 3: Build Your Test Setup

Preparing your test setup for performance benchmarking presents challenges. Many variables can affect performance, but keeping variables known and constant helps provide a reliable and repeatable measurement. This chapter helps with parameters that are not expected to change between different tests of the performance test project. For example, a performance test where the goal is to measure R0, R5, and R6 performance on 8 SAS HDDs, directly attached to a 12Gb/s SAS controller. In this example, an 8 drive direct-attached SAS topology is a fixed configuration for all R0, R5, and R6 tests.

This chapter reviews set-up related parameters, such as the following:

Host system considerations Storage topology Storage components

3.1 Host System Considerations

Many host-specific factors can affect the performance, which include (but are not limited to) the following factors:

Processor Architecture— Processer organization and architecture— Processor count— Processor generation— Number of cores— Hyperthreading status— Processor clock speed— Chipset

Memory— Memory type— Memory speed— Memory configuration

PCIe slot— Link speed— Link width— Location relative to a processor (on multiprocessor architecture)

BIOS settings

The following sections discuss these factors in detail.

3.1.1 Processor Architecture and Core Organization

Any CPU, memory, bridge, and PCI slot organization affect the system efficiency. Newer multiprocessor systems with an architecture where the memory and the PCI slots connect directly to the CPUs, such as shown in the following figure, can perform much better than older system architectures, such as in Figure 10.




Figure 9 Series 9 SMC




Figure 10 Series 7 SMC

Performance measurement in older system architectures might not yield the maximum results that Avago publishes. Older systems are limited by memory, CPU clock, chipset, and mezzanine bus speed, which can all reduce maximum observed performance.

Processor ChoiceIn addition to the processor I/O architecture, the choice of each processor affects performance. Systems that use Intel Xeon® E3 (or larger) or 4th Generation Intel Core processors do not need chipset components for PCIe attachment because the chipset is part of the CPU. Avago recommends that LSISAS3008 and LSISAS3108 Enterprise class storage controller performance testing is done on systems with Intel Enterprise processors.




Avago performance measurements and performance targets are based on the latest host computer components, used in 2-socket systems based on Intel processors. Use of any different host computer system likely results in lower measured performance.

Number of Processors and CoresThe more total cores the better the performance can be, because the I/O load is shared across different processor cores.

Hyper-Threading TechnologyMakes a physical processor appear as if it has more cores. For example, a processor with 16 physical cores might appear as 32 logical cores to the operating system. This is referred as Hyper-Threading Technology (HTT). Usually implementing HTT gives better performance.

Process AffinityThe affinity of a certain process to run on a specific processor in a multiprocessor environment. If the processes (applications) are not spread across the processor cores in a balanced manner, performance is affected because certain cores might be overloaded and others might be unused. Therefore, it is important to manage the affinity and spread the load evenly across the cores. Environments that do not manage this affinity can reduce the performance, especially the IOPs of small I/Os.

Microsoft® Windows Server®, by default, does a good job of managing process to core assignments without user intervention. Linux distributions require you to explicitly assign processes to cores for optimal load balancing.

CPU ClockThe CPU clock can affect performance. Higher clock rates improve system throughput, especially 8-KB I/Os and smaller. A higher CPU clock speed improves latency as well.

For testing Avago SAS 12 Gb/s controllers or newer, use CPUs with a clock of 2.6 GHz or higher. Performance measurements with slower CPUs result in lower than optimal IOPs. Use enterprise computers for performance measurement of enterprise class storage controllers. Using older desktop or workstations for performance measurement yields results lower than expected.

3.1.2 Memory

In addition to system architecture, the memory size, type, speed, bus width, and population affect system performance. Servers might have limitations in all of these areas. The server manufacturer provides a User's Guide and possibly other documents to provide specifications and guidance on selection and population. Refer to such documents to indentify the configurations that best suit your case and provide the best performance.

Additional limitations might exist beyond the number of slots, the maximum size for each DIMM, total size, and the maximum speed that the system supports. The amount of memory and speed that a particular system supports can vary depending on the population. Multi-channel support can increase performance if populated correctly. For example, 12 DIMM slots (4 channels of 3 slots each) might be available. If a single DIMM is populated for each channel, in the recommended slots, performance might increase. However, if more than one slot for each channel is used, the bus speed and performance can decrease.

3.1.3 PCIe Slot Choice

The PCI slot location, PCIe link speed, link width, and relative location to a processor can affect performance, as described in the list that follows. The slot location and number of devices attached to the same processor or bridge can reduce throughput.




NOTE For Linux you can use the lspci inbuilt command to find the controller’s bus, device, or function number. For a Windows operating system, use the lspci available at http://eternallybored.org/misc/pciutils/.

PCIe Link WidthPerformance is linearly proportional to the link width of the PCIe slot. For example, with a x8 link you can achieve twice the performance of a x4 link. On motherboards the physical connector width of a PCIe slot might be larger than the electrical connection. Be wary of such PCIe connectors and use the PCIe bus that matches the maximum links that the storage controller supports. Before you run performance tests, confirm that the negotiated link widths are as expected (x8 with PCIe x8 capable slot, x4 with PCIe x4 capable slot, and so on).

PCIe Link SpeedPCIe slots with different link speed might be available on any motherboard. Choose a PCIe bus designed for highest link speed for better performance. Before you run performance tests, confirm that the negotiated link rates are as expected (8 Gb/s with PCIe 8 Gb/s capable slot, 5 Gb/s with PCIe 5Gb/s slot, and son on).

Actual Link Speed versus Negotiated Link SpeedIn an actual system, additional factors can cause the actual negotiated speed and width to be lower than the maximum supported by the slots or the storage controller. Therefore, you must verify the negotiated speed and width. Read the PCIe Configuration Space to verify the capabilities and currently negotiated speed and width.

NOTE You can use tools such as lsiutil, Scrutiny, lspci, or MegaCLI to read the PCIe Configuration space.

PCIe Slot Location Relative to a ProcessorThe PCIe slot location relative to a processor in a multiprocessor environment can affect performance. See Figure 9 and Figure 10. Consider a case where the benchmark application runs on CPU1, either by operating system allocation or as forced by affinity settings.

If the storage controller is placed on PCIe slots 1 through 3 (native to CPU1), the system can give higher performance.

If the storage controller is placed on PCIe slots 4 through 6 (native to CPU2), the system gives lower performance.

These differences occur because of the additional latency caused by the access over the QPI bus between the CPUs, which does not occur if the memory and PCIe slot are native to the CPU.

NOTE When you use an unfamiliar server, test each PCIe slot to find the slot that gives the highest performance. All slots do not yield the same data throughput.

For Better Performance Results

Run the CPU at the maximum clock supported. Use a CPU that provides a higher number of cores. Use HTT. Manage affinity to spread the load across the cores if the operating system does not automatically

manage affinity. Use a PCIe slot that gives the best performance compared to all the other slots available.

3.1.4 Non Uniform Memory Architecture

Non Uniform Memory Access (NUMA) is a feature useful for multiprocessing, where a CPU can access the memory of the other CPU. A process will reside on a memory local to the CPU or on a memory non-local to the CPU. Depending



Chapter 3: Build Your Test SetupStorage Components and Performance

on where the data resides, the performance varies. Local memory accesses are faster, so performance is higher. However, accessing non-local memory adds additional overheads due to the need to go over the QPI, interprocessor bus. This additional overhead increases the latency and reduces the performance. NUMA is proven to help multiprocessing and handling processes across CPUs, however the benefits are limited to particular workloads only.

You can extend the PCIe slot location example in the previous section to Non Uniform Memory Architecture (NUMA). A process might run on CPU1, but the process memory is on the DDR3 native to the other CPU (CPU2). In this case, CPU1 faces additional latency because of the QPI bus access that is not present if the process used the memory native to CPU1.

3.1.5 BIOS Options

System BIOS can provide configurable options that can affect performance. The following settings might be configurable in your system BIOS. Set them as follows for best performance:

Choose high performance, rather than energy saving or Balanced options. Increase fan settings to run cooler. Enable hyperthreading to increase processor capabilities. Set any controllable PCIe slot width and speed options to the maximum setting. Set QPI speed to maximum. Set Maximum Read Request to Auto, or the largest possible value. Set Maximum Payload to Auto, 256 bytes, or larger.

NOTE The maximum payload size of the host system depends on the chipset. Setting the payload size to a maximum supported value provides maximum performance. Lower values cause higher overheads for each I/O and so affects performance.

3.2 Storage Components and Performance

Initiators, expanders, and targets are the major components that make up the storage systems. The following sections discuss each component and their impact on performance. The following three basic elements comprise any storage topology:

Initiator Expander Target

InitiatorInitiators include host bus adapters that might be an I/O controller or RAID-on-controller and that might be on a motherboard or on an HBA card that fits on any PCIe slot on a motherboard.

ExpanderExpanders can be used as a simple JBOD or with a multitude of functions such as self-configuring, SCSI enclosure services (SES), zoning, and DataBolt. SAS switches made of multiple expanders might also be present.

TargetTargets can be any number of SAS or SATA drives that are HDD, SSD, or an HBA in target mode.




3.2.1 Initiators and Performance

Avago storage controller operation modes are divided into the following major modes:

Initiator Target (IT) Integrated RAID (IR) MegaRAID Integrated MegaRAID (iMR)

Initiator Target (IT)IT mode allows the controller to support only the raw JBOD mode and does not allow any RAID capabilities. IT controllers let the controller be in target mode as well, explained later in a separate subsection.

Integrated RAID (IR)IR mode allows the controller to support basic RAID modes such as R0, R1 and R10. However, the firmware integrates the RAID operations, as opposed to the hardware.

NOTE IR mode is defeatured and superseded by iMR from 12Gb/s SAS, therefore this document does not discuss IR. Further references to RAID assumes MegaRAID.

MegaRAIDMegaRAID mode uses the MegaRAID firmware stack, the hardware RAID modules, and has DDR caching features. These controllers provide the best RAID capabilities compared to other modes. The RAID performance of this mode is the highest among all these modes.

Integrated MegaRAID (iMR)iMegaRAID mode, commonly referred to as iMR, uses the MegaRAID stack, however the firmware implements the RAID functions instead of using the hardware RAID modules. The performance is significantly lower compared to the MegaRAID mode.

The RAID (IR/iMR/MegaRAID) modes allow JBOD options. However, minor differences in performance might exist compared to the JBOD performance of RAID controllers versus IT controllers because of programming differences in the firmware and drivers. Avago controllers support the following configurations:

JBOD RAID0 RAID1 RAID10 RAID5 RAID6 RAID50 RAID60

3.2.1.1 Initiator Features that Affect Performance

The following sections review initiator features that affect performance.

Interrupt CoalescingInterrupt coalescing allows more than one interrupt to be coalesced together before raising the interrupts to the CPU. This option decreases the interrupts to the host CPU per I/O, which improves the maximum IOPs for small size I/Os. For example, if the interrupt coalescing depth is set to 10, the host CPU is interrupted only once every 10 I/Os.




I/O CoalescingI/O coalescing allows smaller I/Os to be grouped and processed together. This feature improves the throughput of the small size I/Os as compared to handling the I/Os individually. This feature is useful for RAID operations. For JBOD this feature is not used because the I/Os can be handled faster with the FastPath feature, if the hardware supports FastPath. See the following FastPath section for more information.

Maximum I/O SizeThe maximum storage controller capability might limit the maximum I/O size. Performance measurements usually use 0.5-K to 4-M I/O sizes, however the controllers might not be able to natively support up to 4-M I/O.

For example, the MegaRAID controller limit on maximum I/O size limit is 252-KB. Larger I/Os are split into multiple I/Os and processed. This strategy can impact performance because it effectively reduces the number of outstanding I/Os per drive. Additional overheads might result because of the split and join operations.

FastPathStorage controllers might support hardware FastPath, where a hardware I/O accelerator handles I/Os without firmware involvement. FastPath helps improve performance for JBODs and certain RAID configurations. RAID configurations that require parity calculations and RAID volumes that use cache for their reads or writes require firmware involvement and cannot use FastPath. Use FastPath whenever your application permits.

Multipath and Drive ListingIT and IR controllers do not support a multipath topology. MegaRAID controllers do support multipath. When drives are used in a multipath topology, each drive is listed twice with IT/IR controllers and the Enclosure slot mapping set in controller NVDATA decides how the Target ID for the drives are assigned. MegaRAID controllers expose only one path and MegaRAID balances the drives across multiple ports when both ports of the controller are used, such that both controller ports can perform at maximum performance.

NOTE The MegaRAID controller allows multiple paths but only one path is used at any time for the I/Os, not both paths at the same time. Current designs do not support active-active I/Os on both paths. Using both ports of a drive might help for SSDs, but not for HDDs. On HDDs the data must go to a single media, unlike the SSDs. SSDs can write the data in parallel so using multiple ports can scale the performance.

DMA Engines – Single Context versus Dual ContextStorage controllers use DMA engines as part of their I/O processing and configuration of these DMA engines can significantly impact performance. Avago 12Gb/s SAS controllers have eight Tx DMAs. A DMA has two modes, single context and dual context. Different configurations perform differently with these modes. Use the EDFBFlags[6:5] in Manufacturing Page 30 of NVDATA to control these modes. Tune these fields to a mode that best suits your application.

For direct attached devices, single context mode might suffice, but dual context mode suits best for 12Gb/s SAS expanders with DataBolt enabled:

The IT controller sets all 8 DMA engines to dual context when 12 Gb/s SAS expanders are detected in the topology; otherwise the controller sets all 8 DMA engines to single context mode.

MegaRAID sets four DMA engines in single context mode and four in dual context mode, by default. When 24 or more devices are attached, MegaRAID sets all eight Tx dma engines to dual context mode.

NOTE If the DMA context is not set correctly you might see issues such as the performance not scaling with an odd number of drives, but scaling with an even number of drives, or vice versa.




I/O Size Tuning for Expander Buffering Solutions12Gb/s SAS storage controllers offer other tunable parameters to improve the buffering solutions that 12Gb/s SAS expanders provide. The parameters, located in Manufacturing Page 30 of the controllers NVDATA, include:

EDFBMaxGroupUnload EDFBThresholdSAS, and EDFBThresholdSATA

EDFBMaxGroupUnloadSpecifies the maximum number of entries that a specific DMA Group unloads to the DMA engines before moving to another DMA Group. A value of 0 in this field will use the hardware default value. From phase 4 of 12Gb/s SAS firmware, this field is set to 4. Four is a recommended value.

EDFBThresholdSAS and EDFBThresholdSATASpecifies the maximum number of Data Frames (SAS and SATA respectively) that should transmit during EDFB before switching to an alternate context. A value of 0x00 indicates the firmware should program the setting based on any values returned by located the EDFB expanders. If multiple EDFB expanders return differing values, the firmware uses the lowest value found.

For Avago 12Gb/s SAS expanders, set these parameters to 0 and the controller dynamically assigns its value by using vendor-specific SMP commands to the Avago12Gb/s SAS expanders.

With non-Avago expanders, set these fields based on the buffer size.

3.2.2 Expanders and Performance

3.2.2.1 Expanders and Latency

Expander attached topologies incur additional latency at each expander level because of arbitration and additional I/O connection time. Arbitration can become a large component of performance impact, especially in deeply cascaded topologies because each additional expander in the cascade adds to the time required to establish a connection.

In 12Gb/s SAS expanders, the arbitration process can take as little as 161.33 nanoseconds to a maximum 401.66 nanoseconds.

The 12Gb/s SAS expanders connect at 53.33 nanoseconds for each expander in the topology.

The times currently cited are in the nanosecond range. A nanosecond range is not large in comparison to the time for drives to process I/O and return responses, but you must still understand the impact expanders can have on the overall storage fabric and the impact each additional expander can add.

3.2.2.2 DataBolt Technology

12Gb/s SAS expanders support DataBolt® technology, or buffering (previously called end device frame buffering (EDFB)). The DataBolt technology allows 3Gb/s and 6Gb/s SAS drives to transfer the data at up to 6Gb/s and 12Gb/s SAS rates respectively, with the use of a buffering module on the expander phys. DataBolt technology removes rate matching, so the performance per drive nearly doubles.

In expander attached configurations, when Databolt is disabled, the negotiated link rate for a 6 Gb/s drive is 6 Gb/s. When Databolt is enabled, the negotiated link rate for a 6 Gb/s drive is 12 Gb/s. The Databolt feature does not affect drives that support 12 Gb/s speeds. Refer to LSI DataBolt Bandwidth Aggregation Technology: 12Gb/s SAS Performance Test Results White Paper for more information.

The following concepts pertain to Databolt technology:

Rate MatchingRate matching permits faster communication channels to run traffic to slower devices by slowing down the faster channel with delete-able ALIGN primitives. As these primitives route through expanders, the primitives are removed and the data is then spaced at the same rate as expected by the slower device. The problem




with rate matching is that, during communication to slower devices, the faster channel yields 50 percent to 75 percent less throughput during the connection.

DataBolt TechnologyDataBolt technology can permit communication between channels that operate at different link rates without using rate matching. DataBolt uses two 24-KB buffers dedicated to inbound and outbound transactions, which permit read and write commands to be serviced at the same time in a nonblocking memory fashion. This action is transparent to the attached devices which makes for seamless integration into SAS domains. The DataBolt technology is T10 compliant.

HDD versus SSDDataBolt technology functions with HDDs and SSDs. Consider how the performance characteristics of each target device might impact DataBolt performance. The expander manufacturing page exposes several tuning parameters that can impact the performance measured through the expander. The exact value of the tuning parameters used for optimal performance depends on the implemented target devices. It is not recommended to modify the default tuning parameters in the manufacturing page without extensive testing. These values were tested with both HDDs and SSDs independently and found to be sufficient for both drive types.

Enable DataBolt

The LSISAS3x36 or LSISAS3x48 expanders do not enable the DataBolt feature by default. Use the expander manufacturing page 0xFF15 to manually enable the DataBolt feature. You must modify the XML file included in the expander firmware package, build a new manufacturing page, and upload the page to the expander. Refer to the 12Gb/s SAS/SATA Expander Firmware Configuration Programming Guide. The following steps describe the general process:

1. Create a separate copy of the default sas3xMfg.xml file. The exact file name depends on the expander version. For example, the manufacturing page XML file for an evaluation LSISAS3x48 expander is named sas3xMfgEval.xml.

2. Modify the XML file using the following changes:

— Set EDFBEnable to 00000001— Set EDFBPhyEnablesLow to FFFFFFFF, to enable EDFB on PHY 0 to PHY 31— Set EDFBPhyEnablesHigh to FF, to enables DataBolt on PHY 32 to PHY 39— PhyMaskLow in EDFBPerfSettings is set to FFFFFFFF (enables DataBolt performance tuning on PHY 0 to

PHY 31)— PhyMaskLow in EDFBPerfSettings is set to 000000FF (enables DataBolt performance tunings on PHY 32 to

PHY 39)

3. Save the XML file.

4. Refer to LSI 12Gb/s Expander Tools (Xtools) User Guide to build (use g3xmfg) and upload (use g3xutil) the manufacturing page.

5. Reset the expander for the new changes to take effect.

6. Use the edfbinfo command to verify the DataBolt status. Refer to the LSI 12Gb/s SAS/SATA Expander SDK Programming Guide for more information.

3.2.3 Storage Drives and Performance

The storage targets can be any number of SAS or SATA drives of type HDD or SSD, or a controller in target mode (see Section 3.2.4, Target-Mode Controllers and Performance for information about controllers in target mode). This section discusses different features of these drives and the effect on performance.

SSD versus HDD

Solid state drives (SSD) perform well especially with random I/Os, as there are no rotational parts.




Brand new SSDs show very high performance compared to used SSDs because the performance of an SSD depends on what was previously written. You must precondition the SSDs for their performance to be repeatable. HDDs do not need any such preconditioning. See Section 3.2.5, SSD Preconditioning for preconditioning information.

HDDs perform better with disk write cache enabled, whereas SSDs do not gain much by using disk write cache. HDD performance varies based on where most data is located. Performance improves if the data is at the outer

sectors (short stroking), and is lower if the data is at the inner sectors. SSD performance is homogeneous because the drives have zero seek time.

SAS versus SATA

SAS provides better enterprise features than SATA. SAS drives usually perform better than SATA. SATA drives usually are of larger density and of slow rotational speed, and therefore perform slower than SAS. SATA performance is lower when attached to expanders, as additional translations occur because of the SATA

tunneling protocol (STP) that is not present on native SATA transfers.

SAS Nearline

Nearline SAS drives have the benefit of larger density that comes with the SATA drives and the reliable interface performance that comes with SAS.

Rotational speed of SAS nearline drives is same as SATA drives. Performance of SAS nearline drives stands between native SAS drives and SATA drives.

Link Speed

The faster the drive interface link rate, the better the performance. For example, 6Gb/s SAS drives perform better than 3Gb/s SAS drives.

Rotational Speed

The higher the rotational speed, the better the performance. For example, 15,000 RPM drives perform better than 7,000 RPM or 10,000 RPM drives.

3.2.4 Target-Mode Controllers and Performance

IT controllers can act as a SAS target and receive commands from SAS initiators through the SAS connection. To complete read/write operations, a target-mode controller copies data between the host memory and the initiator by using SAS. Because SAS is used, the target performance increases.

A target that performs better than the available SAS/SATA drives helps with loop back testing and other useful performance tests. For example, an LSISAS3008 controller in target mode provides up to 300,000 IOPs for small I/Os and up to 5600 MB/s for large I/Os, which is extremely high compared to any drive targets.

3.2.5 SSD Preconditioning

SSD performance and latency at an instant depends on what is written to the flash prior to that instant. It is important to precondition the SSDs, that is, run I/Os to storage until a steady state is reached. If SSDs are not preconditioned, you might see inconsistent performance results when the SSD enters its maintenance mode to erase used sectors so the sectors can be rewritten. This process is known as garbage collection. Performance is not faster after preconditioning. Preconditioning is designed to have benchmarks measure the steady state (slower) performance instead of the initial (faster) performance.

The unique features of SSDs create distinct requirements to accurately measuring SSD performance. The SNIA Solid State Storage Initiative’s Performance Test Specification (PTS 1.1) clearly defines steady state performance. Avago uses




the SNIA definition as a benchmarking guideline. For more information on the SNIA benchmarking standardization access the document at the following link:http://snia.org/sites/default/files/SSS%20PTS%20Client%20-%20v1.1.pdf

Avago emphasizes the SSD benchmarking goal as repeatable, accurate, consistent, and representative results. The following sections describe two methods to accurately measure SSD performance. The first method uses the SNIA methodology and generates the most precise and accurate benchmarking results, but can take considerable time to execute. The second method provides equitably accurate results, but takes significantly less time to complete. The benchmarking tool used is irrelevant.

3.2.5.1 SNIA SSD Preconditioning

1. Purge the drive.

Prior to any artificial benchmarking, put the drive into a known state that emulates the state as received from the manufacturer. This state is typically called the fresh-out-of-box (FOB) state. Most devices support a secure erase command and others might have a proprietary method to put the drives into a known state.

2. Write the entire user capacity of the device twice with 128-KB sequential writes aligned on 4-KB boundaries. Set the queue depth to the highest value supported by the device.

Avago generally uses 256 for RAID virtual devices. You can use a smaller value for HBA testing.

The amount of time it takes to write to the device twice depends on the capacity and the performance.You can estimate this value by multiplying the capacity of the device in MB by 2, then dividing that number by the steady-state Megabytes per second obtained by the benchmark tool. The result is the number of seconds to write to the entire device. For added security, you can add additional time.

3. Run the desired data point until steady state is achieved.

You can determine steady state using two methods: data excursion and slope excursion.

Data ExcursionVariation of y within the measurement window is within 20% of the average (Max(y) –Min(y) <= Average(y) ).

Slope ExcursionsA linear curve fit of the data within the measurement window is within 10% of the average within the measurement window.

The measurement window is the anticipated area of steady state. You can often times determine the measurement window by simple observations.

4. Collect data immediately after you reach steady state.

Idle time can significantly change the performance numbers. Collect performance statistics for a long enough time to assure precise averages. One to 5 minutes is generally sufficient depending on performance characteristics.

3.2.5.2 Alternative SSD Preconditioning

The alternative preconditioning method recognizes that testing all data points and tests (I/O sizes, queue depth, read:write mixtures) can number in the thousands, making it impractical to follow the SNIA-defined test flow. For that reason, Avago defined a shorter testing method that still provides repeatable and precise steady state performance with minimal overhead of preconditioning. This method helps reduce excessive preconditioning while maintaining accurate and consistent results run over run. Very large capacity or cMLC drives might require additional preconditioning time and data points should be verified using the SNIA method to ensure absolute steady state.

1. Purge the drive.

Prior to any artificial benchmarking, put the drive into a known state that emulates the state as received from the manufacturer. This state is typically called the fresh-out-of-box (FOB) state. Most devices support a secure erase command and others might have a proprietary method to put the drives into a known state.

2. Write the entire user capacity of the device twice with 128-KB sequential writes aligned on 4-KB boundaries. Set the queue depth to the highest value supported by the device.

Avago generally uses 256 for RAID virtual devices. You can use a smaller value for HBA testing.

http://snia.org/sites/default/files/SSS%20PTS%20Client%20-%20v1.1.pdf



Chapter 3: Build Your Test SetupStorage Topology

The amount of time it takes to write to the device twice depends on the capacity and the performance.You can estimate this value by multiplying the capacity of the device in MB by 2, then dividing that number by the steady-state Megabytes per second obtained by the benchmark tool. The result is the number of seconds to write to the entire device. For added security, you can add additional time.

3. Run 1-MB sequential writes at queue depth 256, for 2 hours.

4. Run all sequential I/O patterns in the following pattern:

a. All writes.

Run small to large I/O sizes, and low to high queue depths. That is, run all I/Os at the smallest Qd first, and run all I/Os at the largest Qd last.

b. All reads.

Run small to large I/O sizes, and low to high queue depths.

5. Run 4-KB random writes at queue depth 256, for 4 hours.

6. Run all random I/O patterns in the following pattern:

a. All writes.


b. All reads.


3.3 Storage Topology

The number and type of initiators, expanders, and targets present in a topology might vary, but all storage topologies can be categorized to one of the following:

Direct attached Expander attached - Single Expander attached - Cascade Expander attached - Tree Multipath topology

3.3.1 Direct Attached Topology

A direct attached topology has no expander, so the number of drives is limited by the number of phys available on the initiator. Usually 8 or 16 drives are direct attached to a storage controller.

Figure 11 Direct Attached Topology Example

If a single expander can support the number of drives you need, with sufficient ports available for the controller, use that configuration. For example, consider the following setup examples:

1. Single SAS3x48 expander with x8 wide ports to a controller and 40 drives

2. Two SAS3x48 expanders with two x4 wide ports to a controller and 40 drives




1.The first topology would give a better performance over all, compared to the second one.

Direct Attached Topology Considerations

Simple configuration. No added latency that might occur with expanders. Might be suitable to check the storage controller maximum IOPs performance and basic latency characteristics Not suitable for checking the maximum bandwidth (MBPS) of a storage controller, because the MBPS is usually

drive limited.

3.3.2 Expander Attached Topology - Single

The expander attached topology - single has only one expander and is common for applications similar to Just a Bunch of Disks (JBOD). The number of phys on the expander limits the number of drives, excluding the expander phys, that connect to the initiators.

Figure 12 Single Expander Topology Example

Expander Attached Topology - Single Considerations

Relatively simple topology. Suitable for checking both maximum MBPS and IOPs of a storage controller. However, the topology can still be

drive limited for the MBPS if the drives perform low. Expanders allow bandwidth aggregation. Adding a x8 link to an initiator, instead of x4, can reach up to twice the

x4 link performance. The latency is higher than a direct attached topology, but lower than multiple expander topologies. Same expander chips can be configured differently (in terms of routing and other phy/connection attributes) so

the performance can vary between two platforms that use the same expander chip.

3.3.3 Expander Attached Topology - Cascade

The expander attached topology - cascade has two or more expanders connected in series (cascade) fashion. This topology is used when more drives than a single expander can support are needed. Edge expander set is an example of this topology. If n expanders are present, a maximum n + 1 hops are present between the controller and expander.




Figure 13 Cascaded Expander Topology Example

Expander Attached Topology - Cascade Considerations

Relatively complex, but suitable to scale the topology with more drives. Suitable for checking both the maximum MBPS and IOPs of one or more storage controllers. Suitable for measuring the expander’s capability to route the I/Os under different use cases. Use x4 or x8 connections between expanders and between a controller and expander. You must maintain the link

width uniform throughout the cascade. One link smaller introduces a bottleneck. Latency increases for devices farther in the cascade compared to devices near the controller. The drives at the

farther end must win the arbitration at each expander level. The drives directly attached to the expanders at each level have the higher probability of winning the arbitration. The more levels (or hops), the higher the latency. Different arbitration schemes than the default schemes might reduce this impact.

Less fault tolerant. If one expander fails, the whole storage behind the topology is unreachable. Same expander chips can be configured differently (in terms of routing and other phy/connection attributes) so

the performance can vary between two platforms that use the same expander chip.

3.3.4 Expander Attached Topology - Tree

The expander attached topology - tree has two or more expanders connected in a tree branching fashion to reduce the maximum number of hops between controller and expanders. The number of phys on the controller limits the number of expanders that you can connect in parallel.




Figure 14 Tree Expander Topology Example

Expander Attached Topology - Tree Considerations

Relatively complex topology, but suitable for scaling the topology with more drives. Suitable for checking both maximum MBPS and IOPS of one or more storage controllers. Suitable for measuring the expander’s capability to route the I/Os under different use cases. Latency is lesser than in the cascade configuration. Improved fault tolerance than in cascade configuration. If one expander fails, the storage on other branches

might still be available depending on the topology.

3.3.5 Multipath Topology

The Multipath topology is a more complex variation of cascade and tree topologies. This topology usually uses two or more initiators and allows multiple paths to each of the drives from multiple initiators so the Availability is higher than other topologies. The multipath topology requires SAS drives because they are dual ported, unlike SATA drives. The following figure illustrates the multipath in a simple manner. In a more complex example, multiple expanders could replace each single expander, in either a cascade or tree fashion, to allow many more drives to connect and each would still have a path to both sides




Figure 15 Path Redundancy Application Example

Multipath Topology Considerations

Very complex and suitable for large topologies and external storage enclosures. Highest fault tolerance than other topologies. Suitable for checking both maximum MBPS and IOPs of one or more storage controllers in many different

use cases. Suitable for measuring the expander capability under different use cases. Adds multiple variables to the performance. For the performance to be deterministic you must know the status of

all the devices. A minor glitch can affect the performance on a large scale. Hard to reproduce and debug performance issues.

3.3.6 Topology Guidelines for Better Performance

Choose a topology that best suits your need, keeping in mind performance and latency. Make sure to use correct cables and that no signal integrity (SI) issues exist on these cables or connectors.

Otherwise, long debug times might occur. Make sure your drives and expanders are detected properly, using any tool of your preference (example tools

include MSM, Scrutiny, StorCLI/MegaCLI, sg_utils, device manager listing, storlibTest, or lsiutil). Make sure the link width between expanders, and between expanders and controllers, is wide enough

throughout the topology so you do not add additional bottlenecks to your application. Use the Databolt technology with 3Gb/s and 6Gb/s SAS drives. Avoid using the phys with the Databolt feature to

connect to initiators and expanders, so more DataBolt-capable phys are available for drives.

NOTE PHYS [40:47] of the SAS3x48 expander do not offer DataBolt capability. Excess SES polling can interfere with arbitration and indirectly affect the performance. Make sure your SES polling

intervals are not too short to cause such interaction.




Before you run performance tests, confirm that all connected links are up and that the negotiated link rates are as expected (12 Gb/s with SAS 12Gb/s drives or expander, 6 Gb/s with SAS 6Gb/s drives or expander, and so on). You can use the following tools to view SAS link information: lsigetwin utlility, lsigetlin Utility, LSIUTIL, or Scrutiny.



Chapter 4: Configure Your Test ParametersOperating System Environments

Chapter 4: Configure Your Test Parameters

This chapter discusses parameters that may change between your tests in the same performance testing project. For example, volume configurations can change between tests. This chapter covers the following topics:

Operating system environments Volume configurations Benchmarking and system monitoring tools

4.1 Operating System Environments

Operating systems behave differently when it comes to performance because each system’s default settings, configurations, and mode of operations can differ.

4.1.1 Windows Operating System

The parameters discussed in this section might apply for all Windows operating systems; however, this section uses the Windows 2008 R2 operating system as an example.

4.1.1.1 Windows Operating System Hotfixes

In general, disable operating system automatic updates. Apply only the hotfixes or updates suggested for Windows operating systems that fix known performance issues. Assess the possible performance risk before you install any hotfix or update. Installing Windows' hotfixes can alter your performance. For example, the Windows Server 2008 R2 hotfix KB2769701can exhibit high variability of small I/O throughput.

4.1.1.2 MSI-X Interrupt Vectors

By default, the Windows operating system usually does a good job using the available and supported MSI-X vectors on storage controllers. The exact number of MSI-X vectors depends on the number of CPU cores. For better performance, confirm that the total number of interrupt vectors assigned is greater than 16 and the interrupts are balanced across all the CPU cores. Use the following steps to help confirm that the total number of interrupt vectors assigned is greater than 16:

1. Navigate to Start > Control Panel > Device Manager.

2. Right-click the controller in the Storage Controllers device list.

3. Select Properties.

4. Click the Resources tab. The Resource settings box shows each IRQ entry paired with the number of interrupt vectors currently assigned.

4.1.1.3 Process Affinity

Affinity refers to the nature of processes to run on a specific processor core on a multiprocessor system. You can force any application to run on any specific processor cores by using the following steps:

1. Navigate to Windows Task Manager > Processes.

2. Right-click the process of interest. For example, Dynamo.exe if you run an IOmeter benchmark.

3. Select Set Affinity and choose the CPUs or Nodes on which you would like to run this process. You can select one or more.

4.1.1.4 Driver Version and Customization

Use the latest recommended driver version of the storage controller for better performance results.




Avago storage controllers ship with driver default settings tuned for best performance. Under certain cases your application or topology might require custom settings. For such needs, the Windows Driver Configuration Utility (WDCFG) is provided with the driver package. You can customize different driver parameters and choose parameters that best match the needs. This utility provides run-time control over various configuration parameters (registry entries) that configure the Avago host storage drivers used on the Windows operating system. The driver package contains the utility and user guide. It is recommended that you use WDCFG to make all changes, rather than manually, because WDCFG provides a number of safety checks and protections, including a history stack that permits return to prior settings.

NOTE Work with your FAE for any assistance with WDCFG and to choose the settings most suitable for your application.

The following table provides information regarding the parameters that can affect performance.

Table 6 Customizable 6Gb/s SAS, 12Gb/s SAS and MegaRAID Driver Parameters

Driver Parameter Minimum Maximum Default Stability Impact Description

6Gb/s SAS, 12Gb/s SAS, and MegaRAID

NumberofRequest-MessageBuffers

10 9,999,999 — Low Number of request message buffers to allocate at SOD. The driver makes sure that the value used is not greater than the IOC’s reported request FIFO depth.The default value is the number of firmware Global Credits that the firmware issues to the host driver. Be extremely careful when you change this parameter because it can cause starvation of I/O resources at lower levels, adversely affecting storage system operation. Also be careful with the maximum because if it exceeds the firmware credits, requests drop on the floor.

6Gb/s SAS and 12Gb/s SAS Only

DisableFwQueue-FullHandling

0 1 0 None For SAS devices only. If this registry entry is present and the value is non-zero, the firmware does not handle Queue Full returns by the target and they return to the host driver.If this registry entry is not present, or is present with 0 value, the firmware handles the Queue Full return.

MaxSASQueueDepth 1 254 64 None The maximum number of concurrent I/Os issued to a single target ID for SAS devices. Setting this value too high can cause multiple Queue Full returns back to the OS, which can cause Event 11 and Event 15 to appear in the Windows Event Log.A 0 value results in a queue depth of 1. Values greater than 254 are forced to 20 by StorPort, therefore 254 is the maximum value.Setting this parameter to anything larger than the maximum target queue depth currently in-use by target devices causes Queue Full status returns. The maximum target queue depth in-use by target devices is a very nebulous number, which changes dynamically over time based on the present mixture of I/O sizes, current workload, and resources available in the target device. This parameter affects storage system performance; storage system stability should not be impacted.




MaxSATAQueueDepth 1 254 32 None The maximum number of concurrent I/Os issued to a single target ID for SATA devices. Setting this value too high can cause multiple Queue Full returns back to the OS, which can cause Event 11 and Event 15 to appear in the Windows Event log.A 0 value results in a queue depth of 1. Values greater than 254 are forced to 20 by StorPort, therefore 254 is the maximum value.Setting this parameter to anything larger than the maximum target queue depth currently in-use by target devices causes Queue Full status returns. The maximum target queue depth in-use by target devices is a very nebulous number, which changes dynamically over time based on the present mixture of I/O sizes, current workload, and resources available in the target device. This parameter affects storage system performance, storage system stability should not be impacted.

MaxSGList 17(64 KB)

513(2 MB)

257(1 MB)

High Controls the maximum I/O size that the driver handles. Setting this parameter to 33 provides a 128-KB I/O size, maximum. You can significantly impact storage system performance by setting this parameter, which can be tuned to optimize for specific I/O sizes. The default setting (257) provides good performance when I/O demands vary (normal situation).

MegaRAID Only

balancecount 1 4,294,967,295 — None The maximum number of I/Os sent to each disk in a RAID 1 volume before switching to the other disk.

busywaitcount 1 4,294,967,295 — None The number of requests that the adapter must complete before it resumes I/O requests to the miniport driver, in a Queue Full state.

coalescedepth 1 4,294,967,295 — None The maximum I/Os that the driver can coalesce.

coalescestart 2 4,294,967,295 — None Indicates the minimum number of I/Os before the driver starts to coalesce.

fastpathoff 1 1 Not Useda

None Disable the Fastpath I/O algorithm for all I/Os. To enable FastPath, you must remove this parameter entirely.

limitsges 1 1 Not Useda

None If the parameter exists, the maximum number of SGEs for MPT frames is limited.

maxnumrequests 1 1024 — None Sets the maximum number of requests from the OS.

maxtransfersize 1 4,294,967,295 — None Sets the maximum number of bytes to transfer.

msiqueues 1 16 — None Sets the maximum number MSI queues.

nobusywait 1 1 Not Useda

None If this parameter exists, the driver returns the Queue Full status back to StorPort and does not use the StorportBusyWait mechanism.

nonuma 1 1 Not Useda

None If this parameter exists, NUMA support is disabled in the driver.

Nosrbflush 1 1 Not Useda

None If this parameter exists, driver issues the SRB_FUNCTION_FLUSH command to the firmware when the OS receives the command. By default, this command is not passed to the firmware and completed back to the OS by the driver. Enabling this parameter causes significant performance drop.

Qdepth 1 254 — None Sets the device queue depth.

a. By default, this registry entry is not present.

Table 6 Customizable 6Gb/s SAS, 12Gb/s SAS and MegaRAID Driver Parameters (Continued)

Driver Parameter Minimum Maximum Default Stability Impact Description




4.1.1.5 Disk Write Cache

Disk write cache (or write back cache) permits your system to function faster by acknowledging a write while data is still in the disk cache rather than waiting for the data to commit to storage media. However, if power is lost prior to the actual write, the data is lost. Therefore, not all drives permit write cache. Enabling disk write cache improves performance, but you must consider your specific situation before you decide to enable write cache. SSDs generally ignore the disk write cache setting because they might have their own internal caching algorithm.

You can enable write caching in the controller firmware (write back cache for MegaRAID products) or in the operating system. From Disk manager, right-click Properties > Policies. See the following example screen.

Figure 16 Write Cache Policies Example

Write-cache buffer flushing is enabled (unchecked, as shown by the second checkbox in the previous figure) by default. If this feature is enabled, an I/O must actually be written to the drive before it completes. That is, writing an I/O to the cache does not mean the I/O is complete.

If you enabled write-cache buffer and you do not use unbuffered I/O, the performance is the same as disabling the drive's write cache (see the following table). Some benchmark tools, such as HD Tune Pro, use buffered I/O (not direct I/O). Iometer uses unbuffered I/O (direct I/O).

Results indicate that Flushing Enabled is synonymous with no drive cache when you compare writes. Everything must still reach the drive platters to complete. Reads are not affected.

The following table shows how enabling or disabling the write cache option and the write-cache buffer flushing option affects the performance.

Table 7 Windows Write-Cache Buffer Flushing Comparison

Block Size (KB) Windows Flushing Enabled (IOPs) Windows Flushing Disabled (IOPs) Write Cache Disabled (IOPs)

0.5 250 23,486 250

1 250 22,331 250

2 250 21,468 250

4 249 20,098 249

8 248 17,855 248

16 245 11,984 246

32 240 5,997 241

64 232 2,977 232




4.1.2 Linux Operating System

Linux default settings might not be tuned for best performance and you might need to tune them manually. This section assumes general Linux distributions. Any steps might differ from your Linux distribution. Refer to the documentation from your distributor for equivalent commands and settings.

4.1.2.1 Linux Kernel Version

The Linux kernel version depends on the Linux distribution with which the kernal came. For better performance, adhere to the following guidelines:

Choose a distribution with the latest kernel version that is void of any known performance issues. Make sure your Linux distribution supports the RQ_affinity = 2 option.

4.1.2.2 Linux Drivers

Avago storage controllers release with prebuilt drivers. Move to the latest recommended driver version of the storage controller for better performance results.

While applying operating system updates, the Linux Kernel version might change and the newer kernel might not pick the latest driver installed on the system. The kernel might pick the in-box version. For such cases, build the latest driver from its source and install. The source is available as the dynamic kernal module support (DKMS). The instructions to build and install a Linux device driver by using DKMS is available with Avago product documentation, such as the MegaRAID SAS Device Driver Installation User Guide.

4.1.2.3 MSI-X Interrupt Vectors

Linux might not balance the MSI-X interrupt vectors across the CPU cores. You might have to manually assign the interrupts to the cores.

IT Controllers

To list all interrupts assigned to the controller, run cat /proc/interrupts | grep mptsas. Avago provides the set_affinity.sh script with its driver downloads to set the interrupt affinity automatically.

MegaRAID Controllers

Use one of the following methods to set the affinity:

If you use Linux version 6.3 or newer and the RQ_affinity setting is available, set the RQ_affinity setting to 2 for the device. Do not use the affinity steps in the following option.

Use the command line as outlined in this section to assign the interrupt affinity manually.

Use the CPU ID of the core and IRQ number of the interrupt vector in the/proc file system to assign interrupt vectors to specific cores:

echo "{CPU ID MASK}" > /proc/irq/{IRQ #}/smp_affinity

128 215 1,499 215

256 189 747 188

512 151 377 152

1024 108 189 108

2048 69 94 69

4096 40 47 40

8195 21 24 22

Table 7 Windows Write-Cache Buffer Flushing Comparison (Continued)

Block Size (KB) Windows Flushing Enabled (IOPs) Windows Flushing Disabled (IOPs) Write Cache Disabled (IOPs)




Run cat /proc/interrupts to assign the IRQ numbers to the controller. Run grep with mptsas to filter the results.

Add manually-assigned IRQs to the banned interrupts list to avoid rebalancing by the system across the CPUs:

export IRQBALANCE_BANNED_INTERRUPTS="{IRQ #}...{IRQ #}"

4.1.2.4 I/O Scheduler

The Linux kernel uses kernal I/O scheduling to control disk access. The 2.6 kernel lets applications select different I/O schedulers depending on usage patterns to optimize the kernel I/O. The four I/O schedulers are Complete Fair Queueing (default), Deadline, NOOP, and Anticipator. Avago uses the Deadline and NOOP schedulers for performance tuning. Make sure to run the same I/O scheduler for all storage devices attached to the same controller.

Deadline (deadline)

This option uses a deadline algorithm aimed to minimize I/O latency to provide near real-time behavior. The algorithm uses a round robin policy among multiple I/O requests to prevent starvation. Avago uses the Deadline scheduler for storage systems with both rotating media and with SSDs to reduce the I/O latency as much as possible.

You must change the I/O scheduler on each individual device:

Syntax: echo {Scheduler-Name} > /sys/block/{Device-Name}/queue/scheduler

Example: echo "deadline" > /sys/block/"SAS3108" /queue/scheduler

To verify the I/O scheduler for a block device run:

Syntax: cat /sys/block/{Device-Name}/queue/scheduler

Example: cat /sys/block/"SAS3108" /queue/scheduler

NOOP (noop)

This option uses a basic FIFO queue and performs the minimum work required to complete an I/O. This algorithm assumes performance for an I/O is optimized by the application or another component in the system (block device, HBA, or externally attached controller).

You must change the I/O scheduler on each individual device:

Syntax: echo {Scheduler-Name} > /sys/block/{Device-Name}/queue/scheduler

Example: echo "noop" > /sys/block/"SAS3108" /queue/scheduler

To verify the I/O scheduler for a block device run:

Syntax: cat /sys/block/{Device-Name}/queue/scheduler

Example: cat /sys/block/"SAS3108" /queue/scheduler

4.1.2.5 Block Layer I/O Scheduler Queue

The Linux I/O scheduler queue size can impact performance. The queue size determines how many incoming requests are stored in the I/O scheduler's request queue for the scheduler to optimize. Configure the queue size at the block layer for individual devices through the nr_requests variable:

Syntax: echo "{QUEUE SIZE}" > /sys/block/{DEVICE NAME}/queue/nr_requests

Example: echo "128" > /sys/block/"SAS3008"/queue/nr_requests

You can query the current I/O scheduler queue size in a similar manner:

cat /sys/block/{DEVICE NAME}/queue/nr_requests

The I/O scheduler queue default size in most Linux versions is 128; that is, 128 reads and 128 writes can queue to the device at any instance before the process is put to sleep. You can increase or decrease the size. The queue size might impact the system performance.




Latency sensitive applications that use writeback I/O might consider lowering the nr_requests value to prevent filling the device queue with write I/Os. The exact queue size yielding optimal performance varies from system to system and is workload dependent. Test this setting on your system to decide what setting yields the best performance.

4.1.2.6 SCSI Queue Depth

The SCSI queue depth defines the number of transfers that can be outstanding for the device at any given time. You can configure this limit at the block layer for individual devices through the queue_depth variable:

echo "{QUEUE SIZE}" > /sys/block/{DEVICE NAME}/device/queue_depth

You can query the current SCSI queue depth in a similar manner:

cat /sys/block/{DEVICE NAME}/device/queue_depth

The default SCSI queue depth for an I/O device in Linux varies on the device. For example, a SAS hard drive might have a default SCSI queue depth of 32. You can increase or decrease this value. The SCSI queue depth might impact the system performance. The exact SCSI queue depth yielding optimal performance varies from system to system and is workload dependent. Test this setting on your system to decide what setting yields the best performance.

4.1.2.7 Nomerges Setting

The nomerges setting helps manage contiguous I/Os. This setting affects Write Back performance optimization. To optimize Write Back performance, set nomerges to 0 for HDDs or to 1 for SSDs.

NOTE The nomerges option requires the device queue depth setting. If the device queue depth setting is less than the queue depth pushed from the benchmarking tool, the block layer performs merges even if nomerges is set to 1.

Syntax: echo "{NOMERGES}" > /sys/block/{DEVICE NAME}/queue/nomerges

Example: echo "0" > /sys/block/"sda"/queue/nomerges

4.1.2.8 Rotational Setting

The rotational setting states if the device is rotational (1, HDD) or nonrotational (0, SSD). The driver properly sets this value unless you switch drives without performing a driver reload.

Syntax: echo "{rotational}" > /sys/block/{DEVICE NAME}/queue/rotational

Example: For HDDs: echo "1" > /sys/block/"sda"/queue/rotational

4.1.2.9 Add Random Setting

The add_random setting helps manage disk entropy contribution. The default value is 1. Set add_random=0 for SSDs because random entropy pool does not optimize SSD performance.

Syntax: echo "{add_random}" > /sys/block/{DEVICE NAME}/queue/add_random

Example: echo "0" > /sys/block/"sdb"/queue/add_random

4.1.2.10 Linux Write Cache

Use the hdparm tool to enable write cache in for SATA drives. Use the sdparm tool to enable write cache on SAS drives. Refer to the Linux man page for each command for details.



Chapter 4: Configure Your Test ParametersVolume Configurations

4.2 Volume Configurations

4.2.1 Volume Configurations and Performance

Performance varies depending on how a volume is configured. Understanding the following ideas helps you to understand performance:

Drive groupA group of one or more physical drives. Drive groups can be made of any simple RAID type such as R0, R1, R5, or R6 ; or of a spanned RAID type such as R10, R50, R60.

Logical drives, virtual drives, or volumesYou can create logical drives, virtual drives, or volumes from drive groups. A virtual drive can be on a single drive or on a drive group.

When creating the drive groups and virtual drives, you must choose certain parameters, such as the following parameters that can affect the performance:

Volume type (R0, R1, R10, R5, R6, R50, R60) Stripe size (64 KB or 256 KB) Read cache policy (read ahead, no read ahead) Write cache policy (write back, write through) I/O policy (direct I/O or cached I/O) Access policy Disk cache policy (enabled, disabled, unchanged) Consistency and Initialization Background operations FastPath capability

Sections that follow review the effects of these parameters on performance.

4.2.2 Volume Type

JBOD

Just a Bunch of Drives (JBOD) indicates a raw mode without any RAID feature. This scenario is equivalent IT mode or the JBOD mode in MegaRAID/iMR. JBOD is the fastest mode in terms of performance per drive because JBOD does not have the RAID overhead. Though JBOD mode is simple to use, it lacks the redundancy, fault tolerance, and performance benefits that a RAID mode provides.

The hardware in the LSISAS2208, LSISAS2308, LSISAS3004, LSISAS3008, and LSISAS3108 controllers can completely run these I/Os without firmware involvement. This feature is called FastPath. FastPath is possible with some RAID modes as well, which Section 4.2.9, MegaRAID FastPath Software reviews.

RAID

The RAID feature permits multiple drives to be configured as a RAID volume or virtual drive (VD) and exposed to the operating system as a single drive. Refer to Chapter 2 in the MegaRAID SAS Software User’s Guide for a RAID introduction. Avago MegaRAID firmware supports RAID levels 0, 1, 5, 6, 10, 50, and 60. The LSISAS3004 and LSISAS3008 iMR firmware supports RAID levels 0, 1, 5, 10, and 50. The MegaRAID SAS RAID controllers provide reliability, high performance, and fault-tolerant disk subsystem management.

Multiple volumes generally yield better performance. Performance varies between RAID levels and I/O types. Sequential I/Os with RAID 0 (striping) typically performing the best, and sequential I/Os with RAID 1 (mirroring)




performs the lowest. Random write I/Os with RAID 5 or RAID 6 typically have lower performance because of parity calculations.

RAID 0RAID0 stripes the data and uses more than one drive to write the stripes. By doing so, RAID0 provides performance improvement with the use of parallel writes and reads. Performance scales with the number of drives present in the RAID0 volume, which is advantageous compared to using a single drive to store all the data because single drive performance is limited to that drive’s performance. With the same number of drives, RAID0 performance is almost as same as JBOD performance.

RAID 1RAID1 mirrors the data of one drive to another drive. Two writes must occur for each write so the write performance does not double with two drives. However, RAID1 helps with read performance. Data can be read from either drive and the reads can occur in parallel. With a proper load balancing algorithm, the reads scale almost twice the performance of a single drive. Performance is impacted if the volume is undergoing rebuild.

RAID 10RAID10 uses stripping and mirroring, so the performance features of both RAID0 and RAID1 are applicable. Read performance scales almost up to the number of drives present in the RAID10 volume, however the write performance scales only up to half of the number of drives. Performance is impacted if the volume is undergoing rebuild.

RAID 5RAID5 calculates and distributes parity across the drives. Write performance suffers because of these parity calculations. However, read performance scales almost up to the number of data drives present (total drives minus 1 parity drive). Using hardware RAID accelerators, in ROCs, improves the write performance. If the firmware or software handles the parity calculations, the performance decreases. If the volume undergoes rebuild, the performance would be affected.

Initialize the RAID5 volumes for better performance because consistent volumes avoid the need to access each drives individually to do the read-modify-write.

RAID 6RAID6 uses dual distributed parity, similar to RAID5. Dual parity calculations do not show significant overhead compared to single parity so the performance is also similar to RAID5 volumes. Initialize the volumes for better performance. If the volume undergoes rebuild, the performance would be affected.

RAID 50RAID50 is the span formed by RAID5 and RAID0, thus combines the performance properties of both RAID5 and RAID0. That is, data is stripped to use more than one drive at a time. This approach reduces the rebuild times compared to a single large RAID5 made of all drives.

RAID 60RAID60 is the span formed by RAID6 and RAID0. RAID60 is similar to RAID50, previously described.

NOTE RAID50 and RAID60 performance results are almost the same as RAID5 and RAID6 performance results, respectively; this document does not discuss RAID50 and RAID60 explicitly in detail.

Performance and Volume Type

Read PerformanceRedundancy allows the same data to be present on more than one location so it provides the liberty to load balance the reads across different drives. The read performance scales with the level of redundancy. For example, R1 has two drives with same data. The data can be read simultaneously from both these drives, so you can achieve up to almost twice the single drive performance.




Write PerformanceStripping (R0, R10, R50, or R60) allows data to be written into more than one drive in a parallel fashion. The write performance of the volume scales with the number of data drives present in a strip.

Parity GenerationParity generation during the R5, R6, R50, and R60 mode adds overhead to the writes and limits the write performance. The ROC IC hardware RAID modules can compute these XOR (parity) calculations at a faster rate than the firmware computation. The ROCs provide higher R5, R6, R50, and R60 write performance than the IOCs. However, the data still must be cached, which requires the firmware and so the performance is lower compared to the hardware FastPath I/Os.

4.2.3 Strip Size

Strip size decides how much overhead is involved during the write operation. In general, the lower the strip size the higher the number of stripping operations per I/O, and the performance decreases. That is, the higher the strip size, the better the performance. However, performance is negatively impacted if host commands are larger than the strip size or if multiple random I/O land within the same stripe (strip size × number of data drives). You can improve performance if the strip size is matched to the expected I/O size. A 256-KB default size provides a compromise for general operation of small, random and large, or streaming I/O.

6Gb/s SAS MegaRAID controllers used a 64-KB strip size as the default. Newer MegaRAID controllers use a 256-KB strip size as the default. iMR controllers only support 64-KB strip size. The maximum strip size that MegaRAID supports is 1024 KB.

4.2.4 Cache Policy

Cache can improve read and write performance. If the cache performance is limited when the I/Os use cache, the cache becomes the bottleneck. Cached I/Os cannot use the hardware FastPath engine, so the performance might be lower.

Read Ahead and Write Back modes use the cache for reads and writes, respectively. This combination suits HDD volumes, but not SSD volumes. Using cache in the front end boosts performance by flushing the cache later in the background, because accessing the rotational HDDs is significantly slower than cache accesses.

No Read Ahead and Write Through modes avoid using the cache for both reads and writes. If no parity generation exists, this setting helps use FastPath for I/Os and improve performance. This combination suits SSD volumes, but not HDD volumes. Accessing SSDs directly gives better performance and latency than using the cache in between.

Read Ahead with Write Through and No Read Ahead with Write Back modes help only read or only write, respectively. Because these two modes use cache, the I/Os cannot use FastPath and the performance decreases.

4.2.5 Disk Cache Policy

Enabling disk write cache (physical drive cache for MegaRAID) helps the write performance. However, Enterprise servers might want to keep the disk write cache disabled to avoid any data integrity issues that can arise if the drive loses the power abruptly. It might not be advantageous to use disk write cache for SSDs. Disk write cache is advantageous for HDDs because the cache writes are significantly faster compared to writing to the rotating media.

4.2.6 I/O Policy

I/O policy allows Cached I/O and Direct I/O modes. Cached I/O helps retain the write data in the cache, whereas Direct I/O releases the cache line after the writes to the disks complete. For consistent performance, use Direct I/O policy.




Because the Cached I/O performance might vary over time depending on the Cache lines’ availability and what I/O already resides in cache as a dirty cache.

4.2.7 Consistency and Initialization

Initialization is the process of writing volumes with zeros. Initialization is important for consistent performance, especially for volumes that require parity generation. Consistent volumes do not need additional read-modify-writes that inconsistent volumes require.

If initialization is running in the background, as explained in the following section, performance is affected. Wait until the initialization finishes before you run the actual performance tests.

4.2.8 Background Operations

Background operations significantly impact performance because benchmarking tools do not account for such I/Os. Background Initialization, Patrol Read, Consistency Check, Rebuild, and Reconstruction operations should not run while measuring I/O performance. Make sure these operations are disabled or completed, and are not scheduled to run during the performance measurement.

MegaRAID provides options to control the percentage rate at which these background operations are issued. However, these percentages are not representative of exact bandwidth that the background operations consume. For example, setting the rebuild rate to 30% does not mean that 70% of the bandwidth is used for normal I/O. It only means that the wait time before submitting the next Rebuild command is set to 70% of its maximum wait time. The options control only the submission of the commands to the drives; background operations are usually handled when the controller is idle. If many I/Os are already in progress, I/Os might continue to proceed at almost 100% rate.

4.2.9 MegaRAID FastPath Software

Avago MegaRAID FastPath software is a high performance I/O accelerator for SSD acceleration that can be enabled so that a hardware I/O accelerator handles I/Os without firmware involvement. The FastPath feature is an additional feature available in some 6Gb/s Avago MegaRAID SAS controller cards and all 12Gb/s Avago MegaRAID SAS controller cards through the purchase of a software license. Consult product documentation for any Avago MegaRAID card to determine its FastPath capability.

Using FastPath benefits certain workloads depending on the RAID type and volume configuration. Though the FastPath feature can be enabled and available for all configurations, it is not always possible to use FastPath for all configuration. Therefore, the MegaRAID firmware uses FastPath only for configurations for which the use makes sense. The following table identifies which configurations can use FastPath:

Table 8 FastPath Software Capability Matrix

RAID LevelHDDs , HDDs and SSDs SSDs Only

Reads Writes Reads Writes

IT Adapter Yes Yes Yes Yes

MegaRAID JBOD, iMR JBOD Yes Yes Yes Yes

RAID 0 Yes Yes Yes Yes

RAID 1 (Two drives) Yes No Yes No

RAID 1 (More than two drives) No No Yes No




The Avago MegaRAID controller uses FastPath in the following conditions:

Virtual drive configured by using Write Through, No Read Ahead, and Direct I/O. Cut-through I/O is enabled in the controller. No background operations, such as consistency checks, volume initialization, patrol reads, or copy back are

running (verify by using MSM or StorCLI). Controller runs in non-degraded mode for the best performance possible. I/O operations are within a single RAID strip.

Avago MegaRAID topologies can support a mix of FastPath-enabled volumes and non-FastPath volumes. The firmware evaluates the FastPath capability on a per volume basis while doing a media check to determine if the underlying storage consists of HDDs or SSDs. If FastPath is enabled on a volume, the firmware does not touch the I/O in normal cases. The firmware is involved in error cases. With FastPath software, an Avago MegaRAID controller can see substantial performance gains compared to non-FastPath configurations.

4.2.10 Guidelines on Volume Configurations for Better Performance

The following are guidelines only. Choose your options based on what best suits your application.

HDD Volume Parameter Settings— Stripe Size: 256 KB (default)— Read Policy: Always Read Ahead— Write Policy: Write Back— I/O Policy: Direct I/O— Access Policy: Read Write— Disk Cache Policy: Unchanged

SSD Volume Parameter Settings— Stripe Size: 256 KB (default)— Read Policy: No Read Ahead— Write Policy: Write Through— I/O Policy: Direct I/O— Access Policy: Read Write— Disk Cache Policy: Unchanged— If prompted to enable SSD Caching (CacheCade), respond No

Make sure your volumes are consistent before you run performance. Make sure no background operations are running. Make sure sufficient queue depth (Qd) is set from the benchmarking tools. Set the Queue depths in accordance

with the number of physical drives. A Qd of 8 per drive for a JBOD is not same as Qd of 8 for an 8 drive RAID volume. Set the Qd to (8 x 8 =) 64 for the 8 drive RAID volume to get the same performance.

You might need to increase the number of volumes present in a drive group to get better performance. You might also need to increase the number of threads/workers to match the number of virtual drives (volumes). For example, on a RAID0 Drive Group made of 8 physical drives, it is better to create 2, 4, or 8 volumes and assign

RAID 10 No No Yes No

RAID 5 and RAID 50 Yes No Yes No

RAID 6 and RAID 60 Yes No Yes No

Table 8 FastPath Software Capability Matrix (Continued)

RAID LevelHDDs , HDDs and SSDs SSDs Only

Reads Writes Reads Writes



Chapter 4: Configure Your Test ParametersSoftware Tools

them to different workers instead of creating one volume and assigning it to one worker.

4.3 Software Tools

After you complete your storage topology and installed the necessary operating system, you need software tools to set up and monitor your configuration, and to measure the performance. The following table summarizes the tools used in the Avago performance lab. Refer to the documentation of the product or tool of your interest to use the tool that best suits your need.

The following table describes benchmarking tools. The following chapter describes in detail the commonly used benchmarking tools.

Table 9 Tools to Program and Configure the Storage Controllers and Expander

Tool Comments

Sasflash To program SAS controllers. Not for use with MegaRAID products. sas2flash tool for 6Gb/s SAS sas3flash tool for 12Gb/s SAS

Sas2praser Merges the NVDATA files of the SAS controllers with their firmware images. Useful to make custom changes in NVDATA, then merge and flash to the controller.

Storcli Command line tool to program, monitor, and manage MegaRAID 6Gb/s SAS and 12Gb/s SAS controllers. Useful for scripting and automation.

MegaCli Command line tool to program, monitor, and manage MegaRAID 6Gb/s SAS controllers . Useful for scripting and automation.

MegaRAID storage manager (MSM)

GUI-based tool to program, monitor, and manage MegaRAID 6Gb/s SAS and 12Gb/s SAS controllers. Easy to start with and configure different parameters.

MegaREC Recovery tool for MegaRAID controllers. Useful If the controller is bricked.

Lsiutil Internal tool with many debugging options for Avago controllers and expanders. Not for use with MegaRAID products.

Scrutiny Customer tool for debugging and configuring Avago controllers and expanders. Officially supported for 12Gb/s SAS controllers and Expanders.

Xtools (Xflash/Xutil/Xmfg)

Xutil and Xflash is to flash the firmware and manufacturing images of 6Gb/s SAS expanders. [ g3xutil and g3xflash are for 12Gb/s SAS expanders.] Xmfg creates a manufacturing image from xml files.

Table 10 Benchmarking Tools

Tool Comments

IOmeter Easy to use GUI based benchmarking tool with Synthetic workloads. Works with Windows and Linux. Supports Command line options as well. Not suitable if Latency must be analyzed in depth.

VDBench Command line tool that is powerful to measure latency at a greater granularity. Supports windows and Linux. Java based benchmark tool and so easy to run on any OS. Provides text and html based results.

Fio Command line tool suitable for benchmarking, QA, and verification purposes. Provides different IO engines and various results formats.

JetStress Benchmarking tool that simulates a Microsoft Exchange database workload without a full Exchange installation. Suitable as a Real-world workload simulator. Not a general purpose tool.




The following table describes system tools.

4.3.1 Linux Performance Monitoring Tools

When you run a performance test under Linux, you can monitor the system during the test. This monitoring can give additional insight into the system that might not be available through the performance test alone. Linux offers several command line tools installed by default to help you monitor a Linux system.

Many tools discussed in this chapter are a part of the sysstat package. This package is installed by default on many common Linux distributions or available in the package repositories.

4.3.1.1 sar

The sar command line utility reports the values of cumulative activity counters in the Linux operating system. Invoke sar by using one of the two following methods:

standalone: Looks for the current day's data and shows the performance data recorded for the current day.

TPC-C (Transaction Processing Performance Council – C)

TPC-C is an on-line transaction processing (OLTP) benchmark. Simulates a complete environment where a population of terminal operators executes transactions against a database. Transactions based tool.

TPC-E The TPC-E benchmark simulates the OLTP workload of a brokerage firm. The focus of the benchmark is the central database that executes transactions related to the firm’s customer accounts

Orion Oracle Orion is a tool for predicting the performance of an Oracle database without having to install Oracle or create a database.

Table 11 System Tools

Tool Comments

Windows – Perfmon

Performance monitoring tool that comes with Windows. Allows creating different performance counters to measure any performance parameter of interest. For example, plot the MBPS of an SSD over a time to verify if the precondition is sufficient.

Windows – Xperf

Comes as part of Windows Performance Toolkit. Needs windows performance recorder (WPR) and windows performance analyzer (WPA).

Windows – msinfo32

Windows tool to collect all the information about the system. Allows saving the complete configuration to a file.

Linux – Xtrace

eXtended trace utility, similar to strace, ptrace, truss, but with extended functionality and unique features, such as dumping function calls (dynamically or statically linked), dumping call stack and more.

Linux – mdadm

Linux utility to manage software RAID devices. Allows creating software level RAID volumes on any non-RAID storage controller as well.

Linux – sar A Linux command that writes to standard output the contents of selected cumulative activity counters in the operating system.

Linux – iostat Linux command to report CPU statistics and I/O statistics for devices, partitions and network filesystems (NFS).

Linux – blktrace

Linux command to generate traces of the I/O traffic on block devices.

Linux – blkparse

Linux Command to produce formatted output of event streams of block devices.

Windows – diskpart

Windows tool to manage objects (disks, partitions, or volumes) by using scripts or direct input at a command prompt.

Table 10 Benchmarking Tools (Continued)

Tool Comments




sar file: This method is invoked using the -f flag passing in the sar file and shows the performance data stored in the sa file.

You must enable sar logging to use sar. Many systems enable sar by default. Debian®-based systems might require that you modify /etc/default/sysstat to set ENABLED to true. Red Hat®-based systems enable sar by default and set to log 7 days of statistics. Use the following syntax:

sar <command flags> <# of seconds between each run> <# of times to run sar>

For example, sar -u 5 3 runs the cumulative real-time CPU usage every five seconds, a total of three times.

sar generates a wide range of statistics based on the command flags provided. The following command flags are most common:

-u: real-time usage of all CPUs -P: real-time usage of individual CPUs or cores -r: memory statistics, including free and used memory -b: I/O statistics, including transactions and bytes broken down by reads and writes -d: I/O statistics for individual block devices -w: number of context switches per second -q: run queue and load average -s: report the data using the specified start time

The following code sample shows an example output.







4.3.1.2 iostat

The iostat command line utility reports system storage input and output statistics, and observes the time the devices are active in relation to their average transfer rates. iostat generates the following three report types:

CPU Utilization: Global averages among all processors Device Utilization: Statistics for each physical device or partition Network Filesystem: Statistics for each mounted network filesystem

A typical iostat use is to run iostat during a performance test to help monitor the system like real-time. You can use the following command line interval options to enable iostat to display either a single report or a continuous report at fixed intervals:

iostat -d 5: Displays a continuous device report every five seconds iostat -d 5 3: Displays three device reports every five seconds

iostat can report the statistics only for specific devices if the devices are passed into the command line options, as the following example shows:

iostat -d sda sdb 5 3: Displays three device reports every five seconds for device sda and device sdb

The report interval to use with iostat depends on the test and the behavior type that you want to observe. Frequent iostat reports (such as every 2 seconds) might be more suited for picking up small events during the test while more infrequent iostat reports (such as every minute) might give a better overall view of the system. iostat output shows in Linux stdout and can be redirected to a text file if necessary, by using the > operator.

The following command flags are most common:

-c: display the CPU use report -d: display the device use report -n: display the network filesystem report -k: display statistics in KB/s -m: display statistics in MB/s -t: display each report time -x: display the extended statistics





4.3.1.3 blktrace

The blktrace block layer I/O tracing mechanism provides detailed information about request queue operations up to the user space. blktrace includes three major components: a kernel component, a utility component to record the I/O trace information for the kernel to user space, and a utilities component to analyze and view the trace information.

First run: mount -t debugfs debugfs/sys/kernal/debug

Secondly, run blktrace -d dev [-r debugfs_path] [-o output] [-k] [-w time] [-a action] [-A action_mask] [-v]

The parameter options include filter masks, buffer information, tracing information, network information, file, and versions. Refer to the blktrace Linux manufacturing page for details and examples, http://linux.die.net/man/8/blktrace.


http://linux.die.net/man/8/blktrace




4.3.1.4 blkparse

The blkparse parameter interpets the blktrace file for metrics such as the Qdepth, and so on.




4.3.2 Windows XPerf

Prerequisites

Before you can collect traces, you must install .NET framework 4.5 and Windows SDK for Windows 8 or later. After installation two new programs appear in the start menu: Windows Performance Recorder and Windows Performance Analyzer.

Windows SDK provides the XPerf tool that collects various performance traces on Windows. Windows SDK for Windows 8 provides a graphical user interface tool to record and analyze performance traces. Use the following steps to record performance traces using Windows Performance Recorder:

1. Run the Windows Performance Recorder tool by navigating to Start > Windows Performance Recorder.

2. Click More Options to see additional performance profiles.

3. Select CPU Usage and Disk I/O Activity as shown in the following figure. Select additional profiles if necessary.

Figure 17 Profile Selection

4. Use the drop-down options to select the appropriate Performance scenario, Detail level, and Logging mode.

General, Verbose, and File, respectively, are good choices for all generic scenarios.

5. Click Start to begin recording.

6. Start any IO generators or benchmarking tools to send I/Os to disk.

For example run IOMeter with workloads that might uncover performance issues currently being debugged.

7. After the appropriate test runs finish, click Save on the Windows Performance Recorder tool. Save the trace to a convenient location when asked, as shown in the following figure.




Figure 18 Save Test Example

8. Run Windows Performance Analyzer tool and load the trace file saved in the previous step. The Performance Analyzer might take a few minutes to load and analyze the trace.

9. Double click Graph Explorer to see all traces available.

10. Right click on the traces of interest and select Add Graph to Analysis View to add them to Analysis view for further analysis, as shown in the following figure.




Figure 19 Graph Explorer Example

11. After you add all traces of interest to the analysis view, open the View Editor and make appropriate changes to customize the view. The following figure shows the View Editor icon.

Figure 20 Open the View Editor




4.3.3 Windows Performance Monitor (Perfmon)

The Windows operating system ships with a performance monitor to trace important performance counters and get deeper insight about the storage subsystem. Perfmon has an easy-to-use GUI to create and run performance monitoring tasks. Follow the steps to create a data collector profile and start a monitoring task.

1. Run perfmon at command line or the Windows Run dialog to launch the perfmon GUI.

2. On the left of the Performance Monitor GUI, expand Data Collector Sets and right-click User Defined.

3. Select New > Data Collector Set.

4. In the screen that appears, enter a new name for the data collector set. Select Create manually (Advanced) and click Next.

5. In the screen that appears, select Create data logs and select Performance counter.

6. Click Next.

7. In the screen that appears, click Add to add performance counters of your interest to the Data Collector Set.

8. In the screen that appears, scroll to the Physical Disk section and click the down arrow to list the counter options.

Figure 21 Physical Disk Option

9. Select all Physical Disk counters and double-click or click Add. Apply them to either only the storage devices you are measuring, or select <All instances>.




Figure 22 Add Performance Counters

10. After you add the required performance counters to the list, click OK.

11. Complete the remaining Data Collector Set wizard actions.

The new Data Collector Set appears on the left of the Performance Monitor GUI, at Data Collector Sets > User Defined.

12. Select the Data Collector Set that you created and right-click DataCollector01 in the right portion of the GUI.

13. Select Properties from the pop-up menu.

14. In the screen that appears, select the desired Log format and Sample interval. Binary log format can be viewed by using a performance log viewer GUI that Windows provides. If you must process data, the comma separated values (CSV) format is recommended.




Figure 23 Properties

15. Click OK.

16. Right-click on the Data Collector Set and select Start from the pop-up menu to start the monitoring task.




Figure 24 Start Collector Set

Optionally, you can run the following to start and stop performance monitoring from command line.

— logman start <Data Collector Set name>

— logman stop <Data Collector Set name>

Perfmon starts logging data to the output directory specified when you created the data collector set.

When the performance monitoring stops, the results are stored in a DataCollector.csv file that you can import into Excel for analysis and graphing.



Chapter 5: Benchmark ResourcesBenchmarking Basics

Chapter 5: Benchmark Resources

After you set up and configure your test hardware, choose a benchmark that best suits what you want to measure. Your benchmark must measure the metrics of your interest for sufficient duration and with sufficient granularity.

Each benchmarking tool has its own merits and demerits. Evaluate different benchmarks and choose the tool that produces results with the best set of metrics that suits your real-time workloads.

After you choose the benchmark, configure the input parameters and output formats correctly such that it captures all the relevant metrics properly. It is a good practice to have a simple, short test that is a good indicative of your actual runs. Before you run your actual test, run this sample test first to check for maximum IOPS and MBPS for different performance corners of your configuration. Compare the results of this sample test against your expected results to make sure they match. Sample test advantages includes the following:

Proves your topology is free of any obvious issues. Verify the input parameters of the benchmarking tool are as expected. Verify the output results format is as expected.

If your setup has any issues, sample tests catch the issues quickly and so time is saved. Without these tests, you might only see the problem after the actual perfomance test completes, which can be lengthy.

The following sections provide detailed explanations on how to install, run, and interpret results for select commonly used benchmark tools.

5.1 Benchmarking Basics

This section discusses basic parameters related to benchmarking and their impact on performance.

Workers or ThreadsBenchmarking tools use threads to send I/Os and to measure the performance metrics. IOmeter uses workers, whereas Linux tools use threads.

Managers and InstancesA manager is one instance of the benchmark tool, that can have one or more workers. You can have more than one Manager. In Linux you may run multiple instances of the benchmark. However at the end, you must merge the results from all the managers/instances to get the complete result. Typically only one manager is needed, However you might require more than one manager when multiple controllers are benchmarked at the same time, or multiple unrelated metrics measured at the same time.

Queue DepthThe number of outstanding I/Os per drive. The benchmark tools allow direct modification. Qd may be set for a physical drive, or a logical drive in the case of RAID volumes.

Before running benchmarks, identify the minimum Qd for which the drives are saturated and provide maximum performance. Use a Qd that is equal to or greater than this minimum Qd for your benchmarking runs. Multiply this Qd with the number of physical drives to select the Qd for RAID volumes.

I/O TypeI/Os can be sequential, random, and mix of sequential and random. For HDDs, random performance is usually drive limited and HDDs tend to give very high sequential performance and low random performance. For SSDs, both sequential and random performance numbers are at similar levels.

Real world workloads are usually mixed I/O types. Synthetic benchmark tools such as IOmeter allow building complex workloads that mimic the real life workloads.



Chapter 5: Benchmark ResourcesBenchmarking Basics

I/O SizeI/O size is the size of the I/O used to measure the performance metrics. Avago standard runs use I/Os from 0.5 KB to 1 MB or 4 MB. 0.5 KB, 4 KB, or 16 KB are better candidates to check the maximum IOPS. 256 KB and 1 MB are better candidates to check MB/s limits.

The I/O size influences performance in different ways because the storage controllers might have features to optimize performance for specific I/Os. For example,

Small I/Os may be coalesced to get higher performance Large I/Os may be broken in to smaller ones because of the maximum I/O size limitation of the controller.

MegaRAID controllers can support up to 252 KB natively, I/Os above 252 KB are split in to multiple I/Os which causes overhead and the performance might reduce a bit for larger I/Os.

Storage controller may use Big Block Bypass, a feature that by passes larger I/Os from reaching to the Cache to optimize the cache usage and gain larger MB/s.

I/O DirectionI/O direction means the direction in which the data flow. Direction can be read, write, or a mix of reads and writes. Benchmarks usually let you modify the % of Reads/Writes to define the I/O direction. For HDDs, usually the write performance reaches higher maximum levels compared to reads. For SSDs, the write performance is lower as it can involve additional erasing and garbage collection.

Ramp TimeThe duration when the I/Os are sent but the measurement is NOT made. This time is important to avoid the transients that can occur at the start of the test. Allow sufficient ramp time before each of the actual test run time.

Run TimeThe duration when the I/Os are sent and performance metrics are measured. Having too short run time affects the consistency of the performance results. Having too long run time increases the overall measurement time. There must be a trade-off on the right run time. SSDs might need a longer run time. Tests that might have cache influence need longer ramp and run time.

ScalingThe scaling means stepping a certain parameter in either a serial or a parallel fashion. The parameter that scales may be the number of drives, the number of Workers, Qd, and so on. An example for serial is adding drives in steps, irrespective of workers. An example for parallel is adding one drive per worker at every step. If there 8 workers, drives are added to all the workers at each step. The steps can be linear or exponential. For example,

In a 60-drive configuration, for a specific I/O such as 64-K SW, the performance can be monitored for drive scaling with a linear step of 5 drives at a time. This helps ensures all the drives are used well and the system scales well up to 60 drives and performs optimally for any number of drives.

Single drive performance may be monitored for Qd scaling with the exponential scaling from 2 to 256 in steps of powers of 2. This helps choose the right Qd for each physical drive for the actual performance runs.

OutliersOutliers in performance are always possible. The outliers represent a sample that behaves differently compared to the other measurement. The outlier could show itself between runs, or between different tests of same I/O. Outliers usually indicate an issue with the device design or a unaccounted variable during the measurement. Repeat the tests for the same configuration or scale the performance for different I/O sizes, drives, Qd, and so on to find such outliers.



Chapter 5: Benchmark ResourcesIometer for Windows

5.2 Iometer for Windows

Iometer is an I/O generator, measurement, and characterization tool for single and clustered systems. Iometer uses an easy-to-use GUI that provides command line options and the option to run in batch mode. Iometer is not a real end-user application, but is a tool with which to probe storage performance. You can run benchmarking on a local system or from a remote client over a network. This section discusses Iometer capabilities relative to storage performance only.

Avago uses the latest version of Iometer 1.1.0, available at http://sourceforge.net/projects/iometer/, because it provides varied entropy for the I/O data pattern. Entropy (randomness in data patterns) can impact some SSDs. Some SSDs tend to give very high performance when the same data is written again and again. Therefore, it is important to test with high entropy so the measured performance better represents real-world performance.

Iometer works based on a client-server model with two parts: Dynamo and IOmeter GUI. When you start the IOmeter GUI, dynamo starts.

After you make your topology, you must have Managers (Dynamo) with many Workers (Threads) and each worker assigned with a specific number of targets (physical drives or logical volumes). An I/O profile (access specifications) that can be saved as Configuration files (*.icf), is run on these targets to obtain results (results.csv) in comma-separated-value files. These results also show in the GUI when you run the tests.

In batch or command line mode, the command to obtain results might look like

iometer /c iometer.icf /r results.csv /t 100

Check everything in the GUI mode first, and create and save the configuration file before you run the tests in batch mode.

5.2.1 Run Iometer

Prerequisites

Verify that the controller or expander is plugged in and functioning with the system, and that Iometer is installed on the same system as the controller or expander.

Use the following steps and the Iometer User’s Guide to setup and run an Iometer test.

1. Verify that the driver for the controller is installed.

2. Setup your storage controller in a topology of your interest, including the necessary drives.

3. Go to Start > Control Panel > Device Manager > Disk Drives and verify that all your drives are listed.

4. Go to Start > Control Panel > Device Manager > Disk Management.

Some of your drives might be listed as unknown/Not initialized. Right-click on the drive and select Initialize to select and initialize all the drives. If not selected by default, select all the drives that you need to initialize. Use MBR (Master Boot Record) as your partition style. GPT (GUID Partition table) is for Itanium based processors, or if the disk is larger than 2 TB.

You can run I/Os on GPT and raw partitions. If IOmeter does not detect the drives by default when you start IOmeter, you can manually start the dynamo with the /force_raw command line option. The force_raw forces dynamo to report all raw disks regardless of partitions contained within them.

Now your disks should be listed as basic and online. When you reboot your system, you might have to reinitialize some drives. Make sure all your drives are basic and online before you start your tests.

5. Run iometer.exe.

6. Use the steps in Iometer User’s Guide to operate the GUI.

The following list calls out specific choices made by Avago during the Iometer set up:

http://sourceforge.net/projects/iometer/




— On the Disk Targets tab, set the number of outstanding I/Os in the # of Outstanding I/Os field. This value corresponds to the queue depth (Qd).

— Leave the Maximum Disk Size field at 0, so all the virtual drive capacity is exercised.— Do not enter changes in the Network Targets tab.— To run small I/Os of sequential reads choose the 512B; 100% Read; 0% random option from the Global

Access Specifications box and click Add to move the test into the left panel, Assigned Access Specifications. You can also add additional workloads of 4 KB or 16 KB. For example, if you have three workloads for your test. You can create custom access specifications by using the New, Edit, or Edit Copy option. For example, OLTP might require 70% read and 30% writes and such workloads are not defined by default.

— In the Test Setup tab, set proper the Run Time (30 to 60 seconds) and Ramp Up Time (10 to 20 seconds). The aim is to wait a sufficient amount of time for your results to stabilize and average out. Thus you compensate for the transients that occur while switching between tests.

— In the Test Setup tab, keep the Cycling Options choice as Normal -- run all selected targets for all workers., or choose any other option that best suits your need.

7. Click Start Tests, the green-flag button, to start the test.

8. In the Access Specifications tab, the currently running test shows in green. The test number and remaining time is listed at the bottom-right of the Iometer GUI. If you must skip to a specific test, click Stop on the prior tests.

If you see errors, check your drives, cables, and connectors. Swap bad components from your setup with good components before you run the actual tests.

5.2.2 Iometer Tips and Tricks

Obtain an .icf file from your Avago FAE as a starting point for your Iometer testing. Always set the Ramp and Test times, because if you leave the default 0/0 setting, your tests will not progress. Only

the first test will run, and it runs until manually stopped. Ramp time should be sufficient enough for the test that follows. For example, SSDs need preconditioning, after

which you might need to set the Ramp Time as one minute, and Run Time as two minutes. Do not assign a same target to multiple workers. This action can give unexpected results when you run sequential

I/Os. When multiple workers send parallel I/Os sequentially, the end result might look similar to the random I/Os from one worker.

Queue depth (Qd) is the maximum number of outstanding I/Os that can be queued for each drive. Qd is usually set at eight for each drive. You can increase the Qd to decide the optimum Qd for your test.

For RAID volumes, set the Qd based on the number of drives present in the volume. For example, for four drives set Qd = 32 to compare the performance with four drives with 8 Qd in JBOD mode.

When using IOMeter 2006 systems with a CPU clock speed of 2 GHz and higher do not report accurate performance metrics. Always use IOMeter the most recent version of 1.1.0 or a newer version.

NOTE Avago does not report the results from the following test option in the final test results. Use the following test to test your adapters, not for actual performance reporting.

Set Max Disk Size to Disk Cache Size to give advantage with specific I/O size to gain maximum performance. For example, Seagate Savvio® 15K.3 HDDs tend to give higher performance (about 400 MB/s versus about 190 MB/s normal) with 256 KB Sequential read I/Os when the Max Disk Size is 1000 sectors. This setting might be handy when you need to get more performance from fewer drives during initial setup or trouble shooting. The I/Os are completely handled from Cache not from the Media. Do not treat these performance numbers as actual HDD performance.




5.2.3 Interpret Iometer Results

Iometer displays the data in real-time on the GUI and can save the data to a file in Comma Separated Values (CSV) format. This CSV format file can be post processed and required information can be harvested. Because CSV is ASCII format, it can be viewed with any standard text editor, however the raw CSV file can be difficult to understand. The following example is a section from an Iometer CSV output file.

Figure 25 Iometer CSV Output Example

You can import CSV files to Microsoft Excel worksheet for faster and easier post processing. Use the following steps to import a CSV file to Microsoft Excel and format it for easy consumption.

1. In Microsoft Excel, select File > Open.

2. Locate the CSV file and click Open. If you do not see your CSV file, try changing the file type to All Files or Text Files.

3. Select the row with the column headers and select Data > Filter to add filter options to the headers.




Figure 26 Filter Options Example

4. Filter the first column by MANAGER, as shown in the following figure.




Figure 27 Column Filter Example

5. Hide any columns that are of no interest to better organize make the data.

In this example, only 10 columns with most important data appear in the worksheet. The column names are self-explanatory.



Chapter 5: Benchmark ResourcesVdbench

Figure 28 Iometer Worksheet Example

Continue to filter or review the worksheet. Save the file as an Excel file when you finish.

5.2.4 Iometer References

Download IOmeter 1.1.0, Most Recent Versionhttp://sourceforge.net/projects/iometer/

IOmeter User Guidehttp://sourceforge.net/p/iometer/svn/HEAD/tree/trunk/IOmeter/Docs/Iometer.pdf

5.3 Vdbench

The Vdbench tool generates I/O workload to verify data integrity and to measure direct attached performance. Vdbench is a command line program used to generate disk I/O workloads to validate storage performance and data integrity. Vdbench also can provide detailed latency information via a histogram, and can generate workloads with varying intensities. When testing SSDs, Vdbench permits accurate specification of data entropy (randomness of the data pattern).

NOTE VDbench does not properly perform high queue depth, small block sequentials. Instead, VDbench begins to randomize the I/Os and report very low performance if you test multiple devices in JBOD mode with expected large performance, such as 1-M IOPs. RAID testing is not affected because issues prevent Linux from reaching the high level of Windows’ small block sequential performance (behavior is not a bug).

Oracle develops and maintains the program. Use the following resources:

http://sourceforge.net/projects/iometer/

http://sourceforge.net/p/iometer/svn/HEAD/tree/trunk/IOmeter/Docs/Iometer.pdf



Chapter 5: Benchmark ResourcesVdbench

Package download, User Guide, source code, and discussion forum: http://www.oracle.com/technetwork/server-storage/vdbench-downloads-1901681.html

SNIA™ Vdbench slide deck overview: http://snia.org/sites/default/files/Emerald%20Training%20-%20VDBENCH%20Overview.pdf

Avago recommends using Vdbench 5.04, or no older than 5.03 rc11.

5.3.1 Install Vdbench

Use the following steps to install the Vdbench program.

1. Download the Vdbench package from the following location, http://www.oracle.com/technetwork/server-storage/vdbench-downloads-1901681.html, into an empty folder.

2. Unzip the package.

3. Move the entire unzipped package to the system on which you will run the benchmark.

4. Select the folder that matches your operating system and copy the contents into the folder that contains all the Vdbench files. These are the Vdbench system and operating-system specific components.

5. Install a Java® Runtime Environment on the system on which the benchmark will be run. You can download JREs for various operating systems from many sources. For example, at: http://www.oracle.com/technetwork/java/javase/downloads/jre7-downloads-1880261.html

6. Get the JRE 7 SE package.

7. Test the installation by entering the java -version command using one of the following prompts:

— In a Windows operating system, open a CMD window.— In Linux, open a Term window.

A returned version number confirms that the JRE is installed.

8. Next move to the folder in which Vdbench is installed. In the command or term window type vdbench –test.

If Vdbench installed correctly, a short test runs with output similar to the following output:

Vdbench distribution: vdbench503rc9 For documentation, see 'vdbench.pdf'. 17:09:00.028 input argument scanned: '-f/tmp/parmfile' 17:09:00.090 Starting slave: /root/Desktop/FB_Vdb_2E512-4x/vdbench SlaveJvm -m localhost -n localhost-10-011231-17.08.59.982 -l localhost-0 -p 5570 17:09:00.429 All slaves are now connected 17:09:01.002 Starting RD=rd1; I/O rate: 100; elapsed=5; For loops: None Dec 31, 2001 interval i/o MB/sec bytes read resp read write resp resp queue cpu% cpu% rate 1024**2 i/o pct time resp resp max stddev depth sys+u sys 17:09:02.085 1 87.00 0.08 1024 54.02 0.008 0.006 0.011 0.019 0.003 0.0 4.0 0.6 17:09:03.014 2 100.00 0.10 1024 52.00 0.008 0.005 0.010 0.018 0.004 0.0 1.9 1.1 17:09:04.053 3 71.00 0.07 1024 45.07 0.008 0.005 0.010 0.021 0.003 0.0 0.8 0.1 17:09:05.053 4 108.00 0.11 1024 52.78 0.008 0.005 0.010 0.019 0.003 0.0 0.3 0.1 17:09:06.058 5 92.00 0.09 1024 57.61 0.006 0.005 0.009 0.013 0.003 0.0 0.3 0.0 17:09:06.083 avg_2-5 92.75 0.09 1024 52.29 0.007 0.005 0.010 0.021 0.003 0.0 0.8 0.3 17:09:07.646 Vdbench execution completed successfully. Output directory: /root/Desktop/FB_Vdb_2E512-4x/output

http://www.oracle.com/technetwork/server-storage/vdbench-downloads-1901681.html

http://snia.org/sites/default/files/Emerald%20Training%20-%20VDBENCH%20Overview.pdf

http://www.oracle.com/technetwork/server-storage/vdbench-downloads-1901681.html

http://www.oracle.com/technetwork/java/javase/downloads/jre7-downloads-1880261.html



Chapter 5: Benchmark ResourcesJetstress

If these two tests run, Vdbench is properly installed.

5.3.2 Run Vdbench

Prior to running any benchmark, you must be certain that the target drives of the test are not drives that contain critical system information, such as the OS.

1. Verify that VdBench is properly installed.

2. Verify the system is setup to test the environment which you wish to test.

3. Create the <test_name> folder.

4. Run Vdbench.

You can run VDbench from the command line, but use of a parameter file enables you to build complex workload and test descriptions.

Tests shows that for extremely high IOPs testing with VDBench, you can achieve the best performance by binding VDBench to a single CPU Socket (not core) with the following execution line:

numactl --cpunodebind=x vdbench...

You can also force VDBench to use more JVMs than required to avoid the CPU core bottleneck. For example, use the recommended JVM count 8 by entering -m 8 in the VDBench execution line.

Vdbench creates many files as a result of its execution. The Vdbench Users Guide and Section 5.3.4, Interpret Vdbench Results provide details.

5.3.3 Sample Vdbench Script

To run 4-KB sequential read I/O to two disks for a 60 second test in Linux, run

./vdbench –f 4K_Parm.prm –o 4K_Out

where the parameter file, 4K_Parm.prm contains the following lines:

sd=s1,lun=/dev/sdb,openflags=o_direct wd=wd1,sd=s1, xfersize=4K,rdpct=100,seekpct=0 rd=rd1,wd=wd1,iorate=max,forthreads=32,elapsed=60,interval=1

For this example, the results go to the 4K_Out folder.

5.3.4 Interpret Vdbench Results

Vdbench creates many files as a result of its execution. The flatfile.html and histogram.html output files are the most important. flatfile.html contains the throughput, latency, and test condition information for each output sample. You can import the file into Excel for post processing. histogram.html contains a latency histogram for each test which is useful in analyzing components of the average and maximum latency metrics.

Other output files provides much more additional runtime and debug information.

5.4 Jetstress

Microsoft distributes the Jetstress tool that simulates a Microsoft Exchange database workload without a full Exchange installation. Jetstress verifies the performance and stability of a system prior to a full installation of Microsoft Exchange in a production environment. Avago uses Jetstress as a system level performance benchmark tool in the Microsoft Windows environment to help simulate a real-world workload more closely than synthetic benchmark tools




such as IOMeter. Jetstress does only one kind of I/O, simulating an Exchange email environment, so is not a general-purpose tool.

Jetstress is available on the Microsoft website for free. Microsoft releases a new version of Jetstress with each major release of Microsoft Exchange Server. Jetstress 2013 is the current version, to coincide with Microsoft Exchange Server 2013. Each Jetstress version includes changes unique to the Microsoft Exchange Server version with which it releases. Make sure to use the appropriate Jetstress version for the deployed Microsoft Exchange version in your production environment.

Refer to the Jetstress 2013 Field Guide for Jetstress details, including detailed installation details, http://gallery.technet.microsoft.com/Jetstress-2013-Field-Guide-2438bc12

5.4.1 Install Jetstress

1. Obtain the required Extensible Storage Engine (ESE) binaries. Use an installation or CD for Microsoft Exchange Server, or download a trial version from the Microsoft website to get the necessary ESE binaries. Jetstress requires the ESE binaries from its respective Microsoft Exchange Server install package.

Jetstress requires the following ESE binaries:

— ESE.DLL

— ESEPERF.DLL

— ESEPERF.HXX

— ESEPERF.INI

— ESEPERF.XML

2. Run the Jetstress.msi installer. Notice where Jetstress is actually installed on the system.

3. Follow the installation dialogs. Use the recommended default options that each step includes.

4. After the installation completes, copy the five ESE binary files from step 1 into the Jetstress installation folder.

5. Run the Jetstress tool so Jetstress can configure the performance library, objects, and counters.

This Jetstress configuration occurs on the first-run only.

6. Close Jetstress, then restart Jetstress to run your performance benchmarks.

7. Choose a test type. This reminder of this document focuses on the Disk Subsystem Throughput Test.

— Disk Subsystem Throughput Test. Determines the maximum performance for a storage solution when the disks are filled close to capacity. Use for performance testing.

— Exchange Mailbox Profile Test. Determines whether a given storage solution combined can meet or exceed the requirements of a given Exchange mailbox profile specified in terms of users, IOP per mailbox, and quota size. Use to reproduce a specific customer scenario.

Use the following sections and the Jetstress Field Guide to create your Jetstress test.

5.4.2 Create your Jetstress Test

5.4.2.1 Select Capacity and Throughput

Capacity and throughput control how much back-end storage is used and what is the intensity of the applied Exchange workload to the storage.

Storage capacityPercentage of the backend storage that supports the Exchange database. Using at least 85 percent of the storage capacity for a valid throughput test permits full-stroking of the backend storage.

http://gallery.technet.microsoft.com/Jetstress-2013-Field-Guide-2438bc12




ThroughputThroughput capacity percentage to achieve a Target IOPs. It is recommended to leave this value at 100 percent to obtain the maximum IOPs possible from the storage subsystem.

Jetstress 2013 autotunes the benchmark for the maximum IOPs possible within the acceptable response time limits. The application is tuned by varying the number of threads applying a workload to the storage subsystem. Additional threads means greater throughput, but comes at the expense of increased latency metrics that might not be acceptable for Microsoft Exchange. You have the option of not using the auto-tuning and manually specifying the number of threads.

It is recommended that you use the auto-tuning feature the first time you run Jetstress to let the benchmark estimate an optimal thread count. Subsequent runs of Jetstress can then use a manual thread count based on these results and whether a need exists to change the throughput level.

5.4.2.2 Select Test Type

Jetstress offers three test types:

Performance (recommended) Database backup Soft recovery

You can enable or disable the following additional options:

Multi-host testOnly select the multihost test if you are on a shared storage platform with multiple servers.

Run background database maintenanceIt is recommended to enable this option. The background database maintenance is an additional sequential workload operating on the databases in addition to the Exchange operations from the worker threads. Enable this option so Jetstress can more closely replicate a live Exchange deployment performing similar background maintenance at all times.

Continue the test run despite encountering disk errorsIf this option is enabled, the test report includes any disk errors.

5.4.2.3 Define Test Run

You can define how long to run the test. Use the following scenarios to determine your test length:

When you attempt to adjust thread count manually or using the auto-tune feature, the test should be at least 30 minutes (specified as .50).

When you execute a performance test, the test should be a minimum of 2 hours, with a recommended 8 hours. When you validate an Exchange deployment, perform a separate test of 24 hours.

5.4.2.4 Configure Databases

The database configuration that Jetstress uses should match the target Exchange deployment. If no target Exchange deployment exists, adhere to the following recommendations to achieve the maximum performance in Jetstress for the storage subsystem:

Number of databasesMatch the number of databases to the number of volumes presented from the backend storage with each database that resides on a unique volume.

Number of copies per databaseIt is expected that any live Exchange configuration has multiple copies of the Exchange database that reside on the storage subsystem. The exact number of copies depends on the specific Exchange configuration. Modifying this value in Jetstress only simulates additional Log I/O to mimic log shipping activity between




active and passive databases. It does not actually copy the logs. It is recommended to use at least three copies for each database in Jetstress.

Database and log file location.Assign paths for each database and log file. Place the database and its respective log file in the same location unless the target Exchange deployment is configured differently.

5.4.2.5 Select Database Source

You can select from the following three options when you select a database source:

Create new databases Attach existing databases Restore backup database

Typically the first run with a given configuration requires Jetstress to create new databases. Creating a new database takes a long time, according to the Jetstress 2013, Jetstress Field Guide, expect approximately 24 hours for each 10 TB of data. Subsequent runs can attach existing databases because the databases are saved between individual Jetstress runs. However, a chance exists that performance might degrade on a database with each additional run, therefore, create a new database for each run if time permits.

5.4.3 Start the Test

Prerequisites

Complete Section 5.4.2.1, Select Capacity and Throughput through Section 5.4.2.5, Select Database Source.

On the Review & Execute Test dialog within Jetstress, take the following steps:

1. Click Save test.

This step saves the test parameters into an XML file so you can use or review the configuration for future Jetstress tests.

2. Click Prepare test.

This step creates and initializes the databases if they do not yet exist, or checksum existing databases before you use them for testing. The database initialization process can be lengthy depending on the database size. According to the Jetstress 2013, Jetstress Field Guide, expect approximately 24 hours for each 10 TB of data that must be initialized.

3. On the screen that appears, click Execute test.

Execute the test according to the configuration specified and store results in the output directory specified.

5.4.3.1 Characterize the Jetstress Workload

Jetstress differs from other performance-oriented benchmarks like IOMeter in that Jetstress does not produce the best performance numbers. Jetstress replicates an Exchange-type workload on a storage subsystem at a given intensity level. The following distribution is the default for Exchange database operations in Jetstress:

40 % insert 35 % read 20 % delete 5 % update

The SluggishSessions variable also affects the workload. The SluggishSessions variable adds an additional pause between each Jetstress task and permits additional tuning of the intensity level beyond the thread count. The default SluggishSessions value is 1. Increase this value to decrease the number of IOPs achieved with the same thread count.



Chapter 5: Benchmark Resourcesfio for Linux

You can modify all Jetstress parameters in the XML configuration file. Unless you have a very specific reason to manually modify this file, do not do so.

Several streams exist in Jetstress during a test, each that exercise a different workload pattern. While it is difficult to duplicate the workload patterns exactly, you can approximate them. Database operations consist of 32-KB random reads and writes using a mix of approximately two database reads for each database write. The circular log operations are 4-KB sequential writes with 256-KB sequential reads to replicate the logs for each database instance. Background database maintenance is a separate 256-KB sequential read workload.

5.4.4 Interpret Jetstress Results

The output of a Jetstress test comes in several different files. For performance analysis, the Performance_<date>.html file provides an easy way to read the test status in a single report. The report is divided into the following sections:

Test Summary Test Issues Database Sizing and Throughput Transactional I/O Performance Background Database Maintenance I/O Performance Log Replication I/O Performance

5.4.4.1 Transactional I/O Performance

The Transactional I/O Performance section displays the performance numbers for the Transactional I/O workloads going to each Microsoft Exchange database instance. The following parameters are important:

I/O Database Reads Average Latency (ms) I/O Database Writes Average Latency (ms) I/O Database Reads/sec I/O Database Writes/sec I/O Log Writes Average Latency (ms)

Note how close the actual latency metrics are to the latency requirements. Even if a test passes by meeting the latency requirements it can still be a concern if the latency metrics are too close to the requirements where rerunning the test could easily result in failing the criteria.

The sum of the I/O Database Reads/sec and Writes/sec add up to the Achieved Transactional IOPs. If the storage subsystem is allocated evenly between the databases the read and write performance across databases should be similar to one another. Uneven performance might indicate a performance issue that requires further investigation beyond Jetstress.

5.4.4.2 Background Database Maintenance I/O Performance

This section displays the background database maintenance for each database instance, but is not a factor in determining whether a Jetstress test passes or fails. The Database Maintenance I/O Reads/sec value should be greater than 0, which indicates that the Background Database Maintenance was active during the test.

5.5 fio for Linux

fio is an open source, Linux community, I/O tool for benchmarking and system stress tests. fio simulates various I/O workload types with support for multiple I/O engines and system level optimizations. fio interacts with the Linux




layers, resulting in complex tuning methods constantly under improvement. The fio user interface is via a command line, so fio is not as visual as Iometer.

Download fio from http://freecode.com/projects/fio, which points you to the latest fio version. The tool is free and is offered to the public under the GPLv2 license.

Online fio Linux man page: http://linux.die.net/man/1/fio fio project Freecode site: http://freecode.com/projects/fio

5.5.1 Get Started with fio

Complete the following steps to get started with fio.

1. Install the libaio and libaio-devel libraries.

fio requires these libraries before fio is compiled in the following steps. Failure to install libaio and libaio-devel libraries can cause fio to function incorrectly even if fio compiles cleanly.

2. Verify that you have a C compiler such as GCC with the necessary base libraries already installed and configured on your machine.

3. Run ./configure, make, and make install to build and install fio.

4. Use the following guidelines when you create a job file.

— How you enter devices into the FIO job file is crucial. You must adhere to the following format because multiple devices on the same filename= prevents FIO from properly distributing I/O and results in lower performance than expected.[job1]filename=/dev/sda[job1]filename=/dev/sdb

— Include the time_based parameter so FIO adheres properly to the input run time (very important for long run times, such as required for SSD preconditioning).

— For random I/O testing on SSDs, include the following global parameters because the FIO random generator repeats LBAs and remembers LBA locations, which falsely doubles the performance.norandommapuse_os_rand=1randrepeat=0

— Place the readwrite parameter in the global section if the same pattern goes to all devices.— To achieve the high performance required for many small block I/O request size tests, you must execute FIO

with the numactl command. See Section 5.3.2, Run Vdbench.— The numactl command removes the need for cpu_allowed option and is required for the high

performance. The use of both options is not recommended.

The following sample job file example shows a basic job file used to issue 4-KB sequential write workloads to two volumes. This sample demonstrates some important features that fio can control outside of the workload pattern, including I/O engine and CPU affinity. The full available parameter set is covered in the fio manufacturing page, available at http://linux.die.net/man/1/fio for reference. The package also includes a HOWTO file. Verify the page against the manufacturing page for your specific fio version.

[global]numjobs=1bs=4kramp_time=15runtime=45direct=1iodepth=32

http://freecode.com/projects/fio

http://linux.die.net/man/1/fio

http://freecode.com/projects/fio

http://linux.die.net/man/1/fio




readwrite=writeioengine=libaiogroup_reporting

[job1]filename=/dev/sda

[job1]filename=/dev/sdb

[job1]filename=/dev/sdc

5.5.2 fio Performance-Related Parameters

fio includes two performance-related parameter categories. The first category is workload parameters similar to other tools like IOMeter. The second category is unique to fio and lets you optimize fio for the system on which you run. The following workload parameters are important:

readwrite, rw

Determines the workload pattern as either read/write/mix and random or sequential.

blocksize

Specifies the block size for each I/O request.

ramp_time

The amount of time, in seconds, to run the workload before logging any performance numbers.

runtime

The amount of time to run the workload and log performance numbers.

iodepth

The number of I/O requests to keep in flight against a target. This value equals the queue depth parameter in other tools.

thinktime

The amount of time, in microseconds, between issuing individual I/O requests.

The following parameters are unique to fio:

ioengine

Defines what I/O library issues I/O requests. The I/O engine can largely impact the performance measured under fio. The recommended Linux I/O library is libaio which is the Linux native asynchronous I/O. If you want to simulate synchronous I/O, use either sync or vsync. Vsync does coalesce adjacent I/Os into a single request so this might affect the performance measured. You may use other I/O engines, but on a case-by-case basis and should be understood fully before implemented in any test.

direct

Determines whether non-buffered I/O is used fio. This parameter is the equivalent of the O_DIRECT flag when opening a file in Linux.




fsync

Sets how many I/Os to perform before flushing the dirty data to the drive. By default this parameter is disabled and no syncing occurs.

cpus_allowed, cpumask

These variables provide control in a setting that CPUs can be used for a job.

zero_buffers, refill_buffers, scramble_buffers

These settings determine what data is actually written to the targets during the test. Even if the data itself is meaningless when running a performance test, some drives (SSDs in particular) might use data patterns for compression. The default setting is to fill the buffers with random data and scramble them. It is recommended that these settings are not adjusted unless for some specific purpose.

rate, rate_iops

fio can cap the workload intensity based on the bandwidth or IOPs specified.

5.5.3 Interpret fio Output

fio outputs the results of a performance run to stdout in Linux. Capture this output in a text file by using the > operator on the command line in Linux or the --output option on the command line when you run fio.

Additionally, you can use group_reporting in fio that can be specified with the other workload parameters in the job file or the command line. If you set the group_reporting option, the results display on a per-group basis rather than on a per-job basis. It is recommended not to enable group_reporting because the individual results of each job, which might be useful for debugging purposes later, are hidden. Use the minimal option to cause results to be on a per-job basis, but semi-colon delimited. This option enables you to import the data to a spreadsheet.

Extract the following metrics from a fio output:

bw: The bandwidth measured during the test. This value can be expressed as KB/s or MB/s. iops: The IOPs measured during the test. slat, clat, lat: The submission, completion, and overall latency respectively expressed in terms of

minimum, maximum, average, and standard deviation. cpu: CPU use in terms of percent use by the user and system.

The following sample fio output is for a test with one device with an 8-KB random read and write mixed workload at an iodepth of 8. The metrics are separated for the read and write components of the workload.

/dev/sdb: (g=0): rw=randrw, bs=8K-8K/8K-8K, ioengine=libaio, iodepth=8fio 1.58Starting 1 process/dev/sdb: (groupid=0, jobs=1): err= 0: pid=5608 read : io=38246MB, bw=130546KB/s, iops=16318 , runt=300001msec slat (usec): min=3 , max=174 , avg= 4.83, stdev= 1.49 clat (usec): min=35 , max=20358 , avg=416.31, stdev=411.26 lat (usec): min=51 , max=20363 , avg=421.82, stdev=411.26 bw (KB/s) : min=41456, max=146416, per=100.03%, avg=130586.26, stdev=11257.34 write: io=16382MB, bw=55916KB/s, iops=6989 , runt=300001msec slat (usec): min=3 , max=171 , avg= 5.22, stdev= 1.56 clat (usec): min=21 , max=20315 , avg=147.51, stdev=152.05 lat (usec): min=46 , max=20320 , avg=153.40, stdev=151.98 bw (KB/s) : min=19072, max=63520, per=100.05%, avg=55940.49, stdev=4872.83 cpu : usr=11.82%, sys=13.98%, ctx=2584617, majf=0, minf=18 IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=100.0%, 16=0.0%, 32=0.0%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.1%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%



Chapter 5: Benchmark ResourcesVerify Benchmark Results for Validity

issued r/w/d: total=4895500/2096847/0, short=0/0/0 lat (usec): 40=0.02%, 60=7.29%, 80=2.61%, 100=1.65%, 200=9.31% lat (usec): 400=61.86%, 600=9.75%, 800=2.34%, 1000=1.19% lat (msec): 2=2.46%, 4=1.50%, 6=0.02%, 8=0.01%, 10=0.01% lat (msec): 20=0.01%, 40=0.01%Run status group 0 (all jobs): READ: io=38246MB, aggrb=130546KB/s, minb=133679KB/s, maxb=133679KB/s, mint=300001msec, maxt=300001msec WRITE: io=16382MB, aggrb=55915KB/s, minb=57257KB/s, maxb=57257KB/s, mint=300001msec, maxt=300001msecDisk stats (read/write): sdb: ios=4895030/2096620, merge=0/0, ticks=1997589/293869, in_queue=2290517, util=100.00%

5.6 Verify Benchmark Results for Validity

After you gather the results from your benchmark tools, verify the results for validity. Some performance results might look as expected, but anomalies might exist during the performance runs and all the results might not be valid. In such cases, it is a good practice to rerun the tests. You can also run tests in multiple sets, take the average, and check the standard deviation. Comparing your results between runs helps you identify variables that change during the run or between runs. Use the following guidelines to verify your results:

Look for errors. Valid results do not contain any errors.— Check the operating system log files and controller log files for errors. Clean the errors and logs before any

performance test so you can easily check for errors after the test.— Benchmark tools such as IOMeter, provide an Errors metric which should be zero. If the value is non-zero, find

the reason for the errors and rerun your test after you resolve the cause of the errors. When running multiple sets of tests, look for outliers that affect the average and standard deviation. CPU usage is a good indicator of other applications that burden the processors. The CPU use should not be

too high.— If you suspect any issues, use Start Task Manager > Performance to view the CPU Usage graphs. Rerun your

tests and make sure the CPUs are loaded uniformly and none of the CPUs approach 100%.— The Number of workers used for the test might be too low. For example, when the test is for drive scaling, the

performance might not scale after a specific number of drives, which usually overloads some of the CPUs and keeps the others unused.

Change in Number of workers/threads might cause low or inconsistent performance. This problem is common with IOMeter as the number of workers assigned in a saved configuration file (*.icf) can change when you reload that file for the next run. This change can occur if your topology or volume configuration changes, so the saved parameters do not match the new configuration.

Sudden drop in performance for a short interval or the performance stays at a lower level for a long time. These indicate background operations and transient errors during your tests. For example, a drive failure might cause a rebuild and cause decreased performance. Or, an unsupported command by a drive or an expander that always fails can reduce the performance if the same command is issued periodically.

Insufficient ramp time can cause decreased and inconsistent performance. Some configurations that involve cache can have longer transient time, so a longer ramp time or pause between tests is important. Without such interval, the average performance is lower and inconsistent across runs.



Chapter 6: Compare Measured Results with Expected ResultsPerformance Result Examples for MegaRAID

Chapter 6: Compare Measured Results with Expected Results

After you verify the results as valid you can compare the results with the expected results, calculated with the help of Chapter 2. You can also compare the results against what the product vendor publishes. The following section provides example results for a few configurations that Avago usually uses in its Performance Lab for its regression runs.

Example best-case performance results for various configuration options in the Avago performance tuning lab are presented in the following sections. Your results might differ.

NOTE The values in this chapter are examples. Your configurations might not exactly match and you might need to evaluate the numbers for your configurations to compare with your actual results. Contact your FAE for specific and recent product performance results.

6.1 Performance Result Examples for MegaRAID

The following topologies were used as examples:

8 drive direct attached 24 drive expander attached 40 drive expander attached

The following are additional test configuration inputs for these examples:

Avago tests each topology with SAS and SATA HDDs and SSDs. Only R0, R1, R10, R5, R6 configurations are tested. R50 and R60 volumes are not tested. Avago expects that the R5

and R6 results are representative of R50 and R60 results, respectively.— HDDs use the following RAID settings: 256-KB stripe size, Write Back, Read Ahead, Direct IO— SSDs use the following RAID settings: 64-KB stripe size, Write Through, No Read Ahead, Direct IO

The results come from Iometer 1.1.0 under Windows Server 2008 Enterprise. The 8 SAS SSD data uses 12Gb/s SAS SSD. Configurations with 8 and 24 drives are generally disk-limited, and do not maximize all performance metrics. The 24 and 40 drive configurations use a LSISAS3x48 expander with DataBolt enabled. Configurations with 40 drives have sufficient drives to enable saturation of all throughput metrics with large I/O. To provide a realistic test point, the 24 and 40 drive configurations use two RAID volumes with the total drives

evenly split between the volumes. 4-KB I/O size is selected to showcase the maximum IOPS 256-KB I/O size is selected to showcase the maximum MB/s

6.1.1 Eight Drive Direct Attached Example Results

The following table gives maximum throughput results for eight-drive configurations that use 256-KB I/O. At 256-KB I/O, it is possible to demonstrate maximum throughput if enough drives are present.




The following table gives maximum throughput results for eight-drive configurations that use 4-KB I/O.

6.1.2 Twenty-four Drive Expander Attached Example Results

The following table gives maximum throughput results for 24-drive configurations that use 256-KB I/O. At 256-KB I/O, it is possible to demonstrate maximum throughput if enough drives are present.

Table 12 LSISAS3108 Performance Results for One RAID Volume in MB/s

HDD HDD HDD HDD SSD SSD SSD SSD

MegaRAID 6.4 Write-Back Write-Through

256 Q per Volume 256 KB SR 256 KB SW 256 KB RR 256 KB RW 256 KB SR 256 KB SW 256 KB RR 256 KB RW

8 Drives, One RAID Volume

RAID 0, 8x SAS 1511 1494 366 373 5920 3575 5878 1211

RAID 0, 8x SATA 1385 1363 167 187 3501 2792 3396 1071

RAID 10, 8x SAS 1,234 747 375 202 3895 1,722 3340 607

RAID 10, 8x SATA 1,049 674 170 94 3616 1398 3545 551

RAID 5, 8x SAS 1,074 1301 367 116 5882 2,423 5,862 553

RAID 5, 8x SATA 1216 1173 167 58 3615 2,062 3525 549

RAID 6, 8x SAS 998 1,116 366 72 5955 2051 5838 440

RAID 6, 8x SATA 877 991 165 38 3,618 1919 3,418 473

Table 13 LSISAS3108 Controller Performance Results for One RAID Volume in K IOPs

HDD HDD HDD HDD HDD SSD SSD SSD SSD SSD


256 Q per Volume 4 K SR 4 K SW 4 K RR 4 K RW 4 K R 67R 4 K SR 4 K SW 4 K RR 4 K RW 4 K R 67Ra

a. Refers to a 4-KB random 67% read, 33% write I/O sequence.

8 Drives, One RAID Volume

RAID 0, 8x SAS 379 381 3.8 5.3 4.2 415 382 319 145 270

RAID 0, 8x SATA 335 348 1.4 2.2 1.6 392 389 294 129 238

RAID 10, 8x SAS 190 102 3.9 2.8 3.4 407 314 277 72 186

RAID 10, 8x SATA 174 170 1.5 1.3 1.3 400 295 599 64 162

RAID 5, 8x SAS 331 333 3.8 1.5 2.5 415 190 282 37 109

RAID 5, 8x SATA 291 299 1.5 0.7 1 407 272 284 37 104

RAID 6, 8x SAS 281 286 3.8 0.9 2.3 416 285 281 22 74

RAID 6, 8x SATA 218 256 1.5 0.5 1.2 394 287 271 25 65




The following table gives maximum throughput results for 24-drive configurations that use 4-KB I/O.

6.1.3 Forty Drive Expander Attached Example Results

The following table gives maximum throughput results for 40-drive configurations that use 256-KB I/O. At 256-KB I/O, it is possible to demonstrate maximum throughput if enough drives are present.

Table 14 LSISAS3108 Performance Results for Two RAID Volumes in MB/s

HDD HDD HDD HDD SSD SSD SSD SSD


256 Q per Volume 256KB SR 256 KB SW 256 KB RR 256 KB RW 256 KB SR 256 KB SW 256 KB RR 256 KB RW

24 Drives, Two RAID Volumes

RAID 0, 24x SAS 4,444 4,421 1,021 998 5,266 5,267 4,311 3,544

RAID 0, 24x SATA 4,031 3,918 485 453 5210 5686 4,187 3,188

RAID 10, 24x SAS 3,669 1,928 1,053 524 5,992 2,798 5,912 1,927

RAID 10, 24x SATA 3,072 1,843 496 261 4,633 2,067 4,485 1,465

RAID 5, 24x SAS 3,815 3008 1,020 296 4,914 2,410 2,948 533

RAID 5, 24x SATA 3,964 2482 484 144 4,789 2,386 3,871 513

RAID 6, 24x SAS 3,728 2,817 1,023 220 4,909 1,768 3,985 538

RAID 6, 24x SATA 3,353 1,974 483 107 4,803 1,725 3,887 503

Table 15 LSISAS3108 Controller Performance Results for Two RAID Volumes in K IOPs

HDD HDD HDD HDD HDD SSD SSD SSD SSD SSD


256 Q per Volume 4 K SR 4 K SW 4 K RR 4 K RW 4 K R 67R 4 K SR 4 K SW 4 K RR 4 K RW 4 K R 67Ra

a. Refers to a 4-KB random 67% read, 33% write I/O sequence.


RAID 0, 24x SAS 713 678 10.3 13.7 11.3 689 686 488 415 485

RAID 0, 24x SATA 600 578 4.3 5.9 4.6 669 666 466 383 456

RAID 10, 24x SAS 564 443 10.8 7.5 9.4 745 644 468 93 238

RAID 10, 24x SATA 450 487 4.5 3.2 3.8 586 644 372 94 238

RAID 5, 24x SAS 735 713 10.3 3.9 6.7 716 170 539 38 111

RAID 5, 24x SATA 591 601 4.2 1.8 2.7 714 162 477 36 104

RAID 6, 24x SAS 729 689 10.3 2.7 5.4 694 195 534 13 39

RAID 6, 24x SATA 586 593 4.2 1.2 2.1 690 172 367 13 38



Chapter 6: Compare Measured Results with Expected ResultsPerformance Results Examples for IT Controllers

The following table gives maximum throughput results for 40-drive configurations that use 4-KB I/O.

6.2 Performance Results Examples for IT Controllers

The following topologies were used as examples:

8 drive direct attached 24 drive expander attached 40 drive expander attached

The following are additional test configuration inputs for these examples:

Each topology implements SAS and SATA HDDs and SSDs. The results come from Iometer 1.1.0 under Windows Server 2008 Enterprise.

Table 16 LSISAS3108 Performance Results for Two RAID Volumes in MB/s for 256 KB

HDD HDD HDD HDD

MegaRAID 6.4 Write-Back

256 Q per Volume 256 KB SR 256 KB SW 256 KB RR 256 KB RW


RAID 0, 40x SAS 5,808 6,343 1,545 1,535

RAID 0, 40x SATA 5,825 5,358 740 695

RAID 10, 40x SAS 5,792 3,092 1,606 810

RAID 10, 40x SATA 5,019 2,665 774 378

RAID 5, 40x SAS 5,789 3,111 1,543 450

RAID 5, 40x SATA 5,842 3,051 736 221

RAID 6, 40x SAS 5,787 2,912 1,541 326

RAID 6, 40x SATA 5,610 2,825 739 163

Table 17 LSISAS3108 Controller Performance Results for Two RAID Volumes in K IOPs

HDD HDD HDD HDD HDD

MegaRAID 6.2 Write-Back

256 Q per Volume 4 K SR 4 K SW 4 K RR 4 K RW 4 K R 67R


RAID 0, 40x SAS 692 673 14.9 20.3 16.7

RAID 0, 40x SATA 588 580 6.2 8.9 6.8

RAID 10, 40x SAS 731 633 16 11.5 14.1

RAID 10, 40x SATA 577 523 6.7 4.9 5.8

RAID 5, 40x SAS 704 703 14.9 6.2 10.0

RAID 5, 40x SATA 599 612 6.2 2.7 4.1

RAID 6, 40x SAS 693 697 14.9 4.2 8.1

RAID 6, 40x SATA 591 608 6.0 1.9 3.3




The 8 SAS SSD data uses 12Gb/s SAS SSD. The 24 and 40 drive configurations use a LSISAS3x48 expander with DataBolt enabled. Firmware phase 5.0.0.0.

6.2.1 Eight Drive Direct Attached Example Results

The following table gives maximum latency performance results for 8-drive configurations.

The following tables give maximum throughput results for 8-drive configurations.

6.2.2 Twenty-four Drive Expander Attached Example Results



Table 18 LSISAS3008 Controller Maximum Latency Performance Results in msec

4 QHDD SSD

4K SR 4K SW 4K RR 4K RW 4K SR 4K SW 4K RR 4K RW

JBOD 8x SAS 5.5 25.2 167 35 2.2 16.3 2 13

JBOD 8x SATA 50.5 96.5 332 111 11.6 13.5 2 16

Table 19 LSISAS3008 Controller Maximum Throughput Performance Results in K IOPs

32 QHDD SSD

0.5K SR 0.5K SW 4K SR 4K SW 4K SR 4K SW 4K RR 4K RW

JBOD 8x SAS 1,286 1,023 394 394 1,018 767 921 147

JBOD 8x SATA 588 542 359 350 512 413 502 134

Table 20 LSISAS3008 Controller Maximum Throughput Performance Results in MB/s

32 QHDD SSD

256K SR 256K SW 256K SR 256K SW 256K RR 256K RW

JBOD 8x SAS 1,541 1,538 5,870 3,551 5,875 1,193

JBOD 8x SATA 1,404 1,391 3,824 2,913 3,814 1,089


4 QHDD SSD

4K SR 4K SW 4K RR 4K RW 4K SR 4K SW 4K RR 4K RW

JBOD 24x SAS 5.25 13.93 186.15 34.51 2.24 13.84 11.63 17.21

JBOD24x SATA 58.23 101.85 363.55 138.32 2.12 5.33 2.06 15.01




6.2.3 Forty Drive Expander Attached Example Results




32 QHDD SSD

0.5K SR 0.5K SW 4K SR 4K SW 4K SR 4K SW 4K RR 4K RW

JBOD24x SAS 1,118 1,048 1,146 1,033 994 1,000 1,097 440

JBOD24x SATA 817 790 618 612 639 637 649 403


32 QHDD SSD

256K SR 256K SW 256K SR 256K SW 256K RR 256K RW

JBOD24x SAS 4,596 4,586 5,865 6,535 5,866 3,591

JBOD24x SATA 4,197 4,167 5,844 6,095 5,825 3,255


4 QHDD

4K SR 4K SW 4K RR 4K RW

JBOD 40x SAS 36.33 15.27 166 36.53

JBOD 40x SATA 57.69 77.47 393.89 196.57


32 QHDD

0.5K SR 0.5K SW 4K SR 4K SW

JBOD 40x SAS 1113 1015 1008 1022

JBOD 40x SATA 925 890 786 773


32 QHDD

256K SR 256K SW

JBOD 40x SAS 5467 6527

JBOD 40x SATA 5850 6340



Chapter 7: Troubleshoot Performance Issues


If the measured performance test results do not match the expected results, many parameters could be the cause. This chapter presents a few such parameters in accordance with the best practices and guidelines discussed earlier in this document.

Understand the Issue

When you see a discrepancy in the results, you must first understand the discrepancy. This understanding requires additional questioning and running debugging tests. Ask the following questions:

Is the issue repeatable? Are the results reliable? Does the issue vary over time? Improve or worsen? Do system reboots affect the issue? Is the issue because of the drive scaling?

— Results are as expected with a lower number of drives, but not when you add drives?— Results are as expected with an even number of drives, but not an odd number of drives?

Is the issue an effect of Qd variation?— Does increasing or decreasing the Qd affect the issue?

Is the issue an effect of another operation occurring in parallel?— Errors because of signal integrity?— Links go up and down and cause discovery, and affect performance?— Background operations running?— Do any controller logs show errors?— Do any operating system logs show errors?— Are other devices using significant CPU resources?

Is the issue an effect of cache?— Does the issue vary over time and is not consistent?— Does the issue go away if you change any read/write cache setting?— Does performance return to normal after running the same I/O for longer duration?— Does changing the I/O order solve the issue or result in different behaviors?

Is the issue an effect of process affinity?— Performance results are inconsistent, but the difference between the runs is always the same?— Does running on a specific processor, or a set of processors, change the performance?

Is the issue an effect of uninitialized volume? — Does the issue go away after volume initialization?

Is the issue a protocol bottleneck?— Refer to Chapter 2 and re-evaluate your bottlenecks to see if any of the SAS, PCIe, or DDR bottleneck values

match your performance result maximum. Is the issue a link width issue?

— Change the slot or cable. Is the issue resolved? Is the issue a benchmarking tool issue?

— Does changing your benchmark tool to a different tool or a different version resolve the issue?— Change different parameters of the benchmark. Is the issue resolved? For example, change the number of

threads/workers, sampling interval, ramp time, and run time. Is the issue an effect of insufficient preconditioning?

— Rerun the tests after longer preconditioning. Or run the same test for longer duration. Does the performance improve?




Is the issue a thermal issue?— Was the server lid open? Does closing the lid resolve the issue?— Are any server components too hot?

Is the issue an effect of a bug in a specific software, hardware, or firmware version?— Update or roll back the system BIOS. Is the issue resolved?— Update or roll back the controller BIOS, firmware, and driver. Is the issue resolved?

Troubleshooting with such questions helps you identify the issue and it becomes easier to understand and then resolve the issue. After you resolve the issue, rerun your tests to make sure the results meet the expectations. Fixing one bottleneck might advance you to another hurdle, but the troubleshooting continues until you reach the expected results.



Appendix A: Performance Testing Checklist

Appendix A: Performance Testing Checklist

The following checklist highlights the configuration and run-time settings to consider before you start a performance run. See related prior sections of this document for details on each item.



Expected Results

Maximum MB/s = Are there enough targets to support ?

Maximum MB/s = Is there enough PCIe bandwidth to support ?

Maximum IOPs = Are there enough targets to support ?

Maximum IOPs = Is there enough CPU capacity to support ?

Latency = Is the desired cache setting used to support desired Latency ?

Scaling = Are all the targets the same model and capacity for good scaling ?

System

Slot, PCIe revision and width = ___________________ Disable BIOS power saving mode options.

PCIe slot is attached to processor on which the controller driver will run.

Enable BIOS maximum cooling options.

PCIe slot supports maximum expected MB/s. BIOS settings are set for maximum PCIe performance (speed, write packet size, burst).

Operating System

Operating system is known version and revision. Targets appear in the storage management tool.

Install necessary performance patches. Targets are all online, active, and initialized.

Use the correct file system. For maximum performance, use no file system (use raw device).

Event Log or Error Log is clear of I/O related messages.

Turn off any unnecessary background tasks.

Controller

Controller is installed in the desired PCIe slot. Controller heartbeat LED is flashing.

I/O controller chip revision is correct. Controller has a unique WWID visible from query tool.

Targets

All targets are visible and initialized via configuration tool. Each target is negotiated to the desired SAS speed.

All targets are the same model number. Linux: Set each target parameter: scheduler, queue, … as desired.

All targets use the same firmware revision.

Driver

Driver phase matches the controller firmware phase. Set the coalescing depth as desired (default is 4).

Driver is loaded and running. The controller is visible from OS storage configuration tool.

Set the coalescing timeout as desired (default is 10).

Set the maximum outstanding I/O as desired (default is ). All MSI-X vectors are visible.

Firmware

Firmware version is latest GCA, or as desired. Firmware is loaded and running (Heartbeat).

Firmware phase matches the driver phase.

Benchmark

Benchmark tool is the recommended revision. Set I/O size as desired and is consistent across all targets.

Benchmark tool can see all targets or volumes. Set Run and Ramp times as appropriate for test and target type.

Set Qd or Thread count as desired and consistently across all targets.

Configuration Information

Gather and record basic configuration information.

Firmware revision, driver revision, target model, target firmware revision, controller model, operating system name, operating system revision, benchmark tool name, benchmark tool revision, PCIe slot number, PCIe slot width, PCIe slot speed, CPU model, number of CPUs, number of active cores, CPU frequency, and so on.



Revision History

Revision History

Version 1.0, October 2014

The following document changes were made.

Updated access information for the MegaCLI and StorCLI general debug tools. Implemented clarifications throughout the document. Added MegaRAID Driver support of Windows Server 2012 R2 Updated SCSI Queue Depth with specific setting information regarding MegaRAID. Added Nomerges Setting section. Added VMWare operating system optimization information. Fixed values and units in the Throughput Snapshot for Drives table. Updated terminology in the MegaRAID FasthPath section. Updated the tools used to configure RAID virtual drives. Document reorganization.

Advance, Version 0.1, March 2014

Initial document release.

Documents

Avago 6Gb/s SAS and 12Gb/s SAS Performance Tuning … · Avago 6Gb/s SAS and 12Gb/s SAS Performance Tuning Guide User Guide Version 1.0 October 2014 ... 1.3 Performance Measurement