27
HPE Reference Architecture for ProLiant DL580 Gen9 and Microsoft SQL Server 2016 OLTP database consolidation Technical white paper

HPE Reference Architecture for ProLiant DL580 Gen9 … · HPE Reference Architecture for ProLiant DL580 Gen9 and Microsoft SQL Server 2016 OLTP database consolidation . Technical

  • Upload
    ngokien

  • View
    324

  • Download
    9

Embed Size (px)

Citation preview

HPE Reference Architecture for ProLiant DL580 Gen9 and Microsoft SQL Server 2016 OLTP database consolidation

Technical white paper

Technical white paper

Contents Executive summary ................................................................................................................................................................................................................................................................................................................................ 3 Introduction ................................................................................................................................................................................................................................................................................................................................................... 3 Solution overview ..................................................................................................................................................................................................................................................................................................................................... 4

HPE ProLiant DL580 Gen9 server ................................................................................................................................................................................................................................................................................... 4 HPE NVMe PCIe SSDs .................................................................................................................................................................................................................................................................................................................. 5 HPE NVMe PCIe Workload Accelerator ........................................................................................................................................................................................................................................................................ 5

Solution components ............................................................................................................................................................................................................................................................................................................................ 7 Hardware ................................................................................................................................................................................................................................................................................................................................................... 7 Software ..................................................................................................................................................................................................................................................................................................................................................... 7 Application software ....................................................................................................................................................................................................................................................................................................................... 7

Best practices and configuration guidance for the solution ............................................................................................................................................................................................................................. 7 HPE ProLiant DL580 Gen9 ..................................................................................................................................................................................................................................................................................................... 7 NVMe PCIe Workload Accelerator configuration ................................................................................................................................................................................................................................................. 8 SQL Server configuration guidance .................................................................................................................................................................................................................................................................................. 8

Capacity and sizing ................................................................................................................................................................................................................................................................................................................................ 8 Workload description and test methodology........................................................................................................................................................................................................................................................... 8 NVMe characterization ................................................................................................................................................................................................................................................................................................................. 9 SQL memory characterization ........................................................................................................................................................................................................................................................................................... 13 HPE ProLiant DL580 Gen9 system and SQL settings characterization results ................................................................................................................................................................. 16 SQL backup tests........................................................................................................................................................................................................................................................................................................................... 18 SQL index rebuild tests ............................................................................................................................................................................................................................................................................................................ 20 Processor comparison (Broadwell versus Haswell) ....................................................................................................................................................................................................................................... 20 Analysis and recommendations ....................................................................................................................................................................................................................................................................................... 21

Summary ...................................................................................................................................................................................................................................................................................................................................................... 22 Appendix A: Bill of materials ...................................................................................................................................................................................................................................................................................................... 22 Appendix B: FIO tool command line configuration for disk I/O benchmarking ........................................................................................................................................................................... 23 Appendix C: Understanding hardware and software NUMA in SQL Server 2016 and how to set affinity on Broadwell .................................................................... 24 Appendix D: Creating a custom soft-NUMA configuration ........................................................................................................................................................................................................................... 25 Resources and additional links ................................................................................................................................................................................................................................................................................................ 27

Technical white paper Page 3

Executive summary Demands for database implementations continue to escalate. Faster transaction processing speeds, scalable capacity, and increased flexibility are required to meet the needs of today’s business. Many businesses are still running older versions of Microsoft® SQL Server on older infrastructure. As hardware continues to age and older SQL Server versions reach end-of-life (EOL), SQL admins often run into performance and scalability issues. To solve these issues, customers often refresh their platforms and consolidate databases as a way to improve performance and resource utilization. This also, if designed properly, usually results in reduced operating costs and accommodates future data growth.

SQL Server database consolidation can be achieved through three ways – virtualization, instance consolidation, and database consolidation. The HPE ProLiant DL580 Gen9 server is an ideal choice for your demanding and data-intensive resources, addressing key technology trends, such as in-memory computing for accelerated data analytics, in-server flash storage for accelerating data processing, co-processors and GPUs for accelerating data processing, high-density memory and I/O scalability for application consolidation. In addition, the DL580 Gen9 includes system Reliability, Availability, and Serviceability (RAS) features, such as HPE Advanced Error Recovery, HPE Memory Quarantine, and HPE Advanced Error Containment, which Microsoft SQL Server 2016 supports.

This Reference Architecture will demonstrate a highly optimized configuration for a mid-range multiple database consolidation with Microsoft SQL Server 2016 and Windows Server® 2016 on an HPE ProLiant DL580 Gen9 server with a hybrid storage solution of write-intensive NVMe PCIe SSDs and Workload Accelerators to achieve almost 62K batch requests per second with an OLTP workload, gaining 26% in performance over a misconfigured SQL server.

This RA also addresses customer challenges and how to deal with slow SQL backups and index rebuilds performed during daily maintenance windows. We will analyze database performances using specific SQL command switches to improve throughput thus shortening the duration of these important tasks while business databases are either offline or online.

Microsoft SQL Server 2016 realizes additional performance and scalability benefits by upgrading to the latest processors. The testing showed a 20% gain in performance for an OLTP workload on Intel® Xeon® v4 (Broadwell) over v3 (Haswell) processors in an HPE ProLiant DL580 Gen9 server.

Target audience: This Hewlett Packard Enterprise Reference Architecture white paper is designed for IT professionals who use, program, manage, or administer large databases that require high availability and performance. Specifically, this information is intended for those who evaluate, recommend, or design new IT high-performance architectures.

This white paper describes testing completed in September 2016.

Document purpose: The purpose of this document is to describe a Reference Architecture, highlighting benefits and key implementation details to technical audiences.

Introduction As the demands for higher processing and scale-up capabilities grow, older platforms have reached their limits in scalability and performance. The purpose of this Reference Architecture is to provide an example platform for customers to use when designing a high-performance SQL Server database server to support database (DB) consolidation initiatives or new business requests that require high performance, transactional database support. We will discuss the results and analysis for several key decision points that SQL architects must consider in designing their SQL Server database environment. These include the following:

• Perform characterization configuration including two HPE NVMe PCIe offerings to determine the best storage layout for the OLTP database files.

• Determine the least amount of memory required by SQL while maintaining reasonable performance and reducing initial server cost.

• Compare optimized performance results of a multi-database, single instance consolidation scenario over a default installation of SQL Server.

• Characterize workload performance during concurrent backup jobs.

• Characterize workload performance during concurrent online index rebuilds.

• Compare performance of an Intel E7 v4 processor (Broadwell) based Gen9 server against its Intel E7 v3 (Haswell) predecessor.

Technical white paper Page 4

Solution overview HPE ProLiant DL580 Gen9 server The HPE ProLiant DL580 Gen9 server offers a great platform for high performance OLTP database workloads. It is the Hewlett Packard Enterprise four-socket (4S) enterprise standard x86 server offering commanding performance, rock-solid reliability and availability, and compelling consolidation and virtualization efficiencies. Key features of the HPE ProLiant DL580 Gen9 server include:

• Commanding performance key features and benefits:

– Processors – achieve performance with up to four Intel Xeon E7-4800/8800 v4/v3 processors with up to 96 cores per server.

– Memory – HPE SmartMemory prevents data loss and downtime with enhanced error handling. Achieve maximum memory configuration using performance-qualified HPE SmartMemory DIMMs, populating 96 DDR4 DIMM slots, with up to 6 TB maximum memory.

– I/O expansion – adapt and grow to changing business needs with nine PCIe 3.0 slots and a choice of HPE FlexibleLOM or PCIe adapters for 1, 10, or 25 GbE, or InfiniBand adapters.

– HPE Smart Array controllers – faster access to your data with the redesigned HPE Flexible Smart Array and HPE Smart SAS HBA controllers that allow you the flexibility to choose the optimal 12 Gb/s SAS controller most suited to your environment. Additional support for HPE NVMe Mixed Use and Write Intensive PCIe Workload Accelerators

– Storage – standard with 5 SFF hot-plug HDD/SSD drive bays. Additional 5 HDD/SDD or NVMe drive bay support requires optional backplane kit.

• Compelling agility and efficiencies for scale-up environments:

– The HPE ProLiant DL580 Gen9 server supports improved ambient temperature ASHRAE A3 and A4 standards helping to reduce your cooling expenses.

– High efficiency, redundant HPE Common Slot Power Supplies, up to 4x 1500W, provide up to 94% efficiency (Platinum Plus), infrastructure power efficiencies with -48VDC input voltages and support for HPE Power Discovery Services.

– Customer-inspired and easily accessible features include: front access processor/memory drawer for ease of serviceability, hot pluggable fans and drives, optional Systems Insight Display (SID) for health and monitoring of components and Quick Reference code for quick access to product information.

• Agile infrastructure management for accelerating IT service delivery:

– With HPE ProLiant DL580 Gen9 server, HPE OneView (optional) provides infrastructure management for automation simplicity across servers, storage and networking.

– Online personalized dashboard for converged infrastructure health monitoring and support management with HPE Insight Online.

– Configure in Unified Extensible Firmware Interface (UEFI) boot mode, provision local and remote with Intelligent Provisioning and Scripting Toolkits.

– Embedded management to deploy, monitor and support your server remotely, out of band with HPE iLO and optimize firmware and driver updates and reduce downtime with Smart Update, consisting of SUM (Smart Update Manager) and SPP (Service Pack for ProLiant).

Figure 1. HPE ProLiant DL580 Gen9 server, front view

Technical white paper Page 5

HPE NVMe PCIe SSDs The introduction of Non-Volatile Memory Express (NVMe) interface architecture propelled the disk drive technology to the next era of extremely high performing storage products. With significantly high bandwidth/IOPS and very low latency, HPE NVME PCIe SSD products are designed to take advantage of its exceptional properties to provide efficient access to your business data. The HPE SSD portfolio offers three broad categories of SSDs based on target workloads: Read Intensive, Write Intensive, and Mixed Use. The Write Intensive NVMe SSDs selected for this Reference Architecture are designed to have the highest write performance which is best suited for online transactional processing environments. The table below shows the relative performance data between drives in the SSD portfolio. See the “Resources and additional links” section for the full HPE SSD Data Sheet.

Table 1. Write performance comparison example between HPE Write Intensive (WI) 800GB SSDs

Random test HPE WI NVME SSD HPE WI 12G SAS SSD HPE WI 6G SAS SSD

Sequential writes (MB/S) 1,700 580 370

Random writes (IOPS) 99,000 68,000 46,500

HPE NVMe PCIe Workload Accelerator The Workload Accelerator platform provides consistent microsecond latency access for mixed workloads, multiple gigabytes per second access and hundreds of thousands of IOPS from a single product. The optimized Workload Accelerator architecture allows for nearly symmetrical read and write performance with excellent low queue depth performance, making the Workload Accelerator platform ideal across a wide variety of real world, high-performance enterprise environments.

Figure 2. HPE NVME PCIe Workload Accelerator

Technical white paper Page 6

The logical diagram below illustrates the major solution components that includes four Intel Xeon v4 processors, 6TB RAM, NVMe Write Intensive PCIe SSDs and Write Intensive NVMe PCIe Workload Accelerators.

SmartArray P830i16p 12Gb/s SAS RAID

Optional Cache Slot

4GB FBWC

ProLiant DL580 Gen9 CTO 793161-B21

FlexLOM Slot

Internal TPM Slot

534FLR-SFP+2x 10Gbit FlexFabFlexLOM Slot 331FLR

4x Gbit LANFlexLOM Slot

SFF Drive Slot 1

SFF Drive Slot 2

SFF Drive Slot 3

SFF Drive Slot 4

SFF Drive Slot 5

Optional 2nd SFF Storage Cage

CPU 4 SlotsCPU 3 SlotsCPU 2 Slots6 5

Full -Height / Full -LengthPCI-E 3.0 x16

( x8 speed )

3

CPU 1 Slots9

Full-Height / Full-LengthPCI-E 3.0 x16

( x 16 speed )

8 7 4

Full-Height / Full-LengthPCI- E 3.0 x16

( x16 speed )

Full-Height / Full-LengthPCI-E 3.0 x16

( x8 speed )

Full-Height / Full-LengthPCI-E 3.0 x16

( x8 speed )

Full-Height / Full-LengthPCI-E 3.0 x16

( x8 speed )

Full-Height / Full-LengthPCI- E 3. 0 x16

( x16 speed )

1

Full- Height / Full -LengthPCI -E 3 .0 x16

( x16 speed )

2

Full-Height / Full-LengthPCI-E 3.0 x16

( x16 speed )

Front Video

Matrox G200 Video

Serial

Dedicated iLO 4 2.0

Internal Micro SD Slot

2x int, 2x front, 4x rear8x USB 2.0

Optional Power Supply Slot

Optional Power Supply Slot

Power Supply

Power Supply

1GB User PartitioniLO4 2.0 4GB NAND

Intel® C602J Series Chipset

CPU Slot 1 CPU Slot 2 CPU Slot 3 CPU Slot 4

Optional Memory Cartridge Slot 2

Ch 3Ch 1

CH 4Ch 2

Ch 3Ch 1

CH 4Ch 2

5E6J

4A2G

3L1C

8F7K

9 B11 H

10M12D

5E6J

4A2G

3L1C

8F7K

9B11H

10M12D

CPU Memory Drawer

Optional Memory Cartridge Slot 4

Optional Memory Cartridge Slot 5

Optional Memory Cartridge Slot 6

Optional Memory Cartridge Slot 7

Optional Memory Cartridge Slot 8

SFF NVMe SSD

SFF NVMe SSD

SFF NVMe SSD

SFF NVMe SSD

DL580 Gen9 NVMe 5 SSD Express Bay Kit 788359-B21

SFF NVMe SSD

1200W CS PlatinumPlus HotPlug PS 656364-B211200W CS PlatinumPlus HotPlug PS 656364-B211200W CS PlatinumPlus HotPlug PS 656364-B211200W CS PlatinumPlus HotPlug PS 656364-B21

800GB NVMe PCIe WI SFF SC2 SSD 736939-B21

800GB NVMe PCIe WI SFF SC2 SSD 736939-B21

800GB NVMe PCIe WI SFF SC2 SSD 736939-B21

800GB NVMe PCIe WI SFF SC2 SSD 736939-B21

400GB 6G SATA ME SFF SC EM SSD 691866-B21

400GB 6G SATA ME SFF SC EM SSD 691866-B21

1 .6TB NVMe WI HH/HL PCIe

803197-B21

1.6TB NVMe WI HH/HL PCIe

803197-B21

NVMe Switch Controller

5E6J

4A2G

3 L1C

8F7K

9 B11H

10M12D

DL580 Gen9 12 DDR4 DIMM Slots Memory Cartridge 788360-B21 5E6J

4A2 G

3L1C

8F7K

9B11H

10M12D

DL580 Gen9 12 DDR4 DIMM Slots Memory Cartridge 788360-B21 5E6J

4A2G

3L1C

8F7K

9B11H

10M12D

DL580 Gen9 12 DDR4 DIMM Slots Memory Cartridge 788360-B21 5E6 J

4A2G

3L1C

8F7K

9B11H

10M12D

DL580 Gen9 12 DDR4 DIMM Slots Memory Cartridge 788360-B 215E6J

4 A2G

3 L1C

8F7K

9B11H

10M12D

DL580 Gen9 12 DDR4 DIMM Slots Memory Cartridge 788360- B21 5E6J

4A2G

3L1C

8F7K

9B11H

10M12D

DL580 Gen9 12 DDR4 DIMM Slots Memory Cartridge 788360-B21

18Core

Xeon E7-8890v4 (2.2GHz/24c/60MB/165W) 816643-L21 18

Core

Xeon E7-8890v4 (2.2GHz/24c/60MB/165W) 816643-B21 18

Core

Xeon E7-8890v4 (2.2GHz/24c/60MB/165W) 816643-B21 18

Core

Xeon E7-8890v4 (2.2GHz/24c/60MB/165W) 816643-B21

64GB QRx4 DDR4-2400 CAS17 LRDIMM 805358-

B2164GB QRx4 DDR4-2400 CAS17 LRDIMM 805358-

B2164GB QRx4 DDR4-2400 CAS17 LRDIMM 805358-

B2164GB QRx4 DDR4-2400 CAS17 LRDIMM 805358-

B2164GB QRx4 DDR4-2400 CAS17 LRDIMM 805358-

B2164GB QRx4 DDR4-2400 CAS17 LRDIMM 805358-

B2164GB QRx4 DDR4-2400 CAS17 LRDIMM 805358-

B2164GB QRx4 DDR4-2400 CAS17 LRDIMM 805358-

B2164GB QRx4 DDR4-2400 CAS17 LRDIMM 805358-

B2164GB QRx4 DDR4-2400 CAS17 LRDIMM 805358-

B2164GB QRx4 DDR4-2400 CAS17 LRDIMM 805358-

B2164GB QRx4 DDR4-2400 CAS17 LRDIMM 805358-

B21

64GB QRx4 DDR4-2400 CAS17 LRDIMM 805358-

B2164GB QRx4 DDR4-2400 CAS17 LRDIMM 805358-

B2164GB QRx4 DDR4-2400 CAS17 LRDIMM 805358-

B2164GB QRx4 DDR4-2400 CAS17 LRDIMM 805358-

B2164GB QRx4 DDR4-2400 CAS17 LRDIMM 805358-

B2164GB QRx4 DDR4-2400 CAS17 LRDIMM 805358-

B2164GB QRx4 DDR4-2400 CAS17 LRDIMM 805358-

B2164GB QRx4 DDR4-2400 CAS17 LRDIMM 805358-

B2164GB QRx4 DDR4-2400 CAS17 LRDIMM 805358-

B2164GB QRx4 DDR4-2400 CAS17 LRDIMM 805358-

B2164GB QRx4 DDR4-2400 CAS17 LRDIMM 805358-

B21

64GB QRx4 DDR4-2400 CAS17 LRDIMM 805358-

B2164GB QRx4 DDR4-2400 CAS17 LRDIMM 805358-

B21

64GB QRx4 DDR4-2400 CAS17 LRDIMM 805358-

B21

64GB QRx4 DDR4-2400 CAS17 LRDIMM 805358-

B2164GB QRx4 DDR4-2400 CAS17 LRDIMM 805358-

B21

64GB QRx4 DDR4-2400 CAS17 LRDIMM 805358-

B2164GB QRx4 DDR4-2400 CAS17 LRDIMM 805358-

B2164GB QRx4 DDR4-2400 CAS17 LRDIMM 805358-

B2164GB QRx4 DDR4-2400 CAS17 LRDIMM 805358-

B21

64GB QRx4 DDR4-2400 CAS17 LRDIMM 805358-

B21

64GB QRx4 DDR4-2400 CAS17 LRDIMM 805358-

B21

64GB QRx4 DDR4-2400 CAS17 LRDIMM 805358-

B21

64GB QRx4 DDR4-2400 CAS17 LRDIMM 805358-

B21

64GB QRx4 DDR4-2400 CAS17 LRDIMM 805358-

B21

64GB QRx4 DDR4-2400 CAS17 LRDIMM 805358-

B2164GB QRx4 DDR4-2400 CAS17 LRDIMM 805358-

B2164GB QRx4 DDR4-2400 CAS17 LRDIMM 805358-

B21

64GB QRx4 DDR4-2400 CAS17 LRDIMM 805358-

B21

64GB QRx4 DDR4-2400 CAS17 LRDIMM 805358-

B21

64GB QRx4 DDR4-2400 CAS17 LRDIMM 805358-

B21

64GB QRx4 DDR4-2400 CAS17 LRDIMM 805358-

B21

64GB QRx4 DDR4-2400 CAS17 LRDIMM 805358-

B21

64GB QRx4 DDR4-2400 CAS17 LRDIMM 805358-

B21

64GB QRx4 DDR4-2400 CAS17 LRDIMM 805358-

B21

64GB QRx4 DDR4-2400 CAS17 LRDIMM 805358-

B21

64GB QRx4 DDR4-2400 CAS17 LRDIMM 805358-

B2164GB QRx4 DDR4-2400 CAS17 LRDIMM 805358-

B2164GB QRx4 DDR4-2400 CAS17 LRDIMM 805358-

B2164GB QRx4 DDR4-2400 CAS17 LRDIMM 805358-

B2164GB QRx4 DDR4-2400 CAS17 LRDIMM 805358-

B21

64GB QRx4 DDR4-2400 CAS17 LRDIMM 805358-

B21

64GB QRx4 DDR4-2400 CAS17 LRDIMM 805358-

B21

64GB QRx4 DDR4-2400 CAS17 LRDIMM 805358-

B21

64GB QRx4 DDR4-2400 CAS17 LRDIMM 805358-

B21

64GB QRx4 DDR4-2400 CAS17 LRDIMM 805358-

B21

64GB QRx4 DDR4-2400 CAS17 LRDIMM 805358-

B21

64GB QRx4 DDR4-2400 CAS17 LRDIMM 805358-

B21

64GB QRx4 DDR4-2400 CAS17 LRDIMM 805358-

B2164GB QRx4 DDR4-2400 CAS17 LRDIMM 805358-

B2164GB QRx4 DDR4-2400 CAS17 LRDIMM 805358-

B21

64GB QRx4 DDR4-2400 CAS17 LRDIMM 805358-

B2164GB QRx4 DDR4-2400 CAS17 LRDIMM 805358-

B2164GB QRx4 DDR4-2400 CAS17 LRDIMM 805358-

B2164GB QRx4 DDR4-2400 CAS17 LRDIMM 805358-

B2164GB QRx4 DDR4-2400 CAS17 LRDIMM 805358-

B2164GB QRx4 DDR4-2400 CAS17 LRDIMM 805358-

B2164GB QRx4 DDR4-2400 CAS17 LRDIMM 805358-

B2164GB QRx4 DDR4-2400 CAS17 LRDIMM 805358-

B2164GB QRx4 DDR4-2400 CAS17 LRDIMM 805358-

B21

64GB QRx4 DDR4-2400 CAS17 LRDIMM 805358-

B21

64GB QRx4 DDR4-2400 CAS17 LRDIMM 805358-

B2164GB QRx4 DDR4-2400 CAS17 LRDIMM 805358-

B2164GB QRx4 DDR4-2400 CAS17 LRDIMM 805358-

B2164GB QRx4 DDR4-2400 CAS17 LRDIMM 805358-

B21

64GB QRx4 DDR4-2400 CAS17 LRDIMM 805358-

B2164GB QRx4 DDR4-2400 CAS17 LRDIMM 805358-

B21

64GB QRx4 DDR4-2400 CAS17 LRDIMM 805358-

B2164GB QRx4 DDR4-2400 CAS17 LRDIMM 805358-

B2164GB QRx4 DDR4-2400 CAS17 LRDIMM 805358-

B2164GB QRx4 DDR4-2400 CAS17 LRDIMM 805358-

B21

64GB QRx4 DDR4-2400 CAS17 LRDIMM 805358-

B2164GB QRx4 DDR4-2400 CAS17 LRDIMM 805358-

B2164GB QRx4 DDR4-2400 CAS17 LRDIMM 805358-

B2164GB QRx4 DDR4-2400 CAS17 LRDIMM 805358-

B21

64GB QRx4 DDR4-2400 CAS17 LRDIMM 805358-

B21

64GB QRx4 DDR4-2400 CAS17 LRDIMM 805358-

B21

64GB QRx4 DDR4-2400 CAS17 LRDIMM 805358-

B21

64GB QRx4 DDR4-2400 CAS17 LRDIMM 805358-

B21

64GB QRx4 DDR4-2400 CAS17 LRDIMM 805358-

B21

64GB QRx4 DDR4-2400 CAS17 LRDIMM 805358-

B2164GB QRx4 DDR4-2400 CAS17 LRDIMM 805358-

B2164GB QRx4 DDR4-2400 CAS17 LRDIMM 805358-

B2164GB QRx4 DDR4-2400 CAS17 LRDIMM 805358-

B21

2GB 24in FIO FBWC

758836-B21

4- 800GB NVMe PCIe SSDs

2- 1.6TB NVMe PCIe Accelerators

4-socket NUMA Architecture

1.5TB RAM per socket

6TB TOTAL RAM

Redundant Power

556FLR-SFP+FIO 2p 10Gb adptr 732456-B21FlexLOM

ILO 4 2.4Dedicated Mgmt. Port

iLO4 2.4 4GB NAND1GB User Partition

Figure 3. Solution logical diagram

Technical white paper Page 7

Solution components Hardware HPE ProLiant DL580 Gen9 server configuration This HPE ProLiant DL580 Gen9 database server configuration is based on internal, direct-attached storage. We evaluated different storage options and used a hybrid internal storage configuration to demonstrate the performance characteristics of the different disk options and to establish an optimum hybrid configuration using different models working together. In addition, the testing evaluated two server processor family options for the HPE ProLiant Gen9 server. Both the Intel Xeon E7 v3 and v4 processor families were tested. This testing demonstrated the relative performance between the processor models and overall benefits of deploying newer processor architectures.

The HPE ProLiant DL580 Gen9 Broadwell-based server was configured with the following components:

• Four 24-core Intel Xeon E7-8890 v4 processors at 2.20GHz

• 6TB memory (96 x 64GB HPE DDR4 SmartMemory LRDIMMs)

• 2 x 400GB 6G SATA ME 2.5in SC EM SSD (OS)

• 2 HPE 1.6TB NVMe Write-Intensive PCIe Workload Accelerators in RAID1

• 4 x 800GB NVMe PCIe Write-intensive SFF SC2 SSD in RAID10

The HPE ProLiant DL580 Gen9 Haswell-based server was configured with the following components:

• Four 18-core Intel Xeon E7-8890 v3 processors at 2.50 GHz

• 6TB memory (96 x 64GB HPE DDR4 SmartMemory LRDIMMs)

• 2 x 400GB 6G SATA ME 2.5in SC EM SSD (OS)

• 2 HPE 1.6TB NVMe Write-Intensive PCIe Workload Accelerators in RAID1

• 4 x 800GB NVMe PCIe Write-intensive SFF SC2 SSD in RAID10

Software • Microsoft Windows Server 2016 RTM

• Microsoft SQL Server 2016 RTM/CU1

Application software The SQL Server version used for this testing was Microsoft SQL Server 2016 (RTM) – 13.0.1601.5 with Cumulative Update 1 - 13.00.2149.0. This version was installed on the DL580 Gen9 using Microsoft Windows Server 2016 RTM.

Best practices and configuration guidance for the solution HPE ProLiant DL580 Gen9 Initial setup for the DL580 Gen9 server consisted of various BIOS and SQL settings. The following BIOS settings were configured as derived from the HPE best practices on DL580 Gen9 servers white paper.

• Hyper-Threading – Enabled

• Intel Turbo Boost – Enabled

• HPE Power Profile – Maximum Performance

• NUMA Group Size Optimization – Clustered (default)

• QPI Snoop configuration – Cluster-on-Die

Technical white paper Page 8

NVMe PCIe Workload Accelerator configuration • Accelerator cards in a RAID set should both be installed in slots that belong to the same socket to benefit from NUMA-based hardware

performance features.

• Figure 4 shows the location of the installed NVMe PCIe Workload Accelerator cards in a RAID1 set, highlighted in red. Because the cards are in a RAID set, we installed them in adjacent slots with a single socket to localize the mirroring I/O traffic.

Figure 4. HPE ProLiant DL580 Gen9 high-level block diagram. Workload Accelerator cards are installed on Slot 4 and Slot 5

SQL Server configuration guidance • T834 flag / large-page allocation – Enabled

• Max Degree of Parallelism (MAXDOP) – set to 1

Capacity and sizing Workload description and test methodology The OLTP databases are part of a stock trading application emulator, in which clients connect to the databases and perform trade buy, sell and market orders and reports. The workload used for characterizing the HPE ProLiant DL580 Gen9 server consisted of eight 100GB OLTP databases. Each database is spread into eight 12.5GB database file.

For each test the workload was run long enough to warm the SQL buffer pool and was deemed warm when the average bytes per read Windows® Performance Monitor counter dropped from 64K to 8K bytes. Once the buffer pool was warm a measurement period of approximately 15 minutes was used to record steady state performance.

Metrics were collected using the Windows Performance Monitor tool and included counters such as CPU Utilization, Physical Disk counters, and SQL Batch requests per second.

Technical white paper Page 9

The figure below illustrates our test layout, showing the DL580 Gen9 server with 800GB total OLTP databases, connecting with the workload engine VMs on a 10Gb network.

Figure 5. OLTP workload test layout

The following sections will describe the test results and analysis for several key decision points in the design of our SQL Server environment.

• NVMe characterization – Characterize SSDs and Workload Accelerator cards to determine the layout of SQL database files.

• SQL memory characterization – Analyze the least amount of memory required by SQL Server while maintaining reasonable performance.

• HPE ProLiant DL580 Gen9 system and SQL settings characterization results – Configure DL580 Gen9 BIOS settings to deliver optimal performance.

• SQL backup tests – Analyze workload performance during concurrent backup jobs.

• SQL index rebuild tests – Analyze workload performance during concurrent online index rebuilds.

• Processor comparison (Broadwell versus Haswell) – Performance comparison between Broadwell and Haswell processors

NVMe characterization As part of this Reference Architecture, we evaluated two different internal media options to determine their best use in SQL deployments. In the environment we have both a RAID10 array with 4-800GB disks, and a RAID1 array with two Workload Accelerators.

The Flexible IO Tester (FIO) tool was used to measure IOPS, throughput, and latency. Tests were run to measure random read, random write, sequential write, and random read-write measurements, sweeping through different Queue Depths to find the optimal test drive point with reasonable latencies that would mirror the requirements of a transaction DB server.

Read tests were performed at 8K byte reads, while sequential writes were set at 64K bytes. Duration for each data collection run was set to 5 minutes. A 12.5GB test file similar in size to our actual database data file was used. See Appendix B for the FIO tool configuration.

10Gb switch

OLTP Test Layout Diagram

VMs to run workload engines, simulating trade buy and sell, market orders, and

reports

8 x 100GB OLTP DBs800GB total size

Performance Tuning

Performance Monitoring

Data

Log

DL580 Gen9Workload engines

Market data and transactions logs are

captured in OLTP databases

Technical white paper Page 10

In Figure 6 below, SSDs have better read performance with 16.3% more IOPS than the Workload Accelerators during random read tests.

Figure 6. NVMe characterization – Random Read results at optimal Queue Depth

Figure 7 shows that SSDs had better read-write performance with 21.8% more IOPS than the Workload Accelerators during random read-write tests.

Figure 7. NVMe characterization – Random Read-Write results at optimal Queue Depth

102157

118763

90000

95000

100000

105000

110000

115000

120000

125000

Accelerators(RAID 1): Random-rd SSDs(RAID 10): Random-rd

IOPS

Random Read Results

71815

87540

0

10000

20000

30000

40000

50000

60000

70000

80000

90000

100000

Accelerators(RAID 1): Random-rd/wr SSDs(RAID 10): Random-rd/wr

IOPS

Random Read-Write Results

RAID10 SSD Random Reads outperform PCI flash cards by

16.3%

RAID10 SSD Random

Read-Writes outperform PCI flash by 21.8%

Technical white paper Page 11

Figure 8 shows that SSDs had better write performance with 9.7% more IOPS than the Workload Accelerators during random write tests.

Figure 8. NVMe characterization – Random Write results at optimal Queue Depth

The bar graph below shows SSDs offer more than twice the sequential write throughput at 3.5GB/sec than the Workload Accelerators. SSDs were chosen over the Workload Accelerators for transaction log files due to the higher bandwidth.

Figure 9. NVMe characterization – Sequential Write results at optimal queue depth of 256

84521

92695

80000

82000

84000

86000

88000

90000

92000

94000

Accelerators(RAID 1): Random-wr SSDs(RAID 10) : Random-wr

IOPS

Random Write Test Results

1634

3484

0

500

1000

1500

2000

2500

3000

3500

4000

Accelerators(RAID 1) SSDs(RAID 10)

MB/

sec

Sequential Write Throughput SSDs

provide 2X Sequential

Write Throughput

RAID10 SSD Random Writes outperform PCI flash by 9.7%

Technical white paper Page 12

One of the greatest benefits of NVMe PCIe-based SSDs is very low latency. The graph below shows latencies at queue depth of 32.

Figure 10. NVMe characterization – Latency results at multiple queue depth test points

Based on our NVMe characterization results, we recommend the NVMe SSDs (RAID10) over the Workload Accelerators (RAID1) to give you the best transactional throughput for the logs. Comparatively, measured IOPS on both NVMe products are only within a percentage difference of 16%, while throughput is over 100% better on NVMe SSDs (RAID10) with about 3.5GB/sec.

With our analysis, we recommend the following:

• NVMe PCIe Workload Accelerators for data where reasonable read/write (60/40) performance is sustained.

• NVMe PCIe SSDs using four disks for logs where high sequential transactional write access is essential.

203

225

195

226

321

455

615

1258

0 200 400 600 800 1000 1200 1400

RAID 10: Random-wr

RAID 1: Random-wr

RAID 10: Random-rd

RAID 1: Random-rd

RAID 10: Random-rd/wr

RAID 1: Random-rd/wr

RAID 10: Seq-wr

RAID 1: Seq-wr

microseconds

SSDs (RAID10) & Accelerators (RAID1) Latency Results

RAID10 SSDs offer 51%

reduction in latency

Technical white paper Page 13

SQL memory characterization By default, SQL Server can dynamically adjust its memory requirements based on available system resources. With 6TB of RAM on the DL580 Gen9 server, our SQL server has ample memory available for large database workloads. The following tests show that below a certain point too little memory can have a negative impact on performance and memory sizing is key to achieve high performance.

For our workload, we limited the memory that SQL Server can use by setting the maximum server memory. To determine the least amount appropriate for our workload, we performed a memory ramp-down test that started with 1.6TB of max memory, then decreasing the amount 30-60 minutes after the database was warm. Measurements were taken at 1.6TB, 800GB, 600GB, 400GB, 200GB, 100GB, 50GB max mem settings.

Figure 11 shows the actual Windows Performance Monitor graph that shows optimal performance at a 1:1 ratio (800GB) and as we step down the SQL Server maximum memory, disk reads (blue line) increases, and transactional performance (green line) is slightly reduced. Going further to 200GB and below workload performance significantly decreased.

Figure 11. SQL Server performance under varied SQL maximum memory setting

Optimal SQL Max. Mem.

(800GB) Yields 51K

Batch Requests

Technical white paper Page 14

Figure 12 below shows the charted results of our memory test. There are three key points to this test:

• Memory sizing for an 800GB database benefits from 400GB or more of RAM. Any smaller memory configurations resulted in drastic performance reductions. Larger amounts provide for database growth and workload surges.

• Batch requests per second (BRPS) drops by 33% from 47K to 31.5K when SQL Server max memory is reduced to a 1:4 ratio (200GB).

• Disk reads are minimal at 206 reads/sec when SQL Server max memory is set to 800GB, compared to disk reads of 8.3K, 39K, and 65K at max mem of 600GB, 400GB, and breaking point of 200GB respectively.

We chose 800GB max. memory that will give us over 51K batch requests and roughly 200 read IOPS for the simulated work.

Figure 12. SQL max. memory requirement. SQL memory ramp-down test results

SQL OLTP workload test Once the SQL databases are laid-out optimally on NVMe storage and the SQL max memory is set, our OLTP SQL environment is ready for SQL OLTP workload performance testing. The purpose of the OLTP test was to show a stable running OLTP workload within the compute capabilities of the HPE ProLiant DL580 Gen9 server. Several BIOS and SQL Server settings were modified to establish which settings provide best performance for this particular workload. By doing so, this Reference Architecture serves as an optimization guidance for customers deploying SQL Server on HPE ProLiant servers.

Microsoft Windows Server 2016 RTM and SQL Server 2016 RTM/CU1 were the foundation for these tests.

Eight 100GB databases were used to simulate a large-scale environment with high CPU and I/O utilization. Each 100GB database was spread into eight 12.5GB data files, which were located on Drive D. Logs were kept on Drive F.

The buffer pool was warmed up before each measurement was taken. The buffer pool is considered warm when initial warmup read ahead activity ends. This occurs when the average bytes per read counter drops from 64K to 8K, along with total CPU utilization in a steady, leveled state. Transactional throughput was measured with SQL Server Performance monitor counter “Batch requests per second”.

51600

2060

10000

20000

30000

40000

50000

60000

70000

80000

0

10000

20000

30000

40000

50000

60000

1600 800 600 400 200 100 50

Dis

k R

eads

/Sec

Batc

h R

eque

sts/

Sec

SQL Max. Memory

SQL Max Memory Test

BRPS (KB) Disk Reads / sec

Technical white paper Page 15

A baseline test with a workload drive point of 28 users per database was used to achieve an overall system CPU utilization at about 80%. The baseline consisted of the following BIOS and SQL Server settings:

• Maximum Performance

• Hyper-Threading disabled

• Power Profile – Maximum Performance

• Power Regulator – High Static Performance mode

• NUMA Group Size Optimization – Clustered

• QPI Snoop configuration – Cluster-on-Die

• Hardware NUMA / CPU affinity testing

• SQL Server – Soft-NUMA disabled

• No Trace Flags

Each setting was tested to evaluate its impact on performance:

• Hyper-Threading enabled vs. Hyper-Threading disabled

• With T834 flag (large-page allocation) vs. without T834 flag

• SQL soft-NUMA enabled vs. soft-NUMA disabled

• NUMA Group Size Optimization and QPI Snoop configuration settings:

– Clustered / Home Snoop

– Flat / Home Snoop

– Clustered / Cluster-on-Die

– Flat / Cluster-on-Die

• CPU-affinity vs. No CPU-affinity testing

Details for the above settings are described in the next section.

Once our baseline test was complete, our first comparison was with T834 flag versus no T834 flag. We kept the setting with the better performance for the rest of the tests.

In order to minimize reboots and shorten test cycles (taking advantage of a warm buffer pool), we did several tests together. In addition, to minimize test harness configuration changes, we kept all test configurations in three main groups – 4 NUMA nodes, 8 NUMA nodes, and no Port to CPU affinity.

Note This characterization shows the impact and gains measured for the OLTP workload used. Every workload is different and should be evaluated for optimum settings separately.

Technical white paper Page 16

HPE ProLiant DL580 Gen9 system and SQL settings characterization results Hyper-Threading When enabled, Hyper-Threading allows a physical processor core to run two threads of execution. The Windows operating system will now see double the logical CPUs per NUMA node, which often results in an increase in performance. However, two logical processors sharing the same physical resources can increase resource contention and processor overhead so some workloads can experience decreased performance. HPE recommends testing with Hyper-Threading whenever possible to see if Hyper-Threading is better for your workload.

CPU affinity was used to align different workload databases with specific CPU (or NUMA) nodes. With CPU affinity, each database workload ran on a specific NUMA node gaining a 15% boost in performance due to local NUMA memory access and we observed an increase from 53.9K BRPS to 61.8K BRPS when Hyper-Threading was enabled. With no CPU affinity, enabling Hyper-Threading also increased performance by 18%, from 48.7K to 57.4K BRPS. In addition, having Hyper-Threading enabled allowed us to increase the user workload from 28 to 42 users per database while keeping CPU resources at the same 80% utilization, an increase of 50% more users.

Table 2. Batch requests on Broadwell-based DL580 Gen9. Key results for Hyper-Threading and CPU affinity options

Key test points with Hyper-Threading and CPU affinity BRPS CPU_Total % Latency / MS (Data) Latency / MS (Logs)

HT enabled @ 28 users – with affinity 54569 66.43% 0.000 0.000

HT enabled @ 42 users – with affinity 61790 80.16% 0.001 0.000

HT enabled @ 42 users – no affinity 57420 86.36% 0.000 0.000

HT disabled @ 28 users – with affinity 53854 80.70% 0.001 0.000

HT disabled @ 28 users – no affinity 48725 83.59% 0.001 0.000

NUMA Group Size Optimization and QPI Snoop configuration The NUMA Group Size Optimization option on Gen9 servers is used to configure how the system ROM reports the number of logical processors in a NUMA node. This option can be set to Flat or Clustered (default). When set to Clustered, the physical socket will serve as the boundary for the NUMA process group. When set to Flat, Windows will adjust the processor group in an effort to minimize the number of groups and balance their size. In addition, the above setting works in conjunction with QPI links. The QPI Snoop configuration option determines the Snoop mode used by the QPI bus. When in Home Snoop mode, the number of hardware NUMA nodes reported in BIOS is the same as the number of NUMA nodes reported in Windows. However, when Cluster-on-Die is selected, 2 NUMA nodes per socket are reported to the Windows operating system. See Appendix D for a pictorial representation on how NUMA nodes are mapped in Windows and within SQL Server.

These two BIOS settings can greatly impact performance of SQL workloads. For our workload, we cycled through the NUMA/QPI combinations to find the optimal setting to yield the best performance.

Table 3 shows performance results for the NUMA/QPI settings. For our workload environment, only a 1-4% difference separate the measured batch requests when toggling between the four combinations of Flat / Clustered / Home Snoop / Cluster-on-Die, and variations of with/without CPU affinity and Hyper-Threading enabled/disabled.

With CPU affinity, we experienced only a 1% difference in batch requests per second, with the four possible combinations of NUMA/QPI settings. Meanwhile, we saw a 4% difference in batch requests with no CPU affinity. In most cases, the highest batch requests were achieved when set to Flat / Cluster-on-Die for our workload while testing with the variations of with/without CPU affinity and Hyper-Threading enabled/disabled.

While our testing with NUMA / QPI settings in BIOS showed minimal performance differences on our workload as shown in Table 4, we kept the Flat / Cluster-on-Die setting to get the most batch requests per second performance. As your workload may experience a wider swing in performance, HPE recommends to test with these BIOS settings to get the best performance possible for your workload.

Table 3. Batch requests on Broadwell-based DL580 Gen9. NUMA Group / QPI Link optimization Hyper-Threading enabled with CPU affinity

BRPS CPU_Total % Latency / MS (Data) Latency / MS (Logs)

Clustered / Cluster-on-Die 61267 76.81% 0.000 0.000

Clustered / Home Snoop 61494 85.15% 0.000 0.000

Flat / Cluster-on-Die 61790 80.16% 0.001 0.000

Flat / Home Snoop 61329 78.48% 0.000 0.000

Technical white paper Page 17

Trace flag T834 / Large-page allocation Trace flag T834 can be enabled to have SQL Server use large-page allocations for the memory buffer pool. This flag can improve performance by increasing the efficiency of the TLB buffer in the CPU. While T834 only applies to 64-bit versions of SQL Server, SQL admins must have Lock pages in memory user rights to enable trace flag T834.

Our test shows that enabling this flag gave a gain of 1.2% in the batch requests per seconds, from 53.1K to 53.7K. The benefit of large pages in memory is workload depended and we recommend testing it to establish the value for each particular workload.

Table 4. Batch requests on Broadwell-based DL580 Gen9. T834/large-page allocation vs. no T834/large-page allocation result

Setting BRPS CPU_Total % Latency / MS (Data) Latency / MS (Logs)

No Trace flag T834 / Large-page allocation 53077 80.90% 0.000 0.000

Trace flag T834 / Large-page allocation 53710 80.25% 0.002 0.000

SQL server – Automatic soft-NUMA Soft-NUMA allows SQL servers to further group CPUs into nodes on a software level. Enabled by default, soft-NUMA will sub-divide hardware NUMA nodes into smaller groups.

Enabling automatic soft-NUMA will increase the number of soft-NUMA nodes reported within SQL. When using affinity, make sure to verify the number of NUMA nodes within the SQL logs to correctly set up your affinitized workload.

Automatic soft-NUMA can be beneficial with servers with or without hardware NUMA. However, this further division of CPUs can cause contention and may actually degrade performance, based on workload, as the division of processors are spread to multiple sockets. In Table 5, we see a 2-3% decrease in batch requests when automatic soft-NUMA is enabled. On the other hand, you can manually configure how SQL Server divides the logical processors into processor groups by entering node configuration in the registry to give a custom CPU mask and processor group.

Table 5. Batch requests on Broadwell-based DL580 Gen9. Automatic Soft-NUMA enabled vs. disabled result with Hyper-Threading disabled

BRPS CPU_Total % Latency / MS (Data) Latency / MS (Logs)

Soft-NUMA Enabled 52357 79.73% 0.001 0.000

Soft-NUMA Disabled 53854 80.77% 0.001 0.000

We created a custom soft-NUMA affinity with 8 nodes so that each database will connect to its own node. Each node consists of half of the logical processor from a single socket. Table 6 shows the performance gain with custom soft-NUMA against automatic soft-NUMA; we saw a 2% increase in batch requests from 51790 to 52830. Setting the CPU Mask and Processor Group is done in the registry. Appendix D shows how to set up a manual soft-NUMA.

HPE recommends testing with automatic or manual soft-NUMA to optimize your workload.

Note SQL Server databases can have affinity to either SQL Soft NUMA nodes, or the Underlying hardware NUMA nodes when Soft-NUMA is disabled.

Table 6. Batch requests on Haswell-based DL580 Gen9. Results at key test points with best settings and CPU affinity.

Haswell Server – 30 Users BRPS CPU_Total % Latency / MS (Data) Latency / MS (Logs)

HT Enabled – Soft-NUMA disabled 51441 81.93% 0.000 0.000

HT Enabled – Automatic soft-NUMA enabled 51790 78.79% 0.000 0.000

HT Enabled – Custom soft-NUMA 52830 77.79% 0.000 0.000

Technical white paper Page 18

Hardware NUMA / CPU affinity SQL Server is a NUMA-aware application that requires no special configuration to take advantage of NUMA-based hardware. On NUMA hardware, processors are grouped together along with their own memory and I/O channels. Each group is called a NUMA node. A NUMA node can access its local memory, and can also access memory from other NUMA nodes. However, access to local memory is faster than using the remote memory from other NUMA nodes. Thus, the name Non-Uniform Memory Access architecture.

With an affinity over non-affinity workload, we experienced a gain of 8%-11% in batch requests, depending on the Hyper-Threading setting. With Hyper-Threading enabled, we measured an 8% improvement; and, with Hyper-Threading disabled, we measured an 11% improvement, based on the data in Table 2.

Overall performance gain with optimized BIOS and SQL settings After identifying optimum BIOS and SQL settings and setting proper database affinity to hardware NUMA nodes, we measured a 26% cumulative improvement in performance due to an 18% gain from enabling Hyper-Threading and a further 8% gain from setting database level affinity. It is very important to review the system and SQL Server settings during deployment and proof-of-concept testing to take full advantage of the performance capabilities of the HPE ProLiant DL580 server.

Figure 13. Cumulative performance gain with Hyper-Threading enabled and with database to CPU affinity

SQL backup tests In addition to evaluating our primary database transactional workload and identifying optimum system settings for performance, the following backup tests are intended to show the system’s ability to handle a secondary surge workload with no or minimal interference to the performance of the primary workload.

Performing backups using the SQL Server Management Studio (SSMS) GUI is fairly easy and user-friendly; however, GUI-invoked backups do not reveal certain backup command options that are only available using command line scripting. Running backups in Transact-SQL unlocks backup switch options that can allow businesses to perform offline backups at faster speeds, or online backups at a slightly-reduced SQL performance during the regular overnight maintenance periods.

Our backup test plan was based on a 100GB OLTP database; we ran backups in two sets with three different scenarios:

• Set #1 – No Load

– Default backup as found in SSMS, without Compression

– A backup with Compression as found in SSMS

Hyper-Threading

Enabled, 18%

With CPU Affinity, 8%

0%

5%

10%

15%

20%

25%

30%

Overall Performance Gain26% total

gain

Technical white paper Page 19

– A compressed backup with switch options:

blocksize=65536

maxtransfersize=4194304

buffercount=300

• Set #2 – With OLTP workload running

– Default backup as found in SSMS, without Compression

– A backup with Compression as found in SSMS

– A compressed backup with switch options:

blocksize=65536

maxtransfersize=4194304

buffercount=300

In Table 7 with no load, our test resulted in two key points:

• The rate of speed of the backup increased by 79% from 2430MB/sec to 4361MB/sec when using custom compression with additional options.

• The time spent on the backup decreased 44% from 46 seconds to just 26 seconds for a physical database size of 112.38GB.

To put this in a real-life scenario, imagine a business with a 1TB SQL database, it would only take 4 minutes and 33 seconds for a full backup using a tweaked compression command script.

Table 7. No Load – Compression / No compression backup test results

Backup performed with no active OLTP workload

Database size=112.38GB

MB/sec % Difference Duration (sec) % Difference

Backup with default settings 2430MB -- 46 --

Backup with plain compression 362MB -85% 311 572%

Backup with custom compression 4361MB 79% 26 -44%

In Table 8 with a workload running, our test resulted in three key points:

• The rate of speed of the backup increased by 108% from 746MB/sec to 1549MB/sec when using custom compression with additional options.

• The time spent on the backup decreased 53% from 159 seconds to just 76 seconds for a physical database size of 112.38GB.

• Most important, performance on the running workload only decreased by 13.5% during the backup.

Table 8. With running workload – Compression / No compression backup test results

Backup performed during active OLTP workload

Database size=112.38GB

BRPS before backup

BRPS during backup

% Diff MB/sec % Diff Duration (sec) % Diff

Backup with default settings 48852 48362 -1.0% 745.88MB -- 159 --

Backup with plain compression 48762 48096 -1.4% 326.23MB -56% 364 128%

Backup with custom compression 48953 42322 -13.5% 1548.63MB 108% 76 -53%

Technical white paper Page 20

SQL index rebuild tests The max degree of parallelism (MAXDOP) option determines the maximum number of processors to use during an index rebuild operation. Using MAXDOP with the default value of zero (0), the server determines the number of CPUs that are used for the index operation, using only the actual available number of processors or fewer based on the current system workload. However, you can manually configure the number of processors used to run index operations by specifying the number of processors. By doing so, performance may be impacted positively or negatively during online index rebuild. To find the optimal value for our workload, we measured workload performance during concurrent online index rebuilds with varying MAXDOP values.

Our index rebuild test plan consists of running several degree of parallelism scenarios that will measure the performance of the OLTP workload and the duration of the indexing during online index rebuilds. We used one single TPCE database, with an un-partitioned index with 207,360,000 rows by default, 3,018,933 data pages, and 23.585GB in size. The first performance test was done with no max degree of parallelism, with MAXDOP set to 1. The next tests were done with 96, 64, 32, and at the default MAXDOP of 0.

The table below shows that leaving the MAXDOP value at zero (default) will yield the best online performance with only a slight decrease of 23% (avg) in batch requests per second while the index rebuild only took 52 seconds. Using MAXDOP with the default value of zero (0), the server determines the number of CPUs that are used for the index operation, using only the actual available number of processors or fewer based on the current system workload; meaning, over-subscription of CPUs can cause insufficient resources for other applications and database operations for the duration of the index operation, thus performance will decline.

For our workload, leaving MAXDOP index option at default (zero) is recommended during online index rebuild.

Table 9. Performance results during online index rebuild with varying MAXDOP values

Index rebuild on 20GB index table on 96-core Broadwell

BRPS before rebuild

BRPS (avg) during rebuild

% Diff BRPS (min) during rebuild

% Diff Duration (sec) % Diff

MAXDOP=1 61849 41688 -33% 23944 -61% 01:19:03 52%

MAXDOP=96 61849 46354 -25% 33004 -47% 00:58:09 12%

MAXDOP=64 61849 35117 -43% 18442 -70% 01:55:57 121%

MAXDOP=32 61849 34351 -44% 16166 -74% 02:04:13 138%

MAXDOP=0 61849 47567 -23% 34407 -44% 00:52:48

Processor comparison (Broadwell versus Haswell) THE HPE ProLiant DL580 Gen9 supports Intel Xeon 4800/8800 v3/v4 processors. Selecting the right processor for the SQL workload is key to the design of the SQL server environment. The next step is to compare the newer Broadwell processor with the Haswell processor. With the best BIOS and SQL settings, with CPU affinity on Broadwell (as shown on Table 4), the Flat/Cluster-on-Die combination yielded 61,790 batch requests per seconds. We took the same settings and ran the same workload with the same NVMe PCIe storage on a Haswell-based DL580 Gen9 server and compared results. Keep in mind that the Haswell-based server also has four processors with only 18 cores each but running at a higher frequency.

The table and graph below show the test results that show a 20% increase, when upgrading from Haswell to Broadwell processors.

Table 10. Performance comparison between Broadwell and Haswell processors with optimal BIOS and SQL settings

DL580 server – 30 users BRPS CPU_Total % Latency / MS (Data) Latency / MS (Logs)

DL580 Gen9 with Haswell processor 51441 81.93% 0.000 0.000

DL580 Gen9 with Broadwell processor 61790 80.16% 0.001 0.000

Technical white paper Page 21

Figure 14. Broadwell versus Haswell performance results

Analysis and recommendations • When using localized storage, using NVMe PCIe storage products in your SQL Server solution will bring benefits unseen with using regular

SSDs and HDDs. Either in regular SFF drive format or in Workload Accelerator cards, these NVMe drives, when configured and used properly in the SQL environment, can immensely improve disk access for your workload.

• Memory sizing is important. Smaller memory allocations below a 1:2 memory to database ratio can result in performance degradation. Wherever possible use a 1:1 ratio or better.

• Setting the SQL Max server memory for each SQL instance and enabling locked large-page memory in the buffer pool can improve your workload performance.

• Our testing was to find the best configuration for our workload that included several important server and SQL settings. After taking all tests into consideration, we concluded that our series of tests resulted in two best practices: use CPU affinity and Cluster-on-Die over soft-NUMA.

• Using Hyper-Threading can positively impact your workload. Hyper-Threading enables SQL Server to handle more load (an increase from 28 to 42 users in our workload), while improving performance (batch requests) with 18% gain.

• Using CPU affinity can greatly improve performance on your workload. With 11% performance improvement on our workload, CPU affinity enables each database instance to be assigned to its own NUMA nodes, resulting in an exclusive resource of logical processors, memory, and I/O per node. To use affinity, make sure to verify the number of NUMA nodes that are reported in the operating system and in SQL to ensure affinitized workloads are working properly.

• While each workload is different, HPE recommends to test with the NUMA Group Size Optimization and QPI Snoop configuration settings to get the best performance for your workload.

• For Gen9 servers with Broadwell processors, we recommend using hardware-NUMA with Cluster-on-Die versus using SQL soft-NUMA when implementing a NUMA affinitized workload.

51441

61790

0

10000

20000

30000

40000

50000

60000

70000

Haswell Broadwell

Batc

h R

eque

sts

per S

econ

d

Broadwell vs. Haswell Performance Results

20% Performance Increase with

Broadwell CPUs

Technical white paper Page 22

Summary The HPE ProLiant DL580 Gen9 server is a powerful and versatile platform for Microsoft SQL Server consolidation deployments.

• The large number of storage options and PCI expansion slots provide the flexibility needed to deploy high performance and scalable SQL Server environments.

• Intel Broadwell processors improve upon prior generations providing a 20% gain in our OLTP tests when compared to Haswell processors.

• When the DL580 Gen9 and Broadwell processors are configured optimally, we experienced a cumulative gain of 26% when compared to default settings.

• The backup and index maintenance job testing showed minimal primary workload impact.

• The HPE ProLiant DL580 Gen9 with Broadwell-based Intel CPUs provided almost 62K batch requests per second compared to 52K using Haswell.

Our testing shows that our Reference Architecture provides exceptional performance, making it ideal for consolidation efforts during hardware refresh cycles. Highly optimized workloads maintained performance during secondary maintenance windows, giving overall production behavior confidence as a primary database server.

Because of new hardware, new applications, as well as new policies and practices, consolidation is a constant never-ending project. Our Reference Architecture provides an example platform that will bring performance and scalability that will make future consolidation even easier and supports growth for the company.

Appendix A: Bill of materials

Note Part numbers are at time of testing and subject to change. The bill of materials does not include complete support options or other rack and power requirements. If you have questions regarding ordering, please consult with your HPE Reseller or HPE Sales Representative for more details. hpe.com/us/en/services/consulting.html

Table 11. Bill of materials. Broadwell server

Qty Part number Description

HPE ProLiant DL580 Gen9 server

1 793161-B21 HPE DL580 Gen9 CTO Server

1 816643-L21 HPE DL580 Gen9 Intel Xeon E7-8890v4 (2.2GHz/24-core/60MB/165W) FIO Kit

3 816643-B21 HPE DL580 Gen9 Intel Xeon E7-8890v4 (2.2GHz/24-core/60MB/165W) Kit

96 805358-B21 HPE 64GB (1x64GB) Quad Rank x4 DDR4-2400 CAS-17-17-17 Load Reduced Memory

8 788360-B21 HPE DL580 Gen9 12 DIMMs Memory Cartridge

1 788359-B21 HPE DL580 Gen9 NVMe 5 SSD Express Bay Kit

2 691866-B21 HPE 400GB 6G SATA ME 2.5in SC EM SSD

2 803197-B21 HPE 1.6TB NVMe WI HH PCIe Accelerator

4 736939-B21 HPE 800GB NVMe PCIe WI SFF SC2 SSD

1 732456-B21 HPE Flex Fbr 10Gb 2P 556FLR-SFP+FIO Adptr

1 758836-B21 HPE 2GB FIO Flash Backed Write Cache

4 656364-B21 HPE 1200W CS Plat PL HtPlg Pwr Supply Kit

1 BD505A HPE iLO Adv incl 3yr TSU 1-Svr Lic

Technical white paper Page 23

Appendix B: FIO tool command line configuration for disk I/O benchmarking Running the FIO tool in a job file minimizes run starts and stops. Multiple tests can be combined to simplify testing. The use of a Global section sets the defaults for the jobs (or tests) described in the file, which reduces redundant test options that may appear for each test, such as, file name, test duration, or number of threads. The example below performs four I/O disk tests – Random Read, Random Write, Sequential Write, and Mixed Random Read-Write with one request outstanding. All three Random I/O tests are performed with 8K reads, while sequential writes are performed with 64K writes. The global section defines all tests to run with 1 thread, for 5 minutes using the fio_testfile as the testbed, with non-buffered I/O.

[global] ioengine=windowsaio size=12500MB direct=1 time_based runtime=300 directory=/fio filename=fio_testfile thread=1 new_group [rand-read-1] iodepth=1 bs=8k rw=randread stonewall [seq-write-1] iodepth=1 bs=64k rw=write stonewall [rand-write-1] iodepth=1 bs=8k rw=randwrite stonewall [rand-40/60-1] iodepth=1 bs=8k rw=randrw rwmixread=40 For more information about using this tool, visit: http://bluestop.org/fio/

Technical white paper Page 24

Appendix C: Understanding hardware and software NUMA in SQL Server 2016 and how to set affinity on Broadwell Figure 15 shows how NUMA nodes are mapped in Windows and within SQL Server and the resulting TCP/IP port to NUMA node bitmask.

NUMA Node Configuration 4 socket 8 core system

Default Configuration: QPI Snoop: Home SnoopSQL Server Automatic Soft-NUMA: Enabled- SQL Server ignores hardware NUMA, and overlays its own nodes- 4 H/W nodes are seen in Resource MonitorTCP/IP: 8-bit Bitmask 11111111

NUMA NodeSocket

NUMA Node

Core

Core

Core

Core

Core

Core

Core

Core

Socket

NUMA Node

Core

Core

Core

Core

Core

Core

Core

Core

Socket

NUMA Node

Core

Core

Core

Core

Core

Core

Core

Core

Socket

NUMA Node

Core

Core

Core

Core

Core

Core

Core

Core

NUMA Node NUMA Node

NUMA Node NUMA Node

QPI Snoop: Home SnoopSQL Server Automatic Soft-NUMA: Disabled- 4 H/W nodes are seen in Resource Monitor, and SQL ServerTCP/IP: 4-bit Bitmask 1111

QPI Snoop: Cluster on DieSQL Server Automatic Soft-NUMA: Disabled- 8 H/W nodes are seen in Resource Monitor, and SQL ServerTCP/IP: 8-bit Bitmask 11111111

QPI Snoop: Cluster on DieSQL Server Automatic Soft-NUMA: Enabled- 8 H/W nodes are seen in Resource Monitor, and SQL Server splits them into 16

TCP/IP: 16-bit Bitmask 1111111111111111

SocketNUMA Node

Core

Core

Core

Core

Core

Core

Core

Core

SocketNUMA Node

Core

Core

Core

Core

Core

Core

Core

Core

SocketNUMA Node

Core

Core

Core

Core

Core

Core

Core

Core

SocketNUMA Node

Core

Core

Core

Core

Core

Core

Core

Core

Socket

NUMA Node

Core

Core

Core

Core

Core

Core

Core

Core

NUMA Node

Socket

NUMA Node

Core

Core

Core

Core

Core

Core

Core

Core

NUMA Node

Socket

NUMA Node

Core

Core

Core

Core

Core

Core

Core

Core

NUMA NodeSocket

NUMA Node

Core

Core

Core

Core

Core

Core

Core

Core

NUMA Node

Socket

Core

Core

NUMA Node

Core

Core

NUMA Node

Core

Core

NUMA Node

Core

Core

NUMA Node

Socket

Core

Core

NUMA Node

Core

Core

NUMA Node

Core

Core

NUMA Node

Core

Core

NUMA Node

Socket

Core

Core

NUMA Node

Core

Core

NUMA Node

Core

Core

NUMA Node

Core

Core

NUMA Node

Socket

Core

Core

NUMA Node

Core

Core

NUMA Node

Core

Core

NUMA Node

Core

Core

NUMA Node

Figure 15. NUMA node configuration on 4-socket 8-core system

Technical white paper Page 25

Appendix D: Creating a custom soft-NUMA configuration This example is for a 4-socket server with 18-core processors with Hyper-Threading enabled for an eight database workload. General instructions can be found at https://msdn.microsoft.com/en-us/library/ms345357.aspx

• First, we will split the 36 logical cores into two 18-core halves.

• With a programmer’s calculator, enter 18 adjacent ones to reflect 18 adjacent logical cores and covert to hex or decimal.

11 1111 1111 1111 1111 = 3FFFF or 262143(dec)

• Then enter 18 ones to reflect the next 18 adjacent logical cores, followed by 18 zeros to represent the first 18 cores that were used previously.

1111 1111 1111 1111 1100 0000 0000 0000 0000 = FFFFC0000 or 68719214592(dec)

• Table 12 shows the CPUMask and group values for each node.

Table 12. Custom CPUMask and group for a manual soft-NUMA configuration that will position each node within half a socket

CPUMask (Hex) CPUMASK (dec) Group

Node 0 0x3FFFF 262143 0

Node 1 0xFFFFC0000 68719214592 0

Node 2 0x3FFFF 262143 1

Node 3 0xFFFFC0000 68719214592 1

Node 4 0x3FFFF 262143 2

Node 5 0xFFFFC0000 68719214592 2

Node 6 0x3FFFF 262143 3

Node 7 0xFFFFC0000 68719214592 3

• Using the registry, we create a NodeConfiguration folder under HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Microsoft SQL Server\130. Then we create 8 keys to represent 8 nodes. Table 12 shows the CPUMask and group values for each node.

• Make sure automatic soft-NUMA is disabled, then restart SQL Server.

Technical white paper Page 26

• Verify the new soft-NUMA is configured by checking the SQL log:

• Verification can also be done with sys.dm_os_nodes:

• To affinitize each node to a database, we map TCP/IP ports to each node. For more information, visit https://msdn.microsoft.com/en-us/library/ms345346.aspx

TCP/IP Ports - 1500[1],1501[2],1502[4],1503[8],1504[16],1505[32],1506[64],1507[128]

• Each database will listen to a specific port to exclusively access 18 cores within a physical socket.

Technical white paper Page 27

Sign up for updates

© Copyright 2016 Hewlett Packard Enterprise Development LP. The information contained herein is subject to change without notice. The only warranties for Hewlett Packard Enterprise products and services are set forth in the express warranty statements accompanying such products and services. Nothing herein should be construed as constituting an additional warranty. Hewlett Packard Enterprise shall not be liable for technical or editorial errors or omissions contained herein.

Microsoft, Windows Server, and Windows are either registered trademarks or trademarks of Microsoft Corporation in the United States and/or other countries. Intel and Xeon are trademarks of Intel Corporation in the U.S. and other countries.

4AA6-8301ENW, October 2016

Resources and additional links HPE ProLiant DL580 Gen9 server hpe.com/servers/dl580

HPE Reference Architectures hpe.com/info/ra

HPE Servers hpe.com/servers

HPE Storage hpe.com/storage

HPE Networking hpe.com/networking

HPE Technology Consulting Services hpe.com/us/en/services/consulting.html

HPE SSD Data Sheet http://h20195.www2.hpe.com/V2/GetDocument.aspx?docname=4AA4-7186ENW

Best practices configuring the HPE ProLiant DL560 and DL580 Gen9 Servers with Windows Server http://h20195.www2.hpe.com/v2/GetDocument.aspx?docname=4AA5-1110ENW

To help us improve our documents, please provide feedback at hpe.com/contact/feedback.