26
Enterprise Storage RAS augmented by native Intel Platform Storage Extensions (PSE) Tanveer Alam Intel Corporation

Enterprise Storage RAS augmented by native Intel … Corporation . Emphasis on DRAM cache: it is kept in memory, because more recent data is more likely to be used again. \爀䄀昀琀攀爀

Embed Size (px)

Citation preview

Page 1: Enterprise Storage RAS augmented by native Intel … Corporation . Emphasis on DRAM cache: it is kept in memory, because more recent data is more likely to be used again. \爀䄀昀琀攀爀

2016 Storage Developer Conference. © Insert Your Company Name. All Rights Reserved.

Enterprise Storage RAS augmented by native Intel Platform Storage

Extensions (PSE)

Tanveer Alam Intel Corporation

Page 2: Enterprise Storage RAS augmented by native Intel … Corporation . Emphasis on DRAM cache: it is kept in memory, because more recent data is more likely to be used again. \爀䄀昀琀攀爀

2016 Storage Developer Conference. © Insert Your Company Name. All Rights Reserved.

Objective A successfully managed, RAS (Reliability, Availability, Service-ability) capable enterprise and cloud storage system, are the keys to building a highly predictable and resilient storage strategy for any enterprise or mid-sized business. However, to implement this strategy, it takes the right hardware and software building blocks. Intel ® Xeon ® Processor based processors comprise of some of the key fundamental hardware technology extensions that enables enhanced storage system resiliency when used in combination with other higher domain hardware or software such as RAID6 and other compatible scalable redundancy solutions. Two of the most salient Intel Platform Storage Extensions (PSE) include PCIe based Intel Non-Transparent Bridge (NTB) and Intel Asynchronous DRAM Refresh (ADR), Intel® QuickData technology natively built into the Intel Xeon processors.

2

Intel Corporation

Page 3: Enterprise Storage RAS augmented by native Intel … Corporation . Emphasis on DRAM cache: it is kept in memory, because more recent data is more likely to be used again. \爀䄀昀琀攀爀

2016 Storage Developer Conference. © Insert Your Company Name. All Rights Reserved.

Agenda

RAS Practical Definitions and Overview RAS Enabling Framework RAS Features for Storage Intel Platform Storage Extension (PSE) technologies

augmenting system RAS: Non-transparent Bridge (NTB) Asynchronous DRAM Refresh (ADR) Intel® QuickData Technology (chipset DMA acceleration)

RAS System Visibility Final Thoughts

3

Intel Corporation

Page 4: Enterprise Storage RAS augmented by native Intel … Corporation . Emphasis on DRAM cache: it is kept in memory, because more recent data is more likely to be used again. \爀䄀昀琀攀爀

2016 Storage Developer Conference. © Insert Your Company Name. All Rights Reserved.

What is RAS? – Practical definitions RAS** = Reliability, Availability, Service-ability : 99.999% UPTIME (Five Nines Reliability) Reliability = Engineering/Architectural Decision

Error Detection and Self-Healing Minimizes outage Correct certain bits (therefore data) at all times

Availability = Keep running despite problems Reduce frequency and duration of outages Self-Diagnosing, Self-healing (auto-workaround problems) 24/7 operation – business never stops or slows down

Serviceability = Minimized outage/downtime Avoid redundant failures and provide accurate failure diagnostics Concurrent repair (heal) of higher failure items Easy repair/replace/upgrade. 4

**Some excerpts from https://lenovopress.com/lp0554-lenovo-x6-server-ras-features Intel Corporation

Page 5: Enterprise Storage RAS augmented by native Intel … Corporation . Emphasis on DRAM cache: it is kept in memory, because more recent data is more likely to be used again. \爀䄀昀琀攀爀

2016 Storage Developer Conference. © Insert Your Company Name. All Rights Reserved.

RAS: One Size Doesn’t Fit All

5

Cloud HPC

Need Fault Handling Needs fault handling

Check pointing is not used Check pointing heavily used

Apps can tolerate single system failure

Apps can not tolerate single system failure

Fault mgmt. for improving TCO

Fault mgmt. for improving Uptime

Ex: Automated techniques to identify failed DIMM

Ex:HW/FW based self-healing

Different applications may require different RAS features

Intel Corporation

Page 6: Enterprise Storage RAS augmented by native Intel … Corporation . Emphasis on DRAM cache: it is kept in memory, because more recent data is more likely to be used again. \爀䄀昀琀攀爀

2016 Storage Developer Conference. © Insert Your Company Name. All Rights Reserved.

RAS Needed in Mission Critical Systems

6

**Source: http://itic-corp.com/blog/2013/02/ibm-dell-fujitsu-stratus-get-highest-marks-in-itic-reliability-survey/

Mission Critical Systems Require Advanced RAS features

How to prevent /minimize downtime

Intel Corporation

Page 7: Enterprise Storage RAS augmented by native Intel … Corporation . Emphasis on DRAM cache: it is kept in memory, because more recent data is more likely to be used again. \爀䄀昀琀攀爀

2016 Storage Developer Conference. © Insert Your Company Name. All Rights Reserved.

RAS Needed in Mission Critical Systems

7

References: 1.Bianca Schroeder, Garth A. Gibson. (2007). Understanding failures in petascale computers. Journal of Physics: Conference Series, Vol.78, No. 1. (2007) 2.Urs Holzle and Luiz Andre Barroso. (2009). The Datacenter as a Computer, Google. 3.Best Practices for Continuous Application Availability, Gartner Data Center Conference 2008

Sources of service disruption

HW Fault Types Definition

Soft Errors High energy particle strike in caches and memory

Transient Errors Electrical noise induced faults in links

Stuck-at Faults Degradation of single memory cell, single lane on a link

Hard Failure Failure of entire device (DRAM, memory Buffer, CPU)

Intel Corporation

Page 8: Enterprise Storage RAS augmented by native Intel … Corporation . Emphasis on DRAM cache: it is kept in memory, because more recent data is more likely to be used again. \爀䄀昀琀攀爀

2016 Storage Developer Conference. © Insert Your Company Name. All Rights Reserved.

RAS Enabling Framework

8 Hardware Fault Handling Throughout the Entire System (HW+SW+FW)

Intel Corporation

Page 9: Enterprise Storage RAS augmented by native Intel … Corporation . Emphasis on DRAM cache: it is kept in memory, because more recent data is more likely to be used again. \爀䄀昀琀攀爀

2016 Storage Developer Conference. © Insert Your Company Name. All Rights Reserved.

RAS: Storage Systems

9

Application/Software responsible for 40% of Storage Server Failures

Sources of Service Disruption

Intel Corporation

Page 10: Enterprise Storage RAS augmented by native Intel … Corporation . Emphasis on DRAM cache: it is kept in memory, because more recent data is more likely to be used again. \爀䄀昀琀攀爀

2016 Storage Developer Conference. © Insert Your Company Name. All Rights Reserved.

General RAS Features for Storage RAID – Storage Array Controllers (RAID –

mirroring, striping, parity – Data Protection, Redundancy and Failover)

ECC – Error Correcting Code implementation (Memory and In-flight data protection)

Memory Scrubbing (Memory protection – prevents, corrects multi bit failures over time)

DIMM sparing (Dynamic backup and Failover) Thermal Overload – prevention and mitigation Security Features 10

Intel Corporation

Page 11: Enterprise Storage RAS augmented by native Intel … Corporation . Emphasis on DRAM cache: it is kept in memory, because more recent data is more likely to be used again. \爀䄀昀琀攀爀

2016 Storage Developer Conference. © Insert Your Company Name. All Rights Reserved.

RAS for Storage with NTB, ADR + Intel®

QuickData Technology

NTB = Non-Transparent Bridge Dynamic memory data redundancy, failover, protection and security. Two or more

nodes having mirror copies of the data yet blind to each other.

ADR = Asynchronous DRAM Refresh Native sudden power failure or link loss protection including in-flight data. Data is

protected in self-refreshed DRAM and/or copied to non-volatile storage (NVRAM, SSD or HDD) at the conclusion of the cycle.

Intel® QuickData Technology Allows rapid DMA of mirrored or lockstepped memory locations and NTB/PCIe

endpoints across nodes without processor interrupts.

11

PLATFORM STORAGE EXTENSIONS (PSE)

Intel Corporation

Page 12: Enterprise Storage RAS augmented by native Intel … Corporation . Emphasis on DRAM cache: it is kept in memory, because more recent data is more likely to be used again. \爀䄀昀琀攀爀

2016 Storage Developer Conference. © Insert Your Company Name. All Rights Reserved.

Intel Server/Storage RAS Feature Summary

12

System Reliability (Fault Tolerance) System Serviceability/Diagnosability (Fault Management)

ECC, Cache corruption containment Error Reporting, BIST, MCA Bank Error Reporting etc.

DRAM/Memory CMD/ADDR parity, Rank sparing, mirroring DDR4 CRC, Memory corrected error reporting

Intel® UPI CRC, link data retry

PCI Express link retrain, CRC checking, ECRC

PCI Express “stop and scream”, card hotplug/hotswap

Virtual partitioning

Failed DIMM isolation, enhance SMM, error injection capability

CPU Memory

Intel® UPI PCIe

System

Color Key:

Intel Corporation

Page 13: Enterprise Storage RAS augmented by native Intel … Corporation . Emphasis on DRAM cache: it is kept in memory, because more recent data is more likely to be used again. \爀䄀昀琀攀爀

2016 Storage Developer Conference. © Insert Your Company Name. All Rights Reserved.

RAS Areas Covered

NTB + Intel® QuickData and ADR Enhancements in RAS

Snapshots

Data Locality

Cloning/Mirroring

Data Tiering

Data Availability

Quality of Service (QoS)

Resilience

Compression

Erasure Coding

Deduplication

13

Intel Corporation

Page 14: Enterprise Storage RAS augmented by native Intel … Corporation . Emphasis on DRAM cache: it is kept in memory, because more recent data is more likely to be used again. \爀䄀昀琀攀爀

2016 Storage Developer Conference. © Insert Your Company Name. All Rights Reserved.

RAS Enhancements for Storage with NTB PCIe NTB – Non-Transparent Bridge Operation – Typical Storage Use

Case: Direct PCIe connection to another node over internal backplane for

ideal redundancy and failover.

High bandwidth instant data interlink (via PCIe 3.0)

Maintain outstanding Quality of Service (QoS) for uninterrupted critical transaction with multi-node failover configuration.

Supports PCIe (3.0) Dual Cast**- Maximizes memory channel bandwidth with the capability to perform a single write transaction to two targets with PCIe dual-cast

Allows host input/output controller (IOC) to write data directly to PCIe NTB and local memory

14

** PCIe 3.0 dual-cast are available on select processors and platforms. See datasheet for details. Intel Corporation

Page 15: Enterprise Storage RAS augmented by native Intel … Corporation . Emphasis on DRAM cache: it is kept in memory, because more recent data is more likely to be used again. \爀䄀昀琀攀爀

2016 Storage Developer Conference. © Insert Your Company Name. All Rights Reserved.

RAS Enhancements Storage with NTB NTB – Non-Transparent Bridge Operation opens high-bandwidth window into a remote

duplicate storage node (redundancy and failover)

• “Window” is open to Remote Controller’s memory

NTB

CPU + Intel QuickData

CPU+ Intel QuickData

BAR

Translate

Translate BAR

Window 0101

Memory Map

• BARs define Base, Size, and Limit of aperture

• BARs claim transactions and forward to peer

• Translate Registers define where transactions land in peer’s memory

NTB Allows Fast Redundancy Over the PCIe NTB Port

Intel Corporation

Presenter
Presentation Notes
Emphasis on DRAM cache: it is kept in memory, because more recent data is more likely to be used again. After data gets “older” it is sent into permanent storage ADR helps preserve this DRAM cache in case of power failure The back to back failover allows for high availability of the system, key in enterprise environments IO write request CPU/CBDMA read data from DIMM of node A CPU/CBDMA calculate the RAID parity (XOR or GF multiply) & write back data to the DIMM Mirror the cache of node A to node B using CBDMA through NTB. Use Fencing mechanism to make sure data reached to the other side. Node A acknowledge to “client” that IO write is complete RAID Cache is kept in the DIMM to improve system efficiency. In case of sudden power loss ADR protects RAID Cache in the DIMMs Eventually Data and parity written to disk. The basic flow of a RAID write starts with a client making a write request to the “front end” devices (typically Ethernet or fiber channel) of one node designated the “primary node” or Node A for this example. Next the client write data is pushed from the front end devices (DMA) to the system memory of the primary Sandy Bridge ending in an interrupt to the Node A Sandy Bridge device. In response to the interrupt, the Sandy Bridge device executes the RAID calculation. The RAID calculation involves reading all of the RAID data and calculating “parity” information such as “P+Q”. This calculation may be done by the CPU or the CB3/RAID acceleration unit. The original data + other protection data such as the P+Q, is then mirrored from Node A to the redundant Sandy Bridge, Node B using the NTB path. This move is done by the CPU/software or by the CBDMA engine. How Node A knows when the in-flight mirrored writes are safely in Node B is the subject of ADR. Once mirroring is successfully completed, the Node A can acknowledge the client that the write was successful and complete. Figure details the RAID data flow for two nodes in a redundant failover configuration. RAID, as implemented by the nodes, protects against data loss due to drive failures. The redundant node concept is designed also to protect against data loss and provides high availability by protecting against single node failure.
Page 16: Enterprise Storage RAS augmented by native Intel … Corporation . Emphasis on DRAM cache: it is kept in memory, because more recent data is more likely to be used again. \爀䄀昀琀攀爀

2016 Storage Developer Conference. © Insert Your Company Name. All Rights Reserved.

DIMM

MC

Cores

PCIe

NTB

DIMM

MC

Cores

PCIe

NTB

NIC NIC

Intel PCH Intel PCH

Intel QuickData

Intel QuickData

2

3

1

4

6

System A System B

Typical Enterprise/Storage Use Case Data Flow

2

3

CPU CPU

4

DATA

Parity

DATA Data

Parity

4

7 7

5

RAS Enhancements Storage with NTB NTB – Non-Transparent Bridge Operation – Typical Storage Use Case

Direct memory access (DMA) to remote mirrored node (redundancy and failover)

Bandwidth upto 16GBs*

*available on 5th Generation Intel® Xeon ® Processor based platforms+

Intel Corporation

Presenter
Presentation Notes
Emphasis on DRAM cache: it is kept in memory, because more recent data is more likely to be used again. After data gets “older” it is sent into permanent storage ADR helps preserve this DRAM cache in case of power failure The back to back failover allows for high availability of the system, key in enterprise environments IO write request CPU/CBDMA read data from DIMM of node A CPU/CBDMA calculate the RAID parity (XOR or GF multiply) & write back data to the DIMM Mirror the cache of node A to node B using CBDMA through NTB. Use Fencing mechanism to make sure data reached to the other side. Node A acknowledge to “client” that IO write is complete RAID Cache is kept in the DIMM to improve system efficiency. In case of sudden power loss ADR protects RAID Cache in the DIMMs Eventually Data and parity written to disk. The basic flow of a RAID write starts with a client making a write request to the “front end” devices (typically Ethernet or fiber channel) of one node designated the “primary node” or Node A for this example. Next the client write data is pushed from the front end devices (DMA) to the system memory of the primary Sandy Bridge ending in an interrupt to the Node A Sandy Bridge device. In response to the interrupt, the Sandy Bridge device executes the RAID calculation. The RAID calculation involves reading all of the RAID data and calculating “parity” information such as “P+Q”. This calculation may be done by the CPU or the CB3/RAID acceleration unit. The original data + other protection data such as the P+Q, is then mirrored from Node A to the redundant Sandy Bridge, Node B using the NTB path. This move is done by the CPU/software or by the CBDMA engine. How Node A knows when the in-flight mirrored writes are safely in Node B is the subject of ADR. Once mirroring is successfully completed, the Node A can acknowledge the client that the write was successful and complete. Figure details the RAID data flow for two nodes in a redundant failover configuration. RAID, as implemented by the nodes, protects against data loss due to drive failures. The redundant node concept is designed also to protect against data loss and provides high availability by protecting against single node failure.
Page 17: Enterprise Storage RAS augmented by native Intel … Corporation . Emphasis on DRAM cache: it is kept in memory, because more recent data is more likely to be used again. \爀䄀昀琀攀爀

2016 Storage Developer Conference. © Insert Your Company Name. All Rights Reserved.

Why Do I need NTB in a storage system?

17

Problem: Need for fast backup node with real-time data mirroring.

Solution Implementation Cost Off-site/remote backup Copy and duplicate all data to off-site or remote backup

site. High (Capital Equipment) High (Network Bandwidth)

Non-Transparent Bridge (NTB)

Data is backed up seamlessly via fast PCIe interface without need to for any CPU computing resources over NTB link.

Low (built-in) No network bandwidth required.

Intel Corporation

Page 18: Enterprise Storage RAS augmented by native Intel … Corporation . Emphasis on DRAM cache: it is kept in memory, because more recent data is more likely to be used again. \爀䄀昀琀攀爀

2016 Storage Developer Conference. © Insert Your Company Name. All Rights Reserved.

Asynchronous DRAM Refresh (ADR)

18

In the event of sudden power failure, ADR preserves key data in the battery backed DRAM or NVRAM.

Example ADR sequence looks like the following: Force flush transient data from protected ‘write buffers’ in the processor Flush out to the battery DIMMs Place DIMMs in battery backed self-refresh DIMM content can be preserved by a backup battery or copying DIMM content to

Flash or using a NVRAM.

Intel Xeon CPU

Cores MC Buff

DDR4

ADR

VDDR1 VDDR2

Un Core Buff

1.2V_BATT 1.2V

Intel Corporation

Page 19: Enterprise Storage RAS augmented by native Intel … Corporation . Emphasis on DRAM cache: it is kept in memory, because more recent data is more likely to be used again. \爀䄀昀琀攀爀

2016 Storage Developer Conference. © Insert Your Company Name. All Rights Reserved.

ADR in Action

19

Enterprise Storage System Instance

DRAM

Write Buffers Write Buffers

Write Buffers Write Buffers

Sudden Power Failure

STOP

Pending Requests

In-flight Data NVM/Flash Device

C2F

Power Restored

Intel Corporation

Page 20: Enterprise Storage RAS augmented by native Intel … Corporation . Emphasis on DRAM cache: it is kept in memory, because more recent data is more likely to be used again. \爀䄀昀琀攀爀

2016 Storage Developer Conference. © Insert Your Company Name. All Rights Reserved.

Why do I need ADR in a storage system?

20

Solution Implementation Cost Uninterrupted Power Supply (UPS)

In case of power failure – continuous, always on power for a limited period of time.

High

Remote Duplication/Off-site backup

Use real-time failover mechanisms that duplicate transient data real-time and transfer to a remote site for redundancy.

High

Asynchronous DRAM Refresh (ADR)

In the event of sudden power failure, all transient and in-flight data in local buffers is stored in battery backed memory or to persistent flash device.

Low (built-in)

Problem: Sudden Loss of AC Power, all transient data is lost.

Intel Corporation

Page 21: Enterprise Storage RAS augmented by native Intel … Corporation . Emphasis on DRAM cache: it is kept in memory, because more recent data is more likely to be used again. \爀䄀昀琀攀爀

2016 Storage Developer Conference. © Insert Your Company Name. All Rights Reserved.

High Speed Inter and Intra-node DMA Hardware Assist (Intel® QuickData Technology) Intel® QuickData Technology enables fast mirroring and lockstep of

multi nodal systems and memory by concurrently duplicating data between primary and backup/failover nodes using a dedicated chipset hardware DMA engine while the processor continues to run other critical tasks.

Intel® QuickData technology can work in conjunction with PCIe NTB (non-

transparent bridge) to rapidly duplicate or mirror critical memory space regions.

21

Intel Corporation

Page 22: Enterprise Storage RAS augmented by native Intel … Corporation . Emphasis on DRAM cache: it is kept in memory, because more recent data is more likely to be used again. \爀䄀昀琀攀爀

2016 Storage Developer Conference. © Insert Your Company Name. All Rights Reserved.

Intel® QuickData Technology Descriptors

Desc1 Desc2 Desc3-4 Desc5-6 Desc7 …

DMA FILL DMA-DIF DMA NULL

Submit Descriptors to a channel of Intel® QuickData Technology

status update on the final memory write/interrupt

22

Software creates “chain” of descriptors describing

• source • destination • operation

Intel® QuickData Technology Channel completes it with no CPU intervention.

Intel Corporation

Presenter
Presentation Notes
Each link in chain can have different operation Can submit multiple descriptors at once Interrupt when chain is complete
Page 23: Enterprise Storage RAS augmented by native Intel … Corporation . Emphasis on DRAM cache: it is kept in memory, because more recent data is more likely to be used again. \爀䄀昀琀攀爀

2016 Storage Developer Conference. © Insert Your Company Name. All Rights Reserved.

PCIe Dual-Cast + NTB + Intel® QuickData Implementation Example

23

NTB Primary

Secondary (Failover Module)

Secondary (Failover Module)

NTB Primary

NTB Primary

Secondary (Failover Module)

NTB Primary Secondary (Failover Module)

MC

MC

MC

MC

DR

AM

DRAM

DRAM

DRAM

DRAM

Storage System Instance

Dual-Cast

Dual-Cast

Dual-Cast

Dual-Cast

NTB

NTB

NTB

NTB

NTB

NTB

NTB

NTB

DR

AM

D

RA

M

DR

AM

PCIe dual-casting allows simultaneous writes to DRAM and a secondary mirrored system over NTB

Intel Corporation

Page 24: Enterprise Storage RAS augmented by native Intel … Corporation . Emphasis on DRAM cache: it is kept in memory, because more recent data is more likely to be used again. \爀䄀昀琀攀爀

2016 Storage Developer Conference. © Insert Your Company Name. All Rights Reserved.

Why Do I Need Intel® QuickData Technology? Problem: Need to move large amounts of data concurrently to many nodes (and memory) at once without interrupting the processor cores.

24

Solution Implementation Cost Use CPU for data movement Processor uses its resources to perform numerous

calculations in order to move large data constantly between PCIe, I/O and Memory.

High (Very CPU resource intensive)

Intel® QuickData Technology

Intel QuickData technology moves data directly to/from memory and I/O devices offloading CPU resources completely. Also implements comprehensive RAS features like DIF and CRC error checking for data in-flight.

Low (CPU not used)

Intel Corporation

Presenter
Presentation Notes
Which workloads benefit from CBDMA?
Page 25: Enterprise Storage RAS augmented by native Intel … Corporation . Emphasis on DRAM cache: it is kept in memory, because more recent data is more likely to be used again. \爀䄀昀琀攀爀

2016 Storage Developer Conference. © Insert Your Company Name. All Rights Reserved.

RAS System Visibility Processor

Machine Check Architecture (Error Logging Registers) IERR MCERR

Platform Controller Hub (PCH) Memory Error Corrections

Parity Scrubbing SDDC

Advanced Error Reporting (AER) – e.g. PCI Express ERR0 ERR1 ERR2

End-to-End CRC (ECRC)

Thermal Monitoring, Mitigation and Status

25

Intel Corporation

Page 26: Enterprise Storage RAS augmented by native Intel … Corporation . Emphasis on DRAM cache: it is kept in memory, because more recent data is more likely to be used again. \爀䄀昀琀攀爀

2016 Storage Developer Conference. © Insert Your Company Name. All Rights Reserved.

Final thoughts on Storage RAS RAS features both at the platform as well as the system level are

expanding rapidly with every generation. More granular visibility is being added into the systems in the next gen.

Storage specific RAS features are also getting more expansive and granular over time – providing even more detailed look at the system level fault detection, recovery and prevention.

With the growth of cloud storage and software defined storage – higher level RAS features for storage are becoming more predictive, preventative and self-healing in nature covering the entirety of the networked storage system.

Intel’s Platform Storage Extension (PSE) technologies have evolved significantly over time and remain the key drivers for providing comprehensive storage RAS features on Intel storage platforms.

26

Intel Corporation