31
Mission Critical Computing on x86 Rob Kypriotakis Datacenter Solutions Architect Intel Corporation

Mission Critical Computing on x86 · 2018. 10. 29. · High Application Availability* Minimal Application State or Data Cluster Deployment with Mission Critical Servers Redundant

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Mission Critical Computing on x86 · 2018. 10. 29. · High Application Availability* Minimal Application State or Data Cluster Deployment with Mission Critical Servers Redundant

Mission Critical Computing on x86

Rob Kypriotakis

Datacenter Solutions Architect

Intel Corporation

Page 2: Mission Critical Computing on x86 · 2018. 10. 29. · High Application Availability* Minimal Application State or Data Cluster Deployment with Mission Critical Servers Redundant

Mission Critical Workloads

• 1,000s – 1,000,000+ online

users

• Support large transactional

databases

• 24 x 7 operation

• Enable all users

• Complex queries

• Multiple data sources

• Large data warehouse

• Large scalable enterprise

databases

• No single point

of failure

• Extremely fast operational

speed

Transaction Processing Database Business Intelligence and Analytics

An Hour Of Downtime Can Mean

Millions In Lost Revenue

Page 3: Mission Critical Computing on x86 · 2018. 10. 29. · High Application Availability* Minimal Application State or Data Cluster Deployment with Mission Critical Servers Redundant

The Evolving Mission Critical Data Center

Today Long Term Mid Term

Infrastructure Silos Trapped in Legacy IT

Cloud Infrastructure for

Mainstream Enterprise

Dedicated Infrastructure for

Mission Critical Legacy

RISC

Legacy Mainframe

x86

Robust Cloud for All Workloads

Plans to adopt Cloud Infrastructure in the long term

Plans to scale resources for the deluge of data

Page 4: Mission Critical Computing on x86 · 2018. 10. 29. · High Application Availability* Minimal Application State or Data Cluster Deployment with Mission Critical Servers Redundant

Intel® Platforms Deliver Advanced Reliability

99.9976%

99.9973%

99.9971%

99.9962%

11X improve

2.4X improve

Source: ITIC Nov2011 Global Server Hardware &

Server OS Reliability Survey

Source: ITIC Jul2009 Global Server Hardware &

Server OS Reliability Survey

>4 nines

IT Managers Report comparable results between

IBM Power & x86 on unplanned downtime

Page 5: Mission Critical Computing on x86 · 2018. 10. 29. · High Application Availability* Minimal Application State or Data Cluster Deployment with Mission Critical Servers Redundant

Why the large improvement in x86 server reliability results?

• Significant new OS releases with mission critical

capabilities, including reliability

• Intel Xeon platform refreshes with advanced RAS

• Greater datacenter discipline for x86 servers (modeled

after RISC)

• More x86-based mission critical deployments with

higher uptime requirements

• Better IT tracking of x86 downtime

Causes of unplanned downtime

Operator error

40% Application

failures

40%

Hardware,

OS & power

20%

Source: Best Practices for Continuous Application

Availability, Gartner Data Center Conference 2008

Improvements due to a combination of New Platforms And Greater IT Focus

Page 6: Mission Critical Computing on x86 · 2018. 10. 29. · High Application Availability* Minimal Application State or Data Cluster Deployment with Mission Critical Servers Redundant

6

Tectonic Shift in Mission-Critical

Today Future

MISSION- CRITICAL

EPIC/RISC

x86

MAIN- FRAME

MISSION- CRITICAL

x86 EPIC/RISC

MAIN- FRAME

INDUSTRY STANDARD

MISSION-CRITICAL

ALLOWING GREATER FLEXIBILITY

STILL EXCEEDING

SLAs

Page 7: Mission Critical Computing on x86 · 2018. 10. 29. · High Application Availability* Minimal Application State or Data Cluster Deployment with Mission Critical Servers Redundant

Industry is innovating on Intel Architecture

Innovation first, and sometimes only on Intel Architecture… Follow the Innovation

Software Partners

HP Systems A few as example

Page 8: Mission Critical Computing on x86 · 2018. 10. 29. · High Application Availability* Minimal Application State or Data Cluster Deployment with Mission Critical Servers Redundant

Intel Xeon Processor E7 Family

Architected for scalable Windows and Linux performance with advanced reliability

Family of Mission Critical Processors

Intel Itanium Processor 9300 Series

Architected for Mission Critical UNIX with mainframe resiliency and scalability

Hardened OS OEM System

Capability

Application

Availability

OEM Service &

Support

Page 9: Mission Critical Computing on x86 · 2018. 10. 29. · High Application Availability* Minimal Application State or Data Cluster Deployment with Mission Critical Servers Redundant

Common Platform Strategy

Common Ingredients: Chipset | Interconnects | Buffers | Memory

Xeon Volume Economics to Itanium

Itanium RAS Capabilities to Xeon

OpenVMS

NonStop

Page 10: Mission Critical Computing on x86 · 2018. 10. 29. · High Application Availability* Minimal Application State or Data Cluster Deployment with Mission Critical Servers Redundant

Machine Check Architecture (MCA) Recovery

Supports recovery from otherwise fatal system

errors*

Allows the software layers (OS, VMM, DBMS) to

cooperate with the silicon layer to recover from

uncorrectable data errors

Previously seen only in RISC, mainframe, and

Itanium®-based systems

Silicon

Systems

Software

Integrated RAS Capabilities For Highly Available Deployments

*Errors detected using Patrol Scrub or Explicit Write-back from cache

Page 11: Mission Critical Computing on x86 · 2018. 10. 29. · High Application Availability* Minimal Application State or Data Cluster Deployment with Mission Critical Servers Redundant

Intel® Xeon® E7 Family RAS Philosophy

Repair Failing Data Connections

Recover From Uncorrectable

Errors

Minimize Planned Downtime Predict

Failures

Monitor

Heal

Detect & Correct Errors

Contain Uncorrected Errors

Continuous Self Monitoring and Self Healing

Page 12: Mission Critical Computing on x86 · 2018. 10. 29. · High Application Availability* Minimal Application State or Data Cluster Deployment with Mission Critical Servers Redundant

HW Un-correctable Errors

Machine Check Architecture Recovery How It Works

Normal Status With Error Prevention

System Recovery with SW

Error Corrected

Error Detected*

Error Contained

HW Correctable Errors Un-correctable Errors

System works in conjunction with OS,

VMM, or DBMS to recover or restart processes and

continue normal operation

Bad memory location flagged so data will not

be used by OS or applications

Error information passed to SW layer

MCA Recovery

*Errors detected using Patrol Scrub or Explicit Write-back from cache

Allows Recovery From Otherwise Fatal System Errors

Page 13: Mission Critical Computing on x86 · 2018. 10. 29. · High Application Availability* Minimal Application State or Data Cluster Deployment with Mission Critical Servers Redundant

Intel® Xeon® E7 Family Protection With Advanced Memory RAS

Repair Failing Data Connections

Recover From Uncorrectable Errors

Monitor

Heal

Detect & Correct Errors

Contain Uncorrected Errors

• ECC (cache, memory)

• Memory Address Parity Protection

• Memory Demand & Patrol Scrub

• Corrupt Data Containment Mode

• Memory Thermal Throttling • Enhanced DRAM Double Device Data

Correction (DDDC+1) • Enhanced DRAM Single Device Data

Correction (SDDC+1) • Fine Grained Memory Mirroring • Memory sparing & migration • Intel® SMI Lane failover • Intel® SMI Clock Fail Over • Intel® SMI Packet Retry

• Machine Check Architecture (MCA)

Recovery

• Failed DIMM Identification

• Memory Hot Add

• Corrected Machine Check Interrupt

(CMCI) for Preventive Failure Analysis

Xeon E7 Extensive Memory RAS

Minimize Planned Downtime

Predict Failures

Page 14: Mission Critical Computing on x86 · 2018. 10. 29. · High Application Availability* Minimal Application State or Data Cluster Deployment with Mission Critical Servers Redundant

Ensuring High Availability Eliminating Single Sources for Failure

Socket Redundancy & Failover • Dynamic OS Assisted Processor Socket Migration*

Memory Redundancy & Failover • Fine-Grained Memory Mirroring

• Intel® SMI Lane Failover

• Intel® SMI Clock Fail Over

• Intel® SMI Packet Retry

• Memory DIMM and Rank Sparing

• Dynamic Memory Migration

• Enhanced DRAM Double Device Data Correction (DDDC+1)

• Enhanced DRAM Single Device Data Correction (SDDC+1)

Intel® QPI Redundancy & Failover • QPI Self-Healing

• QPI Clock Fail Over

• Intel QPI Packet Retry

Intel® QPI

Xeon® E7

Xeon® E7

Xeon® E7

Xeon® E7

PCI Express* 2.0 PCI Express* 2.0

Memory Memory

IOH IOH

Memory Memory

Intel® QPI = Intel® QuickPath Interconnect Intel® SMI = Intel® Scalable Memory Interconnect

Built-In Redundancy, Failover, & Self-Healing

Page 15: Mission Critical Computing on x86 · 2018. 10. 29. · High Application Availability* Minimal Application State or Data Cluster Deployment with Mission Critical Servers Redundant

Intel® QuickPatvh Interconnect (QPI) Self-Healing

Intel® QPI Self-Healing maintains system availability in the event of persistent interconnect errors

On detecting persistent errors the QPI port automatically reduces to half the current width and keeps operating at a reduced level

The system administrator sets the threshold at which to go into self-healing mode

IOH

DDR3

Intel®

QPI

PCI Express*

2.0 Technology

Socket / IOH / Node Controller

Socket / IOH / Node Controller

Socket / IOH / Node Controller

Socket / IOH / Node Controller

Full

Width

Half

Width

QPI Port

Page 16: Mission Critical Computing on x86 · 2018. 10. 29. · High Application Availability* Minimal Application State or Data Cluster Deployment with Mission Critical Servers Redundant

Intel® Scalable Memory Interconnect (SMI) Lane Failover

Intel® SMI allows the memory interconnect to automatically failover and recover from partial link failures maintaining availability and performance

Intel® SMI provides an additional interconnect lane in each direction (memory write & read)

If a single lane failure is detected, the failed lane is automatically mapped out by the CPU and the spare lane is enabled

Processor Memory Buffer

Spare Lane Lane Failure Detected

Processor Memory

Buffer

Spare Lane Enabled

Lane Disabled

Memory Read Example

Page 17: Mission Critical Computing on x86 · 2018. 10. 29. · High Application Availability* Minimal Application State or Data Cluster Deployment with Mission Critical Servers Redundant

Static / Physical Partitioning Xeon® E7 Reliability Features

Allows a system to be divided into multiple machines, each capable of running its own OS and applications, by enabling/disabling links

Max number of partitions is determined by number of IOHs in the platform

Isolation among partitions are guaranteed by hardware

Repartitioning requires shutting down the system, reconfiguring and a system wide reboot.

Requires no OS support, Little or no FW support and some BMC support

CPU1 CPU2

CPU3 CPU4

IOH

IOH

ICH

ICH

IOH

IOH

ICH

ICH

BMC

`

Partition Manager

IO IO IOIO IO IO

IO IO IOIO IO IO

CPU1 CPU2

CPU3 CPU4

IOH

IOH

ICH

ICH

IOH

IOH

ICH

ICH

BMC

`

Partition Manager

IO IO IOIO IO IO

IO IO IOIO IO IO

Page 18: Mission Critical Computing on x86 · 2018. 10. 29. · High Application Availability* Minimal Application State or Data Cluster Deployment with Mission Critical Servers Redundant

Machine Check Architecture Recovery Extensible Error Recovery Architecture

Growing Software Industry Support for MCA Recovery 1Errors detected using Patrol Scrub or Explicit Write-back from cache

Uncorrectable data errors isolated

and corrected by OS1

Affected application may require

restart

System remains up and running

Window Server* 2008 R2

RHEL* 6

U8+

SLES11* SP1

Uncorrectable data error isolated

to a single VM / guest OS1

Affected VM may require restart

System and all other VMs remain

up and running

Uncorrectable data error isolated

within the DBMS buffer pool1

Affected (non-critical) buffer is

transparently reloaded from disk

System and DBMS remain up and

running

HANA In-Memory DBMS

* Other names and brands may be claimed as the property of others

vSphere* 5.0

RHEL* 6 - KVM

Page 19: Mission Critical Computing on x86 · 2018. 10. 29. · High Application Availability* Minimal Application State or Data Cluster Deployment with Mission Critical Servers Redundant

Database Related Innovation on Intel® Xeon® Processor

Oracle* Exa-Series

• Portfolio of OLTP / OLAP Appliances

• Offered on Intel® Xeon® Processors

"Across the stack, increases of 50 percent in both core count and cache drive up performance on the Intel® Xeon ® processor

5600 series." – Marie-Anne Neimat, VP Embedded Databases, Oracle*

Exa-Series Portfolio

Exadata Exalogic

Exalytics

SAP HANA

• Appliance offered on Intel® Xeon® processor

7500 from key OEM’s

• Instant response times to real-time events

“Intel and SAP, through joint engineering, have optimized SAP HANA…enabling greater business agility and innovative usage models that let customers respond to changing

conditions in real time.” - Press Announcement, December 2010

HANA

Tim

e in

Se

con

ds

lower is better

Source: SAP HANA Benchmark Study

* Other names and brands may be claimed as the property of others. Copyright © 2012, Intel Corporation.

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and

functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other

products. Configurations: see Notes section. For more information go to http://www.intel.com/performance

System

Performance1

IBM DB2 pureScale

• DB2 Mainframe clustering capability made available

by IBM on Xeon based platforms!

“Two complementary advances make it possible to get high volume, low cost computing for online transaction processing: X5 servers based on the Intel® Xeon®

Processor 7500 series and IBM pureScale”

Sal Vella, VP DB2 Development, IBM

Page 20: Mission Critical Computing on x86 · 2018. 10. 29. · High Application Availability* Minimal Application State or Data Cluster Deployment with Mission Critical Servers Redundant

Where Does Platform RAS Matter?

Lower Higher Individual System Resiliency Basic

Server RAS

AdvancedServer RAS

* Other names and brands may be claimed as the property of others

High Application Availability

Minimal Application State

or Data

Cluster Deployment with

Mission Critical Servers

Redundant Deployment Model

Standard or Mission Critical Servers

Distributed Workload Centralized Workload

* 99.999%

System Availability

State is Critical

Data Intensive

Time Critical

2

0

Business Criticality and Application Type Determine RAS Requirements

Intel Xeon E7 Intel Xeon E5

Page 21: Mission Critical Computing on x86 · 2018. 10. 29. · High Application Availability* Minimal Application State or Data Cluster Deployment with Mission Critical Servers Redundant

Advanced Reliability Starts with the Right Silicon Intel® Xeon® Processor E7 Family Reliability, Availability, and Serviceability (RAS) Features

Repair Failing Data Connections

• Memory Thermal Throttling

• Enhanced DRAM Device Data Correction

• Fine-Grained Memory Mirroring

• Memory Sparing & Migration

Detect & Correct Errors

• ECC (cache, memory)

• Memory Address Parity Protection

• Memory Demand & Patrol Scrub

• Corrected Machine Check Interrupt (CMCI)

Enhance Responsiveness

• Intel® QuickPath Interconnect

• QPI Packet Retry

• QPI Protocol Protection via CRC

Increase Availability

• Machine Check Architecture (MCA) recovery

• Physical CPU Hot Add/Replace

• OS CPU On-lining

Minimize Planned Downtime

• Physical IOH Hot Add

• OS IOH On-lining

• PCI-E Hot Plug

INTEL CONFIDENTIAL

Over 20 major features added in the last two processor generations …and committed to continue to focus on RAS

Page 22: Mission Critical Computing on x86 · 2018. 10. 29. · High Application Availability* Minimal Application State or Data Cluster Deployment with Mission Critical Servers Redundant

• Advanced error detection, correction, &

containment

• Intel Machine Check Architecture Recovery

(MCA Recovery)

• Partial memory mirroring

• Advanced error logging

and management

• Control groups

• Advanced error reporting

• PCI hot plug

• Multipath I/O

• Hardware-based checksumming

• KVM hypervisor

Comprehensive Mission-Critical RAS

Reliability Availability Serviceability

Page 23: Mission Critical Computing on x86 · 2018. 10. 29. · High Application Availability* Minimal Application State or Data Cluster Deployment with Mission Critical Servers Redundant

Mission Critical Your Way

Expanding your mission critical systems With Intel, Redhat

Page 24: Mission Critical Computing on x86 · 2018. 10. 29. · High Application Availability* Minimal Application State or Data Cluster Deployment with Mission Critical Servers Redundant
Page 25: Mission Critical Computing on x86 · 2018. 10. 29. · High Application Availability* Minimal Application State or Data Cluster Deployment with Mission Critical Servers Redundant

Future Direction

Page 26: Mission Critical Computing on x86 · 2018. 10. 29. · High Application Availability* Minimal Application State or Data Cluster Deployment with Mission Critical Servers Redundant

Strong Server Platform Roadmap Sustained Server Microprocessor Leadership

Tick Tock Tick Tock

45nm 32nm

Penryn

Nehalem

Westmere

Sandybridge

22nm

Tick Tock

Ivybridge Haswell

Tock

65nm

Tukwila

32nm

Tock 22nm

Tock

Poulson Kittson

Future

15nm

2

6

Page 27: Mission Critical Computing on x86 · 2018. 10. 29. · High Application Availability* Minimal Application State or Data Cluster Deployment with Mission Critical Servers Redundant

System Failures

Page 28: Mission Critical Computing on x86 · 2018. 10. 29. · High Application Availability* Minimal Application State or Data Cluster Deployment with Mission Critical Servers Redundant

15nm

2013* 11nm

2015* 8nm

2017*

2019+ 45nm

2007 32nm

2009 22nm

2011* 65nm

2005 DEVELOPMENT MANUFACTURING RESEARCH

latest Intel manufacturing Advances

*projected

Intel’s process roadmap visibility is out at least 10 years

28

Page 29: Mission Critical Computing on x86 · 2018. 10. 29. · High Application Availability* Minimal Application State or Data Cluster Deployment with Mission Critical Servers Redundant

Poulson: The most significant Intel Itanium® Processor Key Highlights

• 2x the cores, 2x instructions throughput

• >2x the performance, 2x memory density

• New RAS and performance features

• And still completely compatible with existing software; no recompilation required

Poulson continues Itanium advances, and On-Track for Later this Year

Page 30: Mission Critical Computing on x86 · 2018. 10. 29. · High Application Availability* Minimal Application State or Data Cluster Deployment with Mission Critical Servers Redundant

Database Solution Cost Comparison Intel® Xeon® E7 family vs. Power* with Oracle* database

3

0

vs.

8-socket

DL980

8-socket

IBM Power 770*

Hardware cost $99,269 $304,530

Oracle EE cpu cost

($47,500 per license)

80Cx0.5=40 Core licenses

$1,900,000

64Cx1.0=64 Core licenses

$3,040,000

Total Hardware/Software Acquisition Cost $2.00M $3.34M

Source: Hardware Cost- HP Alinean TCO tool (http://h71028.www7.hp.com/enterprise/us/en/migrate-to-hp/tco-challenge.html ); Software- Oracle.com

Intel solution provides ~40% lower Total Cost of Acquisition

Page 31: Mission Critical Computing on x86 · 2018. 10. 29. · High Application Availability* Minimal Application State or Data Cluster Deployment with Mission Critical Servers Redundant

Legal Disclaimer • INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL® PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS

GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL’S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER, AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL® PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. INTEL PRODUCTS ARE NOT INTENDED FOR USE IN MEDICAL, LIFE SAVING, OR LIFE SUSTAINING APPLICATIONS.

• Intel may make changes to specifications and product descriptions at any time, without notice.

• Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products.

• Configurations: [describe config + what test used + who did testing]. For more information go to http://www.intel.com/performance

• Intel does not control or audit the design or implementation of third party benchmarks or Web sites referenced in this document. Intel encourages all of its customers to visit the referenced Web sites or others where similar performance benchmarks are reported and confirm whether the referenced benchmarks are accurate and reflect performance of systems available for purchase.

• Intel processor numbers are not a measure of performance. Processor numbers differentiate features within each processor family, not across different processor families. See www.intel.com/products/processor_number for details.

• Intel, processors, chipsets, and desktop boards may contain design defects or errors known as errata, which may cause the product to deviate from published specifications. Current characterized errata are available on request.

• Intel Turbo Boost Technology requires a system with Intel® Turbo Boost Technology capability. Consult your PC manufacturer. Performance varies depending on hardware, software and system configuration. For more information, visit http://www.intel.com/technology/turboboost

• Intel Virtualization Technology requires a computer system with a processor, chipset, BIOS, virtual machine monitor (VMM) and applications enabled for virtualization technology. Functionality, performance or other virtualization technology benefits will vary depending on hardware and software configurations. Virtualization technology-enabled BIOS and VMM applications are currently in development.

• 64-bit computing on Intel architecture requires a computer system with a processor, chipset, BIOS, operating system, device drivers and applications enabled for Intel® 64 architecture. Performance will vary depending on your hardware and software configurations. Consult with your system vendor for more information.

• Intel, Intel Xeon, Intel Core microarchitecture, and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries.

Copyright © 2012 Intel Corporation. All rights reserved.