25
NoCOUG Fall Conference 2005 High Availability & Disaster Recovery: A 360 o View Alok Pareek

NoCOUG Fall Conference 2005

  • Upload
    talor

  • View
    35

  • Download
    0

Embed Size (px)

DESCRIPTION

NoCOUG Fall Conference 2005. High Availability & Disaster Recovery: A 360 o View Alok Pareek. Agenda. Introduction High Availability - 2005 Industry Shift from MTTF to MTTR Understanding transaction failure models Evaluating HA technologies Real-World HA Examples About GoldenGate - PowerPoint PPT Presentation

Citation preview

Page 1: NoCOUG Fall Conference 2005

NoCOUG Fall Conference 2005

High Availability & Disaster Recovery: A 360o View Alok Pareek

Page 2: NoCOUG Fall Conference 2005

Agenda

IntroductionHigh Availability - 2005Industry Shift from MTTF to MTTRUnderstanding transaction failure modelsEvaluating HA technologies Real-World HA ExamplesAbout GoldenGateQuestions & Answers

Page 3: NoCOUG Fall Conference 2005

Speaker Introduction/Background

GoldenGate Software ArchitectTransactional Data Management and High Availability Solutions

10 years in Oracle kernel developmentData Storage TechnologiesRecovery and High Availability GroupHigh Speed Data Movement

Key FeaturesCross platform/Heterogeneous transportable tablespaces (10g)Multi threaded redo generation (9.2)Implementation of multiple block size caches (9.0)(TSPITR) Tablespace point in time recovery (8.0)Owner of LGWR process (redo generation) algorithms (8i

10.2)

Page 4: NoCOUG Fall Conference 2005

High Availability, Disaster Recovery

DefinitionRatio of system uptime to sum of uptime and downtime

Availability = MTTF/(MTTF+MTTR)

ChallengesAddressing Performance vs. Reliability in computer systemsHardware Faults, Software Bugs, Human errors are realities in any

complex system deploymentEnterprise applications need to function 24x7Disasters are no longer a distant threatInadequate planning to handle outages

Shift in industry (and academic research) focusFrom fault tolerance to reducing MTTR

Page 5: NoCOUG Fall Conference 2005

HA/DR – Systematic View

Unplanned outage

System Failure

Data Failure

System Changes

Data Changes

Active

Database

Planned outage

•Physical Media•Logical corruption

•Node death•Power failure

•Upgrades•Migrations

•Maintenance

1

2 3

Page 6: NoCOUG Fall Conference 2005

1) High Availability Concerns – No Outage

Active

Database

1

• Performance

• DSS vs. OLTP conflicting requirments

• Mixed workload

• Data validation

• Data Transformation

Page 7: NoCOUG Fall Conference 2005

2) Availability Concerns – Unplanned

Unplanned outage

System failure

Data failure

Database

•Database Restore/Recovery

•RAID

•Shared Disk Clusters

•Standby database

Common Approaches

Page 8: NoCOUG Fall Conference 2005

Understanding Database Failures

Failure pointsStatementProcessInstanceDatabaseSite

Failure typesPhysical (Media, corruption, inconsistency amongst redundant

copies)Logical (Incorrect DML, out-of-synch, accidental table drop)

Failure HandlingAutomaticManual

Page 9: NoCOUG Fall Conference 2005

3) Availability Concerns – Planned Outage

System Changes

Data Changes

Database

Planned Outage

•Selected windows of downtime

•Phased approach to maintenance

Page 10: NoCOUG Fall Conference 2005

Key HA Evaluation Criteria

Availability Is the failover/DR solution available for real use?

MTTR (RTO) In the event of a failure, how soon can the data be recovered?

Performance Speed and support for high transaction volumes

Data Loss (RPO) What is the impact of an unplanned outage in terms of lost data?

Manageability Configuration, Install, Monitoring

Impact on Deployed Systems

How intrusive? What is the impact on data itself?

Cost Licensing, Maintenance, Deployment

Page 11: NoCOUG Fall Conference 2005

Differentiating HA Technologies

Conventional Backup/Recovery

RAID multiple hard disks behaving as a single large fast drive (mirrors/stripes/duplexing/parity)

Snapshots

Block Level Database Replication

Change Level Database Replication

Remote Mirroring Solutions

Transactional Data Management

Roll Forward / File Protection High Availability and Disaster Recovery

Page 12: NoCOUG Fall Conference 2005

HA Technologies & Tradeoffs

Block based database replication Standby kept in constant recovery (inactive) mode

Useful for strict disaster recovery only, not HACannot be used for reporting in recovery modeNo write access for distributed load balancingApplication response times suffer after failoverCannot address availability across heterogeneous systems

Change based database replication Trigger or log based

Not optimized for real time performanceIntrusive, ComplexCannot address availability across heterogeneous systems

Page 13: NoCOUG Fall Conference 2005

HA Technologies & Tradeoffs

Remote mirroring solutions Volume managers maintain mirrors of local writes on a set of remote volumes

Useful for file protectionPhysical distance to remote volumes is a critical limitationNo protection from logical corruption, or storage stack corruption

Message based logical writes sent by primary host over IP to remote hosts (synchronously/asynchronously)

Write ordering must be maintained by primary hostRemote volumes are standby-only, applications cannot access themNo protection from logical corruption

Hardware basedStorage arrays propagate IOs to storage arrays at a secondary siteSecondary arrays are inaccessible during replicationNo protection from logical corruptionOnly useful for block availability during DR

Page 14: NoCOUG Fall Conference 2005

HA Technologies & Tradeoffs

Transactional Data Management Addresses low-latency in HA hybrid computing environments (built on 1 Safe)Management of transactional streams -- captures, transforms, routes, delivers and verifies data transactions in real time

Real time, heterogeneous, data integrity, low impactUse cases for HA, DR, data integration, distributed computingNot for file-level replication

Log-based data

capture

(can be selective)

RouteTransform (if

needed)

Source Target

Apply

Sub-second latency

Page 15: NoCOUG Fall Conference 2005

HA/DR – Solution Examples

Unplanned outage

System Failure

Data Failure

System Changes

Data Changes

Active

Database

Planned outage

•Physical Media•Logical corruption

•Node death•Power failure

•Upgrades•Migrations

•Maintenance

1

2 3

Page 16: NoCOUG Fall Conference 2005

Real-World HA Configuration (Active)

Primary “Master”

Secondary “Master”

• Uni/Bi-directional configuration – if one system fails, the other is immediately available and ready to take over.

• For…• Highest Availability• Maximized ROI on hardware (transaction balancing)•Data Centers with Active/Passive

• Example areas:• 24x7 (ATMs, Online Banking)• Online Retail•Real-time Analytics, Warehousing

• Uni/Bi-directional configuration – if one system fails, the other is immediately available and ready to take over.

• For…• Highest Availability• Maximized ROI on hardware (transaction balancing)•Data Centers with Active/Passive

• Example areas:• 24x7 (ATMs, Online Banking)• Online Retail•Real-time Analytics, Warehousing

Active

Live/Active

Page 17: NoCOUG Fall Conference 2005

Real-World HA Configuration (Transaction Off-Loading)

Processing Commit Transactions

Lookup Query Handling

• Improve availability and performance of transaction processing by offloading query load to lower-cost DBs/platforms

• For…• Horizontal scalability• Improved performance

• Example areas:• Online Reservations• Online Lookups

• Improve availability and performance of transaction processing by offloading query load to lower-cost DBs/platforms

• For…• Horizontal scalability• Improved performance

• Example areas:• Online Reservations• Online Lookups

Active

Live/Active

Page 18: NoCOUG Fall Conference 2005

Real-World DR Configuration (Failover)

Unplanned outage

System failure

Data failure

Database

Active

Failover Database

• An HA implementation that captures and applies data to a failover system in real time.

• For…•Fast failover (No restore)•Do root-cause analysis later!•Surgical Repair (Dynamic,

Selective undo)

•Example areas:•24x7/mission-critical

applications•Strict SLA requirements

• An HA implementation that captures and applies data to a failover system in real time.

• For…•Fast failover (No restore)•Do root-cause analysis later!•Surgical Repair (Dynamic,

Selective undo)

•Example areas:•24x7/mission-critical

applications•Strict SLA requirements

Page 19: NoCOUG Fall Conference 2005

Real-World Configuration (Switchover)

Current Database

Planned Outage

System Changes

Data Changes

• Zero-Downtime Migrations• Rolling Upgrades• Zero-Downtime Maintenance

• For…•24x7 availability•Reduced windows for system

maintenance

• Example areas:•Can’t afford downtime to do in-

place upgrade

• Zero-Downtime Migrations• Rolling Upgrades• Zero-Downtime Maintenance

• For…•24x7 availability•Reduced windows for system

maintenance

• Example areas:•Can’t afford downtime to do in-

place upgrade “Switchover” Database

Page 20: NoCOUG Fall Conference 2005

About GoldenGate Software

Established, Loyal Customer Base

Leading Industry Solutions

250 customers... 1500+ solutions implemented… in 35 countries

18,000 Node ATM Network with 24/7

Availability

Saving $ millions with real-time DW

and zero downtime migrations.

Database tiering handles average of

300,000 updates/hour, peaks at 800,000/hour

Achieving paperless enterprise for this

visionary healthcareprovider

GoldenGate Software is a privately held software company that offers Transactional Data

Management solutions.

Page 21: NoCOUG Fall Conference 2005

GoldenGate & Transactional Data Management

Provides guaranteed capture, routing, transformation, delivery, and verification of data transactions across heterogeneous environments in real time.

TDM must be:

Real timeMoves with sub-second latency

Heterogeneous Moves transactions across different databases and platforms

TransactionalMaintains transaction integrity

GoldenGate differentiates on:

PerformanceHandles thousands of transactions per second with very low impact on IT systems

Extensibility & FlexibilityOpen architecture to meet demanding customer needs and data environments

ReliabilitySupports continuous operations and availability

Page 22: NoCOUG Fall Conference 2005

How GoldenGate Works

Trail files: Stages and queues data for routing.

Delivery: Applies transactional data with guaranteed integrity.

Route: Data is compressed, encrypted for routing to targets.

Capture: Committed changes are captured as they occur by reading the transaction logs.

GoldenGate Veridata™: Reports on data discrepancies between in-use DBs

Page 23: NoCOUG Fall Conference 2005

TDM & HA Evaluation Criteria

AvailabilityStandby available for reporting, logical repair, load balancing, failover with warm cache

MTTRNo restore requiredProportional to runtime cost of transactions in last log write (~ seconds)

PerformanceNear zero time latencyShip only committed transactions

Data Protection / Loss

Redo validation using SQL ApplyNo Loss (read access to last IO in current log)

ManageabilityDirector GUI, CLI, STATS

ImpactLow impact on deployed systemsMetadata outside the database

Page 24: NoCOUG Fall Conference 2005

Questions & Answers

Contact Information:Alok Pareek

www.goldengate.com(415) 777-0200

[email protected]

301 Howard Street, Suite 2100, San Francisco CA 94105

Page 25: NoCOUG Fall Conference 2005

Lag points address Logical failures

One-to-One One-to-Many

SourceSource

Target

Target

(1 day lag)

Target

(1 hour lag)

Target

(current)