54
Introduction to the Service Availability Forum .

Introduction to the Service Availability Forum.. Contents Introduction Quick AIS Specification overview AIS Dependability services AIS Communication

Embed Size (px)

Citation preview

Page 1: Introduction to the Service Availability Forum.. Contents  Introduction  Quick AIS Specification overview  AIS Dependability services  AIS Communication

Introduction to the Service Availability Forum

.

Page 2: Introduction to the Service Availability Forum.. Contents  Introduction  Quick AIS Specification overview  AIS Dependability services  AIS Communication

Contents

Introduction Quick AIS Specification overview AIS Dependability services AIS Communication services Programming model DEMO

Page 3: Introduction to the Service Availability Forum.. Contents  Introduction  Quick AIS Specification overview  AIS Dependability services  AIS Communication

Design of dependable services

Two parts of functionality Business logic

Implements the service

Common functionality Communication, management, fault tolerance…

This is where SA Forum AIS comes into picture

Copyright© 2004 Service Availability™ Forum, Inc. - Other names and brands are properties of their respective owners.5

Page 4: Introduction to the Service Availability Forum.. Contents  Introduction  Quick AIS Specification overview  AIS Dependability services  AIS Communication

Construction of a service

Define the functionality Plan the architecture

Components, roles Integrate fault management

Service state monitoring Management

Plan recovery Communication between components

Copyright© 2006 Service Availability™ Forum, Inc6

Page 5: Introduction to the Service Availability Forum.. Contents  Introduction  Quick AIS Specification overview  AIS Dependability services  AIS Communication

Construction of a service

Define the functionality Plan the architecture (AMF)

Components, roles Integrate fault management (AMF)

Service state monitoring Management

Plan recovery (CKPT) Communication between components (MSG, EVT)

Copyright© 2006 Service Availability™ Forum, Inc7

Page 6: Introduction to the Service Availability Forum.. Contents  Introduction  Quick AIS Specification overview  AIS Dependability services  AIS Communication

Copyright© 2007 Service Availability™ Forum, Inc. - Other names and brands are properties of their respective owners.8

Proprietary Platform

Proprietary Hardware

Proprietary Operating System

Proprietary Middleware

Applications

Standard Hardware

COTS-Based Platform

Carrier-GradeOperating System

Integrated COTSMiddleware

Applications

COTS adoption is accelerating transition from vertical to horizontal industry model

Vertically integratedplatform by individual vendors

Vendor D

Vendor C

Vendor B

Vendor A

The Transition

Platform enabled byThe COTS eco-system

Proprietary Platform

Proprietary Hardware

Proprietary Operating System

Proprietary Middleware

Applications

Vendor A Vendor D

Page 7: Introduction to the Service Availability Forum.. Contents  Introduction  Quick AIS Specification overview  AIS Dependability services  AIS Communication

Copyright© 2007 Service Availability™ Forum, Inc. - Other names and brands are properties of their respective owners.9

SA Forum Standard Interface Specifications

Communications Fabric

Hardware Platform

Carrier Grade Operating System

SM

I (S

A F

oru

m)

HPI (SA Forum)

Service AvailabilityMiddleware

AIS (SA Forum)

Other Middlewareand ApplicationServices

Highly Available Applications

SystemsManagementInterfaces(SMI)

HardwarePlatformInterface(HPI)

ApplicationInterfaceSpecification (AIS)

Page 9: Introduction to the Service Availability Forum.. Contents  Introduction  Quick AIS Specification overview  AIS Dependability services  AIS Communication

THE AIS SPECIFICATIONUsing SA Forum’s specifications to build highly available services

Copyright© 2004 Service Availability™ Forum, Inc. - Other names and brands are properties of their respective owners.11

Page 10: Introduction to the Service Availability Forum.. Contents  Introduction  Quick AIS Specification overview  AIS Dependability services  AIS Communication

Copyright© 2006 Service Availability™ Forum, Inc12

Application Interface Specification The Application Interface Specification (AIS) is a set of

open standard interface specifications

The AIS specifies the Application Programming Interface (API) for HA middleware

services service entities and their behavior (life cycle, administrative

operations, functionality)

Example for specification/implementation HW interface: SATA HW implementation: manufactured mainboard HW user: the HDD (Samsung, Western Digital, Seagate, Hitachi…)

Page 11: Introduction to the Service Availability Forum.. Contents  Introduction  Quick AIS Specification overview  AIS Dependability services  AIS Communication

Copyright© 2006 Service Availability™ Forum, Inc13

Application Interface Specification

Managed Hardware Platform

Carrier Grade Operating System

HPI Middleware AIS Middleware

Other Middleware and Application Services

HA ApplicationsManagement

SA ForumAIS APIs,

AMFCLM IMM CKPTLOGNTF LCKEVTMSG

Boards, fans, power, etc.

Web server, Music broadcast

Page 12: Introduction to the Service Availability Forum.. Contents  Introduction  Quick AIS Specification overview  AIS Dependability services  AIS Communication

Application Interface Specification

The AIS is divided into the following parts or areas

Administration Software Management Framework Information Model Management Service Cluster Membership Service Notification Service

Dependability Availability Management Framework Checkpoint Service

Communication Event Service Message Service

Miscellaneous Lock Service Log Service

Copyright© 2004 Service Availability™ Forum, Inc. - Other names and brands are properties of their respective owners.14

Page 13: Introduction to the Service Availability Forum.. Contents  Introduction  Quick AIS Specification overview  AIS Dependability services  AIS Communication

Copyright© 2006 Service Availability™ Forum, Inc 16

MessageService

Process A

Node U

MSG library

MessageService

Process B

Node V

MSG library

MessageService

Process C

Node W

MSG library

MessageService

Node YProcess E

MSG library

priority area 0

priority area 1

priority area 2priority area 3

Queue Q

Message m1priority 0sent to queue Q

Message m2priority 1sent to queue Q

saMsgMessageSend() saMsgMessageSend()

Getting a messagefrom a queue

removes itfrom the queue

saMsgMessageGet()

saMsgMessageGet()

Example service - Message Queues

Page 14: Introduction to the Service Availability Forum.. Contents  Introduction  Quick AIS Specification overview  AIS Dependability services  AIS Communication

THE AIS SPECIFICATION

Dependability Services

Copyright© 2004 Service Availability™ Forum, Inc. - Other names and brands are properties of their respective owners.17

Page 15: Introduction to the Service Availability Forum.. Contents  Introduction  Quick AIS Specification overview  AIS Dependability services  AIS Communication

SA Forum’s approach tomaking applications HA

Availability Management Framework principles Integrate best practices into the middleware, Provide means for the software developer to influence

behavior Let the middleware control the application (like a

Marionette)

Page 16: Introduction to the Service Availability Forum.. Contents  Introduction  Quick AIS Specification overview  AIS Dependability services  AIS Communication

SA Forum’s approach tomaking applications HA

Fault prevention Reduce the probability of system failure to an acceptably

low value Fault tolerance

Provide service in spite of faults Fault removal

Diagnostics, monitoring, repair at development and runtime

Fault forecasting / prediction Estimating failures and their effects E.g. software aging

Page 17: Introduction to the Service Availability Forum.. Contents  Introduction  Quick AIS Specification overview  AIS Dependability services  AIS Communication

Server Clustering

Notions node link cluster partition

Functionalities of cluster middleware Error detection Error handling Notification Reconfiguration

N1

N3

N4N2

N5

Page 18: Introduction to the Service Availability Forum.. Contents  Introduction  Quick AIS Specification overview  AIS Dependability services  AIS Communication

Server Clustering 2.

Categories of clusters HA clusters

Improve the availability of services

Load-balancing clusters Share the workload among the nodes

High-Performance Clusters (HPC) Scientific computing

Grid computing Many independent jobs

Page 19: Introduction to the Service Availability Forum.. Contents  Introduction  Quick AIS Specification overview  AIS Dependability services  AIS Communication

HW and SW based Fault Tolerance

HW Redundant power supply

SW Process replicas, failover, switchover…

Clusters – hybrid solutions Process replicas on different nodes

Page 20: Introduction to the Service Availability Forum.. Contents  Introduction  Quick AIS Specification overview  AIS Dependability services  AIS Communication

AMF Redundancy Patterns

Provided models 2N

Simple failover pair

N+M Load balanced active servers and hot-standby units

N-Way Load balanced active servers which are able to overtake the each other’s

services

N-Way Active No standby redundancy is needed but processing has to be carried out in

parallel by N service units

No redundancy Non-critical services of the system

Page 21: Introduction to the Service Availability Forum.. Contents  Introduction  Quick AIS Specification overview  AIS Dependability services  AIS Communication

Copyright© 2006 Service Availability™ Forum, Inc24

Node U Node V Node W Node X

AMF System Model Example

Service Instance B Service Instance D

ServiceGroupSG2Service Unit S3 Service Unit S4 Service Unit S5

Service Instance A

ServiceGroupSG1Service Unit S1 Service Unit S2

Service group SG1 supports asingle service instance A, andservice group SG2 supportstwo service instances B and D

Active Active Active

Standby StandbyStandby

On behalf of A, service unit S1is assigned the active HA stateand service unit S2 is assignedthe standby HA state

Service unit S1 contains twocomponents C1 and C2, andservice unit S2 contains twocomponents C3 and C4

ComponentC1

ComponentC2

ComponentC3

ComponentC4

ComponentC5 Component

C6

ComponentC7

Similarly, for service group SG2

Page 22: Introduction to the Service Availability Forum.. Contents  Introduction  Quick AIS Specification overview  AIS Dependability services  AIS Communication

The SA Forum Information Model

Page 23: Introduction to the Service Availability Forum.. Contents  Introduction  Quick AIS Specification overview  AIS Dependability services  AIS Communication

Fault Management Flow Chart

The error is detected

Hypothetical cause of error is

located

The defective component is removed

from service The system is reconfigured or

restarted to function properly

A faulty system component is replaced

An alternate form of fault detection, including built-

in diagnosis

Page 24: Introduction to the Service Availability Forum.. Contents  Introduction  Quick AIS Specification overview  AIS Dependability services  AIS Communication

Error detection methods

Mechanisms Interface check (e.g. illegal instruction, illegal

parameter, insufficient access rights) Ad-hoc methods

Acceptability testing Error checking codes Timing checks (watchdog) Diagnostic analysis (in idle time) Substitute back (integrate derive)

Page 25: Introduction to the Service Availability Forum.. Contents  Introduction  Quick AIS Specification overview  AIS Dependability services  AIS Communication

AMF error detection mechanisms

Passive monitoring Using OS functionality Currently only crash of a process

External active monitoring External entity is used to monitor

Internal active monitoring Healthchecks Types

Pull: AMF invoked

Push: Component invoked

Needs the changeof the component

Page 26: Introduction to the Service Availability Forum.. Contents  Introduction  Quick AIS Specification overview  AIS Dependability services  AIS Communication

Planning recovery - checkpointing

Define the variables to be saved State variables

Normal operation Open checkpoint Save variables regularly

On failure Read checkpoint Continue normal operation

Copyright© 2004 Service Availability™ Forum, Inc. - Other names and brands are properties of their respective owners.29

Page 27: Introduction to the Service Availability Forum.. Contents  Introduction  Quick AIS Specification overview  AIS Dependability services  AIS Communication

Copyright© 2006 Service Availability™ Forum, Inc30

ActiveComponent

Checkpoint Service

Example of checkpointing, and restoration from a checkpoint after a fault, for a collocated checkpoint

Node U

Service Unit S

ActiveComponent

ServiceGroup

Section xyz

Section abcCheckpoint C

Node U

Service Unit S

StandbyComponent

Section xyz

Section abcCheckpoint CCheckpoint

Service

openwritesynchronize read

open Activereplica

Activereplica

Standbyreplica

Page 28: Introduction to the Service Availability Forum.. Contents  Introduction  Quick AIS Specification overview  AIS Dependability services  AIS Communication

Methods of recovery

Goal: Recovery from errors before failures. Trivial methods

Retry operation Restart the application (cold, warm) Restart the system

Techniques Forward recovery Backward recovery Compensation

Page 29: Introduction to the Service Availability Forum.. Contents  Introduction  Quick AIS Specification overview  AIS Dependability services  AIS Communication

Methods of recovery 2.

S2

S1

Backward recovery

Forward recovery

Compensation

Page 30: Introduction to the Service Availability Forum.. Contents  Introduction  Quick AIS Specification overview  AIS Dependability services  AIS Communication

Rollback / backward recovery

Preconditions State restoration is possible Correct saved state Logged interactions with the environment Deterministic restart from a given state (distributed

systems)

Mechanisms State based rollback Operation based rollback Hybrid rollback (state + operation)

Page 31: Introduction to the Service Availability Forum.. Contents  Introduction  Quick AIS Specification overview  AIS Dependability services  AIS Communication

State based rollback

Checkpoint operations Create Discard Rollback recovery

Timing / triggers Time interval, number of messages sent, etc.

Recovery region Active checkpoint exists Recovery regions can overlap

Time: Recovery pointData: Checkpoint

C1 C2 C3 C4

Checkpointactiveness

Page 32: Introduction to the Service Availability Forum.. Contents  Introduction  Quick AIS Specification overview  AIS Dependability services  AIS Communication

State based rollback

Error

Error detected

Failure

C2 is corrupt, we have to

use C1

C1 C2 C3 C4

Errors between state saves 2 phase checkpointing (2 buffers) Factors that influence # of overlapping rec. regions

Error detection latency Cost of buffers Coordination in distributed systems

Page 33: Introduction to the Service Availability Forum.. Contents  Introduction  Quick AIS Specification overview  AIS Dependability services  AIS Communication

Service continuity

Goal: maintain the service continuity

Type of error Related actions

Transient •Ignore (recovery handles them)

Permanent •Reconfiguration•Failover•Switchover•Fail-back•“Graceful degradation”

Page 34: Introduction to the Service Availability Forum.. Contents  Introduction  Quick AIS Specification overview  AIS Dependability services  AIS Communication

Actions

Failover The service is removed from the failed entity and assigned

to a healthy entity Switchover

The service is removed from a healthy entity to another healthy entity

Fail-back The failed entity is repaired and service is assigned to it

again Graceful degradation

Service is carried on in a degraded state, probably with degraded functionality

Page 35: Introduction to the Service Availability Forum.. Contents  Introduction  Quick AIS Specification overview  AIS Dependability services  AIS Communication

System level error handling patterns

Fail-silent If failed then no messages are sent out until repaired E.g. in distributed systems

Fail-stop Stop operation on failure E.g. train traffic control systems (turn all lights red if

failure happens)

Francisco V. Brasileiro et. al. Implementing Fail-Silent Nodes for Distributed Systems (1996 )

Normal Silent

Failing

Page 36: Introduction to the Service Availability Forum.. Contents  Introduction  Quick AIS Specification overview  AIS Dependability services  AIS Communication

Requirements in today’s systems

Cost efficiency Flexibility Dependable services

Highly Available Reliable

Availability Outage duration per year

0.99999 ~5 mins

0.9999 ~52 mins

0.999 ~8 hrs

0.99 ~3 days

Page 37: Introduction to the Service Availability Forum.. Contents  Introduction  Quick AIS Specification overview  AIS Dependability services  AIS Communication

Dependability and performance vs.costs

dependability performance

cost

Page 38: Introduction to the Service Availability Forum.. Contents  Introduction  Quick AIS Specification overview  AIS Dependability services  AIS Communication

Cost o

f con

struc

tion

Degree of fault tolerance

Cos

t

FT vs. CostInternational SpaceStation

Desktop ComputerHW / SW redundancy

Algorithms

Advanced techniques

Page 39: Introduction to the Service Availability Forum.. Contents  Introduction  Quick AIS Specification overview  AIS Dependability services  AIS Communication

THE AIS SPECIFICATION

Communication Services

Copyright© 2004 Service Availability™ Forum, Inc. - Other names and brands are properties of their respective owners.42

Page 40: Introduction to the Service Availability Forum.. Contents  Introduction  Quick AIS Specification overview  AIS Dependability services  AIS Communication

Copyright© 2006 Service Availability™ Forum, Inc43

Application Interface Specification

Managed Hardware Platform

Carrier Grade Operating System

HPI Middleware AIS Middleware

Other Middleware and Application Services

HA ApplicationsManagement

SA ForumAIS APIs,

AMFCLM IMM CKPTLOGNTF LCKEVTMSG

Boards, fans, power, etc.

Web server, Music broadcast

Page 41: Introduction to the Service Availability Forum.. Contents  Introduction  Quick AIS Specification overview  AIS Dependability services  AIS Communication

Communication

Message (MSG) Recommended for many to one scheme Example: collect sensor data

Event (EVT) Many to Many (publish – subscribe) scheme Example: Sentinels and villagers

Copyright© 2004 Service Availability™ Forum, Inc. - Other names and brands are properties of their respective owners.44

Page 42: Introduction to the Service Availability Forum.. Contents  Introduction  Quick AIS Specification overview  AIS Dependability services  AIS Communication

Copyright© 2006 Service Availability™ Forum, Inc 45

MessageService

Process A

Node U

MSG library

MessageService

Process B

Node V

MSG library

MessageService

Process C

Node W

MSG library

MessageService

Node YProcess E

MSG library

priority area 0

priority area 3

priority area 1

priority area 2

Message Queue Q

Message m1priority 0sent to queue Q

Message m2priority 1sent to queue Q

saMsgMessageSend() saMsgMessageSend()

Getting a messagefrom a queue

removes itfrom the queue

saMsgMessageGet()

saMsgMessageGet()

Message Service Message Service, Message Service (MSG) Library, and

Message Queue

Page 43: Introduction to the Service Availability Forum.. Contents  Introduction  Quick AIS Specification overview  AIS Dependability services  AIS Communication

Copyright© 2006 Service Availability™ Forum, Inc46

SubscriberEVT library Subscriber

EVT library

Publisher

EVT library

Publisher

EVT library

SubscriberEVT library

Publisher

EVT library

SubscriberEVT library

Event Service

Example of the Event Service, with publishers and subscribers on two nodes

Node U Node V

EventService

Filter Filter Filter Filter

Event Channel X

Event Channel Y

Page 44: Introduction to the Service Availability Forum.. Contents  Introduction  Quick AIS Specification overview  AIS Dependability services  AIS Communication

THE AIS SPECIFICATION

Programming model

Copyright© 2004 Service Availability™ Forum, Inc. - Other names and brands are properties of their respective owners.47

Page 45: Introduction to the Service Availability Forum.. Contents  Introduction  Quick AIS Specification overview  AIS Dependability services  AIS Communication

Programming model

Library life cycle Initialize, finalize, dispatch events

Service specific APIs Callbacks Request functions

Copyright© 2004 Service Availability™ Forum, Inc. - Other names and brands are properties of their respective owners.48

Page 46: Introduction to the Service Availability Forum.. Contents  Introduction  Quick AIS Specification overview  AIS Dependability services  AIS Communication

Programming model - AMF

Library life cycle saAmfInitialize(), saAmfFinalize(), saAmfDispatch()

Service specific APIs Callbacks

saAmfCSISetCallbackT()

Request functions saAmfHAStateGet()

Copyright© 2004 Service Availability™ Forum, Inc. - Other names and brands are properties of their respective owners.49

Page 47: Introduction to the Service Availability Forum.. Contents  Introduction  Quick AIS Specification overview  AIS Dependability services  AIS Communication

Programming model – how to use it?

Initialize service handler saAmfInitialize()

Set callbacks

Get selection object saAmfSelectionObjectGet()

Start main loop do {

select(amfSelectionObject,…)

saAmfDispatch()

} while (1)

Copyright© 2004 Service Availability™ Forum, Inc. - Other names and brands are properties of their respective owners.50

Page 48: Introduction to the Service Availability Forum.. Contents  Introduction  Quick AIS Specification overview  AIS Dependability services  AIS Communication

Demo

Highly Available / Fault TolerantMusic Broadcast Application

Page 49: Introduction to the Service Availability Forum.. Contents  Introduction  Quick AIS Specification overview  AIS Dependability services  AIS Communication

Architecture

Broadcast Server

Source Server 2 Source Server 1Internet

IcecastInput buffer

Music stream source

application

Music stream source

application

The source application is

made HA

Clients

Page 50: Introduction to the Service Availability Forum.. Contents  Introduction  Quick AIS Specification overview  AIS Dependability services  AIS Communication

Online radio streaming system

Page 51: Introduction to the Service Availability Forum.. Contents  Introduction  Quick AIS Specification overview  AIS Dependability services  AIS Communication

Music Source App

Debian 1 Debian 2

MSource SG

MSource SU 1

MSource SU 2

Mcomp 2Mcomp 1

MSource SI 1

MCSI

AMF configuration

Tolerated failures Component Node One communication

channel

Page 52: Introduction to the Service Availability Forum.. Contents  Introduction  Quick AIS Specification overview  AIS Dependability services  AIS Communication

Where should I continue?

OS

Middleware

MP3 streamer

OS

Middleware

MP3 streamer

Position

STATE

Active StandbyActiveFaulty

Page 53: Introduction to the Service Availability Forum.. Contents  Introduction  Quick AIS Specification overview  AIS Dependability services  AIS Communication

Acknowledgements

András Kövi

Zoltán Micskei,

István Majzik

András Pataricza

Page 54: Introduction to the Service Availability Forum.. Contents  Introduction  Quick AIS Specification overview  AIS Dependability services  AIS Communication

References

Service Availability Forum http://saforum.org Application Interface Specification

http://www.saforum.org/specification/AIS_Information/

Education Material http://www.saforum.org/education/

Fault Tolerance Research Group – (BME MIT) http://www.inf.mit.bme.hu/FTSRG/