Cloud - UST Global's Approach To Appplication Resilience · level and developing a solution with tolerance to those failure scenarios. When multiple failures occur at the infrastructure

APPLICATION RESILIENCEThe UST Global Approach

UST Global Inc,April 2018

USTGlobal ®

Table of Contents

Overview of Application Resilience 3

Application Resiliency through Infrastructure Components 4

Application Resilience through Software Components 6

APPLICATION RESILIENCE I APRIL 2018 2Copyright © 2018 UST Global Inc

Application Level Failure Recovery 8

Recovery from Data Failures 9

UST Workload Analysis and Topology Improvements 10

APPLICATION RESILIENCE I APRIL 2018 3Copyright © 2018 UST Global Inc

OVERVIEW OF APPLICATION RESILIENCEWith today’s trends in technology revolution in the digital era, applications and associated

business services are expected to operate at 24x7 with zero tolerance on availability and

performance. It took tremendous effort and resources to build such a system in the past;

however, the availability of building blocks such as containerization, API based resource

management, predictive analytics and machine learning makes it possible to build highly

available and self-healing applications.

Application non-availability is due to several causes:

It is fairly easy and efficient to improve resilience while building such systems from the

ground up. However, in most cases, legacy applications were built for a specific purpose

and contextual situation at that point in time during development and deployment, and

these applications continue to provide value to an enterprise with business critical

functions.

Planned outage which constitutes 90% of the time

Unplanned outage which constitutes about 9%

Disaster which constitutes about 1% of the failures

9% of applicationfailures are due to unplannedoutages ofinfrastructure

Unplanned Outage

Planned Outage

Disaster

Cause of application non-availability

Legacy Issues

However, most of the enterprises are looking at modernization of applications ensuring

applications, ensuring that application resilience is one of the key factors for the

transformation.

Modernization

Copyright © 2018 UST Global Inc APPLICATION RESILIENCE I APRIL 2018 4

APPLICATION RESILIENCY THROUGH INFRASTRUCTURE

COMPONENTSAt the infrastructure level, resiliency is mostly achieved through redundant

components. In the physical environment, server clusters are configured in either

active/active or active/passive modes with a heartbeat mechanism to ensure continuity.

When the heartbeat fails, the remaining active node or passive node takes over.

In virtual environments, the resiliency is achieved through continuous provisioning of

redundant components with failure detection mechanisms. When a failure occurs, a

replacement component is provisioned from the pool of existing components and the

operational state is established.

In virtual operations such as private or public cloud, multiple approaches are used, and

each has the key objective of identifying the single points of failure at the infrastructure

level and developing a solution with tolerance to those failure scenarios. When multiple

failures occur at the infrastructure level, it can be considered a disaster and the

established disaster recovery mechanism is triggered for recovery.

Infrastructure components such as Server, Storage, Network, Power, and Physicalenvironment

Software components such as Operating System, Middleware components such as Web Server, Application Server, Database Server, and Messaging Server

Data components such as File system, Shared memory, Object storage, or Block storage

User connectivity or End User computing environment failures or degradations

Applicationresiliencydepends oninfrastructure,middleware,data andend usercomputing

In this document, we look at various considerations starting from the infrastructure

platform, software components and application architecture, that would make applications

highly resilient from failures or degradation of the application itself and/or from the

dependent layers of infrastructure components.

Purpose

Disaster recovery is implemented with a recovery time objective (RTO) and a recovery

point objective (RPO). Not all applications need the same level of RTO and RPO. More

stringent RTO and RPO parameters are more expensive to implement; hence the RTO and

RPO parameters should be tied to the business criticality of the application.

For a lower RTO closer to 0 hours and lower RPO closer to 0 minutes, the infrastructure

components should be configured in active/active configuration, using a global load

Disaster Recovery RTO and RPO

The key dependent layers for an application to be resilient in the normal mode of

operation are:

Key Dependencies


balancer to distribute the traffic based on the response time, location affinity, application

affinity, and other factors. This requires duplication of infrastructure components (server,

storage, and network) along with deploying the entire application stack in a highly

available mode across multiple data centers which are geographically separated with high

speed, low latency network connections. The network connection between the data

centers should be redundant to avoid single point failures. In the event of one regional

data center going down or multiple failures occurring in a single data center, the other

data center will take up and respond to requests based on timeout parameters or

triggered responses to network alerts.

An example of an active/active deployment of an application environment across two

data centers with high speed connectivity for data replication is shown below. The two

data centers are connected to a global intelligent DNS.

For a moderate RTO and RPO, wherein there is enough time to recover the application

and the associated data, the infrastructure is duplicated similar to active-active

configuration except that the secondary site will generally have a scaled down

configuration and the primary site will provide the application services as depicted below.

Not allapplicationsneed the samelevel of RTOand RPO

Note - •• The example below can be implemented in any public cloud today as well

as private cloud environments

Active/Active configuration

Active MirroringReplication

Amazon Route 53

AMI Copy(ami-996 63410)

awsdrdemo.comfailover.awsdrdemo.com

Active Passive

US East (N. Virginia)

WebServer

WebServer

DBServer

Data Volume

WebServer

WebServer

DBServer

Data Volume

Amazon Route 53

User or System

Data Mirroring/Replication

Low Capacity

AWS

Warm standby configuration

Web AppServer

Primary

AWS

VPS

US Wast (N. California)

FailoverApp

Secondary DB

AWS

VPS

Active ELB:DRDemoPrimary ELB-52152634 us-east-felbamazonaws.com

Web Servers:I36af5751

VPC ID - vpc5f9ef53e

Subnet IDs-subnet-440c786csubnet-289ef549subnet-2c9ef54d

VPC ID - vpc-a412efoc

Subnet IDs-subnet-bbf2efd3subnet-894b01oesubnet-bef2efd6

Primary Database Server:(i-026 aad 65)Private IP174.168.1.11

DR ELB:Created on Failover

Failover App Instance:1-55oide0e Elastic IP 54.215.157.25

Web Servers-Created on Failover

Secondary Database Server:(i-3b266960) Private IP 174.168.1.11


APPLICATION RESILIENCE THROUGH SOFTWARE

COMPONENTSIn multi-tier application architectures, the services at each tier can be architected to scale

horizontally, which increases the resiliency of the application. The horizontal scalability not

In the event of failure at the primary site, the secondary site will scale up in minutes

and provide the full set of application services. The data is replicated between the

primary and secondary data copies based on the RPO parameter values.

For non-critical business applications which have tolerance for less stringent RTO

and RPO levels, the secondary site can be implemented in an active/passive

configuration where the secondary site will be in the shutdown state, as illustrated

in the below diagram.

The data will be replicated from the primary to the secondary site but only the

database server will continue to run in the secondary site. In the event of failure at

the primary site, the rest of the secondary application environment will start and

the services will be provided from the secondary site as illustrated in the below

diagram.

Active-Active, Warm standby and Cold standby are the three DR configurations for various RTOs and RPOs

Cold standby configuration will meet most of the enterprise application needs for RTO

WebServer

WebServer

DBServer

Data Volume

WebServer

WebServer

Primary

Data Volume

Amazon Route 53

User or System


Not Running

Smaller Instance

AWS

Cold standby mode with normal operation mode

WebServer

ApplicationServer

DatabaseServer

Data Volume

WebServer

ApplicationServer

DatabaseServer

Data Volume

Amazon Route 53

User or System


Start in minutes

Resize to prod capacity

AWS

Cold standby mode with DR operation


only provides higher availability but also meets the performance goals even when

the load on the application increases.

The below diagram illustrates the horizontally scalable web servers attached to the

load balancer.

The number of web servers will be increased or decreased depending upon the

load. When one of the web servers fails, the traffic is automatically routed to the

remaining operational web servers ensuring the availability of the application.

However, in the above architecture, the hardware load balancer becomes the single

point of failure. If the load balancer fails, the entire application stack also fails. To

increase the resiliency of the application, the load balancers themselves in this

configuration must be implemented with scalability. The below diagram is an

example of implementing a cloud based scalable, highly available load balancer.

The loadbalancers areusuallydeployedoutside of thedatacenters toincrease theresilience, and load balancers themselves arehorizontallyscalable

Web Server A

Load Balancing Cluster

Hardware Load Balancer

FirewallInternet

Web Server B

Web Server C

Multi-tier resiliency through horizontal scaling

Scalable, highly available load balancer for resilience

World Wide Web

Cloud LoadBalancer

Request

Cloud Servers

Directed Traffic


The cloud-based load balancer can be implemented in public as well as private

environments.

Theapplicationsare developedwith patterndetectionmechanism torecover fromthe failures

Implementation of horizontally scalable load balancers, with fault tolerant configurations

for web servers, application servers and database servers increases the resiliency of the

application many times over.

An example of achieving horizontal scalability at both application server and database

server tiers is shown below.

Note - The database servers shown above can be parallelized and distributed or

partitioned depending upon the application use cases. If the databases are indexed and

distributed across multiple servers with the key domain tables populated in each database

server, the query can be parallelized to improve performance as well as availability of the

database server which also improves resiliency of the overall application.

Full Redundancy and Scalability

Transient failures can occur in the network connecting two applications or application

components. In such situations, the applications should be implemented following the

retry pattern. This approach involves retrying the same API or service call after a suitable

timeout. The timeout selection is very important; if the timeout is too short, the API or

Retry Pattern and Timeouts

DNS

Databaseproxy

LoadBalancer 1

LoadBalancer 2

ApplicationServer 1

ApplicationServer 2

ApplicationServer 3

ApplicationServer m

DatabaseServer 1

DatabaseServer 2

DatabaseServer 3

DatabaseServer m

Distributed Database Optimization

APPLICATION LEVEL FAILURE RECOVERYWhile the infrastructure and middleware services can be implemented as described

above for higher resiliency and throughput, it is very important to enable the applications

themselves to recover from the various component failures.

Below are some of the examples for application-level recovery mechanisms that increase

application resiliency.


The data can be replicated both at the block level as well as file level to increase the speed and integrity

service calls may fail erroneously and if the timeout is too long, the performance of the

application recovery will be compromised. Selecting the timeout parameter values should

be based on the latency of the network and application response time requirements.

When applications fail repeatedly due to transient failures or non-availability of a

component or service, the application should adapt by declaring the API or service is

non-viable and handle the exception by taking an exception path. During the exception

path utilization, the application should periodically check the availability of the API or

service in a non-blocking fashion and restore the normal operation path once the

exception clears. An adaptive or learning routine is used to dynamically use the circuit

breaker parameter values for optimum performance.

Circuit Breaker Pattern

In horizontally scalable applications, each copy of the application can be a logical leader in

adding an additional element. When the logical leader fails, one of the remaining

operational servers is elected as a leader. The mechanism of electing the leader must

cope with failures such as connectivity failures, API call failures or service failures. Other

server instances monitor the logical leader through the heartbeat mechanism. If the

designated leader terminates unexpectedly, or a network failure renders the leader

inaccessible by the other server instances, it will invoke a protocol to elect a new leader

based on round robin or random selection.

Leader Election Pattern

The applications can be instrumented with learning algorithms to detect the degradation

of the infrastructure components as well as to predict the failure of such components

through machine learning algorithms. These machine learning algorithms will detect

anomalous events and respond to prevent the failures. Even if there are failures, the

applications can call the underlying infrastructure APIs and newly provision the required

infrastructure component(s) to recover from such failures. This results in not only

increasing the resiliency of the application but also reducing the number of manual tickets

and support costs.

Infrastructure level API

RECOVERY FROM DATA FAILURESEnhancing availability, reliability and durability of data results in increasing application

resilience. Data loss can be prevented by replicating the data across multiple sites. The

replication of data can be implemented on an object basis (file, data, block, etc.,) in near

real-time by replicating at the lowest unit of change for the object (for example, at the

record level for database files).

The example below illustrates the replication of data at both the database layer as well as

the block storage level. Conducted jointly, the data replication increases the resiliency of

the application.


Automated analytics and migration tools are available to help enterprises increase application resiliency

UST WORKLOAD ANALYSIS AND TOPOLOGY

IMPROVEMENTSUST Global utilizes an application analyzer tool to automate application topology

discovery; the analyzer results provide multiple options for the designer to optimize the

topology with respect to:

Load Balancer

WEB HOST 1 WEB HOST 2

APP HOST 1 APP HOST 2

OID HOST 1 OID HOST 2 OID HOST 1 OID HOST 2

DB HOST 1 DB HOST 2 DB HOST 1 DB HOST 2

APP HOST 1 APP HOST 2

WEB HOST 1 WEB HOST 2

Load Balancer

Load Balancer Load Balancer

PRODSTOR

FirewallIntranet (Data Tier)

Firewall DMZSecure Zone(Application Tier)

Firewall DMZPublic Zone (Web Tier)

Web Application

DATABASE DATABASE

Security

STBYSTOR

Web Application Security

WAN

PRODUCTION/ACTIVE SITE STANDBY/PASSIVE SITE

Data replications at block and file leves

INTERNET

OraceData Guard

DiscReplication

DatabaseCluster

SharedStorageSystem

SecurityCluster

ApplicationCluster

WebHosts

WebHosts

ApplicationCluster

SharedStorageSystem

SecurityCluster

DatabaseCluster

High Availability

Disaster recovery

Multi server deployment

Horizontal scalability


UST GlobalApplicationDiscovery,Analyzer andMigrationaccelerate theApplicationResilienceDesign &Implementation

The below workflow describes the UST Global application resiliency enhancement

methodology from workload analysis through topology planning and implementation.

The scanner discovers the existing application workload topology and generates maps of

the logical distribution. Depending on the target application resiliency requirements, the

target application architecture can be modified to (i) distribute the workloads across

multiple servers to form a distributed environment, or (ii) increase the availability through

horizontally scalable architecting, or (iii) increase performance through database

partitioning and parallelization., or (iv) combinations of the above.

Additionally, an automated migration engine is used to migrate applications to the target

environment once the topology mapping is complete.

Author - Dr. Pandian Angaiyan [email protected]

Application level analysis

SCANNERScan Apps onCloud on-prem

WORKLOAD MAPPERMicro-workloads mappedand planned

TOPOLOGY PLANNINGMap to different architectures

Architectures mapped to VMs

MAPPING TO TARGETARCHITECTURE

UST Global® is a fast-growing digital technology company that provides advanced computing and

digital services to large private and public enterprises around the world. Driven by a larger purpose of

Transforming Lives and the philosophy of “fewer Clients, more Attention”, we bring in the entrepreneuri-

al spirit that seeks the fastest path to value in today’s digital economy. Our innovative technology

services and pioneering social programs make us stand apart.

UST Global is headquartered in Aliso Viejo, California and operates in 21 countries. Our clients include

Fortune 500 companies in Banking and Financial Services, Healthcare, Insurance, Retail, High Technol-

ogy, Manufacturing, Shipping, and Telecom. UST Global believes in building long-lasting, strategic

business relationships through agile and client-centric global engagement models that combine local

experts and resources with cost, scale, and quality advantages of global operations.

For more information, please visit: www.ust-global.com

ABOUT UST GLOBAL®

Corporate Office: UST Global ®

5 Polaris Way, Second Floor, Aliso Viejo, CA 92656Tel: (949) 716-8757 Fax: (949) 716-8396

www.ust-global.comFor further information contact: [email protected]

/USTGlobal /USTGlobal /ustglobalweb /company/ust-global

USTGlobal ®

Documents

Cloud - UST Global's Approach To Appplication Resilience · level and developing a solution with tolerance to those failure scenarios. When multiple failures occur at the infrastructure