Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
APPLICATION RESILIENCEThe UST Global Approach
UST Global Inc,April 2018
USTGlobal ®
Table of Contents
Overview of Application Resilience 3
Application Resiliency through Infrastructure Components 4
Application Resilience through Software Components 6
APPLICATION RESILIENCE I APRIL 2018 2Copyright © 2018 UST Global Inc
Application Level Failure Recovery 8
Recovery from Data Failures 9
UST Workload Analysis and Topology Improvements 10
APPLICATION RESILIENCE I APRIL 2018 3Copyright © 2018 UST Global Inc
OVERVIEW OF APPLICATION RESILIENCEWith today’s trends in technology revolution in the digital era, applications and associated
business services are expected to operate at 24x7 with zero tolerance on availability and
performance. It took tremendous effort and resources to build such a system in the past;
however, the availability of building blocks such as containerization, API based resource
management, predictive analytics and machine learning makes it possible to build highly
available and self-healing applications.
Application non-availability is due to several causes:
It is fairly easy and efficient to improve resilience while building such systems from the
ground up. However, in most cases, legacy applications were built for a specific purpose
and contextual situation at that point in time during development and deployment, and
these applications continue to provide value to an enterprise with business critical
functions.
Planned outage which constitutes 90% of the time
Unplanned outage which constitutes about 9%
Disaster which constitutes about 1% of the failures
9% of applicationfailures are due to unplannedoutages ofinfrastructure
Unplanned Outage
Planned Outage
Disaster
Cause of application non-availability
Legacy Issues
However, most of the enterprises are looking at modernization of applications ensuring
applications, ensuring that application resilience is one of the key factors for the
transformation.
Modernization
Copyright © 2018 UST Global Inc APPLICATION RESILIENCE I APRIL 2018 4
APPLICATION RESILIENCY THROUGH INFRASTRUCTURE
COMPONENTSAt the infrastructure level, resiliency is mostly achieved through redundant
components. In the physical environment, server clusters are configured in either
active/active or active/passive modes with a heartbeat mechanism to ensure continuity.
When the heartbeat fails, the remaining active node or passive node takes over.
In virtual environments, the resiliency is achieved through continuous provisioning of
redundant components with failure detection mechanisms. When a failure occurs, a
replacement component is provisioned from the pool of existing components and the
operational state is established.
In virtual operations such as private or public cloud, multiple approaches are used, and
each has the key objective of identifying the single points of failure at the infrastructure
level and developing a solution with tolerance to those failure scenarios. When multiple
failures occur at the infrastructure level, it can be considered a disaster and the
established disaster recovery mechanism is triggered for recovery.
Infrastructure components such as Server, Storage, Network, Power, and Physicalenvironment
Software components such as Operating System, Middleware components such as Web Server, Application Server, Database Server, and Messaging Server
Data components such as File system, Shared memory, Object storage, or Block storage
User connectivity or End User computing environment failures or degradations
Applicationresiliencydepends oninfrastructure,middleware,data andend usercomputing
In this document, we look at various considerations starting from the infrastructure
platform, software components and application architecture, that would make applications
highly resilient from failures or degradation of the application itself and/or from the
dependent layers of infrastructure components.
Purpose
Disaster recovery is implemented with a recovery time objective (RTO) and a recovery
point objective (RPO). Not all applications need the same level of RTO and RPO. More
stringent RTO and RPO parameters are more expensive to implement; hence the RTO and
RPO parameters should be tied to the business criticality of the application.
For a lower RTO closer to 0 hours and lower RPO closer to 0 minutes, the infrastructure
components should be configured in active/active configuration, using a global load
Disaster Recovery RTO and RPO
The key dependent layers for an application to be resilient in the normal mode of
operation are:
Key Dependencies
Copyright © 2018 UST Global Inc APPLICATION RESILIENCE I APRIL 2018 5
balancer to distribute the traffic based on the response time, location affinity, application
affinity, and other factors. This requires duplication of infrastructure components (server,
storage, and network) along with deploying the entire application stack in a highly
available mode across multiple data centers which are geographically separated with high
speed, low latency network connections. The network connection between the data
centers should be redundant to avoid single point failures. In the event of one regional
data center going down or multiple failures occurring in a single data center, the other
data center will take up and respond to requests based on timeout parameters or
triggered responses to network alerts.
An example of an active/active deployment of an application environment across two
data centers with high speed connectivity for data replication is shown below. The two
data centers are connected to a global intelligent DNS.
For a moderate RTO and RPO, wherein there is enough time to recover the application
and the associated data, the infrastructure is duplicated similar to active-active
configuration except that the secondary site will generally have a scaled down
configuration and the primary site will provide the application services as depicted below.
Not allapplicationsneed the samelevel of RTOand RPO
Note - •• The example below can be implemented in any public cloud today as well
as private cloud environments
Active/Active configuration
Active MirroringReplication
Amazon Route 53
AMI Copy(ami-996 63410)
awsdrdemo.comfailover.awsdrdemo.com
Active Passive
US East (N. Virginia)
WebServer
WebServer
DBServer
Data Volume
WebServer
WebServer
DBServer
Data Volume
Amazon Route 53
User or System
Data Mirroring/Replication
Low Capacity
AWS
Warm standby configuration
Web AppServer
Primary
AWS
VPS
US Wast (N. California)
FailoverApp
Secondary DB
AWS
VPS
Active ELB:DRDemoPrimary ELB-52152634 us-east-felbamazonaws.com
Web Servers:I36af5751
VPC ID - vpc5f9ef53e
Subnet IDs-subnet-440c786csubnet-289ef549subnet-2c9ef54d
VPC ID - vpc-a412efoc
Subnet IDs-subnet-bbf2efd3subnet-894b01oesubnet-bef2efd6
Primary Database Server:(i-026 aad 65)Private IP174.168.1.11
DR ELB:Created on Failover
Failover App Instance:1-55oide0e Elastic IP 54.215.157.25
Web Servers-Created on Failover
Secondary Database Server:(i-3b266960) Private IP 174.168.1.11
Copyright © 2018 UST Global Inc APPLICATION RESILIENCE I APRIL 2018 6
APPLICATION RESILIENCE THROUGH SOFTWARE
COMPONENTSIn multi-tier application architectures, the services at each tier can be architected to scale
horizontally, which increases the resiliency of the application. The horizontal scalability not
In the event of failure at the primary site, the secondary site will scale up in minutes
and provide the full set of application services. The data is replicated between the
primary and secondary data copies based on the RPO parameter values.
For non-critical business applications which have tolerance for less stringent RTO
and RPO levels, the secondary site can be implemented in an active/passive
configuration where the secondary site will be in the shutdown state, as illustrated
in the below diagram.
The data will be replicated from the primary to the secondary site but only the
database server will continue to run in the secondary site. In the event of failure at
the primary site, the rest of the secondary application environment will start and
the services will be provided from the secondary site as illustrated in the below
diagram.
Active-Active, Warm standby and Cold standby are the three DR configurations for various RTOs and RPOs
Cold standby configuration will meet most of the enterprise application needs for RTO
WebServer
WebServer
DBServer
Data Volume
WebServer
WebServer
Primary
Data Volume
Amazon Route 53
User or System
Data Mirroring/Replication
Not Running
Smaller Instance
AWS
Cold standby mode with normal operation mode
WebServer
ApplicationServer
DatabaseServer
Data Volume
WebServer
ApplicationServer
DatabaseServer
Data Volume
Amazon Route 53
User or System
Data Mirroring/Replication
Start in minutes
Resize to prod capacity
AWS
Cold standby mode with DR operation
Copyright © 2018 UST Global Inc APPLICATION RESILIENCE I APRIL 2018 7
only provides higher availability but also meets the performance goals even when
the load on the application increases.
The below diagram illustrates the horizontally scalable web servers attached to the
load balancer.
The number of web servers will be increased or decreased depending upon the
load. When one of the web servers fails, the traffic is automatically routed to the
remaining operational web servers ensuring the availability of the application.
However, in the above architecture, the hardware load balancer becomes the single
point of failure. If the load balancer fails, the entire application stack also fails. To
increase the resiliency of the application, the load balancers themselves in this
configuration must be implemented with scalability. The below diagram is an
example of implementing a cloud based scalable, highly available load balancer.
The loadbalancers areusuallydeployedoutside of thedatacenters toincrease theresilience, and load balancers themselves arehorizontallyscalable
Web Server A
Load Balancing Cluster
Hardware Load Balancer
FirewallInternet
Web Server B
Web Server C
Multi-tier resiliency through horizontal scaling
Scalable, highly available load balancer for resilience
World Wide Web
Cloud LoadBalancer
Request
Cloud Servers
Directed Traffic
Copyright © 2018 UST Global Inc APPLICATION RESILIENCE I APRIL 2018 8
The cloud-based load balancer can be implemented in public as well as private
environments.
Theapplicationsare developedwith patterndetectionmechanism torecover fromthe failures
Implementation of horizontally scalable load balancers, with fault tolerant configurations
for web servers, application servers and database servers increases the resiliency of the
application many times over.
An example of achieving horizontal scalability at both application server and database
server tiers is shown below.
Note - The database servers shown above can be parallelized and distributed or
partitioned depending upon the application use cases. If the databases are indexed and
distributed across multiple servers with the key domain tables populated in each database
server, the query can be parallelized to improve performance as well as availability of the
database server which also improves resiliency of the overall application.
Full Redundancy and Scalability
Transient failures can occur in the network connecting two applications or application
components. In such situations, the applications should be implemented following the
retry pattern. This approach involves retrying the same API or service call after a suitable
timeout. The timeout selection is very important; if the timeout is too short, the API or
Retry Pattern and Timeouts
DNS
Databaseproxy
LoadBalancer 1
LoadBalancer 2
ApplicationServer 1
ApplicationServer 2
ApplicationServer 3
ApplicationServer m
DatabaseServer 1
DatabaseServer 2
DatabaseServer 3
DatabaseServer m
Distributed Database Optimization
APPLICATION LEVEL FAILURE RECOVERYWhile the infrastructure and middleware services can be implemented as described
above for higher resiliency and throughput, it is very important to enable the applications
themselves to recover from the various component failures.
Below are some of the examples for application-level recovery mechanisms that increase
application resiliency.
Copyright © 2018 UST Global Inc APPLICATION RESILIENCE I APRIL 2018 9
The data can be replicated both at the block level as well as file level to increase the speed and integrity
service calls may fail erroneously and if the timeout is too long, the performance of the
application recovery will be compromised. Selecting the timeout parameter values should
be based on the latency of the network and application response time requirements.
When applications fail repeatedly due to transient failures or non-availability of a
component or service, the application should adapt by declaring the API or service is
non-viable and handle the exception by taking an exception path. During the exception
path utilization, the application should periodically check the availability of the API or
service in a non-blocking fashion and restore the normal operation path once the
exception clears. An adaptive or learning routine is used to dynamically use the circuit
breaker parameter values for optimum performance.
Circuit Breaker Pattern
In horizontally scalable applications, each copy of the application can be a logical leader in
adding an additional element. When the logical leader fails, one of the remaining
operational servers is elected as a leader. The mechanism of electing the leader must
cope with failures such as connectivity failures, API call failures or service failures. Other
server instances monitor the logical leader through the heartbeat mechanism. If the
designated leader terminates unexpectedly, or a network failure renders the leader
inaccessible by the other server instances, it will invoke a protocol to elect a new leader
based on round robin or random selection.
Leader Election Pattern
The applications can be instrumented with learning algorithms to detect the degradation
of the infrastructure components as well as to predict the failure of such components
through machine learning algorithms. These machine learning algorithms will detect
anomalous events and respond to prevent the failures. Even if there are failures, the
applications can call the underlying infrastructure APIs and newly provision the required
infrastructure component(s) to recover from such failures. This results in not only
increasing the resiliency of the application but also reducing the number of manual tickets
and support costs.
Infrastructure level API
RECOVERY FROM DATA FAILURESEnhancing availability, reliability and durability of data results in increasing application
resilience. Data loss can be prevented by replicating the data across multiple sites. The
replication of data can be implemented on an object basis (file, data, block, etc.,) in near
real-time by replicating at the lowest unit of change for the object (for example, at the
record level for database files).
The example below illustrates the replication of data at both the database layer as well as
the block storage level. Conducted jointly, the data replication increases the resiliency of
the application.
Copyright © 2018 UST Global Inc APPLICATION RESILIENCE I APRIL 2018 10
Automated analytics and migration tools are available to help enterprises increase application resiliency
UST WORKLOAD ANALYSIS AND TOPOLOGY
IMPROVEMENTSUST Global utilizes an application analyzer tool to automate application topology
discovery; the analyzer results provide multiple options for the designer to optimize the
topology with respect to:
Load Balancer
WEB HOST 1 WEB HOST 2
APP HOST 1 APP HOST 2
OID HOST 1 OID HOST 2 OID HOST 1 OID HOST 2
DB HOST 1 DB HOST 2 DB HOST 1 DB HOST 2
APP HOST 1 APP HOST 2
WEB HOST 1 WEB HOST 2
Load Balancer
Load Balancer Load Balancer
PRODSTOR
FirewallIntranet (Data Tier)
Firewall DMZSecure Zone(Application Tier)
Firewall DMZPublic Zone (Web Tier)
Web Application
DATABASE DATABASE
Security
STBYSTOR
Web Application Security
WAN
PRODUCTION/ACTIVE SITE STANDBY/PASSIVE SITE
Data replications at block and file leves
INTERNET
OraceData Guard
DiscReplication
DatabaseCluster
SharedStorageSystem
SecurityCluster
ApplicationCluster
WebHosts
WebHosts
ApplicationCluster
SharedStorageSystem
SecurityCluster
DatabaseCluster
High Availability
Disaster recovery
Multi server deployment
Horizontal scalability
Copyright © 2018 UST Global Inc APPLICATION RESILIENCE I APRIL 2018 11
UST GlobalApplicationDiscovery,Analyzer andMigrationaccelerate theApplicationResilienceDesign &Implementation
The below workflow describes the UST Global application resiliency enhancement
methodology from workload analysis through topology planning and implementation.
The scanner discovers the existing application workload topology and generates maps of
the logical distribution. Depending on the target application resiliency requirements, the
target application architecture can be modified to (i) distribute the workloads across
multiple servers to form a distributed environment, or (ii) increase the availability through
horizontally scalable architecting, or (iii) increase performance through database
partitioning and parallelization., or (iv) combinations of the above.
Additionally, an automated migration engine is used to migrate applications to the target
environment once the topology mapping is complete.
Author - Dr. Pandian Angaiyan [email protected]
Application level analysis
SCANNERScan Apps onCloud on-prem
WORKLOAD MAPPERMicro-workloads mappedand planned
TOPOLOGY PLANNINGMap to different architectures
Architectures mapped to VMs
MAPPING TO TARGETARCHITECTURE
UST Global® is a fast-growing digital technology company that provides advanced computing and
digital services to large private and public enterprises around the world. Driven by a larger purpose of
Transforming Lives and the philosophy of “fewer Clients, more Attention”, we bring in the entrepreneuri-
al spirit that seeks the fastest path to value in today’s digital economy. Our innovative technology
services and pioneering social programs make us stand apart.
UST Global is headquartered in Aliso Viejo, California and operates in 21 countries. Our clients include
Fortune 500 companies in Banking and Financial Services, Healthcare, Insurance, Retail, High Technol-
ogy, Manufacturing, Shipping, and Telecom. UST Global believes in building long-lasting, strategic
business relationships through agile and client-centric global engagement models that combine local
experts and resources with cost, scale, and quality advantages of global operations.
For more information, please visit: www.ust-global.com
ABOUT UST GLOBAL®
Corporate Office: UST Global ®
5 Polaris Way, Second Floor, Aliso Viejo, CA 92656Tel: (949) 716-8757 Fax: (949) 716-8396
www.ust-global.comFor further information contact: [email protected]
/USTGlobal /USTGlobal /ustglobalweb /company/ust-global
USTGlobal ®