Scalability and Reliability in the Cloud

HIGH SCALABILITY AND RELIABILITY IN THE CLOUDGREG THOMPSONHEAD OF ARCHITECTURE, APPS ENABLEMENTALCATEL-LUCENT

[email protected]@gmthomps

About This Session Target audience is backend application

developers deploying infrastructure into a cloud environment

Will cover concepts for scalability and reliability with the goal of helping application developers understand some key considerations when designing and building the backend.

Design Time Decisions When first building your application backend,

consider a few important questions How fast should the application be recovered if a

failure occurs? What kind of down time is acceptable? Is the application maintaining stateful data? What kind of information needs to be shared

across multiple instances?

Scalability

What is Scalability? Scalability is a

term used to describe how the application will handle increased loads of traffic volume

Scalability – Factors to Consider Horizontal vs. Vertical Stateless vs. Stateful Understanding Limitations Connection Management Segmentation of traffic Segmentation of responsibility (distributed arch) Clustering Messaging

What Type of Scalability?Vertical vs. Horizontal

Scaling up a single node Physical limitations –

instances are very powerful but still have finite limits

Resources such as number of sockets can only go so high

Scaling out across multiple nodes Ability to distribute

traffic over a number of nodes

Allows for more flexibility over time

Vertical Horizontal

Will the App Maintain State?Stateless Applications

Application does not persist information about transactions

Each transaction is independent and atomic

Application

Request Response

Will the App Maintain State?Stateful Applications

Application needs to maintain data about transactions in progress

Requires storage Persistence may also

be required depending the reliability model

Application

First Request

DB

Subsequent Request

Understanding Limitations Thorough testing is key

to understanding bottlenecks

Test real-world scenarios included latency

Push the system to the max to understand how it behaves

Connection ManagementMobile Device Connections Mobile devices don’t always

behave like you expect Connectivity is often very

dynamic Devices move from

4G/3G/2G/no G/Wifi Not all TCP events will get

reported and sockets can remain open

If not handled correctly, these factors can be time bomb no matter how vertically you scale a component

Segmenting Traffic Once the application is

able to be scaled out, traffic can be segmented in different ways Location (i.e. east coast

vs. west coast) Pre-assigned criteria -

User ID, IP, or other dynamic criteria

Load Balanced

Segmenting Responsibility Segmenting

responsibility allows for a distributed architecture Each component can be

scaled independently Allows for more flexibility

in scaling Adds more complexity

and potential messaging overhead

Clustering Clustering is the concept

of having a group of nodes working together to provide the same capability Nodes typically co-located Common data shared as

needed across the cluster Communication may be

needed between nodes

AppNod

e

AppNod

e

AppNod

e

AppNod

e

Shared

Data

Messaging Once a clustered

and/or distributed architecture is used messaging will be needed between various components and/or nodes

Types of Messaging JMS Open Source MQ

packages Custom Designed Use of APIs

Example of Scaled Architecture

Site 1

Load BalancerLoad

Balancer

Web Serve

rWeb

Server

Component 1Compone

nt 1

Component 2Compone

nt 2

Database

Site 2

Load BalancerLoad

Balancer

Web Serve

rWeb

Server

Component 1Compone

nt 1

Component 2Compone

nt 2

Database

Reliability/Availability

What is Reliability/Availability? Availability is typically

measured by the amount of downtime your application has in a given year Unplanned downtime and

planned downtime are both considered

Reliability is described by the likelihood of failure based on actual measurements

We’ll focus more on Availability

Reliability/AvailabilityFactors to Consider

Cost vs. Need Problem detection Automation for recovery Active/standby, active/active, hot standby vs. cold

standby Local and Geo-redundancy Multi-zone, multi-cloud Test Until You Break the System

Reliability Requirements

Number of instances Bandwidth

requirements between sites

Complexity of software Monitoring

User Experience Customer

requirements Negative Publicity

Cost Considerations Need

Problem Detection Effective monitoring of

the application is key to minimizing downtime Event reporting in the

software External monitoring – test

for successful behavior Auto detection and

alerting to minimize cost of operations personnel

Automation for Recovery How quickly a failed

component recovers increases reliability Automatic detection

and automatic recovery Automated installation

key for minimizing setup time during recovery

Availability Models N = number of nodes

required for normal processing

N+1 = one additional node to provide redundancy in case of failure

N+K = K nodes provide additional redundancy

N N

N N +1

N N K K

Redundancy Models Active/Cold Standby

backup site is booted up when needed

Active/Hot Standby Backup site is running

and ready to takeover Active/Active

Both sites active and processing traffic

ActiveCold

Standby

ActiveActiveStand

by

Active Active

Local and Geo-Redundancy Local

Backup instances are available within the same location

Use of availability zones within a region very similar

Geo-Graphic Backup instances are

available in another geo-graphic location

Typically in a separate region to account for events such as natural disasters

Availability to the Max Multi-Zone/Multi-

Region Multi-zone typically

provide instances running in different physical locations, but in same region

Multi-region provides different geographic regions of availability

Multi-Cloud If your application

requires the maximum possible availability

Run in different cloud providers in different regions

Test Until You Break the System Push the system to the

max and observe the breaking points

Fix the problem, repeat The best way to find

problems to prevent unplanned downtime is to thoroughly test with a mindset to break

Q&A

Greg [email protected]@alcatel-lucent.com

THANK YOU!

Technology

Scalability and Reliability in the Cloud