29
HIGH SCALABILITY AND RELIABILITY IN THE CLOUD GREG THOMPSON HEAD OF ARCHITECTURE, APPS ENABLEMENT ALCATEL-LUCENT [email protected] @gmthomps

Scalability and Reliability in the Cloud

Embed Size (px)

DESCRIPTION

From AT&T Bootstrap Week: This session focuses on architecture and design concepts to ensure scalability and maximize reliability for server-based applications running in the cloud environment. The session will discuss techniques to consider for achieving scalability and reliability and tradeoffs to consider such as time vs. cost based on the needs for different types of applications.

Citation preview

Page 1: Scalability and Reliability in the Cloud

HIGH SCALABILITY AND RELIABILITY IN THE CLOUDGREG THOMPSONHEAD OF ARCHITECTURE, APPS ENABLEMENTALCATEL-LUCENT

[email protected]@gmthomps

Page 2: Scalability and Reliability in the Cloud

About This Session Target audience is backend application

developers deploying infrastructure into a cloud environment

Will cover concepts for scalability and reliability with the goal of helping application developers understand some key considerations when designing and building the backend.

Page 3: Scalability and Reliability in the Cloud

Design Time Decisions When first building your application backend,

consider a few important questions How fast should the application be recovered if a

failure occurs? What kind of down time is acceptable? Is the application maintaining stateful data? What kind of information needs to be shared

across multiple instances?

Page 4: Scalability and Reliability in the Cloud

Scalability

Page 5: Scalability and Reliability in the Cloud

What is Scalability? Scalability is a

term used to describe how the application will handle increased loads of traffic volume

Page 6: Scalability and Reliability in the Cloud

Scalability – Factors to Consider Horizontal vs. Vertical Stateless vs. Stateful Understanding Limitations Connection Management Segmentation of traffic Segmentation of responsibility (distributed arch) Clustering Messaging

Page 7: Scalability and Reliability in the Cloud

What Type of Scalability?Vertical vs. Horizontal

Scaling up a single node Physical limitations –

instances are very powerful but still have finite limits

Resources such as number of sockets can only go so high

Scaling out across multiple nodes Ability to distribute

traffic over a number of nodes

Allows for more flexibility over time

Vertical Horizontal

Page 8: Scalability and Reliability in the Cloud

Will the App Maintain State?Stateless Applications

Application does not persist information about transactions

Each transaction is independent and atomic

Application

Request Response

Page 9: Scalability and Reliability in the Cloud

Will the App Maintain State?Stateful Applications

Application needs to maintain data about transactions in progress

Requires storage Persistence may also

be required depending the reliability model

Application

First Request

DB

Subsequent Request

Page 10: Scalability and Reliability in the Cloud

Understanding Limitations Thorough testing is key

to understanding bottlenecks

Test real-world scenarios included latency

Push the system to the max to understand how it behaves

Page 11: Scalability and Reliability in the Cloud

Connection ManagementMobile Device Connections Mobile devices don’t always

behave like you expect Connectivity is often very

dynamic Devices move from

4G/3G/2G/no G/Wifi Not all TCP events will get

reported and sockets can remain open

If not handled correctly, these factors can be time bomb no matter how vertically you scale a component

Page 12: Scalability and Reliability in the Cloud

Segmenting Traffic Once the application is

able to be scaled out, traffic can be segmented in different ways Location (i.e. east coast

vs. west coast) Pre-assigned criteria -

User ID, IP, or other dynamic criteria

Load Balanced

Page 13: Scalability and Reliability in the Cloud

Segmenting Responsibility Segmenting

responsibility allows for a distributed architecture Each component can be

scaled independently Allows for more flexibility

in scaling Adds more complexity

and potential messaging overhead

Page 14: Scalability and Reliability in the Cloud

Clustering Clustering is the concept

of having a group of nodes working together to provide the same capability Nodes typically co-located Common data shared as

needed across the cluster Communication may be

needed between nodes

AppNod

e

AppNod

e

AppNod

e

AppNod

e

Shared

Data

Page 15: Scalability and Reliability in the Cloud

Messaging Once a clustered

and/or distributed architecture is used messaging will be needed between various components and/or nodes

Types of Messaging JMS Open Source MQ

packages Custom Designed Use of APIs

Page 16: Scalability and Reliability in the Cloud

Example of Scaled Architecture

Site 1

Load BalancerLoad

Balancer

Web Serve

rWeb

Server

Component 1Compone

nt 1

Component 2Compone

nt 2

Database

Site 2

Load BalancerLoad

Balancer

Web Serve

rWeb

Server

Component 1Compone

nt 1

Component 2Compone

nt 2

Database

Page 17: Scalability and Reliability in the Cloud

Reliability/Availability

Page 18: Scalability and Reliability in the Cloud

What is Reliability/Availability? Availability is typically

measured by the amount of downtime your application has in a given year Unplanned downtime and

planned downtime are both considered

Reliability is described by the likelihood of failure based on actual measurements

We’ll focus more on Availability

Page 19: Scalability and Reliability in the Cloud

Reliability/AvailabilityFactors to Consider

Cost vs. Need Problem detection Automation for recovery Active/standby, active/active, hot standby vs. cold

standby Local and Geo-redundancy Multi-zone, multi-cloud Test Until You Break the System

Page 20: Scalability and Reliability in the Cloud

Reliability Requirements

Number of instances Bandwidth

requirements between sites

Complexity of software Monitoring

User Experience Customer

requirements Negative Publicity

Cost Considerations Need

Page 21: Scalability and Reliability in the Cloud

Problem Detection Effective monitoring of

the application is key to minimizing downtime Event reporting in the

software External monitoring – test

for successful behavior Auto detection and

alerting to minimize cost of operations personnel

Page 22: Scalability and Reliability in the Cloud

Automation for Recovery How quickly a failed

component recovers increases reliability Automatic detection

and automatic recovery Automated installation

key for minimizing setup time during recovery

Page 23: Scalability and Reliability in the Cloud

Availability Models N = number of nodes

required for normal processing

N+1 = one additional node to provide redundancy in case of failure

N+K = K nodes provide additional redundancy

N N

N N +1

N N K K

Page 24: Scalability and Reliability in the Cloud

Redundancy Models Active/Cold Standby

backup site is booted up when needed

Active/Hot Standby Backup site is running

and ready to takeover Active/Active

Both sites active and processing traffic

ActiveCold

Standby

ActiveActiveStand

by

Active Active

Page 25: Scalability and Reliability in the Cloud

Local and Geo-Redundancy Local

Backup instances are available within the same location

Use of availability zones within a region very similar

Geo-Graphic Backup instances are

available in another geo-graphic location

Typically in a separate region to account for events such as natural disasters

Page 26: Scalability and Reliability in the Cloud

Availability to the Max Multi-Zone/Multi-

Region Multi-zone typically

provide instances running in different physical locations, but in same region

Multi-region provides different geographic regions of availability

Multi-Cloud If your application

requires the maximum possible availability

Run in different cloud providers in different regions

Page 27: Scalability and Reliability in the Cloud

Test Until You Break the System Push the system to the

max and observe the breaking points

Fix the problem, repeat The best way to find

problems to prevent unplanned downtime is to thoroughly test with a mindset to break

Page 28: Scalability and Reliability in the Cloud

Q&A

Page 29: Scalability and Reliability in the Cloud

Greg [email protected]@alcatel-lucent.com

THANK YOU!