Upload
gmthomps
View
2.221
Download
1
Embed Size (px)
DESCRIPTION
From AT&T Bootstrap Week: This session focuses on architecture and design concepts to ensure scalability and maximize reliability for server-based applications running in the cloud environment. The session will discuss techniques to consider for achieving scalability and reliability and tradeoffs to consider such as time vs. cost based on the needs for different types of applications.
Citation preview
HIGH SCALABILITY AND RELIABILITY IN THE CLOUDGREG THOMPSONHEAD OF ARCHITECTURE, APPS ENABLEMENTALCATEL-LUCENT
[email protected]@gmthomps
About This Session Target audience is backend application
developers deploying infrastructure into a cloud environment
Will cover concepts for scalability and reliability with the goal of helping application developers understand some key considerations when designing and building the backend.
Design Time Decisions When first building your application backend,
consider a few important questions How fast should the application be recovered if a
failure occurs? What kind of down time is acceptable? Is the application maintaining stateful data? What kind of information needs to be shared
across multiple instances?
Scalability
What is Scalability? Scalability is a
term used to describe how the application will handle increased loads of traffic volume
Scalability – Factors to Consider Horizontal vs. Vertical Stateless vs. Stateful Understanding Limitations Connection Management Segmentation of traffic Segmentation of responsibility (distributed arch) Clustering Messaging
What Type of Scalability?Vertical vs. Horizontal
Scaling up a single node Physical limitations –
instances are very powerful but still have finite limits
Resources such as number of sockets can only go so high
Scaling out across multiple nodes Ability to distribute
traffic over a number of nodes
Allows for more flexibility over time
Vertical Horizontal
Will the App Maintain State?Stateless Applications
Application does not persist information about transactions
Each transaction is independent and atomic
Application
Request Response
Will the App Maintain State?Stateful Applications
Application needs to maintain data about transactions in progress
Requires storage Persistence may also
be required depending the reliability model
Application
First Request
DB
Subsequent Request
Understanding Limitations Thorough testing is key
to understanding bottlenecks
Test real-world scenarios included latency
Push the system to the max to understand how it behaves
Connection ManagementMobile Device Connections Mobile devices don’t always
behave like you expect Connectivity is often very
dynamic Devices move from
4G/3G/2G/no G/Wifi Not all TCP events will get
reported and sockets can remain open
If not handled correctly, these factors can be time bomb no matter how vertically you scale a component
Segmenting Traffic Once the application is
able to be scaled out, traffic can be segmented in different ways Location (i.e. east coast
vs. west coast) Pre-assigned criteria -
User ID, IP, or other dynamic criteria
Load Balanced
Segmenting Responsibility Segmenting
responsibility allows for a distributed architecture Each component can be
scaled independently Allows for more flexibility
in scaling Adds more complexity
and potential messaging overhead
Clustering Clustering is the concept
of having a group of nodes working together to provide the same capability Nodes typically co-located Common data shared as
needed across the cluster Communication may be
needed between nodes
AppNod
e
AppNod
e
AppNod
e
AppNod
e
Shared
Data
Messaging Once a clustered
and/or distributed architecture is used messaging will be needed between various components and/or nodes
Types of Messaging JMS Open Source MQ
packages Custom Designed Use of APIs
Example of Scaled Architecture
Site 1
Load BalancerLoad
Balancer
Web Serve
rWeb
Server
Component 1Compone
nt 1
Component 2Compone
nt 2
Database
Site 2
Load BalancerLoad
Balancer
Web Serve
rWeb
Server
Component 1Compone
nt 1
Component 2Compone
nt 2
Database
Reliability/Availability
What is Reliability/Availability? Availability is typically
measured by the amount of downtime your application has in a given year Unplanned downtime and
planned downtime are both considered
Reliability is described by the likelihood of failure based on actual measurements
We’ll focus more on Availability
Reliability/AvailabilityFactors to Consider
Cost vs. Need Problem detection Automation for recovery Active/standby, active/active, hot standby vs. cold
standby Local and Geo-redundancy Multi-zone, multi-cloud Test Until You Break the System
Reliability Requirements
Number of instances Bandwidth
requirements between sites
Complexity of software Monitoring
User Experience Customer
requirements Negative Publicity
Cost Considerations Need
Problem Detection Effective monitoring of
the application is key to minimizing downtime Event reporting in the
software External monitoring – test
for successful behavior Auto detection and
alerting to minimize cost of operations personnel
Automation for Recovery How quickly a failed
component recovers increases reliability Automatic detection
and automatic recovery Automated installation
key for minimizing setup time during recovery
Availability Models N = number of nodes
required for normal processing
N+1 = one additional node to provide redundancy in case of failure
N+K = K nodes provide additional redundancy
N N
N N +1
N N K K
Redundancy Models Active/Cold Standby
backup site is booted up when needed
Active/Hot Standby Backup site is running
and ready to takeover Active/Active
Both sites active and processing traffic
ActiveCold
Standby
ActiveActiveStand
by
Active Active
Local and Geo-Redundancy Local
Backup instances are available within the same location
Use of availability zones within a region very similar
Geo-Graphic Backup instances are
available in another geo-graphic location
Typically in a separate region to account for events such as natural disasters
Availability to the Max Multi-Zone/Multi-
Region Multi-zone typically
provide instances running in different physical locations, but in same region
Multi-region provides different geographic regions of availability
Multi-Cloud If your application
requires the maximum possible availability
Run in different cloud providers in different regions
Test Until You Break the System Push the system to the
max and observe the breaking points
Fix the problem, repeat The best way to find
problems to prevent unplanned downtime is to thoroughly test with a mindset to break
Q&A
Greg [email protected]@alcatel-lucent.com
THANK YOU!