43
COMP5348 Enterprise Scale Software Architecture Semester 1, 2010 Lecture 11. Solutions for Scalability and Availability Based on material by Paul Greenfield, Alan Fekete, Uwe Roehm and from textbook by Gorton

COMP5348

Embed Size (px)

DESCRIPTION

COMP5348. Enterprise Scale Software Architecture Semester 1, 2010 Lecture 11. Solutions for Scalability and Availability. Based on material by Paul Greenfield, Alan Fekete, Uwe Roehm and from textbook by Gorton. Outline. Scalability Availability Scale-up versus Scale-out. Scalability. - PowerPoint PPT Presentation

Citation preview

Page 1: COMP5348

COMP5348Enterprise Scale Software Architecture

Semester 1, 2010

Lecture 11. Solutions for Scalability and Availability

Based on material by Paul Greenfield, Alan Fekete, Uwe Roehm and from textbook by Gorton

Page 2: COMP5348

Outline

› Scalability

› Availability

› Scale-up versus Scale-out

Page 3: COMP5348

Scalability

› “How well a solution to some problem will work when the size of the problem increases.”

› 4 common scalability issues in IT systems:

- Request load

- Connections

- Data size

- Deployments

Page 4: COMP5348

Scalability – Request Load

› How does an 100 tps application behave when simultaneous request load grows? E.g.

- From 100 to 1000 requests per second?

› If there isn’t any additional hardware capacity:

- Queueing on the saturated resource means: as the load increases, throughput remains constant and response time per request increases only linearly

- Reality: thrashing may appear, throughput drops!

Page 5: COMP5348

Scalability – J2EE example

0

500

1000

1500

2000

2500

0 200 400 600 800 1000 1200

No. of Clients

TPS

WAS SB

JBoss SB

IAS SB

SS SB

WLS SB

BES SB

I.Gorton, A Liu, Performance Evaluation of Alternative Component Architectures for Enterprise JavaBean Applications, in IEEE Internet Computing, vol.7, no. 3, pages 18-23, 2003.

Page 6: COMP5348

Scalability - Connections

› What happens if number of simultaneous connections to an application increases

- If each connection consumes a resource?

- Exceed maximum number of connections?

› ISP example:

- Each user connection spawned a new process

- Virtual memory on each server exceeded at 2000 users

- Needed to support 100Ks of users

- Tech crash ….

Page 7: COMP5348

Scalability – Data Size

› How does an application behave as the data it processes increases in size?

- Chat application sees average message size double?

- Database table size grows from 1 million to 20 million rows?

- Image analysis algorithm processes images of 100MB instead of 1MB?

› Can application/algorithms scale to handle increased data requirements?

Page 8: COMP5348

Scalability - Deployment

› How does effort to install/deploy an application increase as installation base grows?

- Install new users?

- Install new servers?

› Solutions typically revolve around automatic download/installation

- E.g. downloading applications from the Internet

Page 9: COMP5348

Defining System Capacity

› How many clients can you support?

- Name an acceptable response time

- Average 95% under 2 secs is common

- And what is ‘average’?

- Plot response time vs # of clients

› Great if you can run benchmarks

- Reason for prototyping and proving proposed architectures before leaping into full-scale implementation

Page 10: COMP5348

System Capacity

RI 100x 1-100

0

200

400

600

800

1000

1200

1400

1600

0 200 400 600 800 1000 1200 1400

Threads

TPS

0.0

200.0

400.0

600.0

800.0

1000.0

1200.0

tps

tps

rt

rt

rt

Page 11: COMP5348

Outline

› Scalability

› Availability

› Scale-up versus Scale-out

Page 12: COMP5348

Availability

› Key requirement for most IT applications

› Measured by the proportion of the required time it is useable. E.g.

- 100% available during business hours

- No more than 2 hours scheduled downtime per week

- 24x7x52 (close to 100% availability)

› Related to an application’s reliability

- Unreliable applications suffer poor availability

Page 13: COMP5348

Availability

› Period of loss of availability determined by:

- Time to detect failure

- Time to correct failure

- Time to restart application

- Also time for scheduled maintenance (unless done while on-line)

› Recoverability (e.g. a database)

- the capability to reestablish performance levels and recover affected data after an application or system failure

Page 14: COMP5348

Availability

› Strategies for high availability:

- Eliminate single points of failure

- Replication and failover

- Automatic detection and restart

› Redundancy is the key to availability

- No single points of failure

- Spare everything

- Disks, disk channels, processors, power supplies, fans, memory, ..

- Applications, databases, …

- Hot standby, quick changeover on failure

Page 15: COMP5348

Available System

Web Clients

Web Server farm Load balanced using WLB

App Servers farm using COM+ LB

Database installed on cluster for high availability

Page 16: COMP5348

Availability

› Often a question of application design

- Stateful vs stateless

- What happens if a server fails?

- Can requests go to any server?

- Synchronous method calls or asynchronous messaging?

- Reduce dependency between components

- Failure tolerant designs

- And manageability decisions to consider

Page 17: COMP5348

Redundancy=Availability

› Passive or active standby systems

- Re-route requests on failure

- Continuous service (almost)

- Recover failed system while alternative handles workload

- May be some hand-over time (db recovery?)

- Active standby & log shipping reduce this

- At the expense of 2x system cost…

› What happens to in-flight work?

- State recovers by aborting in-flight ops & doing db recovery but …

Page 18: COMP5348

Transaction Recovery

› Could be handled by middleware

- Persistent queues of accepted requests

- Still a failure window though

› Large role for client apps/users

- Did the request get lost on failure?

- Retry on error?

› Large role for server apps

- What to do with duplicate requests?

- Try for idempotency (repeated txns OK)

- Or track and reject duplicates

Page 19: COMP5348

Outline

› Scalability

› Availability

› Scale-up versus Scale-out

Page 20: COMP5348

Scalability - with increasing hardware

› Adding more hardware should improve performance:

- More resources to do the extra work

- Ideal: constant throughput and constant response time, if hardware grows proportionally with workload

- And scaling must be achieved without modifications to application architecture

› Reality as always is different!

- There are overhead costs

- to manage each request, get it to the resources etc

- Eventually, system hits the limits of that design

Page 21: COMP5348

Scalability – Add more hardware …

Application

ApplicationApplicationApplication

Application

Scale-out: Application replicated on different machines

Scale-up: Single application instance is executed on a multiprocessor machine

CPU

Page 22: COMP5348

Hardware for scalability

› Scale-up or…

- Use bigger and faster systems

› … Scale-out

- Systems working together to handle load

- Server farms

- Clusters

› Implications for application design

- Especially state management

- And availability as well

Page 23: COMP5348

Scale-up

› Could be easy to manage

- One box, one supplier, uniform system

- Add processors, memory, …. as needed

- SMP (symmetric multiprocessing)

› Runs into limits eventually

› Hardware costs more

- A commodity system costs less than 1/n of a special niche-market system which is n times larger

› Could be less available

- What happens on failures? Redundancy?

Page 24: COMP5348

Scale-up

› eBay example

- Server farm of Windows boxes (scale-out)

- Single database server (scale-up)

- 64-processor SUN box (max at time)

- More capacity needed?

- Easily add more boxes to Web farm

- Faster DB box? (not available)

- More processors? (not possible)

- Split DB load across multiple DB servers?

- See eBay presentation…

Page 25: COMP5348

Scaling Out

› More boxes at every level

- Web servers (handling user interface)

- App servers (running business logic)

- Database servers (perhaps… a bit tricky?)

- Just add more boxes to handle more load

› Spread load out across boxes

- Load balancing at every level

- Partitioning or replication for database?

- Impact on application design?

- Impact on system management

- All have impacts on architecture & operations

Page 26: COMP5348

Scaling Out

UI tier Business tier Data tier

Page 27: COMP5348

‘Load Balancing’

› A few different but related meanings

- Distributing client bindings across servers or processes

- Needed for stateful systems

- Static allocation of client to server

- Balancing requests across server systems or processes

- Dynamically allocating requests to servers

- Normally only done for stateless systems

Page 28: COMP5348

Static Load Balancing

Client

Client

Client

Name Server

Server process

Server process

Advertise service

Request server reference

Return server reference

Call server object’s methods

Get server object reference

Load balancing across application process instances within a server

Page 29: COMP5348

Load Balancing in CORBA

› Client calls on name server to find the location of a suitable server

- CORBA terminology for object directory

› Name server can spread client objects across multiple servers

- Often ‘round robin’

› Client is bound to server and stays bound forever

- Can lead to performance problems if server loads are unbalanced

Page 30: COMP5348

Name Servers

› Server processes call name server as part of their initialisation

- Advertising their services/objects

› Clients call name server to find the location of a server process/object

- Up to the name server to match clients to servers

› Client then directly calls server process to create or link to objects

- Client-object binding usually static

Page 31: COMP5348

Dynamic Stateful?

› Dynamic load balancing with stateful servers/objects?

- Clients can throw away server objects and get new ones every now and again

- In application code or middleware

- Have to save & restore state

- Or object replication in middleware

- Identical copies of objects on all servers

- Replication of changes between servers

- Clients have references to all copies

Page 32: COMP5348

BEA WLS Load Balancing

Clients

Clients

DBMS

MACHINE B

MACHINE AEJB Cluster

HeartBeat viaMulticast backbone

EJB Servers instances

EJB Servers instances

Page 33: COMP5348

Threaded Servers

› No need for load-balancing within a single system

- Multithreaded server process

- Thread pool servicing requests

- All objects live in a single process space

- Any request can be picked up by any thread

› Used by modern app servers

Page 34: COMP5348

Threaded Servers

Client

Client

Client

App

DLL

Eg

COM+

COM+ process

Thread pool

Shared object space

Application code

Page 35: COMP5348

Dynamic Load Balancing

› Dynamically balance load across servers

- Requests from a client can go to any server

› Requests dynamically routed

- Often used for Web Server farms

- IP sprayer (Cisco etc)

- Network Load Balancer etc

› Routing decision has to be fast & reliable

- Routing in main processing path

› Applications normally stateless

Page 36: COMP5348

Web Server Farms

› Web servers are highly scalable

- Web applications are normally stateless

- Next request can go to any Web server

- State comes from client or database

- Just need to spread incoming requests

- IP sprayers (hardware, software)

- Or >1 Web server looking at same IP address with some coordination

Page 37: COMP5348

Scaling State Stores?

Scaling stateless logic is easy

…but how are state stores scaled?

› Bigger, faster box (if this helps at all)

- Could hit lock contention or I/O limits

› Replication

- Multiple copies of shared data

- Apps access their own state stores

- Change anywhere & send to everyone

Page 38: COMP5348

Scaling State Stores

› Partitioning

- Multiple servers, each looking after a part of the state store

- Separate customers A-M & N-Z

- Split customers according to state

- Preferably transparent to apps

- e.g. SQL/Server partitioned views

› Or combination of these approaches

Page 39: COMP5348

Scale-out operational consequences

› Purchases done incrementally, so hard to manage

- Machines in the system are not all the same

- May even be different manufactures, certainly different clock speed, memory capacity etc

- Probably different versions of software

Page 40: COMP5348

Scaling Out Summary

Districts

11-20

Districts1-10

Web server farm (Network Load

Balancing)

Application farm (Component Load

Balancing)

Database servers (Cluster Services and partitioning)

UI tier Business tier Data tier

Page 41: COMP5348

Clusters

› A group of independent computers acting like a single system

- Shared disks

- Single IP address

- Single set of services

- Fail-over to other members of cluster

- Load sharing within the cluster

- DEC, IBM, MS, …

Page 42: COMP5348

Clusters

Client PCsClient PCs

Server AServer A Server BServer B

Disk cabinet ADisk cabinet A

Disk cabinet BDisk cabinet B

HeartbeatHeartbeat

Cluster managementCluster management

Page 43: COMP5348

Clusters

› Address scalability

- Add more boxes to the cluster

- Replication or shared storage

› Address availability

- Fail-over

- Add & remove boxes from the cluster for upgrades and maintenance

› Can be used as one element of a highly-available system