Best Practices for Large-Scale Web Sites

Lessons from Ebay

Brian Ko

• 276,000,000 registered users

• stores over 2 Petabytes of data

• over 1 billion page views per day

• 113 million items for sale in over 50,000 categories

• 2 billion Photos

• 300+ features per quarter

• Rolls 100,000+ lines of code every two weeks

• In 39 countries, in 7 languages, 24x7x365

• 48 Billion SQL executions/day!

• In Year 2008

Design goal

• Scalability– Resource usage should increase linearly (or

better!) with load– Design for 10x growth in data, traffic, users,

• Availability– Resilience to failure– Graceful degradation– Recoverability from failure

Design Goal

• Latency– User experience, data latency

• Manageability– Simplicity, Maintainability

– Provide diagnostics

• Cost– Development effort and complexity

– Operational cost (TCO)

Architecture consideration

• Partition everything – “you eat an elephant only one bite at a time”

• Asynchrony for everywhere– “Good things come to those who wait”

• Automate everything– “Automation will save time and eliminate

human errors…”

• Assume everything fails– “Be Prepared”

Partition EverythingSplit

• Split every problem into manageable chunks– “If you can’t split it, you can’t scale it”

– By data, load, and/or usage pattern

– For example, there are 1000’s of databases

Partition EverythingMotivation

• Scalability: can scale horizontally and independently

• Availability: can isolate failures

• Manageability: can decouple different segments and functional areas

• Cost: can use less expensive hardware

Partition EverythingDatabases

• Functional Segmentation– Segment databases into functional areas –

user, item, transaction, product, account, feedback

– Over 1000 logical databases on over 400 physical hosts

• Horizontal Split– Split (or “shard”) databases horizontally along

primary access path.

• No Database Transactions• eBay’s transaction policy

– Absolutely no client side transactions, two-phase commit, etc.

– Auto-commit for vast majority of DB writes

• Consistency is not always required or possible– To guarantee availability and partition-tolerance, we

are forced to trade off consistency (Brewer’s CAP Theorem)

• Consistency without transactions– Careful ordering of DB operations

– Eventual consistency through asynchronous event or reconciliation batch

Partition Everything Application Tier

• Over 17,000 application servers in 220 pools• Functional Segmentation

– Segment functions into separate application pools

– Allows for parallel development, deployment, and monitoring

– Minimizes DB / resource dependencies

• Horizontal Split– Within pool, all application servers are created equal

Partition Everything Application Tier

• User session flow moves through multiple application pools

• Absolutely no session state

• Transient state maintained by– URL, Cookie, Scratch database

Async Everywhere

• Prefer Asynchronous Processing– Where possible, integrate disparate components

asynchronously• Motivations

– Scalability: can scale components independently– Availability

• Can decouple availability state• Can retry operations

– Latency• Can significantly improve user experience latency at cost

of data/execution latency• Can allocate more time to processing than user would tolerate

– Cost: can spread peak load over time

Async EverywhereBatch

• Scheduled offline batch process appropriate for– Infrequent, periodic, or scheduled processing– Non-incremental computation (a.k.a. “Full Table

Scan”)

• Examples– Import data (catalogs, currency, etc.)– Generate recommendations (items, products,

searches, etc.)– Process items at end of auction

Automate EverythingMotivation

• Scalability– Can scale with machines, not humans

• Availability / Latency– Can adapt to changing environment more rapidly

• Cost– Machines are far less expensive than humans– Can learn / improve / adjust over time without

manual effort

Automate EverythingDeployment

• Challenge– Need to deploy the application to over

17,000 application servers at the same time

• Solution– Deploy Application in advance with the new

feature switch turned off– Turn on the switch through automatic

process on target date.– Make the roll back easier.

Assume Everything Fails

• Build all systems to be tolerant of failure– Assume every operation will fail and

every resource will be unavailable– Rapid failure detection and recovery– Do as much as possible during failure

• Motivation– Availability

Assume Everything Fails

• Rollback– Absolutely no changes to the site which

cannot be undone (!)

• Failure Detection– Real-time application state monitoring:

exceptions and operational alerts– “Resource slow” is often far more

challenging than “resource down”

Assume Everything Fails Graceful Degradation

• Application “marks down” the resource– Stops making calls to it and sends alert

• Non-critical functionality is removed or ignored• Critical functionality is retried or deferred

– Failover to alternate resource– Defer processing to async event

• Explicit “markup”– Allows resource to be restored and brought

online in a controlled way

Summary

• Partition everything

• Asynchrony for everywhere

• Automate everything

• Assume everything fails

The End

5 minutes of question time

starts now!

Questions

4 minutes left!

Questions

3 minutes left!

Questions

2 minutes left!

Questions

1 minute left!

Questions

30 seconds left!

Questions

TIME IS UP!

Best Practices for Large-Scale Web Sites

Technology

Large-Scale Analysis of Modern Code Review Practices and ... · Large-Scale Analysis of Modern Code Review Practices and Software Security in Open Source Software by Christopher Thompson

RECENT PRACTICES IN INDUSTRIAL AND LARGE SCALE

Best Practices for Large-Scale Websites -- Lessons from eBay

Inferring large-scale patterns in complex networkstuvalu.santafe.edu/~aaronc/...LargeScaleStructure.pdf · large-scale structure of networks large-scale structural analysis! • enormous

RoSS.pFTU Large-Scale,RoSS.pFTU Large-Scale

Best Practices Blueprint for Growing and Large Scale Continuous Testing at the Speed of DevOps

best practices for deploying a CMDB in large-scale environments

Report Human Resource Practices in Multinational Companies in Ireland a Large-Scale Survey

Large Scale Organisations in Context Types of Large Scale Organisations Classification of Large Scale Organisations Contributions to the economy Management

Large Scale

Large-Scale Image Classification using High Performance ...grids.ucs.indiana.edu/ptliupages/publications/Large-Scale Image... · Large-Scale Image Classification using High Performance

CORPORATE COMMUNICATION PRACTICES IN LARGE- SCALE

Best Practices for Deploying a CMDB in large-scale

Key words: large scale, small scale

Evaluation of ALDOT Ditch Check Practices using Large ... · Evaluation of ALDOT Ditch Check Practices using Large-Scale Testing Techniques ... Check Practices using Large-Scale Testing

in a large distributed agile organization1107050/FULLTEXT01.pdf · Agile practices have become norm, also in large scale organizations. Applying agile methods includes introducing

Patterns and Practices for Large Scale Upgrades: Lessons ......Patterns and Practices for Large Scale Upgrades: Lessons from EEAP, 2017 Esri User Conference--Presentation, 2017 Esri

Operational Best Practices - CCSSO projectsprograms.ccsso.org/projects/operational_best_practices/Operational... · operational best Practices for statewide large-scale ... The original

Large scale agile development practices

Best Management Practices For Large Scale Rescue Operations … Management... · 2017-01-18 · 0 Best Management Practices For Large Scale Rescue Operations at Sea Version.5: March