Humans by the hundred

Preview:

Citation preview

Humans By The HundredScaling Big Data for Big Team Growth

$ whoamiSRE Manager at YelpCWRU AlumPittsburgh native<3 Web OperationsJust a dude

Yelp’s Mission:Connecting people with great

local businesses.

Yelp Stats:As of Q2 2015

83M 3268%83M

What is Yelp?Many sites: www, m, biz, apiMobile appsPartner platformHundreds of developersThousands of servers

Why Am I Here?

DATA

This talk is about people

The Goal

Iterate as fast as possible

Regardless of how many people are participating

Deployment

How It Starts

Deployment: the early daysGet a few people together in slack/irc/etc.

Merge up the codeRun the testsManually test it in stageCross your fingers

Things get slower...Tests take longer to runMore hosts = longer downloadsMore developers = more eyeballsMore features = more code

The Problem: Humans Are Fallible

The Problem: Humans Are Fallible

“…oh @$#&”

The Problem, With MathAssume:

Every change has a chance of success: 98%That means no test failures, no reverts, etc.

Every deploy has a number of changes: nAny failure in the pipeline invalidates the

deployLet’s figure out the probability of a successful deployment: p

The Problem, With MathOnly you

p = .98 (98%)You and a friend

p = .98 * .98 = .96 (96%)You and nine co-workers

p = .98 * .98 * .98 * … * .98 = .82 (82%)

The Problem, With Math

p = (.98)n

The Problem, With Math

p = (.98)n

exponential decay!

This doesn’t scale!More developers = more changesMore changes = longer deploysLonger deploys = less time to developLess time to develop = slower to iterateSlower to iterate != the goal

Mitigating Exponential Decay

p = (.98)n

Mitigating Exponential Decay

p = (.98)n

Making it harder to screw upWrite more testsWrite better testsGet better code reviewsGet better infrastructureSwitch programming languagesUse better tools

Just write better software and stop making mistakes!

PROBLEM SOLVED

The Real WorldTesting builds confidence in our changes

Testing does not protect you from failure

Better tools, tests, and infrastructure can raise our success rates

Mitigating Exponential Decay

p = (.98)n

Mitigating Exponential Decay

p = (.98)n

Service-Oriented ArchitectureLarge monolith → smaller servicesServices communicate over network

Usually HTTP, but you can do RPC, SOAP, etc.Service = independent code baseIndependent deployments

Service-Oriented ArchitectureBenefits

Smaller code bases = upper bound to nFailure domains become isolatedTechnology independenceFederated responsibility

Service-Oriented ArchitectureDrawbacks

everything becomes decoupledfunction calls start looking like HTTP

requestsversioning can be a nightmare

tracking dependencies is harddata consistency becomes challengingend-to-end testing becomes hard(er), if not

impossible

SOA scales people, not code.

Conquering SOAWith the monolith, it’s easy to focus on mean time between failures (MTBF)

Conquering SOAIn a SOA, focus on mean time to recovery (MTTR)

Conquering SOAFail fastAnticipate failureLeverage iteration speed to recover fast

Conquering SOATreat everything as distributed

That means everything will failUse timeouts, retriesFind ways to degrade gracefully

Fail fast & isolatedDon’t rely on synchronous processesPrepare for eventual consistency

Reaping the BenefitsSmaller failure domainsFewer people & changes to manageDeploys get smallerDeploys get fasterDeploys become continuous

Reaping the BenefitsSmaller changes

means smaller code reviewsmeans faster validationmeans smaller blast radiusmeans faster iteration

Continuous DeliveryEveryone works against master branchMaster is deployed when commits added

Deployment gated by testsMonitoring knows something is wrong before you do!

PROBLEM SOLVED

Testing

Tests are hard to get right.

How can we do better?

“Not Recommended” Tests

“Not Recommended” TestsIf a test fails on master:

a feature is broken on the live website, oryour test sucks and you should ditch it

In either case, we disable itTicket is createdDevelopers can fix it later or just bin it and start

fresh

Reliable tests >> test coverage.

Don’t always run all the tests!

Tests of external services should be monitoring

Define your boundaries.

yelp.com / dataset_challenge● 61K businesses● 61K checkin-sets● 481K business attributes

● 1.6M reviews● 366K users● 2.8M edge social-graph● 495K tips

Your academic project, research or visualizations, submitted by Dec 31, 2015=

$5,000 prize + $1,000 for publication + $500 for presenting*

*See full terms on website

Academic dataset from 10 cities in 4 countries!

@YelpEngineering

YelpEngineers

engineeringblog.yelp.com

github.com/yelp

yelp.com/careers

Questions?

Recommended