"Scalability for Big Data" | Dr. Steve Hanks, Principal Data Scientist

Scalability Sucks(working title)

June 6, 2013

Steve HanksPrincipal Data ScientistWhitePages.com

Scalability Perception Versus (Our) Reality

• Perception

– Scalability is about technology, and adopting the right technology

gives you scalability

o You want to believe it (the technology is fun)

o Sales people want you to believe it

• Reality

– Problems are complex and solutions are inter-related

– Scalability problems are rarely isolated to one facet of the solution

o A solution to one symptom tends to push the problem somewhere else

(one thing leads to another)

– Scaling problems are rarely known at inception

o Tipping (over) points

WhitePages Confidential

WhitePages Confidential 3

Brooks: “No Silver Bullet – Essence and Accidents of Software Engineering” (1986)

• Separating “essential difficulties” from “accidental difficulties”

– Technologies address the latter, but at best free us to work on the problem

features that are inherently difficult

• The mistake we make, thinking that technologies that address accidental

difficulties in any sense solve the harder problems

– Then, Compilers, IDEs

– Now, Distributed Databases, NoSQL, MapReduce

• The message is the same: a technology can (help) solve your problem if

– It’s a simple problem and the technology is exactly the right tool, or

– Applying the technology can effectively solve one piece of a complex problem

Case Study:Scale, and More Scaling

The WhitePages Data Ecosystem

Search/API

Data Build(s)

PurchasedData

InternallySourced Data

Core Data

Data Size (approximate)• 300M Persons• 150M Addresses• 135M Businesses• 120M Telephones• 1.4B Address Links• 400M Telephone Links

Volume (monthly)• 50M Unique Users (website)• 35M Mobile Downloads (total)• 1B API calls• 600M Mobile-Related events• 1.2B Data Inputs (purchased + internal)

Scalability Challenges (Ours. Actually. Recently.)

Search/API

Data Build(s)

PurchasedData

InternallySourced Data

Core Data

The Q4/Q1 Scalability Storm

• Address History

• Close the Loop With our Internal Applications

– Phone Meta-Data

o Track more phones, especially mobile phones (e.g. carrier, call or search velocity)

» It’s very interesting to know whether a phone number is active, and when a number gets ported, for example

– Person/Phone links

o We get information from internal clients about phone numbers, calls, and names

o Sometimes we can infer a new link between a Person and a Phone

» Subject to some very strict privacy constraints

• Challenges Perceived at the Time

– Address History

o Size of the Core Database grows

o Time to do the build increases (?? By 25 to 50%)

– Internal Data

o No big scaling problem anticipated, but need a new component to process the event queue and stage the data for

the build (Data Normalization)

Data Normalization – Quick Way to Catch Internal Events

Normalization

Phone Attributes

Current App

(Searches)

(Phone Calls)

PPMatcher

Phone Attributes

Current

PPMatcher

Phone Attributes

(Delivered from internal apps using ActiveMQ) End of Q4

• Stable in production. Processing ~ 100K records per hour

• Clearing the backlog in about 13 minutes

• Nice general solution for tasks that need certain libraries or read/write to DBs (Hadoop doesn’t work well for these)

• Well integrated into our build process

Scaling Tipping Points Q4

• Historical address information is partially on board

– Increases size of CoreDB by 25%, reliability suffers due to stress

• IT finally looks carefully at our Q4 equipment asks, are

appalled

• Normalization is getting strained

– Greater adoption of Current app results in x5 increase in call

records

– New data providers and other applications increase load on

Normalization by 10x

– DB size (phones) increases beyond 20M, local Postgres is getting

cranky

– Bottom line, finishing 1 hour worth of inputs in 12 hours. Not

encouraging.

• General mood is good. Multiple scaling issues, but it’s

OK because we have a silver bullet - AWS

Q1/Q2 – Biting the Silver Bullet (or vice versa)

• Transition our whole build to EMR

– Looks easy, it’s all just Pig!

– Will save a ton of money

– No need for cluster administrator

– Scaling problems are solved, forever!

• Move our core data base (contact graph) to RDS

– Looks easy, it’s just an RDB!

– No need for a DBA

• Move normalization to the cloud by using hundreds of EC2 instances

– Looks easy, they’re just Linux boxes

– Deployment problems are solved

Q2/Q3 – One Thing Leads to Another

• Expected and unexpected overall adoption problems with AWS

– Credentials, regions, etc.

– Carefully estimate time/cost to get data in and out

• EMR transition fairly smooth

• Normalization transition less smooth (to 200+ workers)

– Need new debugging paradigms (e.g. image changes, viewing local state)

– Shared resources (shared filesystems, databases)– Dealing with race conditions and bad actors

– Downstream implications of 200+ workers

– Upstream implications of 200+ workers

• Unexpected new requirements

– For accounting purposes need to log all calls to external data providers

• Database transition less smooth

– Too many conflicting use cases for a clear-choice technology

– Uneasy peace with Redshift

Q4 – Stabilizing

• As We Speak

– Transition to EMR is complete, and the full build works smoothly,

and takes about 50% of the time

o Some combination of the technology and our re-engineering

– Transition to Redshift is complete, though the story isn’t over

o We’re about to make our peace with the fact that we need multiple

data representations, and all that implies

– Normalization is still in flux, still under stress

o Not anticipating the scaling implications on our database of phones

continues to bedevil us

» Exploring NoSql solutions

– Reliably stabilized and at full scale by end of year

Wrap-Up

• We did a lot of scaling. It was messy, and took longer than we thought

– Because we were dumb?

– Because we didn’t plan well enough?

– Because it’s inherently messy, and Brooks was necessarily right about incremental development?

• We accomplished really awesome things, and I still don’t have a single cool or insightful

thing to say about specific technologies!

• Things we did well

– Maintained focus on problems, not technologies

– Designed well enough that there were options for solving problems when they arose

• Things we could have done better

– Don’t combine a lot of scaling with a lot of re-architecting (?)

o Don’t change too many things at once

– Anticipate second-order problems (120M phone numbers on a single Postgres installation was never going to

– Don’t believe that throwing 200 x at any problem will work, for any x

– Pay more attention from the beginning to standard problems in distributed processing (hung remote

processes, long locks, race conditions)

– Know more about Postgres

Conclusion

• Scaling really does suck

• It really isn’t primarily about the technology

• You need

– Understanding of and focus on the problem, not on solution

technologies

– At the same time, knowledge of the possible technology tools

– To be diligent about standard best practices in design and

engineering

"Scalability for Big Data" | Dr. Steve Hanks, Principal Data Scientist

Data & Analytics

Hanks joshua 3.3

Tom Hanks Interview

· Scalability Algorithms Functional Programming Data Structures Money. Developers 0110 1101 Scalability Algorithms Clean Code Functional Programming Data Structures Money. Developers

Hanks Pizza Menu

Data centers Scalability Targets -Storage Account

OPEN Chain: Scalability Through Data Parallelization€¦ · 5/14/2018 OPEN Chain - Scalability Through Data Parallelization - Google Docs ... OPEN Chain: Scalability Through Data

bgp scalability - UC3Mividal/aia/bgp_scalability.pdf · BGP Scalability 3 Scalability metrics! Data Plane " Internet Traffic #Fast forwarding engines & big pipes! Control Plane "

Optimization & Scalability€¦ · Programming Models & Tools Cloud Computing Optimization & Scalability Energy Efﬁciency Exascale Computing Services Big Data, Analytics & Management

Dr. Multicast for Data Center Communication Scalability

Interactive Querying over Large Network Data: Scalability, Visualization… · 2020-06-04 · Interactive Querying over Large Network Data: Scalability, Visualization, and Interaction

Building a Data Warehouse For Scalability and Flexibilitycs.brown.edu/vldb99/IndustrialSpeakerSlides/DWInform.pdf · 1999-09-23 · Building a Data Warehouse For Scalability and Flexibility

Juniper QFX10002: Performance and Scalability for the Data Center

Hanks joshua 4.4

On the Design and Scalability of Distributed Shared-Data

Chuck Gallopo Beverly Hanks

Hanks galaxy

Hanks Wedding Album

Techniques for improving the scalability of data center

Scalability Challenges in Big Data Science

Your Data Any Place, Any Time Performance and Scalability