View
72
Download
1
Category
Preview:
DESCRIPTION
The world of Scalability is glamorous and magical: we get to use cool technologies with sexy names and magical things happen, we have a tough problem and we throw a Scalability Technology at it, and results flow quickly and easily...or so it seems from the outside. The experience of the Whitepages Data Group has been quite different. We have significant, challenging problems of scale, both in the size of our data artifacts, and in the number of updates we have to process to keep it accurate and up to date. But our day-to-day lives are much more mundane than the glamorous world were Hadoop meets NoSql meets Scala, and results flow smoothly, and bigger problems are solved by bolting on a node or two. While we are heavy users of Hadoop, we are typically running EC2 instances in the hundreds and are exploring three of four alternatives to Postgres. We have found no "silver bullet" technology, cutting-edge or otherwise. Rather, for us, scaling problems tend to be insidious and move around a lot, and we feel like firefighters (or worse, like mole-whackers) at least as often as we feel like Data Scientists. I will provide some case studies, examples and rants, most of which don't involve Hadoop or other magic bullets.
Citation preview
Scalability Sucks(working title)
June 6, 2013
Steve HanksPrincipal Data ScientistWhitePages.com
2
Scalability Perception Versus (Our) Reality
• Perception
– Scalability is about technology, and adopting the right technology
gives you scalability
o You want to believe it (the technology is fun)
o Sales people want you to believe it
• Reality
– Problems are complex and solutions are inter-related
– Scalability problems are rarely isolated to one facet of the solution
o A solution to one symptom tends to push the problem somewhere else
(one thing leads to another)
– Scaling problems are rarely known at inception
o Tipping (over) points
WhitePages Confidential
WhitePages Confidential 3
Brooks: “No Silver Bullet – Essence and Accidents of Software Engineering” (1986)
• Separating “essential difficulties” from “accidental difficulties”
– Technologies address the latter, but at best free us to work on the problem
features that are inherently difficult
• The mistake we make, thinking that technologies that address accidental
difficulties in any sense solve the harder problems
– Then, Compilers, IDEs
– Now, Distributed Databases, NoSQL, MapReduce
• The message is the same: a technology can (help) solve your problem if
– It’s a simple problem and the technology is exactly the right tool, or
– Applying the technology can effectively solve one piece of a complex problem
Case Study:Scale, and More Scaling
4
WhitePages Confidential 5
The WhitePages Data Ecosystem
Search/API
Data Build(s)
PurchasedData
InternallySourced Data
Core Data
Data Size (approximate)• 300M Persons• 150M Addresses• 135M Businesses• 120M Telephones• 1.4B Address Links• 400M Telephone Links
Volume (monthly)• 50M Unique Users (website)• 35M Mobile Downloads (total)• 1B API calls• 600M Mobile-Related events• 1.2B Data Inputs (purchased + internal)
WhitePages Confidential 6
Scalability Challenges (Ours. Actually. Recently.)
Search/API
Data Build(s)
PurchasedData
InternallySourced Data
Core Data
WhitePages Confidential 7
The Q4/Q1 Scalability Storm
• Address History
• Close the Loop With our Internal Applications
– Phone Meta-Data
o Track more phones, especially mobile phones (e.g. carrier, call or search velocity)
» It’s very interesting to know whether a phone number is active, and when a number gets ported, for example
– Person/Phone links
o We get information from internal clients about phone numbers, calls, and names
o Sometimes we can infer a new link between a Person and a Phone
» Subject to some very strict privacy constraints
• Challenges Perceived at the Time
– Address History
o Size of the Core Database grows
o Time to do the build increases (?? By 25 to 50%)
– Internal Data
o No big scaling problem anticipated, but need a new component to process the event queue and stage the data for
the build (Data Normalization)
WhitePages Confidential 8
Data Normalization – Quick Way to Catch Internal Events
Normalization
CNAM
Phone Attributes
Current App
(Searches)
(Phone Calls)
PPMatcher
CNAM
Phone Attributes
Current
PPMatcher
Data
Build
Phone Attributes
etc.
(Delivered from internal apps using ActiveMQ) End of Q4
• Stable in production. Processing ~ 100K records per hour
• Clearing the backlog in about 13 minutes
• Nice general solution for tasks that need certain libraries or read/write to DBs (Hadoop doesn’t work well for these)
• Well integrated into our build process
WhitePages Confidential 9
Scaling Tipping Points Q4
• Historical address information is partially on board
– Increases size of CoreDB by 25%, reliability suffers due to stress
• IT finally looks carefully at our Q4 equipment asks, are
appalled
• Normalization is getting strained
– Greater adoption of Current app results in x5 increase in call
records
– New data providers and other applications increase load on
Normalization by 10x
– DB size (phones) increases beyond 20M, local Postgres is getting
cranky
– Bottom line, finishing 1 hour worth of inputs in 12 hours. Not
encouraging.
• General mood is good. Multiple scaling issues, but it’s
OK because we have a silver bullet - AWS
WhitePages Confidential 10
Q1/Q2 – Biting the Silver Bullet (or vice versa)
• Transition our whole build to EMR
– Looks easy, it’s all just Pig!
– Will save a ton of money
– No need for cluster administrator
– Scaling problems are solved, forever!
• Move our core data base (contact graph) to RDS
– Looks easy, it’s just an RDB!
– Will save a ton of money
– No need for a DBA
– Scaling problems are solved, forever!
• Move normalization to the cloud by using hundreds of EC2 instances
– Looks easy, they’re just Linux boxes
– Will save a ton of money
– Deployment problems are solved
– Scaling problems are solved, forever!
WhitePages Confidential 11
Q2/Q3 – One Thing Leads to Another
• Expected and unexpected overall adoption problems with AWS
– Credentials, regions, etc.
– Carefully estimate time/cost to get data in and out
• EMR transition fairly smooth
• Normalization transition less smooth (to 200+ workers)
– Need new debugging paradigms (e.g. image changes, viewing local state)
– Shared resources (shared filesystems, databases)– Dealing with race conditions and bad actors
– Downstream implications of 200+ workers
– Upstream implications of 200+ workers
• Unexpected new requirements
– For accounting purposes need to log all calls to external data providers
• Database transition less smooth
– Too many conflicting use cases for a clear-choice technology
– Uneasy peace with Redshift
WhitePages Confidential 12
Q4 – Stabilizing
• As We Speak
– Transition to EMR is complete, and the full build works smoothly,
and takes about 50% of the time
o Some combination of the technology and our re-engineering
– Transition to Redshift is complete, though the story isn’t over
o We’re about to make our peace with the fact that we need multiple
data representations, and all that implies
– Normalization is still in flux, still under stress
o Not anticipating the scaling implications on our database of phones
continues to bedevil us
» Exploring NoSql solutions
– Reliably stabilized and at full scale by end of year
WhitePages Confidential 13
Wrap-Up
• We did a lot of scaling. It was messy, and took longer than we thought
– Because we were dumb?
– Because we didn’t plan well enough?
– Because it’s inherently messy, and Brooks was necessarily right about incremental development?
• We accomplished really awesome things, and I still don’t have a single cool or insightful
thing to say about specific technologies!
• Things we did well
– Maintained focus on problems, not technologies
– Designed well enough that there were options for solving problems when they arose
• Things we could have done better
– Don’t combine a lot of scaling with a lot of re-architecting (?)
o Don’t change too many things at once
– Anticipate second-order problems (120M phone numbers on a single Postgres installation was never going to
work)
– Don’t believe that throwing 200 x at any problem will work, for any x
– Pay more attention from the beginning to standard problems in distributed processing (hung remote
processes, long locks, race conditions)
– Know more about Postgres
WhitePages Confidential 14
Conclusion
• Scaling really does suck
• It really isn’t primarily about the technology
• You need
– Understanding of and focus on the problem, not on solution
technologies
– At the same time, knowledge of the possible technology tools
– To be diligent about standard best practices in design and
engineering
Recommended