The Past, Present, and Future of Hadoop at LinkedIn

  • Published on

  • View

  • Download

Embed Size (px)


GSK 2014

The Past, Present, and Future of Hadoop @ LinkedInCarl SteinbachSenior Staff Software EngineerData Analytics Infrastructure GroupLinkedIn

The (Not So) Distant Past

PYMK (People You May Know)First version implemented in 20066-8 Million membersRan on Oracle (foreshadowing!)Found various overlapsSchool, Work etcUsed common connectionsTriangle closing (?)

-Since People You May Know is long, we call it PYMK at Linkedin.-The original version ran on Oracle-And the way it worked was to attempt to find overlaps between any pairs of people. Did they share the same school? Did they work at the same company?-One big indicator was common connections, and we used something called triangle closing.

Triangle Closing


-Triangle closing is an easy concept to follow

-Mary knows Dave and Steve

-We make a guess that Dave may also know Steve This is essentially what this feature does. We closed that triangle.

-Additionally, If Dave and Steve share more than one connection, then we can become more confident in our guess.

PYMK ProblemsBy 2008, 40-50 Million membersStill running on OracleFailed oftenInfrequent data refresh6 weeks 6 months!

-3 years later, and LinkedIn was growing fast, to 40-50 Million members. I joined about this time to be a member of the data products group-We still used Oracle to create PYMK data and it may not surprise people to hear that we had scalability problems.-In fact it failed often, and required a lot of manual intervention. When it succeeded, it would take about 6 weeks to produce new results, by which time the data was most likely stale.-At its worst, PYMK had so many problems that no new data appeared on our site for 6 months.----- Meeting Notes (9/3/13 14:06) -----6 min

Humble Beginnings Back in 08

Success! (circa 2009)Apache Hadoop 0.2020 node cluster (repurposed hardware) PYMK in 3 days!

-We tried other solutions. I wont name them, even though some of them some of them were well known and none of them could solve our scale problem-So we started a 20 node hadoop cluster pretty much on bad hardware that we stole or repurposed from our research and development servers without anyone really knowing.-We really didnt know what were doing, and our cluster was misconfigured

-but it solved PYMK in 3 days.-So everything was good well well see

The Present

Hadoop @ LinkedIn Circa 2016> 10 Clusters> 10,000 Nodes> 1000 Users

Thousands of workflows, datasets, and ad-hoc queries

MR, Pig, Hive, Gobblin, Cubert, Scalding, Tez, Spark, Presto,

Two Types of Scaling Challenges


People and Processes

Scaling Machines

Some Tough Talk About HDFSConventional wisdom holds that HDFSScales to > 4k nodes without federation*Scales to > 8k nodes with federation*

Whats been our experience?Many Apache releases wont scale past a couple thousand nodesVendor distros usually arent much better

Why?Scale testing happens after the release, not beforeMost vendors have only a handful of customers with clusters larger than 1k nodes

* Heavily dependent on NN RPC workload, block size, average file size, average container size, etc, etc

March 2015 Was Not a Good Month

What Happened?We rapidly added 500 nodes to a 2000 node cluster (dont do this!)

NameNode RPC queue length and wait time skyrocketed

Jobs crawled to a halt

What Was the Cause?A subtle performance/scale regression was introduced upstream

The bug was included in multiple releases

Increased time to allocate a new file

The more nodes you had, the worse it got

How We Used to do Scale TestingDeploy the release to a small cluster (num_nodes = 100)See if anything breaksIf no, then deploy to next largest cluster and goto step 2If yes, figure out what went wrong and fix it

Problems with this approachExpensive: developer time + hardwareRisky: Sometimes you cant roll back!Doesnt always work: overlooks non-linear regressions

Scale testing and performance investigation tool for HDFSHigh fidelity in all the dimensions that matterFocused on the NameNodeCompletely Black-boxAccurately fakes thousands of DNs on a small fraction of the hardwareMore details in forthcoming blog post

17HDFS Dynamometer

Scaling People and Processes


20vHadoop Performance Tuning

Too many dials!Lots of frameworks: each one is slightly different.Performance can change over time. Tuning requires constant monitoring and maintenance!21Why Are Most User Jobs Poorly Tuned?

* Tuning decision tree from Hadoop In Practice


Dr Elephant: Running Light Without OverbyteAutomated Performance Troubleshooting for Hadoop WorkflowsDetects Common MR and Spark Pathologies:Mapper Data SkewReducer Data SkewMapper Input SizeMapper SpeedReducer TimeShuffle & SortMore!Explains Cause of DiseaseGuided Treatment Process

Grab the source

Read the blog

Dr Elephant is Now Open Source

Upgrades are HardA totally fictional story:The Hadoop team pushes a new Pig upgradeThe next day thirty flows fail with ClassNotFoundExceptionsAngry users riotProperty damage exceeds $30mm

What happened?The flows depended on a third-party UDF that depended on a transitive dependency provided by the old version of Pig, but not the new version of Pig

Bringing Shading Out of the ShadowsWhat most people think it isPackage artifact and all dependencies in the same JAR + rename some or all of the package names

What it really isStatic linking for Java

Unfairly maligned by many people

We built an improved Gradle plugin that makes shading easier for inexperienced users

Audit Hadoop flows for incompatible and unnecessary dependencies.Predict failures before they happen by scanning for dependencies that wont be satisfied post-upgrade.Proved extremely useful during Hadoop2 migration

26Byte-Ray: X-Ray Goggles for JAR Files

Byte-Ray in Action

SoakCycle: Real World Integration Testing

The Future?

Dali2015 was the year of the table

We want to make 2016 the year of the view

Learn more at the Dali talk tomorrow

2014 LinkedIn Corporation. All Rights Reserved.2014 LinkedIn Corporation. All Rights Reserved.