The Past, Present, and Future of Hadoop at LinkedIn

The Past, Present, and Future of Hadoop @ LinkedIn

Carl SteinbachSenior Staff Software EngineerData Analytics Infrastructure GroupLinkedIn

The (Not So) Distant Past

PYMK (People You May Know)First version implemented in 2006

6-8 Million members

Ran on Oracle (foreshadowing!)Found various overlaps

School, Work… etc

Used common connections Triangle closing (?)

Triangle Closing

PYMK ProblemsBy 2008, 40-50 Million membersStill running on OracleFailed oftenInfrequent data refresh

6 weeks – 6 months!

Humble Beginnings Back in ‘08

Success! (circa 2009)Apache Hadoop 0.2020 node cluster (repurposed hardware) PYMK in 3 days!

The Present

Hadoop @ LinkedIn Circa 2016> 10 Clusters> 10,000 Nodes> 1000 Users

Thousands of workflows, datasets, and ad-hoc queries

MR, Pig, Hive, Gobblin, Cubert, Scalding, Tez, Spark, Presto, …

Two Types of Scaling Challenges

Machines

People and Processes

Scaling Machines

Some Tough Talk About HDFSConventional wisdom holds that HDFS Scales to > 4k nodes without federation* Scales to > 8k nodes with federation*

What’s been our experience? Many Apache releases won’t scale past a couple thousand nodes Vendor distros usually aren’t much better

Why? Scale testing happens after the release, not before Most vendors have only a handful of customers with clusters larger than 1k nodes

* Heavily dependent on NN RPC workload, block size, average file size, average container size, etc, etc

March 2015 Was Not a Good Month

What Happened?We rapidly added 500 nodes to a 2000 node cluster

(don’t do this!)

NameNode RPC queue length and wait time skyrocketed

Jobs crawled to a halt

What Was the Cause?A subtle performance/scale regression was introduced upstream

The bug was included in multiple releases

Increased time to allocate a new file

The more nodes you had, the worse it got

How We Used to do Scale Testing1. Deploy the release to a small cluster (num_nodes = 100)2. See if anything breaks3. If no, then deploy to next largest cluster and goto step 24. If yes, figure out what went wrong and fix it

Problems with this approach Expensive: developer time + hardware Risky: Sometimes you can’t roll back! Doesn’t always work: overlooks non-linear regressions

• Scale testing and performance investigation tool for HDFS

• High fidelity in all the dimensions that matter

• Focused on the NameNode• Completely Black-box• Accurately fakes thousands of DNs on a

small fraction of the hardware• More details in forthcoming blog post

HDFS Dynamometer

Scaling People and Processes

HadoopPerformanceTuning

Too many dials!

Lots of frameworks: each one is slightly different.

Performance can change over time.

Tuning requires constant monitoring and maintenance!

Why Are Most User Jobs Poorly Tuned?

* Tuning decision tree from “Hadoop In Practice”

Dr Elephant: Running Light Without OverbyteAutomated Performance Troubleshooting for Hadoop Workflows

● Detects Common MR and Spark Pathologies:

○ Mapper Data Skew○ Reducer Data Skew○ Mapper Input Size○ Mapper Speed○ Reducer Time○ Shuffle & Sort○ More!

● Explains Cause of Disease● Guided Treatment Process

Grab the source codegithub.com/linkedin/dr-elephant

Read the blog postengineering.linkedin.com/blog

Dr Elephant is Now Open Source

Upgrades are HardA totally fictional story: The Hadoop team pushes a new Pig upgrade The next day thirty flows fail with ClassNotFoundExceptions Angry users riot Property damage exceeds $30mm

What happened? The flows depended on a third-party UDF that depended on a transitive

dependency provided by the old version of Pig, but not the new version of Pig

Bringing Shading Out of the ShadowsWhat most people think it is

Package artifact and all dependencies in the same JAR + rename some or all of the package names

What it really is Static linking for Java

Unfairly maligned by many people

We built an improved Gradle plugin that makes shading easier for inexperienced users

Audit Hadoop flows for incompatible and unnecessary dependencies.

Predict failures before they happen by scanning for dependencies that won’t be satisfied post-upgrade.

Proved extremely useful during Hadoop2 migration

Byte-Ray: “X-Ray Goggles for JAR Files”

Byte-Ray in Action

SoakCycle: Real World Integration Testing

The Future?

Dali2015 was the year of the table

We want to make 2016 the year of the view

Learn more at the Dali talk tomorrow

The Past, Present, and Future of Hadoop at LinkedIn

Software

Hadoop , Hadoop , Hadoop !!!

Partners 2013 LinkedIn Use Cases for Teradata Connectors for Hadoop

Related Searches at LinkedIn - Mitul Tiwari54.89.233.243/docs/presentations/related_searches_sigir.pdfImplementation ¥ Kafka: publish-subscribe messaging system ¥ Hadoop: MapReduce

Managing Capacity @ LinkedIn · Linkedin Espresso Hadoop Collection Ingestion Processing / Store Reporting Micro Strategy Tableau. Linkedin’s pipeline Gobblin Lumos 3rd Party Services

· (Page views ? Hourly? Monthly Hadoop Node Hadoop Node Hadoop Camus Node Hadoop Node Hadoop Node Hadoop Node Hadoop Node Hadoop Node Ad-Hoc Analysis External Datastores Trends

Hadoop Performance at LinkedIn

LinkedIn Webinar, LinkedIn Presentation, LinkedIn Training

Hadoop Present - Open Enterprise Hadoop

Kafka and Hadoop at LinkedIn Meetup

Hadoop past, present and future

Strata SG 2015: LinkedIn Self Serve Reporting Platform on Hadoop

Hadoop @ eBay: Past, Present and Future

Continuous Delivery for Linux/Windows/Hadoop...Beta Cluster Hadoop JobTracker Jenkins Slave Hadoop node Hadoop node Hadoop node Hadoop node Slave Node Gateway Prod. Cluster PigServer

Hadoop at LinkedIn

Hadoop Deployment Manual - Hyadespleiades.ucsc.edu/doc/bright/hadoop-deployment-manual.pdf2.2 Ncurses Installation Of Hadoop Using cm-hadoop-setup ... •The Hadoop Deployment Manual

Hadoop @ eBay: Past, Present, and Future

Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010

Hadoop Summit 2014: Building a Self-Service Hadoop Platform at LinkedIn with Azkaban

Hadoop - Past, Present and Future - v2.0

Apache Hadoop YARN: Past, Present and Future