Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)

Preview:

DESCRIPTION

At Spotify we collect huge volumes of data for many purposes. Reporting to labels, powering our product features, and analyzing user growth are some of our most common ones. Additionally, we collect many operational metrics related to the responsiveness, utilization and capacity of our servers. To store and process this data, we use scalable and fault-tolerant multi-system infrastructure, and Apache Hadoop is a key part of it. Surprisingly or not, Apache Hadoop generates large amounts of data in the form of logs and metrics that describe its behaviour and performance. To process this data in a scalable and performant manner we use … also Hadoop! During this presentation, I will talk about how we analyze various logs generated by Apache Hadoop using custom scripts (written in Pig or Java/Python MapReduce) and available open-source tools to get data-driven answers to many questions related to the behaviour of our 690-node Hadoop cluster. At Spotify we frequently leverage these tools to learn how fast we are growing, when to buy new nodes, how to calculate the empirical retention policy for each dataset, optimize the scheduler, benchmark the cluster, find its biggest offenders (both people and datasets) and more.

Citation preview

Adam KawaData Engineer @ Spotify

Hadoop Operations

Powered By … Hadoop

1. How many times has Coldplay been streamed this month?

2. How many times was “Get Lucky” streamed during first 24h?

3. Who was the most popular artist in NYC last week?

Labels, Advertisers, Partners

1. What song to recommend Jay-Z when he wakes up?

2. Is Adam Kawa bored with Coldplay today?3. How to get Arun to subscribe to Spotify

Premium?

Data Scientists

(Big) Data At Spotify■ Data generated by +24M monthly active usersand for users!

- 2.2 TB of compressed data from users per day- 64 TB of data generated in Hadoop each day

(triplicated)

Data Infrastructure At Spotify ■ Apache Hadoop YARN ■ Many other systems including

- Kafka, Cassandra, Storm, Luigi in production - Giraph, Tez, Spark in the evaluation mode

■ Probably the largest commercial Hadoop cluster in Europe!

- 694 heterogeneous nodes- 14.25 PB of data consumed- ~12.000 jobs each day

Apache Hadoop

March 2013Tricky questions were asked!

1. How many servers do you need to buy to survive one year?

2. What will you do to use them efficiently?3. If we agree, don’t come back to us this year! OK?

Finance Department

■ One of Data Engineers responsible for answering these questions!

Adam Kawa

■ Examples of how to analyze various metrics, logs and files

- generated by Hadoop- using Hadoop- to understand Hadoop- to avoid guesstimates!

The Topic Of This Talk

■ This knowledge can be useful to- measure how fast HDFS is growing- define an empirical retention policy- measure the performance of jobs- optimize the scheduler- and more

What To Use It For

1. Analyzing HDFS2. Analyzing MapReduce and YARN

Agenda

HDFSGarbage Collection On The NameNode

“ We don’t have any full GC pauses on the NN.Our GC stops the NN for less than 100 msec,

on average!:) ”

Adam Kawa @ Hadoop User Mailing ListDecember 16th, 2013

“ Today, between 12:05 and 13:00we had 5 full GC pauses on the NN.

They stopped the NN for 34min47sec in total!:( ”

Adam Kawa @ Spotify office, StockholmJanuary 13th, 2014

What happened

between 12:05 and 13:00?

The NameNode was receiving the block reports from all the DataNodes

Quick Answer!

1. We started the NN when the DNs were running

Detailed Answer

1. We started the NN when the DNs were running2. 502 DNs immediately registered to the NN

■ Within 1.2 sec (based on logs from the DNs)

Detailed Answer

1. We started the NN when the DNs were running2. 502 DNs immediately registered to the NN

■ Within 1.2 sec (based on logs from the DNs)3. 502 DNs started sending the block reports

■ dfs.blockreport.initialDelay = 30 minutes■ 17 block reports per minute (on average)■ +831K blocks in each block report (on average)

Detailed Answer

1. We started the NN when the DNs were running2. 502 DNs immediately registered to the NN

■ Within 1.2 sec (based on logs from the DNs)3. 502 DNs started sending the block reports

■ dfs.blockreport.initialDelay = 30 minutes■ 17 block reports per minute (on average)■ +831K blocks in each block report (on average)

4. This generated a high memory pressure on the NN■ The NN ran into Full GC !!!

Detailed Answer

Hadoop told us everything!

■ Enable GC logging for the NameNode■ Visualize e.g. GCViewer■ Analyze memory usage patterns, GC pauses, misconfiguration

Collecting The GC Stats

Time

This blue line shows the heap used by the NN

Loading FsImage

Start replaying Edit logs

First block report processed

25 block reports processed

131 block reports processed

5min 39sec of Full GC

40 block reports processed

Next Full GC

Next Full GC !!!

CMS collector startsat 98.5% of heap…

We fixed that !

What happened in HDFSbetween mid-December 2013

and mid-January 2014?

HDFSHDFS Metadata

■ A persistent checkpoint of HDFS metadata■ It contains information about files + directories■ A binary file

HDFS FsImage File

■ Converts the content of FsImage to text formats- e.g. a tab-separated file or XML

■ Output is easily analyzed by any tools- e.g. Pig, Hive

HDFS Offline Image Viewer

50% of the data created during last 3

months

Anything interesting?

1. NO data added that day2. Many more files added after

The migration to YARN

Where

did

the small files

come from?

■ An interactive visualization of data in HDFS

Twitter's HDFS-DU

/app-logsavg. file size = 253 KB

no. of dirs = 595K

no. of files = 60.6M

■ Statistics broken down by user/group name■ Candidates for duplicate datasets

■ Inefficient MapReduce jobs- Small files- Skewed files

More Uses Of FsImage File

■ You can analyze FsImage to learn how fast HDFS grows■ You can combine it with “external” datasets - number of daily/monthly active users - total size of logs generated by users - number of queries / day run by data analysts

Advanced HDFS Capacity Planning

■ You can also use ''trend button'' in Ganglia

Simplified HDFS Capacity Planning

If we do NOTHING, we might fill the cluster in September ...

What will we do

to survive longer

than September?

HDFSRetention

QuestionHow many days after creation, a dataset is not accessed anymore?

Retention Policy

QuestionHow many days after creation, a dataset is not accessed anymore?

Possible Solution ■ You can use modification_time and access_time from FsImage

Empirical Retention Policy

■ Logs and core datasets are accessed even many years after creation■ Many reports are not accessed even a hour after creation■ Most intermediate datasets needed less than a week

■ 10% of data has not been accessed for a year

Our Retention Facts

HDFSHot Datasets

■ Some files/directories will be accessed more often than others e.g.: - fresh logs, core datasets, dictionary files

Idea■ To process it faster, increase

its replication factor while it’s “hot”■ To save disk space, decrease

its replication factor when it becomes “cold”

Hot Dataset

How to find them?

■ Logs all filesystem access requests sent to the NN■ Easy to parse and aggregate - a tab-separated line for each request

HDFS Audit Log

2014-01-18 15:16:12,023INFO FSNamesystem.audit: allowed=trueugi=kawaa (auth:SIMPLE) ip=/10.254.28.4 cmd=opensrc=/metadata/artist/2013-11-27/part-00061.avro dst=null perm=null

■ JAR files stored in HDFS and used by Pig scripts■ A dictionary file with metadata about log messages■ Core datasets: playlists, users, top tracks

Our Hot Datasets

YARNMapReduce Jobs Autotuning

■ There are jobs that we schedule regularly- e.g. top lists for each country

Idea■ Before submitting it next time, use statistics from the previous executions of a job

- To learn about its historical performance - To tweak its configuration settings

Recurring MapReduce Jobs

We implemented■ A pre-execution hook that automatically sets - Maximum size of an input split - Number of Reduce tasks

■ More settings can be tweaked- Memory

- Combiner

Jobs Autotuning

■ Here, the goal is that a task runs approx. 10 min, on average

- Inspired by LinkedIn at Hadoop Summit 2013- Helpful in extreme cases (short/long running tasks)

A Small PoC ;)

Another Example - Job Optimized Over Time

Even perfect manual settings

may become outdated

when an input dataset grows!

YARNMapReduce Statistics

■ Extracts the statistics from historical MapReduce jobs- Supports MRv1 and YARN

■ Stores them as Avro files- Enables easy analysis using e.g. Pig and Hive

■ Similar projects- Replephant, hRaven

Zlatanitor = Zlatan + Monitor

Zlatanitor

Low Medium High

A Slow Node- 40% lower throughput than the average

Low Medium High

NIC negotiated 100MbE instead of 1GbE

Low Medium High

According to Facebook■ ”Small percentage of machines are responsible for large percentage of failures”

- Worse performance- More alerts- More manual intervention

Repeat Offenders

Adding nodes to the cluster

increases performance.

Sometimes, removing (crappy) nodes

does too !

Fixing

slow and failing

tasks as well !

YARNApplication Logs

■ YARN - can be moved to HDFS - They are stored as TFiles … :( - Small and many of them!

Location Of Application Logs

■ Frequent exceptions and bugs - Just looking at the last line of stderr shows a lot!

■ Possible optimizations - Memory and size of map input buffer

What Might Be Checked

a) AttributeError: 'int' object has no attribute 'iteritems' b) ValueError: invalid literal for int() with base 10: 'spotify' c) ValueError: Expecting , delimiter: line 1 column 3257 (char 3257) d) ImportError: No module named db_statistics

YARNThe Capacity Scheduler

■ We specified capacities and elasticity based on a combination of

- “some” data- intuition- desire to shape future usage (!)

Our Initial Capacities

■ Basic information available on the Scheduler Web UI■ Take print-screens!

- Otherwise, you will lose the history of what you saw :(

Overutilization And Underutilization

■ Capacity Scheduler exposes these metrics via JMX ■ Ganglia does NOT display the metrics related to utilization of queues (by default)

Visualizing Utilization Of Queue

■ It collects JMX metrics from Java processes■ It can send metrics to multiple destinations

- Graphite, cacti/rrdtool, Ganglia- tab-separated text file- STDOUT- and more

Jmxtrans

■ Our Production queue often borrows resources- Usually from the Queue3 and Queue4 queues

Overutilization And Underutilization

The Best Time For The Downtime?

Three Crowns

Three Crowns = Sweden

BONUSSome Cool StuffFrom The Community

■ Aggregates and visualizes Hadoop cluster utilization across users

LinkedIn's White Elephant

■ Collects run-time statistics from MR jobs- Stores them in HBase

■ Does not provide built-in visualization layer- The picture below comes from Twitter's blog

Twitter's hRaven

That’s all!

■ Analyzing Hadoop is also a “business” problem- Save money- Iterate faster- Avoid downtimes

Summary

Thank you!

■ To my awesome colleagues for great technical review:

Piotr Krewski, Josh Baer, Rafal Wojdyla,Anna Dackiewicz, Magnus Runesson, Gustav Landén, Guido Urdaneta, Uldis Barbans

More Thanks

Section name

Questions?

Check out spotify.com/jobs or @Spotifyjobs for more information

kawaa@spotify.comCheck out my blog: HakunaMapData.com

Want to join the band?

Backup

■ Tricky question!■ Use production jobs that represent your workload■ Use a metric that is independent from size of data that you process■ Optimize one setting at the time

Benchmarking

Benchmarking

Benchmarking

Recommended