Download pdf - Hadoop at Spotify - Meetupfiles.meetup.com/5139282/SHUG 1 - Hadoop at Spotify.pdf · Spotify is a Data-Driven company, so data is used practically everywhere.. Where do we use Hadoop?

Januari 2012

Hadoop at Spotify

Friday, January 25, 13

IntroductionUse casesSome statsInfrastructure overviewMap/Reduce in different languagesSchedulingScaling HadoopLessons learned

Agenda


3+ years experience with HadoopWorking with a team of 12 engineers to keep our data flowingResponsible for the data architecture

Don’t hesitate to ask questions during this talk!

Wouter de Bie - Team lead for Analytics & Data Infrastructure

Who am I and what am I doing here?!

Introduction


Too much data to fit into a RDBMS (or too costly)Hadoop provides good value for moneyRuns on “commodity” hardwareScales pretty well

Millions of daily active users == LOTS OF DATA

Why do we use Hadoop?

Introduction


ReportingBusiness analysisOperational analysisIn-client features

Spotify is a Data-Driven company, so data is used practically everywhere..

Where do we use Hadoop?

Use cases


Reporting to labels, licensors, partners and advertisers from day 1In 2008 a RDBMS would have worked for a while, although labels wanted very granular data, so we went with Hadoop instead

This is where it all started..

Reporting


Analyzing growth, user behavior, sign-up funnels, etcCompany KPIsA/B testingNPS analysisSegmentation analysis

We have so much data, so let’s us it!

Business Analytics


An analysis I can share


Listening behavior in Sweden

0.00%$

0.20%$

0.40%$

0.60%$

0.80%$

1.00%$

1.20%$

1.40%$

1.60%$

Mon

$-$00$

Mon

$-$02$

Mon

$-$04$

Mon

$-$06$

Mon

$-$08$

Mon

$-$10$

Mon

$-$12$

Mon

$-$14$

Mon

$-$16$

Mon

$-$18$

Mon

$-$20$

Mon

$-$22$

Tue$-$0

0$Tue$-$0

2$Tue$-$0

4$Tue$-$0

6$Tue$-$0

8$Tue$-$1

0$Tue$-$1

2$Tue$-$1

4$Tue$-$1

6$Tue$-$1

8$Tue$-$2

0$Tue$-$2

2$Wed

$-$00$

Wed

$-$02$

Wed

$-$04$

Wed

$-$06$

Wed

$-$08$

Wed

$-$10$

Wed

$-$12$

Wed

$-$14$

Wed

$-$16$

Wed

$-$18$

Wed

$-$20$

Wed

$-$22$

Thu$-$0

0$Thu$-$0

2$Thu$-$0

4$Thu$-$0

6$Thu$-$0

8$Thu$-$1

0$Thu$-$1

2$Thu$-$1

4$Thu$-$1

6$Thu$-$1

8$Thu$-$2

0$Thu$-$2

2$Fri$-$00$

Fri$-$02$

Fri$-$04$

Fri$-$06$

Fri$-$08$

Fri$-$10$

Fri$-$12$

Fri$-$14$

Fri$-$16$

Fri$-$18$

Fri$-$20$

Fri$-$22$

Sat$-$00$

Sat$-$02$

Sat$-$04$

Sat$-$06$

Sat$-$08$

Sat$-$10$

Sat$-$12$

Sat$-$14$

Sat$-$16$

Sat$-$18$

Sat$-$20$

Sat$-$22$

Sun$-$0

0$Sun$-$0

2$Sun$-$0

4$Sun$-$0

6$Sun$-$0

8$Sun$-$1

0$Sun$-$1

2$Sun$-$1

4$Sun$-$1

6$Sun$-$1

8$Sun$-$2

0$Sun$-$2

2$

SE$23-27$

SE$45-59$

SE$$0-17$

SE$35-44$

SE$60-150$

SE$28-34$

SE$18-22$


Listening behavior in Spain

0.00%$

0.20%$

0.40%$

0.60%$

0.80%$

1.00%$

1.20%$

1.40%$

Mon

$-$00$

Mon

$-$02$

Mon

$-$04$

Mon

$-$06$

Mon

$-$08$

Mon

$-$10$

Mon

$-$12$

Mon

$-$14$

Mon

$-$16$

Mon

$-$18$

Mon

$-$20$

Mon

$-$22$

Tue$-$0

0$Tue$-$0

2$Tue$-$0

4$Tue$-$0

6$Tue$-$0

8$Tue$-$1

0$Tue$-$1

2$Tue$-$1

4$Tue$-$1

6$Tue$-$1

8$Tue$-$2

0$Tue$-$2

2$Wed

$-$00$

Wed

$-$02$

Wed

$-$04$

Wed

$-$06$

Wed

$-$08$

Wed

$-$10$

Wed

$-$12$

Wed

$-$14$

Wed

$-$16$

Wed

$-$18$

Wed

$-$20$

Wed

$-$22$

Thu$-$0

0$Thu$-$0

2$Thu$-$0

4$Thu$-$0

6$Thu$-$0

8$Thu$-$1

0$Thu$-$1

2$Thu$-$1

4$Thu$-$1

6$Thu$-$1

8$Thu$-$2

0$Thu$-$2

2$Fri$-$00$

Fri$-$02$

Fri$-$04$

Fri$-$06$

Fri$-$08$

Fri$-$10$

Fri$-$12$

Fri$-$14$

Fri$-$16$

Fri$-$18$

Fri$-$20$

Fri$-$22$

Sat$-$00$

Sat$-$02$

Sat$-$04$

Sat$-$06$

Sat$-$08$

Sat$-$10$

Sat$-$12$

Sat$-$14$

Sat$-$16$

Sat$-$18$

Sat$-$20$

Sat$-$22$

Sun$-$0

0$Sun$-$0

2$Sun$-$0

4$Sun$-$0

6$Sun$-$0

8$Sun$-$1

0$Sun$-$1

2$Sun$-$1

4$Sun$-$1

6$Sun$-$1

8$Sun$-$2

0$Sun$-$2

2$

ES$$0-17$

ES$35-44$

ES$23-27$

ES$45-59$

ES$60-150$

ES$28-34$

ES$18-22$


Root cause analysisLatency analysisBetter capacity planning (servers, people, bandwidth)

Things get interesting when you combine business metrics with operational data

Operational Metrics


Recommendations (better then external parties, because of the amount of data)RadioTop lists

Leverage data to create better product features

Product features


200 GB of compressed data from users per day100 GB of data from services per day60+ GB of data generated in Hadoop each day190 node Hadoop cluster (24 core CPUs, 32 Gb RAM, 20 TB disk space)4 PB of storage capacity

When I mean “A LOT”, I mean:

Some geeky numbers


Our datainfrastructure


Spotify’s data infrastructureBackend services

HDFS

Map/Reduce

LuigiScheduler

OperationalDatabases

Reporting

Analytical databases

Productfeatures

Dashboards

Map/ReduceJobs

Hive


Python with Hadoop Streaming Pros: fast development, many Spotify libraries available Cons: slower then Java, no access to Hadoop API

Java Pros: fast, access to Hadoop API Cons: verbose language, not many Spotify libs available

PIG Pros: very small scripts, faster then streaming Cons: yet another language to learn, not many Spotify libs available

Hive Pros: SQL like syntax (easy for non-programmers) and relational data model Cons: more moving parts (not well suited for a whole pipe line)

We use multiple languages to run Map/Reduce jobs

Map/Reduce languages


Java, PIG, Hive and Python for writing map reduce 1.0Apache Kafka for log collection (in the process of replacing a batched transfer system)Apache Cassandra (no HBase ;))PostgreSQLSqoop for database import/exportAvro as storage format

Technology we use in combination with Hadoop:

Hadoop feels lonely without friends..

Some technical details


Nothing suitable out there..https://github.com/spotify/luigiWritten in pythonGeneric scheduler and dependency system that supports Python M/R, Pig and Hive

We wrote and open-sourced our own scheduler: Luigi

Map/Reduce is simple, but a single map/reduce job is limited. You need to chain jobs.

Scheduling


Scaling Hadoop


Started with a small (scrap metal) cluster of 37 serversMoved to Amazon Elastic Map/Reduce (EMR) and S3 to quickly scaleBuilt an in-house cluster of 60 nodes because of EMR costsCapacity planning every 6 months, grown to 190 nodes todayPut in place data-retention policy and data archive

Our journey:

Scaling Hadoop


Hadoop has brought us very far. We would never be able to handle the current volume with a “cheap” RDBMS“Commodity hardware” doesn’t mean cheap hardwareHadoop isn’t a silver bulletHadoop is a complex system that needs love and careYou will have to extend Hadoop (and eco-system components) to tailor it to your needsHadoop is fun!

4+ years of Hadoop at Spotify taught us:

Lessons learned


January 2013

Any [email protected] / @xinit

Shameless plug: Spotify is hiring! Talk to me or someone with a Spotify badge if you’re interested!

Thank you!


mailto:[email protected]

mailto:[email protected]