Januari 2012
Hadoop at Spotify
Friday, January 25, 13
IntroductionUse casesSome statsInfrastructure overviewMap/Reduce in different languagesSchedulingScaling HadoopLessons learned
Agenda
Friday, January 25, 13
3+ years experience with HadoopWorking with a team of 12 engineers to keep our data flowingResponsible for the data architecture
Don’t hesitate to ask questions during this talk!
Wouter de Bie - Team lead for Analytics & Data Infrastructure
Who am I and what am I doing here?!
Introduction
Friday, January 25, 13
Too much data to fit into a RDBMS (or too costly)Hadoop provides good value for moneyRuns on “commodity” hardwareScales pretty well
Millions of daily active users == LOTS OF DATA
Why do we use Hadoop?
Introduction
Friday, January 25, 13
ReportingBusiness analysisOperational analysisIn-client features
Spotify is a Data-Driven company, so data is used practically everywhere..
Where do we use Hadoop?
Use cases
Friday, January 25, 13
Reporting to labels, licensors, partners and advertisers from day 1In 2008 a RDBMS would have worked for a while, although labels wanted very granular data, so we went with Hadoop instead
This is where it all started..
Reporting
Friday, January 25, 13
Analyzing growth, user behavior, sign-up funnels, etcCompany KPIsA/B testingNPS analysisSegmentation analysis
We have so much data, so let’s us it!
Business Analytics
Friday, January 25, 13
An analysis I can share
Friday, January 25, 13
Listening behavior in Sweden
0.00%$
0.20%$
0.40%$
0.60%$
0.80%$
1.00%$
1.20%$
1.40%$
1.60%$
Mon
$-$00$
Mon
$-$02$
Mon
$-$04$
Mon
$-$06$
Mon
$-$08$
Mon
$-$10$
Mon
$-$12$
Mon
$-$14$
Mon
$-$16$
Mon
$-$18$
Mon
$-$20$
Mon
$-$22$
Tue$-$0
0$Tue$-$0
2$Tue$-$0
4$Tue$-$0
6$Tue$-$0
8$Tue$-$1
0$Tue$-$1
2$Tue$-$1
4$Tue$-$1
6$Tue$-$1
8$Tue$-$2
0$Tue$-$2
2$Wed
$-$00$
Wed
$-$02$
Wed
$-$04$
Wed
$-$06$
Wed
$-$08$
Wed
$-$10$
Wed
$-$12$
Wed
$-$14$
Wed
$-$16$
Wed
$-$18$
Wed
$-$20$
Wed
$-$22$
Thu$-$0
0$Thu$-$0
2$Thu$-$0
4$Thu$-$0
6$Thu$-$0
8$Thu$-$1
0$Thu$-$1
2$Thu$-$1
4$Thu$-$1
6$Thu$-$1
8$Thu$-$2
0$Thu$-$2
2$Fri$-$00$
Fri$-$02$
Fri$-$04$
Fri$-$06$
Fri$-$08$
Fri$-$10$
Fri$-$12$
Fri$-$14$
Fri$-$16$
Fri$-$18$
Fri$-$20$
Fri$-$22$
Sat$-$00$
Sat$-$02$
Sat$-$04$
Sat$-$06$
Sat$-$08$
Sat$-$10$
Sat$-$12$
Sat$-$14$
Sat$-$16$
Sat$-$18$
Sat$-$20$
Sat$-$22$
Sun$-$0
0$Sun$-$0
2$Sun$-$0
4$Sun$-$0
6$Sun$-$0
8$Sun$-$1
0$Sun$-$1
2$Sun$-$1
4$Sun$-$1
6$Sun$-$1
8$Sun$-$2
0$Sun$-$2
2$
SE$23-27$
SE$45-59$
SE$$0-17$
SE$35-44$
SE$60-150$
SE$28-34$
SE$18-22$
Friday, January 25, 13
Listening behavior in Spain
0.00%$
0.20%$
0.40%$
0.60%$
0.80%$
1.00%$
1.20%$
1.40%$
Mon
$-$00$
Mon
$-$02$
Mon
$-$04$
Mon
$-$06$
Mon
$-$08$
Mon
$-$10$
Mon
$-$12$
Mon
$-$14$
Mon
$-$16$
Mon
$-$18$
Mon
$-$20$
Mon
$-$22$
Tue$-$0
0$Tue$-$0
2$Tue$-$0
4$Tue$-$0
6$Tue$-$0
8$Tue$-$1
0$Tue$-$1
2$Tue$-$1
4$Tue$-$1
6$Tue$-$1
8$Tue$-$2
0$Tue$-$2
2$Wed
$-$00$
Wed
$-$02$
Wed
$-$04$
Wed
$-$06$
Wed
$-$08$
Wed
$-$10$
Wed
$-$12$
Wed
$-$14$
Wed
$-$16$
Wed
$-$18$
Wed
$-$20$
Wed
$-$22$
Thu$-$0
0$Thu$-$0
2$Thu$-$0
4$Thu$-$0
6$Thu$-$0
8$Thu$-$1
0$Thu$-$1
2$Thu$-$1
4$Thu$-$1
6$Thu$-$1
8$Thu$-$2
0$Thu$-$2
2$Fri$-$00$
Fri$-$02$
Fri$-$04$
Fri$-$06$
Fri$-$08$
Fri$-$10$
Fri$-$12$
Fri$-$14$
Fri$-$16$
Fri$-$18$
Fri$-$20$
Fri$-$22$
Sat$-$00$
Sat$-$02$
Sat$-$04$
Sat$-$06$
Sat$-$08$
Sat$-$10$
Sat$-$12$
Sat$-$14$
Sat$-$16$
Sat$-$18$
Sat$-$20$
Sat$-$22$
Sun$-$0
0$Sun$-$0
2$Sun$-$0
4$Sun$-$0
6$Sun$-$0
8$Sun$-$1
0$Sun$-$1
2$Sun$-$1
4$Sun$-$1
6$Sun$-$1
8$Sun$-$2
0$Sun$-$2
2$
ES$$0-17$
ES$35-44$
ES$23-27$
ES$45-59$
ES$60-150$
ES$28-34$
ES$18-22$
Friday, January 25, 13
Root cause analysisLatency analysisBetter capacity planning (servers, people, bandwidth)
Things get interesting when you combine business metrics with operational data
Operational Metrics
Friday, January 25, 13
Recommendations (better then external parties, because of the amount of data)RadioTop lists
Leverage data to create better product features
Product features
Friday, January 25, 13
200 GB of compressed data from users per day100 GB of data from services per day60+ GB of data generated in Hadoop each day190 node Hadoop cluster (24 core CPUs, 32 Gb RAM, 20 TB disk space)4 PB of storage capacity
When I mean “A LOT”, I mean:
Some geeky numbers
Friday, January 25, 13
Our datainfrastructure
Friday, January 25, 13
Spotify’s data infrastructureBackend services
HDFS
Map/Reduce
LuigiScheduler
OperationalDatabases
Reporting
Analytical databases
Productfeatures
Dashboards
Map/ReduceJobs
Hive
Friday, January 25, 13
Python with Hadoop Streaming Pros: fast development, many Spotify libraries available Cons: slower then Java, no access to Hadoop API
Java Pros: fast, access to Hadoop API Cons: verbose language, not many Spotify libs available
PIG Pros: very small scripts, faster then streaming Cons: yet another language to learn, not many Spotify libs available
Hive Pros: SQL like syntax (easy for non-programmers) and relational data model Cons: more moving parts (not well suited for a whole pipe line)
We use multiple languages to run Map/Reduce jobs
Map/Reduce languages
Friday, January 25, 13
Java, PIG, Hive and Python for writing map reduce 1.0Apache Kafka for log collection (in the process of replacing a batched transfer system)Apache Cassandra (no HBase ;))PostgreSQLSqoop for database import/exportAvro as storage format
Technology we use in combination with Hadoop:
Hadoop feels lonely without friends..
Some technical details
Friday, January 25, 13
Nothing suitable out there..https://github.com/spotify/luigiWritten in pythonGeneric scheduler and dependency system that supports Python M/R, Pig and Hive
We wrote and open-sourced our own scheduler: Luigi
Map/Reduce is simple, but a single map/reduce job is limited. You need to chain jobs.
Scheduling
Friday, January 25, 13
Scaling Hadoop
Friday, January 25, 13
Started with a small (scrap metal) cluster of 37 serversMoved to Amazon Elastic Map/Reduce (EMR) and S3 to quickly scaleBuilt an in-house cluster of 60 nodes because of EMR costsCapacity planning every 6 months, grown to 190 nodes todayPut in place data-retention policy and data archive
Our journey:
Scaling Hadoop
Friday, January 25, 13
Hadoop has brought us very far. We would never be able to handle the current volume with a “cheap” RDBMS“Commodity hardware” doesn’t mean cheap hardwareHadoop isn’t a silver bulletHadoop is a complex system that needs love and careYou will have to extend Hadoop (and eco-system components) to tailor it to your needsHadoop is fun!
4+ years of Hadoop at Spotify taught us:
Lessons learned
Friday, January 25, 13
January 2013
Any [email protected] / @xinit
Shameless plug: Spotify is hiring! Talk to me or someone with a Spotify badge if you’re interested!
Thank you!
Friday, January 25, 13