Audiotopsy

Preview:

Citation preview

AudiotopsyFinding insights and trends from music data

Goal

Ingest the million song dataset Provide an option for ad-hoc querying Enable really fast access to data

Where does the data come from

1,000,000 songs / files

273 GB of data

44,745 unique artists

515,576 dated tracks starting from 1922

Data Pipeline!!

REST API End UserBatchProcessing

Real Time Queries

Pig

HBase Schema

Key Column Family

2008019123 Artist: AdeleSong: Rolling in the deep

2009017241 Artist: GotyeSong: Somebody that I used to know

2009032523 Artist: Bruno MarsSong: Locked out of heaven

Inverted Hotttnesss

Factor

Key: 2009 017 123

Year Song Id

Getting the top songs for the year 2009

Perform a partial scan on the keys

Can avoid client side sorting :)

Insights/Challenges

Compression really helps! (360.601 sec vs 885.129 sec)

Getting all components to talk to each other

Dealing with noisy data

Finding a sweet-spot for precision of Geohash

About Me – Denny Abraham Cheriyan