What we do
• Streaming video service• > 5.5 million subscribers• > 20 million unique
visitors/month• > 1 billion ads/month
It all begins with beacons
Living room device(Roku, Xbox, etc)
Mobile device(Android, iPhone,
etc)
Web(hulu.com)
Beacon collection service
What’s in a beacon
80 2013-04-01 00:00:00/v3/playback/start?bitrate=650&cdn=Akamai&channel=Anime&clichéent=Explorer&computerguid=EA8FA1000232B8F6986C3E0BE55E9333&contentid=5003673…
Reporting platform (RP2)Find Metrics & Dimensions
Design and execute reports
The pipeline
Devices Beacon collection serviceDevices
Devices
HDFS
Hive
RDBMS
LogCollector/Flume
MapReduce jobs/JobScheduler
Harpy – continuous aggregation
Reporting(RP2)
Monitoring(metstat)
Developers
Business
HDFSFiles bucketed by beacon
type and partitioned by hour
Log Collection machine #1
Log Collection
…
Load balancer
DevicesDevicesDevices
Log Collection machine
#11
Directory hierarchy on HDFS
/user/hadoop/t2
201401010000/
playback/
201401010100_playback_1.se
q
201401010100_playback_2.se
q
…revenue/
201401010100
playback/
revenue/
MapReduce - going from beacons to basefacts
computerguid EA8FA1000232B8F6986C3E0BE55E9333
userid 5238518video_id 289696content_partner_id 398distribution_partner_id 602distro_platform_id 14is_on_hulu 0…hourid 383149watched 76426
If a program manipulates a large amount of data, it does so in a small number of ways- Alan Perlis
The BeaconSpec compiler
Definitions of beacons and
base-facts
Beaconspec compiler
Java MapReduce
code that can run on the
cluster
What does our language look like?
basefact playback_watched_uniques from playback/(position|end) { dimension harpyhour.id as hourid; dimension computerguid as computerguid; dimension userid as userid; required dimension video.id as video_id; required dimension contentPartner.id as content_partner_id; …
dimension siteSessionId.chosen as site_session_id; dimension facebook.isfacebookconnected as is_facebook_connected; fact sum(watched.out) as watched;}
FAQ: Why didn’t we just use Pig?
The superior [program] cultivates itself so as to give rest to [programmers]- Confucius, the Way of the Superior Man
Scheduling jobs
JobScheduler Interface
Outside world
Logmanager databases
JobScheduler
Checks databases for jobs that are ready to
run and whether dependencies are met
JobMonitorMapReduce
job
JobMonitorMapReduce
job
JobMonitorMapReduce
job
JobScheduler technology
• The actor model of concurrency– Communication through async messaging– Completely encapsulated state
Actor creation
Message passing
Central idea: Treat local objects as if they are distributed, as opposed to treating distributed objects as if they are local
Fault-tolerance – let it crash!
Harpy – continuous aggregations
HDFS NFS
Metadata
Output DBs
Harpy
DataSync
Publishing
HoldingDB
HoldingSweeper Agg
Scheduler
Queue Processor
Hive
RP2
• Reporting Portal for pulling Metrics + Dimensions
• Quick ‘Demo’
Let’s Reexamine the pipeline:
Devices Beacon collection serviceDevices
Devices
HDFS
Hive
RDBMS
LogCollector/Flume
MapReduce jobs/JobScheduler
Harpy – continuous aggregation
Reporting(RP2)
Monitoring(metstat)
Developers
Business
Metstat
• Python Django App• Tasks on Celery + RabbitMQ• JQuery• Tracks status, status changes and statistics• Gets data directly from various sources
(databases, HDFS)
FAQ: Why didn’t we just use Pig?
• Dataflow language – runs on Hadoop• Pig philosophy – (Taken from the Apache website)– Pigs eat anything– Pigs live anywhere– Pigs are domestic animals– Pigs fly
Beaconspec
Beware of the Turing tar-pit where everything is possible but nothing of interest is easy - Alan Perlis
REGISTER ./tutorial.jar; raw = LOAD 'excite.log' USING PigStorage('\t') AS (user, time, query);clean1 = FILTER raw BY org.apache.pig.tutorial.NonURLDetector(query);clean2 = FOREACH clean1 GENERATE user, time, org.apache.pig.tutorial.ToLower(query) as query;
Beaconspec
FAQ: What is open sourced?
• Slickint – database interface generation for Scala– github.com/zenbowman/slickint
• Local filesystem caching for hadoop– github.com/ZenBowman/luna