An overview of Hulu’s metrics platform

Preview:

DESCRIPTION

An overview of Hulu’s metrics platform. Tristan Reid tristan.reid@hulu.com. Prasan Samtani prasan.samtani@hulu.com. What we do. Streaming video service > 5.5 million subscribers > 20 million unique visitors/month > 1 billion ads/month. It all begins with beacons. Living room device - PowerPoint PPT Presentation

Citation preview

An overview of Hulu’s metrics platform

Tristan Reid tristan.reid@hulu.com

Prasan Samtaniprasan.samtani@hulu.com

What we do

• Streaming video service• > 5.5 million subscribers• > 20 million unique

visitors/month• > 1 billion ads/month

It all begins with beacons

Living room device(Roku, Xbox, etc)

Mobile device(Android, iPhone,

etc)

Web(hulu.com)

Beacon collection service

What’s in a beacon

80 2013-04-01 00:00:00/v3/playback/start?bitrate=650&cdn=Akamai&channel=Anime&clichéent=Explorer&computerguid=EA8FA1000232B8F6986C3E0BE55E9333&contentid=5003673…

Reporting platform (RP2)Find Metrics & Dimensions

Design and execute reports

The pipeline

Devices Beacon collection serviceDevices

Devices

HDFS

Hive

RDBMS

LogCollector/Flume

MapReduce jobs/JobScheduler

Harpy – continuous aggregation

Reporting(RP2)

Monitoring(metstat)

Developers

Business

HDFSFiles bucketed by beacon

type and partitioned by hour

Log Collection machine #1

Log Collection

Load balancer

DevicesDevicesDevices

Log Collection machine

#11

Directory hierarchy on HDFS

/user/hadoop/t2

201401010000/

playback/

201401010100_playback_1.se

q

201401010100_playback_2.se

q

…revenue/

201401010100

playback/

revenue/

MapReduce - going from beacons to basefacts

computerguid EA8FA1000232B8F6986C3E0BE55E9333

userid 5238518video_id 289696content_partner_id 398distribution_partner_id 602distro_platform_id 14is_on_hulu 0…hourid 383149watched 76426

If a program manipulates a large amount of data, it does so in a small number of ways- Alan Perlis

The BeaconSpec compiler

Definitions of beacons and

base-facts

Beaconspec compiler

Java MapReduce

code that can run on the

cluster

What does our language look like?

basefact playback_watched_uniques from playback/(position|end) { dimension harpyhour.id as hourid; dimension computerguid as computerguid; dimension userid as userid; required dimension video.id as video_id; required dimension contentPartner.id as content_partner_id; …

dimension siteSessionId.chosen as site_session_id; dimension facebook.isfacebookconnected as is_facebook_connected; fact sum(watched.out) as watched;}

FAQ: Why didn’t we just use Pig?

The superior [program] cultivates itself so as to give rest to [programmers]- Confucius, the Way of the Superior Man

Scheduling jobs

JobScheduler Interface

Outside world

Logmanager databases

JobScheduler

Checks databases for jobs that are ready to

run and whether dependencies are met

JobMonitorMapReduce

job

JobMonitorMapReduce

job

JobMonitorMapReduce

job

JobScheduler technology

• The actor model of concurrency– Communication through async messaging– Completely encapsulated state

Actor creation

Message passing

Central idea: Treat local objects as if they are distributed, as opposed to treating distributed objects as if they are local

Fault-tolerance – let it crash!

Harpy – continuous aggregations

HDFS NFS

Metadata

Output DBs

Harpy

DataSync

Publishing

HoldingDB

HoldingSweeper Agg

Scheduler

Queue Processor

Hive

RP2

• Reporting Portal for pulling Metrics + Dimensions

• Quick ‘Demo’

Let’s Reexamine the pipeline:

Devices Beacon collection serviceDevices

Devices

HDFS

Hive

RDBMS

LogCollector/Flume

MapReduce jobs/JobScheduler

Harpy – continuous aggregation

Reporting(RP2)

Monitoring(metstat)

Developers

Business

Metstat

• Python Django App• Tasks on Celery + RabbitMQ• JQuery• Tracks status, status changes and statistics• Gets data directly from various sources

(databases, HDFS)

FAQ: Why didn’t we just use Pig?

• Dataflow language – runs on Hadoop• Pig philosophy – (Taken from the Apache website)– Pigs eat anything– Pigs live anywhere– Pigs are domestic animals– Pigs fly

Beaconspec

Beware of the Turing tar-pit where everything is possible but nothing of interest is easy - Alan Perlis

REGISTER ./tutorial.jar; raw = LOAD 'excite.log' USING PigStorage('\t') AS (user, time, query);clean1 = FILTER raw BY org.apache.pig.tutorial.NonURLDetector(query);clean2 = FOREACH clean1 GENERATE user, time, org.apache.pig.tutorial.ToLower(query) as query;

Beaconspec

FAQ: What is open sourced?

• Slickint – database interface generation for Scala– github.com/zenbowman/slickint

• Local filesystem caching for hadoop– github.com/ZenBowman/luna

Recommended