Luigi presentation OA Summit

Preview:

DESCRIPTION

OA NYC Summit

Citation preview

Data Workflows at Foursquare using Luigi

Foursquare

•  35 million users

•  Nearly 4 billion check-ins

•  More than 5 million check-ins per day

•  50 million point-of-interest database

•  100's of GB of log data per day

Tools We Use

•  Hive o  Ad hoc analytics, data dumping ground

•  Raw MapReduce o  100's of MapReduce jobs in our codebase

•  Pig o  Fits between structure Hive and free-form

MapReduce

•  Vertica o  Low latency analytics

Cron

E.g. 0 0 * * * ./hadoop-script-1.sh # Wait two hours for that job to finish...

0 2 * * * ./hadoop-script-2.sh

# And on and on and on

Cron - Problems

•  Brittle

•  Hard to reason about / visualize

•  Spend a lot of time waiting

•  Difficult to tell what succeeded or failed

•  No one likes writing Bash scripts

Oozie

XML-based Workflow Engine, with support for Hadoop, Hive, and Pig

Workflows specify computations in a DAG, e.g "Run this Hive query, then run these two MapReduce jobs in parallel"

Coordinators launch recurring workflows at a given frequency, when dependent data is available

Oozie - Example

Oozie - Problems

•  Workflows are all-or-nothing o  Cannot just run step that failed o  Very little code reuse

•  Little to no extensibility •  Limited control flow •  Extremely verbose •  Difficult to test •  No one likes writing XML

Luigi •  Python framework for batch processing jobs

•  Created by Spotify, open-sourced Sept. 2012

•  Tasks are units of work that produce Targets

•  Tasks can depend on one or more other Tasks

•  A Task is only run if all of its dependent Tasks are done

•  Tasks are idempotent

Luigi - Example Task

Luigi - Running the Task $ python word-count.py WordCount --date 2013-06-01

Luigi - Scheduler

Central scheduler ensures each Task is only run by a single worker.

A task is uniquely identified by its class name and its Parameters, e.g. WordCount(date=2013-06-01)

Will retry failed Tasks after a configured timeout

Emails someone when a Task fails

Luigi - Visualizer

Luigi - Visualizer

Luigi - Visualizer

Luigi - Advantages over Cron

•  Explicit dependencies

•  No wasted time waiting

•  Easy to tell what has failed

•  Avoid duplicate work / partial failures

Luigi - Advantages over Oozie

•  Explicit dependencies between workflows

•  Easier to write

•  Vastly more extensible

•  Code reuse

•  Can easily re-run individual steps

Thank you!

Check out Luigi: https://github.com/spotify/luigi

Drop me a line: Joe Ennever jennever@foursquare.com