BDX 2015 - Scaling out big-data computation & machine learning using Pig, Python and Luigi

Scaling out big-data computation & machine learning using Pig, Python and Luigi

Ron Reiter VP R&D, Crosswise

AGENDA §  The goal

§  Data processing at Crosswise

§  The basics of prediction using machine learning

§  The “big data” stack

§  An introduction to Pig

§  Combining Pig and Python

§  Workflow management using Luigi and Amazon EMR

THE GOAL 1.  Process huge amounts of data points

2.  Allow data scientists to focus on their research

3.  Adjust production systems according to research conclusions quickly, without duplicating logic between research and production systems

DATA PROCESSING AT CROSSWISE

§  We are building a graph of devices that belong to the same user, based on browsing data of users


§  Interesting facts about our data processing pipeline:

§  We process 1.5 trillion data points from 1 billion devices

§  30TB of compressed data

§  Cluster with 1600 cores running for 24 hours


§  Our constraints

§  We are dealing with massive amounts of data, and we have to go for a solid, proven and truly scalable solution

§  Our machine learning research team uses Python and sklearn

§  We are in a race against time (to market)

§  We do not want the overhead of maintaining two separate processing pipelines, one for research and one for large-scale prediction

PREDICTING AT SCALE MODEL BUILDING PHASE (SMALL / LARGE SCALE)

PREDICTION PHASE (MASSIVE SCALE)

Labeled Data

Train Model

Evaluate Model

Model

Unlabeled Data

Predict

Output

PREDICTING AT SCALE §  Steps

§  Training & evaluating the model (Iterations on training and evaluation are done until the model’s performance is acceptable)

§  Predicting using the model at massive scale §  Assumptions

§  Distributed learning is not required §  Distributed prediction is required

§  Distributed learning can be achieved but not all machine learning models support it, and not all infrastructures know how to do it

THE “BIG DATA” STACK

YARN Mesos

MapReduce Tez

Resource Manager

ComputaJon Framework

High Level Language

Spark Graphlab

Spark Program

GraphLab Script Pig Scalding

Oozie Luigi Azkaban Workflow

Management

Hive

PIG §  Pig is a high level, SQL-like language, which

runs on Hadoop §  Pig also supports User Defined Functions written

in Java and Python

HOW DOES PIG WORK? §  Pig converts SQL-like queries to MapReduce iterations §  Pig builds a work plan based on a DAG it calculates §  Newer versions of Pig know how to run on different

computation engines, such as Apache Tez and Spark which offer a higher level of abstraction than MapReduce

Pig Runner

Map

Reduce

Map

Reduce

Map

Reduce

Map

Reduce

Map

Reduce

PIG DIRECTIVES The most common Pig directives are: §  LOAD/STORE – Load and save data sets §  FOREACH – map function which constructs a new row for

each row in a data set §  FILTER – filters in/out rows that obey to a certain criteria §  GROUP – groups rows by a specific column / set of columns §  JOIN – join two data sets based on a specific column And many more functions: http://pig.apache.org/docs/r0.14.0/func.html

PIG CODE EXAMPLE customers = LOAD 'customers.tsv' USING PigStorage('\t') AS (customer_id, first_name, last_name);

orders = LOAD 'orders.tsv' USING PigStorage('\t') AS (customer_id, price);

aggregated = FOREACH (GROUP orders BY customer_id) GENERATE group AS customer_id, SUM(orders.price) AS price_sum;

joined = JOIN customers ON customer_id, aggregated ON customer_id;

STORE joined INTO 'customers_total.tsv' USING PigStorage('\t');

COMBINING PIG & PYTHON

COMBINING PIG AND PYTHON

§  Pig gives you the power to scale and process data conveniently with an SQL-like syntax

§  Python is easy and productive, and has many useful scientific packages available (sklearn, nltk, numpy, scipy, pandas)

+

MACHINE LEARNING IN PYTHON USING SCIKIT-LEARN

PYTHON UDF §  Pig provides two Python UDF (User-defined function)

engines: Jython (JVM) and CPython §  Mortar (mortardata.com) added support for C Python

UDFs, which support scientific packages (numpy, scipy, sklearn, nltk, pandas, etc.)

§  A Python UDF is a function with a decorator that specifies the output schema. (since Python is dynamic the input schema is not required)

from pig_util import outputSchema @outputSchema('value:int') def multiply_by_two(num): return num * 2

USING THE PYTHON UDF §  Register the Python UDF:

§  If you prefer speed over package compatibility, use Jython:

§  Then, use the UDF within a Pig expression:

REGISTER 'udfs.py' USING streaming_python AS udfs;

processed = FOREACH data GENERATE udfs.multiply_by_two(num);

REGISTER 'udfs.py' USING jython AS udfs;

CONNECT PIG AND PYTHON JOBS

§  In many common scenarios, especially in machine learning, a classifier can usually be trained using a simple Python script

§  Using the classifier we trained, we can now predict on a massive scale using a Python UDF

§  Requires a higher-level workflow manager, such as Luigi

PYTHON JOB

PIG JOB

PYTHON UDF

PICKLED MODEL S3://model.pkl

WORKFLOW MANAGEMENT

S3 HDFS SFTP FILE DB

Task A Task B Task C REQUIRES REQUIRES

OUTPUTS OUTPUTS OUTPUTS OUTPUTS OUTPUTS

USES USES

D A T A F L O W

WORKFLOW MANAGEMENT WITH LUIGI

§  Unlike Oozie and Azkaban which are heavy workflow managers, Luigi is more of a Python package.

§  Luigi works based on dependency resolving, similar to a Makefile (or Scons)

§  Luigi defines an interface of “Tasks” and “Targets”, which we use to connect the two tasks using dependencies.

UNLABELED LOGS 2014-01-01

TRAINED MODEL 2014-01-01

OUTPUT 2014-01-01

LABELED LOGS 2014-01-01

UNLABELED LOGS 2014-01-02

TRAINED MODEL 2014-01-02

OUTPUT 2014-01-02

LABELED LOGS 2014-01-02

EXAMPLE - TRAIN MODEL LUIGI TASK

§  Let’s see how it’s done:

import luigi, numpy, pandas, pickle, sklearn class TrainModel(luigi.Task): target_date = luigi.DateParameter() def requires(self): return LabelledLogs(self.target_date) def output(self): return S3Target('s3://mybucket/model_%s.pkl' % self.target_date) def run(self): clf = sklearn.linear_model.SGDClassifier() with self.output().open('w') as fd: df = pandas.load_csv(self.input()) clf.fit(df[["a","b","c"]].values, df["class"].values) fd.write(pickle.dumps(clf))

PREDICT RESULTS LUIGI TASK

§  We predict using a Pig task which has access to the pickled model:

import luigi class PredictResults(PigTask): PIG_SCRIPT = """ REGISTER 'predict.py' USING streaming_python AS udfs; data = LOAD '$INPUT' USING PigStorage('\t'); predicted = FOREACH data GENERATE user_id, predict.predict_results(*); STORE predicted INTO '$OUTPUT' USING PigStorage('\t'); """ PYTHON_UDF = 'predict.py' target_date = luigi.DateParameter() def requires(self): return {'logs': UnlabelledLogs(self.target_date), 'model': TrainModel(self.target_date)} def output(self): return S3Target('s3://mybucket/results_%s.tsv' % self.target_date)

PREDICTION PIG USER-DEFINED FUNCTION (PYTHON)

§  We can then generate a custom UDF while replacing the $MODEL with an actual model file.

§  The model will be loaded when the UDF is initialized (this will happen on every map/reduce task using the UDF)

from pig_util import outputSchema import numpy, pickle clf = pickle.load(download_s3('$MODEL')) @outputSchema('value:int') def predict_results(feature_vector): return clf.predict(numpy.array(feature_vector))[0]

PITFALLS §  For the classifier to work on your Hadoop

cluster, you have to install the required packages on all of your Hadoop nodes (numpy, sklearn, etc.)

§  Sending arguments to a UDF is tricky; there is no way to initialize a UDF with arguments. To load a classifier to a UDF, you should generate the UDF using a template with the model you wish to use

CLUSTER PROVISIONING WITH LUIGI

§  To conserve resources, we use clusters only when needed. So we created the StartCluster task:

§  With this mechanism in place, we also have a cron that kills idle clusters and save even more money.

§  We use both EMR clusters and clusters provisioned by Xplenty which provide us with their Hadoop provisioning infrastructure.

PigTask

StartCluster

REQUIRES

ClusterTarget OUTPUTS

USES

USING LUIGI WITH OTHER COMPUTATION ENGINES

§  Luigi acts like the “glue” of data pipelines, and we use it to interconnect Pig and GraphLab jobs

§  Pig is very convenient for large scale data processing, but it is very weak when it comes to graph analysis and iterative computation

§  One of the main disadvantages of Pig is that it has no conditional statements, so we need to use other tools to complete our arsenal

Pig task Pig task GraphLab task

GRAPHLAB AT CROSSWISE §  We use GraphLab to run graph processing at scale – for

example, to run connected components and create “users” from a graph of devices that belong to the same user

PYTHON API §  Pig is a “data flow” language, and not a real language. Its

abilities are limited - there are no conditional blocks or loops. Loops are required when trying to reach “convergence”, such as when finding connected components in a graph. To overcome this limitation, a Python API has been created. from org.apache.pig.scripting import Pig P = Pig.compile( "A = LOAD '$input' AS (name, age, gpa);" + "STORE A INTO '$output';") Q = P.bind({ 'input': 'input.csv', 'output': 'output.csv'}) result = Q.runSingle()

CROSSWISE HADOOP SSH JOB RUNNER

STANDARD LUIGI WORKFLOW

§  Standard Luigi Hadoop tasks need a correctly configured Hadoop client to launch jobs.

§  This can be a pain when running an automatically provisioned Hadoop cluster (e.g. an EMR cluster).

HADOOP MASTER NODE HADOOP SLAVE NODE

HADOOP SLAVE NODE

LUIGI NAMENODE

HADOOP CLIENT

JOB TRACKER

LUIGI HADOOP SSH RUNNER

§  At Crosswise, we implemented a Luigi task for running Hadoop JARs (e.g. Pig) remotely, just like the Amazon EMR API enables.

§  Instead of launching steps using EMR API, we implemented our own, to enable running steps concurrently.

LUIGI

CLUSTER MASTER NODE

EMR SLAVE NODE

EMR SLAVE NODE

API / SSH

API / SSH HADOOP CLIENT INSTANCE

HADOOP CLIENT INSTANCE

WHY RUN HADOOP JOBS EXTERNALLY?

Working with the EMR API is convenient, but Luigi expects to run jobs from the master node and not using the EMR job submission API

Advantages:

§  Doesn’t require to run on a local configured Hadoop client

§  Allows to provision the clusters as a task (using Amazon EMR’s API for example)

§  The same Luigi process can utilize several Hadoop clusters at once

NEXT STEPS AT CROSSWISE

§  We are planning on moving to Apache Tez since MapReduce has a high overhead for complicated processes, and it is hard to tweak and utilize the framework properly

§  We are also investigating Dato’s distributed data processing, training and prediction capabilities at scale (using GraphLab Create)

QUESTIONS?

THANK YOU!

Data & Analytics

BDX 2015 - Scaling out big-data computation & machine learning using Pig, Python and Luigi