Python in an Evolving Enterprise System (PyData SV 2013)

DMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMD MMM8OOOOOOOOOOO8MMMM8OOOOOOOOOOOOOOODMMMMOOOOOOOOOOOOOOMMMN DMMIIIIIIIIIIIII$MMMM$IIIIIIIIIIIIIIIOMMMM7III?IIIIIIIIII7MM MMOIIIIIIIIIIIII7MMMMOIIIIIIIIIIIIIIIIMMMM8IIIIIIIIIIIIIIIMM MMOIIIIIIIIIIIIIIMMMMDIIIIIIIIIIIIIIIIDMMMM7IIIIIIIIIIIIIIMM MMOIIIIIIIIIIIIIIMMMMM7IIIIIIIIIIIIIII?MMMMMIIIIIIIIIIIIIIMM MMOIIIIIIIIIIIIIIMMMMM8IIIIIIIIIIIIIIIIMMMMMOIIIIIIIIIIIIIMM MMOIIIIIIIIIIIIIIOMMMMMIIIIIIIIIIIIIIIIIMMMMMMIIIIIIIIIIIIMM MMOIIIIIIIIIIIIIIIMMMMMIIIIIIIIIIIIIIIIIZMMMMMMIIIIIIIIIIIMM MMOIIIIIIIIIIIIIIIZMMMM8IIIIIIIIIIIIIIIII7MMMMMMOIIIIIIIIIMM MM8$$IIIIIIIIIIIIIIMMMMMIIIIIIIIIIIIIII?III8MMMMMMMZIIIIIIMM MMMMMMMMMMN87IIIIII8MMMMDIIIIIIIIIIIIIIIIIIIZMMMMMMMMMNZIIMM MMMMMMMMMMMMMMMMMOII$MMMMMIIIIIIIIIIIIIIIIIIIIIMMMMMMMMMMMMM MMMMMMMMMMMMMMMMMMMM8MMMMMZIIIIIIIIIIIIIIIIIIIIII8MMMMMMMMMM MMOIIIIIIII7NMMMMMMMMMMMMMMMIIIIIIIIIIIIIIIIIIIIII?IIII$ODMM MMOIIIIIIIIII?I8MMMMMMMMMMMMDIIIIIIIIIIIIIIIIIIIIIIIIIIIIIMM MMOIIIIIIIIIIIIIIIIZMMMMMMMMMM7II?IIIIIIIIIIIIIIIIIIIIIIIIMM MMOIIIIIIIIIIIIIIIIII7NMMMMMMMMIIIIIIIIIIIIIIIIIIIIIIIIIIIMM MMOIIIIIIIIIIIIIIIIIIIIIDMMMMMMMMIIIIIIIIIIIIIIIIIIIIIIIIIMM MMOIIIIIIIIIIIIIIIIIIIIIIIMMMMMMMMMIIIIIIIIIIIIIIIIIIIIIIIMM MMOIIIIIIIIIIIIIIIIIIIIIIII7MMMMMMMM8IIIIIIIIIIIIIIIIIIIIIMM MMOIIIIIIIIIIIIIIIIIIIIIIIIII7MMMMMMMMM$IIIIIIIIIIIIIIIIIIMM MMOIIIIIIIIIIIIIIIIIIIIIIIIIIIOMMMMMMMMMM8I?IIIIIIIIIIIIIIMM MMOIIIIIIIIIIIIIIIIIIIIIIIIIIII7MMMMMMMMMMMMMN7IIIIIIIIIIIMM MMMMMMD$IIIIIIIIIIIIIIIIIIIIIIIIIMMMMMMMMMMMMMMMMDZIIIIIIIMM MMMMMMMMMMMNIIIIIIIIIIIIIIIIIIIIIINMMMMDZNMMMMMMMMMMMMMMMMMM MMMMMMMMMMMMMM?IIIIIIIIIIIIIIIIIIIIMMMMMIII7NMMMMMMMMMMMMMMM MMOIII7DMMMMMMMM$IIIIIIIIIIIIIIIIIIZMMMMNIIIIIIII7$DNMMMMMMM MMOIIIIII7MMMMMMM7IIIIIIIIIIIIIIIIIIOMMMM8IIIIIIIIIIIIIIIIMM MMOIIIIIIIIIMMMMMMNIIIIIIIIIIIIIIIIIIMMMMMIIIIIIIIIIIIIIIIMM MMOIIIIIIIIIIIMMMMMNIIIIIIIIIIIIIIIIIDMMMM7IIIIIIIIIIIIIIIMM MMOIIIIIIIIIIIIMMMMMMIIIIIIIIIIIIIIII7MMMMDIIIIIIIIIIIIIIIMM MMOIIIIIIIIIIIIIMMMMMZIIIIIIIIIIIIIIIIOMMMM$IIIIIIIIIIIIIIMM MMOIIIIIIIIIIIIIIMMMMMIIIIIIIIIIIIIIIIIMMMM8IIIIIIIIIIIIIIMM MMOIIIIIIIIIIIIIIMMMMMNIIIIIIIIIIIIIIIIMMMMMIIIIIIIIIIIIIIMM MMOIIIIIIIIIIIIIIOMMMMMIIIIIIIIIIIIIIIIMMMMMI??IIIIIIIIIIIMM $MMIIIIIIIIIIIIIIIMMMMMIIIIIIIIIIIIIIIIMMMMMIIIIIIIIIIIII8MM MMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMM 777777777777777777777777777777777777777777777777I77 MD$N? MMM8MN OMM8MZ OMMMDM MMDM+ MD~NO M= MM ZZMI +ZI M7 M O7 MO OM M OMMMMM 8M7 M= MM MN7 MM?+I M7 M O7 MO OM M O+ 8M7 M~ MM ZMI MDMMN MMNMM7 OMMMM OM M MMM8M MD:N8 MMMMMM MO8M M7 O7 M7 O7

PYTHON IN AN EVOLVING ENTERPRISE SYSTEM EVALUATING INTEGRATION SOLUTIONS WITH HADOOP

DAVE HIMROD

STEVE KANNAN

ANGELICA PANDO

Building today’s most powerful, open, and customizable advertising technology platform.

Ad is served in <100 milliseconds

300x250

AUCTION REQUEST

AD RESPONSE BID: $2.50

ADVERTISER 1

BID: $3.25

ADVERTISER 2

BID: $4.10

ADVERTISER 3

APPNEXUS OPTIMIZATION

WINNING BID

Evolution of AppNexus

PEOPLE 350 430 20

AD REQUESTS 45B 39B 100M FROM

MYSQL, HADOOP/HBASE, AEROSPIKE, NETEZZA, VERTICA

5000+ SERVERS

OF DATA EVERY DAY 38+ TB

UPTIME 99.99%

Evolution of AppNexus

ENGINEERING HQ IN NYC

ENG OFFICES IN PORTLAND & SF

Data-Driven Decisioning (D3)

DATA PIPELINE

D3 PROCESSING

Bidder Bidder Bidder BIDDERS

Python at AppNexus Python enables us to scale our team and rapidly iterate and prototype technologies.

CLUSTER 1PB NODES ACROSS SEVERAL CLUSTERS 862

40B BILLION LOG RECORDS DAILY

5.6B BILLION LOG RECORDS/HOUR AT PEAK

Hadoop enables us to do aggregations for reporting and other data pipeline jobs

Hadoop at AppNexus

Data modeling today

Task Task Task Task logs logs logs logs VERTICA CACHE

HADOOP

DATA SERVICES

Σ DATA DRIVEN DECISIONING

BIG DATA: TBS/HOUR MEDIUM DATA: GBS/HOUR

To enable the next generation of data modeling, we need to leverage our Hadoop cluster

What are we trying to do

Access the data on Hadoop

Continue to use Python to model

à No consensus on the best solution

So we conducted our own research to evaluate integration options

The budget problem

We have thousands of bidders buying billions of ads per hour in real-time auctions.

We need to create a model that can manipulate how our bidders spend their budgets and purchase ads.

Data modeling today

Task Task Task Task logs logs logs logs VERTICA CACHE

HADOOP

DATA SERVICES

Σ DATA DRIVEN DECISIONING

BIG DATA: TBS/HOUR MEDIUM DATA: GBS/HOUR

DATA DRIVEN DECISIONING

Test problem: Budget aggregation

SCENARIO: Each auction creates a row in a log.

timestamp, auction_id, object_type, object_id, method, value

We need to aggregate and model to update bidders.

Method: Budget aggregation

STEP 1: De-duplicate records where

KEY: object_type, object_id, method, auction_id

STEP 2: Aggregate value where

KEY: object_type, object_id, method

HARDWARE

•  300 GB of log data

•  5 nodes running Scientific Linux 6.3 (Carbon)

•  Intel Xeon CPU @ 2.13 GHz, 4 cores

•  2 TB Disk

•  CDH4

•  45 map, 35 reduce tasks at a time

Research: Potential solutions

1. Native Java

2. Streaming ‒ no framework

3. mrjob

4. Happy / Jython / PyCascading

5.  Pig + Jython UDF 6. Pydoop

7. Disco

8.  Hadoopy / dumbo 9. Hipy

evaluating Hadoop

Effectively ORM for Hive

similar to mrjob

prohibitive installation

Research: Criteria

1. Usability

2. Performance

3. Versatility / Flexibility

Research: Native Java

Benchmark for comparison, using new Hadoop Java API

BudgetAgg.java Mapper class

BudgetAgg.java Reducer class

USABILITY: ›  Not straightforward for analysts to implement, launch, or tweak

PERFORMANCE: ›  Fastest implementation. ›  Can further enhance by overriding comparators for grouping and sorting

VERSATILITY / FLEXIBILITY:

›  Ability to customize pretty much everything

›  Custom Partitioner, Comparator, Grouping Comparator in our implementation

›  Can use complex objects as keys or values

Research: Streaming

Supplies an executable to Hadoop that reads from stdin and writes to stdout

mapper.py reducer.py

Research: Streaming

USABILITY: ›  Key/value detection has to be done by the user ›  Still, straightforward for relatively simple jobs

hadoop jar /usr/lib/hadoop-0.23.0-mr1-cdh4b1/contrib/streaming/hadoop-*streaming*.jar \ -D stream.num.map.output.key.fields=4 \ -D num.key.fields.for.partition=3 \ -D mapred.reduce.tasks=35 \ -file mapper.py \ -mapper mapper.py \ -file reducer.py \ -reducer reducer_nongroup.py \ -partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner \ -input /logs/log_budget/v002/2013/03/06/19/ -output bidder_logs/streaming_output

Research: Streaming

PERFORMANCE: ›  ~50% slower than Java

VERSATILITY / FLEXIBILITY: ›  Inputs in reducer are iterated line-by-line ›  Straightforward to get de-duplication and agg to work in a single step

Research: mrjob

Open-source Python framework that wraps Hadoop Streaming

USABILITY: ›  “Simplified Java” ›  Great docs, actively developed python budget_agg.py -r hadoop --hadoop-bin /usr/bin/hadoop \

--jobconf stream.num.map.output.key.fields=4 \

--jobconf num.key.fields.for.partition=3 \

--jobconf mapred.reduce.tasks=35 \

--partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner \

-o hdfs:///user/apando/budget_logs/mrjob_output \

hdfs:///logs/log_budget/v002/2013/03/06/19/

Research: mrjob

PERFORMANCE: ›  Not much slower than Streaming if only using RawValueProtocol

Research: mrjob

PERFORMANCE: ›  Involving objects or multiple steps slow it down a lot

VERSATILITY / FLEXIBILITY:

›  Can define Input /Internal / Output protocols

Research: Happy / Jython

HAPPY: ›  Full access to Java MapReduce API ›  Happy project is deprecated

›  Depends on Hadoop 0.17

JYTHON: ›  Doesn’t work easily out of the box

›  Relies on deprecated Jython compiler in Jython 2.2 ›  Limited to Jython implementation of Python

›  Numpy/SciPy and Pandas unavailable

Research: PyCascading

Python wrapper around Cascading framework for data processing workflow.

Uses Jython as high level language for defining workflows.

USABILITY: ›  Relatively new project ›  Cascading API is simple and intuitive ›  Job Planner abstracts details of MapReduce

PERFORMANCE: ›  Abstraction makes performance tuning challenging ›  Does not support Combiner operation ›  Dev time was fast, runtime was slow

VERSATILITY / FLEXIBILITY: ›  Allows Jython UDFs ›  Rich set of built-in functions: GroupBy, Join, Merge

Research: Pig

Provides a high-level language for data analysis which is compiled into a sequence of MapReduce operations.

USABILITY:

Research: Pig

USABILITY: ›  Powerful debugging and optimization tools (e.g. explain, illustrate)

›  Automatically optimizes MapReduce operations: ›  Applies Combiner operations where applicable ›  Reorders and conflates data flow for efficiency

Research: Pig

PERFORMANCE: ›  Pig compiler produces performant code ›  Complex operations might require manual optimization ›  Budget Aggregation require the implementation of a User Defined Function in Jython to eliminate unnecessary MapReduce step

Research: Pig

VERSATILITY / FLEXIBILITY: USING PIG + JYTHON UDF

›  PigLatin is expressive and can capture most use cases

›  Define custom data operations in Jython called UDFs

›  UDFs can implement custom loaders, partitioners, and other advanced features

Research: Summary

0 50 100 150 200 250 300

Streaming

PyCascading

Running Time (minutes), Lines of Code

Running Time / Lines of Code for Implementations

Lines of Code

Running Time

Research: Recommendations

•  Pig and PyCascading enable complex pipelines to be expressed simply

•  Pig is more mature and the most viable option for ad-hoc analysis

QUESTIONS

??????? ??:::::::?? ??:::::::::::? ?:::::????:::::? ?::::? ?::::? ?::::? ?::::? ?????? ?::::? ?::::? ?::::? ?::::? ?::::? ?::::? ?::::? ??::?? ???? ??? ??:?? ???

pydata@appnexus.com

Python in an Evolving Enterprise System (PyData SV 2013)

Technology

Measuring the New Wikipedia Community (PyData SV 2013)

Wise.io: A Machine-Learning Platform (PyData SV 2013)

Shogun 2.0 @ PyData NYC 2012

Orange Canvas - PyData 2013

Vaex talk-pydata-paris

Networkx & Gephi Tutorial #Pydata NYC

PyData NYC 2015

Wide IO Presentation PyData London

Hacking Human Language (PyData London)

Python as part of a production machine learning stack by Michael Manapat PyData SV 2014

PyData London CNN Lightning Talk

Our Data Ourselves, Pydata 2015

Introduction to NumPy (PyData SV 2013)

Luigi PyData presentation

PyData: The Next Generation

Thin Client Data Science (PyData SV 2013)

PyData Texas 2015 Keynote

PyData NYC by Akira Shibata

Data Wrangling Kung Fu With pandas (PyData SV 2013)

Authorship attribution pydata london