39
DMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMD MMM8OOOOOOOOOOO8MMMM8OOOOOOOOOOOOOOODMMMMOOOOOOOOOOOOOOMMMN DMMIIIIIIIIIIIII$MMMM$IIIIIIIIIIIIIIIOMMMM7III?IIIIIIIIII7MM MMOIIIIIIIIIIIII7MMMMOIIIIIIIIIIIIIIIIMMMM8IIIIIIIIIIIIIIIMM MMOIIIIIIIIIIIIIIMMMMDIIIIIIIIIIIIIIIIDMMMM7IIIIIIIIIIIIIIMM MMOIIIIIIIIIIIIIIMMMMM7IIIIIIIIIIIIIII?MMMMMIIIIIIIIIIIIIIMM MMOIIIIIIIIIIIIIIMMMMM8IIIIIIIIIIIIIIIIMMMMMOIIIIIIIIIIIIIMM MMOIIIIIIIIIIIIIIOMMMMMIIIIIIIIIIIIIIIIIMMMMMMIIIIIIIIIIIIMM MMOIIIIIIIIIIIIIIIMMMMMIIIIIIIIIIIIIIIIIZMMMMMMIIIIIIIIIIIMM MMOIIIIIIIIIIIIIIIZMMMM8IIIIIIIIIIIIIIIII7MMMMMMOIIIIIIIIIMM MM8$$IIIIIIIIIIIIIIMMMMMIIIIIIIIIIIIIII?III8MMMMMMMZIIIIIIMM MMMMMMMMMMN87IIIIII8MMMMDIIIIIIIIIIIIIIIIIIIZMMMMMMMMMNZIIMM MMMMMMMMMMMMMMMMMOII$MMMMMIIIIIIIIIIIIIIIIIIIIIMMMMMMMMMMMMM MMMMMMMMMMMMMMMMMMMM8MMMMMZIIIIIIIIIIIIIIIIIIIIII8MMMMMMMMMM MMOIIIIIIII7NMMMMMMMMMMMMMMMIIIIIIIIIIIIIIIIIIIIII?IIII$ODMM MMOIIIIIIIIII?I8MMMMMMMMMMMMDIIIIIIIIIIIIIIIIIIIIIIIIIIIIIMM MMOIIIIIIIIIIIIIIIIZMMMMMMMMMM7II?IIIIIIIIIIIIIIIIIIIIIIIIMM MMOIIIIIIIIIIIIIIIIII7NMMMMMMMMIIIIIIIIIIIIIIIIIIIIIIIIIIIMM MMOIIIIIIIIIIIIIIIIIIIIIDMMMMMMMMIIIIIIIIIIIIIIIIIIIIIIIIIMM MMOIIIIIIIIIIIIIIIIIIIIIIIMMMMMMMMMIIIIIIIIIIIIIIIIIIIIIIIMM MMOIIIIIIIIIIIIIIIIIIIIIIII7MMMMMMMM8IIIIIIIIIIIIIIIIIIIIIMM MMOIIIIIIIIIIIIIIIIIIIIIIIIII7MMMMMMMMM$IIIIIIIIIIIIIIIIIIMM MMOIIIIIIIIIIIIIIIIIIIIIIIIIIIOMMMMMMMMMM8I?IIIIIIIIIIIIIIMM MMOIIIIIIIIIIIIIIIIIIIIIIIIIIII7MMMMMMMMMMMMMN7IIIIIIIIIIIMM MMMMMMD$IIIIIIIIIIIIIIIIIIIIIIIIIMMMMMMMMMMMMMMMMDZIIIIIIIMM MMMMMMMMMMMNIIIIIIIIIIIIIIIIIIIIIINMMMMDZNMMMMMMMMMMMMMMMMMM MMMMMMMMMMMMMM?IIIIIIIIIIIIIIIIIIIIMMMMMIII7NMMMMMMMMMMMMMMM MMOIII7DMMMMMMMM$IIIIIIIIIIIIIIIIIIZMMMMNIIIIIIII7$DNMMMMMMM MMOIIIIII7MMMMMMM7IIIIIIIIIIIIIIIIIIOMMMM8IIIIIIIIIIIIIIIIMM MMOIIIIIIIIIMMMMMMNIIIIIIIIIIIIIIIIIIMMMMMIIIIIIIIIIIIIIIIMM MMOIIIIIIIIIIIMMMMMNIIIIIIIIIIIIIIIIIDMMMM7IIIIIIIIIIIIIIIMM MMOIIIIIIIIIIIIMMMMMMIIIIIIIIIIIIIIII7MMMMDIIIIIIIIIIIIIIIMM MMOIIIIIIIIIIIIIMMMMMZIIIIIIIIIIIIIIIIOMMMM$IIIIIIIIIIIIIIMM MMOIIIIIIIIIIIIIIMMMMMIIIIIIIIIIIIIIIIIMMMM8IIIIIIIIIIIIIIMM MMOIIIIIIIIIIIIIIMMMMMNIIIIIIIIIIIIIIIIMMMMMIIIIIIIIIIIIIIMM MMOIIIIIIIIIIIIIIOMMMMMIIIIIIIIIIIIIIIIMMMMMI??IIIIIIIIIIIMM $MMIIIIIIIIIIIIIIIMMMMMIIIIIIIIIIIIIIIIMMMMMIIIIIIIIIIIII8MM MMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMM 777777777777777777777777777777777777777777777777I77 MD$N? MMM8MN OMM8MZ OMMMDM MMDM+ MD~NO M= MM ZZMI +ZI M7 M O7 MO OM M OMMMMM 8M7 M= MM MN7 MM?+I M7 M O7 MO OM M O+ 8M7 M~ MM ZMI MDMMN MMNMM7 OMMMM OM M MMM8M MD:N8 MMMMMM MO8M M7 O7 M7 O7

Python in an Evolving Enterprise System (PyData SV 2013)

  • Upload
    pydata

  • View
    491

  • Download
    1

Embed Size (px)

DESCRIPTION

Video can be found here: https://vimeo.com/63253563

Citation preview

Page 1: Python in an Evolving Enterprise System (PyData SV 2013)

DMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMD MMM8OOOOOOOOOOO8MMMM8OOOOOOOOOOOOOOODMMMMOOOOOOOOOOOOOOMMMN DMMIIIIIIIIIIIII$MMMM$IIIIIIIIIIIIIIIOMMMM7III?IIIIIIIIII7MM MMOIIIIIIIIIIIII7MMMMOIIIIIIIIIIIIIIIIMMMM8IIIIIIIIIIIIIIIMM MMOIIIIIIIIIIIIIIMMMMDIIIIIIIIIIIIIIIIDMMMM7IIIIIIIIIIIIIIMM MMOIIIIIIIIIIIIIIMMMMM7IIIIIIIIIIIIIII?MMMMMIIIIIIIIIIIIIIMM MMOIIIIIIIIIIIIIIMMMMM8IIIIIIIIIIIIIIIIMMMMMOIIIIIIIIIIIIIMM MMOIIIIIIIIIIIIIIOMMMMMIIIIIIIIIIIIIIIIIMMMMMMIIIIIIIIIIIIMM MMOIIIIIIIIIIIIIIIMMMMMIIIIIIIIIIIIIIIIIZMMMMMMIIIIIIIIIIIMM MMOIIIIIIIIIIIIIIIZMMMM8IIIIIIIIIIIIIIIII7MMMMMMOIIIIIIIIIMM MM8$$IIIIIIIIIIIIIIMMMMMIIIIIIIIIIIIIII?III8MMMMMMMZIIIIIIMM MMMMMMMMMMN87IIIIII8MMMMDIIIIIIIIIIIIIIIIIIIZMMMMMMMMMNZIIMM MMMMMMMMMMMMMMMMMOII$MMMMMIIIIIIIIIIIIIIIIIIIIIMMMMMMMMMMMMM MMMMMMMMMMMMMMMMMMMM8MMMMMZIIIIIIIIIIIIIIIIIIIIII8MMMMMMMMMM MMOIIIIIIII7NMMMMMMMMMMMMMMMIIIIIIIIIIIIIIIIIIIIII?IIII$ODMM MMOIIIIIIIIII?I8MMMMMMMMMMMMDIIIIIIIIIIIIIIIIIIIIIIIIIIIIIMM MMOIIIIIIIIIIIIIIIIZMMMMMMMMMM7II?IIIIIIIIIIIIIIIIIIIIIIIIMM MMOIIIIIIIIIIIIIIIIII7NMMMMMMMMIIIIIIIIIIIIIIIIIIIIIIIIIIIMM MMOIIIIIIIIIIIIIIIIIIIIIDMMMMMMMMIIIIIIIIIIIIIIIIIIIIIIIIIMM MMOIIIIIIIIIIIIIIIIIIIIIIIMMMMMMMMMIIIIIIIIIIIIIIIIIIIIIIIMM MMOIIIIIIIIIIIIIIIIIIIIIIII7MMMMMMMM8IIIIIIIIIIIIIIIIIIIIIMM MMOIIIIIIIIIIIIIIIIIIIIIIIIII7MMMMMMMMM$IIIIIIIIIIIIIIIIIIMM MMOIIIIIIIIIIIIIIIIIIIIIIIIIIIOMMMMMMMMMM8I?IIIIIIIIIIIIIIMM MMOIIIIIIIIIIIIIIIIIIIIIIIIIIII7MMMMMMMMMMMMMN7IIIIIIIIIIIMM MMMMMMD$IIIIIIIIIIIIIIIIIIIIIIIIIMMMMMMMMMMMMMMMMDZIIIIIIIMM MMMMMMMMMMMNIIIIIIIIIIIIIIIIIIIIIINMMMMDZNMMMMMMMMMMMMMMMMMM MMMMMMMMMMMMMM?IIIIIIIIIIIIIIIIIIIIMMMMMIII7NMMMMMMMMMMMMMMM MMOIII7DMMMMMMMM$IIIIIIIIIIIIIIIIIIZMMMMNIIIIIIII7$DNMMMMMMM MMOIIIIII7MMMMMMM7IIIIIIIIIIIIIIIIIIOMMMM8IIIIIIIIIIIIIIIIMM MMOIIIIIIIIIMMMMMMNIIIIIIIIIIIIIIIIIIMMMMMIIIIIIIIIIIIIIIIMM MMOIIIIIIIIIIIMMMMMNIIIIIIIIIIIIIIIIIDMMMM7IIIIIIIIIIIIIIIMM MMOIIIIIIIIIIIIMMMMMMIIIIIIIIIIIIIIII7MMMMDIIIIIIIIIIIIIIIMM MMOIIIIIIIIIIIIIMMMMMZIIIIIIIIIIIIIIIIOMMMM$IIIIIIIIIIIIIIMM MMOIIIIIIIIIIIIIIMMMMMIIIIIIIIIIIIIIIIIMMMM8IIIIIIIIIIIIIIMM MMOIIIIIIIIIIIIIIMMMMMNIIIIIIIIIIIIIIIIMMMMMIIIIIIIIIIIIIIMM MMOIIIIIIIIIIIIIIOMMMMMIIIIIIIIIIIIIIIIMMMMMI??IIIIIIIIIIIMM $MMIIIIIIIIIIIIIIIMMMMMIIIIIIIIIIIIIIIIMMMMMIIIIIIIIIIIII8MM MMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMM 777777777777777777777777777777777777777777777777I77 MD$N? MMM8MN OMM8MZ OMMMDM MMDM+ MD~NO M= MM ZZMI +ZI M7 M O7 MO OM M OMMMMM 8M7 M= MM MN7 MM?+I M7 M O7 MO OM M O+ 8M7 M~ MM ZMI MDMMN MMNMM7 OMMMM OM M MMM8M MD:N8 MMMMMM MO8M M7 O7 M7 O7

Page 2: Python in an Evolving Enterprise System (PyData SV 2013)

PYTHON IN AN EVOLVING ENTERPRISE SYSTEM EVALUATING INTEGRATION SOLUTIONS WITH HADOOP

DAVE HIMROD

STEVE KANNAN

ANGELICA PANDO

Page 3: Python in an Evolving Enterprise System (PyData SV 2013)

Building today’s most powerful, open, and customizable advertising technology platform.

Page 4: Python in an Evolving Enterprise System (PyData SV 2013)

Ad is served in <100 milliseconds

300x250

AUCTION REQUEST

AD RESPONSE BID: $2.50

ADVERTISER 1

BID: $3.25

ADVERTISER 2

BID: $4.10

ADVERTISER 3

APPNEXUS OPTIMIZATION

WINNING BID

Page 5: Python in an Evolving Enterprise System (PyData SV 2013)

Evolution of AppNexus

PEOPLE 350 430 20

AD REQUESTS 45B 39B 100M FROM

MYSQL, HADOOP/HBASE, AEROSPIKE, NETEZZA, VERTICA

5000+ SERVERS

OF DATA EVERY DAY 38+ TB

UPTIME 99.99%

Page 6: Python in an Evolving Enterprise System (PyData SV 2013)

Evolution of AppNexus

ENGINEERING HQ IN NYC

ENG OFFICES IN PORTLAND & SF

Page 7: Python in an Evolving Enterprise System (PyData SV 2013)

Data-Driven Decisioning (D3)

DATA PIPELINE

D3 PROCESSING

Bidder Bidder Bidder BIDDERS

Page 8: Python in an Evolving Enterprise System (PyData SV 2013)

Python at AppNexus Python enables us to scale our team and rapidly iterate and prototype technologies.

Page 9: Python in an Evolving Enterprise System (PyData SV 2013)

CLUSTER 1PB NODES ACROSS SEVERAL CLUSTERS 862

40B BILLION LOG RECORDS DAILY

5.6B BILLION LOG RECORDS/HOUR AT PEAK

Hadoop enables us to do aggregations for reporting and other data pipeline jobs

Hadoop at AppNexus

Page 10: Python in an Evolving Enterprise System (PyData SV 2013)

Data modeling today

Task Task Task Task logs logs logs logs VERTICA CACHE

HADOOP

DATA SERVICES

Σ DATA DRIVEN DECISIONING

BIG DATA: TBS/HOUR MEDIUM DATA: GBS/HOUR

Page 11: Python in an Evolving Enterprise System (PyData SV 2013)

To enable the next generation of data modeling, we need to leverage our Hadoop cluster

Page 12: Python in an Evolving Enterprise System (PyData SV 2013)

What are we trying to do

Access the data on Hadoop

Continue to use Python to model

à No consensus on the best solution

So we conducted our own research to evaluate integration options

Page 13: Python in an Evolving Enterprise System (PyData SV 2013)

The budget problem

We have thousands of bidders buying billions of ads per hour in real-time auctions.

We need to create a model that can manipulate how our bidders spend their budgets and purchase ads.

Page 14: Python in an Evolving Enterprise System (PyData SV 2013)

Data modeling today

Task Task Task Task logs logs logs logs VERTICA CACHE

HADOOP

DATA SERVICES

Σ DATA DRIVEN DECISIONING

BIG DATA: TBS/HOUR MEDIUM DATA: GBS/HOUR

DATA DRIVEN DECISIONING

Page 15: Python in an Evolving Enterprise System (PyData SV 2013)

Test problem: Budget aggregation

SCENARIO: Each auction creates a row in a log.

timestamp, auction_id, object_type, object_id, method, value

We need to aggregate and model to update bidders.

Page 16: Python in an Evolving Enterprise System (PyData SV 2013)

Method: Budget aggregation

STEP 1: De-duplicate records where

KEY: object_type, object_id, method, auction_id

STEP 2: Aggregate value where

KEY: object_type, object_id, method

Page 17: Python in an Evolving Enterprise System (PyData SV 2013)

HARDWARE

•  300 GB of log data

•  5 nodes running Scientific Linux 6.3 (Carbon)

•  Intel Xeon CPU @ 2.13 GHz, 4 cores

•  2 TB Disk

•  CDH4

•  45 map, 35 reduce tasks at a time

Page 18: Python in an Evolving Enterprise System (PyData SV 2013)

Research: Potential solutions

1. Native Java

2. Streaming ‒ no framework

3. mrjob

4. Happy / Jython / PyCascading

5.  Pig + Jython UDF 6. Pydoop

7. Disco

8.  Hadoopy / dumbo 9. Hipy

evaluating Hadoop

Effectively ORM for Hive

similar to mrjob

prohibitive installation

Page 19: Python in an Evolving Enterprise System (PyData SV 2013)

Research: Criteria

1. Usability

2. Performance

3. Versatility / Flexibility

Page 20: Python in an Evolving Enterprise System (PyData SV 2013)

Research: Native Java

Benchmark for comparison, using new Hadoop Java API

BudgetAgg.java Mapper class

BudgetAgg.java Reducer class

Page 21: Python in an Evolving Enterprise System (PyData SV 2013)

Research: Native Java

USABILITY: ›  Not straightforward for analysts to implement, launch, or tweak

PERFORMANCE: ›  Fastest implementation. ›  Can further enhance by overriding comparators for grouping and sorting

Page 22: Python in an Evolving Enterprise System (PyData SV 2013)

Research: Native Java

VERSATILITY / FLEXIBILITY:

›  Ability to customize pretty much everything

›  Custom Partitioner, Comparator, Grouping Comparator in our implementation

›  Can use complex objects as keys or values

Page 23: Python in an Evolving Enterprise System (PyData SV 2013)

Research: Streaming

Supplies an executable to Hadoop that reads from stdin and writes to stdout

mapper.py reducer.py

Page 24: Python in an Evolving Enterprise System (PyData SV 2013)

Research: Streaming

USABILITY: ›  Key/value detection has to be done by the user ›  Still, straightforward for relatively simple jobs

hadoop jar /usr/lib/hadoop-0.23.0-mr1-cdh4b1/contrib/streaming/hadoop-*streaming*.jar \ -D stream.num.map.output.key.fields=4 \ -D num.key.fields.for.partition=3 \ -D mapred.reduce.tasks=35 \ -file mapper.py \ -mapper mapper.py \ -file reducer.py \ -reducer reducer_nongroup.py \ -partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner \ -input /logs/log_budget/v002/2013/03/06/19/ -output bidder_logs/streaming_output

Page 25: Python in an Evolving Enterprise System (PyData SV 2013)

Research: Streaming

PERFORMANCE: ›  ~50% slower than Java

VERSATILITY / FLEXIBILITY: ›  Inputs in reducer are iterated line-by-line ›  Straightforward to get de-duplication and agg to work in a single step

Page 26: Python in an Evolving Enterprise System (PyData SV 2013)

Research: mrjob

Open-source Python framework that wraps Hadoop Streaming

USABILITY: ›  “Simplified Java” ›  Great docs, actively developed python budget_agg.py -r hadoop --hadoop-bin /usr/bin/hadoop \

--jobconf stream.num.map.output.key.fields=4 \

--jobconf num.key.fields.for.partition=3 \

--jobconf mapred.reduce.tasks=35 \

--partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner \

-o hdfs:///user/apando/budget_logs/mrjob_output \

hdfs:///logs/log_budget/v002/2013/03/06/19/

Page 27: Python in an Evolving Enterprise System (PyData SV 2013)

Research: mrjob

PERFORMANCE: ›  Not much slower than Streaming if only using RawValueProtocol

Page 28: Python in an Evolving Enterprise System (PyData SV 2013)

Research: mrjob

PERFORMANCE: ›  Involving objects or multiple steps slow it down a lot

VERSATILITY / FLEXIBILITY:

›  Can define Input /Internal / Output protocols

Page 29: Python in an Evolving Enterprise System (PyData SV 2013)

Research: Happy / Jython

HAPPY: ›  Full access to Java MapReduce API ›  Happy project is deprecated

›  Depends on Hadoop 0.17

JYTHON: ›  Doesn’t work easily out of the box

›  Relies on deprecated Jython compiler in Jython 2.2 ›  Limited to Jython implementation of Python

›  Numpy/SciPy and Pandas unavailable

Page 30: Python in an Evolving Enterprise System (PyData SV 2013)

Research: PyCascading

Python wrapper around Cascading framework for data processing workflow.

Uses Jython as high level language for defining workflows.

Page 31: Python in an Evolving Enterprise System (PyData SV 2013)

Research: PyCascading

USABILITY: ›  Relatively new project ›  Cascading API is simple and intuitive ›  Job Planner abstracts details of MapReduce

PERFORMANCE: ›  Abstraction makes performance tuning challenging ›  Does not support Combiner operation ›  Dev time was fast, runtime was slow

Page 32: Python in an Evolving Enterprise System (PyData SV 2013)

Research: PyCascading

VERSATILITY / FLEXIBILITY: ›  Allows Jython UDFs ›  Rich set of built-in functions: GroupBy, Join, Merge

Page 33: Python in an Evolving Enterprise System (PyData SV 2013)

Research: Pig

Provides a high-level language for data analysis which is compiled into a sequence of MapReduce operations.

USABILITY:

Page 34: Python in an Evolving Enterprise System (PyData SV 2013)

Research: Pig

USABILITY: ›  Powerful debugging and optimization tools (e.g. explain, illustrate)

›  Automatically optimizes MapReduce operations: ›  Applies Combiner operations where applicable ›  Reorders and conflates data flow for efficiency

Page 35: Python in an Evolving Enterprise System (PyData SV 2013)

Research: Pig

PERFORMANCE: ›  Pig compiler produces performant code ›  Complex operations might require manual optimization ›  Budget Aggregation require the implementation of a User Defined Function in Jython to eliminate unnecessary MapReduce step

Page 36: Python in an Evolving Enterprise System (PyData SV 2013)

Research: Pig

VERSATILITY / FLEXIBILITY: USING PIG + JYTHON UDF

›  PigLatin is expressive and can capture most use cases

›  Define custom data operations in Jython called UDFs

›  UDFs can implement custom loaders, partitioners, and other advanced features

Page 37: Python in an Evolving Enterprise System (PyData SV 2013)

Research: Summary

0 50 100 150 200 250 300

Java

Streaming

MRJob

PyCascading

Pig

Running Time (minutes), Lines of Code

Running Time / Lines of Code for Implementations

Lines of Code

Running Time

Page 38: Python in an Evolving Enterprise System (PyData SV 2013)

Research: Recommendations

•  Pig and PyCascading enable complex pipelines to be expressed simply

•  Pig is more mature and the most viable option for ad-hoc analysis

Page 39: Python in an Evolving Enterprise System (PyData SV 2013)

QUESTIONS

??????? ??:::::::?? ??:::::::::::? ?:::::????:::::? ?::::? ?::::? ?::::? ?::::? ?????? ?::::? ?::::? ?::::? ?::::? ?::::? ?::::? ?::::? ??::?? ???? ??? ??:?? ???

??????? ??:::::::?? ??:::::::::::? ?:::::????:::::? ?::::? ?::::? ?::::? ?::::? ?????? ?::::? ?::::? ?::::? ?::::? ?::::? ?::::? ?::::? ??::?? ???? ??? ??:?? ???

[email protected]