Upload
pydata
View
491
Download
1
Embed Size (px)
DESCRIPTION
Video can be found here: https://vimeo.com/63253563
Citation preview
DMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMD MMM8OOOOOOOOOOO8MMMM8OOOOOOOOOOOOOOODMMMMOOOOOOOOOOOOOOMMMN DMMIIIIIIIIIIIII$MMMM$IIIIIIIIIIIIIIIOMMMM7III?IIIIIIIIII7MM MMOIIIIIIIIIIIII7MMMMOIIIIIIIIIIIIIIIIMMMM8IIIIIIIIIIIIIIIMM MMOIIIIIIIIIIIIIIMMMMDIIIIIIIIIIIIIIIIDMMMM7IIIIIIIIIIIIIIMM MMOIIIIIIIIIIIIIIMMMMM7IIIIIIIIIIIIIII?MMMMMIIIIIIIIIIIIIIMM MMOIIIIIIIIIIIIIIMMMMM8IIIIIIIIIIIIIIIIMMMMMOIIIIIIIIIIIIIMM MMOIIIIIIIIIIIIIIOMMMMMIIIIIIIIIIIIIIIIIMMMMMMIIIIIIIIIIIIMM MMOIIIIIIIIIIIIIIIMMMMMIIIIIIIIIIIIIIIIIZMMMMMMIIIIIIIIIIIMM MMOIIIIIIIIIIIIIIIZMMMM8IIIIIIIIIIIIIIIII7MMMMMMOIIIIIIIIIMM MM8$$IIIIIIIIIIIIIIMMMMMIIIIIIIIIIIIIII?III8MMMMMMMZIIIIIIMM MMMMMMMMMMN87IIIIII8MMMMDIIIIIIIIIIIIIIIIIIIZMMMMMMMMMNZIIMM MMMMMMMMMMMMMMMMMOII$MMMMMIIIIIIIIIIIIIIIIIIIIIMMMMMMMMMMMMM MMMMMMMMMMMMMMMMMMMM8MMMMMZIIIIIIIIIIIIIIIIIIIIII8MMMMMMMMMM MMOIIIIIIII7NMMMMMMMMMMMMMMMIIIIIIIIIIIIIIIIIIIIII?IIII$ODMM MMOIIIIIIIIII?I8MMMMMMMMMMMMDIIIIIIIIIIIIIIIIIIIIIIIIIIIIIMM MMOIIIIIIIIIIIIIIIIZMMMMMMMMMM7II?IIIIIIIIIIIIIIIIIIIIIIIIMM MMOIIIIIIIIIIIIIIIIII7NMMMMMMMMIIIIIIIIIIIIIIIIIIIIIIIIIIIMM MMOIIIIIIIIIIIIIIIIIIIIIDMMMMMMMMIIIIIIIIIIIIIIIIIIIIIIIIIMM MMOIIIIIIIIIIIIIIIIIIIIIIIMMMMMMMMMIIIIIIIIIIIIIIIIIIIIIIIMM MMOIIIIIIIIIIIIIIIIIIIIIIII7MMMMMMMM8IIIIIIIIIIIIIIIIIIIIIMM MMOIIIIIIIIIIIIIIIIIIIIIIIIII7MMMMMMMMM$IIIIIIIIIIIIIIIIIIMM MMOIIIIIIIIIIIIIIIIIIIIIIIIIIIOMMMMMMMMMM8I?IIIIIIIIIIIIIIMM MMOIIIIIIIIIIIIIIIIIIIIIIIIIIII7MMMMMMMMMMMMMN7IIIIIIIIIIIMM MMMMMMD$IIIIIIIIIIIIIIIIIIIIIIIIIMMMMMMMMMMMMMMMMDZIIIIIIIMM MMMMMMMMMMMNIIIIIIIIIIIIIIIIIIIIIINMMMMDZNMMMMMMMMMMMMMMMMMM MMMMMMMMMMMMMM?IIIIIIIIIIIIIIIIIIIIMMMMMIII7NMMMMMMMMMMMMMMM MMOIII7DMMMMMMMM$IIIIIIIIIIIIIIIIIIZMMMMNIIIIIIII7$DNMMMMMMM MMOIIIIII7MMMMMMM7IIIIIIIIIIIIIIIIIIOMMMM8IIIIIIIIIIIIIIIIMM MMOIIIIIIIIIMMMMMMNIIIIIIIIIIIIIIIIIIMMMMMIIIIIIIIIIIIIIIIMM MMOIIIIIIIIIIIMMMMMNIIIIIIIIIIIIIIIIIDMMMM7IIIIIIIIIIIIIIIMM MMOIIIIIIIIIIIIMMMMMMIIIIIIIIIIIIIIII7MMMMDIIIIIIIIIIIIIIIMM MMOIIIIIIIIIIIIIMMMMMZIIIIIIIIIIIIIIIIOMMMM$IIIIIIIIIIIIIIMM MMOIIIIIIIIIIIIIIMMMMMIIIIIIIIIIIIIIIIIMMMM8IIIIIIIIIIIIIIMM MMOIIIIIIIIIIIIIIMMMMMNIIIIIIIIIIIIIIIIMMMMMIIIIIIIIIIIIIIMM MMOIIIIIIIIIIIIIIOMMMMMIIIIIIIIIIIIIIIIMMMMMI??IIIIIIIIIIIMM $MMIIIIIIIIIIIIIIIMMMMMIIIIIIIIIIIIIIIIMMMMMIIIIIIIIIIIII8MM MMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMM 777777777777777777777777777777777777777777777777I77 MD$N? MMM8MN OMM8MZ OMMMDM MMDM+ MD~NO M= MM ZZMI +ZI M7 M O7 MO OM M OMMMMM 8M7 M= MM MN7 MM?+I M7 M O7 MO OM M O+ 8M7 M~ MM ZMI MDMMN MMNMM7 OMMMM OM M MMM8M MD:N8 MMMMMM MO8M M7 O7 M7 O7
PYTHON IN AN EVOLVING ENTERPRISE SYSTEM EVALUATING INTEGRATION SOLUTIONS WITH HADOOP
DAVE HIMROD
STEVE KANNAN
ANGELICA PANDO
Building today’s most powerful, open, and customizable advertising technology platform.
Ad is served in <100 milliseconds
300x250
AUCTION REQUEST
AD RESPONSE BID: $2.50
ADVERTISER 1
BID: $3.25
ADVERTISER 2
BID: $4.10
ADVERTISER 3
APPNEXUS OPTIMIZATION
WINNING BID
Evolution of AppNexus
PEOPLE 350 430 20
AD REQUESTS 45B 39B 100M FROM
MYSQL, HADOOP/HBASE, AEROSPIKE, NETEZZA, VERTICA
5000+ SERVERS
OF DATA EVERY DAY 38+ TB
UPTIME 99.99%
Evolution of AppNexus
ENGINEERING HQ IN NYC
ENG OFFICES IN PORTLAND & SF
Data-Driven Decisioning (D3)
DATA PIPELINE
D3 PROCESSING
Bidder Bidder Bidder BIDDERS
Python at AppNexus Python enables us to scale our team and rapidly iterate and prototype technologies.
CLUSTER 1PB NODES ACROSS SEVERAL CLUSTERS 862
40B BILLION LOG RECORDS DAILY
5.6B BILLION LOG RECORDS/HOUR AT PEAK
Hadoop enables us to do aggregations for reporting and other data pipeline jobs
Hadoop at AppNexus
Data modeling today
Task Task Task Task logs logs logs logs VERTICA CACHE
HADOOP
DATA SERVICES
Σ DATA DRIVEN DECISIONING
BIG DATA: TBS/HOUR MEDIUM DATA: GBS/HOUR
To enable the next generation of data modeling, we need to leverage our Hadoop cluster
What are we trying to do
Access the data on Hadoop
Continue to use Python to model
à No consensus on the best solution
So we conducted our own research to evaluate integration options
The budget problem
We have thousands of bidders buying billions of ads per hour in real-time auctions.
We need to create a model that can manipulate how our bidders spend their budgets and purchase ads.
Data modeling today
Task Task Task Task logs logs logs logs VERTICA CACHE
HADOOP
DATA SERVICES
Σ DATA DRIVEN DECISIONING
BIG DATA: TBS/HOUR MEDIUM DATA: GBS/HOUR
DATA DRIVEN DECISIONING
Test problem: Budget aggregation
SCENARIO: Each auction creates a row in a log.
timestamp, auction_id, object_type, object_id, method, value
We need to aggregate and model to update bidders.
Method: Budget aggregation
STEP 1: De-duplicate records where
KEY: object_type, object_id, method, auction_id
STEP 2: Aggregate value where
KEY: object_type, object_id, method
HARDWARE
• 300 GB of log data
• 5 nodes running Scientific Linux 6.3 (Carbon)
• Intel Xeon CPU @ 2.13 GHz, 4 cores
• 2 TB Disk
• CDH4
• 45 map, 35 reduce tasks at a time
Research: Potential solutions
1. Native Java
2. Streaming ‒ no framework
3. mrjob
4. Happy / Jython / PyCascading
5. Pig + Jython UDF 6. Pydoop
7. Disco
8. Hadoopy / dumbo 9. Hipy
evaluating Hadoop
Effectively ORM for Hive
similar to mrjob
prohibitive installation
Research: Criteria
1. Usability
2. Performance
3. Versatility / Flexibility
Research: Native Java
Benchmark for comparison, using new Hadoop Java API
BudgetAgg.java Mapper class
BudgetAgg.java Reducer class
Research: Native Java
USABILITY: › Not straightforward for analysts to implement, launch, or tweak
PERFORMANCE: › Fastest implementation. › Can further enhance by overriding comparators for grouping and sorting
Research: Native Java
VERSATILITY / FLEXIBILITY:
› Ability to customize pretty much everything
› Custom Partitioner, Comparator, Grouping Comparator in our implementation
› Can use complex objects as keys or values
Research: Streaming
Supplies an executable to Hadoop that reads from stdin and writes to stdout
mapper.py reducer.py
Research: Streaming
USABILITY: › Key/value detection has to be done by the user › Still, straightforward for relatively simple jobs
hadoop jar /usr/lib/hadoop-0.23.0-mr1-cdh4b1/contrib/streaming/hadoop-*streaming*.jar \ -D stream.num.map.output.key.fields=4 \ -D num.key.fields.for.partition=3 \ -D mapred.reduce.tasks=35 \ -file mapper.py \ -mapper mapper.py \ -file reducer.py \ -reducer reducer_nongroup.py \ -partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner \ -input /logs/log_budget/v002/2013/03/06/19/ -output bidder_logs/streaming_output
Research: Streaming
PERFORMANCE: › ~50% slower than Java
VERSATILITY / FLEXIBILITY: › Inputs in reducer are iterated line-by-line › Straightforward to get de-duplication and agg to work in a single step
Research: mrjob
Open-source Python framework that wraps Hadoop Streaming
USABILITY: › “Simplified Java” › Great docs, actively developed python budget_agg.py -r hadoop --hadoop-bin /usr/bin/hadoop \
--jobconf stream.num.map.output.key.fields=4 \
--jobconf num.key.fields.for.partition=3 \
--jobconf mapred.reduce.tasks=35 \
--partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner \
-o hdfs:///user/apando/budget_logs/mrjob_output \
hdfs:///logs/log_budget/v002/2013/03/06/19/
Research: mrjob
PERFORMANCE: › Not much slower than Streaming if only using RawValueProtocol
Research: mrjob
PERFORMANCE: › Involving objects or multiple steps slow it down a lot
VERSATILITY / FLEXIBILITY:
› Can define Input /Internal / Output protocols
Research: Happy / Jython
HAPPY: › Full access to Java MapReduce API › Happy project is deprecated
› Depends on Hadoop 0.17
JYTHON: › Doesn’t work easily out of the box
› Relies on deprecated Jython compiler in Jython 2.2 › Limited to Jython implementation of Python
› Numpy/SciPy and Pandas unavailable
Research: PyCascading
Python wrapper around Cascading framework for data processing workflow.
Uses Jython as high level language for defining workflows.
Research: PyCascading
USABILITY: › Relatively new project › Cascading API is simple and intuitive › Job Planner abstracts details of MapReduce
PERFORMANCE: › Abstraction makes performance tuning challenging › Does not support Combiner operation › Dev time was fast, runtime was slow
Research: PyCascading
VERSATILITY / FLEXIBILITY: › Allows Jython UDFs › Rich set of built-in functions: GroupBy, Join, Merge
Research: Pig
Provides a high-level language for data analysis which is compiled into a sequence of MapReduce operations.
USABILITY:
Research: Pig
USABILITY: › Powerful debugging and optimization tools (e.g. explain, illustrate)
› Automatically optimizes MapReduce operations: › Applies Combiner operations where applicable › Reorders and conflates data flow for efficiency
Research: Pig
PERFORMANCE: › Pig compiler produces performant code › Complex operations might require manual optimization › Budget Aggregation require the implementation of a User Defined Function in Jython to eliminate unnecessary MapReduce step
Research: Pig
VERSATILITY / FLEXIBILITY: USING PIG + JYTHON UDF
› PigLatin is expressive and can capture most use cases
› Define custom data operations in Jython called UDFs
› UDFs can implement custom loaders, partitioners, and other advanced features
Research: Summary
0 50 100 150 200 250 300
Java
Streaming
MRJob
PyCascading
Pig
Running Time (minutes), Lines of Code
Running Time / Lines of Code for Implementations
Lines of Code
Running Time
Research: Recommendations
• Pig and PyCascading enable complex pipelines to be expressed simply
• Pig is more mature and the most viable option for ad-hoc analysis
QUESTIONS
??????? ??:::::::?? ??:::::::::::? ?:::::????:::::? ?::::? ?::::? ?::::? ?::::? ?????? ?::::? ?::::? ?::::? ?::::? ?::::? ?::::? ?::::? ??::?? ???? ??? ??:?? ???
??????? ??:::::::?? ??:::::::::::? ?:::::????:::::? ?::::? ?::::? ?::::? ?::::? ?????? ?::::? ?::::? ?::::? ?::::? ?::::? ?::::? ?::::? ??::?? ???? ??? ??:?? ???