34
Experimentation Platform on Hadoop Tony Ng, Director, Data Services Padma Gopal, Manager, Experimentation

eBay Experimentation Platform on Hadoop

  • Upload
    tony-ng

  • View
    105

  • Download
    5

Embed Size (px)

Citation preview

Page 1: eBay Experimentation Platform on Hadoop

Experimentation Platform on Hadoop

Tony Ng, Director, Data Services

Padma Gopal, Manager, Experimentation

Page 2: eBay Experimentation Platform on Hadoop

Agenda

Experimentation 101 Reporting Work flow Why Hadoop? Framework Architecture Challenges & Learnings Q & A

Page 3: eBay Experimentation Platform on Hadoop

Experimentation 101

• What is A/B Testing?• Why is it important?• Intuition vs. Reality• eBay Wins

Page 4: eBay Experimentation Platform on Hadoop

4

What is A/B Testing?

• A/B Testing is comparing two versions of a page or process to see which one

performs better

• Variations could be: UI Components, Content, Algorithms etc.

• Measures: Financial metrics, Click rate, Conversion rate etc.

Control - Current design Treatment - Variations of current design

EP – Hadoop Summit 2015

Page 5: eBay Experimentation Platform on Hadoop

5

How is A/B Testing is done?

EP – Hadoop Summit 2015

Page 6: eBay Experimentation Platform on Hadoop

EP – Hadoop Summit 2015 6

Why is it important?

• Intuition vs. Reality

– Intuition especially on novel ideas should be backed up by data.

–Demographics and preferences vary

• Data Driven; not based on opinion

• Reduce risk

Page 7: eBay Experimentation Platform on Hadoop

7

Increased prominence of BIN button compared to Watch, leads to faster checkouts.

EP – Hadoop Summit 2015

Page 8: eBay Experimentation Platform on Hadoop

8

Merch placements perform much better when title and price information is provided upfront.

EP – Hadoop Summit 2015

Page 9: eBay Experimentation Platform on Hadoop

9

New sign-in design effectively pushed more new users to use guest checkout

EP – Hadoop Summit 2015

Page 10: eBay Experimentation Platform on Hadoop

10

What do we support?

EP – Hadoop Summit 2015

Page 11: eBay Experimentation Platform on Hadoop

Experimentation Reporting

• How does EP work?• Work Flow• DW Challenges

Page 12: eBay Experimentation Platform on Hadoop

EP – Hadoop Summit 2015 12

Experiment Lifecycle

Page 13: eBay Experimentation Platform on Hadoop

EP – Hadoop Summit 2015 13

User Behavior & Transactional Data

Experiment Metadata

Detail Intermediate Summaries

4 Billion Rows 4 TB

User1 HomepageUser1 Search for IPhone6User1 View Item1User2 Search for Coach bagUser2 View Item2User2 Bid

Treatment 2 User1 HomepageTreatment 1 User1 Search for IPhone6Treatment 2 User1 Search for IPhone6Treatment 1 User1 View Item 1Treatment 2 User1 View Item 1Treatment 1 User2 Search for Coach bagTreatment 2 User2 Search for Coach bag

Treatment 1 100+ Metrics

Treatment 1 20 X DimensionsTreatment 1 10 Metric InsightsTreatment 2 100+ Metrics

Treatment 2 20 X DimensionsTreatment 2 10 Data Insights

Page 14: eBay Experimentation Platform on Hadoop

EP – Hadoop Summit 2015 14

Transactional Metrics

Activity Metrics

Acquisition Metrics

AD Metrics

Email Metrics

Seller Metrics

Engagement metrics

Absolute - Actual number/counts

Normalized - Weighted mean (by GUID/UID)

Lift - Difference between treatment and control

Standard Deviation - Weighted standard deviation

Confidence Interval – Range within which treatment effect is likely to lie

P-values – Statistically significance

Outlier capped – Trim tail values

Post Stratified – Adjustment method to reduce variance

DATA INSIGHTS

Daily

Weekly

Cumulative

Browser

OS

Device

Site/Country

Category

Segment

Geo

Page 15: eBay Experimentation Platform on Hadoop

Hadoop Migration

• Why Hadoop• Tech Stack• Architecture Overview

Page 16: eBay Experimentation Platform on Hadoop

EP – Hadoop Summit 2015 16

Why Hadoop?

• Design & Development flexibility

• Store large amounts of data without the schemas constraints

• System to support complex data transformation logic

• Code base reduction

• Configurability

• Code not tied to environment & easier to share

• Support for complex structures

Page 17: eBay Experimentation Platform on Hadoop

EP – Hadoop Summit 2015 17

Scheduler/Client

Physical Architecture

Hadoop Cluster

Job Workflow

RDBMS

ETLBridge

Agent

BI&

PresentationmySQL DW

User Behavior

Data

1

2

43

5

Hive Scoobi Spark (poc)

AVRO ORC

Page 18: eBay Experimentation Platform on Hadoop

EP – Hadoop Summit 2015 18

Tech Stack - Scoobi

• Scoobi– Written in Scala, a functional programming language

– Supports Object Oriented Designs

– Abstraction of MR Framework code to lower

– Portability of typical dataset operations like map, flatMap, filter, groupBy, sort, orderBy, partition

– DList (Distributed Lists): Jobs are submitted as a series of “steps” representing granular MR jobs.

– Enables developers to write a more concise code compared to Java MR code.

Page 19: eBay Experimentation Platform on Hadoop

EP – Hadoop Summit 2015 19

Word Count in Java M/R.import java.io.IOException;import java.util.*;import org.apache.hadoop.fs.Path;import org.apache.hadoop.conf.*;import org.apache.hadoop.io.*;import org.apache.hadoop.mapreduce.*;import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;public class WordCount { public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); context.write(word, one); } } }

public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } context.write(key, new IntWritable(sum)); } }public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); Job job = new Job(conf, "wordcount"); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); job.setMapperClass(Map.class); job.setReducerClass(Reduce.class); job.setInputFormatClass(TextInputFormat.class); job.setOutputFormatClass(TextOutputFormat.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job,new Path(args[1])); job.waitForCompletion(true); }}

Page 20: eBay Experimentation Platform on Hadoop

EP – Hadoop Summit 2015 20

Word Count in Scoobi

import Scoobi._, Reduction._

val lines = fromTextFile("hdfs://in/...")

val counts = lines.mapFlatten(_.split(" ")) .map(word => (word, 1)) .groupByKey .combine(Sum.int)

counts.toTextFile("hdfs://out/...", overwrite=true).persist(ScoobiConfiguration())

Page 21: eBay Experimentation Platform on Hadoop

EP – Hadoop Summit 2015 21

Tech Stack - File Format

• Avro– Supports rich and complex data structures such as Maps, Unions– Self-Describing data files enabling portability (Schema co-exists with data)– Supports schema dynamicity using Generic Records– Supports backward compatibility for data files w.r.t schema changes

• ORC (Optimized Row Columnar)– A single file as the output of each task, which reduces the NameNode's load– Metadata stored using Protocol Buffers, which allows addition and removal of fields– Better performance of queries (bound the amount of memory needed for reading or writing)

– Light-weight indexes stored within the file

Page 22: eBay Experimentation Platform on Hadoop

EP – Hadoop Summit 2015 22

Tech Stack - Hive

• Efficient Joins for large datasets.

• UDF for use cases like median and percentile calculations.

• Hive Optimizer Joins:

- Smaller is loaded into memory as a hash table and the larger is scanned

- Map joins are automatically picked up by the optimizer.

• Ad-hoc Analysis, Data Reconciliation use-cases and Testing.

Page 23: eBay Experimentation Platform on Hadoop

EP – Hadoop Summit 2015 23

Fun Facts of EP Processing

• We read more than 200 TB of data for processing daily.

• We run 350 M/R jobs daily.

• We perform more than 30 joins using M/R & Hive, including the ones with heavy data skew.

• We use 40 TB of YARN memory at peak time on a 170 TB Hadoop cluster.

• We can run 150+ concurrent experiments daily.

• Report generation takes around 18 hours.

Page 24: eBay Experimentation Platform on Hadoop

24

Logical Architecture

EP – Hadoop Summit 2015

EP Reporting Services

Detail Intermediate 1 Intermediate 2 Summary

Configuration

Filters Data Providers Processors

Calculators Metric Providers

Output ColumnsMetricsDimensions

Framework Components

Reporting Context

Cache

Util/Helpers

Command Line

Input/OutputConduit

AncillaryServices

Alerts

ShellScripts

Processed Data Store

Tools

Logging & Monitoring

Page 25: eBay Experimentation Platform on Hadoop

CHALLENGES & LEARNINGS

• Joins• Job Optimization• Data Skew

25EP – Hadoop Summit 2015

Page 26: eBay Experimentation Platform on Hadoop

EP – Hadoop Summit 2015 26

Key Challenges

• Performance– Job runtimes are subject to SLA & heavily tied to

resources

• Data Skew (Long tail data distribution)

– May cause unrecoverable runtime failures

– Poor performance

• Joins, Combiner

• Job Resiliency– Auto remediation

– Alerts and Monitoring

Page 27: eBay Experimentation Platform on Hadoop

EP – Hadoop Summit 2015 27

Solution to Key Challenge - Performance

– Tuned the Hadoop job parameters – a few of them are listed below• -Dmapreduce.input.fileinputformat.split.minsize and -Dmapreduce.input.fileinputformat.split.maxsize

– Job run times were reduced in the range of 9% to 35%

• -Dscoobi.mapreduce.reducers.bytesperreducer

– Adjusting this parameter helped optimize the number of reducers to use. Job run times were

reduced to the extent of 50% in some cases

• -Dscoobi.concurrentjobs

– Setting this parameter to true enables multiple steps of a scoobi job to run concurrently

• -Dmapreduce.reduce.memory.mb

– Tuning this parameter helped relieving memory pressure

Page 28: eBay Experimentation Platform on Hadoop

EP – Hadoop Summit 2015 28

Solution to Key Challenge - Performance

– Implement Data cache for objects• Achieved cache hit ratio of over 99% per job

• Runtime performance improved in the range of 18% to 39% depending on the job

– Redesign/Refactor Jobs and Job Schedules• Extracted logic from existing jobs into their own jobs

• Job workflow optimization for better parallelism

– Dedicated Hadoop queue with more than 50 TB of YARN memory.• Shared Hadoop cluster resulted in long waiting times, dedicated queue solved the problem of

resource crunch.

Page 29: eBay Experimentation Platform on Hadoop

EP – Hadoop Summit 2015 29

Joins

– Data skew in one or both datasets Scoobi block join divides the skewed data into blocks and joins the data one block at a time.

– Multiple joins in a process Rewrote a process, which needed join with 11 datasets whose size varied from 49 TB to a few mega

byte, in hive, as this process was taking 6+ hours in Scoobi and reduced the time to 3 hours in hive.

– Other join solutions Also looked into Hive’s bucket join, but the cost to sort and bucket the datasets was more than regular

join.

Page 30: eBay Experimentation Platform on Hadoop

EP – Hadoop Summit 2015 30

Combiner

To relieve Reducer memory pressure and prevent OOM

Solution – Emit part-values of the complete operation for the same key using Combiners

– Calculating Mean• Mean = ( X1 + X2 + X3 …. Xn )/ (1 + 1 + 1 + 1 … n)• formula is composed of 2 parts and mapper emits 2 part values combining records for the

same key.• Reducer receives way fewer records after combining and applies the two parts from each

mapper into the actual mean formula.• Concept can be applied to other complex formula such as Variance, as long as the formula

can be reduced to parts that are commutative and associative.

Page 31: eBay Experimentation Platform on Hadoop

EP – Hadoop Summit 2015 31

Job Resiliency

– Auto-remediation• Auto-restart in case of job failure due to intermittent cluster issues

- Monitoring & Alerting for Hadoop jobs• Continuous monitoring and email alert generated when a long-running job or failure detected

- Monitoring & Alerting for Data quality• Daily monitoring of data trend set up for key metrics and email Alert on any anomaly or violations detected

- Recon scripts• Checks and alerts setup for intermediate data

- Daily data backup• Daily data back up with distcp to a secondary cluster and ability to restore

Page 32: eBay Experimentation Platform on Hadoop

EP – Hadoop Summit 2015 32

Next - Evaluate Spark

Current Problems- Data processing through Map Reduce is slow for a complex DAG as data is persisted to disk

at each step. It is Multiple stages in pipeline are chained together making the overall process very complex.

- Massive Joins against very large datasets are slow.- Expressing every complicated business logic into Hadoop Map Reduce is a problem.

Alternatives- Apache Spark has wide adoption, expressive, industry backing and thriving community

support.- Apache spark has 10x to 100x speed improvements in comparison to traditional M/R jobs.

Page 33: eBay Experimentation Platform on Hadoop

EP – Hadoop Summit 2015 33

Summary

• Hadoop is ideal for large data processing and provides a highly scalable storage platform.

• Hadoop eco-system is still evolving and have to face the issues around the software which is still underdevelopment.

• Moving to Hadoop helped to free up huge capacity in DW for deep dive analysis.

• Huge cost reduction for business like us with exploding data sets.

Page 34: eBay Experimentation Platform on Hadoop

Q & A