95
Patchwork Data at Etsy Matt Walker

Patchwork Data at Etsy

Embed Size (px)

DESCRIPTION

Big data at Etsy began in early 2010 and has since grown to power applications as diverse as ETL, A/B testing, recommender systems, and search indexing. Join us at this talk for an amusing tour through the history of big data at Etsy going back to the roots of our mission-critical A/B testing approach followed by a dive into a selection of the technologies that power such applications today.

Citation preview

Page 1: Patchwork Data at Etsy

Patchwork Data at EtsyMatt Walker

Page 2: Patchwork Data at Etsy
Page 3: Patchwork Data at Etsy
Page 4: Patchwork Data at Etsy

2005 20132007 20112009

June

Etsy

Page 5: Patchwork Data at Etsy

What happened?

Page 6: Patchwork Data at Etsy

We don’t like to talk about it

Page 7: Patchwork Data at Etsy

Okay, we do

• http://codeascraft.etsy.com

• https://www.etsy.com/codeascraft/talks

• http://kongscreenprinting.com

Page 8: Patchwork Data at Etsy

Catch Phrases

• Continuous deployment

• Blameless postmortems

• Measure everything

• Continuous experimentation

Page 9: Patchwork Data at Etsy

Metrics-Driven Development

• Ganglia

• StatsD/Graphite

• Splunk

Page 10: Patchwork Data at Etsy

Scaling a Traditional RDBMS

• Sharded MySQL

• memcached

• Object-relational mapping in PHP

Page 11: Patchwork Data at Etsy

2005 20132007 20112009

December

Page 12: Patchwork Data at Etsy

Adtuitive

• Online advertising network

• Match forum post with rich product advertisements

• Unafraid of scaling across Etsy sellers

Page 13: Patchwork Data at Etsy

Adtuitive

• Amazon Web Services

• JRuby

• Rails

Page 14: Patchwork Data at Etsy
Page 15: Patchwork Data at Etsy

LAMP Stack for Big Data• HDFS

• MapReduce

• HBase

• Hive

• Flume

• JDBC/ODBC

• Hue

• Pig

• Oozie

• Avro

• Zookeeper

http://gigaom.com/2010/08/01/meet-big-data-equivalent-of-the-lamp-stack/

Page 17: Patchwork Data at Etsy

LAMP Stack for Big Data• HDFS S3

• MapReduce (Elastic)

• HBase

• Hive

• Flume

• JDBC/ODBC

• Hue

• Pig Cascading

• Oozie

• Avro TupleSerialization

• Zookeeper

Page 18: Patchwork Data at Etsy

Powered by MapReduce

• ETL

• Analytics

• A/B testing

• Recommenders

• Search

Page 19: Patchwork Data at Etsy

Applications• Log ETL

• Database snapshotter

• TasteTest

• Facebook Gift Recommender

• Complimentary/similar listings

• Funnel Cake

• Feature Funnel

• A/B Analyzer

• Catapult

• Distributed search indexing

• Fast Game (search index)

• Search autosuggest

• SearchAds

• SCRAM ETL (fraud detection)

Page 20: Patchwork Data at Etsy

Applications• Log ETL

• Database snapshotter

• TasteTest

• Facebook Gift Recommender

• Complimentary/similar listings

• Funnel Cake

• Feature Funnel

• A/B Analyzer

• Catapult

• Distributed search indexing

• Fast Game (search index)

• Search autosuggest

• SearchAds

• SCRAM ETL (fraud detection)

Page 21: Patchwork Data at Etsy

Catapult

• End-to-end success story

• Extremely valuable for a web shop

Page 22: Patchwork Data at Etsy

2005 20132007 20112009

January

Relevancy Thursdays

Page 23: Patchwork Data at Etsy

Relevancy Thursdays

• Switch default sort order to relevance

• Each Thursday in January

Page 24: Patchwork Data at Etsy

Relevancy Thursdays

• Default search order was recency

• Relisting was our equivalent of advertising

• $0.20 updated your listing’s timestamp

Page 25: Patchwork Data at Etsy

Relevancy Thursdays

• Recency was meant to support “freshness” in search results

• Search originated as PostgreSQL query

• Converted to Solr to scale

Page 26: Patchwork Data at Etsy

What happens if we switch to relevance?

Page 27: Patchwork Data at Etsy

Relevancy Thursdays

• No A/B testing framework

• No event logs

• Limping along with Google Analytics

Page 28: Patchwork Data at Etsy
Page 29: Patchwork Data at Etsy
Page 30: Patchwork Data at Etsy

2005 20132007 20112009

February

First Log Analysis

Page 31: Patchwork Data at Etsy

First Log Analysis

• Raw web access logs

• URL- and ref tag-based

• Regex parser

Page 32: Patchwork Data at Etsy
Page 33: Patchwork Data at Etsy
Page 34: Patchwork Data at Etsy

Heyday of Tooling

• A/B framework

• Front end event logger

• Database snapshotter

• Barnum and Bailey

• Custom operator library

• Loaders

Page 35: Patchwork Data at Etsy

LAMP Stack for Big Data• HDFS S3

• MapReduce (Elastic)

• HBase

• Hive

• Flume

• JDBC/ODBC

• Hue

• Pig Cascading

• Oozie

• Avro TupleSerialization

• Zookeeper

Page 36: Patchwork Data at Etsy

LAMP Stack for Big Data• HDFS S3

• MapReduce (Elastic)

• HBase

• Hive

• Flume Akamai

• JDBC/ODBC snapshotter/loaders

• Hue

• Pig Cascading

• Oozie Barnum

• Avro TupleSerialization

• Zookeeper

Page 37: Patchwork Data at Etsy

A/B Framework

• Ramp-ups + A/B testing

• Feature flag development

Page 38: Patchwork Data at Etsy

Self-service analytics for any A/B test on the site

Page 39: Patchwork Data at Etsy

2005 20132007 20112009

A/B Framework

June

Page 40: Patchwork Data at Etsy

2005 20132007 20112009

A/B Analyzer

November

Page 41: Patchwork Data at Etsy

Why did it take so long?

• Non-web developers learning the PHP stack

• Failed experiments with “easier to use” MapReduce tools

• Realizing self-service analytics was what Etsy needed

Page 42: Patchwork Data at Etsy
Page 43: Patchwork Data at Etsy
Page 44: Patchwork Data at Etsy

2005 20132007 20112009

February

Catapult

Page 45: Patchwork Data at Etsy

Catapult

• A/B Analyzer + Launch Calendar

• Full product lifecycle

Page 46: Patchwork Data at Etsy
Page 47: Patchwork Data at Etsy
Page 48: Patchwork Data at Etsy
Page 49: Patchwork Data at Etsy

LAMP Stack for Big Data• HDFS S3

• MapReduce (Elastic)

• HBase

• Hive

• Flume Akamai

• JDBC/ODBC snapshotter/loaders

• Hue

• Pig Cascading

• Oozie Barnum

• Avro TupleSerialization

• Zookeeper

Page 50: Patchwork Data at Etsy

LAMP Stack for Big Data• HDFS

• MapReduce

• HBase

• Hive Vertica

• Flume logrotate

• JDBC/ODBC snapshotter/loaders

• Hue

• Pig Cascading

• Oozie

• Avro TupleSerialization

• Zookeeper

Page 51: Patchwork Data at Etsy

Computation Models

• Batch

• Interactive

• Streaming

Page 52: Patchwork Data at Etsy
Page 53: Patchwork Data at Etsy

Batch

Page 54: Patchwork Data at Etsy

Cascading

Page 55: Patchwork Data at Etsy

SQL cascading.jruby

Query Planner/Optimizer Cascading

Execution Engine MapReduce

Storage HDFS

RDBMS / Cascading

Page 56: Patchwork Data at Etsy

cascading.jruby

Page 57: Patchwork Data at Etsy

cascading.jruby

• Productivity: no compile

• Reuse: factor out structure

• Efficiency: no JRuby runtime

• Optimization: move aggregations map-side

Page 58: Patchwork Data at Etsy

A nice constructor

Page 59: Patchwork Data at Etsy

cascading.jruby

Page 60: Patchwork Data at Etsy

Productivity

• Job templates

• Reloader

• Cascading local mode

• Sampled data

Page 61: Patchwork Data at Etsy

Reuse

Page 62: Patchwork Data at Etsy

Reuse

Page 63: Patchwork Data at Etsy

Field Names

Page 64: Patchwork Data at Etsy
Page 65: Patchwork Data at Etsy

Efficiency

• Just a constructor

• Calls into Cascading API

• No JRuby runtime on cluster

Page 66: Patchwork Data at Etsy

Optimization

Page 67: Patchwork Data at Etsy

Tuple Data Model

Page 68: Patchwork Data at Etsy

UDFs

Page 69: Patchwork Data at Etsy

Scalding

• Distributed collections

• Function literals replace UDFs

Page 70: Patchwork Data at Etsy
Page 71: Patchwork Data at Etsy
Page 72: Patchwork Data at Etsy
Page 73: Patchwork Data at Etsy

Interactive

Page 74: Patchwork Data at Etsy

Vertica

Page 75: Patchwork Data at Etsy

Sharded MySQL

• Borrowed from Flickr

• Works

Page 76: Patchwork Data at Etsy

Thou Shalt Not Join

Page 77: Patchwork Data at Etsy

2005 20132007 20112009

Hive

January

Page 78: Patchwork Data at Etsy

2005 20132007 20112009

Hive Turned Off

April

Page 79: Patchwork Data at Etsy

Hive

• Slow

• Sensitive

• Operational burden

• Educational burden

Page 80: Patchwork Data at Etsy

Vertica

• Offline copy of shards, master, auxiliary databases

• Joins are easy

• Reasonable latency

Page 81: Patchwork Data at Etsy

2005 20132007 20112009

Vertica

November

Page 82: Patchwork Data at Etsy

Vertica

• Game changer at Etsy

• High demand for joins

• Rapid prototyping data pipelines

Page 83: Patchwork Data at Etsy
Page 84: Patchwork Data at Etsy

SQL cascading.jruby

Query Planner/Optimizer Cascading

Execution Engine MapReduce

Storage HDFS

RDBMS / Cascading

Page 85: Patchwork Data at Etsy

Back to MapReduce

• Event logs

• Schedule

• Load data in prod

• Scale

Page 86: Patchwork Data at Etsy

Vertica

• Not Hive, Impala, Shark, etc.

• May change our minds

Page 87: Patchwork Data at Etsy

Streaming

Page 88: Patchwork Data at Etsy

Not Powered by MapReduce

• Activity Feed

• Shop Stats

Page 89: Patchwork Data at Etsy

Etsyweb

• memcached

• Gearman

• Sharded MySQL

Page 90: Patchwork Data at Etsy

Usecases

• Trending

• Fraud detection

• ?

Page 91: Patchwork Data at Etsy
Page 92: Patchwork Data at Etsy

Turns out people don’t make product decisions in real time

http://mcfunley.com/whom-the-gods-would-destroy-they-first-give-real-time-analytics

Page 93: Patchwork Data at Etsy

Summing Up

• Be glad you’re living in the future

• Automated tools for the common case

• Don’t be afraid to experiment

Page 94: Patchwork Data at Etsy

Image Credits• http://kongscreenprinting.com/what-we-do-

showcase

• http://animal.discovery.com

• http://www.rallyrace.com/turning-over-the-stone-event-production-basics/

• http://www.flickr.com/photos/bbalaji/2443820505/

• http://www.madeyoulaugh.com/funny_photos/caveman_harley/caveman_harley.jpg

• http://theundercoverrecruiter.com/6-ways-catapult-your-job-search-after-layoff/

• http://www.globaltimes.cn/SPECIALCOVERAGE/Top10Peopleof2011.aspx

• http://www.theculturemap.com/scream-time-edvard-munch-museum/

• http://www.repentamerica.com/webelieve.html

• https://soundcloud.com/tearland/tl-hive

• http://pocketnow.com/2012/08/02/wifi-vs-data-speed-vs-battery-life/bush-scratching-head

Page 95: Patchwork Data at Etsy

Contact / Reference

• Matt Walker

• @data_daddy

• http://codeascraft.etsy.com/

• http://www.etsy.com/codeascraft/talks