View
2.308
Download
0
Category
Tags:
Preview:
DESCRIPTION
Slides for Eventbrite's data platform talk at SF data mining meetup.
Citation preview
Data Platform
Vipul Sharma – vipul@eventbrite.com
A social event ticketing and discovery platform
$1B total sales
68M tickets sold
1.4M events hosted
.5M organizers served
23M attendees served
12 countries
Event Lifecycle
Frictionless is the mantra!
Data Platform and Discovery
Analytics
• Add–Hoc queries by Analysts
Fraud and Spam
Data Platform
Hadoop Cluster
• 30 persistent EC2 High-Memory Instances• 30TB disk with replication factor of 2, ext3
formatted• CDH3 • Fair Scheduler• HBase
Infrastructure
• Search• Solr• Incremental updates towards event driven
• Recommendation/Graph• Hadoop• Native Java MapReduce• Bash for workflow
• Social• Cassandra• Denormalized vview
• Persistence• MySql• HDFS• HBase• MongoDB (Moving to Cassandra)
Infrastructure
• Stream• RabbitMQ• Internal Fire hose• Storm
• Offline• MapRedude• Streaming• Hive• Hue
DiscoverySocial, Interest, Local
Categorization - Prism
Tech
MusicConference
Sports
Prism - Features
• Supervised Learning• Logistic Regression using MLE• Pair wise classification into 20 categories• High precision lower recall• Use mapreduce for feature extraction• Use for clustering as well
Prism – Training Data
• Binary classification for each category• Training data needed for positive and negative
• Conference and not Conference• Sports and not Sports
• Samasource and Crowdflower• Stem words to create initial set• Positive, negative, negative with stem words
Prism - Features
• Convert Event and Organizer data in feature vector
• Event details, Organizer details, Ticket details• Boolean representation of predefined attributes
• Words – tf-idf, dictonaries• Phrases• Domains• Rules – regular expression• Functions – business logic e.g. ticket price between $10-
$20• Compounds – boolean combination of features & and ||
rules– <COMPOUND1>:techcrunch & disrupt & techcrunch.com– <COMPOUND2>:COMPOUND2 && after && party
Prism - Features
• Each feature is represented in various context• Event Title, Event Description, Organizer Title, Organizer Description
• Each feature has meta info – Termclass• <LANG_EN>, <CONF_LANG_EN>,<ADULT_LANG_EN>• <SPORTS_LANG_EN>:<EVENT_TITLE>ball
• Feature vector is represented as sparse vector
+1 391158:1 401814:1 410526:1 411489:1 411606:2 413910:1 427659:1 438369:1 449735:1 449736:2 455478:1 456741:1 463188:1
693|||||warrior spirit's 3rd annual fundraising auction|||||1:<DESC>again,1:<NAME>annual,1:<DESC>annual,2:<DESC>approaching,2:<NAME>auction,4:<DESC>auction,2:<DESC>auctions,2:<DESC>bring
Prism - Training
• Binary classifier• Multiclass less accurate• Each event get classified into 20 category• MapReduce for creating sparse matrix• MapReduce for batch classification
• Distributed cache for feature set and models
• We can use same sparse matrix for clustering
Attendee
• What your interests are? - Prism• Who your friends are? – Explicit and Implicit• What are the interests of your friends? - Prism• Which of your friend have your interests? – IBG• Location of users and events
• Purchase events location• Facebook location• Our database• Other signals – ip, mobile app etc
You will like to attend this event
Item Hierarchy (You bought camera so you need batteries - Amazon)
Collaborative Filtering – User-User Similarity (People who bought camera also bought batteries - Amazon)
Collaborative Filtering – Item-Item similarity(You like Godfather so you will like Scarface - Netflix)
Social Graph Based (Your friends like Lady Gaga so you will like Lady Gaga, PYMK – Facebook, Linkedin)
Interest Graph Based (Your friends who like rock music like you are attending Eric Clapton Event–Eventbrite)
Recommendation Engines
Why Interest?
Events are Social Events are Interest
Dense Graph is IrrelevantInterest are Changing
How do we know your Interest?
• We ask you• Based on your activity
• Events Attended• Events Browsed (In Future)
• Facebook Interests• User Interest has to match Event category• Static
• Prism
Model Based vs Clustering
Building Social Graph is Clustering Step
Social Graph Recommendation is a Ranking Problem
Item-Item vs User-User
Implicit Social Graph
U1
U2 U3
U4 U5
E1
E2 E3
E4
Mixed Social Graph
U1
U2 U3
U4 U5
E1
E2 E3FB
LI
23M * 260 * 260 = 1.5 Trillion Edges
6 Billion edges ranked
Each node is a feature vector representing a User
Each edge is a feature vector representing a Relationship
Feature Generation
• Mixed Features• A series of map-reduce jobs• Output on HDFS in flat files; Input to subsequent jobs• Orders = Event Attendees
• MAP: eid: uid• REDUCE: eid:[uid]
• Attendees Social Graph• Input: eid:[uid]• MAP: uidi:[uid]
• REDUCE: uid:[neighbors]
• Interest based features, user specific, graph mining etc• Upload feature values to HBase
HBase
• Why Hbase?• To process 6B edges lookup features for each node and
each edge• 6B/1000 /86400 = 70 days!!• 1M/sec = 1.5 hrs• Processing 1.3 TB of data with mapreduce
• Collect data from multiple Map Reduce jobs• Stores entire social graph• Features for each node and edge
Data Model
Rowkey U UU
uid1 f1 f2 f3 uid2:f4 uid2:f5 uid3:f4
rowid neighbors events featureX
2718282 101 3 0.3678795
rowid 314159:n 314159:e 314159:fx 161803:n 161803:e 161803:fx
2718282 31 1 0.3183 83 2 0.618
U1
U2 U3
HBase
Hadoop Tips & Tricks
• Joins• Distributed cache• Hive map side joins
• Hive• Nice set of statistical functions• Lots of hive queries
• Hbase• Lots of memory• WAL• LZO• Proper configs• Avoid hot regioservers
Hadoop tips & tricks
• Combiners did not work• Shuffle and Merge
More Innovation
• Rethink everything• Add social to search• Add time series features• Real time updates using firehose and storm• Various sorts of data
Developers! Developers! Developers!
• Interested in scaling, messaging, data, machine learning, mobile, services
• We will continue to push the boundaries of hard problems
• jobs@eventbrite.com• vipul@eventbrite.com
Storm at Eventbrite
Tuesday August 21, 2012 at Eventbrite HQ
How we are using Storm for real time processing of our data
Andrew Whang whang@eventbrite.com
http://www.eventbrite.com/event/4010290888
Questions?
Recommended