Upload
justin-moore
View
37.268
Download
2
Embed Size (px)
Citation preview
3/22/2011 Machine Learning Meetup Jus6n Moore -‐ @injust
Ma;hew Rathbone -‐ @rathboma
Big Data @ foursquare
Infrastructure, Analy6cs, Predic6on, and Beyond
3/22/2011 Machine Learning Meetup Jus6n Moore -‐ @injust
Ma;hew Rathbone -‐ @rathboma
Overview
• What is foursquare • Analy6cs and Data • Machine Learning, Recommenda6ons
3/22/2011 Machine Learning Meetup Jus6n Moore -‐ @injust
Ma;hew Rathbone -‐ @rathboma
What is Foursquare? • Loca6on based startup, applica6on that helps you to explore your city, discover new places
• Visit places, check-‐in, earn rewards, stay connected with your friends
• Game elements: single-‐player, mul6-‐player
3/22/2011 Machine Learning Meetup Jus6n Moore -‐ @injust
Ma;hew Rathbone -‐ @rathboma
What is Foursquare? (cont.)
• 7M+ users, 15M+ venues, 500M+ check-‐ins
• Large reach (every country, North Pole, Space, Everest)
• Na6ve app for almost every smartphone, also available on SMS, web, mobile-‐web
3/22/2011 Machine Learning Meetup Jus6n Moore -‐ @injust
Ma;hew Rathbone -‐ @rathboma
Explore
• Our new social-‐recommenda6on engine
• Real-‐6me sugges6ons based on your social graph.
3/22/2011 Machine Learning Meetup Jus6n Moore -‐ @injust
Ma;hew Rathbone -‐ @rathboma
Data Model
Users Venues
Tips/To-‐dos
Check-‐ins
Shouts
3/22/2011 Machine Learning Meetup Jus6n Moore -‐ @injust
Ma;hew Rathbone -‐ @rathboma
Analy6cs @ Foursquare
I’m going to talk about: • Why produc6on db’s are bad for analy6cs • What we do to make it be;er (hint: hadoop) • Our custom Dashboard • Usage examples • Thoughts about the hadoop/hive experience
3/22/2011 Machine Learning Meetup Jus6n Moore -‐ @injust
Ma;hew Rathbone -‐ @rathboma
Our Data: Problems using the Produc6on
Databases
3/22/2011 Machine Learning Meetup Jus6n Moore -‐ @injust
Ma;hew Rathbone -‐ @rathboma
3/22/2011 Machine Learning Meetup Jus6n Moore -‐ @injust
Ma;hew Rathbone -‐ @rathboma
3/22/2011 Machine Learning Meetup Jus6n Moore -‐ @injust
Ma;hew Rathbone -‐ @rathboma
Our data: So we turn to our friends
Our repor6ng / analy6cs / data mining stack is thanks to open source sobware
3/22/2011 Machine Learning Meetup Jus6n Moore -‐ @injust
Ma;hew Rathbone -‐ @rathboma
Our data: What we do instead
Log Files
3/22/2011 Machine Learning Meetup Jus6n Moore -‐ @injust
Ma;hew Rathbone -‐ @rathboma
About Hadoop and Hive Hadoop: • Distributed Data processing
framework (map-‐reduce). • Wri;en in Java Hive: • SQL layer on top of hadoop • Lets us do “select count(1)
from checkins” instead of having to write our own map-‐reduce java classes.
Image from ibm.com
3/22/2011 Machine Learning Meetup Jus6n Moore -‐ @injust
Ma;hew Rathbone -‐ @rathboma
About Hive
• Create/Drop/Insert/Select etc • Table Joins • Aggrega6on Func6ons • Date Func6ons • URL parsing func6ons • Cool n-‐gram func6ons • Just now gegng database drivers for popular languages (JAVA)
3/22/2011 Machine Learning Meetup Jus6n Moore -‐ @injust
Ma;hew Rathbone -‐ @rathboma
About Hive
Select * from x; Select count(1) from x; Select sum(x.price) from x; Select a, sum(price) from x group by a; Select a from x where datediff(‘2011-‐01-‐01’, d) = 0; Drop table x;
3/22/2011 Machine Learning Meetup Jus6n Moore -‐ @injust
Ma;hew Rathbone -‐ @rathboma
Hadoop vs Hive SELECT
created_date, country, count(1)
FROM checkins GROUP BY
created_date, country
#mapper: $stdin.each do |line|
date, country, id = line.split puts date + “,” + country
end #reducer counts = Hash.new(0) $stdin.each do |line|
counts[line] += 1 end puts counts
VS
3/22/2011 Machine Learning Meetup Jus6n Moore -‐ @injust
Ma;hew Rathbone -‐ @rathboma
Our Hadoop Infrastructure
• We use clusters generated through amazon’s Elas6c MapReduce
• That means we store all of our data in flat files in Amazon S3 (which keeps things simple)
• We export data from both MongoDB and h;p proxy log-‐files
• We manage everything using a custom ruby-‐on-‐rails dashboard
“rake cluster:start[30]” => starts a 30 node cluster, just like that
3/22/2011 Machine Learning Meetup Jus6n Moore -‐ @injust
Ma;hew Rathbone -‐ @rathboma
Our Dashboard • Define and schedule reports through it
• Allow ad-‐hoc access to (internal) users
• Controls data imports into S3 from mongo/log-‐files
• Provides an intermediate DB layer for rollup data caching(experimental atm)
• Allows you to do a bunch of cool stuff with queries aber they’ve run
3/22/2011 Machine Learning Meetup Jus6n Moore -‐ @injust
Ma;hew Rathbone -‐ @rathboma
Example: Impor6ng Data
3/22/2011 Machine Learning Meetup Jus6n Moore -‐ @injust
Ma;hew Rathbone -‐ @rathboma
Example: Query Walkthrough Find top 20 venues in Switzerland
venuename city total zurich airport (zrh) kloten 3746 geneva-‐cointrin airport (gva) grand-‐saconnex 3012 zurich hauptbahnhof zurich 1780 sony ericsson football hotspot basel 773 basel bahnhof sbb basel 761 gare de cornavin geneva 760 bern hauptbahnhof bern 736 gare de lausanne lausanne 672 apple store zurich 670 bahnhof luzern luzern 477 terminal e kloten 458 bellevueplatz zurich 457 terminal a kloten 455 bahnhof oerlikon zurich 453 bahnhof stadelhofen zurich 444 sihlcity zurich 400 zurich flughafen bahnhof zurich 400 bahnhof olten olten 391 bahnhof winterthur winterthur 379 bahnhof hardbrücke zurich 369
3/22/2011 Machine Learning Meetup Jus6n Moore -‐ @injust
Ma;hew Rathbone -‐ @rathboma
Walkthrough: Start the query
3/22/2011 Machine Learning Meetup Jus6n Moore -‐ @injust
Ma;hew Rathbone -‐ @rathboma
Walkthrough: Get the results in email
3/22/2011 Machine Learning Meetup Jus6n Moore -‐ @injust
Ma;hew Rathbone -‐ @rathboma
Walkthrough: Top Venues
3/22/2011 Machine Learning Meetup Jus6n Moore -‐ @injust
Ma;hew Rathbone -‐ @rathboma
Walkthrough
If we want to schedule something to run daily/weekly/monthly we can do that too Reports are represented as Ac6veRecord models
3/22/2011 Machine Learning Meetup Jus6n Moore -‐ @injust
Ma;hew Rathbone -‐ @rathboma
Walkthrough: Reports feed our dashboards
3/22/2011 Machine Learning Meetup Jus6n Moore -‐ @injust
Ma;hew Rathbone -‐ @rathboma
Walkthrough: queries allow data explora6on
3/22/2011 Machine Learning Meetup Jus6n Moore -‐ @injust
Ma;hew Rathbone -‐ @rathboma
Stats on the Stats Stack
• 25-‐machine clusters • Reports on check-‐in data (joining venues and/or users) usually take 5-‐15 minutes to run
• Reports on log data usually take 10-‐20 minutes to run
• We run 10-‐30 reports a day • Most data goes into a Google spreadsheet for people to look at.
3/22/2011 Machine Learning Meetup Jus6n Moore -‐ @injust
Ma;hew Rathbone -‐ @rathboma
Thoughts on Amazon’s EMR
• The API has very low rate limits • Everything is a HTTP get request (even crea6ng a cluster)
• The ruby library/client is unusable as a client library. (we shell out to it in order to capture the resul6ng JSON)
3/22/2011 Machine Learning Meetup Jus6n Moore -‐ @injust
Ma;hew Rathbone -‐ @rathboma
Thoughts on Hive
• Generally good • Some6mes it will act crazy • Par66oning data is harder than it looks • The JSON serde makes all sorts of weird stuff happen when you’re joining tables
• Always join LAST!
3/22/2011 Machine Learning Meetup Jus6n Moore -‐ @injust
Ma;hew Rathbone -‐ @rathboma
Working With Hive SELECT
v.venuename, count(*)
FROM checkins c JOIN venues v ON c.venueid = v.id
GROUP BY v.address
SELECT v.venuename, c.total
FROM (SELECT venueid, count(1) FROM checkins GROUP BY venueid ) c JOIN venues v on c.venueid = v.id
OK BETTER
3/22/2011 Machine Learning Meetup Jus6n Moore -‐ @injust
Ma;hew Rathbone -‐ @rathboma
Our Data: End
• Hadoop + Hive > Mongo + Scripts
• Simple ruby dashboard == super useful
• Lots of data == fun charts
QUESTIONS?
3/22/2011 Machine Learning Meetup Jus6n Moore -‐ @injust
Ma;hew Rathbone -‐ @rathboma
foursquare 3.0: Explore
3/22/2011 Machine Learning Meetup Jus6n Moore -‐ @injust
Ma;hew Rathbone -‐ @rathboma
Engineering an Online Recommenda6on System
3/22/2011 Machine Learning Meetup Jus6n Moore -‐ @injust
Ma;hew Rathbone -‐ @rathboma
Engineering cont.
Goals: • “Here and now” • No new signals • Use all of our textual data • 100ms per query
3/22/2011 Machine Learning Meetup Jus6n Moore -‐ @injust
Ma;hew Rathbone -‐ @rathboma
Engineering cont. Pain points: • Geo indexes, compound geo indexes
• Limi6ng queries in minimally impac�ul ways
• Cached datastores (building rollup collec6ons)
• Geo indexes
3/22/2011 Machine Learning Meetup Jus6n Moore -‐ @injust
Ma;hew Rathbone -‐ @rathboma
Compu6ng a Similarity Matrix
• Analyzing similarity func6ons OK on single machine
• 10M+ venues = 100 trillion element sparse matrix – Compute without visi6ng every element – Parallelize, cross machine
3/22/2011 Machine Learning Meetup Jus6n Moore -‐ @injust
Ma;hew Rathbone -‐ @rathboma
Compute Similarity Matrix, cont.
• Leverage Mahout’s library of similarity func6ons, easy to extend
• Job system controls execu6on of sequen6al dependent M-‐R tasks
• Hadoop: easily scalable to large commodity machine clusters, elas6c makes increasing cluster size trivial
3/22/2011 Machine Learning Meetup Jus6n Moore -‐ @injust
Ma;hew Rathbone -‐ @rathboma
Compute Similarity Matrix, cont.
Series of “Jobs,” each do a Map-‐Reduce 1. Convert input flat file dumped from Hive to binary sparse
vector representa6on 2. Compute pairwise co-‐occurrences 3. Compute column based weights (column normaliza6on),
retrieve all vectors with co-‐occurrences 4. Compute pairwise similari6es, store in sparse matrix
format 5. Fla;en sparse matrix to text format that we can load
into DB
3/22/2011 Machine Learning Meetup Jus6n Moore -‐ @injust
Ma;hew Rathbone -‐ @rathboma
The Value of Why
• Show people which friends visited, which places are co-‐visited (not the same as “similar”?)
• Lowers the bar for precision – Allows users to choose for themselves among recs – Increase propensity to check-‐in (sales pitch for the
venue) • Mix with the social, story-‐telling aspects of
product • Collabora6ve filtering allows for easy descrip6on
3/22/2011 Machine Learning Meetup Jus6n Moore -‐ @injust
Ma;hew Rathbone -‐ @rathboma
Case Study: Defining “Interes6ng”
• Need to show ranked venues for “cold-‐start” • Various influencing factors in what makes a place “interes6ng”
– Number of users checked in – Average visits per user – Tips leb, to-‐dos done – How people check-‐in (broadcast to T/FB, off-‐the-‐grid?) – Trending direc6on (more popular lately?)
• Measuring raw popularity poses problems – Places open just for lunch, smaller dining rooms, longer meal 6mes – Been in system longer, opened recently – Differences between categories (coffee shops != burger joints)
3/22/2011 Machine Learning Meetup Jus6n Moore -‐ @injust
Ma;hew Rathbone -‐ @rathboma
Defining “Interes6ng” cont.
“Local Favorite”
“Must See”
0
1
2
3
4
5
6
7 Visits Per User
Unique Users
3/22/2011 Machine Learning Meetup Jus6n Moore -‐ @injust
Ma;hew Rathbone -‐ @rathboma
Future Direc6ons
• S6ll a big unknown, collect user feedback to drive development
• Scale beyond just co-‐occurrences, improve predic6on in new territory
• Planning mode (beyond the here and now) • Joint recommenda6ons (where do I go with this set of friends?)
3/22/2011 Machine Learning Meetup Jus6n Moore -‐ @injust
Ma;hew Rathbone -‐ @rathboma
Help us get there
foursquare is hiring www.foursquare.com/jobs
Jus6n Moore
@injust [email protected]
Ma;hew Rathbone @rathboma