Upload
dzhou
View
775
Download
2
Embed Size (px)
DESCRIPTION
Citation preview
Data Analysis at Facebook
Jeff Hammerbacher, Ding Zhou*
Facebook Inc.
Outline• How does Facebook work
• Managing Big Data
• Data Analysis for Business Intelligence
• Data Analysis for “Artificial Intelligence”
• Questions
How does Facebook work?
Profile page - content generation portal
Newsfeed page - content consumption portal
Friends page - social graph portal
App page - social app platform
Facebook Data▪ Social Graph Data
▪ The Nodes:▪ 100m+ users; 100+ dimensions each user (numerical, text, categorical);▪ 350k registrations daily;
▪ The Edges:▪ 200+ friends each user (median);▪ 20 categories of edges (fb friends, co-workers, family, etc);
▪ Social Behavior Data▪ Social Interactions: interactions among users, via 100+ interaction types;▪ Social Actions: between users and 33k+ facebook apps, via 200+ action types;
▪ Social Content Data▪ Content of Posts, Notes, Photos, Video, etc
Managing Big Data▪ Data scale [backend]:
▪ Over 1.3 PB raw capacity in largest cluster;▪ Nearly 2 TB uncompressed data per day;▪ Over 20 TB read/write per day;
▪ Distributed Data management:▪ HDFS/Hadoop (MapReduce in Java);▪ MetaStore (MetaData management);▪ Hive QL (Query language on Hadoop+MetaStore);▪ Usage:
▪ at least 50 engineers have run hadoop jobs ▪ 3,514 Jobs weekly▪ 821 Projections,152 Joins, 800 Aggregates, 600 Loaders weekly
Hadoop - MapReduce in Java
MapReduce Execution Flow[Dean, J and Ghemawat, S, 2004]
facebook data teamuses hadoop fordata analysis
facebook:1data:1team:1
uses: 1hadoop: 1for: 1
data:1analysis:1
for:1hadoop:1team:1uses: 1
analysis:1 data:1data:1facebook:1 analysis:1
data:2facebook:1for:1hadoop:1team:1uses: 1
Data Analysis for Business Intelligence
Data for Business Intelligence▪ General Goal:▪ support growth and monetization strategies, and product decisions
▪ User Behavior Studies▪ NUX: Longitudinal study using LARS and recursive partitioning to identify features predictive
of engagement;▪ Identity*: Unsupervised learning over user session data to identify common usage patterns.
Techniques employed include K-Means, PageRank, dimension reduction methods;
▪ Experimentation Platform▪ Columbus*: Top-level site health metrics; drill down by user groups (country, age, gender...);
▪ Columbus++*: A/B testing for impact of site change on site health metrics;;▪ Reporting System
▪ ad-hoc analysis done by Hive queries* - underlined are projects that Ding Zhou participates in;
ColumbusGeographical bird-view ofgrowth by country
Comparison betweenuser groups
Data Analysis for “Artificial Intelligence”-- predicting user social behavior
who the user will interact with
• predict interactions between friends
• features are user profile and browsing history
• tried linear models and tree models
• applied for search, newsfeed, etc
who the user hasn’t found yet
• missing edge prediction problem
• observations are friend/non-friend pairs
• features include profile and local graph info
• profile info more informative
• graph info supplemental if profile incomplete
what applications the user may like*
* projects that Ding Zhou participates in;
• 33k apps, only 0.1% of them used;
• a different recommendation problem;
• prediction model not applicable, user preference unavailable;
• build a prediction model to infer “user ratings”;
• user-based + item-based recommendation
• how to combine profile, social graph, ratings?
what content is interesting*
* projects that Ding Zhou participates in;
• newsfeed as the main content distribution channel
• stories generated by 100s of social actions: on the site, platform, or the Web
• <0.1% of possible stories are shown
• predictions built on story features, and user browsing history
Challenges in Data
- 100s of TBs of meaningful data available- 1,000s of non-trivial features- sampling not always applicable (e.g. small app has no user data)- prediction requirements▪ models regularly applied for 10 billion novel samples▪ models used on-the-fly for 100k samples in 50 ms
Special Machine Learning Problems
- use machine learning to predict user behavior▪ labels: insufficient; inferred implicitly; imbalanced;▪ features: high-dimensional; strongly correlated; noisy;
- scale requires distributed algorithms▪ in-house implementation of tree ensemble methods (bagging predictors)▪ larger training sets grant performance improvements
- speed and accuracy improvements underway
tip of the iceberg
Questions?
(c) 2004-2008 Facebook, Inc. or its licensors. "Facebook" is a registered trademark of Facebook, Inc.. All rights reserved. 1.0