Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
Integra(ng Social with Search
Edward Chang
Director, Google Research, Beijing�
WISE Keynote 1 2010-12-13
Related Papers �• AdHeat (Social Ads):
– AdHeat: An Influence-‐based Diffusion Model for PropagaAng Hints to Match Ads, H.J. Bao and E. Y. Chang, WWW 2010 (best paper candidate), April 2010.
– Parallel Spectral Clustering in Distributed Systems, Wen-‐Yen Chen, Yangqiu Song, Hongjie Bai, Chih-‐Jen Lin, and E. Y. Chang, IEEE TransacAons on PaTern Analysis and Machine Intelligence (PAMI), 2010.
• UserRank: – Confucius and its Intelligent Disciples, X. Si, E. Y. Chang, Z. Gyongyi, VLDB, September 2010 .
– Topic-‐dependent User Rank, Xiance Si, Z. Gyongyi, E. Y. Chang, and M.S. Sun, Google Technical Report.
• Large-‐scale Collabora;ve Filtering: – PLDA+: Parallel Latent Dirichlet AllocaAon for Large-‐Scale ApplicaAons, ACM TransacAons on
Internet Technology, 2011. – CollaboraAve Filtering for Orkut CommuniAes: Discovery of User Latent Behavior, W.-‐Y. Chen,
J. Chu, E. Y. Chang, WWW 2009: 681-‐690. – CombinaAonal CollaboraAve Filtering for Personalized Community RecommendaAon, W.-‐Y.
Chen, E. Y. Chang, KDD 2008: 115-‐123. – PSVM: Parallelizing SVMs on distributed machines, E. Y. Chang, et al., NIPS 2007.
WISE Keynote 2 2010-12-13
Web 1.0
WISE Keynote 3
.htm
.htm
.htm
.jpg
.jpg
.doc
.htm
.msg
.htm
.htm
2010-12-13
Web 2.0 -‐-‐-‐ Web with People
WISE Keynote 4
.htm
.jpg
.doc
.xls
.msg .htm
.htm
.jpg
.msg
.htm
2010-12-13
WISE Keynote 5 2010-12-13
WISE Keynote 6 2010-12-13
Outline�
• Search + Social Synergy • Search Social
• Social Search
• Scalability�
WISE Keynote 7 2010-12-13
Google Q&A (Confucius)
• Developed from 2007 All now @ Beijing • Launched in more than 60 courAers
– Russia – HK – Southeast Asia – Arab World
– Sub-‐Saharan Africa (Baraza)
WISE Keynote 8 2010-12-13
WISE Keynote 9
Query: What are must-‐see a<rac(ons at Yellowstone
2010-12-13
WISE Keynote 10
Query: What are must-‐see a<rac(ons at Yellowstone
2010-12-13
WISE Keynote 11
Query: What are must-‐see a<rac(ons at Yosemite
2010-12-13
WISE Keynote 12
Query: What are must-‐see a<rac(ons at Beijing
Hotel ads
2010-12-13
WISE Keynote 13
Search Quality at Stake.
Strategic Issue
13 17 June 2008 -‐ Knowledge Search Strategy
% of Referral Traffic From 1st Page Sent to Yahoo / Baidu % of First Result Page with >=1 Q&A Result from Yahoo or Baidu
61 countries have Q&A or advanced forums as top 10 most clicked des;na;on (out of 115 countries with more than 1M session)
2010-12-13
SNS & Mobile Also Need Q&A
• Social Networks – Difficult to find user intent to match ads
– Q&A is a perfect app to learn users’ problems
• Mobile Search – Voice is the most convenient user interface – Succinct search result (or rich snippets) is desirable
WISE Keynote 14 2010-12-13
Confucius: Google Q&A Providing High-‐Quality Answers in a Timely Fashion
Trigger a discussion/quesAon session during search Provide labels to a post (semi-‐automaAcally) Given a post, find similar posts (automaAcally) Evaluate quality of a post, relevance and originality Evaluate user credenAals in a topic sensiAve way Route quesAons to experts Provide most relevant, high-‐quality content for Search to index Generate answers using NLP
WISE Keynote 15
U
Search
S U
Community
C
CCS/Q&A
Discussion/Ques;on
DC
Differen;ated Content
2010-12-13
Confucius: Google Q&A Providing High-‐Quality Answers in a Timely Fashion
Trigger a discussion/quesAon session during search Provide labels to a quesAon (semi-‐automaAcally) Given a quesAon, find similar quesAons (automaAcally) Evaluate quality of an answer, relevance and originality Evaluate user credenAals in a topic sensiAve way Route quesAons to experts Provide most relevant, high-‐quality content for Search to index Generate answers using NLP
WISE Keynote 16
U
Search
S U
Community
C
CCS/Q&A
Discussion/Ques;on
DC
Differen;ated Content
2010-12-13
WISE Keynote 17
Q&A Uses Machine Learning
Label suggestion using LDA algorithm.
• Real Time topic-to-topic (T2T) recommendation using LDA algorithm.
• Gives out related high quality links to previous questions before human answer appear.
2010-12-13
WISE Keynote 18
1 1 1 1 1 1 1
1 1 1 1
1 1 1 1
1 1 1 1 1
1 1 1 1 1 1 1 1 1
1 1 1
Que
stio
ns
Labels/Qs
Based on membership so far, and memberships of others
Predict further membership
CollaboraAve Filtering
2010-12-13
FIM-‐based RecommendaAon
WISE Keynote 19 2010-12-13
Distributed Latent Dirichlet AllocaAon (LDA)
WISE Keynote 20
• Other CollaboraAve Filtering Apps – Recommend Users Users – Recommend Music Users – Recommend Ads Users – Recommend Answers Q
• Predict the ? In the light-‐blue cells
Users/Music/Ads/Question
Use
rs/M
usic
/Ads
/Ans
wer
s
1 recipe pastry for a 9 inch double crust 9 apples, 2/1 cup, brown sugar
How to install apps on Apple mobile phones?
Documents
Topic Distribution
Topic Distribution
User quries iPhone crack Apple pie
• Search – Construct a latent layer for beTer
for semanAc matching
• Example: – iPhone crack – Apple pie
2010-12-13
Latent Dirichlet AllocaAon [D. Blei, M. Jordan 04]
• α: uniform Dirichlet φ prior for per document d topic distribuAon (corpus level parameter)
• β: uniform Dirichlet φ prior for per topic z word distribuAon (corpus level parameter)
• θd is the topic distribuAon of document d (document level)
• zdj the topic if the jth word in d, wdj the specific word (word level)
WISE Keynote 21
θ
z
w
Nm
M
α
βφ
K
2010-12-13
CombinaAonal CollaboraAve Filtering Model (CCF) [W.-‐Y. Chen, et al, KDD2008]
WISE Keynote 22
Communities
users descriptions users descriptions
Communities Communities
Word 1
Word 2
Word 3
2010-12-13
WISE Keynote 23 2010-12-13
Confucius: Google Q&A
Trigger a discussion/quesAon session during search Provide labels to a post (semi-‐automaAcally) Given a post, find similar posts (automaAcally) Evaluate quality of a post, relevance and originality Evaluate user credenAals in a topic sensiAve way Route quesAons to experts Provide most relevant, high-‐quality content for Search to index NLQA
WISE Keynote 24
U
Search
S U
Community
C
CCS/Q&A
Discussion/Ques;on
DC
Differen;ated Content
2010-12-13
UserRank
WISE Keynote 25
• Rank users by quanAty (number of links) and quality (weights on links) of contribuAons
• Quality include: – Relevance. Is an answer relevant to the
Q? Measured by KL divergence between latent-‐topic vectors of A and Q
– Coverage. Compared among different answers
– Originality. Detect potenAal plagiarism and spam
– Promptness. Time between Q and A posAng Ame
2010-12-13
Outline�
• Search + Social Synergy • Social Search
• Search Social
• Scalability �
WISE Keynote 26 2010-12-13
Outline�
• Search + Social Synergy • Social Search
• Search Social
• Scalability�
WISE Keynote 27 2010-12-13
Social?
• ConnecAng to friends • Knowing what friends are up to • ConnecAng to strangers
– DaAng, Gaming – Shopping
• Making recommendaAons based on acAviAes
WISE Keynote 28 2010-12-13
User Latent Model
• α: uniform Dirichlet φ prior for per user u interest distribuAon (populaAon level parameter)
• β: uniform Dirichlet φ prior for per interest z acAvity distribuAon (populaAon level parameter)
• θd is the interest distribuAon of user u (user level)
• zuj the interest of the jth acAvity in u, wuj the specific acAvity (acAvity level)
WISE Keynote 29
θ
z
w
Nm
M
α
βφ
K
2010-12-13
WISE Keynote 30 2010-12-13
CombinaAonal CollaboraAve Filtering Model (CCF)
WISE Keynote 31
Users
Keywords Videos/Photos Keywords Videos/Photos
Users Users
2010-12-13
Interest Networks
WISE Keynote 32 2010-12-13
Outline�
• Search + Social Synergy • Social Search
– Mobilize users to improve search quality – Google Q&A, Facebook Like
• Search Social – Use query log to help social
• AcAviAes Interests Social • Groupcom
• Scalability�
WISE Keynote 33 2010-12-13
34 WISE Keynote 2010-12-13
35 WISE Keynote 2010-12-13
36 WISE Keynote 2010-12-13
37 WISE Keynote 2010-12-13
38 WISE Keynote 2010-12-13
39 WISE Keynote 2010-12-13
40 WISE Keynote 2010-12-13
User Latent Model
• α: uniform Dirichlet φ prior for per user u interest distribuAon (populaAon level parameter)
• β: uniform Dirichlet φ prior for per interest z acAvity distribuAon (populaAon level parameter)
• θd is the interest distribuAon of user u (user level)
• zuj the interest of the jth acAvity in u, wuj the specific acAvity (acAvity level)
WISE Keynote 41
θ
z
w
Nm
M
α
βφ
K
2010-12-13
LDA Gibbs Sampling: Inputs & Outputs
WISE Keynote 42
users
words topics
words topics
users
Inputs: 1. training data: users as bags of
words 2. parameter: the number of
topics
Outputs: 1. model parameters: a co-‐
occurrence matrix of topics and words.
2. by-‐product: a co-‐occurrence matrix of topics and users.
2010-12-13
Parallel Gibbs Sampling
WISE Keynote 43
Inputs: 1. training data: users as bags of
words 2. parameter: the number of
topics
Outputs: 1. model parameters: a co-‐
occurrence matrix of topics and words.
2. by-‐product: a co-‐occurrence matrix of topics and users.
users
words topics
words topics
users
2010-12-13
ObservaAons
WISE Keynote 44
Master Node
• BoTleneck: CommunicaAon
• Amdahl’s law caps speedup
• Words in a bag have no order
• Words on a computer node can be reordered
2010-12-13
Example Bags / Node A
• Bag #1 w1, w2, w3, w1, w2, w3, w1, w2, w3 • Bag #2 w1, w2, w1, w2, w1, w2, w1, w2 • Bag #3 w3, w1, w3, w1, w3, w1, w3, w1
• Bundle #1 w1, w1, w1, w1, w1, w1, … • Bundle #2 w2, w2, w2,… , • Bundle #3 w3, w3, w3,…
WISE Keynote 45 2010-12-13
Two Nodes
Node A Node B
W1 W2
W2 W3
W3 W1
WISE Keynote 46 2010-12-13
Parallel Gibbs Sampling
WISE Keynote 47
Inputs: 1. training data: documents as
bags of words 2. parameter: the number of
topics
Outputs: 1. model parameters: a co-‐
occurrence matrix of topics and words.
2. by-‐product: a co-‐occurrence matrix of topics and documents.
docs
words topics
words topics
docs
2010-12-13
PLDA -‐-‐ enhanced parallel LDA�
• Take advantage of bag of words modeling: each Pw machine processes vocabulary in a word order
• Pipelining: fetching the updated topic distribuAon matrix while doing Gibbs sampling
WISE Keynote �48 2010-12-13
Speedup 1,500x using 2,000 machines �
WISE Keynote 49 2010-12-13
Lessons Learned
• BoTleneck MaTers • Inter-‐iteraAon MaTers
WISE Keynote 50 2010-12-13
MapReduce�
WISE Keynote 51
map
map
map
map
map
Reduce
Data Block 1
Data Block 2
Data Block 3
Data Block 4
Data Block 5
Data datadatadatadatadatadatadatadatadatadata
Results datadatadatadatadatadata
2010-12-13
Parallel Programming Models
WISE Keynote 52
MapReduce Project + MPI
GFS/IO and task rescheduling overhead between iterations
Yes No +1
No +1
Flexibility of computation model AllReduce only +0.5
? +1
Flexible +1
Efficient AllReduce Yes +1
Yes +1
Yes +1
Recover from faults between iterations
Yes +1
Yes +1
Apps
Recover from faults within each iteration
Yes +1
Yes +1
Apps
Final Score for scalable machine learning
3.5 5 4
2010-12-13
53
SVM BoTlenecks Time consuming – 1M dataset, 8 days
Memory consuming – 1M dataset, 10G
... ... ..
.
WISE Keynote 2010-12-13
54
Matrix FactorizaAon AlternaAves
exact
approximate
WISE Keynote 2010-12-13
55
Raw Data Matrix Multiplication ICF
Incremental Data
Kernel Matrix
Incremental Kernel Matrix
Incremental ICF
Incremental Matrix Multiplication
Incremental Linear System Solving
Parallelizing SVM [E. Chang, et al, NIPS 07]
WISE Keynote 2010-12-13
56
Incomplete Cholesky FactorizaAon (ICF)
n x n n x p p x n
WISE Keynote 2010-12-13
57
Raw Data Matrix Multiplication ICF
Incremental Data
Kernel Matrix
Incremental Kernel Matrix
Incremental ICF
Incremental Matrix Multiplication
Incremental Linear System Solving
PSVM
WISE Keynote 2010-12-13
58
Matrix Product
=
p x n n x p p x p
WISE Keynote 2010-12-13
59
PSVM [E. Chang, et al, NIPS 07]
• Column-‐based ICF – Slower than row-‐based on single machine
– Parallelizable on mulAple machines
• Changing IPM computaAon order to achieve parallelizaAon
WISE Keynote 2010-12-13
60
Overheads
WISE Keynote 2010-12-13
61
Speedup
WISE Keynote 2010-12-13
Scalability
• ComputaAon – ParallelizaAon – ApproximaAon
• File Systems – Latency – Recovery
• Power Management
WISE Keynote 62 2010-12-13
Sample Plaxorms �
WISE Keynote 63 2010-12-13
Sample Hierarchy�
• Server – 16GB DRAM; 160MB Flash; 5 x 1TB disk
• Rack – 40 servers – 48 port Gigabit Ethernet switch
• Warehouse – 10,000 servers (250 racks) – 2K port Gigabit Ethernet switch �
64 WISE Keynote 2010-12-13
Storage -‐-‐-‐ One Server�
65 WISE Keynote 2010-12-13
Storage -‐-‐-‐ One Rack�
66 WISE Keynote 2010-12-13
Storage -‐-‐-‐ One Center�
67 WISE Keynote 2010-12-13
Google File System (GFS)
• Master manages metadata • Data transfers happen directly between clients/chunkservers • Files broken into chunks (typically 64 MB) • Chunks triplicated across three machines for safety • See SOSP^03 paper at hTp://labs.google.com/papers/gfs.html
Rep
licas
Master GFS Master
GFS Master Client
Client
C1 C0 C0
C3 C3 C4
C1
C5
C3
C4
WISE Keynote 68 2010-12-13
WISE Keynote � 69 2010-12-13
Win in Scale�
• Google Translate • Voice • Trend PredicAon
– An example benefits society�
WISE Keynote � 70 2010-12-13
H1N1 United NaAon Report �
WISE Keynote � 71 2010-12-13
Concluding Remarks • Search + Social • Increasing quanAty and complexity of data demands scalable
soluAons • Have parallelized key subrouAnes for mining massive data sets
– Spectral Clustering [ECML 08] – Frequent Itemset Mining [ACM RS 08] – PLSA [KDD 08] – LDA [WWW 09, AAIM 09] – UserRank [Google TR 09] – Support Vector Machines [NIPS 07]
• Launched Google Q&A (Confucius) in 60+ countries • Relevant papers
– hTp://infolab.stanford.edu/~echang/ • Open Source PSVM, PLDA
– hTp://code.google.com/p/psvm/ – hTp://code.google.com/p/plda/
WISE Keynote 72 2010-12-13
WISE Keynote 73
Models of InnovaAon • Ivory tower
• Only consider theory but not applicaAon
• Build it and they will come • ScienAsts drives product development
• “Research for sale” • Research funded by:
– product groups or customers
• Research & development as equals • Research “sells” innovaAon; • Product “requests” innovaAon
• Google-‐style innovaAon
R
R
D
D
R
D R
R=D
D?
2010-12-13