24
Introduction to Machine Learning for Information Retrieval Xiaolong Wang

Introduction to Machine Learning for Information Retrieval

Embed Size (px)

DESCRIPTION

Introduction to Machine Learning for Information Retrieval. Xiaolong Wang. What is Machine Learning. In short, tricks of m aths Two major tasks: Supervised Learning: a.k.a. Regression, Classification… Unsupervised Learning: a.k.a. data manipulation, clustering …. Supervised Learning. - PowerPoint PPT Presentation

Citation preview

Page 1: Introduction to Machine Learning for Information Retrieval

Introduction to Machine Learning for Information Retrieval

Xiaolong Wang

Page 2: Introduction to Machine Learning for Information Retrieval

What is Machine Learning

• In short, tricks of maths• Two major tasks:– Supervised Learning: • a.k.a. Regression, Classification…

– Unsupervised Learning:• a.k.a. data manipulation, clustering …

Page 3: Introduction to Machine Learning for Information Retrieval

Supervised Learning

• Label : usually manually labeled• Data : data representation, usually as a vector• Prediction Function : selecting one from a predefined family

of functions that has the best prediction

classification regression

Page 4: Introduction to Machine Learning for Information Retrieval

Supervised Learning

• Two formulations:– F1: Given a set of Xi, Yi, learn a function• Yi

– Binary: Spam v.s. Non-spam– Numeric: Very relevant(5), somewhat relevant(4), marginal

relevant(3), somewhat irrelevant(2), very irrelevant(1)

• Xi

– Number of words, occurrence of each word, …

• f– usually linear function

Page 5: Introduction to Machine Learning for Information Retrieval

Supervised Learning

• Two formulations:– F2: Give a set of Xi, Yi ,learn a function such

that• Yi: more complex label than binary or numeric

– Multiclass learning: entertainment v.s. sports v.s. politics…– Structural learning: syntactic parsing

more general

Y

X

Page 6: Introduction to Machine Learning for Information Retrieval

Supervised Learning

• Training– Optimization:• Loss: difference b/w true label Yi and predicted label wTXi

– Squared Loss (regression): (Yi – wTXi)2

– Hinge Loss (classification): max(0, 1 – Yi .wTXi)

– Logistic Loss (classification): log(1 + exp(-Yi .wTXi))

Page 7: Introduction to Machine Learning for Information Retrieval

Supervised Learning

• Training– Optimization:• Regularization:

Without regularization: overfitting

Page 8: Introduction to Machine Learning for Information Retrieval

Supervised Learning

• Training– Optimization:• Regularization:

Large margin, small ||w||

Page 9: Introduction to Machine Learning for Information Retrieval

Supervised Learning

• Optimization:– Art of maximization• Unconstraint:

– First order: Gradient descent– Second order: Newtonian method– Stochastic: stochastic gradient descent (SGD)

• Constraint:– Active set method– Interior Point Method– Alternative Direction Method of Multiplier (ADMM)

Page 10: Introduction to Machine Learning for Information Retrieval

Unsupervised Learning

• Clustering: – PCA– kNN

Page 11: Introduction to Machine Learning for Information Retrieval

Machine Learning for Information Retrieval

• Learning to Rank• Topic Modeling

Page 12: Introduction to Machine Learning for Information Retrieval

Learning to Rank

http://research.microsoft.com/en-us/people/hangli/li-acl-ijcnlp-2009-tutorial.pdf

Page 13: Introduction to Machine Learning for Information Retrieval

Learning to Rank

• X = (q, d)– Features: e.g. Matching between Query and Document

Page 14: Introduction to Machine Learning for Information Retrieval

Learning to Rank

Page 15: Introduction to Machine Learning for Information Retrieval

Learning to Rank

• Labels:– Pointwise: relevant vs. irrelevant; 5,4,3,2,1– Pairwise: doc A > doc B, doc C > doc D– Listwise: permutation

• Acquisition:– Expert Annotation– Clickthrough: click ,skip above

Page 16: Introduction to Machine Learning for Information Retrieval

Learning to Rank

Page 17: Introduction to Machine Learning for Information Retrieval

Learning to Rank

• Prediction function:– Extract Xq,d from (q, d)

– Ranking document by sorting wT Xq,d

• Loss function:– Pointwise– Pairwise– Listwise

Page 18: Introduction to Machine Learning for Information Retrieval

Learning to Rank

• Pointwise:– Regression: Square loss

• Pairwise:– Classification: (q, d1) > (q, d2) => positive example Xq,d1 – Xq, d2

• Listwise:– Optimization: NDCG@j

Relevance (0/1) of document at rank i

Discount of rank i

Cumulative

Gain

Normalized

Page 19: Introduction to Machine Learning for Information Retrieval

Topic Modeling

• Topic Modeling– Factorization of Words * Documents matrix

• Clustering of document– Project documents (vector of # vocabulary) into lower dimension (vector

of # topics)

• What is Topic?– Linear combination of words

• Nonnegative weights, sum to 1 => probability

Page 20: Introduction to Machine Learning for Information Retrieval

Topic Modeling

• Generative models: story-telling– Latent Semantic Analysis, LSA– Probabilistic Latent Semantic Analysis, PLSA– Latent Dirichlet Allocation, LDA

Page 21: Introduction to Machine Learning for Information Retrieval

Topic Modeling

• Latent Semantic Analysis (LSA): – Deerwester et al (1990)– Singular Value Decomposition (SVD) applied to

words * documents matrix

– How to interpret negative values?

Page 22: Introduction to Machine Learning for Information Retrieval

Topic Modeling• Probabilistic Latent Semantic Analysis (PLSA):

– Thomas Hofmann (1999)– How words/documents are generated (as described by probability)

d1, fish

d1, boat

d1, voyage

d2, voyage

d2, sky

d3, trip

……

documents

topics

topics

wor

ds

documents

documents

Maximal Likelihood:

Page 23: Introduction to Machine Learning for Information Retrieval

Topic Modeling• Latent Dirichlet Allocation (LDA)

– David Blei et al. (2003)– PLSA with a Dirichlet prior

• What is Bayesian inference? Conjugate Prior? Posterior? Frequentist v.s. Bayesian Tossing a Coin

Parameter to be estimated priorlikelihood

Posterior probability

• Canonical Maximal Likelihood (Frequentist) as a special form of Bayesian Maximal a Posterior (MAP) when g(r) is uniform prior• Bayesian as an inference method:

• Estimate r: posterior mean, or MAP• Estimate new toss to be head:

Page 24: Introduction to Machine Learning for Information Retrieval

Topic Modeling• Latent Dirichlet Allocation (LDA)

– David Blei et al. (2003)– PLSA with a Dirichlet prior

• What additional info we know about ?– Sparsity:

• each topic has nonzero probability on few words;• each document has nonzero probability on few topics;

Dirichlet distribution defines probability on simplex

documents

topics

topics

wor

ds

documents

documents

Parameter of Multinomial:• Nonnegative• Sum to 1 simplex

Dirichlet can encourage sparsity