46
Machine Learning Meets Web Search Hang Li Microsoft Research Asia MLA’08 Nanjing Nov. 7, 2008

Machine Learning Meets Web Search - Home - LAMDA

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Machine Learning Meets Web Search - Home - LAMDA

Machine Learning Meets Web Search

Hang Li

Microsoft Research Asia

MLA’08 Nanjing

Nov. 7, 2008

Page 2: Machine Learning Meets Web Search - Home - LAMDA

Web Search is Part of Our Life

Page 3: Machine Learning Meets Web Search - Home - LAMDA

Web Users Heavily Rely on Search Engineshttp://www.iprospect.com/premiumPDFs/iProspectSurveyComplete.pdf

Page 4: Machine Learning Meets Web Search - Home - LAMDA

Physically Search System is Data Center

Page 5: Machine Learning Meets Web Search - Home - LAMDA

Statistical Learning

Large Scale Distributed

Computing

Advanced Web Search Technologies

Are Used…

Page 6: Machine Learning Meets Web Search - Home - LAMDA

Overview on Web Search System

Ranker

Crawler Indexer

Index

User

Interface

Web

User

Relevance ranking

Learning to Rank

Document understanding

Query understanding

Search result presentation

Search log data mining

Importance ranking

Anti-spam

Page 7: Machine Learning Meets Web Search - Home - LAMDA

Component Technologies for Web Search

• Relevance Ranking

• Importance Ranking

• Document Understanding

• Query Understanding

• Crawling

• Indexing

• Search Result Presentation

• Anti-Spam

• Learning to Rank

• Search Log Data Mining

• Evaluation and User Study

Page 8: Machine Learning Meets Web Search - Home - LAMDA

Statistical Learning Plays Key Role!

Page 9: Machine Learning Meets Web Search - Home - LAMDA

Talk Outline

• Statistical Learning is Important for Web Search

• Statistical Learning in Web Search

– Importance Ranking: BrowseRank

– Anti-Spam: Temporal Classifier

– Query Understanding: CRF for Query Reformulation

– Web Page Understanding: HyperText Topic Model

– Result Presentation: Context Aware Query Suggestion

– Learning to Rank: Listwise Approach and Global

Ranking

Page 10: Machine Learning Meets Web Search - Home - LAMDA

Importance Ranking:

BrowseRank

(SIGIR 2008 Best Student Paper)

Page 11: Machine Learning Meets Web Search - Home - LAMDA

General Model for Importance Ranking

q

),()(

),()(

),()(

22

11

nn dqfdg

dqfdg

dqfdg

query (or question)

documents

importance and relevance

scores for ranking

)(

),(

dg

dqf

ndddD ,,, 21

Page 12: Machine Learning Meets Web Search - Home - LAMDA

Page Rank

ndL

dPdP

ij dMd j

j

i

1)1(

)(

)()(

)(

Page 13: Machine Learning Meets Web Search - Home - LAMDA

Building User Browsing Graph

Vertex: Web page

Edge: Transition

Edge weight wij:

Number of

transitions

Staying time Ti:

Time spend on

page i

Reset probability :

Normalized frequencies

as first page of session

A directed graph with rich meta

data.

Vertex weight Ci:

Number of visit

for page i

Page 14: Machine Learning Meets Web Search - Home - LAMDA

BrowseRank: Continuous-time

Markov Model

Hard

Calculating

Computing the stationary

distribution of a discrete-

time Markov chain (called

embedded Markov chain)

Estimating

staying time

distribution of

each state

Page 15: Machine Learning Meets Web Search - Home - LAMDA

Anti-Spam:

Temporal Classifier

(ICDM 2006)

Page 16: Machine Learning Meets Web Search - Home - LAMDA

Anti-Spam

• Spam

– Manipulate relevance and importance

– Not ethical, if to be ranked higher beyond real value

– “Cheating” search engines

• Spam Type

– Content Spam

– Link Spam

– Cloaking

Page 17: Machine Learning Meets Web Search - Home - LAMDA

Link Spam

• Link from Blog, Forum, etc

• Link Exchange

• Link Farm

Page 18: Machine Learning Meets Web Search - Home - LAMDA

Inlink May Increase Drastically at Spam Page

timet0 t1

Page 19: Machine Learning Meets Web Search - Home - LAMDA

Detection Using Temporal Information

Page 20: Machine Learning Meets Web Search - Home - LAMDA

Web Page Understanding:

Topic Model for Hypertext

(EMNLP 2008)

Page 21: Machine Learning Meets Web Search - Home - LAMDA

Web Page Understanding

• Web Information Extraction

– Block Analysis

– Metadata Extraction (Title, Date, etc)

– Text Information Extraction

– Wrapper Generation

• Web Page Classification

– Based on Semantics

– Based on Type (Homepage, Spec, etc)

Page 22: Machine Learning Meets Web Search - Home - LAMDA

0

0.5

1

Topic Dist.

0

0.5

1

Topic Dist.

0

1

Topic Dist.

0

1

1 2 3 4

Word Dist.

Topic1

Web Page Web Page Web Page

0

1

1 2 3 4

Word Dist.

Topic2

LDA (Latent Dirichlet Allocation)

Page 23: Machine Learning Meets Web Search - Home - LAMDA

Citing Page

Cited PageCited Page Cited Page

0

0.5

1

Topic Dist.

0

0.5

1

Topic Dist.

0

0.5

1

Topic Dist.

0

1

Topic Dist.

0

1

1 2 3 4

Word Dist.

Topic

HTM: Topic Model for Hyper Text

Page 24: Machine Learning Meets Web Search - Home - LAMDA

LDA vs HTM

Page 25: Machine Learning Meets Web Search - Home - LAMDA

Query Understanding:

Query Refinement

(SIGIR 2008)

Page 26: Machine Learning Meets Web Search - Home - LAMDA

Query Understanding

• Spelling Error Correction

• Query Refinement

– E.g., “ny times” “new york times”

• Query Classification

– Based on Semantics (Sport, etc)

– Based on Type (Name Query, etc)

• Query Segmentation

– E.g., “harry porter book” “[harry porter] book”

Page 27: Machine Learning Meets Web Search - Home - LAMDA

Mismatching between Query Term

and Document Term

I want to search “myspace” myspace

space

my

my space

27

Page 28: Machine Learning Meets Web Search - Home - LAMDA

Understanding the Intent and

Solving the Mismatch

I want to search “myspace” myspace

space

my

my space myspaceQuery

Understan

ding

28

Page 29: Machine Learning Meets Web Search - Home - LAMDA

Structured Prediction Problem

windows onecare

window onecar

Observed “noisy” word

sequence

“Ideal” word sequence

original query

word

sequence

“ideal” query

word

sequence

Page 30: Machine Learning Meets Web Search - Home - LAMDA

Conditional Random Fields for Query

Refinement

Introducing Refinement Operations

oi-1 oi oi+1

yi-1 yi yi+1

xi-1 xi xi+1

OperationsSpelling: insertion, deletion, substitution, transposition, …

Word Stemming: +s/-s, +es/-es, +ed/-ed, +ing/-ing, …

Page 31: Machine Learning Meets Web Search - Home - LAMDA

Result Presentation:

Query Suggestion

(KDD 2008 Best Application Paper)

Page 32: Machine Learning Meets Web Search - Home - LAMDA

Result Presentation

• Snippet Generation

• Query Suggestion

• Result Clustering and Classification

Page 33: Machine Learning Meets Web Search - Home - LAMDA

Query Suggestion

Page 34: Machine Learning Meets Web Search - Home - LAMDA

Search Intent and Context

• Suppose a user raises a query “gladiator”

• If we know the user raises query “beautiful mind” before “gladiator”• User is likely to be interested in the film

• User is likely to be searching the films played by Russell Crowe.

History?

People? Film?

Page 35: Machine Learning Meets Web Search - Home - LAMDA

Framework

• Offline part: model learning– Summarizing queries into concepts by clustering click-through

bipartite

– Mining frequent patterns from session data and building a concept sequence suffix tree

• Online part: query suggestion

Page 36: Machine Learning Meets Web Search - Home - LAMDA

Learning to Rank:

ListWise and Global Ranking

(ICML 2008, NIPS 2008, etc)

Page 37: Machine Learning Meets Web Search - Home - LAMDA

Learning to Rank

11 ,1,1

2,12,1

1,11,1

1

nn yd

yd

yd

q

mm nmnm

mm

mm

m

yd

yd

yd

q

,,

2,2,

1,1,

Learning

System

Ranking

System

1mq

),(

),(

),(

11 ,11,1

2,112,1

1,111,1

mm nmmnm

mmm

mmm

dqfd

dqfd

dqfd

),( dqf

Page 38: Machine Learning Meets Web Search - Home - LAMDA

Data Labeling Learning

)(xfy

Feature

Extraction

1,1

2,1

1,1

1

nd

d

d

q

mnm

m

m

m

d

d

d

q

,

2,

1,

11 ,1,1

2,12,1

1,11,1

1

nn yd

yd

yd

q

mm nmnm

mm

mm

m

yd

yd

yd

q

,,

2,2,

1,1,

11 ,1,1

2,12,1

1,11,1

nn yx

yx

yx

mm nmnm

mm

mm

yx

yx

yx

,,

2,2,

1,1,

Learning Process

Page 39: Machine Learning Meets Web Search - Home - LAMDA

Previous Approach: Pairwise

Approach

• Transforming ranking to classification

– Ranking SVM (Herbrich et al, 2000)

– RankBoost (Freund et al, 2003)

– Ranknet (Burges et al, 2005)

– IR-SVM (Cao et al, 2005)

– Frank (Tsai et al, 2006)

Page 40: Machine Learning Meets Web Search - Home - LAMDA

Our Proposal = Listwise Approach

• Probabilistic Model– ListNet (Cao, et al 2007)

– ListMLE (Xia, et al 2008)

• Optimizing Upper Bounds of IR Measures– AdaRank (Xu & Li, 2007)

– SVM-MAP (Yue, et al, 2007)

– PermuRank (Xu, et al 2008)

• Approximation of IR Measures– SoftRank (Taylor 2007)

– ApproxRank (Qin, et al, to appear)

Page 41: Machine Learning Meets Web Search - Home - LAMDA

Global Ranking Problem

q

query (or question)

retrieved documents

Page 42: Machine Learning Meets Web Search - Home - LAMDA

Global Ranking Using

Continuous CRF

Page 43: Machine Learning Meets Web Search - Home - LAMDA

Summary

• Statistical Learning is Important for Web Search

• Statistical Learning in Web Search

– Importance Ranking: BrowseRank

– Anti-Spam: Temporal Classifier

– Query Understanding: CRF for Query Reformulation

– Web Page Understanding: HyperText Topic Model

– Result Presentation: Context Aware Query Suggestion

– Learning to Rank: Listwise Approach and Global

Ranking

Page 44: Machine Learning Meets Web Search - Home - LAMDA

References• Yuting Liu, Bin Gao, Tie-Yan Liu, Ying Zhang, Zhiming Ma, Shuyuan He,

Hang Li, BrowseRank: Letting Users Vote for Page Importance, Proc. of

SIGIR 2008, 451-458. SIGIR’08 Best Student Paper Award.

• Jiafeng Guo, Gu Xu, Hang Li, Xueqi Cheng, A Unified and Discriminative

Model for Query Refinement, Proc. of SIGIR 2008, 379-386.

• Huanhuan Cao, Daxin Jiang, Jian Pei, Qi He, Zhen Liao, Enhohng Chen,

Hang Li, Context-Aware Query Suggestion by Mining Click-Through and

Session Data, Proc. of KDD 2008, 875-883, SIGKDD’08 Best Application

Paper Award.

• Guoyang Shen, Bin Gao, Tie-Yan Liu, Guang Feng, Shiji Song, and Hang Li,

Detecting Link Spam using Temporal Information, Prof. of ICDM-2006, 1049-

1053.

• Fen Xia, Tie-Yan Liu, Jue Wang, Wensheng Zhang, Hang Li, Listwise

Approach to Learning to Rank –Theory and Algorithm, Proc. of ICML 2008,

1192-1199.

• Tao Qin, Tie-Yan Liu, Xu-Dong Zhang, De-Sheng Wang, Hang Li, Global

Ranking Using Continuous Conditional Random Fields, Proc. of NIPS 2008,

to appear.

Page 45: Machine Learning Meets Web Search - Home - LAMDA

Pao-Lu Hsu Lecture on Statistical

Machine Learning and Applications

Speaker: Prof. Trevor Hastie

Talk: Regularization Paths

Time: 10:00am-12:00am, December 3, 2008

Place: Peking University Hall (北大百年讲堂会议厅)

Web Site: http://iria.pku.edu.cn/PMSIT/

Contact: [email protected]

Page 46: Machine Learning Meets Web Search - Home - LAMDA

Thank You!

[email protected]