Machine learning for the Web: Applications and challenges Soumen Chakrabarti Center for Intelligent Internet Research Computer Science and Engineering

Machine learning for the Web:Applications and challenges

Soumen Chakrabarti

Center for Intelligent Internet ResearchComputer Science and Engineering

IIT Bombay

www.cse.iitb.ernet.in/~soumen

http://www.cse.iitb.ernet.in/~soumen

2

Traditional supervised learning Training instance

Test instance

Independent variables x mostly continuous, maybe categorical

Predicted variable y discrete (classification) or continuous (regression)

yxxx n ;,,, 21

nxxx ,,, 21 Statisticalmodels,

inferencerules, or

separators

Learner

Learner

Prediction y

3

Traditional unsupervised learning No training / testing phases Input is a collection of

records with independent attributes alone

Measure of similarity Partition or cover instances

using clusters with large “self-similarity” and small “cross-similarity”

Hierarchical partitions

nxxx ,,, 21

Large self-similarity

Small cross-similarity

4

Learning hypertext models Entities are pages, sites,

paragraphs, links, people, bookmarks, clickstreams…

Transformed intosimple models and relations• Vector space/bag-of-words• Hyperlink graph• Topic directories• Discrete time series

occurs(term, page, cnt)cites(page, page)

is-a(topic, topic)example(topic, page)

5

Challenges

Large feature space in raw data• Structured data sets: 10s to 100s• Text (Web): 50 to 100 thousand

Most features not completely useless• Feature elimination / selection not perfect• Beyond linear transformations?

Models used today are simplistic• Good accuracy on simple labeling tasks• Lose a lot of detail present in hypertext to

fit known learning techniques

6

Challenges

Complex, interrelated objects• Not a structured tuple-like entity• Explicit and implicit connections

• Document markup sub-structure• Site boundaries and hyperlinks• Placement in popular directories like Yahoo!

Traditional distance measures are noisy• How to combine diverse features? (Or, a

link is worth a ? words)• Unreliable clustering results

7

This session

Semi-supervised clustering(Rich Caruana)• Enhanced clustering via user feedback

Kernel methods (Nello Cristianini)• Modular learning systems for text and

hypertext

Reference matching(Andrew McCallum)• Recovering and cleaning implicit citation

graphs from unstructured data

8

This talk: Two examples

Learning topics of hypertext documents• Semi-supervised learning scenario• Unified model of text and hyperlinks• Enhanced accuracy of topic labeling

Segmenting hierarchical tagged pages• Topic distillation (hubs and authorities)• Minimum description length segmentation• Better focused topic distillation• Extract relevant fragments from pages

9

Classifying interconnected entities Early examples:

• Some diseases have complex lineage dependency

• Robust edge detection in images

How are topics interconnected in hypertext?

Maximum likelihood graph labeling with many classes

Finding edgepixels in adifferentiatedimage

? ??

?

?

?

.3 red

.7 blue

0.6 0.40.3 0.7

10

Naïve Bayes classifiers

Decide topic; topic c is picked with prior probability (c); c(c) = 1

Each c has parameters (c,t) for terms t Coin with face probabilities t (c,t) = 1

Fix document length n(d) and toss coin Naïve yet effective; can use other algos Given c, probability of document is

dt

tdntctdn

dncdnd ),(),(

)},({

)(]),(|Pr[

11

Enhanced models for hypertext c=class, d=text,

N=neighbors Text-only model: Pr(d|c) Using neighbors’ text to

judge my topic:Pr(d, d(N) | c)

Better recursive model:Pr(d, c(N) | c)

Relaxation labeling over Markov random fields

Or, EM formulation

?

12

Hyperlink modeling boosts accuracy 9600 patents from 12

classes marked by USPTO

Patents have text and prior art links

Expand test patent to include neighborhood

‘Forget’ and re-estimate fraction of neighbors’ classes

(Even better for Yahoo)

0

5

10

15

20

25

30

35

40

0 50 100

%Neighborhood known

%E

rro

r

Text Link Text+Link

13

Hyperlink Induced Topic Search

Radius-1 expanded graph

Response

KeywordSearchengine

Query

a = EThh = Ea‘Hubs’ and‘authorities’

h

a

h

h

ha

a

a

14

“Topic drift” and various fixes Some hubs have

‘mixed’ content Authority ‘leaks’

through mixed hubs from good to bad pages

Clever: match query with anchor text to favor some edges

B&H: eliminate outlier documents

Vector-spacedocumentmodel

Centroid

×

Cut-offradius

Query term

Activationwindow

‘Thick’ links

15

Document object model (DOM) Hierarchical graph

model for semi-structured data

Can extract reasonable DOM from HTML

A fine-grained view of the Web

Valuable because page boundaries are less meaningful now

<html><head><title>Portals</title></head><body><ul><li><a href=“…”>Yahoo</a></li><li><a href=“…”>Lycos</a></li></ul></body></html>

html

head body

title ul

li li

a a

16

A model for hub generation Global hub score

distribution 0 w.r.t. given query

Authors use DOM nodes to specialize 0 into local I

At a certain ‘cut’ in the DOM tree, local distribution directly generates hub scores

Global distribution

Progressive‘distortion’Model

frontier

Other pages

17

Optimizing a cost measure

Hv

v

Reference distribution 0

vHh

vh )|Pr(logData encoding cost is roughly

Distribution distortion cost is

1log)||(KL 0

0v

v

vv

(for Poisson distribution)

18

Modified topic distillation algorithm

Will this (non-linear) system converge? Will segmentation help in reducing drift?

Initialize DOM graphLet only root set authority scores be 1Repeat until reasonable convergence:

Authority-to-hub score propagationMDL-based hub score smoothingHub-to-authority score propagationNormalization of authority scores

Segment and rank micro-hubsPresent annotated results

19

Convergence

28 queries used in Clever and by B&H 366k macro-pages, 10M micro-links Rank converges within 15 iterations

1.00E-07

1.00E-06

1.00E-05

1.00E-04

1.00E-03

1.00E-02

1.00E-01

0 2 4 6 8 10Iterations

Me

an

au

th s

core

ch

an

ge

20

Effect of micro-hub segmentation ‘Expanded’ implies

authority diffusion arrested

As nodes outside rootset start participating in the distillation…• #Expanded increases• #Pruned decreases

Prevents authority leaks via mixed hubs

0

500

1000

1500

2000

2500

1 2 3 4 5 6 7 8 9Iterations

Sm

oo

thin

g s

tatis

tics

ExpandedPruned

21

Rank correlation with B&H Positively

correlated Some negative

deviations Pseudo-

authorities downgraded by our algorithm

These were earlier favored by mixed hubs

0

0.005

0.01

0.015

0.02

0.025

0 0.005 0.01 0.015Authority score B&H

Ou

r a

uth

ori

ty s

core

(Axes not to same scale)

22

Conclusion

Hypertext and the Web pose new modeling and algorithmic challenges

Locality exists in many guises Diverse sources of information: text,

links, markup, usage Unifying models needed Anecdotes suggest that synergy can be

exploited

Documents

Machine learning for the Web: Applications and challenges Soumen Chakrabarti Center for Intelligent Internet Research Computer Science and Engineering