Upload
clifton-ellis
View
217
Download
0
Embed Size (px)
Citation preview
Machine learning for the Web:Applications and challenges
Soumen Chakrabarti
Center for Intelligent Internet ResearchComputer Science and Engineering
IIT Bombay
www.cse.iitb.ernet.in/~soumen
2
Traditional supervised learning Training instance
Test instance
Independent variables x mostly continuous, maybe categorical
Predicted variable y discrete (classification) or continuous (regression)
yxxx n ;,,, 21
nxxx ,,, 21 Statisticalmodels,
inferencerules, or
separators
Learner
Learner
Prediction y
3
Traditional unsupervised learning No training / testing phases Input is a collection of
records with independent attributes alone
Measure of similarity Partition or cover instances
using clusters with large “self-similarity” and small “cross-similarity”
Hierarchical partitions
nxxx ,,, 21
Large self-similarity
Small cross-similarity
4
Learning hypertext models Entities are pages, sites,
paragraphs, links, people, bookmarks, clickstreams…
Transformed intosimple models and relations• Vector space/bag-of-words• Hyperlink graph• Topic directories• Discrete time series
occurs(term, page, cnt)cites(page, page)
is-a(topic, topic)example(topic, page)
5
Challenges
Large feature space in raw data• Structured data sets: 10s to 100s• Text (Web): 50 to 100 thousand
Most features not completely useless• Feature elimination / selection not perfect• Beyond linear transformations?
Models used today are simplistic• Good accuracy on simple labeling tasks• Lose a lot of detail present in hypertext to
fit known learning techniques
6
Challenges
Complex, interrelated objects• Not a structured tuple-like entity• Explicit and implicit connections
• Document markup sub-structure• Site boundaries and hyperlinks• Placement in popular directories like Yahoo!
Traditional distance measures are noisy• How to combine diverse features? (Or, a
link is worth a ? words)• Unreliable clustering results
7
This session
Semi-supervised clustering(Rich Caruana)• Enhanced clustering via user feedback
Kernel methods (Nello Cristianini)• Modular learning systems for text and
hypertext
Reference matching(Andrew McCallum)• Recovering and cleaning implicit citation
graphs from unstructured data
8
This talk: Two examples
Learning topics of hypertext documents• Semi-supervised learning scenario• Unified model of text and hyperlinks• Enhanced accuracy of topic labeling
Segmenting hierarchical tagged pages• Topic distillation (hubs and authorities)• Minimum description length segmentation• Better focused topic distillation• Extract relevant fragments from pages
9
Classifying interconnected entities Early examples:
• Some diseases have complex lineage dependency
• Robust edge detection in images
How are topics interconnected in hypertext?
Maximum likelihood graph labeling with many classes
Finding edgepixels in adifferentiatedimage
? ??
?
?
?
.3 red
.7 blue
0.6 0.40.3 0.7
10
Naïve Bayes classifiers
Decide topic; topic c is picked with prior probability (c); c(c) = 1
Each c has parameters (c,t) for terms t Coin with face probabilities t (c,t) = 1
Fix document length n(d) and toss coin Naïve yet effective; can use other algos Given c, probability of document is
dt
tdntctdn
dncdnd ),(),(
)},({
)(]),(|Pr[
11
Enhanced models for hypertext c=class, d=text,
N=neighbors Text-only model: Pr(d|c) Using neighbors’ text to
judge my topic:Pr(d, d(N) | c)
Better recursive model:Pr(d, c(N) | c)
Relaxation labeling over Markov random fields
Or, EM formulation
?
12
Hyperlink modeling boosts accuracy 9600 patents from 12
classes marked by USPTO
Patents have text and prior art links
Expand test patent to include neighborhood
‘Forget’ and re-estimate fraction of neighbors’ classes
(Even better for Yahoo)
0
5
10
15
20
25
30
35
40
0 50 100
%Neighborhood known
%E
rro
r
Text Link Text+Link
13
Hyperlink Induced Topic Search
Radius-1 expanded graph
Response
KeywordSearchengine
Query
a = EThh = Ea‘Hubs’ and‘authorities’
h
a
h
h
ha
a
a
14
“Topic drift” and various fixes Some hubs have
‘mixed’ content Authority ‘leaks’
through mixed hubs from good to bad pages
Clever: match query with anchor text to favor some edges
B&H: eliminate outlier documents
Vector-spacedocumentmodel
Centroid
×
Cut-offradius
Query term
Activationwindow
‘Thick’ links
15
Document object model (DOM) Hierarchical graph
model for semi-structured data
Can extract reasonable DOM from HTML
A fine-grained view of the Web
Valuable because page boundaries are less meaningful now
<html><head><title>Portals</title></head><body><ul><li><a href=“…”>Yahoo</a></li><li><a href=“…”>Lycos</a></li></ul></body></html>
html
head body
title ul
li li
a a
16
A model for hub generation Global hub score
distribution 0 w.r.t. given query
Authors use DOM nodes to specialize 0 into local I
At a certain ‘cut’ in the DOM tree, local distribution directly generates hub scores
Global distribution
Progressive‘distortion’Model
frontier
Other pages
17
Optimizing a cost measure
Hv
v
Reference distribution 0
vHh
vh )|Pr(logData encoding cost is roughly
Distribution distortion cost is
1log)||(KL 0
0v
v
vv
(for Poisson distribution)
18
Modified topic distillation algorithm
Will this (non-linear) system converge? Will segmentation help in reducing drift?
Initialize DOM graphLet only root set authority scores be 1Repeat until reasonable convergence:
Authority-to-hub score propagationMDL-based hub score smoothingHub-to-authority score propagationNormalization of authority scores
Segment and rank micro-hubsPresent annotated results
19
Convergence
28 queries used in Clever and by B&H 366k macro-pages, 10M micro-links Rank converges within 15 iterations
1.00E-07
1.00E-06
1.00E-05
1.00E-04
1.00E-03
1.00E-02
1.00E-01
0 2 4 6 8 10Iterations
Me
an
au
th s
core
ch
an
ge
20
Effect of micro-hub segmentation ‘Expanded’ implies
authority diffusion arrested
As nodes outside rootset start participating in the distillation…• #Expanded increases• #Pruned decreases
Prevents authority leaks via mixed hubs
0
500
1000
1500
2000
2500
1 2 3 4 5 6 7 8 9Iterations
Sm
oo
thin
g s
tatis
tics
ExpandedPruned
21
Rank correlation with B&H Positively
correlated Some negative
deviations Pseudo-
authorities downgraded by our algorithm
These were earlier favored by mixed hubs
0
0.005
0.01
0.015
0.02
0.025
0 0.005 0.01 0.015Authority score B&H
Ou
r a
uth
ori
ty s
core
(Axes not to same scale)
22
Conclusion
Hypertext and the Web pose new modeling and algorithmic challenges
Locality exists in many guises Diverse sources of information: text,
links, markup, usage Unifying models needed Anecdotes suggest that synergy can be
exploited