Upload
felix-long
View
216
Download
0
Tags:
Embed Size (px)
Citation preview
2007-09-14 MMDSS - P. Gallinari 1
Learning structured ouputs
www-connex.lip6.fr
University Pierre et Marie Curie – Paris – Fr
NATO ASI
Mining Massive Data Sets for security
2007-09-14 MMDSS - P. Gallinari 2
Outline
Motivation and examples Approaches for structured learning
Generative models Discriminant models Search models
2007-09-14 MMDSS - P. Gallinari 3
Machine learning and structured data
Different types of problems Model, classify, cluster structured data Predict structured outputs Learn to associate structured
representations Structured data and applications in many
domains chemistry, biology, natural language, web,
social networks, data bases, etc
2007-09-14 MMDSS - P. Gallinari 4
Sequence labeling: POS
This Workshop brings together scientistsand engineers
DT NN VBZ RB NNS CC NNSinterested in recent developments in exploiting Massive
VBN IN JJ NNS IN VBG JJdata sets
NP NP
determiner noun Verb 3rd pers adverb Noun pluralCoord. Conj.
adjective Verb gerund
Verb plural
2007-09-14 MMDSS - P. Gallinari 5
PENN tag set1. CC Coordinating conjunction 25.TO to 2. CD Cardinal number 26.UH Interjection 3. DT Determiner 27.VB Verb, base form 4. EX Existential there 28.VBD Verb, past tense 5. FW Foreign word 29.VBG Verb, gerund/present participle 6. IN Preposition/subord. 30.VBN Verb, past participle 7. JJ Adjective 31.VBP Verb, non-3rd ps. sing. present 8. JJR Adjective, comparative 32.VBZ Verb, 3rd ps. sing. present 9. JJS Adjective, superlative 33.WDT wh-determiner 10.LS List item marker 34.WP wh-pronoun 11.MD Modal 35.WP Possessive wh-pronoun 12.NN Noun, singular or mass 36.WRB wh-adverb 13.NNS Noun, plural 37. # Pound sign 14.NNP Proper noun, singular 38. $ Dollar sign 15.NNPS Proper noun, plural 39. . Sentence-final punctuation 16.PDT Predeterminer 40. , Comma 17.POS Possessive ending 41. : Colon, semi-colon 18.PRP Personal pronoun 42. ( Left bracket character 19.PP Possessive pronoun 43. ) Right bracket character 20.RB Adverb 44. " Straight double quote 21.RBR Adverb, comparative 45. ` Left open single quote 22.RBS Adverb, superlative 46. " Left open double quote 23.RP Particle 47. ' Right close single quote 24.SYM Symbol 48. " Right close double quote
2007-09-14 MMDSS - P. Gallinari 6
Segmentation + labeling: syntactic chunking (Washington Univ. tagger)
This Workshop brings together scientistsand engineers
NP VP ADVP NPinterested in recent developments in
VP IN NP PNPexploiting Massive data sets
NP
Noun Phrase Verb Phrase Noun Phraseadverbial Phrase
Noun Phrase
2007-09-14 MMDSS - P. Gallinari 7
Segmentation + labeling: Named Entity recognition Entities
locations, persons, organizations Time expressions: dates, times Numeric expression: $ amount, percentages
NEW YORK (Reuters) - Goldman Sachs Group Inc. agreed on Thursday to pay $9.3 million to settle charges related to a former economist …. Goldman's GS.N settlement with securities regulators stemmed from charges that it failed to properly oversee John Youngdahl, a one-time economist …. James Comey, U.S. Attorney for the Southern District of New York, announced on Thursday a seven-count indictment of Youngdahl for insider trading, making false statements, perjury, and other charges. Goldman agreed to pay a $5 million fine and disgorge $4.3 million from illegal trading profits.
2007-09-14 MMDSS - P. Gallinari 8
Information extraction
Home Organizing Committee Lecturers Program Submission Participants NATO ASI Information Travel Information Hosted by
Event Sponsored by
NATO Advanced Study I nstitute
on
Mining Massive Data Sets for Security
September 10 - 21, 2007, Villa Cagnola - Gazzada - I taly
NATO ASI Announcement
This Workshop brings together scientists and engineers interested in recent developments in exploiting Massive Data Sets. Emphasis is placed on available techniques and their application to security-critical applications….
Lecturers
C. Best L. Bottou R. Feldman F. Fogelman-Soulié P. Gallinari E. Glover L. Giles A. Gionis I . Guyon D. Hand G. Hébrail F. Provost N. Tishby V. Vapnik D. Wilksinson
Objective
Today our world is awash in data and we live in an Information Society where every action leaves a trace, generating massive amounts of data. Recent scientific developments provide technologies to exploit these huge amounts of data and extract from it critical information. ……
Directors
Clive Best, J RC - IT Françoise Fogelman Soulié, Kxen - FR
Patrick Gallinari, Univesité Paris 6 - FR Naftali Tishby, Hebrew University - IL
Importants Dates
Deadline for submission of application form: J une 24, 2007 (Extended)
Notification of acceptance: J une 30, 2007 (New)
Deadline for Accomodation form: J uly 1, 2007
NATO ASI MMDSS: September 10-21, 2007
…….
Legal Notice Webmaster Top
2007-09-14 MMDSS - P. Gallinari 9
Syntaxic Parsing (Stanford Parser)
2007-09-14 MMDSS - P. Gallinari 10
Document mapping problem Problem: query heterogeneous XML databases or collections Need to know the correspondence between the structured
representations uually made by hand Learn the correspondence between the different sources
Labeled tree mapping problem
<Restaurant><Name>La cantine</Name><Adress> 65 rue des pyrénées, Paris, 19ème, FRANCE</Adress><Specialities> Canard à l’orange, Lapin au miel</ Specialities ></Restaurant>
<Restaurant><Name>La cantine</Name><Adress> <City>Paris</City><Stree>pyrénées</Street> <Num>65</Num></Adress><Dishes> Canard à l’orange</Dishes><Dishes> Lapin au miel</Dishes></Restaurant>
2007-09-14 MMDSS - P. Gallinari 11
Others Taxonomies Social networks Adversial computing: Webspam, Blogspam, … Translation Biology …..
2007-09-14 MMDSS - P. Gallinari 12
Is structure really useful ?Can we make use of structure ? Yes
Evidence from many domains or applications Mandatory for many problems
e.g. 10 K classes classification problem Yes but
Complex or long term dependencies often correspond to rare events
Practical evidence for large size problems Simple models sometimes offer competitive results
Information retrieval Speech recognition, etc
2007-09-14 MMDSS - P. Gallinari 13
Structured learning X, Y : input and output spaces Structured output
y Y decomposes into parts of variable size y = (y1, y2,…, yT)
Dependencies Relations between y parts Local, long term, global
Cost function O/ 1 loss: Hamming loss: F-score: BLEU etc
yy ˆ*1
T
iii yy
1
ˆ*1
yy ˆ*1
2007-09-14 MMDSS - P. Gallinari 14
General approach Predictive approach:
where F : X x Y R is a score function used to rank potential outputs
F trained to optimize some loss function Inference problem
|Y| sometimes exponential Argmax is often intractable: hypothesis
decomposability of the score function over the parts of y
Restricted set of outputs
),,(maxarg)(* yxFxfyYy
i
iiYy
yxFxfy ),,(maxarg)(*
2007-09-14 MMDSS - P. Gallinari 15
Structured algorithms differ by: Feature encoding Hypothesis on the output structure Hypothesis on the cost function
Generative models
Hidden Markov ModelsProbabilistic Context Free grammarsTree labeling model
2007-09-14 MMDSS - P. Gallinari 17
Usual hypothesis Features : “natural” encoding of the input Hypothesis on the output structure : local
output dependencies, Markov property Score decomposes, e.g. sum of local cost on
each subpart Inference : usually dynamic programming
2007-09-14 MMDSS - P. Gallinari 18
HMMs Sequence labeling – segmentation Dependencies
Outputs : Markov independence
Decoding and learning Dynamic programming
Viterbi Argmax …. Forward Backward
Decoding complexity O(n|Q|2) for a sequence of length n and |Q| states
)/()/(111 tt
tt
qqpqqp
)/(),/(1
1
1 tttt
tqxpqxxp
2007-09-14 MMDSS - P. Gallinari 19
Consider a simple HMM Start
State space for an input sequence of size 3
2007-09-14 MMDSS - P. Gallinari 20
Probabilistic Context Free Grammar (after Manning & Shultze)
Set of terminals {w1,…,wv} Set of non terminals {N1,…,Nn} N1: start symbol Set of rules {Ni zi} with zi sequence of
terminals and non terminals To each rule is associated a probability P(Ni
zi) Special case: Chomsky Normal Form
grammars zi = wj
zi = NkNm
2007-09-14 MMDSS - P. Gallinari 21
S NP VP 1.0 NP NP PP 0.4 PP P NP 1.0 NP astronomer 0.1 VP V NP 0.7 NP ears 0.18 VP VP PP 0.3 NP saw 0.04 P with 1.0 NP stars 0.18 V saw 1.0 NP telescopes 0.1
S
VP
VP V
V
NP
NP
NPNP PP
PP P NP
astronomers
saw
stars
with ears
2007-09-14 MMDSS - P. Gallinari 22
Notations Sentence
Wp,q= wpwp+1…wq
Ni dominates sequence Wp,q if Ni may rewrite wpwp+1…wq
Assumptions Context free
Probability of a subtree does not depend on words outside the subtree
Independence from N.. Ancestors The probability does not depend on nodes in the
derivation outside the subtree
Nj
Wp……… Wq
2007-09-14 MMDSS - P. Gallinari 23
Inside and outside probabilities As for the forward – backward
variables in HMMS, 2 probabilities may be defined
Inside Probability of generating
wk…wl starting from Nj
Outside Probability of generating Nj
and all words outside wk…wl
)/(),( ,,jlklkDefj Nwplk
),,(),( ,1,1,1 nljlkkDefj wNwplk
W1…Wk-1 Wk ….…… Wl Wl+1…Wn
Nj
N1
2007-09-14 MMDSS - P. Gallinari 24
Probability of a sentence: CKY algorithm Probability of sentence w1,n
Left Right induction on the sequence
For k = 1 .. n For l= k+1 .. n, calculate
),1(),()(),(,,
lmmkNNNPlk qmqp
pqpj
j
kjwNpkk kj
j ,)(),(
Nj
NqNp
Wk……… Wm Wm+1 ….....… Wl
)(),1( ,11 nwpn
2007-09-14 MMDSS - P. Gallinari 25
Inference and learning Inference
Similar to probability of a sentence with Max instead of
Complexity: O(m3n3) n = length of the sentence, m = # non terminals
in the grammar Learning
Inside – outside Each step is O(m3n3)
Tree generative models
Classification / clustering of structured documents (Denoyer et al. 2004)Document annotation / conversion (Wisniewski et al. 2006)
2007-09-14 MMDSS - P. Gallinari 27
Context-XML semi-structured documents
<fig>
<fgc>
<sec>
<p>
<bdy>
<article>
<st>
<hdr>
text
text
text
2007-09-14 MMDSS - P. Gallinari 28
Document model
),/()/(
)/,()/(
ddd
dd
sStTPsSP
tTsSPdDP
Structural probability
Content probability
),( dd tsd
s
t
! Scalability !
2007-09-14 MMDSS - P. Gallinari 29
Document Model: Structure Belief Networks
Paragraphe Paragraphe
Section Section
Titre
Titre
Document
Corps
Titre du document
Titre de la section
Cette section contient deuxparagraphes
Premier paragraphe Second paragraphe
La deuxième section necontient pas de paragraphes
Document
Intro SectionSection
Paragraphe Paragraphe Paragraphe
Document
Intro SectionSection
Paragraphe Paragraphe Paragraphe
//
1
)()(d
i
di
d sPsP
//
1
)))((/()(d
i
id
id
d nparentlabelsPsP
//
1
))(()),((/)(d
i
di
di
di
d nprécédentlabelnparentlabelsPsP
Document
Intro SectionSection
Paragraphe Paragraphe Paragraphe
2007-09-14 MMDSS - P. Gallinari 30
Document Model: Content Model for each node
1st order dependency
Use of a local generative model for each label
),....,( //1 dddd ttt
//
1
),/(),/(d
i
id
iddd stPstP
)/(),/( idd si
did
i tPstP
2007-09-14 MMDSS - P. Gallinari 31
Final network Document
Intro SectionSection
Paragraphe
Paragraphe Paragraphe
T1= « Ce documentest un exemple dedocument structuré
arborescent »
T2= « Ceci est lapremière section du
document »
T3= « Le premierparagraphe »
T4= «Le secondparagraphe »
T5= «La secondesection »
T6= «Le troisièmeparagraphe »
)/( DocumentIntroP )/( DocumentSectionP)/( DocumentSectionP
)/( SectionParagrapheP
)/( SectionParagrapheP
)/( SectionParagrapheP)/1( IntroTP
)/2( SectionTP
)/3( ParagrapheTP
)/5( SectionTP
)/4( ParagrapheTP )/6( ParagrapheTP
)/6()/5()/4(*
)/3()/2()/1(*
)/arg()²/()/()( 3
ParagrapheTPSectionTPParagrapheTP
ParagrapheTPSectionTPIntroTP
SectionraphePPDocumentSectionPDocumentIntroPdP
2007-09-14 MMDSS - P. Gallinari 32
Different learning techniques Likelihood maximization Discriminant learning
Logistic function Error minimization
Fisher Kernel
contenustructure
Dd
d
i
ts
di
di
Dd
sd
Dd
LL
stPsP
dPL
TRAIN
di
TRAIN
TRAIN
//
1
),/(log)/(log
)/(log
n
ic
ixpaix
cixpaix
e
e
xcP
cxP
cxP
1 )(,
)(,log
)/(
)/(log
1
11
1)/(
2007-09-14 MMDSS - P. Gallinari 33
Document mapping problem Problem
Learn from examples how to map heterogeneous sources onto a predefined target schema
Preserve the document semantic
Sources: semistructured, HTML, PDF, flat text, etc
Labeled tree mapping problemDifferent instances
Flat text to XMLHTML to XML
XML to XML….
<Restaurant><Nom>La cantine</Nom><Adresse> 65 rue des pyrénées, Paris, 19ème, FRANCE</Adresse><Spécialités> Canard à l’orange, Lapin au miel</Spécialités></Restaurant>
<Restaurant><Nom>La cantine</Nom><Adresse> <Ville>Paris</Ville> <Arrd>19</Arrd> <Rue>pyrénées</Rue> <Num>65</Num></Adresse><Plat> Canard à l’orange</Plat><Plat> Lapin au miel</Plat></Restaurant>
2007-09-14 MMDSS - P. Gallinari 34
Document mapping problem Central issue: Complexity
Large collections Large feature space: 103 to 106
Large search space (exponential)
Approach Learn generative models of XML target
documents from a training set Decoding of unknown sources according to
the learned model
2007-09-14 MMDSS - P. Gallinari 35
Problem formulationGiven
ST a target format
dsin(d) an input document
Find the most probable target document
)'(maxarg)(
'din
TT S
SdS ddPd
Decoding Learned transformation model
2007-09-14 MMDSS - P. Gallinari 36
General restructuration model
sd
td'
sd'
td
),,/(),/(argmax '''
'1 ddddd
dtstPssPd
2007-09-14 MMDSS - P. Gallinari 37
Example : HTML to XML (Tree annotation)
Hypothesis Input document
HTML tags mostly for visualization Remove tags Keep only the segmentation (leaves)
Annotation Leaves are the same in the HTML and XML
document Target document model: node label depends only
on its local context Context = content, left sibling, father
2007-09-14 MMDSS - P. Gallinari 38
Model and training
Probability of target tree
Solve
Exact Dynamic Programming decoding O(|Leaf nodes|3.|tags|)
Approximate solution with LASO (Hal Daume ICML 2005) O(|Leaf nodes|.|tags||tree nodes|)
iniiiidT
dTdSinT
nfathernsibcnPdddP
dddPddP
))(),(,(),...,(
),...,()(
1
1)(
Document
Intro SectionSection
Paragraphe Paragraphe Paragraphe
)'(maxarg)(
'din
TT S
SdS ddPd
2007-09-14 MMDSS - P. Gallinari 39
Experiments : HTML to XML
IEEE collection / INEX corpus 12 K documents,
Average: 500 leaf nodes, 200 int nodes, 139 tags Movie DB
10 K movie descriptions (IMDB) Average: 100 leaf nodes, 35 int. nodes, 28 tags
Shakespeare 39 plays Few doc, but:
Average: 4100 leaf nodes, 850 int nodes, 21 tags Mini-Shakespeare
Randomly chosen 60 scenes from the plays 85 leaf nodes, 20 int. nodes, 7 tags
For all collections ½ train, ½ test
2007-09-14 MMDSS - P. Gallinari 40
Performance
2007-09-14 MMDSS - P. Gallinari 41
2007-09-14 MMDSS - P. Gallinari 42
Summary 30 years of generative models
Hierarchical HMMs, Factorial HMMs, etc Local dependency hypothesis
On the outputs On the inputs
Inference and learning often use dynamic programming Prohibitive for some/ many problems Other methods: loopy propagation, search e.g. ,A*, ..
Cost function : joint likelihoood - decomposes
Discriminant models
Structured Percepron (Collins 2002)Large margin methods (Tsochantaridis et al. 2004, Taskar
2004)
2007-09-14 MMDSS - P. Gallinari 44
Usual hypothesis Joint representation of input – output Φ(x, y)
Encode potential dependencies among and between input and output e.g. histogram of state transitions observed in training
set, frequency of (xi,yj), POS tags, etc
Large feature sets (102 -> 104)
Linear score function:
Decomposability of features set (outputs) and of the loss function
),(,),,( yxyxF
2007-09-14 MMDSS - P. Gallinari 45
Structured Perceptron (Collins 2002)
Discriminant model based on a Perceptron variant for sequence labeling
Initially proposed for POS and Chunking Possible extension to other structured
outputs tasks Inference: Viterbi Encodes input and output (local) dependencies Simplicity
2007-09-14 MMDSS - P. Gallinari 46
Algorithm
Training algorithm Initialize : Repeat n times over all training examples
(x,y)
If update parameters
0
),(,maxargˆ yxysY
yy ˆ
)ˆ,(),( yxyx
2007-09-14 MMDSS - P. Gallinari 47
Inference : DP Restricted to 0/ 1 cost Also
Convergence and generalization bounds (Freund & Shapire, 99 )
# mistakes depends only on on the margin, not on the size of output space (potential candidates)
2007-09-14 MMDSS - P. Gallinari 48
Extension of large margin methods
2 problems Generalize max margin principle to other
loss functions than O/1 loss Number of constraints proportional to |Y|,
i.e. potentially exponential
2007-09-14 MMDSS - P. Gallinari 49
SVM ISO (Tsochantaridis et al. 2004)
Extension of multi-class SVMs Principle:
0),(,)(,:\,,..1
esinequalitilinear N)card(*N solve :problem Equivalent
),(,),(,\
max:,..1
:iferror 0get We
esinequalitilinear non N solve toamounts Problem
yixwiyixwiyYyNi
Y
iyixwyixwiyYy
Ni
problem separable linearly and tionclassificafor loss 10/ : Example
2007-09-14 MMDSS - P. Gallinari 50
SVM formulation non linearly separable case, 0/1 cost (1 slack var. per non linear constraint (Crammer – Singer 91)):
iyYyiyixwiyixwyiw
i
n
iiN
Cw
QP
i
i
\,1),(,),(,)(,
sconstraintwith
1
22
1 min
0,
2007-09-14 MMDSS - P. Gallinari 51
Pb 1 : Extension to ∆ loss For each constraint, penalize examples according to its loss Rescale slack variables according to the loss incurred in each
linear constraint
New constraints
iyYyiyiy
iyixwiyixwyiw \,),(
1),(,),(,)(,
2007-09-14 MMDSS - P. Gallinari 52
Learning Pb 2 : Limit the number of constraints
Done implicitely via the training algorithm The algorithms finds a polynomial number of
« active » contraints the solution of the QP problem with these
constraints alone fullfill all constraints with a given precision ε
The algorithm requires to solve an Argmax problem at each iteration Viterbi for sequences CKY for parsing
2007-09-14 MMDSS - P. Gallinari 53
M3N (Taskar et al. 2003)
Combine a probabilistic model (Markov Network) with a max margin formulation Solution to problem 1 : margin rescaling Solution to problem 2 : the structure of the Markov
network limits the number of constraints (e.g. chain network for sequences)
2007-09-14 MMDSS - P. Gallinari 54
Summary: discriminant approaches
Hypothesis Local dependencies for the output Decomposability of the loss function
Long term dependencies in the input Nice convergence properties + bounds Complexity
Learning often does not scale
Incremental learningLearning to search solution spaces
Incremental Parsing, Collins 2004SEARN, Daume et al. 2006Reinforcement Learning, Maes et al. 2007
2007-09-14 MMDSS - P. Gallinari 56
General ideas Incremental construction of output ŷ
Actions Decisions : choose a subset of actions
Learn how to explore the state space of the problem
2007-09-14 MMDSS - P. Gallinari 57
Incremental parsing (Collins, Roark, 2004)
Build incrementally the parse tree of a sentence Inference: greedy algorithm
Read the sentence from left to right Step i corresponds to the ith word in the sentence At step i,
candidate partial parse trees are generated for the first i words of the sentence
then scored and ranked A subset of the candidates is selected Selected candidates will be used to generate next set of
candidates Final candidate is the best scored parse tree at the
end of the sentence
2007-09-14 MMDSS - P. Gallinari 58
More Ranking function F = < Φ(x,y) , θ> F is learned from a training set with an
adaptive algorithm (Perceptron) Input at step i: joint vectorial representation
of the input sentence and the partial parse tree
No need for dynamic programming
2007-09-14 MMDSS - P. Gallinari 59
How to build sequences of partial trees : from Y(i) to Y(i+1) At step i
Consider word xi
Action = Try to attach each chain ending with xi to any attachment site
Grammar G includes some constraints Allowable Chain
only derivation chains appearing in the data set are allowed
Attachment site Places in the partial tree where chain can be attached
(inferred from data too)
2007-09-14 MMDSS - P. Gallinari 60
ε
NP
Astronomer
VP
saw V
saw stars
2007-09-14 MMDSS - P. Gallinari 61
SEARN (Daume et al 2006)
Introduced the idea of learning how to explore a search space
Hypothesis : the structured output can be built incrementally ŷ = (ŷ1, ŷ2,…, ŷT)
It will by build via Machine learning
Loss :
Goal Construct incrementally ŷ so as to minimize the loss Learn how to search the solution space
)ˆ,(
yy
XEC
2007-09-14 MMDSS - P. Gallinari 62
Example : sequence labelling
Example
2 labels R et B
Search space :
(input sequence, {sequence of labels})
For a size 3 sequence
x = x1 x2 x3 :
A node represents a state in the search space
2007-09-14 MMDSS - P. Gallinari 63
Example : expected loss
C=1
C=1
C=1
C=1
C=1
C=1
C=1
C=0
C=0
C=0
C=0
C=0
C=0CT=1
CT=2
CT=2
CT=3
CT=0
CT=1
CT=1
CT=2
C=0
Sequence of size 3
target :
Loss does not always
separate !!
2007-09-14 MMDSS - P. Gallinari 64
Example: state space exploration guided by local costs
C=1
C=1
C=1
C=1
C=1
C=1
C=1
C=0
C=0
C=0
C=0
C=0
C=0CT=1
CT=2
CT=2
CT=3
CT=0
CT=1
CT=1
CT=2
C=0
Sequence of size 3
target :
Goal:
generalize to unseen situation
2007-09-14 MMDSS - P. Gallinari 65
Inference
Suppose we have a policy function F which decides at each step which action to take
Inference could be performed by computing ŷ1= F(x,. ), ŷt= F(ŷ1,… ŷt-1),…, ŷT= F(ŷ1,… , ŷT-1) ŷ = F(ŷ1,… , ŷT)
No Dynamic programming needed
2007-09-14 MMDSS - P. Gallinari 66
Training F will be implemented with a classifier
F : {States} -> {Actions} Training
Learn a classifier F s.t. At each step, F takes the “optimal” decision
Incremental algorithm First classifier F1 will learn from the optimal path
supposed to be known at training time Bad solution for generalization
At step i, classifier i will learn from the decision of Fi-1
2007-09-14 MMDSS - P. Gallinari 67
Let Fi be current classifier For input x, at each state st,
there are 2 possible actions Compute the expected cost
associated to actions a1, a2 The best action will be
labeled 0, the other 1 : targets for the classifier
When the final state is reached, we get a set of training example for Fi+1
a1
a2
CF(s0,a1)
CF(s0,a1)
a1
a2
CF(s1,a1)
CF(s1,a2)
a1
a2
CF(s2,a1)
CF(s2,a2)
s0 s3
s2
s1
2007-09-14 MMDSS - P. Gallinari 68
Training algorithm F initialized with a good initial policy For each example x
Compute its prediction ŷ = (ŷ1, ŷ2,…, ŷT(x)) using current F Let (s1, s2,…, sT(x)) the corresponding set of states For each state sT
compute the state representation Φ(sT,x) For each possible action “a” compute the expected loss of “a”
cF(st, a) set of training examples for the policy classifier
For each x and each st {(a1, cF(st, a1)),… (a|A|, cF(st, a|A|))}x,st
Train classifier F’ to predict the best action Update current classifier F with F’ iterate
2007-09-14 MMDSS - P. Gallinari 69
Searn hypothesis Decomposability of y
y can be built from successive parts Decomposability of training loss Optimal policy
Remarks No Markov assumption No need for DP
ŷ built by successive applications of the policy Fast
Can accommodate large number of cost functions
2007-09-14 MMDSS - P. Gallinari 70
Reinforcement learning search (Maes 2007)
Formalizes search based ideas as a Markov Decision Problem and Reinforcement Learning problem
Provides a general framework for this approach Many RL algorithm could be used for training
2007-09-14 MMDSS - P. Gallinari 71
Reinforcement learning An agent A is in an environment At time t A is in a state st and takes action at A receives a reward rt from the environment and moves
to state st+1 Its goal is to maximize the long term reward The environment is often stochastic
Modeled as a finite-state Markov Decision Process (MDP)
Goal of A: maximize some long term reward No notion of correct input – output pair
A is “myopic” and shall explore the environment in order to estimate its reward
Typical situation in robotics, 2 player games, planning, etc
2007-09-14 MMDSS - P. Gallinari 72
Markov Decision Process A MDP is a tuple (S, A, P, R)
S is the State space, A is the action space, P is a transition function describing the
dynamic of the environment P(s, a, s’) = P(st+1 = s’| st = s, at = a)
R is a reward function R(s, a, s’) = E[rt|st+1 = s’, st = s, at = a)
2007-09-14 MMDSS - P. Gallinari 73
Policy Distribution: (s,a) = P(at = a | st = s)
Immediate reward When action a is chosen in state st, agent receives an
immediate reward rt
Cumulative reward from t
Goal Develop the policy that maximizes R0
If P and R are known: problem is usually solved using DP When only S and A are known : reinforcement learning
k
ktk
t rR
2007-09-14 MMDSS - P. Gallinari 74
Direct approach For each possible policy sample reward r Choose the policy with highest reward
Value function approach Belmann equation
Use estimates of E[R|st], the value function, and learn a policy that maximize them
1 ttt sRErsRE
2007-09-14 MMDSS - P. Gallinari 75
Reinforcement learning value functions
they measure “how good it is” to be in a state s or to choose action a when in state s
A policy is better than ’ if
In order to improve , learn to improve V or Q
],[),(
][)(
aassREasQ
ssREsV
ttt
tt
ssVsV )()( '
2007-09-14 MMDSS - P. Gallinari 76
Most RL algorithms use the following scheme Iterate
Evaluation of utility functions V or Q for the current policy Improve policy by increasing V or Q
Remark V and Q are often stored in tables For large problems unfeasible
Use approximate values : regression:
),(action - state couple theofn descriptio vectorial),(
),(,),(ˆ
asas
asasQ
2007-09-14 MMDSS - P. Gallinari 77
prototype RL ALGORITHM Initialize θ Repeat
choose an initial state s choose an action a (stochastic) While final state is not reached
take action a, observe reward r and next state s’ Learn θ to improve Q from this feedback s s’, a a’
Until convergence
2007-09-14 MMDSS - P. Gallinari 78
Structured outputs and MDP State st
Input x + partial output ŷt
Initial state : (x,) Actions
Task dependent POS: new tag for the current word XML: insert a new path in a partial tree
Reward Final: Heuristic:
)ˆ,( yyR )ˆ,( tyyr
2007-09-14 MMDSS - P. Gallinari 79
Inference Apply learned policy on x
Learning SARSA here (other RL algorithms could do)
2007-09-14 MMDSS - P. Gallinari 80
Exemple : sequence labeling Left Right model
Actions: label Order free model
Actions: label + position Loss : Hamming cost or F score Tasks
Named entity recognition (Shared task at CoNNL 2002 - 8 000 train, 1500 test)
Chunking – Noun Phrases (CoNNL 2002) Handwriten word recognition (5000 train, 1000 test)
Complexity of inference O(sequence size * number of labels)
2007-09-14 MMDSS - P. Gallinari 81
2007-09-14 MMDSS - P. Gallinari 82
Dependency parsing
2007-09-14 MMDSS - P. Gallinari 83
Action (target word, label)
Cost function “Labeled attachment” score (# correct target words+
label) CoNLL 2007
10 languages
2007-09-14 MMDSS - P. Gallinari 84
XML structuration Action
Attach path ending with the current leaf to a position in the current partial tree
Φ(.,.) encode a series of potential (state, action) pairs
Loss: F-Score for trees
2007-09-14 MMDSS - P. Gallinari 85
HTML
HEAD BODY
TITLE IT H1 P FONT
Example Francis MAESTitle of the
sectionWelcome to INEX
This is a footnote
Example
TITLE
DOCUMENT
AUTHOR
Francis MAESTitle of the
section
SECTION
TITLE
Welcome toINEX
TEXT
INPUT DOCUMENT
TARGET DOCUMENT
2007-09-14 MMDSS - P. Gallinari 86
HTML
HEAD BODY
TITLE IT H1 P FONT
Example Francis MAESTitle of the
sectionWelcome to INEX
This is a footnote
2007-09-14 MMDSS - P. Gallinari 87
HTML
HEAD BODY
TITLE IT H1 P FONT
Example Francis MAESTitle of the
sectionWelcome to INEX
This is a footnote
2007-09-14 MMDSS - P. Gallinari 88
HTML
HEAD BODY
TITLE IT H1 P FONT
Example Francis MAESTitle of the
sectionWelcome to INEX
This is a footnote
Example
TITLE
DOCUMENT
2007-09-14 MMDSS - P. Gallinari 89
HTML
HEAD BODY
TITLE IT H1 P FONT
Example Francis MAESTitle of the
sectionWelcome to INEX
This is a footnote
Example
TITLE
DOCUMENT
2007-09-14 MMDSS - P. Gallinari 90
HTML
HEAD BODY
TITLE IT H1 P FONT
Example Francis MAESTitle of the
sectionWelcome to INEX
This is a footnote
Example
TITLE
DOCUMENT
2007-09-14 MMDSS - P. Gallinari 91
HTML
HEAD BODY
TITLE IT H1 P FONT
Example Francis MAESTitle of the
sectionWelcome to INEX
This is a footnote
Example
TITLE
DOCUMENT
Francis MAES
AUTHOR
2007-09-14 MMDSS - P. Gallinari 92
HTML
HEAD BODY
TITLE IT H1 P FONT
Example Francis MAESTitle of the
sectionWelcome to INEX
This is a footnote
Example
TITLE
DOCUMENT
Francis MAES
AUTHOR
2007-09-14 MMDSS - P. Gallinari 95
HTML
HEAD BODY
TITLE IT H1 P FONT
Example Francis MAESTitle of the
sectionWelcome to INEX
This is a footnote
Example
TITLE
DOCUMENT
Title of the section
SECTION
TITLE
AUTHOR
Francis MAES
2007-09-14 MMDSS - P. Gallinari 96
HTML
HEAD BODY
TITLE IT H1 P FONT
Example Francis MAESTitle of the
sectionWelcome to INEX
This is a footnote
Example
TITLE
DOCUMENT
AUTHOR
Francis MAESTitle of the
section
SECTION
TITLE
2007-09-14 MMDSS - P. Gallinari 97
HTML
HEAD BODY
TITLE IT H1 P FONT
Example Francis MAESTitle of the
sectionWelcome to INEX
This is a footnote
Example
TITLE
DOCUMENT
AUTHOR
Francis MAESTitle of the
section
SECTION
TITLE
Welcome toINEX
TEXT
2007-09-14 MMDSS - P. Gallinari 98
HTML
HEAD BODY
TITLE IT H1 P FONT
Example Francis MAESTitle of the
sectionWelcome to INEX
This is a footnote
Example
TITLE
DOCUMENT
AUTHOR
Francis MAESTitle of the
section
SECTION
TITLE
Welcome toINEX
TEXT
2007-09-14 MMDSS - P. Gallinari 99
HTML
HEAD BODY
TITLE IT H1 P FONT
Example Francis MAESTitle of the
sectionWelcome to INEX
This is a footnote
Example
TITLE
DOCUMENT
AUTHOR
Francis MAESTitle of the
section
SECTION
TITLE
Welcome toINEX
TEXT
This is afootnote
2007-09-14 MMDSS - P. Gallinari 100
HTML
HEAD BODY
TITLE IT H1 P FONT
Example Francis MAESTitle of the
sectionWelcome to INEX
This is a footnote
Example
TITLE
DOCUMENT
AUTHOR
Francis MAESTitle of the
section
SECTION
TITLE
Welcome toINEX
TEXT
This is afootnote
2007-09-14 MMDSS - P. Gallinari 101
HTML
HEAD BODY
TITLE IT H1 P FONT
Example Francis MAESTitle of the
sectionWelcome to INEX
This is a footnote
Example
TITLE
DOCUMENT
AUTHOR
Francis MAESTitle of the
section
SECTION
TITLE
Welcome toINEX
TEXT
INPUT DOCUMENT
TARGET DOCUMENT
2007-09-14 MMDSS - P. Gallinari 102
Results
2007-09-14 MMDSS - P. Gallinari 103
Summary on search method Learn to explore the state space of the problem Alternative to DP or classical search algorithms Could be used with any decomposable cost
function
2007-09-14 MMDSS - P. Gallinari 104
Conclusion Other approaches
Y. Lecun (2006): energy Based models J. Weston (2007): regression Cohen (2006): stacking …..