2007-09-14MMDSS - P. Gallinari1 Learning structured ouputs P. Gallinari [email protected] University Pierre et Marie Curie –

2007-09-14 MMDSS - P. Gallinari 1

Learning structured ouputs

P. [email protected]

www-connex.lip6.fr

University Pierre et Marie Curie – Paris – Fr

NATO ASI

Mining Massive Data Sets for security


Outline

Motivation and examples Approaches for structured learning

Generative models Discriminant models Search models


Machine learning and structured data

Different types of problems Model, classify, cluster structured data Predict structured outputs Learn to associate structured

representations Structured data and applications in many

domains chemistry, biology, natural language, web,

social networks, data bases, etc


Sequence labeling: POS

This Workshop brings together scientistsand engineers

DT NN VBZ RB NNS CC NNSinterested in recent developments in exploiting Massive

VBN IN JJ NNS IN VBG JJdata sets

NP NP

determiner noun Verb 3rd pers adverb Noun pluralCoord. Conj.

adjective Verb gerund

Verb plural


PENN tag set1. CC Coordinating conjunction 25.TO to 2. CD Cardinal number 26.UH Interjection 3. DT Determiner 27.VB Verb, base form 4. EX Existential there 28.VBD Verb, past tense 5. FW Foreign word 29.VBG Verb, gerund/present participle 6. IN Preposition/subord. 30.VBN Verb, past participle 7. JJ Adjective 31.VBP Verb, non-3rd ps. sing. present 8. JJR Adjective, comparative 32.VBZ Verb, 3rd ps. sing. present 9. JJS Adjective, superlative 33.WDT wh-determiner 10.LS List item marker 34.WP wh-pronoun 11.MD Modal 35.WP Possessive wh-pronoun 12.NN Noun, singular or mass 36.WRB wh-adverb 13.NNS Noun, plural 37. # Pound sign 14.NNP Proper noun, singular 38. $ Dollar sign 15.NNPS Proper noun, plural 39. . Sentence-final punctuation 16.PDT Predeterminer 40. , Comma 17.POS Possessive ending 41. : Colon, semi-colon 18.PRP Personal pronoun 42. ( Left bracket character 19.PP Possessive pronoun 43. ) Right bracket character 20.RB Adverb 44. " Straight double quote 21.RBR Adverb, comparative 45. ` Left open single quote 22.RBS Adverb, superlative 46. " Left open double quote 23.RP Particle 47. ' Right close single quote 24.SYM Symbol 48. " Right close double quote


Segmentation + labeling: syntactic chunking (Washington Univ. tagger)

This Workshop brings together scientistsand engineers

NP VP ADVP NPinterested in recent developments in

VP IN NP PNPexploiting Massive data sets

NP

Noun Phrase Verb Phrase Noun Phraseadverbial Phrase

Noun Phrase


Segmentation + labeling: Named Entity recognition Entities

locations, persons, organizations Time expressions: dates, times Numeric expression: $ amount, percentages

NEW YORK (Reuters) - Goldman Sachs Group Inc. agreed on Thursday to pay $9.3 million to settle charges related to a former economist …. Goldman's GS.N settlement with securities regulators stemmed from charges that it failed to properly oversee John Youngdahl, a one-time economist …. James Comey, U.S. Attorney for the Southern District of New York, announced on Thursday a seven-count indictment of Youngdahl for insider trading, making false statements, perjury, and other charges. Goldman agreed to pay a $5 million fine and disgorge $4.3 million from illegal trading profits.


Information extraction

Home Organizing Committee Lecturers Program Submission Participants NATO ASI Information Travel Information Hosted by

Event Sponsored by

NATO Advanced Study I nstitute

on

Mining Massive Data Sets for Security

September 10 - 21, 2007, Villa Cagnola - Gazzada - I taly

NATO ASI Announcement

This Workshop brings together scientists and engineers interested in recent developments in exploiting Massive Data Sets. Emphasis is placed on available techniques and their application to security-critical applications….

Lecturers

C. Best L. Bottou R. Feldman F. Fogelman-Soulié P. Gallinari E. Glover L. Giles A. Gionis I . Guyon D. Hand G. Hébrail F. Provost N. Tishby V. Vapnik D. Wilksinson

Objective

Today our world is awash in data and we live in an Information Society where every action leaves a trace, generating massive amounts of data. Recent scientific developments provide technologies to exploit these huge amounts of data and extract from it critical information. ……

Directors

Clive Best, J RC - IT Françoise Fogelman Soulié, Kxen - FR

Patrick Gallinari, Univesité Paris 6 - FR Naftali Tishby, Hebrew University - IL

Importants Dates

Deadline for submission of application form: J une 24, 2007 (Extended)

Notification of acceptance: J une 30, 2007 (New)

Deadline for Accomodation form: J uly 1, 2007

NATO ASI MMDSS: September 10-21, 2007

…….

Legal Notice Webmaster Top


Syntaxic Parsing (Stanford Parser)


Document mapping problem Problem: query heterogeneous XML databases or collections Need to know the correspondence between the structured

representations uually made by hand Learn the correspondence between the different sources

Labeled tree mapping problem

<Restaurant><Name>La cantine</Name><Adress> 65 rue des pyrénées, Paris, 19ème, FRANCE</Adress><Specialities> Canard à l’orange, Lapin au miel</ Specialities ></Restaurant>

<Restaurant><Name>La cantine</Name><Adress> <City>Paris</City><Stree>pyrénées</Street> <Num>65</Num></Adress><Dishes> Canard à l’orange</Dishes><Dishes> Lapin au miel</Dishes></Restaurant>


Others Taxonomies Social networks Adversial computing: Webspam, Blogspam, … Translation Biology …..


Is structure really useful ?Can we make use of structure ? Yes

Evidence from many domains or applications Mandatory for many problems

e.g. 10 K classes classification problem Yes but

Complex or long term dependencies often correspond to rare events

Practical evidence for large size problems Simple models sometimes offer competitive results

Information retrieval Speech recognition, etc


Structured learning X, Y : input and output spaces Structured output

y Y decomposes into parts of variable size y = (y1, y2,…, yT)

Dependencies Relations between y parts Local, long term, global

Cost function O/ 1 loss: Hamming loss: F-score: BLEU etc

yy ˆ*1

T

iii yy

1

ˆ*1

yy ˆ*1


General approach Predictive approach:

where F : X x Y R is a score function used to rank potential outputs

F trained to optimize some loss function Inference problem

|Y| sometimes exponential Argmax is often intractable: hypothesis

decomposability of the score function over the parts of y

Restricted set of outputs

),,(maxarg)(* yxFxfyYy

i

iiYy

yxFxfy ),,(maxarg)(*


Structured algorithms differ by: Feature encoding Hypothesis on the output structure Hypothesis on the cost function

Generative models

Hidden Markov ModelsProbabilistic Context Free grammarsTree labeling model


Usual hypothesis Features : “natural” encoding of the input Hypothesis on the output structure : local

output dependencies, Markov property Score decomposes, e.g. sum of local cost on

each subpart Inference : usually dynamic programming


HMMs Sequence labeling – segmentation Dependencies

Outputs : Markov independence

Decoding and learning Dynamic programming

Viterbi Argmax …. Forward Backward

Decoding complexity O(n|Q|2) for a sequence of length n and |Q| states

)/()/(111 tt

tt

qqpqqp

)/(),/(1

1

1 tttt

tqxpqxxp


Consider a simple HMM Start

State space for an input sequence of size 3


Probabilistic Context Free Grammar (after Manning & Shultze)

Set of terminals {w1,…,wv} Set of non terminals {N1,…,Nn} N1: start symbol Set of rules {Ni zi} with zi sequence of

terminals and non terminals To each rule is associated a probability P(Ni

zi) Special case: Chomsky Normal Form

grammars zi = wj

zi = NkNm


S NP VP 1.0 NP NP PP 0.4 PP P NP 1.0 NP astronomer 0.1 VP V NP 0.7 NP ears 0.18 VP VP PP 0.3 NP saw 0.04 P with 1.0 NP stars 0.18 V saw 1.0 NP telescopes 0.1

S

VP

VP V

V

NP

NP

NPNP PP

PP P NP

astronomers

saw

stars

with ears


Notations Sentence

Wp,q= wpwp+1…wq

Ni dominates sequence Wp,q if Ni may rewrite wpwp+1…wq

Assumptions Context free

Probability of a subtree does not depend on words outside the subtree

Independence from N.. Ancestors The probability does not depend on nodes in the

derivation outside the subtree

Nj

Wp……… Wq


Inside and outside probabilities As for the forward – backward

variables in HMMS, 2 probabilities may be defined

Inside Probability of generating

wk…wl starting from Nj

Outside Probability of generating Nj

and all words outside wk…wl

)/(),( ,,jlklkDefj Nwplk

),,(),( ,1,1,1 nljlkkDefj wNwplk

W1…Wk-1 Wk ….…… Wl Wl+1…Wn

Nj

N1


Probability of a sentence: CKY algorithm Probability of sentence w1,n

Left Right induction on the sequence

For k = 1 .. n For l= k+1 .. n, calculate

),1(),()(),(,,

lmmkNNNPlk qmqp

pqpj

j

kjwNpkk kj

j ,)(),(

Nj

NqNp

Wk……… Wm Wm+1 ….....… Wl

)(),1( ,11 nwpn


Inference and learning Inference

Similar to probability of a sentence with Max instead of

Complexity: O(m3n3) n = length of the sentence, m = # non terminals

in the grammar Learning

Inside – outside Each step is O(m3n3)

Tree generative models

Classification / clustering of structured documents (Denoyer et al. 2004)Document annotation / conversion (Wisniewski et al. 2006)


Context-XML semi-structured documents

<fig>

<fgc>

<sec>

<p>

<bdy>

<article>

<st>

<hdr>

text

text

text


Document model

),/()/(

)/,()/(

ddd

dd

sStTPsSP

tTsSPdDP

Structural probability

Content probability

),( dd tsd

s

t

! Scalability !


Document Model: Structure Belief Networks

Paragraphe Paragraphe

Section Section

Titre

Titre

Document

Corps

Titre du document

Titre de la section

Cette section contient deuxparagraphes

Premier paragraphe Second paragraphe

La deuxième section necontient pas de paragraphes

Document

Intro SectionSection

Paragraphe Paragraphe Paragraphe

Document



//

1

)()(d

i

di

d sPsP

//

1

)))((/()(d

i

id

id

d nparentlabelsPsP

//

1

))(()),((/)(d

i

di

di

di

d nprécédentlabelnparentlabelsPsP

Document




Document Model: Content Model for each node

1st order dependency

Use of a local generative model for each label

),....,( //1 dddd ttt

//

1

),/(),/(d

i

id

iddd stPstP

)/(),/( idd si

did

i tPstP


Final network Document


Paragraphe

Paragraphe Paragraphe

T1= « Ce documentest un exemple dedocument structuré

arborescent »

T2= « Ceci est lapremière section du

document »

T3= « Le premierparagraphe »

T4= «Le secondparagraphe »

T5= «La secondesection »

T6= «Le troisièmeparagraphe »

)/( DocumentIntroP )/( DocumentSectionP)/( DocumentSectionP

)/( SectionParagrapheP

)/( SectionParagrapheP

)/( SectionParagrapheP)/1( IntroTP

)/2( SectionTP

)/3( ParagrapheTP

)/5( SectionTP

)/4( ParagrapheTP )/6( ParagrapheTP

)/6()/5()/4(*

)/3()/2()/1(*

)/arg()²/()/()( 3

ParagrapheTPSectionTPParagrapheTP

ParagrapheTPSectionTPIntroTP

SectionraphePPDocumentSectionPDocumentIntroPdP


Different learning techniques Likelihood maximization Discriminant learning

Logistic function Error minimization

Fisher Kernel

contenustructure

Dd

d

i

ts

di

di

Dd

sd

Dd

LL

stPsP

dPL

TRAIN

di

TRAIN

TRAIN

//

1

),/(log)/(log

)/(log

n

ic

ixpaix

cixpaix

e

e

xcP

cxP

cxP

1 )(,

)(,log

)/(

)/(log

1

11

1)/(


Document mapping problem Problem

Learn from examples how to map heterogeneous sources onto a predefined target schema

Preserve the document semantic

Sources: semistructured, HTML, PDF, flat text, etc

Labeled tree mapping problemDifferent instances

Flat text to XMLHTML to XML

XML to XML….

<Restaurant><Nom>La cantine</Nom><Adresse> 65 rue des pyrénées, Paris, 19ème, FRANCE</Adresse><Spécialités> Canard à l’orange, Lapin au miel</Spécialités></Restaurant>

<Restaurant><Nom>La cantine</Nom><Adresse> <Ville>Paris</Ville> <Arrd>19</Arrd> <Rue>pyrénées</Rue> <Num>65</Num></Adresse><Plat> Canard à l’orange</Plat><Plat> Lapin au miel</Plat></Restaurant>


Document mapping problem Central issue: Complexity

Large collections Large feature space: 103 to 106

Large search space (exponential)

Approach Learn generative models of XML target

documents from a training set Decoding of unknown sources according to

the learned model


Problem formulationGiven

ST a target format

dsin(d) an input document

Find the most probable target document

)'(maxarg)(

'din

TT S

SdS ddPd

Decoding Learned transformation model


General restructuration model

sd

td'

sd'

td

),,/(),/(argmax '''

'1 ddddd

dtstPssPd


Example : HTML to XML (Tree annotation)

Hypothesis Input document

HTML tags mostly for visualization Remove tags Keep only the segmentation (leaves)

Annotation Leaves are the same in the HTML and XML

document Target document model: node label depends only

on its local context Context = content, left sibling, father


Model and training

Probability of target tree

Solve

Exact Dynamic Programming decoding O(|Leaf nodes|3.|tags|)

Approximate solution with LASO (Hal Daume ICML 2005) O(|Leaf nodes|.|tags||tree nodes|)

iniiiidT

dTdSinT

nfathernsibcnPdddP

dddPddP

))(),(,(),...,(

),...,()(

1

1)(

Document



)'(maxarg)(

'din

TT S

SdS ddPd


Experiments : HTML to XML

IEEE collection / INEX corpus 12 K documents,

Average: 500 leaf nodes, 200 int nodes, 139 tags Movie DB

10 K movie descriptions (IMDB) Average: 100 leaf nodes, 35 int. nodes, 28 tags

Shakespeare 39 plays Few doc, but:

Average: 4100 leaf nodes, 850 int nodes, 21 tags Mini-Shakespeare

Randomly chosen 60 scenes from the plays 85 leaf nodes, 20 int. nodes, 7 tags

For all collections ½ train, ½ test


Performance



Summary 30 years of generative models

Hierarchical HMMs, Factorial HMMs, etc Local dependency hypothesis

On the outputs On the inputs

Inference and learning often use dynamic programming Prohibitive for some/ many problems Other methods: loopy propagation, search e.g. ,A*, ..

Cost function : joint likelihoood - decomposes

Discriminant models

Structured Percepron (Collins 2002)Large margin methods (Tsochantaridis et al. 2004, Taskar

2004)


Usual hypothesis Joint representation of input – output Φ(x, y)

Encode potential dependencies among and between input and output e.g. histogram of state transitions observed in training

set, frequency of (xi,yj), POS tags, etc

Large feature sets (102 -> 104)

Linear score function:

Decomposability of features set (outputs) and of the loss function

),(,),,( yxyxF


Structured Perceptron (Collins 2002)

Discriminant model based on a Perceptron variant for sequence labeling

Initially proposed for POS and Chunking Possible extension to other structured

outputs tasks Inference: Viterbi Encodes input and output (local) dependencies Simplicity


Algorithm

Training algorithm Initialize : Repeat n times over all training examples

(x,y)

If update parameters

0

),(,maxargˆ yxysY

yy ˆ

)ˆ,(),( yxyx


Inference : DP Restricted to 0/ 1 cost Also

Convergence and generalization bounds (Freund & Shapire, 99 )

# mistakes depends only on on the margin, not on the size of output space (potential candidates)


Extension of large margin methods

2 problems Generalize max margin principle to other

loss functions than O/1 loss Number of constraints proportional to |Y|,

i.e. potentially exponential


SVM ISO (Tsochantaridis et al. 2004)

Extension of multi-class SVMs Principle:

0),(,)(,:\,,..1

esinequalitilinear N)card(*N solve :problem Equivalent

),(,),(,\

max:,..1

:iferror 0get We

esinequalitilinear non N solve toamounts Problem

yixwiyixwiyYyNi

Y

iyixwyixwiyYy

Ni

problem separable linearly and tionclassificafor loss 10/ : Example


SVM formulation non linearly separable case, 0/1 cost (1 slack var. per non linear constraint (Crammer – Singer 91)):

iyYyiyixwiyixwyiw

i

n

iiN

Cw

QP

i

i

\,1),(,),(,)(,

sconstraintwith

1

22

1 min

0,


Pb 1 : Extension to ∆ loss For each constraint, penalize examples according to its loss Rescale slack variables according to the loss incurred in each

linear constraint

New constraints

iyYyiyiy

iyixwiyixwyiw \,),(

1),(,),(,)(,


Learning Pb 2 : Limit the number of constraints

Done implicitely via the training algorithm The algorithms finds a polynomial number of

« active » contraints the solution of the QP problem with these

constraints alone fullfill all constraints with a given precision ε

The algorithm requires to solve an Argmax problem at each iteration Viterbi for sequences CKY for parsing


M3N (Taskar et al. 2003)

Combine a probabilistic model (Markov Network) with a max margin formulation Solution to problem 1 : margin rescaling Solution to problem 2 : the structure of the Markov

network limits the number of constraints (e.g. chain network for sequences)


Summary: discriminant approaches

Hypothesis Local dependencies for the output Decomposability of the loss function

Long term dependencies in the input Nice convergence properties + bounds Complexity

Learning often does not scale

Incremental learningLearning to search solution spaces

Incremental Parsing, Collins 2004SEARN, Daume et al. 2006Reinforcement Learning, Maes et al. 2007


General ideas Incremental construction of output ŷ

Actions Decisions : choose a subset of actions

Learn how to explore the state space of the problem


Incremental parsing (Collins, Roark, 2004)

Build incrementally the parse tree of a sentence Inference: greedy algorithm

Read the sentence from left to right Step i corresponds to the ith word in the sentence At step i,

candidate partial parse trees are generated for the first i words of the sentence

then scored and ranked A subset of the candidates is selected Selected candidates will be used to generate next set of

candidates Final candidate is the best scored parse tree at the

end of the sentence


More Ranking function F = < Φ(x,y) , θ> F is learned from a training set with an

adaptive algorithm (Perceptron) Input at step i: joint vectorial representation

of the input sentence and the partial parse tree

No need for dynamic programming


How to build sequences of partial trees : from Y(i) to Y(i+1) At step i

Consider word xi

Action = Try to attach each chain ending with xi to any attachment site

Grammar G includes some constraints Allowable Chain

only derivation chains appearing in the data set are allowed

Attachment site Places in the partial tree where chain can be attached

(inferred from data too)


ε

NP

Astronomer

VP

saw V

saw stars


SEARN (Daume et al 2006)

Introduced the idea of learning how to explore a search space

Hypothesis : the structured output can be built incrementally ŷ = (ŷ1, ŷ2,…, ŷT)

It will by build via Machine learning

Loss :

Goal Construct incrementally ŷ so as to minimize the loss Learn how to search the solution space

)ˆ,(

yy

XEC


Example : sequence labelling

Example

2 labels R et B

Search space :

(input sequence, {sequence of labels})

For a size 3 sequence

x = x1 x2 x3 :

A node represents a state in the search space


Example : expected loss

C=1

C=1

C=1

C=1

C=1

C=1

C=1

C=0

C=0

C=0

C=0

C=0

C=0CT=1

CT=2

CT=2

CT=3

CT=0

CT=1

CT=1

CT=2

C=0

Sequence of size 3

target :

Loss does not always

separate !!


Example: state space exploration guided by local costs

C=1

C=1

C=1

C=1

C=1

C=1

C=1

C=0

C=0

C=0

C=0

C=0

C=0CT=1

CT=2

CT=2

CT=3

CT=0

CT=1

CT=1

CT=2

C=0

Sequence of size 3

target :

Goal:

generalize to unseen situation


Inference

Suppose we have a policy function F which decides at each step which action to take

Inference could be performed by computing ŷ1= F(x,. ), ŷt= F(ŷ1,… ŷt-1),…, ŷT= F(ŷ1,… , ŷT-1) ŷ = F(ŷ1,… , ŷT)

No Dynamic programming needed


Training F will be implemented with a classifier

F : {States} -> {Actions} Training

Learn a classifier F s.t. At each step, F takes the “optimal” decision

Incremental algorithm First classifier F1 will learn from the optimal path

supposed to be known at training time Bad solution for generalization

At step i, classifier i will learn from the decision of Fi-1


Let Fi be current classifier For input x, at each state st,

there are 2 possible actions Compute the expected cost

associated to actions a1, a2 The best action will be

labeled 0, the other 1 : targets for the classifier

When the final state is reached, we get a set of training example for Fi+1

a1

a2

CF(s0,a1)

CF(s0,a1)

a1

a2

CF(s1,a1)

CF(s1,a2)

a1

a2

CF(s2,a1)

CF(s2,a2)

s0 s3

s2

s1


Training algorithm F initialized with a good initial policy For each example x

Compute its prediction ŷ = (ŷ1, ŷ2,…, ŷT(x)) using current F Let (s1, s2,…, sT(x)) the corresponding set of states For each state sT

compute the state representation Φ(sT,x) For each possible action “a” compute the expected loss of “a”

cF(st, a) set of training examples for the policy classifier

For each x and each st {(a1, cF(st, a1)),… (a|A|, cF(st, a|A|))}x,st

Train classifier F’ to predict the best action Update current classifier F with F’ iterate


Searn hypothesis Decomposability of y

y can be built from successive parts Decomposability of training loss Optimal policy

Remarks No Markov assumption No need for DP

ŷ built by successive applications of the policy Fast

Can accommodate large number of cost functions


Reinforcement learning search (Maes 2007)

Formalizes search based ideas as a Markov Decision Problem and Reinforcement Learning problem

Provides a general framework for this approach Many RL algorithm could be used for training


Reinforcement learning An agent A is in an environment At time t A is in a state st and takes action at A receives a reward rt from the environment and moves

to state st+1 Its goal is to maximize the long term reward The environment is often stochastic

Modeled as a finite-state Markov Decision Process (MDP)

Goal of A: maximize some long term reward No notion of correct input – output pair

A is “myopic” and shall explore the environment in order to estimate its reward

Typical situation in robotics, 2 player games, planning, etc


Markov Decision Process A MDP is a tuple (S, A, P, R)

S is the State space, A is the action space, P is a transition function describing the

dynamic of the environment P(s, a, s’) = P(st+1 = s’| st = s, at = a)

R is a reward function R(s, a, s’) = E[rt|st+1 = s’, st = s, at = a)


Policy Distribution: (s,a) = P(at = a | st = s)

Immediate reward When action a is chosen in state st, agent receives an

immediate reward rt

Cumulative reward from t

Goal Develop the policy that maximizes R0

If P and R are known: problem is usually solved using DP When only S and A are known : reinforcement learning

k

ktk

t rR


Direct approach For each possible policy sample reward r Choose the policy with highest reward

Value function approach Belmann equation

Use estimates of E[R|st], the value function, and learn a policy that maximize them

1 ttt sRErsRE


Reinforcement learning value functions

they measure “how good it is” to be in a state s or to choose action a when in state s

A policy is better than ’ if

In order to improve , learn to improve V or Q

],[),(

][)(

aassREasQ

ssREsV

ttt

tt

ssVsV )()( '


Most RL algorithms use the following scheme Iterate

Evaluation of utility functions V or Q for the current policy Improve policy by increasing V or Q

Remark V and Q are often stored in tables For large problems unfeasible

Use approximate values : regression:

),(action - state couple theofn descriptio vectorial),(

),(,),(ˆ

asas

asasQ


prototype RL ALGORITHM Initialize θ Repeat

choose an initial state s choose an action a (stochastic) While final state is not reached

take action a, observe reward r and next state s’ Learn θ to improve Q from this feedback s s’, a a’

Until convergence


Structured outputs and MDP State st

Input x + partial output ŷt

Initial state : (x,) Actions

Task dependent POS: new tag for the current word XML: insert a new path in a partial tree

Reward Final: Heuristic:

)ˆ,( yyR )ˆ,( tyyr


Inference Apply learned policy on x

Learning SARSA here (other RL algorithms could do)


Exemple : sequence labeling Left Right model

Actions: label Order free model

Actions: label + position Loss : Hamming cost or F score Tasks

Named entity recognition (Shared task at CoNNL 2002 - 8 000 train, 1500 test)

Chunking – Noun Phrases (CoNNL 2002) Handwriten word recognition (5000 train, 1000 test)

Complexity of inference O(sequence size * number of labels)



Dependency parsing


Action (target word, label)

Cost function “Labeled attachment” score (# correct target words+

label) CoNLL 2007

10 languages


XML structuration Action

Attach path ending with the current leaf to a position in the current partial tree

Φ(.,.) encode a series of potential (state, action) pairs

Loss: F-Score for trees


HTML

HEAD BODY

TITLE IT H1 P FONT

Example Francis MAESTitle of the

sectionWelcome to INEX

This is a footnote

Example

TITLE

DOCUMENT

AUTHOR

Francis MAESTitle of the

section

SECTION

TITLE

Welcome toINEX

TEXT

INPUT DOCUMENT

TARGET DOCUMENT


HTML

HEAD BODY

TITLE IT H1 P FONT



This is a footnote


HTML

HEAD BODY

TITLE IT H1 P FONT



This is a footnote


HTML

HEAD BODY

TITLE IT H1 P FONT



This is a footnote

Example

TITLE

DOCUMENT


HTML

HEAD BODY

TITLE IT H1 P FONT



This is a footnote

Example

TITLE

DOCUMENT


HTML

HEAD BODY

TITLE IT H1 P FONT



This is a footnote

Example

TITLE

DOCUMENT


HTML

HEAD BODY

TITLE IT H1 P FONT



This is a footnote

Example

TITLE

DOCUMENT

Francis MAES

AUTHOR


HTML

HEAD BODY

TITLE IT H1 P FONT



This is a footnote

Example

TITLE

DOCUMENT

Francis MAES

AUTHOR


HTML

HEAD BODY

TITLE IT H1 P FONT



This is a footnote

Example

TITLE

DOCUMENT

Title of the section

SECTION

TITLE

AUTHOR

Francis MAES


HTML

HEAD BODY

TITLE IT H1 P FONT



This is a footnote

Example

TITLE

DOCUMENT

AUTHOR


section

SECTION

TITLE


HTML

HEAD BODY

TITLE IT H1 P FONT



This is a footnote

Example

TITLE

DOCUMENT

AUTHOR


section

SECTION

TITLE

Welcome toINEX

TEXT


HTML

HEAD BODY

TITLE IT H1 P FONT



This is a footnote

Example

TITLE

DOCUMENT

AUTHOR


section

SECTION

TITLE

Welcome toINEX

TEXT


HTML

HEAD BODY

TITLE IT H1 P FONT



This is a footnote

Example

TITLE

DOCUMENT

AUTHOR


section

SECTION

TITLE

Welcome toINEX

TEXT

This is afootnote


HTML

HEAD BODY

TITLE IT H1 P FONT



This is a footnote

Example

TITLE

DOCUMENT

AUTHOR


section

SECTION

TITLE

Welcome toINEX

TEXT

This is afootnote


HTML

HEAD BODY

TITLE IT H1 P FONT



This is a footnote

Example

TITLE

DOCUMENT

AUTHOR


section

SECTION

TITLE

Welcome toINEX

TEXT

INPUT DOCUMENT

TARGET DOCUMENT


Results


Summary on search method Learn to explore the state space of the problem Alternative to DP or classical search algorithms Could be used with any decomposable cost

function


Conclusion Other approaches

Y. Lecun (2006): energy Based models J. Weston (2007): regression Cohen (2006): stacking …..

Documents

2007-09-14MMDSS - P. Gallinari1 Learning structured ouputs P. Gallinari [email protected] University Pierre et Marie Curie –