29
CGMIL 2008 - Hyderabad - India An Italian-English dependency parser and its [possible] application to Hindi Leonardo Lesmo ([email protected]) Natural Language Processing Group (Dip. Informatica – Univ. Torino) (http://www.di.unito.it/gull)

CGMIL 2008 - Hyderabad - India An Italian-English dependency parser and its [possible] application to Hindi Leonardo Lesmo ([email protected]) Natural

Embed Size (px)

Citation preview

Page 1: CGMIL 2008 - Hyderabad - India An Italian-English dependency parser and its [possible] application to Hindi Leonardo Lesmo (lesmo@di.unito.it) Natural

CGMIL 2008 - Hyderabad - India

An Italian-English dependency parser and its [possible] application

to HindiLeonardo Lesmo

([email protected])

Natural Language Processing Group(Dip. Informatica – Univ. Torino)

(http://www.di.unito.it/gull)

Page 2: CGMIL 2008 - Hyderabad - India An Italian-English dependency parser and its [possible] application to Hindi Leonardo Lesmo (lesmo@di.unito.it) Natural

CGMIL 2008 - Hyderabad - India

OUTLINE

The Turin University ParserPerformancesThe Turin University Treebank (TUT)Mapping between TUT and AnnCorraCurrent activities and the future

Page 3: CGMIL 2008 - Hyderabad - India An Italian-English dependency parser and its [possible] application to Hindi Leonardo Lesmo (lesmo@di.unito.it) Natural

CGMIL 2008 - Hyderabad - India

Post-processing Segmentation

Analysis of Conjunctions

Chunking

Tagging rules

Lexical access

Verbal Attachment

Verbal subcategories

Verbal frames

THE PARSER

Dictionary

Morphology

POS tagging

Chunking rules

Page 4: CGMIL 2008 - Hyderabad - India An Italian-English dependency parser and its [possible] application to Hindi Leonardo Lesmo (lesmo@di.unito.it) Natural

CGMIL 2008 - Hyderabad - India

When the man that you mentioned sent me that beautiful message, I fell in love with him

When [the man] that you mentioned sent me [that beautiful message], I fell [in love] [with him]

chunking

{{When [the man] {that you mentioned} sent me [that beautiful message]}, I fell [in love] [with him] }

segmentation

caseframing

AN EXAMPLE

Page 5: CGMIL 2008 - Hyderabad - India An Italian-English dependency parser and its [possible] application to Hindi Leonardo Lesmo (lesmo@di.unito.it) Natural

CGMIL 2008 - Hyderabad - India

beautiful

verb-obj

to fall

I

verb+fin-rmod-time

when

verb-subj

prep-arg

in with

rmod

message

conj-arg

himto send

thatdet+def- arg

adjc+qualif-rmod

prep-arg

love

meverb-indobj

verb-subj

the

man

verb-indcomp*locut

to mention

that

verb-rmod+ relcl

det+def- arg

verb-subjverb-obj

you

THE FINAL RESULT

Page 6: CGMIL 2008 - Hyderabad - India An Italian-English dependency parser and its [possible] application to Hindi Leonardo Lesmo (lesmo@di.unito.it) Natural

CGMIL 2008 - Hyderabad - India

1 When (WHEN CONJ SUBORD TIME) [7;VERB+FIN-RMOD-TIME]2 the (THE ART DEF ALLVAL ALLVAL) [7;VERB-SUBJ]3 man (MAN NOUN COMMON M SING) [2;DET+DEF-ARG]4 that (THAT PRON RELAT ALLVAL ALLVAL LSUBJ+OBL) [6;VERB-OBJ]5 you (YOU PRON PERS ALLVAL ALLVAL 2 LSUBJ+LOBJ+LIOBJ+OBL) [6;VERB-SUBJ] 6 mentioned (MENTION VERB MAIN IND PAST ALLVAL ALLVAL) [3;VERB-RMOD-RELCL]7 sent (SEND VERB MAIN IND PAST ALLVAL ALLVAL) [1;CONJ-ARG]8 me (I PRON PERS ALLVAL SING 1 LOBJ+LIOBJ+OBL) [7;VERB-INDCOMPL-THEME]9 that (THAT ADJ DEMONS ALLVAL SING) [7;VERB-OBJ]10 beautiful (BEAUTIFUL ADJ QUALIF ALLVAL ALLVAL) [11;ADJC+QUALIF-RMOD]11 message (MESSAGE NOUN COMMON N SING) [9;DET+DEF-ARG]12 , (#\, PUNCT) [14;SEPARATOR]13 I (I PRON PERS ALLVAL SING 1 LSUBJ) [14;VERB-SUBJ]14 fell (FALL VERB MAIN IND PAST ALLVAL ALLVAL) [0;TOP-VERB]15 in (IN PREP MONO) [14;PREP-RMOD]16 love (LOVE NOUN COMMON N SING) [15;PREP-ARG]17 with (WITH PREP MONO) [14;PREP-RMOD]18 him (HE PRON PERS M SING 3 LOBJ+LIOBJ+OBL) [17;PREP-ARG]19 . (#\. PUNCT) [14;END]

THE ACTUAL FORMAT

Page 7: CGMIL 2008 - Hyderabad - India An Italian-English dependency parser and its [possible] application to Hindi Leonardo Lesmo (lesmo@di.unito.it) Natural

CGMIL 2008 - Hyderabad - India

LAS UAS LAS2 Participant

86.94 90.90 91.59 UniTo_Lesmo

77.88 88.43 83.00 UniPi_Attardi

75.12 85.81 82.05 IIIT_Mannem

74.85 85.88 81.59 UniStuttIMS_Schielen

* 85.46 * UPenn_Champollion

47.62 62.11 54.90 UniRoma2_Zanzotto

Results: Evalita 2007

LAS: Labeled Attachment ScoreUAS: Correct Attachment ScoreLAS2: Correct Label Score

Page 8: CGMIL 2008 - Hyderabad - India An Italian-English dependency parser and its [possible] application to Hindi Leonardo Lesmo (lesmo@di.unito.it) Natural

CGMIL 2008 - Hyderabad - India

LAS UAS

CoNLL EPT CoNLL EPT

UniPi_Attardi 81.34 77.88 85.54 88.43

IIIT_Mannem 78.67 75.12 82.91 85.81

UniStuttIMS_Schielen 80.46 74.85 84.54 85.88

Comparison with CoNLL

CoNLL: International contest for dependency parsers (multilanguage)

Page 9: CGMIL 2008 - Hyderabad - India An Italian-English dependency parser and its [possible] application to Hindi Leonardo Lesmo (lesmo@di.unito.it) Natural

CGMIL 2008 - Hyderabad - India

The Turin University Treebank (TUT)

• Current size:Italian: 2200 sentences

62445 tokens(4635 traces; 6704 punctuation)

English: 150 sentences4250 tokens

(253 traces; 513 punctuation)English not yet online (under test)

Page 10: CGMIL 2008 - Hyderabad - India An Italian-English dependency parser and its [possible] application to Hindi Leonardo Lesmo (lesmo@di.unito.it) Natural

CGMIL 2008 - Hyderabad - India

1. ADJ (adjectives)- DEITT (deictic) next- DEMONS (demonstrative) such, this, that- EXCLAM (exclamative)- INDEF (indefinite) numerous, certain, few- INTERR (interrogative) what, which- ORDIN (ordinal) first, twentieth, last- ORDINSUFF (ordinal suffixes) nd, rd, th, st- POSS (possessive) my, your, their- QUALIF (qualificative) nice, big, English

2. ADV (adverbs)- ADFIRM (adfirmative)- ADVERS (adversative) although, though- COMPAR (comparative) less, more- CONCESS (concessive) also- DOUBT (doubt) perhaps

- EXPLIC (explicative) that_is- INTERJ (interjections) at_any_rate- INTERR (interrogative) how, where, when, why- LIMIT (limit) just, only- LOC (locative) there, within, below, here - MANNER (manner) aloud, alright, well- NEG (negation) not- QUANT (quantification) little, rather, too- REASON (motivation) in_fact- STRENG (strengthening) even, moreover- SUPERL (superlative) most- TIME (time) sometime, afterward, already

Parts of Speech(and

“subtypes”)

Page 11: CGMIL 2008 - Hyderabad - India An Italian-English dependency parser and its [possible] application to Hindi Leonardo Lesmo (lesmo@di.unito.it) Natural

CGMIL 2008 - Hyderabad - India

3. ART (articles)- DEF (definite) the- INDEF (indefinite) a, another, - GENITIVE (genitive): 's

4. CONJ (conjunctions)- COORD (coordinative) and, but, or, neither,

nor- SUBORD (subordinative) since, that, to, unless- COMPAR (comparative) than

5. DATE (dates) 08/06/20086. INTERJ (interjections) alas7. MARKER (markers)8. NOUN (nouns)

- COMMON house, boy, chair - PROPER Mary, Italia, Italy, England

9. NUM (numbers) zero, twenty, 127, 3.14

10. PHRAS (phrasals) yes, no11. PREDET (predeterminers) all,

both12. PREP (prepositions)

- MONO of, to, from, in- POLI during, above, under, in front of

Parts of Speech(and

“subtypes”)2

Page 12: CGMIL 2008 - Hyderabad - India An Italian-English dependency parser and its [possible] application to Hindi Leonardo Lesmo (lesmo@di.unito.it) Natural

CGMIL 2008 - Hyderabad - India

13. PRON (pronouns)DEMONS (demonstrative) this, that, EXCLAM (exclamative) whatINDEF (indefinite) everything, nobody, somethingINTERR (interrogative) what, whoLOC (locative) I: ne, ci, viORDIN (ordinals) first, second, fiftiethPERS (personal) I, you, we, herPOSS (possessive) mine, yoursREFL-IMPERS (reflexive-impersonal) ci, vi, si, seRELAT (relative) that, who, which, where

14. PUNCT (punctuation)15. SPECIAL (special)16. VERB (verbs)

MAIN (all standard verbs) go, eat, give, be (in “to be intelligent”)AUX (auxiliaries) be (in “to be kissed”)MOD (modals) must, can, will

Parts of Speech(and

“subtypes”)3

Page 13: CGMIL 2008 - Hyderabad - India An Italian-English dependency parser and its [possible] application to Hindi Leonardo Lesmo (lesmo@di.unito.it) Natural

CGMIL 2008 - Hyderabad - India

The labelling schemeTop Dependent

Function

Arg Modifier

Nofunction

adjc-arg

advb-arg

conj-argnoun-arg

verb-arg

verb-subjverb-obj

verb-indobjverb-indcompl

verb-predcompl

Apposition Rmod

Page 14: CGMIL 2008 - Hyderabad - India An Italian-English dependency parser and its [possible] application to Hindi Leonardo Lesmo (lesmo@di.unito.it) Natural

CGMIL 2008 - Hyderabad - India

Nofunction

Aux

Contin

Coordinator

Emptycompl

InterjectionSeparatorVisitor

Verb-expletive

Aux+passiveAux+tense

Aux+progressive

Contin+denomContin+locut

Contin+prep Coordantec

Coord

Coord2nd

The NOFUNCTION labels

Page 15: CGMIL 2008 - Hyderabad - India An Italian-English dependency parser and its [possible] application to Hindi Leonardo Lesmo (lesmo@di.unito.it) Natural

CGMIL 2008 - Hyderabad - India

Some examples

Aux+progressive: I am looking for …

Aux+tense: … the debate has – to quite some extent - suffered from …

Aux+passive: … whose historical experience is not marked by …

Auxiliaries

Contin+locut: … convinced of the feasibility … in order to reinforce …

Continuations

Contin+prep: … grown out of the millenniums …

Contin+denom: Samuel Alexander asserted …

Page 16: CGMIL 2008 - Hyderabad - India An Italian-English dependency parser and its [possible] application to Hindi Leonardo Lesmo (lesmo@di.unito.it) Natural

CGMIL 2008 - Hyderabad - India

The question of what we might consider to be an adequate …

Visitors (and traces)

the

of

prep-rmodquestion

prep-argwhat

det+def- arg

verb-obj

trace

verb-subj

trace to

verb-predcompl+obj

prep-arg

be

trace

verb-subj

an

verb-obj

considerwe

verb-rmod+relcl

verb-subj

trace

visitor

mightverb+modal-indcompl

Page 17: CGMIL 2008 - Hyderabad - India An Italian-English dependency parser and its [possible] application to Hindi Leonardo Lesmo (lesmo@di.unito.it) Natural

CGMIL 2008 - Hyderabad - India

Coordination

base: … is tautologous and without ontologic commitment …

coord+base coord2nd+base

compar: … were more like mythical heroes than like the omnipotent God …

coord+compar coord2nd+comparcoordantec+compar

correlat: … neither John nor his friends …

coord2nd+correlatcoord+correlatcoordantec+correlat

… and “word” traces

compar: … Samuel asserted that mentality emerged … and then tasserted tSamuel that …

coord+base coord2nd+base

Page 18: CGMIL 2008 - Hyderabad - India An Italian-English dependency parser and its [possible] application to Hindi Leonardo Lesmo (lesmo@di.unito.it) Natural

CGMIL 2008 - Hyderabad - India

The AnnCorra scheme

• It is chunk-based (some elementary subtrees are left unanalysed)

• It involves 28 relations (arc labels) and 25 different POS (tabel below)

• There are some non-dependency labels (as for coordination (ccof)

• Some POS are merged (e.g. Demonstratives include both Adj and Pron)

Page 19: CGMIL 2008 - Hyderabad - India An Italian-English dependency parser and its [possible] application to Hindi Leonardo Lesmo (lesmo@di.unito.it) Natural

CGMIL 2008 - Hyderabad - India

AnnCorra TUT

Common Noun NN NOUN (common)

Proper Noun NNP NOUN (proper)

Location, Time NST ADV (time), ADV (loc)

Pronoun PRP PRON except the ones in Demonstrative and Question

Adjective JJ ADJ except the ones in Demonstrative and Question

Adverb RB ADV (with some exceptions)

Demonstrative DEM PRON (demons), ADJ (demons)

Question Words WQ ADJ (interr), ADV (interr), PRON RELAT????, PRON (interr)

Main verb VM VERB (main)

Verb Aux VAUX VERB (aux), VERB (mod)

Post position PSP PREP

Particles RP None

Conjuncts CC CONJ

Quantifiers QF DET, PREDET

Cardinal numb QC NUM

Ordinal numb QO ADJ (ordin), PRON (ordin)

Classifier CL None

Intensifier INTF ADV (quant)

Interjection INJ INTERJ

Negation NEG ADV (neg)

Quotative UT None

Sym SYM SPECIAL or PUNCT

Compounds *C None

Reduplicative RDP None

Echo ECH None

Mapping category

labels

Page 20: CGMIL 2008 - Hyderabad - India An Italian-English dependency parser and its [possible] application to Hindi Leonardo Lesmo (lesmo@di.unito.it) Natural

CGMIL 2008 - Hyderabad - India

k1 (karta): the primary (or “most independent”) participant in the action (similar to agent) VERB-SUBJ

k2 (karma): this is the secondary participant (often, the patient). VERB-OBJ k3 (karana): the instrument. VERB-INDCOMPL-MEANSMANNER k4 (sampradana): recipient or the beneficiary of an action VERB-INDOBJ k5 (apadana): the stationary element in a separation ???? k7 (adhikarana): the locus (spatial or temporal or abstract) of karta or karma. It is

tagged as k7p, k7t or k7 depending on the type of location. VERB-INDCOMPL-LOC

The argument (karaka) labels

Mapping arc labels

Page 21: CGMIL 2008 - Hyderabad - India An Italian-English dependency parser and its [possible] application to Hindi Leonardo Lesmo (lesmo@di.unito.it) Natural

CGMIL 2008 - Hyderabad - India

must

read

verb+modal-indcompl

I

verb-subj

verb-subj

the

verb-obj

book

tdet+def-arg

(must read)

I

k1

(the book)

k2

Mapping the structure

Chunk-based structure of AnnCorra

Page 22: CGMIL 2008 - Hyderabad - India An Italian-English dependency parser and its [possible] application to Hindi Leonardo Lesmo (lesmo@di.unito.it) Natural

CGMIL 2008 - Hyderabad - India

Current activities and the future

A word about semantics: DTS

theoremsstudents

verb-subj

heard

threetwo

verb-obj

difficult

det+quantif-arg det+indef-arg

adjc+qualif-rmod

quant(x): quant(y):

x

student'1

y

theorem' difficult'11

restr(x): restr(y):

difficult'

1study'

yx

student' theorem'

2

111

Page 23: CGMIL 2008 - Hyderabad - India An Italian-English dependency parser and its [possible] application to Hindi Leonardo Lesmo (lesmo@di.unito.it) Natural

CGMIL 2008 - Hyderabad - India

difficult'

1

study'

yx

student' theorem'

2

11 1

CTX

Disambiguation: Semdep arcs

1

study'

yx

student' theorem'

2

11 1

CTX

difficult'

2x [ student’(x) 3y [theorem’(y) study’(x,y) ]] 3y [ theorem’(y) 2x [student’(y) study’(x,y) ]]

Any more reading? 1

study'

yx

student' theorem'

2

11 1

CTX

difficult'

??? Branching Quantification (Independent Set)

Page 24: CGMIL 2008 - Hyderabad - India An Italian-English dependency parser and its [possible] application to Hindi Leonardo Lesmo (lesmo@di.unito.it) Natural

CGMIL 2008 - Hyderabad - India

Current activities and the future

Practical semantic interpretation based on ontological knowledge for DB access

Extension of the treebank with semantic annotation (in cooperation with Johan Bos)

Development of a graphical interface with a online server (Java implementation and socket-based connection with a Lisp server)

Automatic analysis of legal texts for extracting information about trule amendments (date, modified text, new text)

Page 25: CGMIL 2008 - Hyderabad - India An Italian-English dependency parser and its [possible] application to Hindi Leonardo Lesmo (lesmo@di.unito.it) Natural

CGMIL 2008 - Hyderabad - India

The future (last but not least)

Morphological analysis of Hindi (mid-way)

Development and testing of a Hindi parser and of mapping rules from Hindi to English and viceversa

In cooperation with IIIT Hyderabad

Page 26: CGMIL 2008 - Hyderabad - India An Italian-English dependency parser and its [possible] application to Hindi Leonardo Lesmo (lesmo@di.unito.it) Natural

CGMIL 2008 - Hyderabad - India

HEAD= wiw2w1 wi+2wi-1 wi+1wn

?? ? ???

….. …..

Function:

Structure:(head-category head-subcategory (dependent-position (dependent-category (dependent-constraints))) ARC-LABEL)

More on Parsing 1

Page 27: CGMIL 2008 - Hyderabad - India An Italian-English dependency parser and its [possible] application to Hindi Leonardo Lesmo (lesmo@di.unito.it) Natural

CGMIL 2008 - Hyderabad - India

Examples:

(ART DEF (before (PREDET (agree))) PDETMOD)

i (cat=ART, subcat=DEF gender=m, number=pl)

tutti (cat=PREDET, gender=m, number=pl)

PDETMOD

theall

(NOUN COMMON (chunk-follows (ADJ (agree) (subcat qualif))) ADJCMOD-QUALIF)

bello (cat=ADJ, subcat=QUALIF, gender=m, number=sing)

giardino (cat=NOUN, gender=m, number=sing)

ADJCMOD-QUALIF

nicegarden

molto (cat=ADV)

very

More on Parsing 2

Page 28: CGMIL 2008 - Hyderabad - India An Italian-English dependency parser and its [possible] application to Hindi Leonardo Lesmo (lesmo@di.unito.it) Natural

CGMIL 2008 - Hyderabad - India

verbs

nosubj-verbs

subj-verbs

obj-verbs

basic-transempty-modal

modal

ssubj-inf-verbs

trans

indobj-verbs

trans-indobj

subcategorization classes

bisognare

camminare

dovere

dictionary

potere

need

walk

must

can

Verb subcategorization classes:

More on Parsing 3

Page 29: CGMIL 2008 - Hyderabad - India An Italian-English dependency parser and its [possible] application to Hindi Leonardo Lesmo (lesmo@di.unito.it) Natural

CGMIL 2008 - Hyderabad - India

Transformations:

basic class (e.g. trans) transformed classes (e.g. trans, trans+passivization,trans+infinitivization,trans+prodrop,trans+passivization+infinitivization,….. )

Example transformation:(infinitivization replacing (subj-verbs) (is-inf-form tr-verb v-casefr) (cancel-case s-subj))

More on Parsing 4