47
INRIA ATOLL Software Tools for Natural Language Processing Éric de la Clergerie [email protected] http://atoll.inria.fr Evaluation Seminar SYM C : Management and processing of language and data Dourdan, November 15-16th 2005 INRIA É. de la Clergerie ATOLL SymC 2005/11/15 1 / 39

ATOLL - Software Tools for Natural Language Processing

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

INRIA

ATOLLSoftware Tools for Natural Language Processing

Éric de la [email protected]

http://atoll.inria.fr

Evaluation SeminarSYM C : Management and processing of language and data

Dourdan, November 15-16th 2005

INRIA É. de la Clergerie ATOLL SymC 2005/11/15 1 / 39

INRIA

Outline

1 Generalities

2 Thematics & Contributions

3 Applications

4 Actions

5 Collaborations

6 Conclusions

INRIA É. de la Clergerie ATOLL SymC 2005/11/15 2 / 39

INRIA

ATelier d’Outils Logiciels pour le Langage naturel

Creation : 1997 – Computer Science NLP

ATOLL objectives :

to develop tools and techniques, theoretical or applied, in order tohelp to access, process and use documents in natural language.

INRIA scientific challenges :

To design new applications using the Web and multimedia databases

Keywords : Computational Linguistics ; Natural Language Processing (NLP) ;Linguistic Engineering ; Parsing ; Syntactic Formalisms ; Linguistic resources ;

INRIA É. de la Clergerie ATOLL SymC 2005/11/15 3 / 39

INRIA

ATOLL’s composition

2002 2003 2004 2005Scientific leader

Éric de la Clergerie (CR)Permanents

Bernard Lang (DR)Pierre Boullier (DR)Philippe Deschamp (CR)François Thomasset (DR)

Exteriors & Temporaries

Areski Nait Abdallah (Pr, Univ. Brest)Alexis Nasr (Prof., Del. Paris 7)François Barthélemy (MdC, CNAM)Lionel Clément (PostDoc RLT, Ing. RNIL)Guillaume Rousse (Ing. Biotim)Stéphane Laurière (Ing. e-COTS)

PhDBenoît Sagot

INRIA É. de la Clergerie ATOLL SymC 2005/11/15 5 / 39

INRIA

Parsing ?

Parsing : Identifying the relationships between words (and groups of words)Grammar : packed sets of relationships + combinaison rules

INRIA É. de la Clergerie ATOLL SymC 2005/11/15 6 / 39

INRIA

Parsing ?

Parsing : Identifying the relationships between words (and groups of words)Grammar : packed sets of relationships + combinaison rules

Tree Adjoining Grammars [TAG]

NP

John

S

NP ↓ VP

V

sleeps

subst⇒ P

NP

John

VP

V

sleeps

INRIA É. de la Clergerie ATOLL SymC 2005/11/15 6 / 39

INRIA

Parsing ?

Parsing : Identifying the relationships between words (and groups of words)Grammar : packed sets of relationships + combinaison rules

Tree Adjoining Grammars [TAG]

NP

John

S

NP ↓ VP

V

sleeps

subst⇒ P

NP

John

VP

V

sleeps

V

⋆V Adv

a lot

adj⇒ S

NP

John

VP

V

V

sleeps

Adv

a lot

INRIA É. de la Clergerie ATOLL SymC 2005/11/15 6 / 39

INRIA

Parsing ?

Parsing : Identifying the relationships between words (and groups of words)Grammar : packed sets of relationships + combinaison rules

Tree Adjoining Grammars [TAG]

NP

John

S

NP ↓ VP

V

sleeps

subst⇒ P

NP

John

VP

V

sleeps

V

⋆V Adv

a lot

adj⇒ S

NP

John

VP

V

V

sleeps

Adv

a lot

Problems :No consensus on the best linguistic formalismCapturing all syntactic constructionsCapturing word usages (lexicon & statistics)Handling amibiguities

INRIA É. de la Clergerie ATOLL SymC 2005/11/15 6 / 39

INRIA

Outline

1 Generalities

2 Thematics & Contributions

3 Applications

4 Actions

5 Collaborations

6 Conclusions

INRIA É. de la Clergerie ATOLL SymC 2005/11/15 7 / 39

INRIA

ATOLL : Thematics

Formal Language Theory

Formalisms, Automata& Tabulation

INRIA É. de la Clergerie ATOLL SymC 2005/11/15 8 / 39

INRIA

ATOLL : Thematics

Formal Language Theory

Formalisms, Automata& Tabulation

(Open) Tools

Parser Compilers : DYALOG, SYNTAX, RCG

INRIA É. de la Clergerie ATOLL SymC 2005/11/15 8 / 39

INRIA

ATOLL : Thematics

Formal Language Theory

Formalisms, Automata& Tabulation

(Open) Tools

Parser Compilers : DYALOG, SYNTAX, RCG

Ling. Dev. Environment : MGCOMP, MGTOOLS,TAG_UTILS, FOREST_UTILS, MRCG2RCG. . .

(Open) Ling. Resources

Lexicon : LEFFF

Grammar : SXLFGMetaGrammar : FRMG

INRIA É. de la Clergerie ATOLL SymC 2005/11/15 8 / 39

INRIA

ATOLL : Thematics

Formal Language Theory

Formalisms, Automata& Tabulation

(Open) Tools

Parser Compilers : DYALOG, SYNTAX, RCGPre Parsing : SXPIPE

Ling. Dev. Environment : MGCOMP, MGTOOLS,TAG_UTILS, FOREST_UTILS, MRCG2RCG. . .

(Open) Ling. Resources

Lexicon : LEFFF

Grammar : SXLFGMetaGrammar : FRMG

Applications

Evaluation : EASyInfo. Extraction : Biotim

Corpora

Grid techniques

INRIA É. de la Clergerie ATOLL SymC 2005/11/15 8 / 39

INRIA

ATOLL : Thematics

Formal Language Theory

Formalisms, Automata& Tabulation

(Open) Tools

Parser Compilers : DYALOG, SYNTAX, RCGPre Parsing : SXPIPE

Ling. Dev. Environment : MGCOMP, MGTOOLS,TAG_UTILS, FOREST_UTILS, MRCG2RCG. . .

(Open) Ling. Resources

Lexicon : LEFFF

Grammar : SXLFGMetaGrammar : FRMG

Applications

Evaluation : EASyInfo. Extraction : BiotimLing. Knowledge Acquisition

Corpora

Grid techniques

INRIA É. de la Clergerie ATOLL SymC 2005/11/15 8 / 39

INRIA

ATOLL : Thematics

Formal Language Theory

Formalisms, Automata& Tabulation

(Open) Tools

Parser Compilers : DYALOG, SYNTAX, RCGPre Parsing : SXPIPE

Ling. Dev. Environment : MGCOMP, MGTOOLS,TAG_UTILS, FOREST_UTILS, MRCG2RCG. . .

(Open) Ling. Resources

Lexicon : LEFFF

Grammar : SXLFGMetaGrammar : FRMG

Applications

Evaluation : EASyInfo. Extraction : BiotimLing. Knowledge Acquisition

Corpora

Grid techniques

Normalization

NormalangueISO TC37 SC4

INRIA É. de la Clergerie ATOLL SymC 2005/11/15 8 / 39

INRIA

ATOLL : Thematics

Formal Language Theory

Formalisms, Automata& Tabulation

(Open) Tools

Parser Compilers : DYALOG, SYNTAX, RCGPre Parsing : SXPIPE

Ling. Dev. Environment : MGCOMP, MGTOOLS,TAG_UTILS, FOREST_UTILS, MRCG2RCG. . .

(Open) Ling. Resources

Lexicon : LEFFF

Grammar : SXLFGMetaGrammar : FRMG

Applications

Evaluation : EASyInfo. Extraction : BiotimLing. Knowledge Acquisition

Corpora

Grid techniques

Normalization

NormalangueISO TC37 SC4

Free Software

Bernard Lang

INRIA É. de la Clergerie ATOLL SymC 2005/11/15 8 / 39

INRIA

ATOLL’s positionning

Balance between theory, development and experimentation

NLP requires many tools and resourceswith difficulties to access and exploit linguistic resources (for French)

◮ ⇒ dev. effort + investigation of methodologies to speed up dev.◮ favor reuse and distribution ⇒ normalization & open source

Software Eng. practices : Versioning + Packaging + Catalog on line◮ favor emerging of resources ⇒ LexSynt action, collaborations

Search for comprehension of mechanisms of languagebut also see language as cultural artifact :

◮ collaboration with linguists & use of linguistic theories (and formalisms)◮ exploitation of corpora to capture language usage

NLP is an experimental field◮ need to play at real scale

large coverage grammars, large lexica, real documents, large corpora◮ need feedback : evaluation, statistics◮ real scale applications

INRIA É. de la Clergerie ATOLL SymC 2005/11/15 9 / 39

INRIA

Syntactic formalisms

Exploration of a wide range of syntactic formalisms

CFGNP

Vocabulary Complexityunification

DatalogV(sing)

DCG LFG HPSGS(gap(np)) λ-Prolog

MCS

Derivationcomplexity

combiningstructures

TAGLIG

N

A ↓ ⋆N

LCFRS

RCG [Boullier]

Feature TAG

Meta-RCG [Sagot]

Meta-Grammars : Abstract level of grammar description based on hierarchies ofclasses grouping constraints and requiring/providing functionalities.⇒ [MG compilation] Generation of target grammars (TAGs, LFGs)

INRIA É. de la Clergerie ATOLL SymC 2005/11/15 10 / 39

INRIA

Linguistic Resources : (Meta-)Grammars

No easily available wide coverage French grammars⇒ development of grammars

SXLFG : wide coverage French LFG grammar [Clément, Sagot, Boullier]very efficient exploitation with SYNTAX

FRMG : wide coverage French Meta-Grammar [Clergerie]MG + factorization operators ⇒ generates a very compact TAG grammar126 trees with only 27 verb-anchored trees (to compare to usual 2-6Ktrees)

Dev. of env. of development for grammars

Edition

Visualization

Statistics (coverage, time, ambiguity) on test suites (EUROTRA & TSNLP)

INRIA É. de la Clergerie ATOLL SymC 2005/11/15 11 / 39

INRIA

Linguistic Resources : Lexicon

Development of French lexicon LEFFF [Clément,Sagot]Over 400 000 forms and following lemma distribution :

verbs common nouns proper nouns adj adv6788 37183 52938 10024 2127

Verb morphology automatically learned on corpus (+ manual checking)

Syntactic information on verbs (subcategorization, control, . . . )promet (promises)

v [pred=’promettre_1<subj|ssubj|vsubj,(obj|scomp),(à -obj)>’,cat=v,@SCompInd,@P3s]v [...]

Multiple Inheritance-based architecture (∼ MG)@promettre {

< @verbe_ditransitif_à_svp< objet_phrastique_possible< complétive_indicatif

| ... }

still incomplete and with errors,but using corpus parsing to track errors (error mining)

INRIA É. de la Clergerie ATOLL SymC 2005/11/15 12 / 39

INRIA

Parsing techniques

Parsing remains an algorithmic challenge :

Ambiguity handling & representation (Shared forests)Push-Down Automata ; Dynamic Programming techniques

Formalisms Automata Tabulation NotesCFG PDA O(n3) Lang

DCG / Logic Programming Logical PDA completeness Lang & ClergerieTAG / LIG 2-Stack Automata O(n6) Clergerie & Pardo

MC-TAG/osRCG/MCS Thread Automata O(nk ) Clergerie

Guiding techniques (e.g., supertagging [Boullier] & chunking [Sagot])multi-pass parsing where the shared forest of a level guides the next level

Algorithms on shared forests (e.g., disambiguation)

Robustness (e.g., partial parsing, error correction techniques)handling “ill formed” sentences, unknown words and constructions

Scaling issues (wide coverage grammars, large lexica)grammar factorization (FRMG) & many algorithmic issues

INRIA É. de la Clergerie ATOLL SymC 2005/11/15 13 / 39

INRIA

DyALog - Exploring unification-based grammars

An environment (compiler dyacc + abstract machine) :

For compiling tabular parsersbased on : stack automata & dynamic programming⇒ computation sharing & loop detection⇒ extraction of shared forests

Also a logic programming environment⇒ power of logic + possibility of escaping within grammars

Strongly NLP oriented◮ Ease grammar design : TFS, finite domains, . . .◮ Multiple grammatical formalisms : DCG, BMG, TAG & TIG, RCG◮ Functionalities and customization of parsers :

multiple parsing strategies, forests, word lattice, lexicalization, robustness

Used for◮ a robust Potuguese grammar (bidir. head-driven DCG+BMG) [GLINT]

⇒ dev. of a similar Spanish Grammar [COLE]◮ MG compiler MGCOMP◮ French wide coverage TIG/TAG grammar FRMG [ATOLL]◮ French TAG grammar with semantic interface [LORIA]

INRIA É. de la Clergerie ATOLL SymC 2005/11/15 14 / 39

INRIA

Syntax - From CFGs to LFGs and RCGs

SYNTAX [Boullier]

Originally developed to parse Programming Languages with CFG(+attributes)

Extended with tabular techniques to cover all CFGs

Recently extended for Lexical Functional Grammars2 passes : CFG + computation of decorations on shared parse forests+ disambiguation phase⇒ used for large coverage LFG grammar SXLFG [Clément,Sagot,Boullier]

Extremely efficient on CFGs and on LFGs+ 300Ksent. journalistic corpus (Monde Diplomatique) : 3

4 sent. < 0.1s

RCG [Boullier]

Derived from technology behind SYNTAX

Very powerful and also efficient

INRIA É. de la Clergerie ATOLL SymC 2005/11/15 15 / 39

INRIA

Linguistic infrastructure : SxPipe

French Morpho-Syntactic chain SXPIPE [Sagot, Boullier, . . . ] :

Word and sentence segmentations, including multi-words

Named entities (Proper Nouns, Dates, Addresses, URL, :-) , . . . )

Spelling corrections

Returns a word lattice (DAG) as input for parsing

0 1 2 3 4 5 6 7

Jean

{Jean} jean

{abite} habite en outre

{en outre} en_outre

{au} à {au} le {1 , rue de la Pompe} _ADRESSE

INRIA É. de la Clergerie ATOLL SymC 2005/11/15 16 / 39

INRIA

Normalization

Motivation :

Reusability of resourcesInteroperability of tools in Processing Chains

Participation (with Laurent Romary and Langue & Dialogue [LORIA]) at

INRIA consortium SYNTAXFrench Normalangue/RNIL actionISO TC 37 SC4 on “Linguistic Resource Management”⇒ head of French delegation to ISO meetingsPartially involved in new European project LIRICS

Participation on emerging standards :

MAF Morphosyntactic Annotation Framework (Project leader)FSR Feature Structure Representation (+ future compagnion FSD)DCR Data Category Registry (registering ling. terminology)LMF Lexical Markup Framework (lexicons)

SynAF Syntactic Annotation Framework

INRIA É. de la Clergerie ATOLL SymC 2005/11/15 17 / 39

INRIA

Environment of development

Principles :

Coordinating small toolsLINGPIPE, TAG_UTILS, FOREST_UTILS, PARSERD, . . .

Use of XML intermediary formats multiple views (XSL)Grammars (TAGML), MetaGrammars, Morpho-Syntax (MAF),Shared Forests (Derivations or Dependencies)Visualization tools (web services)

◮ MAF (MAFD) : http://atoll.inria.fr/mafdemo◮ Parsing (PARSERD) : http://atoll.inria.fr/parserdemo◮ Grammar (FRMG) : http://atoll.inria.fr/perl/frmg/tree.pl

il a voulu en promettre une à Paul.He has wanted to promise one of them to Paul

il

a voulu

en

promettre

une à

Paul

.

pro:cln: (0)

promettre:v:107 (31)

subject (31) pro:pro:50 (2)

object (31)

_:VMod:92 (1)vmod (18) à:prep:2 (1)

preparg (6)

_:S:33 (1)S (31)

a:aux:54 (1) vouloir:v:63 (1)Infl (1) V (31)

pro:clg: (0)

clg (16)

pro:cll: (0)

cll (15)

à:prep:41 (1)N2 (1)

PP (1)Paul:np:51 (1)N2 (1)

N2 (1)

.:_: (0)void (1)

INRIA É. de la Clergerie ATOLL SymC 2005/11/15 18 / 39

INRIA

Outline

1 Generalities

2 Thematics & Contributions

3 Applications

4 Actions

5 Collaborations

6 Conclusions

INRIA É. de la Clergerie ATOLL SymC 2005/11/15 19 / 39

INRIA

French Parsing Evaluation Campaign EASy

Evaluation protocol : complex(formalisms, deep vs shallow, ambiguous or not, . . . )⇒ evaluation on shallow constituents and small set of dependencies

[Dec. 2004] Participation of FRMG & SXLFG (out of 14 parsers)∼ 40Ksentences, with ∼ 4200 manually annotated, several styles

Use of our deep parsers to produce non-ambiguous shallow information⇒ robustness (full and partial parsing) ⇒ disambiguation heuristics +conversion

EASy campaign tests not just parsing but a full processing chainLEFFF, SXPIPE + parser + post-parsing

GN 1 NV 2 GR 3 GP 4Jean abite en outre au 1 , rue de la PompeF1 F2 F3 F4 F5 F6 F7 F8 F9 F10 F11

subject verbGN1 NV2

compl. verbGP4 NV2

modifier verbGR3 NV2

Note : Still waiting for definitive and complete resultsINRIA É. de la Clergerie ATOLL SymC 2005/11/15 20 / 39

INRIA

EASy as a starting point

Using EASy expertise and resources to continue evaluations, get feedback,compare FRMG & SXLFG (and acquire statistics)

Corpus distribution

%0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100

kind

of c

orpu

s

general

litteraire

mail

medical

oral_delic

oral_elda

questions

total

84.0

81.9

73.7

80.5

72.2

77.0

86.1

77.5

67.4

76.5

63.5

70.9

63.4

70.8

83.5

68.7

68.1

76.0

64.0

71.1

62.7

71.5

83.4

68.7

#sentences distribution

general (18%)755

litteraire (21%)881

mail (20%)852

medical (13%)554

oral_delic (12%)522

oral_elda (12%)502

questions (5%)203

%precision

%recall

%fmeasure

⇒ NEW already used results for a very accurate CFG-based chunker [Sagot]

INRIA É. de la Clergerie ATOLL SymC 2005/11/15 21 / 39

INRIA

BIOTIM : from books to knowledge bases

ACI “Datawarehouse” BIOTIM : Processing botanical descriptions (flora)

Corpus “Flore du Cameroun” (1963 – 2001)

Volumes Pages Av. Pages Words Taxons31 9466 305 1.5M ∼ 2400

Tasks :Corpus preparation : spelling correction (OCR), logical structuring

Preliminary Linguistic Processing : morpho-syntactic processing

Terminology extraction & first experiments with “governor-governee”relationship

“Ontology” extraction : use of parsing to extract syntactic dependencies+ Harris hypothesis : similar syntactic contexts hints semantic similarities⇒ lancéolé (adj) : leaf shape

Text Mining : getting the properties of each taxonparsing + disambiguation through ontologyknowledge bases : Description Logics (RDF / OWL)

INRIA É. de la Clergerie ATOLL SymC 2005/11/15 22 / 39

INRIA

Linguistic Resource Acquisition

Motivation : Bootstrap

Using resources in tools (parsers, taggers, . . . ) ⇒ Validationerror mining techniques (Van Noord, Clergerie)

Using tools to get resources from (raw) corpora◮ Learning morphology and lemma (done for French & Slovak [Sagot])◮ Learning syntactic information (sub-categorization, support verbs . . . )◮ Learning semantic classes and selection restrictions◮ Learning probabilities (for desambiguation)◮ Generic idea : learning from contexts coming from dependencies

Reducing human cost ⇒ Free Linguistic Resources

Adapting resources to needs (evolution)

INRIA É. de la Clergerie ATOLL SymC 2005/11/15 23 / 39

INRIA

Outline

1 Generalities

2 Thematics & Contributions

3 Applications

4 Actions

5 Collaborations

6 Conclusions

INRIA É. de la Clergerie ATOLL SymC 2005/11/15 24 / 39

INRIA

Actions : ARC RLT (2001 – 2002)

ARC “Linguistic resource acquisition and representation for TAGs”

Participants ATOLL (coordinator ), Langue et Dialogue,Calligramme, TALaNa (Univ. Paris 7)

ObjectivesSemi-Automatic acquisition of a French TAG lexiconXML representation for TAGsEmerging of notion of Meta-Grammars

Preferences Tree bank

Corpus Annotatedcorpus

Grammar

Metagrammar

Lexicon

Pre Parsing

Supervisedvalidation

Parser

Acquisition

compilation

compilation

INRIA É. de la Clergerie ATOLL SymC 2005/11/15 25 / 39

INRIA

Actions (cont’d)

Biotim ACI “Masse de Donnée” (2003 – 2006)◮ Participants : IRD, LIFO (Orléans), IMEDIA, Vertigo (CNAM), INRA◮ Objectives (ATOLL) : text mining of botanical corpora

EASy/EVALDA French Technolangue action for the evaluation of parsingsystems (2003 – 2005)

◮ Participants : ELDA, LIMSI, LLF, ATILF, DIAM-STIM, DELIC, GREYC, L&D, LPL,Synapse Développement, Systal-Pertimm, Xerox, LIC2M, LATL, EPFL, FT R&D,Tagmatica, VALORIA, ERSS

◮ Objectives (ATOLL) : participation with FRMG & SXLFG

RNIL/Normalangue Technolangue action for the normalization oflinguistic resources (2003 – 2005)

◮ Participants : L&D, AFNOR, ATILF, LLF Jussieu, IRIN, LIMSI, CLIPS, RESO, CEA,XRCE, EDF R&D, SYSTRAN, France Telecom R&D, Systems & Defense Electronics,SOFTISSIMO, SINEQUA, LUCID-ID, J-WAY

◮ Objectives (ATOLL) : project leader on MAF + head of French delegation forISO TC37 SC4

INRIA É. de la Clergerie ATOLL SymC 2005/11/15 26 / 39

INRIA

Actions (cont’d)

LexSynt ILF-funded action (2005 – ? ? ?)◮ Coordination : Sylvain Kahane (Modyco), Susanne Salmon-Alt (ATILF), Éric de

la Clergerie (Projet ATOLL, INRIA)◮ Participants : ATILF, ERSS, IGM, LPL, Lattice, MoDyCo, ATOLL, Calligramme, L& D,

Signes, ATV (K.U. Leuven), OLST (Univ. of Montreal), Normalangue/RNIL◮ objectives : to design & exploit a reference syntactic-semantic lexicon for

French.

GENI ARC on Generation and Inference (2002 - 2003)◮ Participants : L&D (coord.), Orpailleur, TALaNa, IRIT◮ Objectives (ATOLL) : TAG, generation & tabulation, lexical semantic

e-COTS : RNTL action (2001 – 2002, extended 2003) Bernard Lang◮ Participants : Thomson-CSF, EDF and Bull◮ Objectives : setup an open and cooperative WEB portal to manage information

about software components.

MOPROSCO ARC (2005 – 2006) Participation Areski Nait-Abdallah

INRIA É. de la Clergerie ATOLL SymC 2005/11/15 27 / 39

INRIA

Outline

1 Generalities

2 Thematics & Contributions

3 Applications

4 Actions

5 Collaborations

6 Conclusions

INRIA É. de la Clergerie ATOLL SymC 2005/11/15 28 / 39

INRIA

INRIA Collaborations

Langue et Dialogue (LORIA) : TAG ; Meta-Grammars ; LinguisticInfrastructure ; Normalization ; LexicaARC RLT and GENI ; Normalangue ; EASy ; LexSynt

Calligramme (LORIA) : MG ; ARC RLT ; LexSynt

Signes (Futurs, Bordeaux) : LexSyntincreased collaboration with the arrival at Bordeaux of Lionel Clément

IMEDIA (INRIA Rocquencourt) : Biotim

Orpailleur (LORIA) : text mining & knowledge extraction (ARC GENI)

INRIA É. de la Clergerie ATOLL SymC 2005/11/15 29 / 39

INRIA

French Collaborations

Lattice/TALaNa (University Paris 7) :◮ TAG, MG (RLT & GENI) lexica (LexSynt), . . .◮ people (Lionel Clément & Alexis Nasr)◮ co-supervising of Sagot’s PhD with Laurence Danlos.◮ discussions towards a common structure

MoDyCo (University Paris 10 - Nanterre & CNRS) with Sylvain Kahane,linguistic formalisms ; LexSynt

LIFO (Orléans) : NLP and Machine learning ; Biotim

BIODIVAL (IRD, Orléans) : Biotim

+ Potential collaborations with◮ IGM (Marne-La-Vallée) [LADL tables, LexSynt],◮ ERSS (Toulouse) [Acquisition on corpus],◮ LIS (Paris 6) [Knowledge rep. and use ; Biotim+],◮ . . .

INRIA É. de la Clergerie ATOLL SymC 2005/11/15 30 / 39

INRIA

International Collaborations

COLE (La Coruña & Vigo, Spain) – Manuel Vilares Ferro◮ TAGs ; Parsing techniques ; grammars ; use of DyALog ; information retrieval

and extraction applications.◮ French-Spanish “Programme d’Actions Intégrées” [PAI] PICASSO

⇒ visits (several-months student visits)⇒ organization of 2 TAPD editions (Paris 1998, Vigo 2000)

◮ Co-supervised PhD of Miguel Alonso Pardo (on 2SA and TAGs)⇒ many common papers

GLINT* (New Univ. of Lisbon) – Gabriel Pereira Lopes◮ Use of DyALog and on machine learning techniques for NLP◮ French-Portuguese programs RELING, ICTII et PAI PESSOA ⇒ visits

XTAG (Univ. of Pennsylvanie) – Aravind Joshi◮ TAG parsing and Meta-Grammars ; PhD student A. Kinyon◮ Potential NSF–INRIA collaboration

RIADI (Univ. of La Manouba, Tunisie) – Mohamed Ben AhmedPreliminary contacts to set up a cooperation on the processing of Arabic

INRIA É. de la Clergerie ATOLL SymC 2005/11/15 31 / 39

INRIA

Visibility

(New) Invitation to deliver a course at the ESSLLI’06 summer school ;invitation of Bernard Lang to many events.

Editorial board of French journal “Traitement Automatique des Langues”.Guest Editor for the T.A.L. issue on “Evolutions in Parsing” (2003)

Program Committees of 11 national and int. conferences and workshops

9 national and int. PhD Juries , including 3 reviews

Standardization committee of ISO TC37 SC4 (head of French delegation)

Consulting and project reviews for actions Technolangue and ACI ;Bernard Lang consulted by companies, administrations, and government.

Paper reviews for journals (T.A.L, JoLLI, TCS) and conferences/workshops(TALN, EACL, IWPT, ACL, ICLP, PPDP, COLING, ESSLII, IJCNLP, ICALP,POPL, Formal Grammars, MOL, ESSLLI, ICLP)

INRIA É. de la Clergerie ATOLL SymC 2005/11/15 33 / 39

INRIA

Publications

ATOLL’s publications are available on line athttp://atoll.inria.fr/biblio

Journals : T.A.L., TCS, Document numérique

Conferences & Workshops : ACL, NACL, COLING, TALN, IWPT, TAG+, FG,MOL, LACL, LREC, . . .

02 03 04 05

PhD Thesis [Sagot] 1TBFH.D.R [Clergerie] 1TBF

Journal 1BL 2+1BL 1BL 2+3S+4BLConference proceedings 4 6 6 11+4P+2S+2BLBook chapter 1BL 2 1Book (edited) 1Technical report 1 1Total 5 11 10 29

INRIA É. de la Clergerie ATOLL SymC 2005/11/15 34 / 39

INRIA

Outline

1 Generalities

2 Thematics & Contributions

3 Applications

4 Actions

5 Collaborations

6 Conclusions

INRIA É. de la Clergerie ATOLL SymC 2005/11/15 35 / 39

INRIA

About the 4 last years (2002 – 2005)

Difficulties in recruitment (no CR since 1994)but nevertheless, ATOLL has welcomed (temporary) brilliant members :Lionel Clément, Benoît Sagot, Guillaume Rousse

Fulfilled most of the planned objectivessyntactic formalisms ; parsing techniques ; robustness ; lexica ; linguisticinfrastructure ; evaluation ; applications

Get involved in unexpected but natural actions : EASy and Normalangue⇒ (EASy) fast development pace for a real scale processing chain

Large opening of our thematicsmorpho-syntax, lexica, corpora, grammar design, . . .

now working at a much larger scale.wide coverage grammars, large lexicon, processing of large corpora

INRIA É. de la Clergerie ATOLL SymC 2005/11/15 36 / 39

INRIA

Scientific evolutions (2006 – 2009)

Better syntactic formalisms and syntactic descriptionsMGs, MC-TAGs, Meta-RCGs, dependencies, constraints⇒ ARC proposal MOSAÏQUE (coordinator Signes)

Increased parsing robustness and efficiency◮ acquisition and exploitation of stochastic information (Nasr)◮ algorithmic of n-best beam disambiguators on shared derivation forests◮ guiding techniques & cascade-based parsing◮ error correction techniques◮ integration of emerging techniques (e.g., HPSG’s quick check filtering)

Guiding + Tabular techniques + Stochastic methods + Evaluation⇒ [Ambition] marry deep and shallow parsing, ambiguous or not,to get efficient and accurate parsing systems

INRIA É. de la Clergerie ATOLL SymC 2005/11/15 37 / 39

INRIA

Scientific evolutions (cont’d)

Linguistic knowledge acquisition & lexica◮ use of EASy references and Paris 7 treebank◮ acquisition on raw corpora◮ acquisition and exploitation of lexical semantic info.

Information extraction applications, (Biotim followups)possibly with question/answering and (multilingual) generation

Opening to multilinguism (Arabic ?)application of our tools and acquisition methodologies for new languages

INRIA É. de la Clergerie ATOLL SymC 2005/11/15 38 / 39

INRIA

Organizational aspects

Real need to renew the composition of ATOLLplanned departures ⇒ impossible to continue without new member(s)

◮ ? maintenance & development of tools & resources◮ ? conduct large scale experimentations & exploit results◮ ? covering of enough computational linguistic sub-fields◮ ? following collaborations and actions on ATOLL’s side

Discussions with TALaNa towards a common structure

Reinforcing collaborations through INRIA Action d’Envergure for NLP ?◮ exchanging resources, tools & expertise ;◮ sharing dev. effort ;◮ better covering of NLP sub-fields

INRIA É. de la Clergerie ATOLL SymC 2005/11/15 39 / 39

INRIA

Organizational aspects

Real need to renew the composition of ATOLLplanned departures ⇒ impossible to continue without new member(s)

◮ ? maintenance & development of tools & resources◮ ? conduct large scale experimentations & exploit results◮ ? covering of enough computational linguistic sub-fields◮ ? following collaborations and actions on ATOLL’s side

Discussions with TALaNa towards a common structure

Reinforcing collaborations through INRIA Action d’Envergure for NLP ?◮ exchanging resources, tools & expertise ;◮ sharing dev. effort ;◮ better covering of NLP sub-fields

Thank you !

INRIA É. de la Clergerie ATOLL SymC 2005/11/15 39 / 39