Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
INRIA
ATOLLSoftware Tools for Natural Language Processing
Éric de la [email protected]
http://atoll.inria.fr
Evaluation SeminarSYM C : Management and processing of language and data
Dourdan, November 15-16th 2005
INRIA É. de la Clergerie ATOLL SymC 2005/11/15 1 / 39
INRIA
Outline
1 Generalities
2 Thematics & Contributions
3 Applications
4 Actions
5 Collaborations
6 Conclusions
INRIA É. de la Clergerie ATOLL SymC 2005/11/15 2 / 39
INRIA
ATelier d’Outils Logiciels pour le Langage naturel
Creation : 1997 – Computer Science NLP
ATOLL objectives :
to develop tools and techniques, theoretical or applied, in order tohelp to access, process and use documents in natural language.
INRIA scientific challenges :
To design new applications using the Web and multimedia databases
Keywords : Computational Linguistics ; Natural Language Processing (NLP) ;Linguistic Engineering ; Parsing ; Syntactic Formalisms ; Linguistic resources ;
INRIA É. de la Clergerie ATOLL SymC 2005/11/15 3 / 39
INRIA
ATOLL’s composition
2002 2003 2004 2005Scientific leader
Éric de la Clergerie (CR)Permanents
Bernard Lang (DR)Pierre Boullier (DR)Philippe Deschamp (CR)François Thomasset (DR)
Exteriors & Temporaries
Areski Nait Abdallah (Pr, Univ. Brest)Alexis Nasr (Prof., Del. Paris 7)François Barthélemy (MdC, CNAM)Lionel Clément (PostDoc RLT, Ing. RNIL)Guillaume Rousse (Ing. Biotim)Stéphane Laurière (Ing. e-COTS)
PhDBenoît Sagot
INRIA É. de la Clergerie ATOLL SymC 2005/11/15 5 / 39
INRIA
Parsing ?
Parsing : Identifying the relationships between words (and groups of words)Grammar : packed sets of relationships + combinaison rules
INRIA É. de la Clergerie ATOLL SymC 2005/11/15 6 / 39
INRIA
Parsing ?
Parsing : Identifying the relationships between words (and groups of words)Grammar : packed sets of relationships + combinaison rules
Tree Adjoining Grammars [TAG]
NP
John
S
NP ↓ VP
V
sleeps
subst⇒ P
NP
John
VP
V
sleeps
INRIA É. de la Clergerie ATOLL SymC 2005/11/15 6 / 39
INRIA
Parsing ?
Parsing : Identifying the relationships between words (and groups of words)Grammar : packed sets of relationships + combinaison rules
Tree Adjoining Grammars [TAG]
NP
John
S
NP ↓ VP
V
sleeps
subst⇒ P
NP
John
VP
V
sleeps
V
⋆V Adv
a lot
adj⇒ S
NP
John
VP
V
V
sleeps
Adv
a lot
INRIA É. de la Clergerie ATOLL SymC 2005/11/15 6 / 39
INRIA
Parsing ?
Parsing : Identifying the relationships between words (and groups of words)Grammar : packed sets of relationships + combinaison rules
Tree Adjoining Grammars [TAG]
NP
John
S
NP ↓ VP
V
sleeps
subst⇒ P
NP
John
VP
V
sleeps
V
⋆V Adv
a lot
adj⇒ S
NP
John
VP
V
V
sleeps
Adv
a lot
Problems :No consensus on the best linguistic formalismCapturing all syntactic constructionsCapturing word usages (lexicon & statistics)Handling amibiguities
INRIA É. de la Clergerie ATOLL SymC 2005/11/15 6 / 39
INRIA
Outline
1 Generalities
2 Thematics & Contributions
3 Applications
4 Actions
5 Collaborations
6 Conclusions
INRIA É. de la Clergerie ATOLL SymC 2005/11/15 7 / 39
INRIA
ATOLL : Thematics
Formal Language Theory
Formalisms, Automata& Tabulation
INRIA É. de la Clergerie ATOLL SymC 2005/11/15 8 / 39
INRIA
ATOLL : Thematics
Formal Language Theory
Formalisms, Automata& Tabulation
(Open) Tools
Parser Compilers : DYALOG, SYNTAX, RCG
INRIA É. de la Clergerie ATOLL SymC 2005/11/15 8 / 39
INRIA
ATOLL : Thematics
Formal Language Theory
Formalisms, Automata& Tabulation
(Open) Tools
Parser Compilers : DYALOG, SYNTAX, RCG
Ling. Dev. Environment : MGCOMP, MGTOOLS,TAG_UTILS, FOREST_UTILS, MRCG2RCG. . .
(Open) Ling. Resources
Lexicon : LEFFF
Grammar : SXLFGMetaGrammar : FRMG
INRIA É. de la Clergerie ATOLL SymC 2005/11/15 8 / 39
INRIA
ATOLL : Thematics
Formal Language Theory
Formalisms, Automata& Tabulation
(Open) Tools
Parser Compilers : DYALOG, SYNTAX, RCGPre Parsing : SXPIPE
Ling. Dev. Environment : MGCOMP, MGTOOLS,TAG_UTILS, FOREST_UTILS, MRCG2RCG. . .
(Open) Ling. Resources
Lexicon : LEFFF
Grammar : SXLFGMetaGrammar : FRMG
Applications
Evaluation : EASyInfo. Extraction : Biotim
Corpora
Grid techniques
INRIA É. de la Clergerie ATOLL SymC 2005/11/15 8 / 39
INRIA
ATOLL : Thematics
Formal Language Theory
Formalisms, Automata& Tabulation
(Open) Tools
Parser Compilers : DYALOG, SYNTAX, RCGPre Parsing : SXPIPE
Ling. Dev. Environment : MGCOMP, MGTOOLS,TAG_UTILS, FOREST_UTILS, MRCG2RCG. . .
(Open) Ling. Resources
Lexicon : LEFFF
Grammar : SXLFGMetaGrammar : FRMG
Applications
Evaluation : EASyInfo. Extraction : BiotimLing. Knowledge Acquisition
Corpora
Grid techniques
INRIA É. de la Clergerie ATOLL SymC 2005/11/15 8 / 39
INRIA
ATOLL : Thematics
Formal Language Theory
Formalisms, Automata& Tabulation
(Open) Tools
Parser Compilers : DYALOG, SYNTAX, RCGPre Parsing : SXPIPE
Ling. Dev. Environment : MGCOMP, MGTOOLS,TAG_UTILS, FOREST_UTILS, MRCG2RCG. . .
(Open) Ling. Resources
Lexicon : LEFFF
Grammar : SXLFGMetaGrammar : FRMG
Applications
Evaluation : EASyInfo. Extraction : BiotimLing. Knowledge Acquisition
Corpora
Grid techniques
Normalization
NormalangueISO TC37 SC4
INRIA É. de la Clergerie ATOLL SymC 2005/11/15 8 / 39
INRIA
ATOLL : Thematics
Formal Language Theory
Formalisms, Automata& Tabulation
(Open) Tools
Parser Compilers : DYALOG, SYNTAX, RCGPre Parsing : SXPIPE
Ling. Dev. Environment : MGCOMP, MGTOOLS,TAG_UTILS, FOREST_UTILS, MRCG2RCG. . .
(Open) Ling. Resources
Lexicon : LEFFF
Grammar : SXLFGMetaGrammar : FRMG
Applications
Evaluation : EASyInfo. Extraction : BiotimLing. Knowledge Acquisition
Corpora
Grid techniques
Normalization
NormalangueISO TC37 SC4
Free Software
Bernard Lang
INRIA É. de la Clergerie ATOLL SymC 2005/11/15 8 / 39
INRIA
ATOLL’s positionning
Balance between theory, development and experimentation
NLP requires many tools and resourceswith difficulties to access and exploit linguistic resources (for French)
◮ ⇒ dev. effort + investigation of methodologies to speed up dev.◮ favor reuse and distribution ⇒ normalization & open source
Software Eng. practices : Versioning + Packaging + Catalog on line◮ favor emerging of resources ⇒ LexSynt action, collaborations
Search for comprehension of mechanisms of languagebut also see language as cultural artifact :
◮ collaboration with linguists & use of linguistic theories (and formalisms)◮ exploitation of corpora to capture language usage
NLP is an experimental field◮ need to play at real scale
large coverage grammars, large lexica, real documents, large corpora◮ need feedback : evaluation, statistics◮ real scale applications
INRIA É. de la Clergerie ATOLL SymC 2005/11/15 9 / 39
INRIA
Syntactic formalisms
Exploration of a wide range of syntactic formalisms
CFGNP
Vocabulary Complexityunification
DatalogV(sing)
DCG LFG HPSGS(gap(np)) λ-Prolog
MCS
Derivationcomplexity
combiningstructures
TAGLIG
N
A ↓ ⋆N
LCFRS
RCG [Boullier]
Feature TAG
Meta-RCG [Sagot]
Meta-Grammars : Abstract level of grammar description based on hierarchies ofclasses grouping constraints and requiring/providing functionalities.⇒ [MG compilation] Generation of target grammars (TAGs, LFGs)
INRIA É. de la Clergerie ATOLL SymC 2005/11/15 10 / 39
INRIA
Linguistic Resources : (Meta-)Grammars
No easily available wide coverage French grammars⇒ development of grammars
SXLFG : wide coverage French LFG grammar [Clément, Sagot, Boullier]very efficient exploitation with SYNTAX
FRMG : wide coverage French Meta-Grammar [Clergerie]MG + factorization operators ⇒ generates a very compact TAG grammar126 trees with only 27 verb-anchored trees (to compare to usual 2-6Ktrees)
Dev. of env. of development for grammars
Edition
Visualization
Statistics (coverage, time, ambiguity) on test suites (EUROTRA & TSNLP)
INRIA É. de la Clergerie ATOLL SymC 2005/11/15 11 / 39
INRIA
Linguistic Resources : Lexicon
Development of French lexicon LEFFF [Clément,Sagot]Over 400 000 forms and following lemma distribution :
verbs common nouns proper nouns adj adv6788 37183 52938 10024 2127
Verb morphology automatically learned on corpus (+ manual checking)
Syntactic information on verbs (subcategorization, control, . . . )promet (promises)
v [pred=’promettre_1<subj|ssubj|vsubj,(obj|scomp),(à -obj)>’,cat=v,@SCompInd,@P3s]v [...]
Multiple Inheritance-based architecture (∼ MG)@promettre {
< @verbe_ditransitif_à_svp< objet_phrastique_possible< complétive_indicatif
| ... }
still incomplete and with errors,but using corpus parsing to track errors (error mining)
INRIA É. de la Clergerie ATOLL SymC 2005/11/15 12 / 39
INRIA
Parsing techniques
Parsing remains an algorithmic challenge :
Ambiguity handling & representation (Shared forests)Push-Down Automata ; Dynamic Programming techniques
Formalisms Automata Tabulation NotesCFG PDA O(n3) Lang
DCG / Logic Programming Logical PDA completeness Lang & ClergerieTAG / LIG 2-Stack Automata O(n6) Clergerie & Pardo
MC-TAG/osRCG/MCS Thread Automata O(nk ) Clergerie
Guiding techniques (e.g., supertagging [Boullier] & chunking [Sagot])multi-pass parsing where the shared forest of a level guides the next level
Algorithms on shared forests (e.g., disambiguation)
Robustness (e.g., partial parsing, error correction techniques)handling “ill formed” sentences, unknown words and constructions
Scaling issues (wide coverage grammars, large lexica)grammar factorization (FRMG) & many algorithmic issues
INRIA É. de la Clergerie ATOLL SymC 2005/11/15 13 / 39
INRIA
DyALog - Exploring unification-based grammars
An environment (compiler dyacc + abstract machine) :
For compiling tabular parsersbased on : stack automata & dynamic programming⇒ computation sharing & loop detection⇒ extraction of shared forests
Also a logic programming environment⇒ power of logic + possibility of escaping within grammars
Strongly NLP oriented◮ Ease grammar design : TFS, finite domains, . . .◮ Multiple grammatical formalisms : DCG, BMG, TAG & TIG, RCG◮ Functionalities and customization of parsers :
multiple parsing strategies, forests, word lattice, lexicalization, robustness
Used for◮ a robust Potuguese grammar (bidir. head-driven DCG+BMG) [GLINT]
⇒ dev. of a similar Spanish Grammar [COLE]◮ MG compiler MGCOMP◮ French wide coverage TIG/TAG grammar FRMG [ATOLL]◮ French TAG grammar with semantic interface [LORIA]
INRIA É. de la Clergerie ATOLL SymC 2005/11/15 14 / 39
INRIA
Syntax - From CFGs to LFGs and RCGs
SYNTAX [Boullier]
Originally developed to parse Programming Languages with CFG(+attributes)
Extended with tabular techniques to cover all CFGs
Recently extended for Lexical Functional Grammars2 passes : CFG + computation of decorations on shared parse forests+ disambiguation phase⇒ used for large coverage LFG grammar SXLFG [Clément,Sagot,Boullier]
Extremely efficient on CFGs and on LFGs+ 300Ksent. journalistic corpus (Monde Diplomatique) : 3
4 sent. < 0.1s
RCG [Boullier]
Derived from technology behind SYNTAX
Very powerful and also efficient
INRIA É. de la Clergerie ATOLL SymC 2005/11/15 15 / 39
INRIA
Linguistic infrastructure : SxPipe
French Morpho-Syntactic chain SXPIPE [Sagot, Boullier, . . . ] :
Word and sentence segmentations, including multi-words
Named entities (Proper Nouns, Dates, Addresses, URL, :-) , . . . )
Spelling corrections
Returns a word lattice (DAG) as input for parsing
0 1 2 3 4 5 6 7
Jean
{Jean} jean
{abite} habite en outre
{en outre} en_outre
{au} à {au} le {1 , rue de la Pompe} _ADRESSE
INRIA É. de la Clergerie ATOLL SymC 2005/11/15 16 / 39
INRIA
Normalization
Motivation :
Reusability of resourcesInteroperability of tools in Processing Chains
Participation (with Laurent Romary and Langue & Dialogue [LORIA]) at
INRIA consortium SYNTAXFrench Normalangue/RNIL actionISO TC 37 SC4 on “Linguistic Resource Management”⇒ head of French delegation to ISO meetingsPartially involved in new European project LIRICS
Participation on emerging standards :
MAF Morphosyntactic Annotation Framework (Project leader)FSR Feature Structure Representation (+ future compagnion FSD)DCR Data Category Registry (registering ling. terminology)LMF Lexical Markup Framework (lexicons)
SynAF Syntactic Annotation Framework
INRIA É. de la Clergerie ATOLL SymC 2005/11/15 17 / 39
INRIA
Environment of development
Principles :
Coordinating small toolsLINGPIPE, TAG_UTILS, FOREST_UTILS, PARSERD, . . .
Use of XML intermediary formats multiple views (XSL)Grammars (TAGML), MetaGrammars, Morpho-Syntax (MAF),Shared Forests (Derivations or Dependencies)Visualization tools (web services)
◮ MAF (MAFD) : http://atoll.inria.fr/mafdemo◮ Parsing (PARSERD) : http://atoll.inria.fr/parserdemo◮ Grammar (FRMG) : http://atoll.inria.fr/perl/frmg/tree.pl
il a voulu en promettre une à Paul.He has wanted to promise one of them to Paul
il
a voulu
en
promettre
une à
Paul
.
pro:cln: (0)
promettre:v:107 (31)
subject (31) pro:pro:50 (2)
object (31)
_:VMod:92 (1)vmod (18) à:prep:2 (1)
preparg (6)
_:S:33 (1)S (31)
a:aux:54 (1) vouloir:v:63 (1)Infl (1) V (31)
pro:clg: (0)
clg (16)
pro:cll: (0)
cll (15)
à:prep:41 (1)N2 (1)
PP (1)Paul:np:51 (1)N2 (1)
N2 (1)
.:_: (0)void (1)
INRIA É. de la Clergerie ATOLL SymC 2005/11/15 18 / 39
INRIA
Outline
1 Generalities
2 Thematics & Contributions
3 Applications
4 Actions
5 Collaborations
6 Conclusions
INRIA É. de la Clergerie ATOLL SymC 2005/11/15 19 / 39
INRIA
French Parsing Evaluation Campaign EASy
Evaluation protocol : complex(formalisms, deep vs shallow, ambiguous or not, . . . )⇒ evaluation on shallow constituents and small set of dependencies
[Dec. 2004] Participation of FRMG & SXLFG (out of 14 parsers)∼ 40Ksentences, with ∼ 4200 manually annotated, several styles
Use of our deep parsers to produce non-ambiguous shallow information⇒ robustness (full and partial parsing) ⇒ disambiguation heuristics +conversion
EASy campaign tests not just parsing but a full processing chainLEFFF, SXPIPE + parser + post-parsing
GN 1 NV 2 GR 3 GP 4Jean abite en outre au 1 , rue de la PompeF1 F2 F3 F4 F5 F6 F7 F8 F9 F10 F11
subject verbGN1 NV2
compl. verbGP4 NV2
modifier verbGR3 NV2
Note : Still waiting for definitive and complete resultsINRIA É. de la Clergerie ATOLL SymC 2005/11/15 20 / 39
INRIA
EASy as a starting point
Using EASy expertise and resources to continue evaluations, get feedback,compare FRMG & SXLFG (and acquire statistics)
Corpus distribution
%0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100
kind
of c
orpu
s
general
litteraire
medical
oral_delic
oral_elda
questions
total
84.0
81.9
73.7
80.5
72.2
77.0
86.1
77.5
67.4
76.5
63.5
70.9
63.4
70.8
83.5
68.7
68.1
76.0
64.0
71.1
62.7
71.5
83.4
68.7
#sentences distribution
general (18%)755
litteraire (21%)881
mail (20%)852
medical (13%)554
oral_delic (12%)522
oral_elda (12%)502
questions (5%)203
%precision
%recall
%fmeasure
⇒ NEW already used results for a very accurate CFG-based chunker [Sagot]
INRIA É. de la Clergerie ATOLL SymC 2005/11/15 21 / 39
INRIA
BIOTIM : from books to knowledge bases
ACI “Datawarehouse” BIOTIM : Processing botanical descriptions (flora)
Corpus “Flore du Cameroun” (1963 – 2001)
Volumes Pages Av. Pages Words Taxons31 9466 305 1.5M ∼ 2400
Tasks :Corpus preparation : spelling correction (OCR), logical structuring
Preliminary Linguistic Processing : morpho-syntactic processing
Terminology extraction & first experiments with “governor-governee”relationship
“Ontology” extraction : use of parsing to extract syntactic dependencies+ Harris hypothesis : similar syntactic contexts hints semantic similarities⇒ lancéolé (adj) : leaf shape
Text Mining : getting the properties of each taxonparsing + disambiguation through ontologyknowledge bases : Description Logics (RDF / OWL)
INRIA É. de la Clergerie ATOLL SymC 2005/11/15 22 / 39
INRIA
Linguistic Resource Acquisition
Motivation : Bootstrap
Using resources in tools (parsers, taggers, . . . ) ⇒ Validationerror mining techniques (Van Noord, Clergerie)
Using tools to get resources from (raw) corpora◮ Learning morphology and lemma (done for French & Slovak [Sagot])◮ Learning syntactic information (sub-categorization, support verbs . . . )◮ Learning semantic classes and selection restrictions◮ Learning probabilities (for desambiguation)◮ Generic idea : learning from contexts coming from dependencies
Reducing human cost ⇒ Free Linguistic Resources
Adapting resources to needs (evolution)
INRIA É. de la Clergerie ATOLL SymC 2005/11/15 23 / 39
INRIA
Outline
1 Generalities
2 Thematics & Contributions
3 Applications
4 Actions
5 Collaborations
6 Conclusions
INRIA É. de la Clergerie ATOLL SymC 2005/11/15 24 / 39
INRIA
Actions : ARC RLT (2001 – 2002)
ARC “Linguistic resource acquisition and representation for TAGs”
Participants ATOLL (coordinator ), Langue et Dialogue,Calligramme, TALaNa (Univ. Paris 7)
ObjectivesSemi-Automatic acquisition of a French TAG lexiconXML representation for TAGsEmerging of notion of Meta-Grammars
Preferences Tree bank
Corpus Annotatedcorpus
Grammar
Metagrammar
Lexicon
Pre Parsing
Supervisedvalidation
Parser
Acquisition
compilation
compilation
INRIA É. de la Clergerie ATOLL SymC 2005/11/15 25 / 39
INRIA
Actions (cont’d)
Biotim ACI “Masse de Donnée” (2003 – 2006)◮ Participants : IRD, LIFO (Orléans), IMEDIA, Vertigo (CNAM), INRA◮ Objectives (ATOLL) : text mining of botanical corpora
EASy/EVALDA French Technolangue action for the evaluation of parsingsystems (2003 – 2005)
◮ Participants : ELDA, LIMSI, LLF, ATILF, DIAM-STIM, DELIC, GREYC, L&D, LPL,Synapse Développement, Systal-Pertimm, Xerox, LIC2M, LATL, EPFL, FT R&D,Tagmatica, VALORIA, ERSS
◮ Objectives (ATOLL) : participation with FRMG & SXLFG
RNIL/Normalangue Technolangue action for the normalization oflinguistic resources (2003 – 2005)
◮ Participants : L&D, AFNOR, ATILF, LLF Jussieu, IRIN, LIMSI, CLIPS, RESO, CEA,XRCE, EDF R&D, SYSTRAN, France Telecom R&D, Systems & Defense Electronics,SOFTISSIMO, SINEQUA, LUCID-ID, J-WAY
◮ Objectives (ATOLL) : project leader on MAF + head of French delegation forISO TC37 SC4
INRIA É. de la Clergerie ATOLL SymC 2005/11/15 26 / 39
INRIA
Actions (cont’d)
LexSynt ILF-funded action (2005 – ? ? ?)◮ Coordination : Sylvain Kahane (Modyco), Susanne Salmon-Alt (ATILF), Éric de
la Clergerie (Projet ATOLL, INRIA)◮ Participants : ATILF, ERSS, IGM, LPL, Lattice, MoDyCo, ATOLL, Calligramme, L& D,
Signes, ATV (K.U. Leuven), OLST (Univ. of Montreal), Normalangue/RNIL◮ objectives : to design & exploit a reference syntactic-semantic lexicon for
French.
GENI ARC on Generation and Inference (2002 - 2003)◮ Participants : L&D (coord.), Orpailleur, TALaNa, IRIT◮ Objectives (ATOLL) : TAG, generation & tabulation, lexical semantic
e-COTS : RNTL action (2001 – 2002, extended 2003) Bernard Lang◮ Participants : Thomson-CSF, EDF and Bull◮ Objectives : setup an open and cooperative WEB portal to manage information
about software components.
MOPROSCO ARC (2005 – 2006) Participation Areski Nait-Abdallah
INRIA É. de la Clergerie ATOLL SymC 2005/11/15 27 / 39
INRIA
Outline
1 Generalities
2 Thematics & Contributions
3 Applications
4 Actions
5 Collaborations
6 Conclusions
INRIA É. de la Clergerie ATOLL SymC 2005/11/15 28 / 39
INRIA
INRIA Collaborations
Langue et Dialogue (LORIA) : TAG ; Meta-Grammars ; LinguisticInfrastructure ; Normalization ; LexicaARC RLT and GENI ; Normalangue ; EASy ; LexSynt
Calligramme (LORIA) : MG ; ARC RLT ; LexSynt
Signes (Futurs, Bordeaux) : LexSyntincreased collaboration with the arrival at Bordeaux of Lionel Clément
IMEDIA (INRIA Rocquencourt) : Biotim
Orpailleur (LORIA) : text mining & knowledge extraction (ARC GENI)
INRIA É. de la Clergerie ATOLL SymC 2005/11/15 29 / 39
INRIA
French Collaborations
Lattice/TALaNa (University Paris 7) :◮ TAG, MG (RLT & GENI) lexica (LexSynt), . . .◮ people (Lionel Clément & Alexis Nasr)◮ co-supervising of Sagot’s PhD with Laurence Danlos.◮ discussions towards a common structure
MoDyCo (University Paris 10 - Nanterre & CNRS) with Sylvain Kahane,linguistic formalisms ; LexSynt
LIFO (Orléans) : NLP and Machine learning ; Biotim
BIODIVAL (IRD, Orléans) : Biotim
+ Potential collaborations with◮ IGM (Marne-La-Vallée) [LADL tables, LexSynt],◮ ERSS (Toulouse) [Acquisition on corpus],◮ LIS (Paris 6) [Knowledge rep. and use ; Biotim+],◮ . . .
INRIA É. de la Clergerie ATOLL SymC 2005/11/15 30 / 39
INRIA
International Collaborations
COLE (La Coruña & Vigo, Spain) – Manuel Vilares Ferro◮ TAGs ; Parsing techniques ; grammars ; use of DyALog ; information retrieval
and extraction applications.◮ French-Spanish “Programme d’Actions Intégrées” [PAI] PICASSO
⇒ visits (several-months student visits)⇒ organization of 2 TAPD editions (Paris 1998, Vigo 2000)
◮ Co-supervised PhD of Miguel Alonso Pardo (on 2SA and TAGs)⇒ many common papers
GLINT* (New Univ. of Lisbon) – Gabriel Pereira Lopes◮ Use of DyALog and on machine learning techniques for NLP◮ French-Portuguese programs RELING, ICTII et PAI PESSOA ⇒ visits
XTAG (Univ. of Pennsylvanie) – Aravind Joshi◮ TAG parsing and Meta-Grammars ; PhD student A. Kinyon◮ Potential NSF–INRIA collaboration
RIADI (Univ. of La Manouba, Tunisie) – Mohamed Ben AhmedPreliminary contacts to set up a cooperation on the processing of Arabic
INRIA É. de la Clergerie ATOLL SymC 2005/11/15 31 / 39
INRIA
Visibility
(New) Invitation to deliver a course at the ESSLLI’06 summer school ;invitation of Bernard Lang to many events.
Editorial board of French journal “Traitement Automatique des Langues”.Guest Editor for the T.A.L. issue on “Evolutions in Parsing” (2003)
Program Committees of 11 national and int. conferences and workshops
9 national and int. PhD Juries , including 3 reviews
Standardization committee of ISO TC37 SC4 (head of French delegation)
Consulting and project reviews for actions Technolangue and ACI ;Bernard Lang consulted by companies, administrations, and government.
Paper reviews for journals (T.A.L, JoLLI, TCS) and conferences/workshops(TALN, EACL, IWPT, ACL, ICLP, PPDP, COLING, ESSLII, IJCNLP, ICALP,POPL, Formal Grammars, MOL, ESSLLI, ICLP)
INRIA É. de la Clergerie ATOLL SymC 2005/11/15 33 / 39
INRIA
Publications
ATOLL’s publications are available on line athttp://atoll.inria.fr/biblio
Journals : T.A.L., TCS, Document numérique
Conferences & Workshops : ACL, NACL, COLING, TALN, IWPT, TAG+, FG,MOL, LACL, LREC, . . .
02 03 04 05
PhD Thesis [Sagot] 1TBFH.D.R [Clergerie] 1TBF
Journal 1BL 2+1BL 1BL 2+3S+4BLConference proceedings 4 6 6 11+4P+2S+2BLBook chapter 1BL 2 1Book (edited) 1Technical report 1 1Total 5 11 10 29
INRIA É. de la Clergerie ATOLL SymC 2005/11/15 34 / 39
INRIA
Outline
1 Generalities
2 Thematics & Contributions
3 Applications
4 Actions
5 Collaborations
6 Conclusions
INRIA É. de la Clergerie ATOLL SymC 2005/11/15 35 / 39
INRIA
About the 4 last years (2002 – 2005)
Difficulties in recruitment (no CR since 1994)but nevertheless, ATOLL has welcomed (temporary) brilliant members :Lionel Clément, Benoît Sagot, Guillaume Rousse
Fulfilled most of the planned objectivessyntactic formalisms ; parsing techniques ; robustness ; lexica ; linguisticinfrastructure ; evaluation ; applications
Get involved in unexpected but natural actions : EASy and Normalangue⇒ (EASy) fast development pace for a real scale processing chain
Large opening of our thematicsmorpho-syntax, lexica, corpora, grammar design, . . .
now working at a much larger scale.wide coverage grammars, large lexicon, processing of large corpora
INRIA É. de la Clergerie ATOLL SymC 2005/11/15 36 / 39
INRIA
Scientific evolutions (2006 – 2009)
Better syntactic formalisms and syntactic descriptionsMGs, MC-TAGs, Meta-RCGs, dependencies, constraints⇒ ARC proposal MOSAÏQUE (coordinator Signes)
Increased parsing robustness and efficiency◮ acquisition and exploitation of stochastic information (Nasr)◮ algorithmic of n-best beam disambiguators on shared derivation forests◮ guiding techniques & cascade-based parsing◮ error correction techniques◮ integration of emerging techniques (e.g., HPSG’s quick check filtering)
Guiding + Tabular techniques + Stochastic methods + Evaluation⇒ [Ambition] marry deep and shallow parsing, ambiguous or not,to get efficient and accurate parsing systems
INRIA É. de la Clergerie ATOLL SymC 2005/11/15 37 / 39
INRIA
Scientific evolutions (cont’d)
Linguistic knowledge acquisition & lexica◮ use of EASy references and Paris 7 treebank◮ acquisition on raw corpora◮ acquisition and exploitation of lexical semantic info.
Information extraction applications, (Biotim followups)possibly with question/answering and (multilingual) generation
Opening to multilinguism (Arabic ?)application of our tools and acquisition methodologies for new languages
INRIA É. de la Clergerie ATOLL SymC 2005/11/15 38 / 39
INRIA
Organizational aspects
Real need to renew the composition of ATOLLplanned departures ⇒ impossible to continue without new member(s)
◮ ? maintenance & development of tools & resources◮ ? conduct large scale experimentations & exploit results◮ ? covering of enough computational linguistic sub-fields◮ ? following collaborations and actions on ATOLL’s side
Discussions with TALaNa towards a common structure
Reinforcing collaborations through INRIA Action d’Envergure for NLP ?◮ exchanging resources, tools & expertise ;◮ sharing dev. effort ;◮ better covering of NLP sub-fields
INRIA É. de la Clergerie ATOLL SymC 2005/11/15 39 / 39
INRIA
Organizational aspects
Real need to renew the composition of ATOLLplanned departures ⇒ impossible to continue without new member(s)
◮ ? maintenance & development of tools & resources◮ ? conduct large scale experimentations & exploit results◮ ? covering of enough computational linguistic sub-fields◮ ? following collaborations and actions on ATOLL’s side
Discussions with TALaNa towards a common structure
Reinforcing collaborations through INRIA Action d’Envergure for NLP ?◮ exchanging resources, tools & expertise ;◮ sharing dev. effort ;◮ better covering of NLP sub-fields
Thank you !
INRIA É. de la Clergerie ATOLL SymC 2005/11/15 39 / 39