Upload
darrell-kimsey
View
216
Download
0
Tags:
Embed Size (px)
Citation preview
Syntactic annotation in CGN: Syntactic annotation in CGN: lessons learned lessons learned and to be learnedand to be learned
Ineke Schuurman
Centre for Computational Linguistics
Katholieke Universiteit Leuven
15-11-2011 Paris 2
This talk ...
• Why CGN: Spoken Dutch Corpus?• At that time …• Other layers
– Orthographic transcription– PoS tagging
• Syntactic annotation– Dependencies and categories
• Spoken language– “standard” language, disfluencies
• LASSY/SoNaR: Written Dutch Corpus • What to take into account when planning a ‘spoken
treebank’
15-11-2011 Paris 3
Why CGN?
Dutch Language Union• Dutch/Flemish organization taking care of common
language• 1997-8: report state of the art wrt Language & Speech
Technology
• 1998: Spoken Dutch Corpus, 5 years, 2/3 Netherlands - 1/3 Flanders, balanced
1000 hours, +/- 10M words1 M Syntactic Annotation
• Both research purposes and services (EU) / industry
15-11-2011 Paris 4
At that time
This talk: focus on textual aspects!
--------------------------------------------------------
• No taggers, parsers that could be reused• Existing grammars cover(ed) the northern variant of
Dutch• No ‘formal’ grammar
►start from scratch
15-11-2011 Paris 5
Other layers
• Relevant for syntax:– Orthographic transcription– PoS tagging
• All layers in parallel, butper fragment: layer A finished before start layer B(except for errors)
• Reason: time• But: gave us opportunity to express wishes/needs wrt
other layers• Example: handling of specific types of words.
15-11-2011 Paris 6
Transcription and PoS
An example:
15-11-2011 Paris 7
Specific types of words
*v words in another language (not 'adopted' in Dutch)*a not fully realized words (gaan probe instead of gaan
proberen)*x words that could not be (fully) understood (also xxx,
ggg)
*u mispronounced words (ploberen instead of proberen, om-uh-dat*u instead of omdat)
*d dialectal words
One or more words?zo’n vs zo ‘n (such a): one token!But hebde*d (litt. have you) realized as hebt*d de*d :
two tokens
15-11-2011 Paris 8
Syntactic analysis: goal CGN
• Annotation in theory-neutral format in order to be useful for as many people as possible
• Categories: NP, PP, …• Functions/dependencies: subject, object1, …
• As automatic as possible:– Tool from NEGRA-corpus: Annotate
– for German– same desiderata as CGN (contrary to Dutch AMAZON-parser)
.
15-11-2011 Paris
Annotate
• Developed for NEGRA-project (Saarbrücken)– Oliver Plaehn, Thorsten Brants
• Semi-automatic annotation– Works with tagger and parser – Suggests structures
• Combined with Cascaded Markov Models (Brants)– Bootstrapping approach possible
15-11-2011 Paris
Annotate screen
.
15-11-2011 Paris 11
Annotate ‘correction’ format
15-11-2011 Paris 12
Annotate export format
.
15-11-2011 Paris 13
Principles of syntactic annotation
• Structures as flat as possible• Only new level when there is a new head• No branching when just one node is involved• No duplication of functions (1 SU, 1 OBJ1, …)• In principle just non-branching heads• Allowed:
– multiple branching– crossing dependencies
• Input: simplified PoS.
15-11-2011 Paris 14
Less PoS-tags
Simplified PoS
• PoS: over 300 tags– Over 100 for pronouns
– Not problematic at all, often unique token/tag combinations
• Not all details necessary for SA
• Example full tagset– T501a VNW(pers,pron,nomin,vol,1,ev) ik (I)
– T501o VNW(pers,pron,nomin,vol,3,ev,masc) hij (he)
• Example simplified tagset– VNW1 VNW(pers,pron) personal pronoun
– In graph: both T501a and VNW1
.
15-11-2011 Paris 15
Syntactic simplifications
Other simplifications
• Obj2 – indirect object (dative)meewerkend voorwerp
• Ik geef hem een boek / een boek aan hem(I give him a book)
belanghebbend voorwerp• Ik koop hem een boek / een boek voor hem
(I buy him a book)
• Bepaling van gesteldheid (~predicative complement)• hij verft de deur blauw (he paints the door blue)• Hij vindt het boek leuk (he does like the book)• Hij nam het boek lachend aan (laughing he accepted the book)
.
15-11-2011 Paris 16
Results
Even then:
• Annotate did most NPs and PPs very well, but often failed for the more complex parts
• In some sense surprising as the results for German were much better.
However:• In that case written language was involved.
Training for spoken language is much harder!.
15-11-2011 Paris 17
Details CGN corpus
Balanced corpus: • types of documents (next slide)• Speaker characteristics
• Sex• Age• Geographic region• Socio-economic class• Level of education
• 2/3 Netherlands, 1/3 Belgium (Flanders)• Participants were asked to speak standard language (in
case they agreed beforehand to participate in CGN) .
15-11-2011 Paris 18
Details CGN corpus
►many types of documents• Read-aloud written: Literature read aloud (library for the
blind)• Written to be spoken:
• News broadcasts• Lectures
• Spoken (spontaneous)• Interviews• Phone calls• Debates• Spontaneous conversations with x people (over lunch etc).
15-11-2011 Paris 19
Variation
To some extent differences in written language, much more in spoken variants, esp. in spontaneous speech
• Separable verbs• NL dat ze hem op wilde bellen (that she wanted to call him)• VL dat ze hem wilde opbellen
• Other choice of auxiliaries• NL Ze is het komen brengen (she came and brought it)• VL Ze heeft het komen brengen
• Other words for same concept, same words for different concepts
• Pompbak-gootsteen (sink), namiddag (afternoon-late afternoon)
Gramm/dictionaries: mostly northern written variant
.
15-11-2011 Paris
Disfluencies
Partially realized words
hilari*a instead of hilarisch (EN hilarious)
Analyzed as if realized
***
Ik doe West- en Oost-Vlaanderen
I’ll take care of West- and Oost-Vlaanderen
Short for: West-Vlaanderen en Oost-Vlaanderen
Completely regularly analyzed as conjunction (CONJ)
.
15-11-2011 Paris
Disfluencies
When too little of a token is realized, such a token is ignored
awel genen TV meer en genen boe*a gene voetbal meer .
EN: So no more tv and no more football
.
15-11-2011 Paris
Ex of disfluency (repetition)
15-11-2011 Paris
Disfluencies
Mixed repetition/correction
Ze was bijna hileri*a hilari*a
She was almost hilarious
hileri*a is corrected as hilari*a, only the corrected form is included in the analysis
Die verd*a die vervl*a die krankzinnige hond
That damn*, that cursed*, that crazy dog
Only last 3 words (that crazy dog) included in graph
.
15-11-2011 Paris 24
Disfluencies
Wrong pronunciation
Dat is een serieus plobleem*u
Dat is een serieus probleem
That’s a serious problem
Analysed as if the ‘correct’ word was involved
***
15-11-2011 Paris 25
Words in foreign language
In spoken and written language:
Words in another language, and not found in a Dutch dictionary:
umbrella*v, plus*v de*v temps*v, à la carte not: rendez-vous, cinema, cognac (in Dutch dictionaries)
• Single words: just like their Dutch counterpart• Strings: only ‘top’ label presented• Sentences: not analyzed.
15-11-2011 Paris 26
Pro and con markings
Markings (*a, etc) have proven to be useful for PoS and SA.
But:
should have been removed afterwards, i.e. all information should have been contained in tags, orthographic level should contain only orthography
Problem: other groups wanted them at orthographic level for speech recognition purposes
Solution: add a field without markings
.
15-11-2011 Paris 27
Syntactic annotation
Lacking and superfluous words
There are no ‘ungrammatical’ sentences, all sentences are to be analyzed!
• Lacking elements: just accept it• Superfluous elements: just accept it
BUT there are some exceptions:
repetition
‘accidental’ sentences
.
15-11-2011 Paris 28
Not analyzed parts
Sometimes parts of a ‘sentence’ are ‘ignored’:
• ReparationsIk zie hem morg*a overmorgenI’ll see him the day after tomorrow
• RepetitionsHij is in in vergaderingHe has a meeting
Or not connected:
• ‘accidental’ sentences/unitsIk heb nooit ik ben leraresI have never I am a teacher
• Uh-insertion (hesitation marker)Ze heeft uh zeven dochtersShe has seven daughters.
15-11-2011 Paris 29
Examples
More of the same
15-11-2011 Paris 30
Asyndetic conjunction
15-11-2011 Paris 31
Discourse phenomena
Some examples of ‘discourse’ within a sentence
15-11-2011 Paris 32
Accidental unit
‘Accidental’ unit, discourse
parts not connected
15-11-2011 Paris 33
Syntactic annotation
sentence
vs
discourse
15-11-2011 Paris 34
Atypical ‘sentences’
Often: discourse
15-11-2011 Paris 35
Complicating factors
No punctuation apart from full stop, question mark, elipsis• ‘wrong order’ of sentences when more people are talking at
the same time!
►Tricky wrt coreference, temporal reasoning etc
Spelling: incorrect (but correct with other meaning)• U zij de glorie (Thine be the glory) • U zei de glorie (‘zei’ meaning ‘said’)• Ik zal haar eraan houden (houden aan: to keep a promise)• Ik zal haar er aanhouden (aanhouden: to arrest)
►context, recordings
.
15-11-2011 Paris 36
Written corpus: Lassy/SoNaR
STEVIN programme (Flemish/Dutch - 2004-2011)
D-Coi / LASSY / (SoNaR)
1M SA written text, manually corrected, plus
1.500M SA automatically
ALPINO parser (Groningen)
Largely inspired by CGN, based on HPSG
Some differences• Mentioning of ‘hidden’ subjects, objects
– Hij heeft een boek gekocht
.
15-11-2011 Paris 37
Alpino
• Alpino grammar: HPSG-based• ‘Constructional’ approach:
– rich lexical representations– many detailed, construction specific lexical rules (+/- 600)
• Grammar based parsing very efficient, esp when combined with specific rules
• Large lexicon (100.000+ entries, 200.000+ NEs)– Stored as perfect hash finite automaton (Daciuk)
• Crucial: Integrated tagger (=/= CGN tagger!)• Left corner parser
15-11-2011 Paris
Alpino (as is) and CGN
Parsing the CGN-corpus with Alpino• very bad results• reason might be: it uses a ‘wrong’ grammar, inadequate
lexicon etc etc
As we wanted both CGN and Lassy to be searchable using the same tools, CGN was ‘translated’ into the Lassy-format. There are, however, still differences in the way a few phenomena are handled.
.
15-11-2011 Paris
Lassy vs CGN
• Subject/direct objects wrt infinitives and participle• Partitives (one of them said …): in CGN separate label
PART, in Lassy combination of HD and MOD• LASSY: head always lexically anchored• In LASSY SBAR-complement always VC-label, in CGN
either OBJ1 or VC• …
Analyses not fully identical, but 99% is!
15-11-2011 Paris 40
Syntactic annotation: Lassy
.
15-11-2011 Paris 41
Syntactic annotation: CGN
.
15-11-2011 Paris 42
To be taken into account
In general:
• Take care of IPR• Be prepared to consult other layers• Use a flexible bug reporting system• “Spoken language”: grammar/system should be very flexible• Alignment may be very time consuming
Be aware that, as far as consistency is concerned, not the really hard cases are the most important, but rather those the correctors don’t realize to be problematic (because in those cases they don’t consult others)
GOOD LUCK !.