42
Syntactic annotation in CGN: Syntactic annotation in CGN: lessons learned lessons learned and to be learned and to be learned Ineke Schuurman Centre for Computational Linguistics Katholieke Universiteit Leuven

Syntactic annotation in CGN: lessons learned and to be learned Ineke Schuurman Centre for Computational Linguistics Katholieke Universiteit Leuven

Embed Size (px)

Citation preview

Page 1: Syntactic annotation in CGN: lessons learned and to be learned Ineke Schuurman Centre for Computational Linguistics Katholieke Universiteit Leuven

Syntactic annotation in CGN: Syntactic annotation in CGN: lessons learned lessons learned and to be learnedand to be learned

Ineke Schuurman

Centre for Computational Linguistics

Katholieke Universiteit Leuven

Page 2: Syntactic annotation in CGN: lessons learned and to be learned Ineke Schuurman Centre for Computational Linguistics Katholieke Universiteit Leuven

15-11-2011 Paris 2

This talk ...

• Why CGN: Spoken Dutch Corpus?• At that time …• Other layers

– Orthographic transcription– PoS tagging

• Syntactic annotation– Dependencies and categories

• Spoken language– “standard” language, disfluencies

• LASSY/SoNaR: Written Dutch Corpus • What to take into account when planning a ‘spoken

treebank’

Page 3: Syntactic annotation in CGN: lessons learned and to be learned Ineke Schuurman Centre for Computational Linguistics Katholieke Universiteit Leuven

15-11-2011 Paris 3

Why CGN?

Dutch Language Union• Dutch/Flemish organization taking care of common

language• 1997-8: report state of the art wrt Language & Speech

Technology

• 1998: Spoken Dutch Corpus, 5 years, 2/3 Netherlands - 1/3 Flanders, balanced

1000 hours, +/- 10M words1 M Syntactic Annotation

• Both research purposes and services (EU) / industry

Page 4: Syntactic annotation in CGN: lessons learned and to be learned Ineke Schuurman Centre for Computational Linguistics Katholieke Universiteit Leuven

15-11-2011 Paris 4

At that time

This talk: focus on textual aspects!

--------------------------------------------------------

• No taggers, parsers that could be reused• Existing grammars cover(ed) the northern variant of

Dutch• No ‘formal’ grammar

►start from scratch

Page 5: Syntactic annotation in CGN: lessons learned and to be learned Ineke Schuurman Centre for Computational Linguistics Katholieke Universiteit Leuven

15-11-2011 Paris 5

Other layers

• Relevant for syntax:– Orthographic transcription– PoS tagging

• All layers in parallel, butper fragment: layer A finished before start layer B(except for errors)

• Reason: time• But: gave us opportunity to express wishes/needs wrt

other layers• Example: handling of specific types of words.

Page 6: Syntactic annotation in CGN: lessons learned and to be learned Ineke Schuurman Centre for Computational Linguistics Katholieke Universiteit Leuven

15-11-2011 Paris 6

Transcription and PoS

An example:

Page 7: Syntactic annotation in CGN: lessons learned and to be learned Ineke Schuurman Centre for Computational Linguistics Katholieke Universiteit Leuven

15-11-2011 Paris 7

Specific types of words

*v words in another language (not 'adopted' in Dutch)*a not fully realized words (gaan probe instead of gaan

proberen)*x words that could not be (fully) understood (also xxx,

ggg)

*u mispronounced words (ploberen instead of proberen, om-uh-dat*u instead of omdat)

*d dialectal words

One or more words?zo’n vs zo ‘n (such a): one token!But hebde*d (litt. have you) realized as hebt*d de*d :

two tokens

Page 8: Syntactic annotation in CGN: lessons learned and to be learned Ineke Schuurman Centre for Computational Linguistics Katholieke Universiteit Leuven

15-11-2011 Paris 8

Syntactic analysis: goal CGN

• Annotation in theory-neutral format in order to be useful for as many people as possible

• Categories: NP, PP, …• Functions/dependencies: subject, object1, …

• As automatic as possible:– Tool from NEGRA-corpus: Annotate

– for German– same desiderata as CGN (contrary to Dutch AMAZON-parser)

.

Page 9: Syntactic annotation in CGN: lessons learned and to be learned Ineke Schuurman Centre for Computational Linguistics Katholieke Universiteit Leuven

15-11-2011 Paris

Annotate

• Developed for NEGRA-project (Saarbrücken)– Oliver Plaehn, Thorsten Brants

• Semi-automatic annotation– Works with tagger and parser – Suggests structures

• Combined with Cascaded Markov Models (Brants)– Bootstrapping approach possible

Page 10: Syntactic annotation in CGN: lessons learned and to be learned Ineke Schuurman Centre for Computational Linguistics Katholieke Universiteit Leuven

15-11-2011 Paris

Annotate screen

.

Page 11: Syntactic annotation in CGN: lessons learned and to be learned Ineke Schuurman Centre for Computational Linguistics Katholieke Universiteit Leuven

15-11-2011 Paris 11

Annotate ‘correction’ format

Page 12: Syntactic annotation in CGN: lessons learned and to be learned Ineke Schuurman Centre for Computational Linguistics Katholieke Universiteit Leuven

15-11-2011 Paris 12

Annotate export format

.

Page 13: Syntactic annotation in CGN: lessons learned and to be learned Ineke Schuurman Centre for Computational Linguistics Katholieke Universiteit Leuven

15-11-2011 Paris 13

Principles of syntactic annotation

• Structures as flat as possible• Only new level when there is a new head• No branching when just one node is involved• No duplication of functions (1 SU, 1 OBJ1, …)• In principle just non-branching heads• Allowed:

– multiple branching– crossing dependencies

• Input: simplified PoS.

Page 14: Syntactic annotation in CGN: lessons learned and to be learned Ineke Schuurman Centre for Computational Linguistics Katholieke Universiteit Leuven

15-11-2011 Paris 14

Less PoS-tags

Simplified PoS

• PoS: over 300 tags– Over 100 for pronouns

– Not problematic at all, often unique token/tag combinations

• Not all details necessary for SA

• Example full tagset– T501a VNW(pers,pron,nomin,vol,1,ev) ik (I)

– T501o VNW(pers,pron,nomin,vol,3,ev,masc) hij (he)

• Example simplified tagset– VNW1 VNW(pers,pron) personal pronoun

– In graph: both T501a and VNW1

.

Page 15: Syntactic annotation in CGN: lessons learned and to be learned Ineke Schuurman Centre for Computational Linguistics Katholieke Universiteit Leuven

15-11-2011 Paris 15

Syntactic simplifications

Other simplifications

• Obj2 – indirect object (dative)meewerkend voorwerp

• Ik geef hem een boek / een boek aan hem(I give him a book)

belanghebbend voorwerp• Ik koop hem een boek / een boek voor hem

(I buy him a book)

• Bepaling van gesteldheid (~predicative complement)• hij verft de deur blauw (he paints the door blue)• Hij vindt het boek leuk (he does like the book)• Hij nam het boek lachend aan (laughing he accepted the book)

.

Page 16: Syntactic annotation in CGN: lessons learned and to be learned Ineke Schuurman Centre for Computational Linguistics Katholieke Universiteit Leuven

15-11-2011 Paris 16

Results

Even then:

• Annotate did most NPs and PPs very well, but often failed for the more complex parts

• In some sense surprising as the results for German were much better.

However:• In that case written language was involved.

Training for spoken language is much harder!.

Page 17: Syntactic annotation in CGN: lessons learned and to be learned Ineke Schuurman Centre for Computational Linguistics Katholieke Universiteit Leuven

15-11-2011 Paris 17

Details CGN corpus

Balanced corpus: • types of documents (next slide)• Speaker characteristics

• Sex• Age• Geographic region• Socio-economic class• Level of education

• 2/3 Netherlands, 1/3 Belgium (Flanders)• Participants were asked to speak standard language (in

case they agreed beforehand to participate in CGN) .

Page 18: Syntactic annotation in CGN: lessons learned and to be learned Ineke Schuurman Centre for Computational Linguistics Katholieke Universiteit Leuven

15-11-2011 Paris 18

Details CGN corpus

►many types of documents• Read-aloud written: Literature read aloud (library for the

blind)• Written to be spoken:

• News broadcasts• Lectures

• Spoken (spontaneous)• Interviews• Phone calls• Debates• Spontaneous conversations with x people (over lunch etc).

Page 19: Syntactic annotation in CGN: lessons learned and to be learned Ineke Schuurman Centre for Computational Linguistics Katholieke Universiteit Leuven

15-11-2011 Paris 19

Variation

To some extent differences in written language, much more in spoken variants, esp. in spontaneous speech

• Separable verbs• NL dat ze hem op wilde bellen (that she wanted to call him)• VL dat ze hem wilde opbellen

• Other choice of auxiliaries• NL Ze is het komen brengen (she came and brought it)• VL Ze heeft het komen brengen

• Other words for same concept, same words for different concepts

• Pompbak-gootsteen (sink), namiddag (afternoon-late afternoon)

Gramm/dictionaries: mostly northern written variant

.

Page 20: Syntactic annotation in CGN: lessons learned and to be learned Ineke Schuurman Centre for Computational Linguistics Katholieke Universiteit Leuven

15-11-2011 Paris

Disfluencies

Partially realized words

hilari*a instead of hilarisch (EN hilarious)

Analyzed as if realized

***

Ik doe West- en Oost-Vlaanderen

I’ll take care of West- and Oost-Vlaanderen

Short for: West-Vlaanderen en Oost-Vlaanderen

Completely regularly analyzed as conjunction (CONJ)

.

Page 21: Syntactic annotation in CGN: lessons learned and to be learned Ineke Schuurman Centre for Computational Linguistics Katholieke Universiteit Leuven

15-11-2011 Paris

Disfluencies

When too little of a token is realized, such a token is ignored

awel genen TV meer en genen boe*a gene voetbal meer .

EN: So no more tv and no more football

.

Page 22: Syntactic annotation in CGN: lessons learned and to be learned Ineke Schuurman Centre for Computational Linguistics Katholieke Universiteit Leuven

15-11-2011 Paris

Ex of disfluency (repetition)

Page 23: Syntactic annotation in CGN: lessons learned and to be learned Ineke Schuurman Centre for Computational Linguistics Katholieke Universiteit Leuven

15-11-2011 Paris

Disfluencies

Mixed repetition/correction

Ze was bijna hileri*a hilari*a

She was almost hilarious

hileri*a is corrected as hilari*a, only the corrected form is included in the analysis

Die verd*a die vervl*a die krankzinnige hond

That damn*, that cursed*, that crazy dog

Only last 3 words (that crazy dog) included in graph

.

Page 24: Syntactic annotation in CGN: lessons learned and to be learned Ineke Schuurman Centre for Computational Linguistics Katholieke Universiteit Leuven

15-11-2011 Paris 24

Disfluencies

Wrong pronunciation

Dat is een serieus plobleem*u

Dat is een serieus probleem

That’s a serious problem

Analysed as if the ‘correct’ word was involved

***

Page 25: Syntactic annotation in CGN: lessons learned and to be learned Ineke Schuurman Centre for Computational Linguistics Katholieke Universiteit Leuven

15-11-2011 Paris 25

Words in foreign language

In spoken and written language:

Words in another language, and not found in a Dutch dictionary:

umbrella*v, plus*v de*v temps*v, à la carte not: rendez-vous, cinema, cognac (in Dutch dictionaries)

• Single words: just like their Dutch counterpart• Strings: only ‘top’ label presented• Sentences: not analyzed.

Page 26: Syntactic annotation in CGN: lessons learned and to be learned Ineke Schuurman Centre for Computational Linguistics Katholieke Universiteit Leuven

15-11-2011 Paris 26

Pro and con markings

Markings (*a, etc) have proven to be useful for PoS and SA.

But:

should have been removed afterwards, i.e. all information should have been contained in tags, orthographic level should contain only orthography

Problem: other groups wanted them at orthographic level for speech recognition purposes

Solution: add a field without markings

.

Page 27: Syntactic annotation in CGN: lessons learned and to be learned Ineke Schuurman Centre for Computational Linguistics Katholieke Universiteit Leuven

15-11-2011 Paris 27

Syntactic annotation

Lacking and superfluous words

There are no ‘ungrammatical’ sentences, all sentences are to be analyzed!

• Lacking elements: just accept it• Superfluous elements: just accept it

BUT there are some exceptions:

repetition

‘accidental’ sentences

.

Page 28: Syntactic annotation in CGN: lessons learned and to be learned Ineke Schuurman Centre for Computational Linguistics Katholieke Universiteit Leuven

15-11-2011 Paris 28

Not analyzed parts

Sometimes parts of a ‘sentence’ are ‘ignored’:

• ReparationsIk zie hem morg*a overmorgenI’ll see him the day after tomorrow

• RepetitionsHij is in in vergaderingHe has a meeting

Or not connected:

• ‘accidental’ sentences/unitsIk heb nooit ik ben leraresI have never I am a teacher

• Uh-insertion (hesitation marker)Ze heeft uh zeven dochtersShe has seven daughters.

Page 29: Syntactic annotation in CGN: lessons learned and to be learned Ineke Schuurman Centre for Computational Linguistics Katholieke Universiteit Leuven

15-11-2011 Paris 29

Examples

More of the same

Page 30: Syntactic annotation in CGN: lessons learned and to be learned Ineke Schuurman Centre for Computational Linguistics Katholieke Universiteit Leuven

15-11-2011 Paris 30

Asyndetic conjunction

Page 31: Syntactic annotation in CGN: lessons learned and to be learned Ineke Schuurman Centre for Computational Linguistics Katholieke Universiteit Leuven

15-11-2011 Paris 31

Discourse phenomena

Some examples of ‘discourse’ within a sentence

Page 32: Syntactic annotation in CGN: lessons learned and to be learned Ineke Schuurman Centre for Computational Linguistics Katholieke Universiteit Leuven

15-11-2011 Paris 32

Accidental unit

‘Accidental’ unit, discourse

parts not connected

Page 33: Syntactic annotation in CGN: lessons learned and to be learned Ineke Schuurman Centre for Computational Linguistics Katholieke Universiteit Leuven

15-11-2011 Paris 33

Syntactic annotation

sentence

vs

discourse

Page 34: Syntactic annotation in CGN: lessons learned and to be learned Ineke Schuurman Centre for Computational Linguistics Katholieke Universiteit Leuven

15-11-2011 Paris 34

Atypical ‘sentences’

Often: discourse

Page 35: Syntactic annotation in CGN: lessons learned and to be learned Ineke Schuurman Centre for Computational Linguistics Katholieke Universiteit Leuven

15-11-2011 Paris 35

Complicating factors

No punctuation apart from full stop, question mark, elipsis• ‘wrong order’ of sentences when more people are talking at

the same time!

►Tricky wrt coreference, temporal reasoning etc

Spelling: incorrect (but correct with other meaning)• U zij de glorie (Thine be the glory) • U zei de glorie (‘zei’ meaning ‘said’)• Ik zal haar eraan houden (houden aan: to keep a promise)• Ik zal haar er aanhouden (aanhouden: to arrest)

►context, recordings

.

Page 36: Syntactic annotation in CGN: lessons learned and to be learned Ineke Schuurman Centre for Computational Linguistics Katholieke Universiteit Leuven

15-11-2011 Paris 36

Written corpus: Lassy/SoNaR

STEVIN programme (Flemish/Dutch - 2004-2011)

D-Coi / LASSY / (SoNaR)

1M SA written text, manually corrected, plus

1.500M SA automatically

ALPINO parser (Groningen)

Largely inspired by CGN, based on HPSG

Some differences• Mentioning of ‘hidden’ subjects, objects

– Hij heeft een boek gekocht

.

Page 37: Syntactic annotation in CGN: lessons learned and to be learned Ineke Schuurman Centre for Computational Linguistics Katholieke Universiteit Leuven

15-11-2011 Paris 37

Alpino

• Alpino grammar: HPSG-based• ‘Constructional’ approach:

– rich lexical representations– many detailed, construction specific lexical rules (+/- 600)

• Grammar based parsing very efficient, esp when combined with specific rules

• Large lexicon (100.000+ entries, 200.000+ NEs)– Stored as perfect hash finite automaton (Daciuk)

• Crucial: Integrated tagger (=/= CGN tagger!)• Left corner parser

Page 38: Syntactic annotation in CGN: lessons learned and to be learned Ineke Schuurman Centre for Computational Linguistics Katholieke Universiteit Leuven

15-11-2011 Paris

Alpino (as is) and CGN

Parsing the CGN-corpus with Alpino• very bad results• reason might be: it uses a ‘wrong’ grammar, inadequate

lexicon etc etc

As we wanted both CGN and Lassy to be searchable using the same tools, CGN was ‘translated’ into the Lassy-format. There are, however, still differences in the way a few phenomena are handled.

.

Page 39: Syntactic annotation in CGN: lessons learned and to be learned Ineke Schuurman Centre for Computational Linguistics Katholieke Universiteit Leuven

15-11-2011 Paris

Lassy vs CGN

• Subject/direct objects wrt infinitives and participle• Partitives (one of them said …): in CGN separate label

PART, in Lassy combination of HD and MOD• LASSY: head always lexically anchored• In LASSY SBAR-complement always VC-label, in CGN

either OBJ1 or VC• …

Analyses not fully identical, but 99% is!

Page 40: Syntactic annotation in CGN: lessons learned and to be learned Ineke Schuurman Centre for Computational Linguistics Katholieke Universiteit Leuven

15-11-2011 Paris 40

Syntactic annotation: Lassy

.

Page 41: Syntactic annotation in CGN: lessons learned and to be learned Ineke Schuurman Centre for Computational Linguistics Katholieke Universiteit Leuven

15-11-2011 Paris 41

Syntactic annotation: CGN

.

Page 42: Syntactic annotation in CGN: lessons learned and to be learned Ineke Schuurman Centre for Computational Linguistics Katholieke Universiteit Leuven

15-11-2011 Paris 42

To be taken into account

In general:

• Take care of IPR• Be prepared to consult other layers• Use a flexible bug reporting system• “Spoken language”: grammar/system should be very flexible• Alignment may be very time consuming

Be aware that, as far as consistency is concerned, not the really hard cases are the most important, but rather those the correctors don’t realize to be problematic (because in those cases they don’t consult others)

GOOD LUCK !.