Colloquia Linguistica

Leif Grönqvist Colloquia Linguistica 1

Colloquia Linguistica

Part II: The development of Automated Syntactic Taggers

Leif Grönqvist

Göteborg University


Overview

• Some basic thing about corpora (quick)– What is a corpus– What can we do with it

• Part-of-speech tagging (slower)– What is the problem– Some common approaches

• A rule based tagger• A statistical tagger

• Corpus tools– Different tools– Demonstration of Multitool


What is a corpus for a computational linguist?

• Various properties are important but the word ‘corpus’ is just Latin for ‘body’

• These properties should be considered:– Representativeness– Size– Form (annotation standard)– Standard reference


Representativeness

• A corpus used for analyzing spoken Swedish should ideally contain all utterances of Swedish ever spoken

• But this is impossible, so there are at least two strategies depending on purpose:– Try to collect various dialogue types of sizes

proportional to the “complete corpus”– Collect enough big portions of each type to make sure

to find all wanted phenomena• Regardless of which strategy you use it is

important to select the samples from each type carefully, preferably using random


Corpus size: how big should it be?

• Depends on purpose!

• Some strategies:– Monitor corpus: as big as possible

• Bank of English > 500 million tokens• Used for lexicography

– Finite size, big enough for current task• POS-tagging, ~100 tags: 1 million tokens• Language model for automatic speech recognition:

100 million tokens


Machine readable form

• Corpora have been used in linguistics for more than 100 years.

• Now: a corpus => machine readable

• The annotations should be made in a way to make extraction of wanted features as simple as possible


Standard reference (quick)

• Typical content of a research article: “We used the corpus XX, took 90% for training, and 10% for testing with our new algorithm. We then got 97.2% correctness, which is a significant improvement from the old tagger at the 99% level”– Exactly the corpus XX must be available for

other research groups


What to do with a corpus

• Check our linguistic intuition

• Annotate interesting features manually

• Use it for training of taggers and parsers– Annotate new data automatically

• But, be careful! A corpus is not the complete language


Text encoding

• Various encoding schemes around– Text based

• Human and machine readable• Could be difficult to check for validity

– Word processor based• Only human readable• Rarely used in computational linguistics

– XML/SGML based• Machine readable• May be transformed to human readable form using XSLT• Formalisms and tools for free, well more or less free• Limitations of XML may be annoying sometimes


Some important properties (skip)

Important properties according to Geoffrey Leech• Possibility to extract original corpus• Possibility to separate annotations• Based on well defined guidelines• Make clear how the annotations were done• Make clear that there may be errors in the

corpus• Widely agreed theory-neutral annotation scheme• No annotation scheme is the a priori standard

scheme


Some annotation standardsTEI (Text Encoding Initiative)• Huge standard for all types of texts and corpora developed by the

TEI Consortium since 1987• SGML based in the beginning but now XML(X)CES (XML Corpus Encoding Standard)• Highly inspired by the TEI• Not as complicated but only in beta versionISLE (International Standards for Language Engineering)• Developed by three working groups (lexicon, multimodality and

evaluation)CDIF (Corpus Document Interchange Format)• Used by the British National Corpus• A lot in common with the TEI


Some typical results directly extracted from a corpus

• Concordances (KWIC)

• Frequency lists

• N-gram statistics

• Probabilities


Concordances

rer, matematiker och dataloger i Göteborgsregionen, bandavskrifter och dataloggar, skriver Feldt.|Si, bandavskrifter och dataloggar.|Men den nya Palme Ahlberg, forskare i datalogi på Chalmers.|Av PER- Ahlberg, forskare i datalogi på Chalmers.|SIDAN 4und blir professor i datalogi vid Umeå universitetund blir professor i datalogi vid Umeå universiteta fyra olika kurser: datalogi, pedagogik, teknik oybjer och Jan Smith, datalogi.|Sektionen för maskiatorer eller pluggar datalogi.|Så på fritiden leker det gäller trådlös datalogistik, nu kommer det ö


Frequency lists74556 de48104 ja39947 e34342 å25694 så25639 att22378 va19134 som18679 vi18084 inte17611 på17214

man16870 i16846 då

77810 det36843 är35471 och32404 ja30439 att28628 jag26059 så19205 som18681 inte18469 har18421 vi17719 på17377 man17343 då

90304 .56075 ,40438 och33978 i26358 att25634 det21830 en21333 som19743 på15754 är14333 med13837 för13683 av13547 jag


N-gram statistics3395 det är2913 för att2451 det var1560 att det1351 är det1278 i en1174 att han1003 i den966 som en920 men det889 på en884 att jag882 är en882 med en

42 i stället för att36 för några år sedan35 men det är inte34 en stor del av33 på samma sätt som32 det var som om31 att det är en30 är en av de30 men det var inte28 vad är det för28 det är svårt att27 det är som om27 att det inte var26 för ett år sedan


Part-of-speech tagging

• We want to assign the right part-of-speech (just as an example) to each word in a corpus

• Input is a tokenized corpus• The tagset is determined in advance• The word types in the corpus have various

properties in the training data– Some are unambiguous– Some are ambiguous (typically 2-7 POS each)– Some are unknown (not there)


An example

Tagset: noun, verb, pron, art, infmrk, prepIn: $A: you have to book a chair on deckOut: pron verb infmrk verb art noun prep noun

• But, “book” and “chair” may be either verb or noun - the tagger has to disambiguate!

• Several approaches to do this, all based on patterns and regularities in the language


Terms used in tagging

• Tagging: put the right label (i.e. word class) on each token

• Tagset: all possible labels (word classes)

• Tokenizing: divide the corpus into tokens (words, sentence boundaries)

• Training: find the rules or probabilities needed by the tagger


Various approaches

• Rule based tagging– Constraint based tagging (SweTwol, EngTwol

by Lingsoft)– Transformation-based tagging (Eric Brill)

• Stochastic tagging (HMM)– Calculate the most probable tag sequence – Using maximum likelihood estimation– Or some bootstrap based training


Constrain based tagging

• Basic idea:– Assign all possible tags to each words– Remove tags according to a set of rules of the type:

“if word+1 is an adj, adv or quantifier and the following is a sentence boundary and word-1 is not a verb like ‘consider’ then eliminate non-adv else eliminate adv.”

– Continue until no rule is applicable, but never remove the last tag on a word

• Typically more than 1000 hand written rules, but may also be machine learned


The example: Constraint grammar

• Tagset: nn, vb, pron, art, infmrk, prep

• First: look up all possible classes for each word

• Rules will then remove unwanted tags

In Step 1

you pron

have verb

to infmrk

book noun, verb

a art

chair noun, verb

on prep

deck noun


Transformation-based tagging

• Basic idea:– Set the most probable tag for each word as a start value– Change tags according to rules of the type: “if a word is tagged

as a verb and the word before is an article, then change the tag to noun”. Perform rules in a specific order!

• Training is done using a tagged corpus:1. Write a set of rule templates of the type: “if word-1 or word+1 is

an X then change the tag for word to Y”2. Among the set of possible rules, find the one with the highest

score3. Continue from 2 until a lowest score threshold is passed4. Keep the ordered set of rules

• Rules will make errors that are corrected by later rules


The example: Transformation based learning

• Tagset: nn, vb, pron, art, infmrk, prep

• First: look up the most common tag for each word

• Rules will then change to the right tags

Word Step 1

you pron

have verb

to infmrk

book noun

a art

chair noun

on prep

deck noun


An HMM tagger: uses statistics (brief)

• The problem may be formulated as:

• Which may be reformulated as:

• But the denominator is constant and may be removed and we get:


HMM tagger, cont. (brief)

The Markov assumption (for n=3) and the chain rule gives us:

What we need now is:


The example: HMMWord Seq.1 Seq.2 Seq.3 Seq.4

you pron pron pron pron

have verb verb verb verb

to infmrk infmrk infmrk infmrk

book noun noun verb verb

a art art art art

chair noun verb noun verb

on prep prep prep prep

deck noun noun noun noun

Select the sequence with the highest probability!


Training of an HMM tagger

• The best way is the Maximum Likelihood Estimation. But it requires a hand tagged corpus

• A fancy name for a simple principle: expect the new data to be as the training data. Count the thing there:– P(c) = freq(c) / Ntok– P(w,c) = freq(w,c) / Ntok– P(w|c) = P(w,c) / P(c)


Evaluation (skip)

• The result is compared with: the so called “Gold Standard” (manually coded)– Typically accuracy reach 96-97% – This may be compared with the result for a

baseline tagger, for example a tagger not using context at all

– Similarity between two gold standards may verified with the kappa measure

• Important to note that 100% is impossible even for human annotators


Problems (quick)

• Words and sequences are missing in the training data. This is cured using smoothing:– Additive: add one occurrence to each event

frequency– Good-Turing estimation: try to calculate the

number of unseen events to get a better estimation of their probabilities

– Back-off and Linear interpolation– Morphology may help (-arity, -s)


The Viterbi algorithm (quick)

• To calculate the probabilities for all possible sequences of tags would take too long time

• The Viterbi algorithm helps us to find the most probable path in linear time to the length of the text and quadratic time to the number of states, using dynamic programming


Example of corpus tools at the linguistics department in Göteborg

• The Corpus Browser– A tool for searching (for words and expressions) and

browsing in our transcriptions

• TraSA– A tool that count things like number of words,

utterances, overlaps, vocabulary richness, etc

• Multitool– A tool for browsing and coding a transcription, with

audio and video available at the same time– Demonstration?


Thank you!

• Thank you for listening!

• Well, do we have any time left for questions?

Documents

Colloquia Linguistica