LIN3022 Natural Language Processing Lecture 3 Albert Gatt 1LIN3022 Natural Language Processing

LIN3022 Natural Language Processing

Lecture 3Albert Gatt

1LIN3022 Natural Language

Processing


2

Reminder: Non-deterministic FSA

• An FSA where there can be multiple paths for a single input (tape).

• Two basic approaches for recognition:1. Either take a ND machine and convert it to a D

machine and then do recognition with that.2. Or explicitly manage the process of recognition

as a state-space search (leaving the machine as is).


3

Example

b a a a ! \

q0 q1 q2 q2 q3 q4


4

Non-Deterministic Recognition: Search

• Given an ND FSA representing some language and given an input string :– If the input string belongs to the language, the ND FSA will contain at

least one path for that string;– If the input string does not belong to the language, there will be no

path;– Not all paths directed through the machine for an accept string lead

to an accept state.

• A recognition algorithm succeeds if a path is found for the input string; fails otherwise.


5

Words

• Finite-state methods are particularly useful in dealing with a lexicon

• Many devices need access to large lists of words– Spell checkers– Syntactic parsers– Language Generators– Machine translation systems

• What sort of knowledge should a computational lexicon contain?– What we’re mainly concerned with today is morphology (inflection

and derivation)


6

Computational tasks involving morphology

• Morphological analysis (parsing):– ommijiet omm+PL

• Morphological (lexical) generation– omm+PL ommijiet

• Stemming– Ommijiet omm– Ommijiethom omm– Ommna omm– (map from a word to its stem)


7

Computational tasks involving morphology

• Lemmatisation– Convert words to their “basic” form• Ommijiet omm• Ommijiethom omm• ...

• Tokenisation– Split running text into individual tokens– Qtilt il-kelb qtilt, il-, kelb– Xrobt l-ilma xrobt, l-, ilma


8

Regular, irregular and messy

• The problem is to cover everything, of course:– Mouse/mice, goose/geese, ox/oxen (PLURAL)– Go/went, fly/flew (PAST)– Solid/solidify, mechanical/mechanise

(NOMINALISATION)


9

Inflection vs Derivation

• Inflection is typically fairly straightforward– Relatively easy to identify regular and irregular

cases

• Derivational morphology is much messier.– Quasi-systematicity– Irregular meaning change– Changes of word class

LIN3022 Natural Language Processing 10

Derivational Examples• Verbs and Adjectives to Nouns

-ation computerise computerisation

-ee appoint appointee

-er kill killer

-ness fuzzy fuzziness


Derivational Examples• Nouns and Verbs to Adjectives

-al computation computational

-able embrace embraceable

-less clue clueless


12

Example: Compute

• Many paths are possible…

• Start with compute– Computer -> computerise -> computerisation– Computer -> computerise -> computerisable

• But not all paths/operations are equally good– Clue

• Clue -> *clueable


13

Morphology, recognition and FSAs

• We’d like to use the machinery provided by FSAs to capture these facts about morphology– Accept strings that are in the language– Reject strings that are not– ... in an efficient way:• Without listing every word in the language• Capturing some generalisations• Enabling fast search


14

What does a morphological parser or recogniser need?

• Lexicon– Often not feasible to just list all the words.

• Maltese has literally thousands of forms for some verbs...• Some morphological processes are productive; we’re likely to meet

completely new formations.

• Morphotactics (order of morphemes)– E.g. English plural morpheme after the noun stem

• E.g. Maltese “accusative” -l before dative pronominal suffixes (e.g Qatilulna)

• E.g. English –ise before –ation (formalisation)

• Orthographic rules– Needed to handle variations of the spelling of the stem– E.g. English nouns ending in –y change to –i (city cities)


15

Start Simple

• Regular singular nouns are ok• Regular plural nouns have an -s on the end• Irregulars are ok as is (i.e. treat as atomic for

now)


16

Simple Rules


17

Substitute words for word classes• Idea is to be able to use this kind of FSA for

recognition.• We’ve replaced classes like “reg-noun” with

the actual words.


18

Derivational Rules

If everything is an accept state how do things ever get rejected?


19

Lexicons

• Lexicon can be stored as an FSA.

• A base lexicon (with baseforms) can be plugged into a larger FSA to capture morphological rules and morphotactics.


20

Part 2

• Finite state transducers

• Morphological parsing

• Morphological generation


21

Parsing/Generation vs. Recognition

• We can now run strings through these machines to recognize strings in the language

• But recognition is usually not quite what we need – Often if we find some string in the language we might like to

assign a structure to it (parsing)– Or we might have some structure and we want to produce a

surface form for it (production/generation)


22

Finite State Transducers

• The simple story– Add another tape– Add extra symbols to the transitions

– On one tape we read “cats”, on the other we write “cat +N +PL”


23

What is a finite-state transducer

• A transducer is a machine that takes input of a certain form and outputs something of a different form.– We can think of morphological analysis as

transduction (from a word to stem+features)• Ommijietna omm + PL + POSS.1PL

– So is morphological generation (from a stem+feature combination to a word)• Omm + PL + POSS.1PL ommijietna


24

Structure of an FST

• The easiest way to think of an FST is as a variation on the classic FSA

• A simple FSA has states and transitions, and recognises something on an input tape.

• An FST has states and transitions, but works with two tapes.– One corresponds to the input.– The other to the output.


25

The uses of an FST

• Recognition:– Take a pair of strings (on two tapes) as input, output accept

or reject• Generation:– Output a pair of strings from the language

• Translation:– Read one string and output another

• Relation between two sets– A machine that computes the relation between the set of

possible input strings and the set of possible output strings.


26

FSTs


27

Morphological parsing

• Morphological analysis or parsing can either be • An important stand-alone component of many

applications (spelling correction, information retrieval)

• Or simply a link in a chain of further linguistic analysis

• Interestingly, FSTs are bidirectional, i.e. can be used for parsing and generation


Transitions

• c:c means read a c on one tape and write a c on the other• +N:ε means read a +N symbol on one tape and write nothing on the other• +PL:s means read +PL and write an s

• Note the conventions: x:y represents an input symbol x and the output symbol y.

c:c a:a t:t +N: ε +PL:s


29

Typical Uses

• Typically, we’ll read from one tape using the first symbol on the machine transitions (just as in a simple FSA).

• And we’ll write to the second tape using the other symbols on the transitions.


30

Ambiguity

• Recall that in non-deterministic recognition multiple paths through a machine may lead to an accept state.• Didn’t matter which path was actually traversed

• In FSTs the path to an accept state does matter since different paths represent different parses and different outputs will result


31

Ambiguity

• What’s the right parse (segmentation) for• Unionisable• Union-ise-able• Un-ion-ise-able

• Each represents a valid path through the derivational morphology machine.

• Unlike in an FST, the differences matter! (Some are not legal parses in English)


32

Ambiguity

• There are a number of ways to deal with this problem• Simply take the first output found• Find all the possible outputs (all paths) and return

them all (without choosing)• Bias the search so that only one or a few likely

paths are explored


33

The Gory Details

• Of course, its not as easy as • “cat +N +PL” <-> “cats”

• As we saw earlier there are geese, mice and oxen

• But there are also a whole host of spelling/pronunciation changes that go along with inflectional changes• Cats vs Dogs• Fox and Foxes


34

Multi-Tape Machines

• To deal with these complications, we will add more tapes and use the output of one tape machine as the input to the next– This gives us a cascade of FSTs

• So to handle irregular spelling changes we’ll add intermediate tapes with intermediate symbols


Multi-Level Tape Machines

• We use one machine to transduce between the lexical and the intermediate level, and another to handle the spelling changes to the surface tape

^ marks a morpheme boundary

# marks a word

boundary


36

Lexical to Intermediate Level


Intermediate to Surface• The add an “e” rule as in fox^s# <-> foxes#


38

Foxes


39

Foxes


40

Foxes


41

Note

• A key feature of this lower machine is that it has to do the right thing for inputs to which it doesn’t really apply. So...– Fox -> foxes but bird -> birds


42

Overall Scheme

• We now have one FST that has explicit information about the lexicon (actual words, their spelling, facts about word classes and regularity).– Lexical level to intermediate forms

• We have a larger set of machines that capture orthographic/spelling rules.– Intermediate forms to surface forms


43

Overall Scheme


44

Cascades

• This is a common architecture• Overall processing is divided up into distinct

rewrite steps• The output of one layer serves as the input to the

next• The intermediate tapes may or may not wind up

being useful in their own right


45

Overall Plan


46

Part 3

• A brief look at stemming...

• A brief look at tokenisation


47

Stemming

• Stemming is the process of stripping affixes from words to reduce them to their stems– NB The stem is not necessarily the baseform– Examples:• Strip “ing” from all word endings

– Going go– Stripping stripp– ...


48

Uses of stemming

• Often used in Information Retrieval– The basic technology underlying search engines– Task: given a query (e.g. Keywords), retrieve

documents which match it– Stemming is useful because it increases the

likelihood of matches– E.g. Search for kangaroos returns documents

containing kangaroo or kangaroos


49

The Porter stemmer

• Built by Martin Porter in 1980• Still widely used• Very simple FST-based stemmer• No lexicon, just rules

• View a demo online here: http://qaa.ath.cx/porter_js_demo.html

http://qaa.ath.cx/porter_js_demo.html


50

Rules in the Porter Stemmer

• ATIONAL ATE– E.g. relational relate– But what about rational?

• ING Є– E.g. Shivering shiver

• These can be viewed as an FST, but without a lexical layer.


51

Errors in the Porter stemmer

• Simple rules like the ones used by the Porter stemmer are error-prone (miss exceptions)

• Errors of commission:– Rational ration

• Errors of omission:– European Europe


52

Tokenisation

• Defined as the task of splitting running text into component tokens.

• Related task: sentence segmentation (split running text into sentences)

• Simplest technique:– Just split on whitespace

• But what about:– Punctuation – Clitics – ...


53

Tokenisation & Sentence segmentation

• Often go hand in hand!• Example:– Full stop can be a sentence boundary or an intra-

word boundary– “Punctuation” marks in numbers:

• 30,000• 20.2

• Can get a long way using simple regular expressions, at least in Indo-European languages.


54

Example from Maltese• Treat numbers as tokens, with or without decimals:

– \d+(\.\d+)?

• Honorifics shouldn’t be broken up at full-stops or apostrophes:– sant['’]|(onor|sra|nru|dott|kap|mons|dr|prof)\.

• Definite articles shouldn’t be broken up at hyphens:– i?[dtlrnsxzżċ]-

• ...[other exceptions]

• General rule: – A word is just a sequence of characters: \w+\s


55

Chinese

• Words in Chinese are composed of hanzi characters.

• Each character usually represents one morpheme.

• Words tend to be quite short (ca. 2.4 characters long on average)

• There is no whitespace between words!


56

A simple algorithm for segmenting Chinese

• Maximum Matching (maxmatch)– Requires a lexicon.– Start by pointing at the beginning of the input

string– Choose the longest item in the lexicon that

matches the input up to now.– If no word matches, move the pointer one symbol

forward.


57

MaxMatch example

• Imagine an English string with no whitespace:– The table down there– thetabledownthere

• The first item found by MaxMatch is theta– Because the algorithm tries to match the longest portion

of the input– Then: bled, own, there

• Result: theta bled own there• Luckily, it works better for Chinese, because words

there tend to be shorter than in English!

Documents

LIN3022 Natural Language Processing Lecture 3 Albert Gatt 1LIN3022 Natural Language Processing