132
COMPUTATIONAL MORPHOLOGY Radhika Mamidi Winter workshop/school on Machine Learning and Text Analytics 17 December, 2013

COMPUTATIONAL MORPHOLOGY - SAUmlta/studymaterial/17-12-2013/Winter_school_Morph... · Computational Morphology: Processing and producing ... Morphemes Have a sound ... nationalization

  • Upload
    dothien

  • View
    217

  • Download
    0

Embed Size (px)

Citation preview

COMPUTATIONAL

MORPHOLOGY

Radhika Mamidi

Winter workshop/school on Machine Learning and Text Analytics

17 December, 2013

Outline

What is Computational Morphology?

Morphology concepts

Computational approaches

Morphological Analysis

Morphological Generator

Computational Approaches

Paradigm based

Finite State based

Statistical based

Issues to be handled

Computational Morphology

Linguistics: Scientific study of language at these levels

Sound – Phonetics and Phonology

Word - Morphology

Sentence - Syntax

Discourse – Discourse Analysis

Computational Linguistics: Processing and producing language computationally.

Computational Morphology: Processing and producing language computationally at WORD LEVEL

Why are we studying morphology?

The knowledge of words will help us process language computationally at word level.

Knowledge of words include word structure and word formation rules.

This knowledge will help us in developing NLP tools like “Morphological Analyzers” and “Morphological Generators”

These are important in Information Retrieval, Machine Translation, Spell checking, etc.

What’s Morphology?

The study of word structure The study of the mental dictionary:

How are words stored in the mind? What is a possible word?

Example:(i) At the supermarket, the girls bought pink cheeriotsand the boys blue fistings.

(ii) When their mother signaled, the girls barriedhome unhappily.

The words ‘cheeriots’, ‘fistings’ and

‘barried’ do not exist. However, assuming

they are valid words of English, we

‘guess’ the meaning by context and the

position of the word in the given sentence.

We do this using our general knowledge

and linguistic knowledge.

At the supermarket, the girls bought pinkcheeriots and the boys blue fistings.

Part of speech

= nouns [comes after adjectives]

[-s ending]

= more than one

Meaning

= some objects that have color

[clue: supermarket]

= some object that is sold; perhaps a toy

The word forms are more like toys, balls, ribbons.

When their mother signaled, the girls barried home unhappily.

Part of Speech = verb [position]

[ends in –ed] = past tense

Base form = barry

Meaning = go

The word form is more like carried, married.

What’s the Longest Word of English?

Could it be ismestablishmentariandisanti ?

Why not when antidisestablishmentarianism is possible.

Possible words with anti and missile: anti-missile (adjective)

anti-missile missile: a missile used for anti-missile purposes

anti- anti-missile missile missile: a missile used against anti-missile missiles

antiNmissileN+1, where N can go until….

There is a systematic way of word formation.

Eg: happy, -ness, un-

This is called Morphotactics.

Morphemes

Have a sound [form] and a meaning:

Example: “cats”

/kaet/ “four-legged animal”

/-s/ “plural number”

Even though /-s/ has a sound and a meaning, it can’t mean “plural” by itself…

It has to attach to a noun.

“A morpheme is the smallest unit of word form that has meaning”

Examples:

cats = cat + -s

girlish = girl + -ish

unfriendly = un- + friend + -ly

cat, -s, girl, -ish, un-, friend, -ly are morphemes

Even Bush knows morphology

(…though he may use it differently than the rest of us) The war on terrorism has transformationed the

US-Russia Relationship

We’re working to help Russia securitize the dismantled warheads

The explorationists are only willing to help move equipment during the winter

This case has had full analyzation and has been looked at a lot

Compositionality

“Explorationists”

explore: to spend an extended effort looking around a particular area - X

-ation: can attach to Verbs, the process of Xing - Y

-ist: can attach to Nouns, one who performs an action Y - Z

-s: attaches to Nouns, more than one Z

explorationists: a compositional word

Fully compositional meaning is based on its parts

Eg: computer, impossible, headache

Non-compositionality

“Inflect”

Is inflect morphologically complex?

It contains more than one morpheme.

What do in- and flect mean?

This is a case of a non-compositional meaning.

In explorationists, if you know the meaning of the parts, you know the meaning of the whole. Not necessarily so for inflect.

Non-compositional meaning cannot be derived from its parts.

Lexical/Content words

Words which are not function words are called content words or lexical words: these include nouns, verbs, adjectives, and most adverbs, though some adverbs are function words (e.g. then, why).

They belong to open class. Dictionaries define the specific meanings of content

words, but can only describe the general usages of function words.

By contrast, grammars describe the use of function words in detail, but have little interest in lexical words.

Function words

Function words or grammatical words are words that have little lexical meaning or have ambiguous meaning, but instead serve to express grammatical relationships with other words within a sentence.

Function words may be prepositions, pronouns, auxiliary verbs, conjunctions, grammatical articles or particles, all of which belong to the group of closed class words.

To know about morpheme we should know about….

Free morphemes vs. Bound morphemes

Lexical morphemes vs. Functional morphemes

Null/Zero morpheme

Inflectional morphemes vs. Derivational morphemes

Root morphemes vs. Affix morphemes

Free vs Bound morphemes

electr- and tox- have isolable meanings in electric, electrify, toxic, (de-)toxify

But they cannot be pronounced on their own: they are bound morphemes

girl and book have isolable meanings in girls, girlish, books, booked, booking

They can occur on their own: they are free morphemes

Are prefixes and suffixes bound morphemes?

Lexical morpheme

free morphemes: apple, smart, book, slow, eat, write

They can exist on their own as independent words.

bound morphemes: -ceive, -ject, cran-, -ship, un-, dis-

They cannot be used independently.

They need another morpheme [free or bound] to form a word.

Eg: re-ceive, con-ceive, sub-ject, pro-ject, cran-berry, scholar-ship, fellow-ship, un-kind, dis-obey

Functional morphemes

free morphemes:

of, with, she, it, and, although, however, because, then

bound morphemes:

-s, -ed, -ing,

Four-way contrasts

Lexical, Free: Nouns, Verbs, Adj, Adv

cat, town, call, house, hall, smart, fast

Lexical, Bound: including derivational affixes

rasp- [raspberry], cran- [cranberry] , -ceive [conceive,

receive], un-, re-, pre-

Functional, Free: Prepositions, Articles, Conj

with, at, and, an, the, because

Functional, Bound: inflectional affixes

-s, -ed, -ing [eats, walked, laughing]

Exercise 1: Identify the free and bound morphemes in

the following words

walked, talked, danced, arrived

playhouse, watchdog, football player

drinking, playing, eating

import, export, transport

raspberry, cranberry

invert, convert, divert

books, pens, boards

writer, caretaker, rider, fighter

Can the following words be decomposed?

delight, news, traitor, bed, evening

Exercise 2: Identify the lexical and functional morphemes in

the following words. Mention if they are free or bound.

politically

beautiful

between

writing

raspberries

unable

nationalization

Inflection vs. Derivation

Derivational affixes: allow us to make new words that alters the meaning. There is an error in the computation. [computeV –

computationN] {Nominalisation}

It is a computational approach. [computationN –computationalAdj] {Adjectivization}

Inflectional suffixes: required in order to make the sentence grammatical

Inflected words belong to the same class *Yesterday I walk to class [walkV – walkedV]

*I like all my student [studentN – studentsN]

Inflectional Morphology

Examples: [the POS remains the same]

VERBS

EAT = eat, eats, ate, eaten, eating

DRINK = drink, drinks, drank, drunk, drinking

PLAY = play, plays, played, played, playing

0, -s, -ed, -en, -ing are inflectional morphemes

NOUNS

PLAY = play, plays

GIRL = girl, girls

SHEEP = sheep, sheep

-s, 0 are inflectional morphemes

Derivational morphology

Two types:

May change the category {N,V,A,Adv}driveV +er = driverN

eatv + able = eatableadj

girlN + ish = girlishadj

disturb V + ance = disturbance N

Doesn’t have to change the categoryun + doV = undoV

re+fryv = refryv

un+happyAdj = unhappyAdj

Derivational – more examples

Verbs

eat – eatable [adj], eatables [noun]

drink – drinking [noun]

play – player [noun]

-able, -ing, -er are derivational morphemes

Nouns

play – playful [adj], replay [verb]

girl – girlish [adj], girlhood [noun]

sheep – sheepish[adj]

-ful, re-, -ish, -hood are derivational morphemes

Each of the words below contains two morphemes – a root and a derivational affix. Decide if the derivational affix changes only the meaning or the class of the root as well.

rewrite hopeless happily national

unclear creation happiness

unhappy helpful undo

Exercise 3

Null/Zero morpheme

A null morpheme is a morpheme that is realized by a phonologically null affix (an empty string of phonological segments)

A null morpheme is an "invisible" affix.

It's also called zero morpheme; the process of adding a null morpheme is called null affixation.

Examples

cat = cat + -0 = ROOT("cat") + SINGULAR

cats = cat + -s = ROOT("cat") + PLURAL

sheep = sheep + -0 = ROOT(“sheep") + SINGULAR

sheep = sheep + -0 = ROOT(“sheep") + PLURAL

More examples

darken[verb] = dark [adj] + -en

Meaning = make more ‘Adjective’

redden [verb] = red + -en [make more Red]

yellow [verb] = yellow + 0 [make more yellow]

brown [verb] = brown + 0 [make more brown]

blacken [verb] = black + -en [make more black]

Root Morphemes vs Affix morphemes

Root morphemes are morphemes around which larger

words are built.

Root morphemes are free or bound.

Affixes are additional morphemes added to roots to

create multi- or poly-morphemic words.

Affixes are always bound.

RatsRoot = rat [free morpheme]

Affix = -s [bound morpheme]

ProjectRoot = -ject [bound morpheme]

Affix = pro- [bound morpheme]

MiceRoot = mouse [free morpheme]

Affix = -s [bound morpheme]

AteRoot = eat [free morpheme]

Affix = -ed [bound morpheme]

DisgracefulnessRoot = grace [free morpheme]

Affixes = dis-, -ful, -ness [bound morpheme]

Affixes

Morphemes added to free forms to make other free

forms are called affixes.

Mainly four kinds of affixes:

1. Prefixes (at beginning) – “un-” in “unable”

2. Suffixes (at end) – “-ed” in “walked”

3. Circumfixes (at both ends) – “en—en” in enlighten

4. Infixes (in the middle) – “-um-” in kumilad [‘to be red’], fumikas

[‘to be strong’]

[ kilad = ‘red’, fikas = ‘strong’ in Bontoc language]

Affixes are bound morphemes.

Prefixes

No prefix can determine the category of a complex

word:

Eg: unhappy, unhappiness, unhappily, undo

What does un- mean when it attaches to adjectives?

unkind, unhappy

What does un- mean when it attaches to verbs?

undo, untie

Suffixes

We can represent the fact that the rightmost suffix

determines the category of a word for triplets like -

Eg: rational, rationalize, rationalization

rational = adjective

rationalize = verb

rationalization = noun

Allomorph

An allomorph is a variant form of a morpheme.

The meaning remains the same, while the sound can

vary.

Example: the different forms of past tense

morpheme /-ed/ [as we hear]

barked, hissed [t]

raised, smelled [d]

added, trotted [ed]

Allomorphs of /-s/ for nouns

Example: the different forms of plural morpheme /-s/ are: [as we read]

-s --- cats, dogs, boys, girls

-es – watches, churches

-0 – sheep

/-s/, /-es/ and 0 are allomorphs of /-s/

{If pronunciation is considered, then /-s/, /-z/, /-iz/ and 0 are allomorphs of /-s/ in the above examples}

Lexemes and Word-forms

Lemma/Lexeme: Base form of a word. Includes derived

forms. It is different from root.

Word-forms: The inflected forms of a word.

Eg: happy, unhappy, unhappiness, happiness, happier,

unhappier, happiest, unhappiest

Lexemes: happy, unhappy, unhappiness, happiness

Wordforms:

Of lexemes HAPPY: happy, happier, happiest,

Of lexemes UNHAPPY: unhappy, unhappier, unhappiest

Analysing a word

Look at the word ACTORS

Three morphemes: act, -or, -s

Root: act

Suffixes: -or, -s

Derivational suffix: -or

Inflectional suffix: -s

Lexeme: ACTOR

Word-forms: actor, actors

Hierarchical Structure within Words

the word unlockable is ambiguous

[[un + lock] able]: able to be unlocked

[un [lock +able]]: not able-to-be-locked

Similar to this structure:

French History Teacher

Old Ladies Hostel

Old Bombay Highway

Disgraceful Ungraceful

Adj Adj

/ \ / \

Noun Suffix Prefix \

| | | Adj

/ \ | | / \

Prefix Noun | | Noun Suffix

| | | | | |

Dis grace ful Un grace ful

Exercise 4

Give the hierarchical structure of the following words

unwanted

disfigurement

Interchangeable

united

actors

retries

unhappiness

KEY POINTS

Words and word structure

Root morphemes vs. Affix morphemes

Lexical morphemes

Free morphemes

Bound morphemes Derivational morphemes

Functional morphemes

Free morphemes

Bound morphemes Inflectional morphemes

Null/Zero morpheme

Morphological Analysis

Morphology = The study of word structure

Morphological Analysis = Analysing words

How do we do it?

Is the process opposite of Morph generation?

Why is it important?

Look at the following words and break

them up into smaller units

useless

copilot

psychology

sickness

communism

socialism

artist

dentist

curable

impossible

microstructure

macrostructure

departure

Did you do it this way?

Comparing with other words?

useless – endless, meaningless – “without”

copilot – co-author, co-editor – “with”

psychology – physiology, biology – “study of”

sickness – dizziness, happiness – “state of being”

communism – socialism, feminism – “belief, doctrine”

artist – activist, tourist – “one who”

impossible – impatient, immoral – “not”

microstructure – microscope, microbiology – “small”

macrostructure – macroeconomics, macro - “large”

What about coalition, departure, important, dentist?

Now do the same for this data:

Mtotoamefika

Mtotoanafika

Mtotoatafika

Watotowamefika

Watotowanafika

Watotowatafika

Mtuamelala

Mtuanalala

Mtuatalala

Watuwamelala

Watuwanalala

Watuwatalala

Here’s some clue!

Mtotoamefika - "The child has arrived."

Mtotoanafika - "The child is arriving."

Mtotoatafika - "The child will arrive."

Watotowamefika - "The children have arrived."

Watotowanafika - "The children are arriving."

Watotowatafika - "The children will arrive."

Mtuamelala - "The man has slept."

Mtuanalala – "The man is sleeping."

Mtuatalala - "The man will sleep."

Watuwamelala - "The men have slept."

Watuwanalala - "The men are sleeping."

Watuwatalala - "The men will sleep."

Did you do it this way?

Comparing with other words?

Mtotoamefika - "The child has arrived."

Mtotoanafika - "The child is arriving."

Mtotoatafika - "The child will arrive."

Watotowamefika - "The children have arrived."

Watotowanafika - "The children are arriving."

Watotowatafika - "The children will arrive."

Mtuamelala - "The man has slept."

Mtuanalala – "The man is sleeping."

Mtuatalala - "The man will sleep."

Watuwamelala - "The men have slept."

Watuwanalala - "The men are sleeping."

Watuwatalala - "The men will sleep."

Exercise 5

What is the equivalent of…?

The child has slept.

The children are sleeping.

The men have arrived.

The man will arrive.

Exercise 5

What is the equivalent of…?

The child has slept.

Mtotoamelala

The children are sleeping.

Watotowanalala

The men have arrived.

Watuwamefika

The man will arrive.

Mtuatafika

Quite easy?!

Easy to identify morphemes if the language is regular and consistent.

Languages are of following types (but no language belongs to one type solely)

a. Isolating/ Analytic (low morpheme-per-word ratio)Eg: Chinese, Vietnamese, English

Context and syntax more important than morphology

b. Synthetic Fusional (many features fused in one morpheme)

Eg: Most IE languages

Agglutinating (joining many morphemes together)

Eg: Finnish, Korean, Turkish, Japanese, Telugu

Isolating languages

Isolating languages do not (usually) have any bound morphemes

Eg: Mandarin Chinese

Gou bu ai chi qingcai (dog not like eat vegetable)

This can mean one of the following

(depending on the context)

The dog doesn’t like to eat vegetables

The dog didn’t like to eat vegetables

The dogs don’t like to eat vegetables

The dogs didn’t like to eat vegetables.

Dogs don’t like to eat vegetables.

Other Examples

English

nationalisation

Turkish

uygar+la¸s+tır+ama+dık+larımız+dan+mı¸s+sınız+casına Behaving as if you are among those whom we could not cause to become civilized (Turkish)

Telugu

pagalagoTTabOyADu

He was about to break it.

Hindi

jA_sakatA_hE

He/It can go.

A bit of warm-up first: Write a small text in

your language

Manipuri

Tamil

Languages are not so regular….

English verbs

Eat – ate – eaten

Heat – heated – heated

Teach – taught – taught

Preach – preached – preached

Cry – cried - cried

English nouns

Box – boxes

Ox – oxen

House – houses

Mouse – mice

Child - children

Spelling changes

occur at

morpheme

boundaries.

Cases of

suppletion are

often seen.

Vowel change

also occurs.

There is always a

default regular

word formation.

Orthographic rules identified

Name Description of rule Example

Consonant doubling L-letter consonant

douled before –ing/-

ed

Beg/begging

E deletion Silent e dropped

before –ing and -ed

Make/making

E insertion E added after –s, -z,

-x, -ch, -sh before -s

Watch/watches

Y replacement -y changes to -ie

before –s, -I before -

ed

Try/tries

K insertion Verbs ending with

vowel + -c add -k

Panic/panicked

J&M (2000)

Suppletion

• Replacement of the entire wordform.

• Example: am, is, are, was, were – variants of ‘be’

ate – past tense of ‘eat’

In Hindi: vaha jA rahA hE (“He is going.”).

vaha gayA (“He went”)

In Telugu:

tinnADA? (“Has he eaten?”). tinalEdu (“(He) Has not eaten.”)

vaccADA? (“Has he come?”). rAlEdu (“(He) Has not come.”)

Change of one sound into another because of the

influence of neighbouring sounds.

Example: peN + kaL peNgaL

manishi + lu manushulu

(“vowel harmony”)

Assimilation

Reduplication

Some or all of a word is duplicated to mark a

morphological process

Indonesian

orang (man) ⇒ orangorang (men)

Bambara

wulu (dog) ⇒ wuluowulu (whichever dog)

Turkish

mavi (blue) ⇒ masmavi (very blue)

kırmızı (red) ⇒ kıpkırmızı (very red)

koşa koşa (by running)

Let’s explore your language

Lexical, Free morphemes

Avu {“cow”}, pustakam {“book”}, pilla {“girl”}

Lexical, Bound morphemes

a-, su- {adharmam “in-justice”, suhAsini “good-smile”}

Functional, Free morphemes

Ame {“she”}, mariyu {“and”}, kAni {“but”}

Functional, Bound morphemes

-lo, -ki {inTilO “in-house”, chetiki “to-hand”}

What else is happening?

Is there a null morpheme in your language?

pilli pAlu tAgiMdi {“pilli” in nominative case}

“The cat drank milk”

pilli tOka guburugA uMdi {“pilli” in genitive case”}

“The cat’s tail is bushy”

Some changes in lexemes

In Telugu:

illu inTi before locative case marker “lO”, “ki”

ceyyi cEti before locative case marker “lO”, “ki”

In Hindi:

laDakA laDake before ergative case marker “ne”

laDake laDakOM before the ergative case marker “ne”

What about causative and other

constructions? Observe the words.

Exercise 6:

Translate in the language you speak at home.

Ram killed Ravan. Ravan died.

The baby ate. The nanny fed him.

Sita made the nanny feed the baby.

Ravi broke the vase. The vase broke.

Anil broke the lock open.

Ravish axed the goat to death.

Bina nailed the picture on the wall.

In Kashmiri

Number (singular, plural)

Plural form of noun can be obtained by addition of suffix, change in vowel or without any declination.

Example:

Singular Plural

Suffix addition: kangir “fire pot” kangri

laej “pot” laeji

Vowel change: kuTh “room” kueTh

gagur “rat” gagar

Invariants: kAv “crow” kAv

kan “ear” kan

In Kashmiri

Gender (masculine, feminine)

Example:

Masculine Feminine

Different roots for each gender: ladake “boy” kUr “girl”

dAnd “ox” gAv “cow”

Same root but changes not regular: batuk “duck”

batich “fem duck”

tsAvul “goat”

tsAvij “ fem goat”

Same root with regular changes: dAndur “greengrocer” dAndren

kAndur “baker”

kAndren

Let’s look at Telugu verbs

Agglutinative language

Many suffixes with different features concatenating

Derived words from pagili:

pagiliMdi – pagilipOyiMdi – pagalabOyiMdi –

pagalledu - pagalagoTTADu –

pagalagoTTabOyADu

Morph analysis = Identifying morphemes and

analysing them

boys boy + s boy +noun + plural markerRoot = boy

POS = noun

Number = plural

Avulu Avu +noun + plural

pustakAlu pustakam +noun + plural

tinnADu tinu + A +Du tinu +verb +past +masc.sg

cEsADu ceyyi + A +Du ceyyi +verb +past +masc.sg

OdikoMDiddavana -> the one (masculine) who was reading

Odu + i + koLLu +MD+ u + iru + dd + a + avanu + a

Root + VBP+ AUXV +PST+ VBP + AUXV + PST+ RP +

PRON-3SM + ACC

Exercise:

Analyse every word you used in your text

Ambiguous words: multiple analyses

Plants

Grounded

Ground

Leaves

Found

Can

Morphological Analysis and

Generation

Bidirectional

Example:

Ate eat + verb + past

Surface level: tinnADu “he ate”

|

Intermediary level: tinn + A + Du

|

Lexical level: tinu + verb + past + 3rd per. sg. masc.

Morphological Analyzers

They are tools to automatically decompose a word

into its root and affixes and give related features.

Example:

1st stage – identifying morphemes

ate: root = eat

suffix = ed

2nd stage – analyzing morphemes

ate: root = eat

tense = past

Morph Analyser

Language Input: word Output: analysis

Hindi ladake a) root=ladakA,

cat=n, gen=m,

num=sg, case=obl

b) root=ladakA,

cat=n, gen=m,

num=pl, case=dir

Morph Generator

Language Input: analysis Output: word

Hindi a) root=ladakA,

cat=n, gen=m,

num=sg, case=obl

b) root=ladakA,

cat=n, gen=m,

num=pl, case=dir

ladake

Machine learning of

morphological rules

Supervised approach

requires training data and rules are extracted from the training data.

Rules = orthographic, suppletion, assimilation, vowel harmony

The morphological analyzers can be built by making use of lexical database with morphological information for building rules.

A good example of such lexical database

is CELEX that contains information about lemma, wordform, abbreviation, corpus tagging etc

Some Applications

Machine Translation

Speech Processing

Information Retrieval

Machine Translation

Pos tagger gives only part of speech. More

information is needed to translate a word correctly.

More information like tense, aspect and mood of

the verbs, gender, number and person of the nouns.

Example: [Eng Hindi translation]

ENGLISH: She went home.

HINDI: vaha ghar gayi.

ENGLISH: He went home.

HINDI: vaha ghar gayaa.

The gender of the pronoun is essential for the translation in Hindi.

The morph analyzer will give the gender information.

Example: [Hindi Eng translation]

In Hindi ‘vaha’ can have different senses – ‘he’, ‘she’ or

‘that’.

“vaha ghar gayaa”

If we were to translate this, then the extra information on

the verb will help us to translate the above sentence

correctly as

“He went home”

The ‘yaa’ indicates past tense as well as singular

number and masculine gender.

The morph analyzer will give this information.

Information Retrieval

Stemming: A process of reducing an inflected or derived word to its stem. Makes search effective.

Example:

Wordforms =play, plays, played, playing PLAY

Wordforms = child, children, childish CHILD

Stemmers are useful but have limitations.

Eg. marketing market

speaker speak

Speech Processing

In Text to Speech tools also Morph Analyzer is essential along with Part of Speech.

With extra information on the words, the efficiency increases.

The intonation, the pause, the stress etc can be close to the way humans speak.

This additional information is given by morph analyzers.

(POS taggers helps in pronouncing the ambiguous words the right way: wind, wound, minute etc)

Linguistic models for describing morphology

Item and Arrangement (morpheme based)

Item and Process (lexeme based)

Word and Paradigm (word based)

Approaches

Paradigm based

Finite State based

Combination of both – fast and efficient

PARADIGM BASED

MORPHOLOGICAL ANALYZERS

Requirement for building paradigm based

Morph Analyzers

Knowledge of Lexeme and Word forms

Root and Affix dictionaries

Paradigm Table

Paradigm Class

The lexemes are stored in the dictionaries and the

word forms as paradigms.

Lexeme and Word form

APPLE: apple, apples

CHURCH: church, churches

BOY: boy, boys

WATCH: watch, watches

SPY: spy, spies

The word in upper case is called LEXEME and the inflected forms are WORD FORMS.

Lexemes are the headwords in a dictionary.

Lexeme and Word form

Another example:

played is a word form of the lexeme PLAY

plays is a word form of the lexeme PLAY(1)

plays is a word form of the lexeme PLAY(2)

where PLAY(1) is a verb and PLAY(2) is a noun.

PLAY(1) and PLAY(2) are two different lexemes.

Exercise 7

Give the lexeme of the following word forms:

ate

played

manufactured

glasses

players

bites

Exercise 8

“manufactured” can be a verb in past tense or an adjective. So it belongs to two different lexemes – MANUFACTURE and MANUFACTURED.

Which of the following words belong to more than one lexeme?

atewantedwrotewrittenfinished

Root and Affix dictionaries

Root dictionary contains a list of roots or the base forms - the lexemes.

It is stored usually with its part of speech.

Affix dictionary contains a list of all the affixes in a language.

The features of the affixes are stored here.

The features are stored as attribute value pairs.

Example entries in a dictionary

Root dictionary

eat <root=‘eat’, category=‘verb’>

book <root=‘book’, category=‘verb’>

book <root=‘book’, category=‘noun’>

Affix dictionary

+s <tense = ‘present’>

+ed <tense = ‘past’>

+en <aspect = ‘perfective’>

+ing <aspect = ‘progressive’>

Paradigm table

A paradigm table represents the inflected forms of a

particular lexeme.

It includes the conjugation of verbs and declensions of

nouns, adjectives, pronouns etc.

Example:

APPLE: apple, apples

EAT: eat, eats, ate, eaten, eating

SMART: smart, smarter, smartest

Conjugation of English verbs

play plays played played playing

eat eats ate eaten eating

look looks looked looked looking

dance dances danced danced dancing

push pushes pushed pushed pushing

Declension of English nouns

apple, apples

boy, boys

church, churches

watch, watches

spy, spies

Declension of English adjectives• smart, smarter, smartest

• tall, taller, tallest

Exercise 9

Give the paradigm table for 5 different nouns and

5 different verbs in English.

Paradigm Class

A paradigm class contains the classes of lexemes i.e.

the prototypical root and all the roots that fall in its

class including the given root.

Those words which decline or conjugate in exactly

the same way, fall into one paradigm class.

The English verbs ‘PLAY’ and ‘LOOK’ have the following paradigm:

play plays played played playing

look looks looked looked looking

So they belong to the same class.

But ‘PUSH’ since it differs in its present tense form i.e. it has ‘-es’ and not ‘- s’ falls in another class. Its paradigm is as follows:

push pushes pushed pushed pushing

The English nouns ‘PLAY’ and ‘BOY’ have the following

paradigm:

play plays

boy boys

So they belong to the same class.

But ‘SPY’ falls in another class. Its paradigm is as follows:

spy spies

Paradigm class is represented by one member of the

class.

eat V eat

play V play, talk, walk, train

push V push, fish

play N play, boy, day

spy N spy, sky

church N church, watch

Exercise 10

Which of the following verbs belong to the same paradigm class?

mince ride walk speak

shake play dance take

Which of the following nouns belong to the same paradigm class?

girl house dish book

mouse beach flower pencil

Identify paradigm classes in your own language.

Avu = {Avu, bassu, bomma, chekka}

Add-Delete Rules for Generation and Analysis

of words

•Most of the morphological analyzers handle only

inflectional morphology.

•The rules given here are for such inflectional

processes only.

•An add-delete rule would look like: [n1, n2, xyz]

where

n1 = number of characters to delete from the end

n2 = number of characters to add at the end

xyz = the characters to add

Rules to generate wordforms in a paradigm table for a

given paradigm class.

Eg. play Eg. eat

play[0,0,0] = play eat[0,0,0] = eat

play[0,1,s] = plays eat[0,1,s] = eats

play[0,2,ed] = played eat[3,3,ate] = ate

play[0,2,ed] = played eat[0,2,en] = eaten

play[0,3,ing] = playing eat[0,3,ing] = eating

Exercise 11

Write similar rules to generate paradigm tables of the

verbs ‘dance’, ‘cry’, ‘sleep’, ‘drink’ ‘write’ and nouns ‘book’,

‘mouse’, ‘church’, ‘sheep’.

In the same way, these rules can be used to find the root of an

inflected word.

For example, the root of ‘playing’ is ‘play’ – we get it by deleting 3

characters and adding nothing.

playing [3,0,0] = play eating [3,0,0] = eat

played [2,0,0] = play eaten [2,0,0] = eat

played [2,0,0] = play ate [3,3,eat] = eat

plays [1,0,0] = play eats [1,0,0] = eat

play [0,0,0] = play eat [0,0,0] = eat

Exercise 12

Write similar rules to find the root of the verbs ‘kept’, ‘cried’, ‘sat’,

‘stood’, ‘written’ and nouns ‘mice’, ‘legs’, ‘spies’, ‘uncles’, ‘houses’.

Let’s look at the paradigms in some ILs

Hindi Morph Analyzer demo

http://sampark.iiit.ac.in/hindimorph/

Input text::

लड़का आ गया। Output text:

Input text::

लड़के आ गये। Output text:

Note : Features explanation of a word

1. root : Root of the word (e.g. ladZake word has root ladZakA)

2. cat : Category of the word (e.g. Noun=n, Pronoun=pn, Adjective=adj, verb=v,

adverb=adv post-position=psp and avvya=avy)

3. gen : Gender of the word (e.g. Masculine=m, Feminine=f, Nueter=n, mf , mn , fn,

and any)

4. num : Number of the word (e.g. Singular=sg, Plural=pl, dual, and any )

5. per : Person of the word (e.g. 1st Person=1, 2nd Person=2, 3rd Person=3, and any)

6. case : Case of the word (e.g. direct=d, oblique=o and any)

7. tam : Case marker for noun or Tense Aspect Mood(TAM) for verb of the word

8. suff : Suffix of the word

agr_gen: Agreeing gender of the following noun

agr_num: Agreeing number of the following noun

agr_cas: Agreeing case of the following noun

emph: Emphatic feature of the word

hon: Honorofic feature of the word

derive_root: Derived root

derive_lcat: Derived lexical category

derive_gen: Derived gender

derive_suff: Derived suffix

FINITE STATE TRANSDUCERS (FST)

BASED MORPHOLOGICAL ANALYZERS

Finite State Automaton/Transducers

Computational model to handle the morphology analysis/generation

Used in morphology, phonology, TTS, data mining.

Popular, since mathematically extremely well understood.

Easy to implement.

Many off-the-shelf tools available for direct use.

Optimisation: finding a m/c with minimum number of states that performs

the same function

Simple extensions to Markov Models useful for POS tagging

Finite State Automaton

A Finite State Automaton is a machine composed of:

An input tape

A finite number of states, with one initial and one or more accepting states

Actions in terms of transitions from one state to the other, depending on

the current state and the input.

The input tape has a sequence of symbols written on it. The tape, initially is in the intial state. The reader reads one symbol at a time, and depending on the current state and input symbol, transition to another state takes place

Finite state automaton (FSA)

Morphotactic Models

English nominal inflection

q0 q2q1

plural (-s)reg-n

irreg-sg-n

irreg-pl-n

•Inputs: cats, goose, geese

•reg-n: regular noun

•irreg-pl-n: irregular plural noun

•irreg-sg-n: irregular singular noun

Derivational morphology: adjective fragment

q3

q5

q4

q0

q1 q2un-

adj-root1

-er, -ly, -est

adj-root1

adj-root2

-er, -est

• Adj-root1: clear, happy, real

• Adj-root2: big, red

Parsing with Finite State Transducers

cats cat +N +PL

Kimmo Koskenniemi’s two-level morphology

Words represented as correspondences between lexical

level (the morphemes) and surface level (the

orthographic word)

Morphological parsing :building mappings between the

lexical and surface levels

c a t +N +PL

c a t s

Finite State Transducers

FSTs map between one set of symbols and another

using an FSA whose alphabet is composed of

pairs of symbols from input and output alphabets

In general, FSTs can be used for

Translator (Hello: vanakkam)

Parser/generator (Hello:How may I help you?)

To map between the lexical and surface levels of

Kimmo’s 2-level morphology

FST for a 2-level Lexicon

Example

Reg-n Irreg-pl-n Irreg-sg-n

c a t g o:e o:e s e g o o s e

q0 q1 q2 q3c a t

q1 q3 q4q2

se:o e:o e

q0 q5

g

FST for English Nominal Inflection

q0 q7

+PL:^s#

Combining (cascade or composition) this FSA

with FSAs for each noun type replaces e.g. reg-

n with every regular noun representation in the

lexicon

q1 q4

q2 q5

q3 q6

reg-n

irreg-n-sg

irreg-n-pl

+N:

+PL:-s#

+SG:-#

+SG:-#

+N:

+N:

FST for Telugu verb

q0 q8

Du: sg, masc, 3p

tinnADu, tinaledu; vinnADu, vinaledu;

cEsADu, cEyaledu

q1 q4

q5

q3

q6

tin: tinu,

verb

cEs: ceyyi,

verb

A: past

A:past

a: lEdu: neg

q2

q7 lEdu: nega: cEy:ceyyi,

verb

Issues – mainly due to ambiguity

When shot at, the dove dove into the bushes.

The insurance was invalid for the invalid.

They were too close to the door to close it.

The buck does funny things when the does are present.

Upon seeing the tear in the painting I shed a tear.

After seeing the wound, the doctor wound his watch.

Such words with multiple analyses affect MT and other NLP applications.

Examples

veyyi = verb or noun?

“put”, “thousand”

perugu = verb or noun?

“yogurt”, “grow”

nAku = verb or pronoun in k4

“to lick”, “for/to me”

(icecream) nAkivvu nAku + ivvu (“Give it to me”)

nAki + ivvu (“Lick and then give”)

Tools

Several off-the-shelf tools available for FST which

support Word and Paradigm model

1. XFST (Xerox Finite State Tool)

2. SFST (Stuttgart Finite State Transducer Tools)

3. Lttoolbox (part of Apertium)

To wind up…

Knowledge of words and word structure is important.

A good linguistic analysis of a language will help in formulating rules of word formation, paradigm table and paradigm classes

Morph generator is the reverse process of morph analyser.

A thorough study of one’s language at word level will help built MA by using any approach.

Phonological changes at morpheme boundary, agglutinative and fusional type of suffixing makes analysis more challenging.

Selecting the right analysis when more than one analyses of a given wordform is given is another challenging task esp. in Machine Translation system.

Akshar Bharati, Vineet Chaitanya, Rajeev Sangal. 1995. Chapter 3: Words andTheir Analyzer. Natural Language Processing: A Paninian Perspective. Prentice-Hall of India.

Akshar Bharati, Rajeev Sangal, Dipti M Sharma, Radhika Mamidi. 2004. Generic Morphological Analysis Shell. In Proceedings of LREC . Lisbon, Portugal. 24th-30th May 2004.

Dan Jurafsky and J.H. Martin. 2000. Speech and Language Processing. Prentice-Hall.

Monis Khan, Radhika Mamidi, Umar Hamid Shah. 2005. Morphological Analyzer for Kashmiri . LSI.

Amba Kulkarni, G Uma Maheshwar Rao. Building Morphological Analyzers and Generators for Indian Languages using FST, Tutorial for ICON 2010.

Viswanathan S et. al.. 2003. A Tamil Morphological Analyser. Recent Advances in NLP. 31-39. Mysore: Central Institute of Indian Languages.

Parameshwari K. 2011. An Implementation of APERTIUM Morphological Analyzer and Generator for Tamil. Language in India. 11: 5.

R. Veerappan, P J Antony, S Saravanan, K P Soman. 2011. A Rule based Kannada Morphological Analyzer and Generator using Finite State Transducer. International Journal of Computer Applications (0975 – 8887) Volume 27– No.10, August 2011.

References

Thank you!