Upload
dothien
View
217
Download
0
Embed Size (px)
Citation preview
COMPUTATIONAL
MORPHOLOGY
Radhika Mamidi
Winter workshop/school on Machine Learning and Text Analytics
17 December, 2013
Outline
What is Computational Morphology?
Morphology concepts
Computational approaches
Morphological Analysis
Morphological Generator
Computational Approaches
Paradigm based
Finite State based
Statistical based
Issues to be handled
Computational Morphology
Linguistics: Scientific study of language at these levels
Sound – Phonetics and Phonology
Word - Morphology
Sentence - Syntax
Discourse – Discourse Analysis
Computational Linguistics: Processing and producing language computationally.
Computational Morphology: Processing and producing language computationally at WORD LEVEL
Why are we studying morphology?
The knowledge of words will help us process language computationally at word level.
Knowledge of words include word structure and word formation rules.
This knowledge will help us in developing NLP tools like “Morphological Analyzers” and “Morphological Generators”
These are important in Information Retrieval, Machine Translation, Spell checking, etc.
What’s Morphology?
The study of word structure The study of the mental dictionary:
How are words stored in the mind? What is a possible word?
Example:(i) At the supermarket, the girls bought pink cheeriotsand the boys blue fistings.
(ii) When their mother signaled, the girls barriedhome unhappily.
The words ‘cheeriots’, ‘fistings’ and
‘barried’ do not exist. However, assuming
they are valid words of English, we
‘guess’ the meaning by context and the
position of the word in the given sentence.
We do this using our general knowledge
and linguistic knowledge.
At the supermarket, the girls bought pinkcheeriots and the boys blue fistings.
Part of speech
= nouns [comes after adjectives]
[-s ending]
= more than one
Meaning
= some objects that have color
[clue: supermarket]
= some object that is sold; perhaps a toy
The word forms are more like toys, balls, ribbons.
When their mother signaled, the girls barried home unhappily.
Part of Speech = verb [position]
[ends in –ed] = past tense
Base form = barry
Meaning = go
The word form is more like carried, married.
What’s the Longest Word of English?
Could it be ismestablishmentariandisanti ?
Why not when antidisestablishmentarianism is possible.
Possible words with anti and missile: anti-missile (adjective)
anti-missile missile: a missile used for anti-missile purposes
anti- anti-missile missile missile: a missile used against anti-missile missiles
antiNmissileN+1, where N can go until….
There is a systematic way of word formation.
Eg: happy, -ness, un-
This is called Morphotactics.
Morphemes
Have a sound [form] and a meaning:
Example: “cats”
/kaet/ “four-legged animal”
/-s/ “plural number”
Even though /-s/ has a sound and a meaning, it can’t mean “plural” by itself…
It has to attach to a noun.
“A morpheme is the smallest unit of word form that has meaning”
Examples:
cats = cat + -s
girlish = girl + -ish
unfriendly = un- + friend + -ly
cat, -s, girl, -ish, un-, friend, -ly are morphemes
Even Bush knows morphology
(…though he may use it differently than the rest of us) The war on terrorism has transformationed the
US-Russia Relationship
We’re working to help Russia securitize the dismantled warheads
The explorationists are only willing to help move equipment during the winter
This case has had full analyzation and has been looked at a lot
Compositionality
“Explorationists”
explore: to spend an extended effort looking around a particular area - X
-ation: can attach to Verbs, the process of Xing - Y
-ist: can attach to Nouns, one who performs an action Y - Z
-s: attaches to Nouns, more than one Z
explorationists: a compositional word
Fully compositional meaning is based on its parts
Eg: computer, impossible, headache
Non-compositionality
“Inflect”
Is inflect morphologically complex?
It contains more than one morpheme.
What do in- and flect mean?
This is a case of a non-compositional meaning.
In explorationists, if you know the meaning of the parts, you know the meaning of the whole. Not necessarily so for inflect.
Non-compositional meaning cannot be derived from its parts.
Lexical/Content words
Words which are not function words are called content words or lexical words: these include nouns, verbs, adjectives, and most adverbs, though some adverbs are function words (e.g. then, why).
They belong to open class. Dictionaries define the specific meanings of content
words, but can only describe the general usages of function words.
By contrast, grammars describe the use of function words in detail, but have little interest in lexical words.
Function words
Function words or grammatical words are words that have little lexical meaning or have ambiguous meaning, but instead serve to express grammatical relationships with other words within a sentence.
Function words may be prepositions, pronouns, auxiliary verbs, conjunctions, grammatical articles or particles, all of which belong to the group of closed class words.
To know about morpheme we should know about….
Free morphemes vs. Bound morphemes
Lexical morphemes vs. Functional morphemes
Null/Zero morpheme
Inflectional morphemes vs. Derivational morphemes
Root morphemes vs. Affix morphemes
Free vs Bound morphemes
electr- and tox- have isolable meanings in electric, electrify, toxic, (de-)toxify
But they cannot be pronounced on their own: they are bound morphemes
girl and book have isolable meanings in girls, girlish, books, booked, booking
They can occur on their own: they are free morphemes
Are prefixes and suffixes bound morphemes?
Lexical morpheme
free morphemes: apple, smart, book, slow, eat, write
They can exist on their own as independent words.
bound morphemes: -ceive, -ject, cran-, -ship, un-, dis-
They cannot be used independently.
They need another morpheme [free or bound] to form a word.
Eg: re-ceive, con-ceive, sub-ject, pro-ject, cran-berry, scholar-ship, fellow-ship, un-kind, dis-obey
Functional morphemes
free morphemes:
of, with, she, it, and, although, however, because, then
bound morphemes:
-s, -ed, -ing,
Four-way contrasts
Lexical, Free: Nouns, Verbs, Adj, Adv
cat, town, call, house, hall, smart, fast
Lexical, Bound: including derivational affixes
rasp- [raspberry], cran- [cranberry] , -ceive [conceive,
receive], un-, re-, pre-
Functional, Free: Prepositions, Articles, Conj
with, at, and, an, the, because
Functional, Bound: inflectional affixes
-s, -ed, -ing [eats, walked, laughing]
Exercise 1: Identify the free and bound morphemes in
the following words
walked, talked, danced, arrived
playhouse, watchdog, football player
drinking, playing, eating
import, export, transport
raspberry, cranberry
invert, convert, divert
books, pens, boards
writer, caretaker, rider, fighter
Can the following words be decomposed?
delight, news, traitor, bed, evening
Exercise 2: Identify the lexical and functional morphemes in
the following words. Mention if they are free or bound.
politically
beautiful
between
writing
raspberries
unable
nationalization
Inflection vs. Derivation
Derivational affixes: allow us to make new words that alters the meaning. There is an error in the computation. [computeV –
computationN] {Nominalisation}
It is a computational approach. [computationN –computationalAdj] {Adjectivization}
Inflectional suffixes: required in order to make the sentence grammatical
Inflected words belong to the same class *Yesterday I walk to class [walkV – walkedV]
*I like all my student [studentN – studentsN]
Inflectional Morphology
Examples: [the POS remains the same]
VERBS
EAT = eat, eats, ate, eaten, eating
DRINK = drink, drinks, drank, drunk, drinking
PLAY = play, plays, played, played, playing
0, -s, -ed, -en, -ing are inflectional morphemes
NOUNS
PLAY = play, plays
GIRL = girl, girls
SHEEP = sheep, sheep
-s, 0 are inflectional morphemes
Derivational morphology
Two types:
May change the category {N,V,A,Adv}driveV +er = driverN
eatv + able = eatableadj
girlN + ish = girlishadj
disturb V + ance = disturbance N
Doesn’t have to change the categoryun + doV = undoV
re+fryv = refryv
un+happyAdj = unhappyAdj
Derivational – more examples
Verbs
eat – eatable [adj], eatables [noun]
drink – drinking [noun]
play – player [noun]
-able, -ing, -er are derivational morphemes
Nouns
play – playful [adj], replay [verb]
girl – girlish [adj], girlhood [noun]
sheep – sheepish[adj]
-ful, re-, -ish, -hood are derivational morphemes
Each of the words below contains two morphemes – a root and a derivational affix. Decide if the derivational affix changes only the meaning or the class of the root as well.
rewrite hopeless happily national
unclear creation happiness
unhappy helpful undo
Exercise 3
Null/Zero morpheme
A null morpheme is a morpheme that is realized by a phonologically null affix (an empty string of phonological segments)
A null morpheme is an "invisible" affix.
It's also called zero morpheme; the process of adding a null morpheme is called null affixation.
Examples
cat = cat + -0 = ROOT("cat") + SINGULAR
cats = cat + -s = ROOT("cat") + PLURAL
sheep = sheep + -0 = ROOT(“sheep") + SINGULAR
sheep = sheep + -0 = ROOT(“sheep") + PLURAL
More examples
darken[verb] = dark [adj] + -en
Meaning = make more ‘Adjective’
redden [verb] = red + -en [make more Red]
yellow [verb] = yellow + 0 [make more yellow]
brown [verb] = brown + 0 [make more brown]
blacken [verb] = black + -en [make more black]
Root Morphemes vs Affix morphemes
Root morphemes are morphemes around which larger
words are built.
Root morphemes are free or bound.
Affixes are additional morphemes added to roots to
create multi- or poly-morphemic words.
Affixes are always bound.
RatsRoot = rat [free morpheme]
Affix = -s [bound morpheme]
ProjectRoot = -ject [bound morpheme]
Affix = pro- [bound morpheme]
MiceRoot = mouse [free morpheme]
Affix = -s [bound morpheme]
AteRoot = eat [free morpheme]
Affix = -ed [bound morpheme]
DisgracefulnessRoot = grace [free morpheme]
Affixes = dis-, -ful, -ness [bound morpheme]
Affixes
Morphemes added to free forms to make other free
forms are called affixes.
Mainly four kinds of affixes:
1. Prefixes (at beginning) – “un-” in “unable”
2. Suffixes (at end) – “-ed” in “walked”
3. Circumfixes (at both ends) – “en—en” in enlighten
4. Infixes (in the middle) – “-um-” in kumilad [‘to be red’], fumikas
[‘to be strong’]
[ kilad = ‘red’, fikas = ‘strong’ in Bontoc language]
Affixes are bound morphemes.
Prefixes
No prefix can determine the category of a complex
word:
Eg: unhappy, unhappiness, unhappily, undo
What does un- mean when it attaches to adjectives?
unkind, unhappy
What does un- mean when it attaches to verbs?
undo, untie
Suffixes
We can represent the fact that the rightmost suffix
determines the category of a word for triplets like -
Eg: rational, rationalize, rationalization
rational = adjective
rationalize = verb
rationalization = noun
Allomorph
An allomorph is a variant form of a morpheme.
The meaning remains the same, while the sound can
vary.
Example: the different forms of past tense
morpheme /-ed/ [as we hear]
barked, hissed [t]
raised, smelled [d]
added, trotted [ed]
Allomorphs of /-s/ for nouns
Example: the different forms of plural morpheme /-s/ are: [as we read]
-s --- cats, dogs, boys, girls
-es – watches, churches
-0 – sheep
/-s/, /-es/ and 0 are allomorphs of /-s/
{If pronunciation is considered, then /-s/, /-z/, /-iz/ and 0 are allomorphs of /-s/ in the above examples}
Lexemes and Word-forms
Lemma/Lexeme: Base form of a word. Includes derived
forms. It is different from root.
Word-forms: The inflected forms of a word.
Eg: happy, unhappy, unhappiness, happiness, happier,
unhappier, happiest, unhappiest
Lexemes: happy, unhappy, unhappiness, happiness
Wordforms:
Of lexemes HAPPY: happy, happier, happiest,
Of lexemes UNHAPPY: unhappy, unhappier, unhappiest
Analysing a word
Look at the word ACTORS
Three morphemes: act, -or, -s
Root: act
Suffixes: -or, -s
Derivational suffix: -or
Inflectional suffix: -s
Lexeme: ACTOR
Word-forms: actor, actors
Hierarchical Structure within Words
the word unlockable is ambiguous
[[un + lock] able]: able to be unlocked
[un [lock +able]]: not able-to-be-locked
Similar to this structure:
French History Teacher
Old Ladies Hostel
Old Bombay Highway
Disgraceful Ungraceful
Adj Adj
/ \ / \
Noun Suffix Prefix \
| | | Adj
/ \ | | / \
Prefix Noun | | Noun Suffix
| | | | | |
Dis grace ful Un grace ful
Exercise 4
Give the hierarchical structure of the following words
unwanted
disfigurement
Interchangeable
united
actors
retries
unhappiness
KEY POINTS
Words and word structure
Root morphemes vs. Affix morphemes
Lexical morphemes
Free morphemes
Bound morphemes Derivational morphemes
Functional morphemes
Free morphemes
Bound morphemes Inflectional morphemes
Null/Zero morpheme
Morphological Analysis
Morphology = The study of word structure
Morphological Analysis = Analysing words
How do we do it?
Is the process opposite of Morph generation?
Why is it important?
Look at the following words and break
them up into smaller units
useless
copilot
psychology
sickness
communism
socialism
artist
dentist
curable
impossible
microstructure
macrostructure
departure
Did you do it this way?
Comparing with other words?
useless – endless, meaningless – “without”
copilot – co-author, co-editor – “with”
psychology – physiology, biology – “study of”
sickness – dizziness, happiness – “state of being”
communism – socialism, feminism – “belief, doctrine”
artist – activist, tourist – “one who”
impossible – impatient, immoral – “not”
microstructure – microscope, microbiology – “small”
macrostructure – macroeconomics, macro - “large”
What about coalition, departure, important, dentist?
Now do the same for this data:
Mtotoamefika
Mtotoanafika
Mtotoatafika
Watotowamefika
Watotowanafika
Watotowatafika
Mtuamelala
Mtuanalala
Mtuatalala
Watuwamelala
Watuwanalala
Watuwatalala
Here’s some clue!
Mtotoamefika - "The child has arrived."
Mtotoanafika - "The child is arriving."
Mtotoatafika - "The child will arrive."
Watotowamefika - "The children have arrived."
Watotowanafika - "The children are arriving."
Watotowatafika - "The children will arrive."
Mtuamelala - "The man has slept."
Mtuanalala – "The man is sleeping."
Mtuatalala - "The man will sleep."
Watuwamelala - "The men have slept."
Watuwanalala - "The men are sleeping."
Watuwatalala - "The men will sleep."
Did you do it this way?
Comparing with other words?
Mtotoamefika - "The child has arrived."
Mtotoanafika - "The child is arriving."
Mtotoatafika - "The child will arrive."
Watotowamefika - "The children have arrived."
Watotowanafika - "The children are arriving."
Watotowatafika - "The children will arrive."
Mtuamelala - "The man has slept."
Mtuanalala – "The man is sleeping."
Mtuatalala - "The man will sleep."
Watuwamelala - "The men have slept."
Watuwanalala - "The men are sleeping."
Watuwatalala - "The men will sleep."
Exercise 5
What is the equivalent of…?
The child has slept.
The children are sleeping.
The men have arrived.
The man will arrive.
Exercise 5
What is the equivalent of…?
The child has slept.
Mtotoamelala
The children are sleeping.
Watotowanalala
The men have arrived.
Watuwamefika
The man will arrive.
Mtuatafika
Quite easy?!
Easy to identify morphemes if the language is regular and consistent.
Languages are of following types (but no language belongs to one type solely)
a. Isolating/ Analytic (low morpheme-per-word ratio)Eg: Chinese, Vietnamese, English
Context and syntax more important than morphology
b. Synthetic Fusional (many features fused in one morpheme)
Eg: Most IE languages
Agglutinating (joining many morphemes together)
Eg: Finnish, Korean, Turkish, Japanese, Telugu
Isolating languages
Isolating languages do not (usually) have any bound morphemes
Eg: Mandarin Chinese
Gou bu ai chi qingcai (dog not like eat vegetable)
This can mean one of the following
(depending on the context)
The dog doesn’t like to eat vegetables
The dog didn’t like to eat vegetables
The dogs don’t like to eat vegetables
The dogs didn’t like to eat vegetables.
Dogs don’t like to eat vegetables.
Other Examples
English
nationalisation
Turkish
uygar+la¸s+tır+ama+dık+larımız+dan+mı¸s+sınız+casına Behaving as if you are among those whom we could not cause to become civilized (Turkish)
Telugu
pagalagoTTabOyADu
He was about to break it.
Hindi
jA_sakatA_hE
He/It can go.
Languages are not so regular….
English verbs
Eat – ate – eaten
Heat – heated – heated
Teach – taught – taught
Preach – preached – preached
Cry – cried - cried
English nouns
Box – boxes
Ox – oxen
House – houses
Mouse – mice
Child - children
Spelling changes
occur at
morpheme
boundaries.
Cases of
suppletion are
often seen.
Vowel change
also occurs.
There is always a
default regular
word formation.
Orthographic rules identified
Name Description of rule Example
Consonant doubling L-letter consonant
douled before –ing/-
ed
Beg/begging
E deletion Silent e dropped
before –ing and -ed
Make/making
E insertion E added after –s, -z,
-x, -ch, -sh before -s
Watch/watches
Y replacement -y changes to -ie
before –s, -I before -
ed
Try/tries
K insertion Verbs ending with
vowel + -c add -k
Panic/panicked
J&M (2000)
Suppletion
• Replacement of the entire wordform.
• Example: am, is, are, was, were – variants of ‘be’
ate – past tense of ‘eat’
In Hindi: vaha jA rahA hE (“He is going.”).
vaha gayA (“He went”)
In Telugu:
tinnADA? (“Has he eaten?”). tinalEdu (“(He) Has not eaten.”)
vaccADA? (“Has he come?”). rAlEdu (“(He) Has not come.”)
Change of one sound into another because of the
influence of neighbouring sounds.
Example: peN + kaL peNgaL
manishi + lu manushulu
(“vowel harmony”)
Assimilation
Reduplication
Some or all of a word is duplicated to mark a
morphological process
Indonesian
orang (man) ⇒ orangorang (men)
Bambara
wulu (dog) ⇒ wuluowulu (whichever dog)
Turkish
mavi (blue) ⇒ masmavi (very blue)
kırmızı (red) ⇒ kıpkırmızı (very red)
koşa koşa (by running)
Let’s explore your language
Lexical, Free morphemes
Avu {“cow”}, pustakam {“book”}, pilla {“girl”}
Lexical, Bound morphemes
a-, su- {adharmam “in-justice”, suhAsini “good-smile”}
Functional, Free morphemes
Ame {“she”}, mariyu {“and”}, kAni {“but”}
Functional, Bound morphemes
-lo, -ki {inTilO “in-house”, chetiki “to-hand”}
What else is happening?
Is there a null morpheme in your language?
pilli pAlu tAgiMdi {“pilli” in nominative case}
“The cat drank milk”
pilli tOka guburugA uMdi {“pilli” in genitive case”}
“The cat’s tail is bushy”
Some changes in lexemes
In Telugu:
illu inTi before locative case marker “lO”, “ki”
ceyyi cEti before locative case marker “lO”, “ki”
In Hindi:
laDakA laDake before ergative case marker “ne”
laDake laDakOM before the ergative case marker “ne”
What about causative and other
constructions? Observe the words.
Exercise 6:
Translate in the language you speak at home.
Ram killed Ravan. Ravan died.
The baby ate. The nanny fed him.
Sita made the nanny feed the baby.
Ravi broke the vase. The vase broke.
Anil broke the lock open.
Ravish axed the goat to death.
Bina nailed the picture on the wall.
In Kashmiri
Number (singular, plural)
Plural form of noun can be obtained by addition of suffix, change in vowel or without any declination.
Example:
Singular Plural
Suffix addition: kangir “fire pot” kangri
laej “pot” laeji
Vowel change: kuTh “room” kueTh
gagur “rat” gagar
Invariants: kAv “crow” kAv
kan “ear” kan
In Kashmiri
Gender (masculine, feminine)
Example:
Masculine Feminine
Different roots for each gender: ladake “boy” kUr “girl”
dAnd “ox” gAv “cow”
Same root but changes not regular: batuk “duck”
batich “fem duck”
tsAvul “goat”
tsAvij “ fem goat”
Same root with regular changes: dAndur “greengrocer” dAndren
kAndur “baker”
kAndren
Let’s look at Telugu verbs
Agglutinative language
Many suffixes with different features concatenating
Derived words from pagili:
pagiliMdi – pagilipOyiMdi – pagalabOyiMdi –
pagalledu - pagalagoTTADu –
pagalagoTTabOyADu
Morph analysis = Identifying morphemes and
analysing them
boys boy + s boy +noun + plural markerRoot = boy
POS = noun
Number = plural
Avulu Avu +noun + plural
pustakAlu pustakam +noun + plural
tinnADu tinu + A +Du tinu +verb +past +masc.sg
cEsADu ceyyi + A +Du ceyyi +verb +past +masc.sg
OdikoMDiddavana -> the one (masculine) who was reading
Odu + i + koLLu +MD+ u + iru + dd + a + avanu + a
Root + VBP+ AUXV +PST+ VBP + AUXV + PST+ RP +
PRON-3SM + ACC
Exercise:
Analyse every word you used in your text
Morphological Analysis and
Generation
Bidirectional
Example:
Ate eat + verb + past
Surface level: tinnADu “he ate”
|
Intermediary level: tinn + A + Du
|
Lexical level: tinu + verb + past + 3rd per. sg. masc.
Morphological Analyzers
They are tools to automatically decompose a word
into its root and affixes and give related features.
Example:
1st stage – identifying morphemes
ate: root = eat
suffix = ed
2nd stage – analyzing morphemes
ate: root = eat
tense = past
Morph Analyser
Language Input: word Output: analysis
Hindi ladake a) root=ladakA,
cat=n, gen=m,
num=sg, case=obl
b) root=ladakA,
cat=n, gen=m,
num=pl, case=dir
Morph Generator
Language Input: analysis Output: word
Hindi a) root=ladakA,
cat=n, gen=m,
num=sg, case=obl
b) root=ladakA,
cat=n, gen=m,
num=pl, case=dir
ladake
Machine learning of
morphological rules
Supervised approach
requires training data and rules are extracted from the training data.
Rules = orthographic, suppletion, assimilation, vowel harmony
The morphological analyzers can be built by making use of lexical database with morphological information for building rules.
A good example of such lexical database
is CELEX that contains information about lemma, wordform, abbreviation, corpus tagging etc
Machine Translation
Pos tagger gives only part of speech. More
information is needed to translate a word correctly.
More information like tense, aspect and mood of
the verbs, gender, number and person of the nouns.
Example: [Eng Hindi translation]
ENGLISH: She went home.
HINDI: vaha ghar gayi.
ENGLISH: He went home.
HINDI: vaha ghar gayaa.
The gender of the pronoun is essential for the translation in Hindi.
The morph analyzer will give the gender information.
Example: [Hindi Eng translation]
In Hindi ‘vaha’ can have different senses – ‘he’, ‘she’ or
‘that’.
“vaha ghar gayaa”
If we were to translate this, then the extra information on
the verb will help us to translate the above sentence
correctly as
“He went home”
The ‘yaa’ indicates past tense as well as singular
number and masculine gender.
The morph analyzer will give this information.
Information Retrieval
Stemming: A process of reducing an inflected or derived word to its stem. Makes search effective.
Example:
Wordforms =play, plays, played, playing PLAY
Wordforms = child, children, childish CHILD
Stemmers are useful but have limitations.
Eg. marketing market
speaker speak
Speech Processing
In Text to Speech tools also Morph Analyzer is essential along with Part of Speech.
With extra information on the words, the efficiency increases.
The intonation, the pause, the stress etc can be close to the way humans speak.
This additional information is given by morph analyzers.
(POS taggers helps in pronouncing the ambiguous words the right way: wind, wound, minute etc)
Linguistic models for describing morphology
Item and Arrangement (morpheme based)
Item and Process (lexeme based)
Word and Paradigm (word based)
Requirement for building paradigm based
Morph Analyzers
Knowledge of Lexeme and Word forms
Root and Affix dictionaries
Paradigm Table
Paradigm Class
The lexemes are stored in the dictionaries and the
word forms as paradigms.
Lexeme and Word form
APPLE: apple, apples
CHURCH: church, churches
BOY: boy, boys
WATCH: watch, watches
SPY: spy, spies
The word in upper case is called LEXEME and the inflected forms are WORD FORMS.
Lexemes are the headwords in a dictionary.
Lexeme and Word form
Another example:
played is a word form of the lexeme PLAY
plays is a word form of the lexeme PLAY(1)
plays is a word form of the lexeme PLAY(2)
where PLAY(1) is a verb and PLAY(2) is a noun.
PLAY(1) and PLAY(2) are two different lexemes.
Exercise 7
Give the lexeme of the following word forms:
ate
played
manufactured
glasses
players
bites
Exercise 8
“manufactured” can be a verb in past tense or an adjective. So it belongs to two different lexemes – MANUFACTURE and MANUFACTURED.
Which of the following words belong to more than one lexeme?
atewantedwrotewrittenfinished
Root and Affix dictionaries
Root dictionary contains a list of roots or the base forms - the lexemes.
It is stored usually with its part of speech.
Affix dictionary contains a list of all the affixes in a language.
The features of the affixes are stored here.
The features are stored as attribute value pairs.
Example entries in a dictionary
Root dictionary
eat <root=‘eat’, category=‘verb’>
book <root=‘book’, category=‘verb’>
book <root=‘book’, category=‘noun’>
Affix dictionary
+s <tense = ‘present’>
+ed <tense = ‘past’>
+en <aspect = ‘perfective’>
+ing <aspect = ‘progressive’>
Paradigm table
A paradigm table represents the inflected forms of a
particular lexeme.
It includes the conjugation of verbs and declensions of
nouns, adjectives, pronouns etc.
Example:
APPLE: apple, apples
EAT: eat, eats, ate, eaten, eating
SMART: smart, smarter, smartest
Conjugation of English verbs
play plays played played playing
eat eats ate eaten eating
look looks looked looked looking
dance dances danced danced dancing
push pushes pushed pushed pushing
Declension of English nouns
apple, apples
boy, boys
church, churches
watch, watches
spy, spies
Declension of English adjectives• smart, smarter, smartest
• tall, taller, tallest
Paradigm Class
A paradigm class contains the classes of lexemes i.e.
the prototypical root and all the roots that fall in its
class including the given root.
Those words which decline or conjugate in exactly
the same way, fall into one paradigm class.
The English verbs ‘PLAY’ and ‘LOOK’ have the following paradigm:
play plays played played playing
look looks looked looked looking
So they belong to the same class.
But ‘PUSH’ since it differs in its present tense form i.e. it has ‘-es’ and not ‘- s’ falls in another class. Its paradigm is as follows:
push pushes pushed pushed pushing
The English nouns ‘PLAY’ and ‘BOY’ have the following
paradigm:
play plays
boy boys
So they belong to the same class.
But ‘SPY’ falls in another class. Its paradigm is as follows:
spy spies
Paradigm class is represented by one member of the
class.
eat V eat
play V play, talk, walk, train
push V push, fish
play N play, boy, day
spy N spy, sky
church N church, watch
Exercise 10
Which of the following verbs belong to the same paradigm class?
mince ride walk speak
shake play dance take
Which of the following nouns belong to the same paradigm class?
girl house dish book
mouse beach flower pencil
Identify paradigm classes in your own language.
Avu = {Avu, bassu, bomma, chekka}
Add-Delete Rules for Generation and Analysis
of words
•Most of the morphological analyzers handle only
inflectional morphology.
•The rules given here are for such inflectional
processes only.
•An add-delete rule would look like: [n1, n2, xyz]
where
n1 = number of characters to delete from the end
n2 = number of characters to add at the end
xyz = the characters to add
Rules to generate wordforms in a paradigm table for a
given paradigm class.
Eg. play Eg. eat
play[0,0,0] = play eat[0,0,0] = eat
play[0,1,s] = plays eat[0,1,s] = eats
play[0,2,ed] = played eat[3,3,ate] = ate
play[0,2,ed] = played eat[0,2,en] = eaten
play[0,3,ing] = playing eat[0,3,ing] = eating
Exercise 11
Write similar rules to generate paradigm tables of the
verbs ‘dance’, ‘cry’, ‘sleep’, ‘drink’ ‘write’ and nouns ‘book’,
‘mouse’, ‘church’, ‘sheep’.
In the same way, these rules can be used to find the root of an
inflected word.
For example, the root of ‘playing’ is ‘play’ – we get it by deleting 3
characters and adding nothing.
playing [3,0,0] = play eating [3,0,0] = eat
played [2,0,0] = play eaten [2,0,0] = eat
played [2,0,0] = play ate [3,3,eat] = eat
plays [1,0,0] = play eats [1,0,0] = eat
play [0,0,0] = play eat [0,0,0] = eat
Exercise 12
Write similar rules to find the root of the verbs ‘kept’, ‘cried’, ‘sat’,
‘stood’, ‘written’ and nouns ‘mice’, ‘legs’, ‘spies’, ‘uncles’, ‘houses’.
Hindi Morph Analyzer demo
http://sampark.iiit.ac.in/hindimorph/
Note : Features explanation of a word
1. root : Root of the word (e.g. ladZake word has root ladZakA)
2. cat : Category of the word (e.g. Noun=n, Pronoun=pn, Adjective=adj, verb=v,
adverb=adv post-position=psp and avvya=avy)
3. gen : Gender of the word (e.g. Masculine=m, Feminine=f, Nueter=n, mf , mn , fn,
and any)
4. num : Number of the word (e.g. Singular=sg, Plural=pl, dual, and any )
5. per : Person of the word (e.g. 1st Person=1, 2nd Person=2, 3rd Person=3, and any)
6. case : Case of the word (e.g. direct=d, oblique=o and any)
7. tam : Case marker for noun or Tense Aspect Mood(TAM) for verb of the word
8. suff : Suffix of the word
agr_gen: Agreeing gender of the following noun
agr_num: Agreeing number of the following noun
agr_cas: Agreeing case of the following noun
emph: Emphatic feature of the word
hon: Honorofic feature of the word
derive_root: Derived root
derive_lcat: Derived lexical category
derive_gen: Derived gender
derive_suff: Derived suffix
Finite State Automaton/Transducers
Computational model to handle the morphology analysis/generation
Used in morphology, phonology, TTS, data mining.
Popular, since mathematically extremely well understood.
Easy to implement.
Many off-the-shelf tools available for direct use.
Optimisation: finding a m/c with minimum number of states that performs
the same function
Simple extensions to Markov Models useful for POS tagging
Finite State Automaton
A Finite State Automaton is a machine composed of:
An input tape
A finite number of states, with one initial and one or more accepting states
Actions in terms of transitions from one state to the other, depending on
the current state and the input.
The input tape has a sequence of symbols written on it. The tape, initially is in the intial state. The reader reads one symbol at a time, and depending on the current state and input symbol, transition to another state takes place
Morphotactic Models
English nominal inflection
q0 q2q1
plural (-s)reg-n
irreg-sg-n
irreg-pl-n
•Inputs: cats, goose, geese
•reg-n: regular noun
•irreg-pl-n: irregular plural noun
•irreg-sg-n: irregular singular noun
Derivational morphology: adjective fragment
q3
q5
q4
q0
q1 q2un-
adj-root1
-er, -ly, -est
adj-root1
adj-root2
-er, -est
• Adj-root1: clear, happy, real
• Adj-root2: big, red
Parsing with Finite State Transducers
cats cat +N +PL
Kimmo Koskenniemi’s two-level morphology
Words represented as correspondences between lexical
level (the morphemes) and surface level (the
orthographic word)
Morphological parsing :building mappings between the
lexical and surface levels
c a t +N +PL
c a t s
Finite State Transducers
FSTs map between one set of symbols and another
using an FSA whose alphabet is composed of
pairs of symbols from input and output alphabets
In general, FSTs can be used for
Translator (Hello: vanakkam)
Parser/generator (Hello:How may I help you?)
To map between the lexical and surface levels of
Kimmo’s 2-level morphology
FST for a 2-level Lexicon
Example
Reg-n Irreg-pl-n Irreg-sg-n
c a t g o:e o:e s e g o o s e
q0 q1 q2 q3c a t
q1 q3 q4q2
se:o e:o e
q0 q5
g
FST for English Nominal Inflection
q0 q7
+PL:^s#
Combining (cascade or composition) this FSA
with FSAs for each noun type replaces e.g. reg-
n with every regular noun representation in the
lexicon
q1 q4
q2 q5
q3 q6
reg-n
irreg-n-sg
irreg-n-pl
+N:
+PL:-s#
+SG:-#
+SG:-#
+N:
+N:
FST for Telugu verb
q0 q8
Du: sg, masc, 3p
tinnADu, tinaledu; vinnADu, vinaledu;
cEsADu, cEyaledu
q1 q4
q5
q3
q6
tin: tinu,
verb
cEs: ceyyi,
verb
A: past
A:past
a: lEdu: neg
q2
q7 lEdu: nega: cEy:ceyyi,
verb
Issues – mainly due to ambiguity
When shot at, the dove dove into the bushes.
The insurance was invalid for the invalid.
They were too close to the door to close it.
The buck does funny things when the does are present.
Upon seeing the tear in the painting I shed a tear.
After seeing the wound, the doctor wound his watch.
Such words with multiple analyses affect MT and other NLP applications.
Examples
veyyi = verb or noun?
“put”, “thousand”
perugu = verb or noun?
“yogurt”, “grow”
nAku = verb or pronoun in k4
“to lick”, “for/to me”
(icecream) nAkivvu nAku + ivvu (“Give it to me”)
nAki + ivvu (“Lick and then give”)
Tools
Several off-the-shelf tools available for FST which
support Word and Paradigm model
1. XFST (Xerox Finite State Tool)
2. SFST (Stuttgart Finite State Transducer Tools)
3. Lttoolbox (part of Apertium)
To wind up…
Knowledge of words and word structure is important.
A good linguistic analysis of a language will help in formulating rules of word formation, paradigm table and paradigm classes
Morph generator is the reverse process of morph analyser.
A thorough study of one’s language at word level will help built MA by using any approach.
Phonological changes at morpheme boundary, agglutinative and fusional type of suffixing makes analysis more challenging.
Selecting the right analysis when more than one analyses of a given wordform is given is another challenging task esp. in Machine Translation system.
Akshar Bharati, Vineet Chaitanya, Rajeev Sangal. 1995. Chapter 3: Words andTheir Analyzer. Natural Language Processing: A Paninian Perspective. Prentice-Hall of India.
Akshar Bharati, Rajeev Sangal, Dipti M Sharma, Radhika Mamidi. 2004. Generic Morphological Analysis Shell. In Proceedings of LREC . Lisbon, Portugal. 24th-30th May 2004.
Dan Jurafsky and J.H. Martin. 2000. Speech and Language Processing. Prentice-Hall.
Monis Khan, Radhika Mamidi, Umar Hamid Shah. 2005. Morphological Analyzer for Kashmiri . LSI.
Amba Kulkarni, G Uma Maheshwar Rao. Building Morphological Analyzers and Generators for Indian Languages using FST, Tutorial for ICON 2010.
Viswanathan S et. al.. 2003. A Tamil Morphological Analyser. Recent Advances in NLP. 31-39. Mysore: Central Institute of Indian Languages.
Parameshwari K. 2011. An Implementation of APERTIUM Morphological Analyzer and Generator for Tamil. Language in India. 11: 5.
R. Veerappan, P J Antony, S Saravanan, K P Soman. 2011. A Rule based Kannada Morphological Analyzer and Generator using Finite State Transducer. International Journal of Computer Applications (0975 – 8887) Volume 27– No.10, August 2011.
References