Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
AbstractAbstractAbstractAbstractAbstractAbstractAbstractAbstract� Identification of Reduplication, a subtask of
Multi-word Expression identification.
� Reduplication, a very productive process at
both the grammatical as well as semantic
levels in Bengali.
� Here, reduplications have been identified
from the Bengali corpus of the articles of the
noted Indian Nobel laureate Rabindranath
Tagore.
� Rule-Based Approach consisting of two
phases i.e. identification of reduplication and
semantic analysis.
� Repetition of any linguistic unit such as
phoneme, morpheme, word, phrase, clause or the
utterance as a whole.
Example: In English : ha-ha, blah-blah etc.
In Bengali: �����-������ (abal-tabal, incoherent).
What is Reduplication?What is Reduplication?What is Reduplication?What is Reduplication?
� Bengali, richest Indian language with 2400 words
(Chaudhuri et al., 2005) in the onomatopoeic and
idiophonic category of reduplication.
� Reduplication carries various semantic meanings and
helps to identify the mental state of the speaker.
� Two coarse-grained categories:
(a) repetition at the expression level.
(b) repetition at the contents or semantic (sense) level.
General ClassificationGeneral ClassificationGeneral ClassificationGeneral ClassificationGeneral ClassificationGeneral ClassificationGeneral ClassificationGeneral ClassificationGeneral ClassificationGeneral ClassificationGeneral ClassificationGeneral ClassificationGeneral ClassificationGeneral ClassificationGeneral ClassificationGeneral Classification�� Onomatopoeic ExpressionOnomatopoeic Expression:
�� �� (khat khat, knock knock)
�� Complete Reduplication:Complete Reduplication:
�-� (bara-bara,big big)
� Partial Reduplication:� Partial Reduplication:
���-
�� (thakur-thukur ,God)
� Semantic Reduplication:
Synonym: ����-�� (matha-mundu, head)Antonym: ���-��� (din-rat, day and night) Class representative: ��-���� (cha-paani, snacks)
� Correlative Reduplication:
�������� (maramari, fighting)
Expression level Expression level Expression level Expression level Expression level Expression level Expression level Expression level ClassificationClassificationClassificationClassificationClassificationClassificationClassificationClassification
� NonNon--soundsound SymbolicSymbolic WordsWords
� Nouns and pronouns
��� ��� (bari bari, one house to other)� Adjectives
��� ��� � � (lal lal phul, red flowers)��� ��� � �� Verbs
���� ���� (bolte bolte, speaking) [Mandatory]���� ���� (bhebe chinte, thinking) [Optional]
� Adverb ���� ���� (dhere dhere, slowly)
�� Sound WordsSound Words
�� �� (chal chal, sound of water falling)
Sense Level ReduplicationSense Level ReduplicationSense Level ReduplicationSense Level ReduplicationSense Level ReduplicationSense Level ReduplicationSense Level ReduplicationSense Level Reduplication� Sense of repetition:
���-��� (bachar bachar, every year)
� Sense of plurality:
� � ��� (bara bara bari, many big houses )
� Sense of emphatic meaning:
��� ��� � � (lal-lal phul, deep red rose)
� Sense of completion:� Sense of completion:
����-���� (kheye deye jabo, after eating)
� Sense of hesitation or softness:
��� ��� �� (Hasi-hasi mukh, laughing face)
� Sense of incompleteness of the verbs:
ক�� ���� ���� (kotha bolte bolte, talking about )
� Sense of corresponding correlative words:
�������� (Maramari, fighting)
� Sense of onomatopoeia:
�� �� (khat khat, knock knock)
System DesignSystem DesignSystem DesignSystem DesignSystem DesignSystem DesignSystem DesignSystem Design
Phase 1: Identifying Reduplications
Identify mainly five cases of reduplication
i.e.Onomatopoeic, complete, partial, semantic
and correlative reduplications.
Phase 2: Semantic Analysis
Extraction of associated meaning or
semantics like sense level reduplications.
Phase Phase Phase Phase Phase Phase Phase Phase 11111111System ArchitectureSystem ArchitectureSystem ArchitectureSystem ArchitectureSystem ArchitectureSystem ArchitectureSystem ArchitectureSystem Architecture
TokenizerTokenizerTokenizerTokenizerCorpus
Bengali
Corpus
RuleRuleRuleRule&&&&based Identifierbased Identifierbased Identifierbased Identifier
ClassifierClassifierClassifierClassifier
Set of Inflections
Set of Inflections
DictionaryDictionary
Components of The Components of The Components of The Components of The Components of The Components of The Components of The Components of The ArchitectureArchitectureArchitectureArchitectureArchitectureArchitectureArchitectureArchitecture
CorpusCorpusCorpusCorpusCorpusCorpusCorpusCorpusArticles (novel, stories, dramas) ofRabindranath Tagore [http://www.rabindra-rachanabali.nltr.org]
TokenizerTokenizerTokenizerTokenizerTokenizerTokenizerTokenizerTokenizerSeparates words based on blank space orSeparates words based on blank space orspecial symbols (like hyphen, exclamationnotation etc) to identify two consecutivewords.
RuleRule--basedbased IdentifierIdentifier
Consecutive tokens are passed to it to verifywhether they are reduplicated words or notbased on different algorithms.
ClassifierClassifier
CClassify reduplications at expression level.
Components of The Components of The Components of The Components of The Components of The Components of The Components of The Components of The ArchitectureArchitectureArchitectureArchitectureArchitectureArchitectureArchitectureArchitecture
DictionaryDictionary
It includes the lexicon and the associated
semantics. The system uses both Bengali-
to-Bengali (monolingual) and Bengali-to-
English (bilingual) dictionaries.
�� Set of inflectionsSet of inflections
0(����), �(-�, -), -�(-��), -��, -��(-���), -�, -��(��), ���, -���, -��, -�, -����, -�, -�
Brief classification Brief classification Algorithms Algorithms
�Complete: comparison for complete equality of two
words is checked.
�partial: 3 cases - (i) change of the first vowel
attached with first consonant, (ii) change of
consonant itself in first position or (iii) change of
both matra and consonant.
Exception: �����-������(abal-tabal, incoherent)Exception: �����-������(abal-tabal, incoherent)[Solution: only consonants that are produced afterchanging are ‘$’, ‘�’, ‘�’, ‘ ’(S.K.Chattopadhyay, 1992.)]
� Onomatopoeic: after removing inflection, words
are divided equally and then comparison is done.
�Correlative : the formative affixes ‘–�’ , ‘-%’ areadded with the root to form 1st and 2nd words
respectively and agglutinated.
�Semantic : a dictionary based approach using set of
above mentioned inflections.
Phase Phase Phase Phase Phase Phase Phase Phase 22222222Semantic (sense) analysisSemantic (sense) analysisSemantic (sense) analysisSemantic (sense) analysisSemantic (sense) analysisSemantic (sense) analysisSemantic (sense) analysisSemantic (sense) analysis
�Correspondence between general and
sense level reduplications:
ReduplicationsReduplications SemanticsSemantics (Sense)(Sense)
onomatopoeic onomatopoeic onomatopoeiaonomatopoeia
semantic or partialsemantic or partial completioncompletion
correlative wordcorrelative word corresponding corresponding correlative wordscorrelative wordscorrelative wordscorrelative words
Complete Complete Repetition /Repetition /hesitation, softnesshesitation, softness
Problem for sense disambiguation of complete
reduplication: multiple sense depending on the
context.
� System identifies some related words like ‘ক��’(kara, to do), ‘����’ (bhaba, to think), ‘����’ (mato,like), ‘��&�’ (laga, feel) for disambiguation.
� These are not enough for disambiguating the
sense of the phrase.
Experimental ResultsExperimental ResultsExperimental ResultsExperimental ResultsExperimental ResultsExperimental ResultsExperimental ResultsExperimental Results
�The collected corpus includes 14,810 tokens
for 3675 distinct word forms at the root
level.
�Metrics:
� IR metrics: Precision, Recall, F-score.
� Frequency measurements of each class.
� Hyphen and close form count.� Hyphen and close form count.
�Evaluation:
Reduplication Precision Recall F-score
Onomatopoeic 99.85 99.77 99.79
Complete 99.98 99.92 99.95
Partial 79.15 75.80 77.44
Semantic 85.20 82.26 83.71
Correlative 99.91 99.73 99.82
System 92.82 91.50 92.15
Error AnalysisError AnalysisError AnalysisError AnalysisError AnalysisError AnalysisError AnalysisError Analysis
0
10
20
30
40
50
60
70
80
90
100
Precision
Recall
F-Score
�Partial and semantic evaluation scores are notsatisfactory because of some wrong taggingby the shallow parser.
�Some synonymous reduplication (����- �'�, dhire-susthe, slowly and steadily)implies anonymous sense of the previousword but not its exact synonym. These wordsare not identified properly due to the lack ofBengali lexicons like WordNet.
Frequency hypothesisFrequency hypothesisFrequency hypothesisFrequency hypothesisFrequency hypothesisFrequency hypothesisFrequency hypothesisFrequency hypothesis� Frequency is an important indication of
whether a compound is a MWE.
� 8.52% of reduplications are hyphened.
� percentage of closed reduplications is
33.09% where maximum of them are
onomatopoeic, correlative and semantic
reduplications.
� 100% of correlative reduplications and� 100% of correlative reduplications and
maximum of onomatopoeic reduplications
are closed.
8.51
51.0626.6
12.7
18.08
Frequency Analysis
Onomatopoeic
Complete
Partial
Semantic
Correlative
ConclusionConclusion
� The reduplication is mainly used for
emphasis, generality, intensity, or to
show continuation of an act.
� The semantics of the reduplicated words
indicate some sort of senseindicate some sort of sense
disambiguation that cannot be bounded
by only rule based analysis.
� Further researches on the field of
Stylometry analysis of the authors or
Plagiarism detection.