16

Abstract - SourceForgemultiword.sourceforge.net/download/Presentations_MWE2010/...(bari bari, one house to other) Adjectives (lal lal phul, red flowers) Verbs (bolte bolte, speaking)

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

  • AbstractAbstractAbstractAbstractAbstractAbstractAbstractAbstract� Identification of Reduplication, a subtask of

    Multi-word Expression identification.

    � Reduplication, a very productive process at

    both the grammatical as well as semantic

    levels in Bengali.

    � Here, reduplications have been identified

    from the Bengali corpus of the articles of the

    noted Indian Nobel laureate Rabindranath

    Tagore.

    � Rule-Based Approach consisting of two

    phases i.e. identification of reduplication and

    semantic analysis.

  • � Repetition of any linguistic unit such as

    phoneme, morpheme, word, phrase, clause or the

    utterance as a whole.

    Example: In English : ha-ha, blah-blah etc.

    In Bengali: �����-������ (abal-tabal, incoherent).

    What is Reduplication?What is Reduplication?What is Reduplication?What is Reduplication?

    � Bengali, richest Indian language with 2400 words

    (Chaudhuri et al., 2005) in the onomatopoeic and

    idiophonic category of reduplication.

    � Reduplication carries various semantic meanings and

    helps to identify the mental state of the speaker.

    � Two coarse-grained categories:

    (a) repetition at the expression level.

    (b) repetition at the contents or semantic (sense) level.

  • General ClassificationGeneral ClassificationGeneral ClassificationGeneral ClassificationGeneral ClassificationGeneral ClassificationGeneral ClassificationGeneral ClassificationGeneral ClassificationGeneral ClassificationGeneral ClassificationGeneral ClassificationGeneral ClassificationGeneral ClassificationGeneral ClassificationGeneral Classification�� Onomatopoeic ExpressionOnomatopoeic Expression:

    �� �� (khat khat, knock knock)

    �� Complete Reduplication:Complete Reduplication:

    �-� (bara-bara,big big)

    � Partial Reduplication:� Partial Reduplication:

    ���-

    �� (thakur-thukur ,God)

    � Semantic Reduplication:

    Synonym: ����-�� (matha-mundu, head)Antonym: ���-��� (din-rat, day and night) Class representative: ��-���� (cha-paani, snacks)

    � Correlative Reduplication:

    �������� (maramari, fighting)

  • Expression level Expression level Expression level Expression level Expression level Expression level Expression level Expression level ClassificationClassificationClassificationClassificationClassificationClassificationClassificationClassification

    � NonNon--soundsound SymbolicSymbolic WordsWords

    � Nouns and pronouns

    ��� ��� (bari bari, one house to other)� Adjectives

    ��� ��� � � (lal lal phul, red flowers)��� ��� � �� Verbs

    ���� ���� (bolte bolte, speaking) [Mandatory]���� ���� (bhebe chinte, thinking) [Optional]

    � Adverb ���� ���� (dhere dhere, slowly)

    �� Sound WordsSound Words

    �� �� (chal chal, sound of water falling)

  • Sense Level ReduplicationSense Level ReduplicationSense Level ReduplicationSense Level ReduplicationSense Level ReduplicationSense Level ReduplicationSense Level ReduplicationSense Level Reduplication� Sense of repetition:

    ���-��� (bachar bachar, every year)

    � Sense of plurality:

    � � ��� (bara bara bari, many big houses )

    � Sense of emphatic meaning:

    ��� ��� � � (lal-lal phul, deep red rose)

    � Sense of completion:� Sense of completion:

    ����-���� (kheye deye jabo, after eating)

    � Sense of hesitation or softness:

    ��� ��� �� (Hasi-hasi mukh, laughing face)

    � Sense of incompleteness of the verbs:

    ক�� ���� ���� (kotha bolte bolte, talking about )

    � Sense of corresponding correlative words:

    �������� (Maramari, fighting)

    � Sense of onomatopoeia:

    �� �� (khat khat, knock knock)

  • System DesignSystem DesignSystem DesignSystem DesignSystem DesignSystem DesignSystem DesignSystem Design

    Phase 1: Identifying Reduplications

    Identify mainly five cases of reduplication

    i.e.Onomatopoeic, complete, partial, semantic

    and correlative reduplications.

    Phase 2: Semantic Analysis

    Extraction of associated meaning or

    semantics like sense level reduplications.

  • Phase Phase Phase Phase Phase Phase Phase Phase 11111111System ArchitectureSystem ArchitectureSystem ArchitectureSystem ArchitectureSystem ArchitectureSystem ArchitectureSystem ArchitectureSystem Architecture

    TokenizerTokenizerTokenizerTokenizerCorpus

    Bengali

    Corpus

    RuleRuleRuleRule&&&&based Identifierbased Identifierbased Identifierbased Identifier

    ClassifierClassifierClassifierClassifier

    Set of Inflections

    Set of Inflections

    DictionaryDictionary

  • Components of The Components of The Components of The Components of The Components of The Components of The Components of The Components of The ArchitectureArchitectureArchitectureArchitectureArchitectureArchitectureArchitectureArchitecture

    CorpusCorpusCorpusCorpusCorpusCorpusCorpusCorpusArticles (novel, stories, dramas) ofRabindranath Tagore [http://www.rabindra-rachanabali.nltr.org]

    TokenizerTokenizerTokenizerTokenizerTokenizerTokenizerTokenizerTokenizerSeparates words based on blank space orSeparates words based on blank space orspecial symbols (like hyphen, exclamationnotation etc) to identify two consecutivewords.

    RuleRule--basedbased IdentifierIdentifier

    Consecutive tokens are passed to it to verifywhether they are reduplicated words or notbased on different algorithms.

    ClassifierClassifier

    CClassify reduplications at expression level.

  • Components of The Components of The Components of The Components of The Components of The Components of The Components of The Components of The ArchitectureArchitectureArchitectureArchitectureArchitectureArchitectureArchitectureArchitecture

    DictionaryDictionary

    It includes the lexicon and the associated

    semantics. The system uses both Bengali-

    to-Bengali (monolingual) and Bengali-to-

    English (bilingual) dictionaries.

    �� Set of inflectionsSet of inflections

    0(����), �(-�, -), -�(-��), -��, -��(-���), -�, -��(��), ���, -���, -��, -�, -����, -�, -�

  • Brief classification Brief classification Algorithms Algorithms

    �Complete: comparison for complete equality of two

    words is checked.

    �partial: 3 cases - (i) change of the first vowel

    attached with first consonant, (ii) change of

    consonant itself in first position or (iii) change of

    both matra and consonant.

    Exception: �����-������(abal-tabal, incoherent)Exception: �����-������(abal-tabal, incoherent)[Solution: only consonants that are produced afterchanging are ‘$’, ‘�’, ‘�’, ‘ ’(S.K.Chattopadhyay, 1992.)]

    � Onomatopoeic: after removing inflection, words

    are divided equally and then comparison is done.

    �Correlative : the formative affixes ‘–�’ , ‘-%’ areadded with the root to form 1st and 2nd words

    respectively and agglutinated.

    �Semantic : a dictionary based approach using set of

    above mentioned inflections.

  • Phase Phase Phase Phase Phase Phase Phase Phase 22222222Semantic (sense) analysisSemantic (sense) analysisSemantic (sense) analysisSemantic (sense) analysisSemantic (sense) analysisSemantic (sense) analysisSemantic (sense) analysisSemantic (sense) analysis

    �Correspondence between general and

    sense level reduplications:

    ReduplicationsReduplications SemanticsSemantics (Sense)(Sense)

    onomatopoeic onomatopoeic onomatopoeiaonomatopoeia

    semantic or partialsemantic or partial completioncompletion

    correlative wordcorrelative word corresponding corresponding correlative wordscorrelative wordscorrelative wordscorrelative words

    Complete Complete Repetition /Repetition /hesitation, softnesshesitation, softness

    Problem for sense disambiguation of complete

    reduplication: multiple sense depending on the

    context.

    � System identifies some related words like ‘ক��’(kara, to do), ‘����’ (bhaba, to think), ‘����’ (mato,like), ‘��&�’ (laga, feel) for disambiguation.

    � These are not enough for disambiguating the

    sense of the phrase.

  • Experimental ResultsExperimental ResultsExperimental ResultsExperimental ResultsExperimental ResultsExperimental ResultsExperimental ResultsExperimental Results

    �The collected corpus includes 14,810 tokens

    for 3675 distinct word forms at the root

    level.

    �Metrics:

    � IR metrics: Precision, Recall, F-score.

    � Frequency measurements of each class.

    � Hyphen and close form count.� Hyphen and close form count.

    �Evaluation:

    Reduplication Precision Recall F-score

    Onomatopoeic 99.85 99.77 99.79

    Complete 99.98 99.92 99.95

    Partial 79.15 75.80 77.44

    Semantic 85.20 82.26 83.71

    Correlative 99.91 99.73 99.82

    System 92.82 91.50 92.15

  • Error AnalysisError AnalysisError AnalysisError AnalysisError AnalysisError AnalysisError AnalysisError Analysis

    0

    10

    20

    30

    40

    50

    60

    70

    80

    90

    100

    Precision

    Recall

    F-Score

    �Partial and semantic evaluation scores are notsatisfactory because of some wrong taggingby the shallow parser.

    �Some synonymous reduplication (����- �'�, dhire-susthe, slowly and steadily)implies anonymous sense of the previousword but not its exact synonym. These wordsare not identified properly due to the lack ofBengali lexicons like WordNet.

  • Frequency hypothesisFrequency hypothesisFrequency hypothesisFrequency hypothesisFrequency hypothesisFrequency hypothesisFrequency hypothesisFrequency hypothesis� Frequency is an important indication of

    whether a compound is a MWE.

    � 8.52% of reduplications are hyphened.

    � percentage of closed reduplications is

    33.09% where maximum of them are

    onomatopoeic, correlative and semantic

    reduplications.

    � 100% of correlative reduplications and� 100% of correlative reduplications and

    maximum of onomatopoeic reduplications

    are closed.

    8.51

    51.0626.6

    12.7

    18.08

    Frequency Analysis

    Onomatopoeic

    Complete

    Partial

    Semantic

    Correlative

  • ConclusionConclusion

    � The reduplication is mainly used for

    emphasis, generality, intensity, or to

    show continuation of an act.

    � The semantics of the reduplicated words

    indicate some sort of senseindicate some sort of sense

    disambiguation that cannot be bounded

    by only rule based analysis.

    � Further researches on the field of

    Stylometry analysis of the authors or

    Plagiarism detection.