DCU meets MET: Bengali and Hindi Morpheme Extraction Debasis Ganguly, Johannes Leveling, Gareth J.F. Jones CNGL, School of Computing, Dublin City University,

DCU meets MET: Bengali and Hindi Morpheme ExtractionDebasis Ganguly, Johannes Leveling, Gareth J.F. JonesCNGL, School of Computing, Dublin City University, Ireland

OutlineMotivationTask DescriptionBengali Stemming ApproachHindi Stemming ApproachResultsConclusions and Future Work

MotivationSome languages have complex inflectional and derivational morphology, i.e. the same base form can correspond to multiple surface word formsExample: company, companies company; hopeful hopeFor information retrieval, indexing surface forms would lead to many mismatches between query terms and index terms extracted from documentsIndex base forms/stems: Reduce different surface forms to the same index form (stem, lemma) to increase the chance of matching query term with document terms

Task DescriptionMorpheme Extraction Task:Investigate effect of morphologic analysis/ lemmatization/ stemming on information retrieval (IR) performance (for Indian languages)

Subtasks:Subtask 1: manual evaluation of morpheme extractionSubtask 2: IR evaluation using the proposed morpheme representation as index terms. Evaluation metric is mean average precision (MAP)

Stemming ApproachesLight vs aggressive stemmingRule-based vs. corpus-based stemmingmanually created vs. cluster of related wordsiteratively remove word suffixesproblem:overstemming, i.e. removed suffix is too longe.g. international/intern; news/newunderstemming, i.e. removed suffix is too shorte.g. forgetfulness/forgetfulirregular forms e.g. feet/foot; women/woman

Our Bengali Stemming ApproachRule-based stemmer created by native speakerFocus on nouns (most important for IR)Four categories [Bhattacharya et al. 2005]:Title markers added as suffixes to proper nounse.g. (Mrs.), (sir)Classifier for plurality and specificity/gender of a noune.g. (Pictures), (the Picture), (female student)Case marker for possessive or accusative relationse.g. (familys)Emphasizer to emphasize the current word e.g. (only a picture), (only this picture)

Bengali StemmerDrop emphasizers (iteratively)e.g. Drop classifiers and case markerse.g. , Drop title markerse.g. Drop plural suffixese.g. Drop derivational suffixese.g.

Our Hindi Stemming ApproachHindi has less complex inflectional morphologyfewer stemming rulesRule-based stemmerStemming rules manually created by native Hindi speaker

Hindi StemmerIteratively remove Hindi vowels, Matras, Anusvara, and (character ya) from the right of a string until first consonant is encounteredDrop derivational suffixes, e.g. (to boys) (boy) (to girls) (girl)

MET Experiments Experiments for Bengali and Hindi Stemmers implemented in C Submission as source code Stemmed forms are used for retrieval with Terrier

Results

TeamLanguageMAPBaselineBengali0.2740JUBengali0.3307(+20.69%)DCUBengali0.3300(+20.44%)IIT-KGPBengali0.3225(+17.70%)CVPR-TeamBengali0.3159(+15.29%)ISMBengali0.3103(+13.25%)BaselineHindi0.2821DCUHindi0.2963(+5.03%)ISMHindi0.2793(-0.99%)

Conclusions

Bengali stemmer:2nd best performance

Hindi stemmer:Best performance

Both have also been used successfully in previous ad-hoc IR experiments for FIRE

Future workExplore use of exclusion lists for irregular casesExtend rule set (i.e. handle verbs)Compare to other stemmers for Bengali/Hindi e.g. Indian language in version 4 of Lucene; stemmers from Jacques Savoys web page on cross-language IRInvestigate morphology of named entities

Thank+s for your attentionAny question+s ?

***Remove reformulated *Remove reformulated *Remove reformulated *Remove reformulated *Remove reformulated *Remove reformulated *Remove reformulated *Remove reformulated *Remove reformulated **

Documents

DCU meets MET: Bengali and Hindi Morpheme Extraction Debasis Ganguly, Johannes Leveling, Gareth J.F. Jones CNGL, School of Computing, Dublin City University,