Upload
maximillian-wiggins
View
212
Download
0
Tags:
Embed Size (px)
Citation preview
DCU meets MET: Bengali and Hindi Morpheme ExtractionDebasis Ganguly, Johannes Leveling, Gareth J.F. JonesCNGL, School of Computing, Dublin City University, Ireland
OutlineMotivationTask DescriptionBengali Stemming ApproachHindi Stemming ApproachResultsConclusions and Future Work
MotivationSome languages have complex inflectional and derivational morphology, i.e. the same base form can correspond to multiple surface word formsExample: company, companies company; hopeful hopeFor information retrieval, indexing surface forms would lead to many mismatches between query terms and index terms extracted from documentsIndex base forms/stems: Reduce different surface forms to the same index form (stem, lemma) to increase the chance of matching query term with document terms
Task DescriptionMorpheme Extraction Task:Investigate effect of morphologic analysis/ lemmatization/ stemming on information retrieval (IR) performance (for Indian languages)
Subtasks:Subtask 1: manual evaluation of morpheme extractionSubtask 2: IR evaluation using the proposed morpheme representation as index terms. Evaluation metric is mean average precision (MAP)
Stemming ApproachesLight vs aggressive stemmingRule-based vs. corpus-based stemmingmanually created vs. cluster of related wordsiteratively remove word suffixesproblem:overstemming, i.e. removed suffix is too longe.g. international/intern; news/newunderstemming, i.e. removed suffix is too shorte.g. forgetfulness/forgetfulirregular forms e.g. feet/foot; women/woman
Our Bengali Stemming ApproachRule-based stemmer created by native speakerFocus on nouns (most important for IR)Four categories [Bhattacharya et al. 2005]:Title markers added as suffixes to proper nounse.g. (Mrs.), (sir)Classifier for plurality and specificity/gender of a noune.g. (Pictures), (the Picture), (female student)Case marker for possessive or accusative relationse.g. (familys)Emphasizer to emphasize the current word e.g. (only a picture), (only this picture)
Bengali StemmerDrop emphasizers (iteratively)e.g. Drop classifiers and case markerse.g. , Drop title markerse.g. Drop plural suffixese.g. Drop derivational suffixese.g.
Our Hindi Stemming ApproachHindi has less complex inflectional morphologyfewer stemming rulesRule-based stemmerStemming rules manually created by native Hindi speaker
Hindi StemmerIteratively remove Hindi vowels, Matras, Anusvara, and (character ya) from the right of a string until first consonant is encounteredDrop derivational suffixes, e.g. (to boys) (boy) (to girls) (girl)
MET Experiments Experiments for Bengali and Hindi Stemmers implemented in C Submission as source code Stemmed forms are used for retrieval with Terrier
Results
TeamLanguageMAPBaselineBengali0.2740JUBengali0.3307(+20.69%)DCUBengali0.3300(+20.44%)IIT-KGPBengali0.3225(+17.70%)CVPR-TeamBengali0.3159(+15.29%)ISMBengali0.3103(+13.25%)BaselineHindi0.2821DCUHindi0.2963(+5.03%)ISMHindi0.2793(-0.99%)
Conclusions
Bengali stemmer:2nd best performance
Hindi stemmer:Best performance
Both have also been used successfully in previous ad-hoc IR experiments for FIRE
Future workExplore use of exclusion lists for irregular casesExtend rule set (i.e. handle verbs)Compare to other stemmers for Bengali/Hindi e.g. Indian language in version 4 of Lucene; stemmers from Jacques Savoys web page on cross-language IRInvestigate morphology of named entities
Thank+s for your attentionAny question+s ?
***Remove reformulated *Remove reformulated *Remove reformulated *Remove reformulated *Remove reformulated *Remove reformulated *Remove reformulated *Remove reformulated *Remove reformulated **