IIIT Hyderabad’s CLIR experiments for FIRE-2008
Sethuramalingam S & Vasudeva VarmaIIIT Hyderabad, India
1
Outline
• Introduction• Related Work in Indian Language IR• Our CLIR experiments• Evaluation & Analysis• Future Work
IIIT-H @ FIRE-2008 2
Introduction
• Cross-language information retrieval (CLIR) is a subfield of information retrieval dealing with retrieving information written in a language different from the language of the user's query (courtesy: Wikipedia)
• Information – text, audio, video, speech, geographical information etc
IIIT-H @ FIRE-2008 3
CLIR – Indian languages(IL) scenario
IIIT-H @ FIRE-2008 4
தமி�ழ்
Modified from Source: D. Oard’s Cross-Language IR presentation
हि�न्दी�
తెలు�గు�
বাং��লা�
मरा�ठी
To retrieve documents written in any IL when user queries in one language
Why CLIR for IL?
IIIT-H @ FIRE-2008 5
IIIT-H @ FIRE-2008 6
Why CLIR for IL?
Why CLIR for IL?
IIIT-H @ FIRE-2008 7
• Internet user growth in India between 2000 to 2008 - 1,100.0 % Source : www.internetworldstats.com
• Growth in Indian language contents on the web between 2000 to 2007 – 700%
So, CLIR for IL becomes mandatory!
RELATED WORK IN INDIAN LANGUAGE IR
IIIT-H @ FIRE-2008 8
Related Work in ILIR
• ACM TALIP, 2003 - The surprise language exercises - Task was to build CLIR system for
English to Hindi and Cebuano
“The surprise language exercises”, Douglas W. Oard. ACM Transactions on Asian Language Information Processing (TALIP), 2(2):79–84, 2003
IIIT-H @ FIRE-2008 9
Related Work in ILIR
• CLEF 2006 - Ad-hoc bi-lingual track including two
Indian languages Hindi and Telugu - Our team from IIIT-H participated in
Hindi and Telugu to English CLIR task
“Hindi and Telugu to English Cross Language Information Retrieval”, Prasad Pingali and Vasudeva Varma. CLEF 2006.
IIIT-H @ FIRE-2008 10
Related Work in ILIR
• CLEF 2007 - Indian language subtask consisting of
Hindi, Bengali, Marathi, Telugu and Tamil - Five teams including ours participated
- Hindi and Telugu to English CLIR
“IIIT Hyderabad at CLEF 2007 - Adhoc Indian Language CLIR task”, Prasad Pingali and Vasudeva Varma. CLEF 2007.
IIIT-H @ FIRE-2008 11
Related Work in ILIRGoogle’s CLIR system for 34 languages including
Hindi
IIIT-H @ FIRE-2008 12
OUR CLIR EXPERIMENTS
IIIT-H @ FIRE-2008 13
Our CLIR experiments
• Ad-hoc cross-lingual Hindi to English, and English to Hindi.
• Ad-hoc monolingual runs in Hindi and English• 12 runs in total were submitted for the above
4 tasks
IIIT-H @ FIRE-2008 14
Problem statement
• CLIR system should take a set of 50 topics in the source language and return top 1000 documents for each topic in the target language
IIIT-H @ FIRE-2008 15
<top lang="hi"><num>28</num><title> ईरा�न का� पराम�णु� का�र्य�क्रम</title><desc> ईरा�न का� का�र्य�क्रम औरा उसका� पराम�णु� न�हि� का� बा�रा� म� हि�श्व का�रा�र्य।</desc><narr> ईरा�न का� पराम�णु� न�हि� औरा ऐस� का�र्य�क्रम का� हि�रुद्ध ईरा�न परा र्य#एसए का�
हिनरा%�रा दीबा�� औरा धमका� का� बा�रा� म� स#चन� स%बा%धिध� प्रले�ख म� रा�न� च�हि�ए। पराम�णु� न�हि� का� समझौ-�� का� लिलेए ईरा�न औरा र्य#रा/प�र्य स%घ का� बा�च ����� औरा हि�श्व दृधि2 भी�
रुलिचकारा �/गी�</narr></top>
CLIR System architecture
• Query Processing module– Named Entities identification– Query translation using lexicons– Transliteration– Query Scoring
• Indexing module– Stop-word remover,– A typical Indexer using Lucene
IIIT-H @ FIRE-2008 16
CLIR System architecture
• Query Processing module– Named Entities identification– Query translation using lexicons– Transliteration– Query Scoring
• Indexing module– Stop-word remover,– A typical Indexer using Lucene
IIIT-H @ FIRE-2008 17
Named entities Identification
• Used for identifying the named entities present in the queries for transliteration
• We used– Our CRF-based NER system( as a binary classifier)
for Hindi queries,– Stanford English NER system for English queries
• Identifies Person, Organization and Location names
IIIT-H @ FIRE-2008 18
"Experiments in Telugu NER: A Conditional Random Field Approach“, Praneeth M Shishtla, Prasad Pingali, Vasudeva Varma. NERSSEAL-08, IJCNLP-08, Hyderabad, 2008.
CLIR System architecture
• Query Processing module– Named Entities identification– Query translation using lexicons– Transliteration(mapping-based)– Query Scoring
• Indexing module– Stop-word remover,– A typical Indexer using Lucene
IIIT-H @ FIRE-2008 19
Query translation
• Using bi-lingual lexicons– “Shabdanjali”, an English-Hindi dictionary
containing 26,633 entries– IIT Bombay Hindi Wordnet– Manually collected Hindi-English dictionary with
6,685 entries
IIIT-H @ FIRE-2008 20
Shabdanjali - http://www.shabdkosh.com/shabdanjali
Hindi Wordnet - http://www.cfilt.iitb.ac.in/wordnet/webhwn/
CLIR System architecture
• Query Processing module– Named Entities identification– Query translation using lexicons– Transliteration(mapping-based)– Query Scoring
• Indexing module– Stop-word remover,– A typical Indexer using Lucene
IIIT-H @ FIRE-2008 21
Transliteration
• Mapping-based approach• For a given named entity in source language
– Derive the Compressed Word Format (CWF) E.g. academia – cdm
E.g. abullah - bll
– Generate the list of Named entities & their CWFs at the target language side
– Search and map the CWF of source language NE with the CWF of the right target language equivalent within the min. modified edit distance
IIIT-H @ FIRE-2008 22
Transliteration
• Implementation– Named entities present in the Hindi and English
corpora are identified and listed.– Their CWFs are generated using a set of heuristic,
rewrite and remove rules– CWFs are added to the list of NEs
IIIT-H @ FIRE-2008 23
“Named Entity Transliteration for Cross-Language Information Retrievalusing Compressed Word Format Mapping algorithm”, Srinivasan C Janarthanam, Sethuramalingam S, Udhyakumar Nallasamy. iNEWS-08, CIKM-2008.
CLIR System architecture
• Query Processing module– Named Entities identification– Query translation using lexicons– Transliteration(mapping-based)– Query Scoring
• Indexing module– Stop-word remover,– A typical Indexer using Lucene
IIIT-H @ FIRE-2008 24
Query Scoring
• We generate a Boolean OR query with scored query words
• Query scoring is based on– Position of occurrence of the word in the topic– Number of occurrences of the word– Numbers, Years are given greater weights
IIIT-H @ FIRE-2008 25
CLIR System architecture
• Query Processing module– Named Entities identification– Query translation using lexicons– Transliteration(mapping-based)– Query Scoring
• Indexing & Ranking module– Stop word remover,– A typical Indexer using Lucene
IIIT-H @ FIRE-2008 26
Indexing module
• For the English corpus, stop words are removed and stemmed using Lucene
• For the Hindi corpus, a list of 246 words is generated from the given corpus based on frequency
• Documents are indexed using the Lucene Indexer and ranked using the BM-25 algorithm in Lucene
IIIT-H @ FIRE-2008 27
EVALUATION & ANALYSIS
IIIT-H @ FIRE-2008 28
Evaluation
• English-Hindi cross-lingual run
IIIT-H @ FIRE-2008 29
Run MAP GMAP R-Prec Bpref
Title + Desc 0.1538 0.0093 0.1687 0.1905
Title + Narr 0.1516 0.0229 0.1871 0.1918
Title + Desc + Narr 0.1432 0.0215 0.1793 0.1886
Evaluation
• Hindi-English cross-lingual run
IIIT-H @ FIRE-2008 30
Run MAP GMAP R-Prec Bpref
Title + Desc 0.0907 0.0197 0.1291 0.1408
Title + Narr 0.1204 0.0366 0.1718 0.1734
Title + Desc + Narr 0.1112 0.0287 0.1541 0.1723
Evaluation
• Hindi-Hindi monolingual run
IIIT-H @ FIRE-2008 31
Run MAP GMAP R-Prec Bpref
Title + Desc 0.2579 0.0427 0.2797 0.2964
Title + Narr 0.2652 0.0534 0.2845 0.3023
Title + Desc + Narr 0.2472 0.0525 0.2558 0.2773
Evaluation
• English-English monolingual run
IIIT-H @ FIRE-2008 32
Run MAP GMAP R-Prec Bpref
Title + Desc 0.4416 0.3437 0.4579 0.4889
Title + Narr 0.4863 0.3989 0.4894 0.5218
Title + Desc + Narr 0.4690 0.3841 0.4707 0.5167
English-Hindi Vs Hindi-Hindi
IIIT-H @ FIRE-2008 33
Hindi-English Vs English-English
IIIT-H @ FIRE-2008 34
Evaluation
• Summary– Our English-Hindi CLIR performance was 58% of
the monolingual run– Our Hindi-English CLIR performance was 25% of
the monolingual run– Our Hindi-Hindi monolingual run retrieved 52% of
total relevant documents– Our English-English monolingual run retrieved
91% of total relevant documents
IIIT-H @ FIRE-2008 35
Analysis
• Our English-Hindi CLIR performance can be attributed to factors like– Exact matching of English named entities– Good coverage of English words in our lexicons
• A relatively lower performance on Hindi-English CLIR is due to– Low dictionary coverage– Query formulation was not complex enough
IIIT-H @ FIRE-2008 36
FUTURE WORK
IIIT-H @ FIRE-2008 37
Future Work
• Error analysis on per topic basis• Work on more complex query formulations• Work on other possible query translation
techniques like– Building dictionaries from parallel corpora– Using web– Using Wikipedia
IIIT-H @ FIRE-2008 38
THANK YOU!!!
IIIT-H @ FIRE-2008 39