AU-KBC FIRE2008 Submission - Cross Lingual Information Retrieval Track: Tamil- English

AU-KBC FIRE2008 Submission - Cross Lingual Information

Retrieval Track: Tamil- English

Pattabhi R.K Rao and Sobha. L

AU-KBC Research Centre,MIT Campus, Chennai

FIRE 2008 – Tamil – English CLIR

• Problem Definition– Ad-hoc cross-lingual document retrieval task

of FIRE. – The task is to retrieve relevant documents in

English for a given Indian language query – worked on Tamil – English cross lingual

information retrieval system

Our Approach

• The main components in our CLIR system are – Query Language Analyser– Named Entity recognizer– Query Translation engine– Query Expansion– Ranking

Query Language Analyser – Tamil Morphological Analyser• The morphological analyser analyses each word

to give the morphs of the word• E.g.: patiwwAnY ->pati(V) + ww (Past) +

AnY(3SM)• For nouns, the inflections mark the case such as

Dative, accusative• For verbs, the inflections carry information of

Person, Number, Gender, tense, aspect and modal

• Uses paradigm-based approach • Implemented as Finite State Machine

Named Entity Recognizer (NER)• Generic engine uses Conditional Random

Fields (CRFs)• Trained on 100000 word corpus from

various domains• Uses a hierarchical tagset• Performs with 80% Recall and Precision

89%

Query Translation• Uses a bilingual dictionary based approach• Tamil – English bilingual dictionary is 150K size• For Named entities, for which transliteration is

required, transliteration engine is used.• Tamil to English Transliteration is a tough task

– Tamil has few consonants.• Transliteration is done using a statistical system

based on n-grams approach• The statistical system works with an accuracy of

81%

Query Expansion

• The query terms are expanded using – Thesaurus– Ontology

• Query Expansion is done at two places – Before Query translation– After Query translation

• Synonyms are obtained using WordNet

Query Expansion (2)

• Ontology is used to obtain more world knowledge

Festivals

Hindu Muslim Christian

Holi Diwali

Dussera

Ramazan Christmas

What is there in the Ontology

• Descriptions about the entity– Ex: Holi- Festival of colours, Good over Evil, – Depavali- Festival of Lights , crackers etc

• We have an ontology of this type for 100 entities– Festivals, Sports, country, Natural Calamities,

Sports, Person Names, etc

Ranking

• Here standard Okapi (BM25) ranking algorithm is used with customization to suite our need

• A parameter called boost factor is introduced to the standard algorithm for calculating the score

• The NEs in the query are given a boost factor of 1.5 and original query terms are given a boost factor of 1.25

Ranking (2)

• This boost factor parameter show the weightage for certain particular terms in the query

• NEs get more weightage than other terms, it has been give 0.5 times more weightage

• And Original query terms are given 0.25 times more weightage to retain the importance of the user given query terms

Experiments – Results (1)• We have submitted two runs• For query 29, “assistance after Tsunami”, on expanding

the query for the terms “assistance” and “ Tsunami”, we obtain “financial assistance, relief material, manpower help, rebuilding infrastructure, government assistance, non-governmental organizations assistance, relief fund, natural calamity, Tsunami, high sea waves”

• This expansion of the query has helped in increasing the recall, the MAP score for this query is 0.46

• For query ids 27 and 59 the system did not perform well

Experiments – Results (2)• The query 27 “Sino Indian relationship” is too

broad and the query expansion is not done well, due to lack of knowledge in the ontology, here what all constitute relationship needs to be defined

• The query 59, “Ameican citizens fight against Iraq war”, is too specific and the document collection has more number of documents on Iraq war, rather than on the particular document . The terms “Iraq War” get more weight than the terms “fight against”

Experiments – Results (3)

MAP R-prec P@5 P10 P@20 Recall

0.4821 0.4862 0.7280 0.6960 0.6360 0.8912

Overall Results of the Tamil – English cross lingual information retrieval system.

MAP Score For each query

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49

Query ID

MA

P Sc

ore

Recall For Each Query

0

20

40

60

80

100

120

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49

Query IDs

Rec

all %

age

Conclusion

• Here Query language analyser is used• The difference between two runs MAP

score of 0.3921 and 0.4821• The use of query expansion module helps

in increasing the recall• The results obtained are encouraging

– MAP – 0.4821– P@10 – 0.6960– Recall – 0.8912

References• Mohammad Afraz and Sobha L (2008), ‘English to Dravidian

Language Machine Transliteration: A Statistical Approach Based on N-grams’, In the Proceedings of International Seminar on Malayalam and Globalization, Thiruvananthapuram, India.

• Genesereth, M. R. and Nilsson, N. (1987). Logical Foundations of Artificial Intelligence. Morgan Kaufmann Publishers: San Mateo, CA.

• Vijayakrishna R and Sobha L (2008), “Domain focused Named Entity Recognizer for Tamil using Conditional Random Fields”, In Proceedings of International Joint Conference on Natural Language Processing Workshop on NER for South and South East Asian Languages, Hyderabad, India. pp. 59 – 66.

• S.Viswanathan, S.Ramesh Kumar, B.Kumara Shanmugam, S.Arulmozi and K.Vijay Shanker. (2003). “A Tamil Morphological Analyser”, In the Proceedings of International Conference on Natural LanguageProcessing-2003, Mysore.

Thank you!

Documents

AU-KBC FIRE2008 Submission - Cross Lingual Information Retrieval Track: Tamil- English