39
IIIT Hyderabad’s CLIR experiments for FIRE- 2008 Sethuramalingam S & Vasudeva Varma IIIT Hyderabad, India 1

IIIT Hyderabad’s CLIR experiments for FIRE-2008 Sethuramalingam S & Vasudeva Varma IIIT Hyderabad, India 1

Embed Size (px)

Citation preview

Page 1: IIIT Hyderabad’s CLIR experiments for FIRE-2008 Sethuramalingam S & Vasudeva Varma IIIT Hyderabad, India 1

IIIT Hyderabad’s CLIR experiments for FIRE-2008

Sethuramalingam S & Vasudeva VarmaIIIT Hyderabad, India

1

Page 2: IIIT Hyderabad’s CLIR experiments for FIRE-2008 Sethuramalingam S & Vasudeva Varma IIIT Hyderabad, India 1

Outline

• Introduction• Related Work in Indian Language IR• Our CLIR experiments• Evaluation & Analysis• Future Work

IIIT-H @ FIRE-2008 2

Page 3: IIIT Hyderabad’s CLIR experiments for FIRE-2008 Sethuramalingam S & Vasudeva Varma IIIT Hyderabad, India 1

Introduction

• Cross-language information retrieval (CLIR) is a subfield of information retrieval dealing with retrieving information written in a language different from the language of the user's query (courtesy: Wikipedia)

• Information – text, audio, video, speech, geographical information etc

IIIT-H @ FIRE-2008 3

Page 4: IIIT Hyderabad’s CLIR experiments for FIRE-2008 Sethuramalingam S & Vasudeva Varma IIIT Hyderabad, India 1

CLIR – Indian languages(IL) scenario

IIIT-H @ FIRE-2008 4

தமி�ழ்

Modified from Source: D. Oard’s Cross-Language IR presentation

हि�न्दी�

తెలు�గు�

বাং��লা�

मरा�ठी

To retrieve documents written in any IL when user queries in one language

Page 5: IIIT Hyderabad’s CLIR experiments for FIRE-2008 Sethuramalingam S & Vasudeva Varma IIIT Hyderabad, India 1

Why CLIR for IL?

IIIT-H @ FIRE-2008 5

Page 6: IIIT Hyderabad’s CLIR experiments for FIRE-2008 Sethuramalingam S & Vasudeva Varma IIIT Hyderabad, India 1

IIIT-H @ FIRE-2008 6

Why CLIR for IL?

Page 7: IIIT Hyderabad’s CLIR experiments for FIRE-2008 Sethuramalingam S & Vasudeva Varma IIIT Hyderabad, India 1

Why CLIR for IL?

IIIT-H @ FIRE-2008 7

• Internet user growth in India between 2000 to 2008 - 1,100.0 % Source : www.internetworldstats.com

• Growth in Indian language contents on the web between 2000 to 2007 – 700%

So, CLIR for IL becomes mandatory!

Page 8: IIIT Hyderabad’s CLIR experiments for FIRE-2008 Sethuramalingam S & Vasudeva Varma IIIT Hyderabad, India 1

RELATED WORK IN INDIAN LANGUAGE IR

IIIT-H @ FIRE-2008 8

Page 9: IIIT Hyderabad’s CLIR experiments for FIRE-2008 Sethuramalingam S & Vasudeva Varma IIIT Hyderabad, India 1

Related Work in ILIR

• ACM TALIP, 2003 - The surprise language exercises - Task was to build CLIR system for

English to Hindi and Cebuano

“The surprise language exercises”, Douglas W. Oard. ACM Transactions on Asian Language Information Processing (TALIP), 2(2):79–84, 2003

IIIT-H @ FIRE-2008 9

Page 10: IIIT Hyderabad’s CLIR experiments for FIRE-2008 Sethuramalingam S & Vasudeva Varma IIIT Hyderabad, India 1

Related Work in ILIR

• CLEF 2006 - Ad-hoc bi-lingual track including two

Indian languages Hindi and Telugu - Our team from IIIT-H participated in

Hindi and Telugu to English CLIR task

“Hindi and Telugu to English Cross Language Information Retrieval”, Prasad Pingali and Vasudeva Varma. CLEF 2006.

IIIT-H @ FIRE-2008 10

Page 11: IIIT Hyderabad’s CLIR experiments for FIRE-2008 Sethuramalingam S & Vasudeva Varma IIIT Hyderabad, India 1

Related Work in ILIR

• CLEF 2007 - Indian language subtask consisting of

Hindi, Bengali, Marathi, Telugu and Tamil - Five teams including ours participated

- Hindi and Telugu to English CLIR

“IIIT Hyderabad at CLEF 2007 - Adhoc Indian Language CLIR task”, Prasad Pingali and Vasudeva Varma. CLEF 2007.

IIIT-H @ FIRE-2008 11

Page 12: IIIT Hyderabad’s CLIR experiments for FIRE-2008 Sethuramalingam S & Vasudeva Varma IIIT Hyderabad, India 1

Related Work in ILIRGoogle’s CLIR system for 34 languages including

Hindi

IIIT-H @ FIRE-2008 12

Page 13: IIIT Hyderabad’s CLIR experiments for FIRE-2008 Sethuramalingam S & Vasudeva Varma IIIT Hyderabad, India 1

OUR CLIR EXPERIMENTS

IIIT-H @ FIRE-2008 13

Page 14: IIIT Hyderabad’s CLIR experiments for FIRE-2008 Sethuramalingam S & Vasudeva Varma IIIT Hyderabad, India 1

Our CLIR experiments

• Ad-hoc cross-lingual Hindi to English, and English to Hindi.

• Ad-hoc monolingual runs in Hindi and English• 12 runs in total were submitted for the above

4 tasks

IIIT-H @ FIRE-2008 14

Page 15: IIIT Hyderabad’s CLIR experiments for FIRE-2008 Sethuramalingam S & Vasudeva Varma IIIT Hyderabad, India 1

Problem statement

• CLIR system should take a set of 50 topics in the source language and return top 1000 documents for each topic in the target language

IIIT-H @ FIRE-2008 15

<top lang="hi"><num>28</num><title> ईरा�न का� पराम�णु� का�र्य�क्रम</title><desc> ईरा�न का� का�र्य�क्रम औरा उसका� पराम�णु� न�हि� का� बा�रा� म� हि�श्व का�रा�र्य।</desc><narr> ईरा�न का� पराम�णु� न�हि� औरा ऐस� का�र्य�क्रम का� हि�रुद्ध ईरा�न परा र्य#एसए का�

हिनरा%�रा दीबा�� औरा धमका� का� बा�रा� म� स#चन� स%बा%धिध� प्रले�ख म� रा�न� च�हि�ए। पराम�णु� न�हि� का� समझौ-�� का� लिलेए ईरा�न औरा र्य#रा/प�र्य स%घ का� बा�च ����� औरा हि�श्व दृधि2 भी�

रुलिचकारा �/गी�</narr></top>

Page 16: IIIT Hyderabad’s CLIR experiments for FIRE-2008 Sethuramalingam S & Vasudeva Varma IIIT Hyderabad, India 1

CLIR System architecture

• Query Processing module– Named Entities identification– Query translation using lexicons– Transliteration– Query Scoring

• Indexing module– Stop-word remover,– A typical Indexer using Lucene

IIIT-H @ FIRE-2008 16

Page 17: IIIT Hyderabad’s CLIR experiments for FIRE-2008 Sethuramalingam S & Vasudeva Varma IIIT Hyderabad, India 1

CLIR System architecture

• Query Processing module– Named Entities identification– Query translation using lexicons– Transliteration– Query Scoring

• Indexing module– Stop-word remover,– A typical Indexer using Lucene

IIIT-H @ FIRE-2008 17

Page 18: IIIT Hyderabad’s CLIR experiments for FIRE-2008 Sethuramalingam S & Vasudeva Varma IIIT Hyderabad, India 1

Named entities Identification

• Used for identifying the named entities present in the queries for transliteration

• We used– Our CRF-based NER system( as a binary classifier)

for Hindi queries,– Stanford English NER system for English queries

• Identifies Person, Organization and Location names

IIIT-H @ FIRE-2008 18

"Experiments in Telugu NER: A Conditional Random Field Approach“, Praneeth M Shishtla, Prasad Pingali, Vasudeva Varma. NERSSEAL-08, IJCNLP-08, Hyderabad, 2008.

Page 19: IIIT Hyderabad’s CLIR experiments for FIRE-2008 Sethuramalingam S & Vasudeva Varma IIIT Hyderabad, India 1

CLIR System architecture

• Query Processing module– Named Entities identification– Query translation using lexicons– Transliteration(mapping-based)– Query Scoring

• Indexing module– Stop-word remover,– A typical Indexer using Lucene

IIIT-H @ FIRE-2008 19

Page 20: IIIT Hyderabad’s CLIR experiments for FIRE-2008 Sethuramalingam S & Vasudeva Varma IIIT Hyderabad, India 1

Query translation

• Using bi-lingual lexicons– “Shabdanjali”, an English-Hindi dictionary

containing 26,633 entries– IIT Bombay Hindi Wordnet– Manually collected Hindi-English dictionary with

6,685 entries

IIIT-H @ FIRE-2008 20

Shabdanjali - http://www.shabdkosh.com/shabdanjali

Hindi Wordnet - http://www.cfilt.iitb.ac.in/wordnet/webhwn/

Page 21: IIIT Hyderabad’s CLIR experiments for FIRE-2008 Sethuramalingam S & Vasudeva Varma IIIT Hyderabad, India 1

CLIR System architecture

• Query Processing module– Named Entities identification– Query translation using lexicons– Transliteration(mapping-based)– Query Scoring

• Indexing module– Stop-word remover,– A typical Indexer using Lucene

IIIT-H @ FIRE-2008 21

Page 22: IIIT Hyderabad’s CLIR experiments for FIRE-2008 Sethuramalingam S & Vasudeva Varma IIIT Hyderabad, India 1

Transliteration

• Mapping-based approach• For a given named entity in source language

– Derive the Compressed Word Format (CWF) E.g. academia – cdm

E.g. abullah - bll

– Generate the list of Named entities & their CWFs at the target language side

– Search and map the CWF of source language NE with the CWF of the right target language equivalent within the min. modified edit distance

IIIT-H @ FIRE-2008 22

Page 23: IIIT Hyderabad’s CLIR experiments for FIRE-2008 Sethuramalingam S & Vasudeva Varma IIIT Hyderabad, India 1

Transliteration

• Implementation– Named entities present in the Hindi and English

corpora are identified and listed.– Their CWFs are generated using a set of heuristic,

rewrite and remove rules– CWFs are added to the list of NEs

IIIT-H @ FIRE-2008 23

“Named Entity Transliteration for Cross-Language Information Retrievalusing Compressed Word Format Mapping algorithm”, Srinivasan C Janarthanam, Sethuramalingam S, Udhyakumar Nallasamy. iNEWS-08, CIKM-2008.

Page 24: IIIT Hyderabad’s CLIR experiments for FIRE-2008 Sethuramalingam S & Vasudeva Varma IIIT Hyderabad, India 1

CLIR System architecture

• Query Processing module– Named Entities identification– Query translation using lexicons– Transliteration(mapping-based)– Query Scoring

• Indexing module– Stop-word remover,– A typical Indexer using Lucene

IIIT-H @ FIRE-2008 24

Page 25: IIIT Hyderabad’s CLIR experiments for FIRE-2008 Sethuramalingam S & Vasudeva Varma IIIT Hyderabad, India 1

Query Scoring

• We generate a Boolean OR query with scored query words

• Query scoring is based on– Position of occurrence of the word in the topic– Number of occurrences of the word– Numbers, Years are given greater weights

IIIT-H @ FIRE-2008 25

Page 26: IIIT Hyderabad’s CLIR experiments for FIRE-2008 Sethuramalingam S & Vasudeva Varma IIIT Hyderabad, India 1

CLIR System architecture

• Query Processing module– Named Entities identification– Query translation using lexicons– Transliteration(mapping-based)– Query Scoring

• Indexing & Ranking module– Stop word remover,– A typical Indexer using Lucene

IIIT-H @ FIRE-2008 26

Page 27: IIIT Hyderabad’s CLIR experiments for FIRE-2008 Sethuramalingam S & Vasudeva Varma IIIT Hyderabad, India 1

Indexing module

• For the English corpus, stop words are removed and stemmed using Lucene

• For the Hindi corpus, a list of 246 words is generated from the given corpus based on frequency

• Documents are indexed using the Lucene Indexer and ranked using the BM-25 algorithm in Lucene

IIIT-H @ FIRE-2008 27

Page 28: IIIT Hyderabad’s CLIR experiments for FIRE-2008 Sethuramalingam S & Vasudeva Varma IIIT Hyderabad, India 1

EVALUATION & ANALYSIS

IIIT-H @ FIRE-2008 28

Page 29: IIIT Hyderabad’s CLIR experiments for FIRE-2008 Sethuramalingam S & Vasudeva Varma IIIT Hyderabad, India 1

Evaluation

• English-Hindi cross-lingual run

IIIT-H @ FIRE-2008 29

Run MAP GMAP R-Prec Bpref

Title + Desc 0.1538 0.0093 0.1687 0.1905

Title + Narr 0.1516 0.0229 0.1871 0.1918

Title + Desc + Narr 0.1432 0.0215 0.1793 0.1886

Page 30: IIIT Hyderabad’s CLIR experiments for FIRE-2008 Sethuramalingam S & Vasudeva Varma IIIT Hyderabad, India 1

Evaluation

• Hindi-English cross-lingual run

IIIT-H @ FIRE-2008 30

Run MAP GMAP R-Prec Bpref

Title + Desc 0.0907 0.0197 0.1291 0.1408

Title + Narr 0.1204 0.0366 0.1718 0.1734

Title + Desc + Narr 0.1112 0.0287 0.1541 0.1723

Page 31: IIIT Hyderabad’s CLIR experiments for FIRE-2008 Sethuramalingam S & Vasudeva Varma IIIT Hyderabad, India 1

Evaluation

• Hindi-Hindi monolingual run

IIIT-H @ FIRE-2008 31

Run MAP GMAP R-Prec Bpref

Title + Desc 0.2579 0.0427 0.2797 0.2964

Title + Narr 0.2652 0.0534 0.2845 0.3023

Title + Desc + Narr 0.2472 0.0525 0.2558 0.2773

Page 32: IIIT Hyderabad’s CLIR experiments for FIRE-2008 Sethuramalingam S & Vasudeva Varma IIIT Hyderabad, India 1

Evaluation

• English-English monolingual run

IIIT-H @ FIRE-2008 32

Run MAP GMAP R-Prec Bpref

Title + Desc 0.4416 0.3437 0.4579 0.4889

Title + Narr 0.4863 0.3989 0.4894 0.5218

Title + Desc + Narr 0.4690 0.3841 0.4707 0.5167

Page 33: IIIT Hyderabad’s CLIR experiments for FIRE-2008 Sethuramalingam S & Vasudeva Varma IIIT Hyderabad, India 1

English-Hindi Vs Hindi-Hindi

IIIT-H @ FIRE-2008 33

Page 34: IIIT Hyderabad’s CLIR experiments for FIRE-2008 Sethuramalingam S & Vasudeva Varma IIIT Hyderabad, India 1

Hindi-English Vs English-English

IIIT-H @ FIRE-2008 34

Page 35: IIIT Hyderabad’s CLIR experiments for FIRE-2008 Sethuramalingam S & Vasudeva Varma IIIT Hyderabad, India 1

Evaluation

• Summary– Our English-Hindi CLIR performance was 58% of

the monolingual run– Our Hindi-English CLIR performance was 25% of

the monolingual run– Our Hindi-Hindi monolingual run retrieved 52% of

total relevant documents– Our English-English monolingual run retrieved

91% of total relevant documents

IIIT-H @ FIRE-2008 35

Page 36: IIIT Hyderabad’s CLIR experiments for FIRE-2008 Sethuramalingam S & Vasudeva Varma IIIT Hyderabad, India 1

Analysis

• Our English-Hindi CLIR performance can be attributed to factors like– Exact matching of English named entities– Good coverage of English words in our lexicons

• A relatively lower performance on Hindi-English CLIR is due to– Low dictionary coverage– Query formulation was not complex enough

IIIT-H @ FIRE-2008 36

Page 37: IIIT Hyderabad’s CLIR experiments for FIRE-2008 Sethuramalingam S & Vasudeva Varma IIIT Hyderabad, India 1

FUTURE WORK

IIIT-H @ FIRE-2008 37

Page 38: IIIT Hyderabad’s CLIR experiments for FIRE-2008 Sethuramalingam S & Vasudeva Varma IIIT Hyderabad, India 1

Future Work

• Error analysis on per topic basis• Work on more complex query formulations• Work on other possible query translation

techniques like– Building dictionaries from parallel corpora– Using web– Using Wikipedia

IIIT-H @ FIRE-2008 38

Page 39: IIIT Hyderabad’s CLIR experiments for FIRE-2008 Sethuramalingam S & Vasudeva Varma IIIT Hyderabad, India 1

THANK YOU!!!

IIIT-H @ FIRE-2008 39