38
INFORMATION SEARCHING AND RETRIEVAL (MLS 712) PREPARED FOR: ASSOC. PROF. HAJAH FUZIAH MOHD NADZAR PREPARED BY: ASYURA BINTI AMINORDIN (2012482362) MOHD IQBAL AL-FARABI B YAHYA (2012253658) DATE: DECEMBER 17, 2012 Cross Language Information Retrieval (CLIR)

Cross language information retrieval (clir)slide

Embed Size (px)

Citation preview

Page 1: Cross language information retrieval (clir)slide

INFORMATION SEARCHING AND RETRIEVAL (MLS 712)

PREPARED FOR: ASSOC. PROF. HAJAH FUZIAH MOHD NADZAR

PREPARED BY: ASYURA BINTI AMINORDIN (2012482362)

MOHD IQBAL AL-FARABI B YAHYA (2012253658)

DATE: DECEMBER 17, 2012

Cross Language Information Retrieval

(CLIR)

Page 2: Cross language information retrieval (clir)slide

Introduction

Cross-language information retrieval (CLIR) is a subfield of information retrieval dealing with retrieving information written in a language different from the language of the user's query. For example, a user may pose their query in English but retrieve relevant documents written in French.

http://en.wikipedia.org/wiki/Cross-language_information_retrieval

Page 3: Cross language information retrieval (clir)slide

CLIR Purpose

Researchers in Cross-Language Information Retrieval (CLIR) seek to support the process of finding documents written in one natural language with automated systems that can accept queries expressed in other languages.

Page 4: Cross language information retrieval (clir)slide

English-ChineseInformation Retrieval System (ECIRS)

Web-based English-Chinese Information Retrieval System, ECIRS. ECIRS provides a cross-language platform for helping people to retrieve Chinese information without inputting a Chinese query. The web-based client-server architecture allows more users to access ECIRS through the worldwide Internet.

Page 5: Cross language information retrieval (clir)slide

Conts…

ECIRS consists of a client side and a server side. The client side is a web-based user interface. The server side includes bilingual dictionaries, content-based document index files, a Chinese search engine and Chinese document collections.

Page 6: Cross language information retrieval (clir)slide

Conts…

Client side Server sideAllows a user to input a query in English and send the query to the server side then the result contains an entry list of relevant documents in Chinese

An English-Chinese dictionary and a Chinese-English dictionary, are used totranslate the user's query from English into Chinese key word in ECIRS.

Page 7: Cross language information retrieval (clir)slide

English - Chinese Information retrieval

Screen shot of English Chinese Information retrieval System Layout: http://www.cs.nmsu.edu/~sliu/main_frame.html

Page 8: Cross language information retrieval (clir)slide

English - Chinese Information retrieval

Side bar from the System where user can choose any of the button provided EX: On-line English Chinese Dictionary allow user to translate English word into Chinese word

Page 9: Cross language information retrieval (clir)slide

English - Chinese Information retrieval

Screen shot of English Chinese Information retrieval System Layout: http://www.cs.nmsu.edu/~sliu/main_frame.html

From the screenshot above we insert any keyword which we want to search Example: Computer

Keyword : computer

Page 10: Cross language information retrieval (clir)slide

English - Chinese Information retrieval

Screen shot of English Chinese Information retrieval System Layout: http://www.cs.nmsu.edu/~sliu/main_frame.html

Translation from English into Chinese

Page 11: Cross language information retrieval (clir)slide

English Chinese Information retrieval

On-Line Chinese Information Retrieval System. The database where all document or information that relate to the information need which is “Computer”

Page 12: Cross language information retrieval (clir)slide

English Chinese Information retrieval

Screen shot of English Chinese Information retrieval System Layout: http://www.cs.nmsu.edu/~sliu/main_frame.html

The List of document which relate to the computer. There was 294 result

Page 13: Cross language information retrieval (clir)slide

English Chinese Information retrieval

Screen shot of English Chinese Information retrieval System Layout: http://www.cs.nmsu.edu/~sliu/main_frame.html

Page 14: Cross language information retrieval (clir)slide

Big 5 - GB

Big 5 is a Chinese character encoding method used in Taiwan, Hong Kong, and Macau for Traditional Chinese characters

GB (Guojia Biaozhun 国家标准 ) is the registered internet name for a key official character set of the People's Republic of China, used for simplified Chinese characters

Page 15: Cross language information retrieval (clir)slide

Cross Language Information Retrieval

Layout of the website where people use to book hotel and flight to travel.

Page 16: Cross language information retrieval (clir)slide

Conts…

Users can choose any language. Example: Japanese

Page 17: Cross language information retrieval (clir)slide

Conts…

As we can see the language in the layout change into Japanese wording.

Change into Japanese wording.

Page 18: Cross language information retrieval (clir)slide

Conts…

By using Google translate it allow users to identified the meaning of the Japanese word.

EXAMPLE: MALAY-to-JAPANESE

Page 19: Cross language information retrieval (clir)slide

Conts…

Insert the translation word from the Google translate in search engine of www.easytobook.com

Page 20: Cross language information retrieval (clir)slide

Conts…

A list of result where 131 hotels is available where we can see the wording show is still in Japanese.

Click any result

Page 21: Cross language information retrieval (clir)slide

Conts…

The description of the hotel in Kuala Lumpur is written in Japanese.

Page 22: Cross language information retrieval (clir)slide

CLIR WEBSITE EXAMPLE

http://www.cs.nmsu.edu/~sliu/main_frame.html

http://www.easytobook.com/

Page 23: Cross language information retrieval (clir)slide

CINDOR (Conceptual Interlingua Document Retrieval)

Cross-language text retrieval system capable of accepting a user's query stated in their native language and then seamlessly searching, retrieving, relevance ranking and displaying documents written in a variety of foreign languages

CINDOR allows users of the system to state queries in any of the supported languages (currently English, French, Spanish, and Japanese) and search and retrieve documents from any of the supported languages.

Adopted ‘Conceptual Interlingua’: unique approach to cross-language information management based on a language-independent conceptual representation

Page 24: Cross language information retrieval (clir)slide

CINDOR

‘Conceptual’ resource of our conceptual interlingua

Concept of “elasticity: the tendency of a body to return to its original shape after it has been stretched or compressed”, which has the label 131186, is instantiated in English and French 131186 spring, give, springiness 131186 élasticité, flexibilité, moëlleux

Page 25: Cross language information retrieval (clir)slide

The Eurovision St Andrews Photographic Collection

Site presents the collection in a variety of ways: full text search; or browsing a list of 999 pre-defined index terms organised alphabetically and hierarchically via a categories page

SAC consists of 28,133 thumbnail images (around 120x76 pixels), larger versions of these images (around 368x234 pixels), and associated captions, giving a total of 84,399 files in the main body of the collection.

Page 26: Cross language information retrieval (clir)slide

Eurovision

Photograph metadata: (1) a unique record number, (2) a short title, (3) a full title, (4) a textual description of the image content, (5) the date when the photograph was taken (most

frequently with the day, month and year), (6) the originator, i.e. the name of an individual or company

to which the photograph is attributed, (7) the location of the photograph (e.g. the county and the

country), and (8) a line for notes to offer additional information about the

photograph

Page 27: Cross language information retrieval (clir)slide

Eurovision

St Andrews collection has been used for bilingual ad-hoc retrieval where queries typical to this kind of historic collection have been generated in English and translated into languages including a range of Indo-European, Asian and Romance languages

Challenges include: Captions which are short in length increasing the

likelihood of vocabulary mismatch, captions with text not directly associated with the visual content of an image (e.g. expressing something in the background),

The use of colloquial and domain-specific language in the caption (i.e. British English).

Page 28: Cross language information retrieval (clir)slide

The web interface to the St Andrews collection

Page 29: Cross language information retrieval (clir)slide

The web interface to the St Andrews collection

Page 30: Cross language information retrieval (clir)slide

CLIR University of Indonesia

Query expansion techniques: pseudo relevance feedback Assumption that the top few documents initially

retrieved are indeed relevant to the query, and so they must contain other terms that are also relevant to the query

To choose the relevant terms from the top ranked documents, we used the tf*idf term weighting formula.

We added a certain number of noun terms that have the highest weight scores.

Page 31: Cross language information retrieval (clir)slide

Interface and program demo

Page 32: Cross language information retrieval (clir)slide

Interface and program demo

Page 33: Cross language information retrieval (clir)slide

INFOMAP

Chinese question classification is the process that analyzes a question and labels it based on its question type and expected answer type

Adopt INFOMAP inference engine to support the knowledge-based approach for Chinese questions, which can be formulated as templates and use SVM (Support Vector Machines) as the machine learning approach for large collections of labeled Chinese questions.

INFOMAP is a knowledge representation framework that extracts important concepts from a natural language text

Feature of INFOMAP is its capability to represent and match complicated template structures, such as hierarchical matching, regular expressions, semantic template matching, frame (non-linear relations) matching, and graph matching.

Using INFOMAP, we can identify the question category from a Chinese question.

Page 34: Cross language information retrieval (clir)slide

Example

Question (In which city were the Olympics held in

2004?)INFOMAP can be formulated as a rule or

template (four elements (denoted as "HAS-PART") in this rule) "[5 Time]:[3 Organization]:[7 Q_Location]: ([9

LocationRelatedEvent])“ 2004

Page 35: Cross language information retrieval (clir)slide

Searching Demo

Page 36: Cross language information retrieval (clir)slide

Searching demo

Page 37: Cross language information retrieval (clir)slide

Searching demo

Page 38: Cross language information retrieval (clir)slide

Thank You