7
Concept Search in Urdu Kashif Riaz University of Minnesota 4-192 EE/CS Building 200 Union Street SE Minneapolis, MN 55455 [email protected] ABSTRACT This paper describes a thesis proposal to do concept search in non English and non European languages. Urdu is chosen as an example language because of its unique nature, morphology and a large number of speakers. Besides its importance, Urdu does not have adequate language resources to do intellectual research in Information Retrieval (IR). It is shown that methods used for English language for concept searching are inadequate for Urdu. Some novel approaches for concept searching are also presented. Pre-processing IR tasks such as stop word identification and stemming require complex research for a morphological rich language like Urdu. Named-entity identification is hypothesized to be useful in determining the concept being sought by the user and research plan includes an implementation of named-entity identification for Urdu. An Urdu language toolkit will be made available to the IR community for Urdu language processing. Finally, a TREC like evaluation criteria is presented with relevance judgments, test collection and queries for Urdu IR. Categories and Subject Descriptors H.3.1 [Content Analysis and Indexing]: Linguistic Processing, Dictionaries, Indexing methods. General Terms Algorithms, Experimentation, Languages. Keywords Urdu, Concept Search, Evaluation, Stemming. 1. INTRODUCTION There have been tremendous advances in the Information Retrieval (IR) community since the field emerged about forty years ago [17]. But unfortunately until the last decade, almost all research was done in English. Since then the community has made some advances in other European and Asian languages, but still there has been little effort done for languages that are written from right to left (Arabic in an exception because of the unfortunate events of September 11, 2001). In the last two years there is a strong surge of work done in Indic languages in India, mostly in the area of natural language processing but Urdu is not included in most of these initiatives [31]. I assert that, there is a need to study Information Retrieval related concepts in languages that although spoken by a large number of world's population have not gotten attention in the IR community primarily because of the lack of language resources like corpora and expertise; right to left languages is an example of such languages. Most of these languages were and still are transliterated on the Web or represented as images because of the lack of Unicode support of the Operating Systems, or the in the presentation software. An Urdu word can be transliterated into many ways. For example, world’s most common first name Mohammad, can be transliterated as at least a dozen different ways, Muhammad, Mohamed, and Mohd to list a few. While trying to search for a non-English word like Mohammad, one has to query a search engine multiple times with different spellings to get all the relevant material. But Mohammad can only be written one way in Urdu, Farsi and Arabic. If indexed in its native orthography, searching of non-European languages will improve search experience many fold. Besides the advances in the search engine technology, most retrieval today is keyword based. For example, Google is state of the art for search engine technology today. If one searches for the word car on Google, almost all top documents will not have the term automobile as main query term. Although this is a trivial example, with no consequences; the result of not finding a relevant document could become very serious in a medical or in a legal domain. Consider this example in a legal domain: most of us know about the McDonalds’ hot coffee case where the McDonalds fast food chain was sued and for serving coffee very hot. The attorney who is representing either party will be interested in legal documents where some hot or cold beverage has caused an injury to a person, in contrast to only hot coffee. It is evident from our Web searching experience that there is no freely available search engine that will provide this functionality. This is called Concept Search. There has been considerable research done in the IR community to solve this problem using bag of words approach, but it has not resulted in any break through search engine technology. For research I intend to focus on the Concept Search problem for non-English languages and non-European languages, within that I will focus on languages written right to left or the languages that share the same grammar with one of the right to left languages, Hindi shares the same grammar with Urdu. I will choose Urdu as an example language because it shares some aspects of vocabulary and grammar from all Hindi, Arabic and Persian; it is the least studied of all, and most importantly I am fluent in it. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. PIKM’08, October 30, 2008, Napa Valley, California, USA. Copyright 2008 ACM 978-1-60558- 257-3 /08/10...$5.00. 33

[ACM Press Proceeding of the 2nd PhD workshop - Napa Valley, California, USA (2008.10.30-2008.10.30)] Proceeding of the 2nd PhD workshop on Information and knowledge management - PIKM

  • Upload
    kashif

  • View
    214

  • Download
    2

Embed Size (px)

Citation preview

Concept Search in Urdu Kashif Riaz

University of Minnesota 4-192 EE/CS Building 200 Union Street SE

Minneapolis, MN 55455 [email protected]

ABSTRACT This paper describes a thesis proposal to do concept search in non English and non European languages. Urdu is chosen as an example language because of its unique nature, morphology and a large number of speakers. Besides its importance, Urdu does not have adequate language resources to do intellectual research in Information Retrieval (IR). It is shown that methods used for English language for concept searching are inadequate for Urdu. Some novel approaches for concept searching are also presented. Pre-processing IR tasks such as stop word identification and stemming require complex research for a morphological rich language like Urdu. Named-entity identification is hypothesized to be useful in determining the concept being sought by the user and research plan includes an implementation of named-entity identification for Urdu. An Urdu language toolkit will be made available to the IR community for Urdu language processing. Finally, a TREC like evaluation criteria is presented with relevance judgments, test collection and queries for Urdu IR.

Categories and Subject Descriptors H.3.1 [Content Analysis and Indexing]: Linguistic Processing, Dictionaries, Indexing methods.

General Terms Algorithms, Experimentation, Languages.

Keywords Urdu, Concept Search, Evaluation, Stemming.

1. INTRODUCTION There have been tremendous advances in the Information Retrieval (IR) community since the field emerged about forty years ago [17]. But unfortunately until the last decade, almost all research was done in English. Since then the community has made some advances in other European and Asian languages, but still there has been little effort done for languages that are written from right to left (Arabic in an exception because of the unfortunate events of September 11, 2001). In the last two years there is a strong surge of work done in Indic languages in India, mostly in

the area of natural language processing but Urdu is not included in most of these initiatives [31]. I assert that, there is a need to study Information Retrieval related concepts in languages that although spoken by a large number of world's population have not gotten attention in the IR community primarily because of the lack of language resources like corpora and expertise; right to left languages is an example of such languages. Most of these languages were and still are transliterated on the Web or represented as images because of the lack of Unicode support of the Operating Systems, or the in the presentation software. An Urdu word can be transliterated into many ways. For example, world’s most common first name Mohammad, can be transliterated as at least a dozen different ways, Muhammad, Mohamed, and Mohd to list a few. While trying to search for a non-English word like Mohammad, one has to query a search engine multiple times with different spellings to get all the relevant material. But Mohammad can only be written one way in Urdu, Farsi and Arabic. If indexed in its native orthography, searching of non-European languages will improve search experience many fold. Besides the advances in the search engine technology, most retrieval today is keyword based. For example, Google is state of the art for search engine technology today. If one searches for the word car on Google, almost all top documents will not have the term automobile as main query term. Although this is a trivial example, with no consequences; the result of not finding a relevant document could become very serious in a medical or in a legal domain. Consider this example in a legal domain: most of us know about the McDonalds’ hot coffee case where the McDonalds fast food chain was sued and for serving coffee very hot. The attorney who is representing either party will be interested in legal documents where some hot or cold beverage has caused an injury to a person, in contrast to only hot coffee. It is evident from our Web searching experience that there is no freely available search engine that will provide this functionality. This is called Concept Search. There has been considerable research done in the IR community to solve this problem using bag of words approach, but it has not resulted in any break through search engine technology. For research I intend to focus on the Concept Search problem for non-English languages and non-European languages, within that I will focus on languages written right to left or the languages that share the same grammar with one of the right to left languages, Hindi shares the same grammar with Urdu. I will choose Urdu as an example language because it shares some aspects of vocabulary and grammar from all Hindi, Arabic and Persian; it is the least studied of all, and most importantly I am fluent in it.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. PIKM’08, October 30, 2008, Napa Valley, California, USA. Copyright 2008 ACM 978-1-60558- 257-3 /08/10...$5.00.

33

1.1 Research Vision The current vision of the research is to build a concept search engine with cross language retrieval capability, where the results are returned in Urdu when queried in English. While building the system a number of retrieval and mining subtasks will be implemented. For example, stop word identification, stemming, tokenization, named entity recognition, text classification, frequency distribution etc. An Urdu language processing toolkit will be built and available to the community through the Web. Most of the literature suggests that concept search is most successful with the use of ontology like Wordnet [25] [26]. Currently there is no ontology available for Urdu, although it will be a daunting task I intend to investigate the challenges in creating an Urdu Wordnet. This goal may be accomplished with the help of other researchers in the community. On this quest, I will compare and contrast the state of art for concept searching and other search engine enabling algorithms (stop words, stemmers etc.) that are developed for English with their performance on Urdu collection. In my view one of the most important and intellectually rewarding result of this work will be to understand what are the features and characteristics of non-English languages in this case Urdu that require changes in IR tasks and their algorithms to perform well.

2. URDU In this section I will briefly introduce some right to left languages and a few characteristics of Urdu. Urdu is the national language of Pakistan, and one of the major languages of India. It is estimated that there are about 300 million speakers of Urdu. Most of the Urdu speakers live in Pakistan, India, UAE, U.K and USA. Recently, there has been quite a bit of interest in right to left language processing in the IR community, specifically in the intelligence community and other organizations working for the government in the United States. Most of the interest has been focused toward Arabic (a right to left language). There are other right to left language like Urdu, Persian (Farsi), Dari, Punjabi, and Pashto that are mostly spoken in South Asia. Arabic is a Semitic language and the other languages belong to the Proto Indo Iranian branch of languages. Arabic and these other languages only share script and some vocabulary. Therefore, the language specific task done for Arabic is not applicable to these languages. For example, stemming algorithms generated for Arabic will not work for a language like Urdu. Next section introduces Urdu computational processing in terms of right to left language processing.

2.1 Urdu Processing Urdu among all above languages mentioned has unique case that it shares its grammar with Hindi. The difference is some vocabulary, and writing style. Hindi is written in Devanagari script. Because of these similarities, Hindi and Urdu are considered one language for linguistic purposes. Urdu is quite complex language because Urdu’s grammar and morphology is a combination of many languages: Sanskrit, Arabic, Farsi, English and Turkish to name a few. This aspect of Urdu becomes quite a challenge while doing morphological analysis to build a stemmer, details of stemming challenge is presented in section 4.1.2. Urdu is sometimes referred to as Hindustani, but this use has not been seen in language literature for years. Because of its rich

morphology and word borrowing characteristics, Urdu is widely considered the language of the poets. From 1700s to 1900’s it was a prerequisite to learn Urdu in order to be considered a reputable poet or an intellectual. Urdu’s descriptive power is quite high. This means that there could be many different ways and forms a concept can be expressed in Urdu. For example, the words Pachem and Maghreb both are used for the direction West. In the previous example Pachem has its ancestry in Sanskrit and Maghreb has its roots in Arabic. Because of it high descriptive power it is difficult for state of the art concept searching algorithms like LSA and its variants to provide good precision. Urdu is considered the lingua franca of business in Pakistan, and the South Asian community in the U.K [3]. Personally, when travelling to UK I rarely speak English in big cities like London or Glasgow while out and about. Urdu has a property of accepting lexical features and vocabulary from other languages, most notably English. This is called code-switching in linguistics e.g. it is not uncommon to see a right to left flow interrupted by a word written in English (left to right) and then continuation of the flow right to left. For example, ميرا ہو laptop ہے [That is my laptop]. In the above example, Microsoft Word did not support English embedding within the Urdu sentence and displayed it improperly. But while electronically processing, the tokenization will be done correctly [2]. Code switching becomes an issue for stemming algorithms while suffix, infix or postfix stripping. Urdu stemming requires all of the above stripping processes. In order to process Urdu and other right to left languages Unicode encoding and proper font usage is necessary, more specifically Unicode Little. Becker and Riaz discuss Urdu Unicode encoding in detail in [2].

2.2 Hurdles of Urdu Processing In this section I highlight a number of hurdles that could arise when computationally processing a less studied language like Urdu. There are a number of natural language processing tools available for language processing. Natural Language Tool Kit (NLTK) is a Python suite of modules for language processing. Unfortunately, these tools are not compatible to process language like Urdu. What follows is a list of the hurdles that I anticipate or I have encountered while doing my research.

2.2.1 Urdu Corpus Construction In order to run experiments and evaluation for algorithms a robust corpus is a pre-requisite. Although there are standard corpora available through Linguistic Data Consortium (LDC) to process various languages, when I started doing research in Urdu there was no corpus available. Therefore, a new corpus was built in order to do named entity extraction [1]. The details of the corpus are available from [2]. Currently, there are only two known Urdu corpora available to the community. One is the EMILLE Lancaster Corpus, in which Urdu is one language among many, it is the more comprehensive of the two [15]. The other is the Becker-Riaz Urdu Corpus, a corpus of strictly BBC Urdu news articles [2]. Both Corpora are based on Corpus Encoding Standard and are in XML format [11].

2.2.2 Current Progress A 6 million word Becker-Riaz corpus and a 7 million word EMILLE corpus are available to run our algorithms and to build

34

corpus processing tools, e.g. to see if Zip’s law works on Urdu data etc. Becker-Riaz corpus is constantly being updated for new words and documents, while EMILLE corpus is stagnant.

2.2.3 Language Engineering issues for Unicode Processing As mentioned earlier, Urdu is a right to left language and therefore, cannot be represented electronically in ASCII character set. Such languages are represented with character sets encoded in Unicode. The discussion and details about Unicode is out of the scope of this paper. For more information see http://www.unicode.org. Since most of the language processing tools are not Unicode compliant, new tools need to be developed to work with right to left languages. In order to process Urdu, familiarity with the Arabic Orthography Unicode Standards is needed. Not all programming languages support Unicode and bidirectional languages, Java and C# in the .NET framework are well equipped to process Urdu. Operating Systems also need to have Unicode support built in, Windows XP and Vista support Unicode natively. In contrast, not all versions of Linux support Unicode. In order to conveniently write programs, debugging and quick prototyping a good Integrated Development Environment (IDE) is a necessity. Although, Java supports Unicode, the most widely used Java IDE Eclipse does not support Unicode in its display console by default. Once configuration is set up to display Urdu, it is lost after each execution of the program. This causes delay and distraction during coding. A text editor is also required that supports Unicode and Urdu fonts in order to view Urdu characters. Not all fonts support Urdu characters, an editor that supports Arabic does not necessarily supports Urdu because Urdu has letters that are not part of Arabic or Farsi e.g. the letter corresponding to dd in larddki (girl) is not part of either Arabic or Farsi, it is a phonetic sound in Tamil, a Sanskrit-based south Indian language. UltraEdit is one such Text editor that supports Urdu fonts and characters.

3. CONCEPT SEARCH 3.1 What is Concept Search The meaning of concept in IR has the same dilemma as determining the exact definition of data mining in the data mining community (i.e. there is no clear definition). I assert that concept searching is more than keyword-based search and it satisfies user’s need in terms of reducing information seeking time during the search process. The state of the art search engines like Google and Yahoo are keyword-based. There are very few Urdu keyword-based search engines but they simply use word match technique and don’t use any stemming, stop word removal or language identification [33]. Keyword-based engines score and rank documents based upon the presence of the query term along with other criteria like Page Rank in earlier versions of Google. Keyword searching increases information seeking time for the user. This manifests itself in two ways: First, keyword search retrieves a non-relevant document that contains a keyword e.g. the query tem plane could mean airplane or co-ordinate plane. Second, keyword search misses documents that are relevant but don’t contain the query term automobile and car. One of the biggest challenges of traditional Information Retrieval is that the terms used in the user query are not the same as they

are in the document explaining the same concept. Therefore, the query terms may not be indexed by the search engine. This happens because of the language phenomenon called synonymy. Synonymy means that there is more than one way to name and describe a concept—it is the equivalent to synonyms in any language. Synonymy causes low recall and low precision. The other language phenomenon that hinders good performance of IR engines is polysemy. Polysemy means that two or more different concepts can be represented by the same term (homograph). An example of this phenomenon is a term like chip which can either mean a semiconductor chip or the casino chip or a chip in a bowl. Polysemy causes high recall, but very poor precision. It is observed that less than 20% of the time people use the same keyword for a well-defined object [5]. For the purpose of this research, concept is defined to be more specific than the theme of the document. A concept is more granular than a theme or the central idea of a document. The goal of concept search is to retrieve documents that don’t have the exact keyword from the query, and fit the searcher’s information need. It is very important to note that we want to satisfy both of these goals where more importance is given to user’s information need. Unfortunately, the most important measure is somewhat subjective. Besides user judgments, standard IR measures of recall, precision, and F-measures are used to assess the retrieval performance.

3.2 State of Art of Concept Search One of the basic intellectual steps for this research is to understand the state of the art for concept searching in English and then compare the techniques to Urdu. The initial conjecture is that the translation of techniques to Urdu will not be smooth. Concept searching has been tried in the area of Information Retrieval time and time again with mixed results. Some of the techniques ranged from manual processing to sophisticated mathematical models [4], [5], [6], [7], [21].

3.2.1 Latent Semantic Analysis By far, the most studied technique for conceptual searching is done through Latent Semantic Analysis1 (LSA) and its variants. The heart of LSA is its use of Singular Value Decomposition (SVD). This technique is used for dimensionality reduction of the concept space. The use of singular value decomposition and its utility for concept searching is examined from the linear algebra perspective. LSA can be examined as a special case for knowledge induction, Bayesian regression model, and term co-occurrence analysis [8]. LSA and its probabilistic cousin PLSA has been baked into many commercial products [21]. There are a number of commercially available search engines that claim to be using concept searching. Most of these products use LSA as their base along with some other technology. There are many reasons for using LSA as a starting point. Primarily, LSA is used because the engines do not have to do any syntactic parsing in order to build the indices. Also, LSA claims to be language independent because it is a ‘bag of words’ approach and does not incorporate linguistic features and knowledge to generate the concepts. I studied two state-of-the-art commercial concept search

1 Latent Semantic Analysis (LSA) and Latent Semantic Indexing (LSI) are used interchangeably

35

engines, A and B (product names have been removed to preserve anonymity). A uses Probabilistic Latent Semantic Analysis (PLSA), a novel form of LSA that incorporates probabilities and B incorporates LSA with some other non-disclosed technology. B does not support right-to-left languages. It supports only languages that are written in roman script. Neither software program is Web-based—they need to be installed on the user’s site using the user’s hardware, memory etc. because of high powered computation needed for the dimensionality reduction algorithm. Experiments on these commercial engines on a large collection of English data showed unsatisfactory precision. Neither of the engines supported Urdu data.

3.2.1.1 LSA on Urdu Data Since most of the experiments in the literature show good results for LSA, I handpicked an initial space of a small number of documents that were related, but not obviously related from the Becker-Riaz corpus. The reduced documents showed both polysemy and synonymy. Below is the list of translated significant keywords for each document. Doc1: prime minister, Shaukat, Aziz, Citibank, Pakistan, Musharraf, leader, meeting Doc2: leader, America, Hindustan, Delhi, Sonia, country Doc3: Pathan, ball, wicket, cricket, Lahore Doc4: cricket, school, Shaukat, Lahore Doc5: Musharraf, Pearl, Sheikh, Bharat, parliament, Pakistan Doc6: Afghanistan, America, Sheikh, Al-Qaeda, Israel Doc7: Shaukat, medical, study, leader, school, dean

In the matrix created, Hindustan and Bharat (both names for country India) represent synonymy. Shaukat represents polysemy in the document because it refers to both Shaukat Aziz, the president of Pakistan, and Shaukat, dean of Lahore’s medical school as the article indicated. Noise is present into the data through the connection of Doc 7 and Doc 1 with the term leader. It is important to note that different terms in Urdu were used for leader in the Doc 7 and Doc 1. The cell (i,j) contained values 0 or 1, representing the presence or absence of the keyword. After running LSA, the match for the query containing prime minister did not rank highly. We modified the matrix to contain the frequency of the terms found in the document. By adding frequencies of the terms, results were not encouraging. It is important to note that the query vector contained only 1 or 0 for the presence or the absence of the term. The results started to look better by adding weights to the query. This is an interesting phenomenon because most of the literature does not mention what criteria are good for giving weights to the query. Obviously, frequency is a bad criterion to weigh the query, since the query terms are usually never repeated, but that criterion improves the results considerably. When Vector Space Model was used on unique term-rich queries, the results were much better and relevant document was always retrieved

3.2.2 Further Research in Concept Search for Urdu 3.2.2.1 Exploring Language Modeling For Information Retrieval purposes, language model (LM) models have proved to be quite successful. LM can be viewed as the query generation process rather than searching collection of relevant documents. Query generation is viewed as a process of

randomly sampling all the queries that could be generated from the document. In other words, what is the probability that this document could have been generated this query? The documents with the high probability of generating a query are ranked high in the result list. Because of the high descriptive power of Urdu I hypothesize that language modeling technique for document retrieval could prove useful for concept searching. My research in this area is in its initial stages.

3.2.2.2 Use of Ontology/Thesaurus for Query Expansion I plan on using ontology or simple thesaurus in order to enhance some standard query expansions techniques that are used to increase recall. Generalized query expansion does not produce good results for concept searching because of the following reasons: First is a type of synonymy, e.g. phone is a device, device is a hyponym (hierarchical relationship) of phone. This is a difficult relationship to capture without creating thesauri. If such relationship is not captured then the precision will be too low – query will also retrieve computers [32]. Second is a type of polysemy e.g. the term cell in the query can have two meanings, i.e. jail cell or cell phone or cell of a body. Regional variants of the term usage can become a problem for example; the word cell means battery in my lexicon. Other examples are tube and subway, pop and soda etc. Since there is no Urdu Wordnet, the desire is to create a cluster-based thesaurus as suggested by [9] or other innovative query expansion techniques like query-document term mismatch as suggested by [30]. There is a Wordnet for Hindi which was recently created but it cannot be used for Urdu tasks. One has to be adept in Hindi orthography and requires more than passing knowledge of Hindi reading, parts of speech understanding and grammar nomenclature. The nomenclature of Urdu grammar is derived from Persian and Arabic.

3.3 Named Entity Extraction A large number of query terms contain names of people, geographic locations and may other proper nouns. IR and NLP community calls these proper nouns named-entities. Although, named-entities are typically not associated with concept search, recognition of such entities can increase the precision results quite a bit. Usually, names are not indexed because they receive a high IDF values because of their rarity in the corpus e.g. I know of a premier commercial search engine that does not index any words that have IDF higher than a threshold because it deems the word to be a misspelled word or a name (name of this search engine is omitted to preserve anonymity). Over the years there has been considerable research done in this area and many of them have shown interesting results but mostly in English and European languages. Most of the commercial engines and available software use capitalization rules and statistical methods. There is no reported research in Urdu Named recognition except [1] which suggests some approaches only. Urdu does not have any capitalization rules or proper noun markers so named-entity extraction is quite challenging. I plan on using rule-based approach to recognize names.

4. ENABLING TECHNOLOGIES In order to accomplish our goals and research milestones a number of enabling technologies need to be in place. I will

36

briefly discuss them below along with their progress. First of all, a search engine needs to be built. This search engine will provide Boolean keyword searching, ranked retrieval, and concept searching mechanism. These implementations are required in order to properly benchmark different search algorithms. A search engine technology requires different sub components such as a clean standard corpus, corpus processing and mining tools, stop word identification algorithm, stemming algorithm, indexing mechanism, and preferably a UI. A high level pictorial of sub components tasks is shown below.

4.1 Status of Enabling Technologies for Urdu In this section research status of some the perceived enabling technologies is explained. The goal was to use freely available tools like stop lists and stemmers e.g. Porter Stemmer. But unfortunately no such tools exist for Urdu either free or as commercial software. I anticipate devising a stop word list and an Urdu stemmer.

4.1.1 Stop word identification A natural language is composed of two types of words: content words that have meaning associated with them and functional words that don’t have any meaning. Stop words also called negative list is used to identify function words that don’t need to be indexed because no one uses them as a query word. But truly, stop words are those words that if used in a query will return a large number of documents, possibly the whole corpus. If too many documents are returned, then no Information Retrieval is accomplished. In other words, the keywords which are chosen for building the inverted index should discriminate between documents by not occurring too often, or too seldom. This is called resolving power in IR literature [9]. Stop words are identified as an IR task because these words don’t need to be indexed and therefore can save disk requirements. The research goal is to produce automatically a list of stop words from the Becker-Riaz and EMILLE corpora. I assert that the stop words are words that are uniformly distributed in all documents, it is best to compare this phenomenon with English and Urdu corpus. Using set intersection across sample English documents, stop words were generated like at, of, the, and many more. The same algorithm was run on a subset of the Becker-Riaz Urdu corpus and the EMILLE Urdu written corpus. The Becker-Riaz set of documents did not yield a single stop word (i.e., not a single word is common to all documents). The set of documents from the EMILLE corpus returned only the word کی (feminine possessive). A TF-IDF strategy was used to identify the words that are distributed uniformly across the corpus. The TF-IDF algorithm generated 87.5% precision for Becker-Riaz corpus and 91% precision for EMILLE corpus. The details of the algorithm and stop words for a subset of Becker-Riaz and EMILLE corpora can be found in [28].

4.1.2 Stemming Stemming is a standard Information Retrieval task. One of the major utility of stemmers is that it is used to enhance the recall of a search engine. It is a necessity for this purpose for any search engine, but crucial for an engine that does concept searching to improve precision. A stemmer is a computer algorithm to reduce the words to their stem, base, or root form. A stemming algorithm has been a long-standing problem in Information Retrieval. The process of stemming, often called conflation, is useful in search engines, natural language processing, and other word processing problems. For example, a stemming algorithm reduces the words fishing, fished, fish, and fisher to the root word fish. The conflation helps in improving recall because someone looking for fishing is most probably looking for fish also. One of the most commonly used English stemmer is called the Porter Stemmer. Porter Stemmer is available for many European languages but not for Urdu. There have been some attempts to analyze Urdu computationally, but all of these attempts are to analyze Urdu sentence structure. Computing Research Laboratory (CRL) has an Urdu morphological analyzer that is Arabic language based [29]. My research indicates that CRL morphological analyzer does not analyze Urdu words properly. Since Urdu is a right to left language with Arabic orthography, and has loan words from languages like Punjabi, Hindi, Farsi, English and Turkish, building an Urdu stemmer is quite challenging. One needs to be aware of the morphological analysis of all the languages whose words Urdu uses as loan words. There are two ways to build a stemmer for Urdu; statistical-based and rule-based. At this point only the rule-based approach is explored. The morphological issues while building an Urdu Stemmer are detailed in [28]. A small prototype has been built that incorporates a few rules for possessing plurals and possessives; it also has a rule for a word not to be stemmed. One very interesting phenomena was observed while implementing the plural rule. There are some root forms of Urdu that are shared amongst the indigenous form, Persian form, and Arabic form, and the word has to be processed by each of the rules (not one of the rules) to reach the proper root form. The side effect of this process is that recall is increased but precision is decreased considerably because polysemy is introduced. The following example can illustrate the phenomena.

If we derive a root form of the Urdu word khabr خبر (meaning news in English), then it could arrive from many different source words with different meanings. The possible non-stemmed word could be a loan word from other language or indigenous to Urdu. An example follows:

newspaper in Urdu and Persian, but plural for :(Akhbar) اخبارnews in Arabic news in Urdu, Arabic, and Persian – root form :(khabr) خبر

plural of news in indigenous Urdu and Hindi :(khabrain) خبريں

plural of newspaper in Urdu and Persian :(Akhbarat) اخبارات

I anticipate building a fully functional Urdu stemmer for my research.

37

5. EVALUATION One of the major hurdles for Urdu processing is the lack of baseline evaluation mechanism for results. Evaluation hurdle can be classified by two aspects, one is the usual mathematical measures like recall, precision, F-measure etc. In this section I will address creating a gold standard for Urdu that is typically available for other languages e.g. TREC is a well known institution to provide gold standard and relevance judgments for IR evaluation. Information retrieval is fundamentally a user driven task where the search engine is trying to satisfy an information need of users. This is evident from users reissuing a query after unsatisfactory results. Unfortunately, there are very few researchers in the IR community who know both Urdu and the IR related concepts. This was evident from some reviews from the submitted paper to different conferences, where a reviewer was a native speaker but did not understand the meaning of stop words, although explained in the paper. Although this is a hurdle, it is not insurmountable. This problem can be resolved by explaining the results and IR concepts in the reports and papers. When this strategy was used, a reviewer commented that they knew about stemming and it should not be explained to take space in the paper – looks like a catch 22. Also, when I wanted to have the Becker-Riaz and EMILLE corpus tagged through a third party for identification of stop words I spent a great deal of time explaining to the native speakers what are stop words and how they should tag it. The other aspect of evaluation is that there is no baseline available for mining tasks such as named entity extraction or for text classification. A great deal of time needs to be spent in order to make something baseline for mining and retrieval tasks by manually tagging relevant features so accurate recall, precision and other like measures can be calculated.

5.1 Creating Urdu Baseline for Evaluation I have created a TREC like evaluation test bed for Urdu evaluation on a much smaller scale. The test bed contains the Corpus of 200 Becker-Riaz documents, information requests or queries and relevance judgments created by university students in Pakistan and life-long news readers in the United States. The experts varied in ages and life experiences and demographics and their information needs were diverse. Kappa statistic is used to check the agreement between users. We chose 200 documents from a much larger set from Becker-Riaz corpus because it is extremely hard to create relevance judgments and topics for thousands of documents with no funding. Besides measures like recall and precision, I intend to use F-measure, novelty-ratio and coverage ratio [20]. We evaluated a home-grown Boolean retrieval engine, Apache Lucene [22], Lemur [23], and Terrier [24] search engines. Lemur (a C++ based engine) had encoding and XML processing issues. When document was transformed into TREC format, Lemur indexed Urdu fine, but retrieval did not recognize the query. The transformation of XML documents into TREC like format is not desirable because a lot of useful metadata information is lost in that way. I found Lucene to be the most Unicode, XML and metadata aware search engine. Eventually, a TREC like pooling will be used to find the common relevant documents across Lucene Terrier and the home grown engine.

6. CONCLUSION As the reader may have realized that I have set a tall order to accomplish concept searching in Urdu. In order to alleviate some of the difficulty I plan on using freely available tools when available and hand crafting some of the language resources like dictionaries for query expansion etc. The goal of this research is to examine characteristics of IR algorithms and IR sub components for different languages. I feel that the field of Information Retrieval will benefit from learning about the challenges in building an IR task for morphological rich language like Urdu.

7. REFERENCES [1] D. Becker, B. Bennett, E. Davis, D. Panton, and K. Riaz.

“Named Entity Recognition in Urdu: A Progress Report”. Proceedings of the 2002 International Conference on Internet Computing. June 2002.

[2] D. Becker, K. Riaz. “A Study in Urdu Corpus Construction.” Proceedings of the 3rd Workshop on Asian Language Resources and International Standardization at the 19th International Conference on Computational Linguistics. August 2002.

[3] P. Baker, A. Hardie, T. McEnery, and B.D. Jayaram. “Corpus Data for South Asian Language Processing”. Procee-dings of the 10th Annual Workshop for South Asian Language Processing, EACL 2003.

[4] M. Berry, S. Dumais, G. O’Brien. “Using Linear Algebra for Intelligent Information Retrieval.” Technical Report. University of Tennessee Knoxville. 1994.

[5] S. Deerwester, S. Dumais, T. Landauer, G. Furnas, R. Harshman. “Indexing by Latent Semantic Analysis.” Journal of the American Society of Information Science,” vol 41. no. 6. p. 391-407. 1990.

[6] T. Landauer, P. Foltz, Lahman, D. “A Introduction to Latent Semantic Analysis.” Discourse Processes, 25, 259-284. 1998

[7] H. Haav, T. Lubi. “A Survey of Concept-Based Information Retrieval Tools on the Web.” Fifth East-European Conference on Advances in Databases and Information Systems. 2002.

[8] R. Story. “An Explanation of the Effectiveness of the Latent Semantic Indexing by Means of a Bayesian Regression Model.” Information Processing and Management. vol. 32, no. 3, pp 329-344. 1996.

[9] R. K. Belew. “Finding Out About”. Cambridge University Press, 2000.

[10] C. Fox. “Lexical Analysis and Stoplists”. Information Retrieval, Data Structure & Algorithms, pages 102-130, Prentice Hall, 1992.

[11] N. Ide, C. Brew. “Requirements, Tools, and Architectures for Annotated Corpora”. Proceedings of Data Architectures and Software Support for Large Corpora. European Language Resources Association, Paris, 2000.

[12] R. Lo, B. He, I. Ounis. “Automatically Building a Stop word List for an Information Retrieval System”. 5th Dutch-Belgium Information Retrieval Workshop (DIIR). 2005.

38

[13] J. Savoy. “A Stemming Procedure and Stopword List for General French Corpora”. Journal of the American Society for Information Science, 1999.

[14] Z. Xiao, A. McEnery, P. Baker and A. Hardie. “Developing Asian Language Corpora: Standards and Practice”, Proceedings of the 4th Workshop on Asian Language Resources. March 25, 2004. Sanya, China.

[15] W0038: The EMILLE Lancaster Corpus. [cited 2005 July 15], Available: http://www.elda.org/catalogue/en/text/W0038.html

[16] R. Schmidt. “Urdu: An Essential Grammar.” Routlege Publishing, 2005

[17] A. Singal. “Modern Information Retrieval” IEEE Data Engineering, 2001

[18] K. Taghva; R. Beckley, M. Sadeh. “A stemming algorithm for the Farsi language”, Proceedings of the International Conference on Information Technology: Coding and Computing (ITCC’05)

[19] A. Mokhtaripour, S. Jahanpour. “Introduction to a New Farsi Stemmer”, CIKM 2006

[20] R. Bates-Yates, B. Ribeiro-Neto. “Modern Information Retrieval”, Addison Wesley, 1999

[21] M. Deniston. “An Overview and Discussion of Concept Search Models and Technologies.” Engenium’s Semetric (White Paper). 2003.

[22] Lucene. http://lucene.apache.org/ (July, 2008) [23] The Lemur Toolkit for Language Modeling and Information

Retrieval. http://www.lemurproject.org/ (July 2008)

[24] Terrier .http://ir.dcs.gla.ac.uk/terrier/ (July 2008) [25] Pedersen, Patwardhan, Michelizzi. “WordNet::Similarity -

Measuring the Relatedness of Concepts”, Proceedings of the Nineteenth National Conference on Artificial Intelligence (AAAI-04), pp. 1024-1025, San Jose, CA, July 25-29, 2004

[26] Song, X.; Lin, C.-Y.; Sun, M.-T. “Speech-Based Visual Concept Learning Using Wordnet”, Multimedia and Expo, 2005. ICME 2005. IEEE International Conference on Volume 6, Issue 6, Page(s):1138 - 1141, July 2005

[27] K. Riaz, "Challenges in Urdu Stemming" Future Directions in Information Access. Glasgow, August 2007

[28] K. Riaz, "Stop Word Identification in Urdu" ,Conference of Language and Technology, Bara Gali, Pakistan, August 2007

[29] http://crl.nmsu.edu/Resources/lang_res/urdu.html , (July 2008)

[30] T. Custis, K. Al-Kofahi, “A new approach for evaluating query expansion: query-document term mismatch, SIGIR 07, Amsterdam. Netherlands, 2007

[31] P. Majumder, M. Mitra, S. K. Parui, P. Bhattacharyya. The First International Workshop on Evaluating Information Access (EVIA 2007) Tokyo, Japan, May 15, 2007.

[32] I. Moulinier, P. Jackson, “Natural Language Processing for Online Applications, Text Retrieval, Extraction and Categorization”, 2nd Edition, John Benjamins Publishing Company, 2007

[33] http://hamariweb.com/ (July 2008)

39