41
Master of Science in Informatics at Grenoble Master Math´ ematiques Informatique - sp´ ecialit´ e Informatique option Artificial Intelligence and Web Word Embeddings for Information Retrieval Bhaskar Chatterjee 24/06/2016 Research project performed at MRIM, GETALP, LIG Lab Under the supervision of: Jean Pierre Chevallet and Christophe Servan, LIG Defended before a jury composed of: Prof. James L. Crowley Prof. Edmond Boyer Prof. Dominique Vaufreydaz Prof. Jean-Sebastien Franco Prof. Laurence Nigay Prof. Thomas Ropars Prof. Cyril Labbe June 2016

Word Embedding In IR

Embed Size (px)

Citation preview

Page 1: Word Embedding In IR

Master of Science in Informatics at GrenobleMaster Mathematiques Informatique - specialite Informatique

option Artificial Intelligence and Web

Word Embeddings for InformationRetrieval

Bhaskar Chatterjee24/06/2016

Research project performed at MRIM, GETALP, LIG Lab

Under the supervision of:Jean Pierre Chevallet and Christophe Servan, LIG

Defended before a jury composed of:Prof. James L. Crowley

Prof. Edmond BoyerProf. Dominique VaufreydazProf. Jean-Sebastien Franco

Prof. Laurence NigayProf. Thomas Ropars

Prof. Cyril Labbe

June 2016

Page 2: Word Embedding In IR
Page 3: Word Embedding In IR

Abstract

Recent research in word embeddings learned by deep neural networks hasgained a lot of attention in the natural language processing domain. These wordembeddings not only provide a good word representation but also capture rich sim-ilarities between words based on their context. This work presents a state of theart word embedding learning technique word2vec and how to improve textual re-trieval effectiveness by using rich semantic similarities of words. In addition, wediscuss the usage of word embeddings in language models a state of the art ap-proach for matching queries and documents and propose some tests to validate theeffectiveness of word embeddings in textual information retrieval.

Page 4: Word Embedding In IR
Page 5: Word Embedding In IR

Contents

1 Introduction 1

2 State of the art 32.1 Techniques that attempt to tackle term mismatch . . . . . . . . . . . . . . . . . 42.2 Techniques for computation of term relations from raw data . . . . . . . . . . . 52.3 Language Modelling approach in IR . . . . . . . . . . . . . . . . . . . . . . . 92.4 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

3 Proposed approach 173.1 Retrieval toolkit and Collection used . . . . . . . . . . . . . . . . . . . . . . . 183.2 Experimental setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243.3 Evaluation metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243.4 Retrieval models and results . . . . . . . . . . . . . . . . . . . . . . . . . . . 253.5 Results comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

4 Conclusion and future work 31

Bibliography 33

Page 6: Word Embedding In IR
Page 7: Word Embedding In IR

1Introduction

Information Retrieval (IR) concerns about finding the relevant documents from a collectionthat matches the users needs according to the users queries. In IR, the relevance is defined ashow well a set of retrieved documents matches the information needs of the user. Relevancecan be concerned with retrieval time or novelty of the result. This type of relevance is calleduser relevance. System relevance can be defined as the relevant documents which the retrievalengine retrieves to satisfy users needs usually against some query. In IR, the fundamental goalis to maximize the system relevance to match the user relevance. Classical retrieval modelsretrieves document based on exact matching between queries and documents. Exact matchingmeans only the query terms will be matched against the document. Exact matching of the termsbetween queries and document are not able capture semantic relationship between the terms.When users use different query terms than the terms contained in the document which conveythe same meaning, the retrieval will suffer from the term-mismatch problem. For example, auser is interested in knowing Obama’s visit in Europe and treaties signed. So, user fires thequery ”USA president visit in Europe and political consequences”, but french newspaper con-sists of documents like ”Obama visits Paris and new laws for import are signed”. Similarly,English (U.K) newspaper consists of documents like ”London and Washington signed new visareforms after Obama’s visit ”. So in both the cases, user will not be able to retrieve thesedocuments, since none of the query words occur in the document even though ”US President”signifies ”Obama”, ”Paris, London” signifies Europe and ”Import laws, visa reforms” signifies”political consequences”.Several techniques have been proposed to overcome the problem of term mismatch. Some ofthe notable techniques are relevance feedback [35], dimension reduction techniques like LDA[36] and integrating term similarity into retrieval models. It is easier for humans to judge simi-lar words after seeing the context. By context we mean the sentence or set of terms surroundingthe target term. But for retrieval engines it is a difficult task to guess or understand the context.Semantic information about relatedness of words can be obtained from an external knowledgesource1. But there are some challenges with these external knowledge sources. Not only theseexternal knowledge are expensive both in terms of time and money, but also need constantupdating as language evolves over time. Another problem when choosing similar words fromthese external resources (which are quite general purpose) sometimes brings noise (words canbe similar but do not relate in the context2). For example, if users use a query which means

1External knowledge can be a thesaurus or external lexical databases like Wordnet2A context in text usually means the sentence or window of words surrounding the target word.

Page 8: Word Embedding In IR

”health benefits of milk” the relevant documents should include health benefits of milk, maybeother dairy products like yogurt,cheese etc. but not about ”cows”.

To overcome the short comings of the these external resources we propose to use word2vec[6], which is a neural network based approach to learn word embeddings. Word embeddingsare language modelling techniques where words are mapped to real numbers in a low dimen-sional space relative to the vocabulary size so that words show some kind of features (semanticrelatedness, word association etc.) in this low dimensional space. These learned word embed-dings – as mentioned by the author [6] – contain the semantic similarities for words. Earliertechniques for computing these word embeddings through dimensionality reduction techniqueslike matrix factorization suffered some major problems, first they were really computationallyexpensive and secondly even with good hardware (lets say 32 core system with 128 gigabyte ofmemory) could take days to complete. Third problem was for very large corpus (roughly 5-10gigabyte) it was impossible to compute the embeddings. These word embeddings proposed byThomas Mikolov [5] are fast to train, for us it took 5 hours to build the knowledge resource on16 core system with 128 gigabyte of ram. We assume these word embeddings should reducethe noise while choosing the similar words as these embeddings are trained in the collectionand are learned on context of words.

In this thesis we propose two approaches to exploit the semantic space of the word em-beddings, one is to use this similarity in a classical probabilistic IR model and for the secondapproach we propose a vector space model. Due to time constraints we are not able to build thesystem for our second approach but provide our proposition. For our first approach, to integratethe similarities in the retrieval, we have chosen the ALMasri and chevallet’s [18] probabilisticlanguage model. Their model is based on extensive document matching where each query termwill be matched against all the documents unlike to classical approach of matching query termsonly with the documents which contain the query term. This thesis will discuss in detail aboutthe impact of integrating semantic features of neural net based word embeddings in Informa-tion retrieval. For this we propose a series of experiments to check the usefulness of semanticspace of the word embeddings.

2

Page 9: Word Embedding In IR

2State of the art

An information retrieval system in most bare bone terms gives a ranked list of documents givena query from a list in the system that matches the query in best possible way. Queries are ei-ther short or long depending upon the domain of retrieval (e.g medical and legal domains maycontain long queries whereas web searches contain really short queries). Usually the numberof words contained in a document is much higher than in a query, so queries provide shallowinformation and documents contain a lot of information. The fundamental problem in infor-mation retrieval is how to capture maximum information from this queries and match withdocuments which contain the related information. Also in English language a lot of differentwords convey similar meanings and most of the time words in queries are similar to differentwords in the document. For example, the query :- ”Obama visits Europe” and documents con-tains sentences like ”The president of united states of America arrived in Germany to discussrising environmental concerns”. Even though the query terms are semantically related to thedocument, probabilistic methods for retrieval will fail to capture the similarities. This issue,known as term mismatch, and has an impact on the recall of most information retrieval sys-tems. Recent research formally defined the term mismatch probability, and showed that onaverage a query term mismatches (fails to appear in) 40% to 50% of the documents relevantto the query [2]. This situation gets worse when there are multiple terms in the query whichdid not match with the terms in the document, in this case the number of relevant documentsdegrade quite fast. Even when search engines do not require all query terms to appear in resultdocuments, including a query term that is likely to mismatch relevant documents can still causethe mismatch problem. The retrieval model will penalize the relevant documents that do notcontain the term, and at the same time favor documents (false positives) that happen to containthe term but are irrelevant. Since the number of false positives is typically much larger than thenumber of relevant documents for a topic, these false positives can appear throughout the ranklist, burying the true relevant results.This thesis will discuss this vocabulary mismatch problems and our proposed way to overcomethem. There are various techniques that attempt to solve the problem of mismatch.

Page 10: Word Embedding In IR

2.1 Techniques that attempt to tackle term mismatch

2.1.1 Relevance feedback and pseudo relevance feedbackIn relevance feedback the user is involved in the retrieval process. The underlining idea of therelevance feedback is that it is difficult to produce a good query if the user does not have anidea of the collection. Relevance feedback is an iterative process where a user first gives aninitial query then judges the retrieved document and then reformulates the query for a betterretrieval. In such cases, relevance feedback can be effective in understanding user’s own in-formation needs. After seeing some documents, the user can refine his own understanding ofthe information which he/she is seeking. The Rocchio algorithm [28] is the classic algorithmfor implementing explicit feedback which enables the user to select the relevant documents inorder to reformulate the original query.

Relevance feedback is not effective in real world scenarios where users do not like to iterateover writing queries and checking documents to have a good query formulation.

Pseudo relevance feedback is also called blind relevance feedback, since it attempts the iter-ative process of query refinement from local analysis. By local we mean without the knowledgeof external resource or user’s involvement. The method first does an normal retrieval to find aninitial set of relevant documents then top k documents are selected. Then from these documentsterms are selected and query is reformulated. Then this new query is used to retrieve the rele-vant document. Pseudo relevance feedback had shown to give good results in the past, TREC-3adhoc competition was dominated by such expansion techniques where it performed quite well[20]. There also lies some problem with this technique, if the top retrieved documents are notrelevant then the original query’s topic might change to an unintended direction.

2.1.2 Dimension reductionThe main idea behind dimension reduction is minimizing the semantic gap between the queryand document. Some of the notable mentions are stemming, Latent Semantic Indexing(LSI),Conceptual Indexing. These techniques try to increase the chance that query and documentrepresent the same topic or concept even when they have different terms. Ahmed [29] performsa stemming method according to the context of the query which helped them to improve theaccuracy and the performance of their retrieval. Stemming is a process of reducing a wordto its word root or base form. For example, browse or browsing becomes after stemming tobrows, etc. Sometimes after stemming, the original word is lost and sometimes different wordforms take same roots for example ”accusation” and ”accustom” both take the root ”accus”after stemming done with porter stemmer.

Deerwester [9] proposed to solve the query mismatch by representing query and documentin the latent semantic space. In latent semantic space each term is grouped with its similarterms. In latent semantic space similar words tend to share the same space. LSI uses singu-lar value decomposition, a mathematical technique to do matrix factorization on a matrix ofterms and documents. This matrix is very huge, usually the size is number of words in the col-lection times number of documents. This method is very computationally intensive and quiteimpossible to perform on large collections.

The success of dimension reduction techniques depend on the application domain and thecharacteristics of the studied collection. Also, reducing the dimension can also result in a very

4

Page 11: Word Embedding In IR

simplified term space that may harm the expressiveness of the language and could relate inincorrect classification of unrelated term.

2.1.3 Query expansion with external sourceTo improve the results which specify user’s best interest from the queries, a query is expandedwith additional relevant terms and re-weighting the terms in the expanded query. This way onecan obtain additional relevant documents. One way to get this additional terms is through a the-saurus, lexical database like wordnet, automatically generated thesaurus, word embeddings etc.

There lies some problem of using a manual thesaurus or a lexical database, since thesevocabulary resources are very expensive to construct in both time and money. Also it is verydifficult to update them since there are new words which are invented every now and then.Another problem is that it may not exist a vocabulary resource for certain languages. One wayto avoid this problem is by using fast computational methods by which we can infer the termassociations. Automatic generation of thesaurus and word embeddings are two of the ways tocompute the relationship between terms.

2.2 Techniques for computation of term relationsfrom raw data

2.2.1 Automatic generation of thesaurusAs an alternative to manual thesaurus we can compute an automatic thesaurus in a cost effectiveway from the large document set. There are two ways to compute the thesaurus. One by simplyexploiting the word co-occurrence matrix by counting text statistics for similar words. Anotherapproach is exploiting the grammatical properties of the language by doing some grammaticalanalysis to find grammatical relationships. The idea behind such an approach exploits the factthat words which occur in similar context will have semantic similarities for example grass,cattle’s, herbivores, milk, red meat etc. will all relate to cows. Thesaurus generated usingword co-occurrences is more robust while thesaurus generated using grammatical relationshipis more accurate. Simplest example for such an thesaurus would be a thesaurus containing acount of words that follow the target words. This way we will have a probability of word pairs.

2.2.2 Latent semantic indexing and latent dirichlet allocationMost of the retrieval methods in IR system are based on the assumption of term independence.For example, the vector space model (VSM) assumes that the documents are embedded in amutually orthogonal term space, while probabilistic models, such as the BM25 or the languagemodel (LM) assume that the terms are sampled independently from documents. Standard ap-proaches in IR take into account term association in two ways, one which involves a globalanalysis over the whole collection of documents (e.g. independent of the queries), while theother takes into account local co-occurrence information of terms in the top ranked documentsretrieved in response to a query. Our approach is mostly based on the global analysis overthe collection. LDA (Latent Dirichlet Allocation) [10] and LSI (Latent Semantic Indexing)

Page 12: Word Embedding In IR

[9] are some of the approaches that allow us to compute term association over the collectionbut they do so at the document level and do not take local context (can be sentence or just nsurrounding words) of words into account rather context in LSI and LDA is the document. Inlatent semantic analysis documents are represented in a term space of reduced dimensionalityso as to take into inter term dependencies. In Simple terms LSI takes a term document matrixor bag of words model and on this matrix does singular value decomposition(SVD) so as toextract the term dependencies. LDA represents term dependencies by assuming that each termis generated from a set of latent variables called the topics [10]. One of the issues of usingthis kind of techniques is that they take into account term dependencies at document level butwith the word embeddings techniques we can have fine grained relations between words whichtaken into account the local window (contexts) of the words.

Another way of computing the thesaurus is by neural networks. This brings us to the ap-proach of using word embeddings, recent findings in word embedding learning techniquesespecially word2vec [5] [6], which is a shallow two-layer neural networks, that are trained toreconstruct linguistic contexts of words: the network is shown a word, and must guess whichwords occurred in adjacent positions in an input text.

2.2.3 Brief about word embeddings

Word embedding is a technique for representing a word in vector space to extract some fea-tures by mapping words or phrases from the vocabulary to vectors of real numbers in a low-dimensional space relative to the vocabulary size. There are various ways to learn this wordembeddings such as by neural networks, dimensionality reduction on the word co-occurrencematrix etc. Now we will discuss a neural network based technique called word2vec to learn theembeddings.

Word2vec Word2vec is a set of two algorithms – CBOW and Skipgram – that producesvector representations of words in latent space of N dimension (N is the size of vectors). Thewhole intuition behind word2vec is that words occurring in similar context will have similarmeaning. For example: ”There are a lot of x in park. x are eating grass.” In this sentences fromthe context we can easily infer x can be cow, sheep etc. Vector representation of words arehere from a 2003 as proposed by Yoshua Bengio [24], what word2vec provides is a very fastway to compute vectors which instates the contextual information of the words. Also, qualita-tively speaking these vector representations according to the author [5] produced good resultson analogy tasks [6], with some linear vector operations the author states that we can get newrelationships for e.g

vector(”Paris”)− vector(”France”)+ vector(”Italy”) = vector(”Rome”)

Word2vec uses a single hidden layer, fully connected neural network as shown below. Theneurons in the hidden layer are all linear neurons. The input layer is set to have as manyneurons as there are words in the vocabulary for training. The hidden layer size is set to thedimensionality of the resulting word vectors. The size of the output layer is same as the inputlayer. Figure 1.1 is representative of the neural architecture.

6

Page 13: Word Embedding In IR

Figure 2.1 – A simple CBOW model with only one word in the context [8]

The idea behind the word2vec is that if the network is shown a word, it will try to predictthe next word. The input to the network is encoded using ”1-out of -V” representation whereV is the size of the vocabulary meaning that only one input line is set to one and rest of theinput lines are set to zero. There are two models word2vec use. One is CBOW and otherskipgram for the prediction. CBOW is trained to predict the target word t from the contextualwords that surround it, c, i.e. the goal is to maximize P(t,c) over the training set. Skipgram onthe other hand, predicts the contextual words from target words. Skipgram turns out to learnfiner-grained vectors when one trains over more data. Below are the two models presented

Figure 2.2 – CBOW and Skipgram [5]

The main focus of Mikolov’s paper was the skip gram Model. In his model from a corpusof words w and context c, the aim is to set the parameter θ in P(c|w;θ) where P(c|w) is the

Page 14: Word Embedding In IR

conditional probability of the word given corpus to maximize the corpus probability

argmaxθ

∏w∈corpus

[∏

c∈C(w)P(c|w : θ)

](2.1)

Here C(w) is the context of the word w . We can also write this equation as

argmaxθ

∏(w,c)∈D)

P(c|w : θ) (2.2)

where D is the set of all word context pairs from the text.One way to learn this probability is by modelling using the softmax function which comes fromneural networks.

P(c|w : θ) =evc.vw

∑c∈C evc.vw(2.3)

here vc is the vector representation of the context word c and vw is the vector representation ofthe word w. C is the set of all available context. Wights θ are vci ,vwi for w ∈ V (vocabulary)c ∈C, i ∈ 1, ...p (a total of |C|× |V |× |p|). now we need to maximize this equation so takinglog and final equation looks this

maxθ ∏(w,c∈P)

P(c|w : θ) = ∑(w,c∈P)

logevc.vw− log∑c

evc.vw (2.4)

from the author[5] equation 2,3 is, vw and v′w are input and output vector of the word w. W

is the vocabulary

P(c|w : θ) =ev′wo

T .vwI

∑Ww=1 ev′wT .vwI

(2.5)

Maximizing the equation 2.4 should result in good vectors or embeddings for vw (∀w ∈V ) as-suming similar words will have similar vectors.Equation 2.4 is very computationally expensive as due to this part ∑ logev

c′ .vw since this partsums over all possible contexts and there can thousands of contexts for a word.Mikolov [5] in his paper presented two different cost functions to make it computationally ef-ficient. One is hierarchical softmax and other is negative sampling. Both are very differentapproaches from each other.In hierarchical softmax only log2 |V | is evaluated in the output layer instead of whole vocabu-lary as in softmax. The hierarchical softmax uses a binary tree where all the vocabulary wordsare at the leaves and every node defines a probability to visit its child node. Concretely eachword can be reached from the root of the tree. Let n(w, j) be the j−th node on the path from theroot to w and let L(w) be the length of this path where n(w,1) = root and n(w,L(w)) = w. Forinner node n, let ch(n) be an any fixed child of n and [x] be 1 if x is true and and -1 otherwise.The hierarchical softmax defines P(wO|wI) as follows :

P(w|wI) =L(w)−1

∏j=1

σ([n(w, j+1) = ch(n(w, j))].v′w, j

T vwI) (2.6)

8

Page 15: Word Embedding In IR

here σ(x) = 11+e−x .

Idea for negative sampling is more straight forward instead of updating all the output vec-tors we update only a sample of them. With negative sampling they replaced the objectivefunction(equation 2.4) to a new one.

logσ(v′wo

T vwI)+k

∑i=1

Ewi Pn(w)

[logσ(−v

′wi

T vwI)

].....[5] (2.7)

In his paper Mikolov [5] recommends using the skip-gram model with negative sampling(SGNS), as it outperformed the other variants on analogy tasks. There are more heuristics takeninto account for example they sub-sampled frequent words with the corresponding formula.

P(w) = 1−√

t/ f (w) (2.8)

P(wi) represents each word wi in the training set is discarded with probability computed bythe formula above. f (wi) is the frequency of the word and t is the threshold generally around10−5 . According to the paper this method was chosen heuristically. We choose their methodof getting the term relationships because according to the author vectors contain semantic sim-ilarities and we in our approach will try to exploit this similarity.

Things to keep in mind Word2vec doesn’t distinguishes between various meanings a wordcan take. Each word has only one meaning according to the context and normalised accord-ingly. We will not explore in this direction anymore since we are not exploring in term disam-biguation way.

2.3 Language Modelling approach in IRLanguage Modelling approach to information retrieval was proposed by Ponte and Croft [1].This approach models the idea that a document is a good match to a query if the documentmodel is likely to generate the query, which happens if query words are contained in the doc-ument. The language modelling approach builds a probabilistic language model Md for a doc-ument d, and ranks document based on the probability of the model generating the query:P(q|Md) where q is the query for which documents are to be retrieved.

2.3.1 Unigram, bigram, N-gram models

So what is the meaning of document model generating a query? A traditional generative modelof language, of the kind familiar from formal language theory can be used to recognize orgenerate strings. For this we can build probabilities over terms in the document. These prob-abilities can be independent or conditioned depending upon the model is chosen for eg. thesimplest model unigram throws away all conditioning contexts and computes probabilities ofeach term independently. Unigram model for three terms in the query for the document

Punigram(t1t2t3) = P(t1)P(t2)P(t3) (2.9)

Page 16: Word Embedding In IR

If term dependence is taken into account chain rule can be used to decompose the probabil-ity of a sequence of events into the probability of each successive event conditioned on earlierevents. For example if we want to know how much each term is dependent on the next termone can use bi-gram modeling

PBi−gram(t1t2t3) = P(t1)P(t2|t1)P(t3|t2) (2.10)

This way we can continue to n terms such an approach can be called as n gram modellingusing conditional probabilities. There are more complex language models based on grammarof a language called probabilistic context free grammars.

2.3.2 The query likelihood modelIn query likelihood model for each document d a language model Md is constructed. Theapproach is to rank the document by P(d|q), here the probability of the document is estimatedwith the likelihood of the document being relevant to the query.

P(d|q) = P(q|d)P(d)P(q)

(2.11)

The P(d|q) is decomposed with bayes rule. Here important thing to note P(d) and P(q) canbe computed before and can be treated as constant. The ranking of the document being relevantto the query can be estimated as the likelihood that the document will generate that query. Sowe can say that query likelihood model attempts to model the process of query generation andin the process documents are ranked by the probability that a query would be observed as arandom sample from the document model. For this one can use multinomial unigram languagewhich is same as naive bayes model. With this model each document is considered as a classwhich can be thought up like a separate language.

P(q|Md) = Kq ∏t∈q

P(t|Md) (2.12)

Here Kq is the multinomial coefficient for query q and is constant for the query.

2.3.3 Estimating the query from the documentThe probability of estimating the query given the language model (LM) for a document d withthe maximum likelihood model with unigram assumption is

P(q|Md) = ∏t∈q

Pmle(t|Md) = ∏t∈q

t ft,d|Ld (2.13)

Here Md is the language model of the document d, t ft,d is the raw count of the term t in thedocument and Ld is the length of the document or total number of terms in the document d. Theequation which is presented above is a very classical model for estimating the query generationprobability from the document. The problem with such an approach is that terms are sparse inthe document. It is possible that words that occur in the query might not be in the documentin such a case this model will calculate zero probability to the query estimation even if somewords occur in the document. Clearly such a model is a big problem in information retrievalboth for ranking and matching. We will only get a non-zero value only when all the terms are

10

Page 17: Word Embedding In IR

present in the document, which is not always possible since user doesn’t have exact idea ofthe distribution of the words, he/she only has idea of the concept of the document and so allthe query words may not be contained in the document. In order to solve the problem of zeroprobability and significance of words in the distribution there is a technique called smoothing[11] which assigns the probability weights to the terms according to their distribution in thecollection.

Smoothing and Extension of language model

The approach for smoothing is that a non occurring term could be possible in the query but itsprobability should be close but not be more than its likelihood of occurrence from the wholethe whole collection. That is if t ft,d = 0

P(t|Md)≤ c ft |T (2.14)

Here c ft is the total count of the term in the collection and T is the total number of all terms inthe document. In 1980 Jelinek Frederick and Robert L. Mercer [4] proposed a mixture betweena document-specific multinomial distribution and a mutinomial distribution obtained from theentire collection. They proposed

P(t|d) = λPmle(t|Md)+(1−λ )Pmle(t|Mc) (2.15)

where λ between 0 and 1 and Mc is the language model of the entire collection. The valueof lambda has a big impact in the performance of the model.Another smoothing method is based on dirichlet prior’s [7], where a language model is builtfrom the entire collection as a prior distribution in a Bayesian updating process.

P(t|d) =t ft,d +µPmle(t|Mc)

Ld +µ(2.16)

Like the query likelihood model there is another language modelling technique called doc-ument likelihood model where language model from the query is constructed and probabilityof query generating the document is computed. Such an approach is less appalling since thereis lot less text in queries compared to documents so the language model of the query will beworst computed, and it will need to be smoothed from other language models.Another way is to make a language model from both document and query and calculate howdifferent are these two models from each other. John lafferty and zhai [16] in their paper de-veloped a risk minimization principal for document retrieval. They suggested that the wayto model the risk of returning a document d as relevant to query q is to use the Kullback-Leibler(KL) divergence between both the models. KL is a divergence measure which measureshow bad the probability Mq is at modeling Md .

R(d;q) = KL(Md|Mq) =∑t∈V P(t|Mq) logP(t|Mq)

P(t|Md)(2.17)

Lafferty and Zhai stated in their paper that the model comparison outperformed both querylikelihood and document likelihood models.

Page 18: Word Embedding In IR

2.4 Related workLanguage modelling techniques do not take into account the problem of synonymy. Various ap-proaches have been proposed to tackle this problem by extending the language models. FabioCrestani [17] proposed certain frameworks to exploit the term mismatch problem. He pro-posed that probabilistic retrieval models can be modeled as a dot product between query anddocument.

RSV (d,q) = ∑t∈q

wd(t)∗wq(t) (2.18)

where wd(t) is the weight of the term in document d and wq(t) is the weight of the term in queryq. Crestani used a similarity function Sim. Sim(ti, t j) = 1 that is if ti = t j that means similarityis maximum(usually for the same term) if Sim(ti, t j) = 0 then ti 6= t j that means terms don’thave semantic relation, he then added the function similarity to the above equation. Crestanidoes this in two ways, first in the case of mismatch that is ti ∈ q and ti /∈ d then he finds theterm t j from the document which is closest to the word ti in the query that is the maximumsimilarity. So extended RSV for document to query becomes

RSVmax(d,q) = ∑t∈q

Sim(ti, t j)wd(t j)∗wq(ti) (2.19)

Also instead of just calculating the score of maximum similar term he proposed to computefor all related terms from the document to a non matched query term.

RSVtot(d,q) = ∑ti∈q

[∑t j∈d

Sim(ti, t j)wd(t j)]∗wq(ti) (2.20)

Taking inspiration from crestani [17] work on similarity, ALMasri and Chevallet [18] pro-posed a new extended language model to take into account the similarity of the query word todocument word in case of a mismatch. They proposed to modify the document index accordingto the query and the external knowledge about the term relations. They expanded the documentd by the query terms which are semantically related to at-least one document term. The ideahere is to maximize the coordination of document and query and in a process maximize theprobability of retrieving the relevant documents for a given query. Below in the figure 2.3 is apictorial description of the process.

Formally they mentioned the modified document dq by

dq = d∩F(q/d,K,d) (2.21)

where d is the original document, K is the knowledge source, F(q/d,K,d) is the transformationof q/d. K provides the semantic similarity between the terms t and t

′that is Sim(t, t

′). For

unmatched term t in the query they looked fro term t∗ in the document which has the maximumsimilarity from the term t.

t∗ = argmaxt ′∈dSim(t, t′) (2.22)

Then the occurrences of the query terms t in the modified document dq rely on the occur-rences of the most similar term f req(t∗,d) then the pseudo occurrences of t is as follows

f req(t,dq) = f req(t∗;d).Sim(t, t∗) (2.23)

12

Page 19: Word Embedding In IR

Figure 2.3 – Expand the document d using the knowledge K. [18]

this pseudo occurrences of the term t∗ are then included in the modified document dq. Thetranslation function F now becomes

F(q/d,K,d) = [t|t ∈ q/d,∃t∗ ∈ d, t∗ = argmaxt ′∈dSim(t, t′] (2.24)

So now dq becomes

dq = d∩ [t|t ∈ q/d,∃t∗ ∈ d, t∗ = argmaxt ′∈dSim(t, t′] (2.25)

The length of dq is calculated in following way

|dq|= |d|+ ∑t∈q/d

f req(t∗;d)Sim(t, t′) (2.26)

They then used this modified document dq instead of d. They assumed that this new proba-bility estimation would be more accurate than ordinary language model.

Then they took two language models Dirichlet smoothing and Jelinek -Mercer Smoothing;extended it with the similarity function and modified document.

Dirichlet Smoothing

Pµ(t|d) =t ft,d +(µ)Pmle(t|Mc)

|d|+µ(2.27)

They extended the above equation to include similarity

Pµ(t|dq) =

t ft,d+(µ)Pmle(t|Mc)

|dq|+µ, if t ∈ d

t ft∗,d .Sim(t,t∗)+(µ)Pmle(t|Mc)

|dq|+µ, if t /∈ d

(2.28)

Page 20: Word Embedding In IR

If all the query terms occur in the document then |dq|= |d|; so the probabilities Pµ(t|dq) =Pµ(t|d) .

Similar to extending the dirichlet smoothing they extended the Jelinek-Mercer smoothingmodel.Jelinek-Mercer

Pλ (t|d) = (1−λ )P(t|d)+λP(t|C) (2.29)

Extended Jelinek-Mercer Smoothing

Pλ (t|dq) =

(1−λ ) f req(t;d)

|dq| +λP(t|C), if t ∈ d

(1−λ ) f req(t∗;d).Sim(t,t∗)|dq| +λP(t|C), if t /∈ d

(2.30)

For the similarity function Sim(t, t′) ALMasri and Chevallet made an assumption that the

term t is semantically related to a term t′, if t

′is a descendant of the term t in the term hier-

archy within an external knowledge K. For the term t from query and t′

from document fromvocabulary V , they defined Sim(t, t

′) as follows, Sim : V ∗V ⇒ [0,1] :

∀t, t′∈V,0≤ Sim(t, t

′)≤ 1 (2.31)

1. Sim(t, t′) = 0, if t and t

′are not semantically related and t 6= t

′.

2. Sim(t, t′)≤ 1, the t

′is a descendant of term t in term hierarchy of K and t 6= t

′.

3. Sim(t, t′) = 1 if t and t

′are same that is t

′= t

The similarity of terms is computed as the inverse of the distance between them.

Sim(t, t′) =

1distance(t, t ′)

,distance(t, t′)> 0 (2.32)

For their experiments they used CLEF 1 corpora(medical domain corpora) with UMLS 2

as an external knowledge base. UMLS is a multi-source knowledge base for medical domain.Instead of using words to index documents they used concepts. UMLS provided them withconcepts and document, query are mapped using MetaMap [14]. They experimented usingX-IOTA [19]. They compared their results with smoothed language models and statisticaltranslation model based on Jelinek and dirichlet smoothing. ALMasri and Chevallet stated thattheir model performed better than the smoothened language models and translation models byhaving considerable gain over other language models. Below we present their results

1http://www.clef-initiative.eu/2 Unified Medical Language System (http://www.nlm.nih.gov/research/umls/).

14

Page 21: Word Embedding In IR

Figure 2.4 – MAP of Extended Dirichlet smoothing and Extended Jelinek-Mercer smoothingafter integrating concept similarity. The gain is the improvement obtained by their approachover ordinary language models. † indicates a statistically significant improvement in over ordi-nary language models using Fishers Randomization test with p < 0.05. [18]

Page 22: Word Embedding In IR
Page 23: Word Embedding In IR

3Proposed approach

So taking motivation from the work of ALMasri and Chevallet [18], we decided to check thevalidity of the semantic features of the embeddings in the information retrieval against theirmodel. We choose their language model as we assume that similarity can be best exploited atdocument level instead of adding similar words to query i.e in case of non occurrence of querywords in the document, new word should be chosen from the document which best matches thequery term. This way every document can be evaluated even in case of complete mismatch ofdocument and query terms. We assume given the good word embeddings even in the case oftotal mismatch i.e when none of the query terms occur in the document we can retrieve rele-vant documents if the documents and query share semantic relatedness in the semantic space.To put this hypothesis to test we proposed some experiments so to have an understanding ofusefulness of semantic features of the word embeddings in Information Retrieval.

Our second approach is based on the vector space model. In the classic vector space modelproposed by Salton [28], the weights of word in documents are a combination of documentfrequency of the term and global(collection) frequency of the term. In the original paper thedocuments are represented by ~vd = (w1,d,w2,d, ..wN,d). The importance of each word is repre-sented by its weight which is computed as

wt,d = t ft,d log|D|

|d ′ ∈ D|t ∈ d ′|(3.1)

Here t ft,d is the term frequency in the document d, log |D||d′∈D|t∈d′ |

is the inverse documentfrequency where D represents the collection and d is the document with t as the term. Similarityof the query q to document dm is calculated by cosine similarity.

Sim(~dm,~q) =~dm.~q|dm||q|

=∑

Ni=1 wi,mwi,q√

∑Ni=1 w2

i,m

√∑

Ni=1 w2

i,q

(3.2)

Here wi,q and wi,m are terms in query and document respectively.

In our approach we propose a simple weighting scheme where only document frequency aretaken, we choose to omit the id f value of the term as we assume since word vectors are trainedon the whole collection every term has information regarding its occurrence with other termvectors. Also we propose a different document vector than salton in his paper. We representdocument vector as

Page 24: Word Embedding In IR

~dm =m

∑i=1

t fi. ~wi,m (3.3)

The assumption behind such an approach is in semantic space combination of words mightrepresent an concept which might be closer to the query vector if they represent similar con-cepts. We can obtain the query vector in a similar way to document vector

~q j =j

∑i=1

t fi. ~wi, j (3.4)

Similarity between query and document vector can be exploited in the same way as pre-sented in Salton’s paper using cosine similarity.

Sim(~dm,~q) =~dm.~q j

|dm||q j|(3.5)

One of the key differences between the vector space model proposed by salton and ours isthat, in salton’s document vector space each term in the document was the dimension of thedocument vector and in our case each dimension of the document vector is the weighted sumof corresponding dimension of the word vectors. Our document vector length is not dependenton the number of words contained in the document but rather the dimensions of the wordvector. We get the word embeddings from word2vec. With our model all the vectors whetherit is document or query will all be of same length.

Since we don’t have time to build the system we keep this approach as future proposition.For our first approach we propose some experiments and the next sections are related to ourfirst approach only. We start explaining now the corpus we used, the task, the retrieval toolkitwe have used, existing baselines on the task. Then we discuss about our experimental setup.Lastly the results obtained and conclusion.

3.1 Retrieval toolkit and Collection used

3.1.1 TREC collectionTREC(Text Retrieval Conference) collection consist of three parts: the documents, the ques-tions or topics and the relevance judgments or ‘right answers’.TREC documents are distributed on CD-ROMs with approximately 1 Gbyte of text on each ,compressed to fit into disks. The documents are contained in 5 disks. Disk 1-3 is called thetipster collection and disk 4-5 is the TREC collection. Below are the description of the docu-ments in each disks1 .1. Disk 1: Includes material from the Wall Street Journal (1987, 1988, 1989), the Federal Reg-ister (1989), Associated Press (1989), Department of Energy abstracts, and Information fromthe Computer Select disks (1989, 1990) copyrighted by Ziff-Davis.2. Disk 2: Includes material from the Wall Street Journal (1990, 1991, 1992), the Federal Reg-ister (1988), Associated Press (1988) and Information from the Computer Select disks (1989,1990) copyrighted by Ziff-Davis.

1http://www.nist.gov/tac/data/data desc.html

18

Page 25: Word Embedding In IR

3. Disk 3: Includes material from the San Jose Mercury News (1991), the Associated Press(1990), U.S. Patents (1983-1991), and Information from the Computer Select disks (1991,1992) copyrighted by Ziff-Davis.4. Disk 4: Includes material from the Financial Times Limited (1991, 1992, 1993, 1994), theCongressional Record of the 103rd Congress (1993), and the Federal Register (1994).5. Disk 5: Includes material from the Foreign Broadcast Information Service (1996) and theLos-Angeles Times (1989, 1990).

Below are some document statistics

Figure 3.1 – Some data statistics from Disk 1-3 [20]

There is a range of document lengths in the collection, there are short documents like DOEand very long documents like FR. The range of the document length also varies for exampleAP is quite uniform(median terms and number of terms per record) whereas ZIFF,WSJ and FRhave significantly wider variance in lengths. The documents are formatted in SGML formatwith a DTD. Fig3.2 shows the document structure

Page 26: Word Embedding In IR

Figure 3.2 – Document structure in the collection

The topics in TREC 1 and 2(topics 51-150) have a long and complex structure. Thesetopics were designed to mimic real users need and made by system who are real users of theretrieval system. TREC 1 and 2 topics consist of concepts which adds to the information needsof the users. Topics of TREC 3(topics 151-200) are much shorter than TREC 1 and 2, andalso concept field is removed to mimic more a general user which gives system no informationbesides the query itself. Below are the structure of topics from TREC 1,2 and TREC 3

20

Page 27: Word Embedding In IR

(a) TREC 1 Topic example

(b) TREC 3 Topic example

Figure 3.3 – TREC Topic example

Relevance Judgement File : It is quite necessary to find a list of relevant and non rele-vant documents to evaluate the performance of the retrieval system. Relevant judgement filescontains this information. So this list should be as comprehensive as possible. All the threeTREC’s have used a pooling method based on Jones and Rijsbergen work[21] to create therelevance assessments. In this method a pool of relevant file is generated by taking a sampleof relevant top X documents from various participating systems. This sample is then shown tohuman judges.

3.1.2 The TaskThe adhoc task [12] investigates the performance of systems that search a static set of doc-uments using new topics. This task is similar to how a researcher might use a library – thecollection is known but the questions likely to be asked are not known. Fig. 4.1 depicts howthe adhoc task is accomplished in TREC. Participants are given a document collection consist-ing of approximately 2 Gbytes of text and 50 new topics. The set of relevant documents forthese topics in the document set is not known at the time the participants receive the topics.Participants produce a new query set, Q3, from the adhoc topics and run those queries againstthe adhoc documents. The output from this run is the test result for the adhoc task.

Page 28: Word Embedding In IR

Figure 3.4 – The Ad-hoc task for retrieval in TREC [12]

3.1.3 Retrieval ToolkitFor the retrieval we are using an open source tool called Terrier(, Terabyte Retriever)2 [22][23][25] [26]. Terrier is written in Java, and is developed at the School of Computing Science,University of Glasgow.Terrier is designed as a tool to evaluate, test and compare models and ideas, and to build sys-tems for large-scale IR. Since it is an open source platform we decided to test our methodsand experiments using terrier. Information retrieval in terrier is done in three stages. First thecollection for the experiment is indexed , then choosing a matching model which terrier usesfor retrieval. Third step is the evaluation.For the indexing the corpus of documents is handled by a Collection plugin, which generatesa stream of Document objects. Each Document generates a stream of terms, which are trans-formed by a series of Term Pipeline components, after which the Indexer writes to disk.

2http://terrier.org/

22

Page 29: Word Embedding In IR

Figure 3.5 – Indexing Structure of Terrier [25]

For the retrieval The application communicates with the Manager, which in turn runs thedesired Matching module. Matching assigns scores to the documents using the combinationof weighting model and score modifiers. Terrier provides a lot of weighting models 3. Weextended the terrier platform to include the Jelinek-Mercer smoothing model, extended Jelinek-Mercer smoothing model model and extended dirichlet smoothing model for our experiments.

Figure 3.6 – The retrieval architecture of Terrier. [25]

For the evaluation Terrier for TREC data takes the evaluation file provided by us and com-putes the scores. More details in next sections.

3http://terrier.org/docs/v4.1/configure retrieval.html

Page 30: Word Embedding In IR

3.2 Experimental setup

3.2.1 Word Embedding Parameters and Data Pre-processingOur dataset contains of 741K documents which contains roughly 171M tokens. Since this datais in raw format it needs to be tokenized first. For this we used moses [37] tokenizer as astandard way to tokenize the raw dataset. Since the chunk of the useful data is only containedin ”TEXT” tag we decided to process only that. Other tags in the document contains data setdoes not contain semantically useful information. In the next step we stemmed the data so asto have simpler word form and increase the number of context for each word. We learn theword vectors with these processed dataset. This data is then fed individually through a neuralnetwork with one hidden layer of size 200 which corresponds to the dimension of the outputvectors. The learning rate α is choosen as 0.05. We learned the word embeddings with skipgram model with window size of 10 with negative sampling with negative sample for each dataas 5. Mikolov [5] described the optimal value for small training data set to be between 5-20and for large training between 2-5. Also stopwords like ”the” , ”in” etc. doesn’t provide anyuseful semantic meanings which occur quite frequently in the corpus, to limit the usage of thisstopwords we subsampled the frequent words where the subsampling threshold is choosen tobe 1e−4 which is 0.0001. Also misspelled or very rare words doesn’t provide much informationof their context so we did not train tokens whose frequency is less than 20. The window size ischosen as 10 as we assume it to be of idle size to capture the semantic regularities.

3.2.2 TerrierWe use only ”TEXT” tags of the documents to build the indices for Terrier. From query only”topic” tag is used to fetch the query terms. While building the indices for the documents stop-words are removed and stemming is done. Terrier has some matching models like BM25,DrichletLM,PL2 etc. but for our experiments we implemented some new models namelyJelinek-Mercer, Extended Jelinek-Mercer and Extended DrichletLM. For our experiments wehave chosen the classical retrieval method including BM25,Jelinek Mercer Smoothened lan-guage model and Drichlet language model. We choose these models to test whether includingsimilarity in the language models adds to the increase in retrieval performance of the system.We tested with different varying parameters of these probabilistic models. We have put theresults and various parameter values in section 3.6.

3.3 Evaluation metricsThe classical retrieval evaluation metrics are Precision, Recall, Mean Average Precision andPrecision at index. Lets assume ReldocRetrieved signifies the total number of relevant doc-uments retrieved by the system, ReldocNotRetrieved signifies the total number of relevantdocuments not retrieved by the system, NonReldocRetrieved signifies non relevant documentretrieved by the system then Precision is defined by

ReldocRetrievedReldocRetrieved +NonReldocRetrieved

(3.6)

Similarly recall is defined by

24

Page 31: Word Embedding In IR

ReldocRetrievedReldocRetrieved +ReldocNotRetrieved

(3.7)

Precision takes all relevant document into account but if we want to evaluate only the topmost relevant documents we generally use precision at index. These measure is called precisionat n where n is any real number. Generally most common is Precision at 10 or P@10 .Mean Average Precision or MAP is the mean of the average precisions for the query. FormallyMAP is defined as

∑Qq=1 AvgP(q)

Q(3.8)

Where Q is the number of queries.Recall oriented system are good but if their precision at higher rank is low then probably the

system might not perform well in real life scenarios. As user don’t have time to read hundred’sof document rather user relevant document at higher ranks.

3.4 Retrieval models and resultsFor the baseline we have taken the baseline from the official TREC-3 adhoc competition [20].Below we will provide a short description about the best performers in the competition, theirresults and their methodology in brief.Citya1 [27]: They used probabilistic term weighting scheme with topic expansion of upto 40terms, with dynamic passage retrieval in additional to the whole document retrieval.INQ101 [30] : They used probabilistic weighting with an inference net. They also used topicexpansion and passage retrieval in addition to whole document matching. They used an externalthesaurus built by them.CrnlEA: Their method was based on vector space model with term weighting called smart.Theyused Rocchio relevance feedback to expand terms. No topic expansion or phrase retrieval wasdone.westp1 [32]: Their method is based on lines with INQ101 with document and phrases beingused for ranking. But topic expansion was done on minimal lines. // pircs1 [33] : Thier methodswere based on spreading activation model on parts of document(550-words). Topic expansionwas done using top 6 documents with the terms in original topic. Then the top 30 topics werechoosen.ETH002 [34] : They used a combination of vector space model, passage retrieval model usingHidden markov’s chains and topic expansion using document links.

Page 32: Word Embedding In IR

Figure 3.7 – Trec-3 Adhoc results [20]

For the Jelinek-Mercer Language Model and Extended Jelinek Mercer model we used thesmoothing parameter(λ ) value as 0.15 as we tested with other values and found this value tobe optimum in terms of recall and average precision. For the Dirichlet and Extended Dirichletlanguage models we found the optimal value smoothing parameter to be 2500. Since ExtendedJelinek Mercer and Extended Dirichlet language models rely on Extensive document matchingthat is every document should be matched against the query and in case of mismatch wouldmatch against the similar words, this process become expensive both in terms of computationand time taken which is around 2 days. This is not feasible for both experiments and real lifescenarios, so we decide to follow some heuristic. To reduce the matching time we decided totake only those documents which contains atleast one query word. Formally

qt ∪d,∀t ∈ query(q),∀document(d) (3.9)

This process cut our matching time from 48 hours to 12 hours. Also we had set off a sim-ilarity limit of 0.7 for words to limit the noise occurring from too many similar words. Lastlyto gain more on the matching time we precomputed the similarity of words. Since keepingsimilarity of all words in the memory would require roughly 1 terabyte of memory, we onlycomputed the similarity of the query terms to all other words in the collection this reduced thememory space to 1 gigabyte.Below are the retrieval results from our experiments. There were total of 9805 relevant doc-uments for the 50 set of queries. For each query 1000 document were retrieved and map andprecision at 10(p@10) is computed accordingly.

26

Page 33: Word Embedding In IR

Trec-3 adhoc runModel MAP Total Relevant

RetrievedPrecision@10

JelinekMercer 0.2129 4823 0.4340Extended JelinekMercer

0.1623 3928 0.3720

Drichlet LM 0.2282 5091 0.4940Extended Drich-let LM

0.1898 4557 0.4520

Figure 3.8 – Trec-3 Adhoc results

3.5 Results comparisonAfter the results, none of our models beat the baseline. So, we inspected our method and founda major problem with our Toolkit. Terrier internally tokenizes the words in a very peculiarway. Terrier removes all the punctuation even from internal structure of the words and thenremoves the terms if they are less than 3 characters in them. So it removes all abbreviationswhich contain 2 characters. For example ”U.S.” becomes ”us” and then it is removed fromthe indexing, as well as the queries. Most of our queries contain these abbreviation speciallythe ”U.S.”, so we assume we do not gain anything from that term. Also we miss the contextof the country. So our retrieval can suffer from that. This can be a factor but just to be surewe decided to dig deeper for queries which retrieved least relevant documents in comparison

Page 34: Word Embedding In IR

to relevant document present in the collection. The query with query-id 152 contains ”Accu-sations of Cheating by Contractors on U.S Defense Projects” did not produce much relevantdocuments as it only retrieved 95 documents out of 538 with the Extended Dirichlet languagemodel. For the query the document number ”FR88928-0019” seemed quite relevant to us as itcontained information about employees of contractors who use illegal drugs in defense projectsand department of defense passing different policies to curb that. This document was deemedirrelevant in the relevance file. Relevance seemed to be another factor as not all documents hadbeen judged by the human judges.

To check our hypothesis that removal of abbreviation hurts the performance, we removedall the instances of the term ”U.S.” to ”America” and re-indexed the whole collection again.Then we did the same for the query as mentioned above we found a totally different set ofresults with the same Extended Dirichlet language model. The relevant document retrievedincreased to 178 instead of 95. It even beat the classical Dirichlet model both in terms of meanaverage precision and total relevant retrieved.

Figure 3.9 – Trec-3 Adhoc results for query 152

Another important document (WSJ870715-0135) which was highly ranked in the retrievalspeaks about Japanese companies joining defense contractors, where one of the contractors”Toshiba” building for Russia too and banned after that. These statement from the document”These statements that accuse Japan of being a leaky sieve of high technology – that doesn’thelp the situation at all.” . It also seemed quite relevant to us what was mentioned irrelevant.What was surprising is that all the top documents concerned about contractors role in defensecontracts and their inability to cope with the situation. All documents spoke negatively of the

28

Page 35: Word Embedding In IR

contractors. All the irrelevant documents even were semantically close to the query which wasthe goal. Relevancy seemed quite subjective to us.

We ran our test with the Extended Dirichlet Language model on another query on whichwe could not gain anything. On closer inspection we found the problem. The query id 179contained query ”U.S. Restaurants in Foreign Lands”. Top results retrieved included informa-tion like Japanese companies buying stake in American companies and spoke negatively ofJapanese corporate invasion. The second ranked relevant document did not yiel any semanticsimilarity either it contained information about foreign shipping practices of U.S federal mar-itime commission. One intuitive feeling of such a query to work is to have term dependenceamong query terms. For example, instead of finding similar words for foreign which can bealien, different etc., foreign should be dependent on the term America and similar word forAmerica should be searched without changing the term America. We also ran our experimentsafter necessary changes in the toolkit and datasets, but we could not finish the experiments atthe time of writing this thesis. We could only get retrieval results of the first 21 queries forwhich we show the results for both ExtendedDirichlet language model and Dirichlet languagemodel.

Figure 3.10 – Retrieval results

We can see both retrieval models perform on similar lines, we do not gain much betterresults.

Page 36: Word Embedding In IR
Page 37: Word Embedding In IR

4Conclusion and future work

The results we presented did not yield the intended results. We might not have beat the base-lines but our experiments show us some interesting facts about including the similarities ofembeddings in retrieval. We found that for the query 152, all the top documents were in thesame semantic space, they all got the same essence of the query i.e. all of the documents spokenegatively about the contractors in U.S. defense projects. This solidifies our assumption thatby using embeddings we can capture some form of semantic similarities of documents withrespect to the queries. It is hard to find out one single reason why our retrieval did not producedgood results. One of the possible reasons can be that since all the documents are not judged byhuman annotators, it is quite possible that some of the documents that are retrieved by our sys-tem might be relevant but were marked irrelevant in the absence of reviews. Also, relevance is aquite subjective parameter for us – some of the results retrieved were relevant but were markedirrelevant in the relevance file. The removal of abbreviations on occasions also affected ourretrieval performance. On a careful analysis of query 179, we found a glaring loophole in ourassumption. We found that simply altering the individual query terms with similar terms wouldnot retrieve relevant documents. Through the analysis we found out that there is an inherent re-lationship among terms in the query and exploiting this relationship with similarity might yieldbetter results. Another possible reason might be that similar words contained a lot of noise. Forinstance, the terms ti t1, t2 t3 are the most similar in that order but in this context only t3. Onepossible approach is to take n number of most similar terms for all query terms and do retrievalfor this expanded query, without calculating similarity of all words in the document. This waywe might be able to remove the noise.

The language models are based on the assumption that each term is independent from otherterms in the term space. We would like to explore more on lines of other works where termsdependence is taken into account. One possible way to exploit it is by taking into account thegrammatical properties of the language. Secondly, to exploit the true nature of the semanticspace of the embeddings, we would like to test our vector space model as proposed in chapter3. The language model we used only exploited the cosine similarity of the embeddings whereaswe assume the vector space model might truly exploit the embedding space.

Page 38: Word Embedding In IR
Page 39: Word Embedding In IR

Bibliography

[1] Jay M.Ponte and W.Bruce Croft A language modelling approach to information, SI-GIR’98

[2] L. Zhao and J. Callan. Term necessity prediction. In Proceedings of the 19th ACM Con-ference on Information and Knowledge Management (CIKM ’10). 2010.

[3] Zhao, L. and Callan, J., Automatic term mismatch diagnosis for selective query expansion,SIGIR 2012

[4] Jelinek, Frederick and Robert L. Mercer. 1980. Interpolated estimation of Markov sourceparameters from sparse data. In Proceedings of the Workshop on Pattern Recognition inPractice, Amsterdam, The Netherlands: North-Holland, May

[5] Mikolov, Tomas; Sutskever, Ilya; Chen, Kai; Corrado, Greg S.; Dean, Jeff (2013). Dis-tributed representations of words and phrases and their compositionality. Advances inNeural Information Processing Systems.

[6] Mikolov, Tomas; et al. ”Efficient Estimation of Word Representations in Vector Space”

[7] MACKAY, D. AND PETO, L. 1995. A hierarchical Dirichlet language model. NaturalLanguage Engineering 1, 3, 289-307.

[8] word2vec Parameter Learning Explained

[9] S. C. Deerwester, S. T. Dumais, T. K. Landauer, G. W. Furnas, and R. A. Harshman.Indexing by latent semantic analysis. JASIS, 41(6):391-407, 1990

[10] D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent Dirichlet Allocation. Journal of MachineLearning Research, 3:993-1022, March 2003.

[11] CHENGXIANG ZHAI and JOHN LAFFERTY, A Study of Smoothing Methods for Lan-guage Models Applied to Information Retrieval April 2004

[12] Ellen M. Voorhees, Donna Harman Overview of the Sixth Text REtrieval Conference(TREC-6)

[13] Rocchio, J. J. 1971. Relevance feedback in information retrieval.

[14] Alan R Aronson. Metamap. Mapping text to the umls meta-thesaurus. 2006

Page 40: Word Embedding In IR

[15] Buckley, Chris, Gerard Salton, and James Allan. 1994b. The effect of adding relevanceinformation in a relevance feedback environment. Proc. SIGIR, pp. 292-300. ACM Press.

[16] John Lafferty, Chengxiang Zhai Document language models, query models, and risk min-imization for information retrieval,SIGIR 2001

[17] Fabio Crestani, Exploiting the Similarity of Non-Matching Terms at Retrieval Time, Sept1999

[18] Mohannad ALMasri, KianLam Tan, , Catherine Berrut, , Jean-Pierre Chevallet, andPhilippe Mulhem Integrating Semantic Term Relations into Information Retrieval Sys-tems Based on Language Models

[19] Jean-Pierre Chevallet X-IOTA: an open XML framework for IR experimentation, Pro-ceeding AIRS’04

[20] DK Harmon Overview of the Third Text REtrieval Conference (TREC-3), 1996

[21] Sparck Jones and Van Rijsbergen 1975 K. Sparck Jones and C. J. Van Rijsbergen. 1975.Report on the need for and provision of an ideal information retrieval test collection.Technical Report 5266, Computer Lab., Univ. Cambridge.

[22] Craig Macdonald, Richard McCreadie, Rodrygo Santos and Iadh Ounis. From Puppy toMaturity: Experiences in Developing Terrier. In Proceedings of the SIGIR 2012 Work-shop in Open Source Information Retrieval.

[23] Iadh Ounis, Christina Lioma, Craig Macdonald, and Vassilis Plachouras. Research Di-rections in Terrier: a Search Engine for Advanced Retrieval on the Web. In Novat-ica/UPGRADE Special Issue on Next Generation Web Search, 8(1):49–56, 2007.

[24] Yoshua Bengio,RA c©jean Ducharme,Pascal Vincent,Christian Jauvin A Neural Proba-bilistic Language Model,(2003)

[25] Iadh Ounis, Gianni Amati, Vassilis Plachouras, Ben He, Craig Macdonald, and ChristinaLioma. Terrier: A High Performance and Scalable Information Retrieval Platform. InProceedings of ACM SIGIR’06 Workshop on Open Source Information Retrieval (OSIR2006). 10th August, 2006. Seattle, Washington, USA.

[26] Iadh Ounis, Gianni Amati, Vassilis Plachouras, Ben He, Craig Macdonald and DouglasJohnson. Terrier Information Retrieval Platform. In Proceedings of the 27th EuropeanConference on Information Retrieval (ECIR 05).

[27] S.E. Robertson , S. Walker , S. Jones , M.M. Hancock-Beaulieu , M. Gatford, Okapi atTREC-3 (1996)

[28] G. Salton, A. Wong, and C. S. Yang A Vector Space Model for Automatic Index-ing,Communications of the ACM,(1975) vol. 18, nr. 11, pages 613-620.

[29] Fuchun Peng,Nawaaz Ahmed, Xin Li,Yumao Lu Context sensitive stemming for websearch (2007)

34

Page 41: Word Embedding In IR

[30] John Broglio , James P. Callan , W. Bruce Croft , Daniel W. Nachbar Document Retrievaland Routing Using the INQUERY System (1994)

[31] Automatic Query Expansion Using SMART : TREC 3

[32] Paul Thompson , Bokyung Yang , James Flood TREC-3 Ad Hoc Retrieval and RoutingExperiments using the WIN System (1995)

[33] K. L. Kwok , L. Grunfeld , D. D. Lewis TREC-3 Ad-Hoc, Routing Retrieval and Thresh-olding Experiments using PIRCS (1995)

[34] Daniel Knaus , Elke Mittendorf , Peter Schauble Improving a Basic Retrieval Method byLinks and Passage Level Evidence (1995)

[35] Victor Lavrenko. W. Bruce Croft Relevance based language models

[36] David M. Blei, Andrew Y. Ng, Michael I. Jordan Latent Dirichlet Allocation,Journal ofMachine Learning Research 3 (2003) 993-1022

[37] Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris Callison-Burch, Marcello Federico,Nicola Bertoldi, Brooke Cowan, Wade Shen, Christine Moran, Richard Zens, Chris Dyer,Ondrej Bojar, Alexandra Constantin, Evan Herbst. (2007) ”Moses: Open Source Toolkitfor Statistical Machine Translation”. Annual Meeting of the Association for Computa-tional Linguistics (ACL), demonstration session, Prague, Czech Republic, June 2007.