Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
Keyword-Based File Sorting for Information Retrieval
Balmain Beckford
Senior Thesis
Department of Computer Science
Minnesota State University, Mankato
December 27, 2010
Abstract
Keyword-based file sorting is the aggregation of related files into clusters based
on a similarity evaluation between files and the representatives within the clusters.
Keywords are the discriminating features of a file. These discriminating features are
based on the frequency of the keywords along with their weight value. Information
retrieval (IR) is a field of computer science that deals with processing text found in
files in order to more efficiently retrieve files that are the closest match to a users
query. Files can be indexed based on the words they contain as well as words that
match these words’ concept from a thesaurus. Keyword-based IR by itself displays
certain deficiencies in retrieving files. This thesis shows how incorporating the proper
indexing and clustering techniques will improve the quality and performance of IR
based on keywords.
1
1 Introduction
Keyword-based information retrieval is a technology that has been around for quite some time
but is still very useful in various search applications today. For effectiveness and simplicity,
most keyword-based information retrieval systems rely on extracting keywords from tags
associated with the file it is trying to retrieve. These tags can be generated manually by a
set of users or collaboratively through social tagging systems [27, 2]. Popular applications
today that utilize collaborative social tagging are Youtube which allow users to upload and
create tags for videos, Flickr which allow users to upload and create tags for photos [27] and
Amazon which allow users to upload and create tags for almost any item the user wants to
sell. Other applications that utilize collaborative tagging are blogs and wikis [27, 35]. This
method of tagging has been proven more intuitive than other tagging alternatives, because it
is easier for users to select a previously used tag the matches their item [27]. An advantage
of collaborative tagging is its ability to adapt to new vocabulary and word trends [27]. A
challenge that may arise from collaborative tagging is having a wide variety of tags or tag
redundancy and ambiguity due to unsupervised tagging [27]. Ambiguity may affect the
information retrieval process by causing files to be retrieved that were mistakenly placed in
the wrong category.
Keyword-based information retrieval can also be used in other areas to retrieve files
or documents. It can be used in software development to determine the traceability of a
particular piece of code [9, 32, 29]. Traceability is the ability to describe and follow the
requirement of a particular software design in both a forward and backward direction [9].
This is from the inception stage where the design of the software is conceived and throughout
the entire lifecycle of the design process. Information is retrieved from keywords associated
with objects in the software [9]. An object is a real world entity that is modeled by software.
In this field the similarity between files can be calculated by using a similarity index.
2
In addition, information retrieval based on keywords has been used in a number of other
specialized fields. Content-based information retrieval has been used in the medical field to
retrieve medical images based on the tags and description related to that image [10, 20, 24].
There has also been success in using this technology in journalism. It has been used in
the design of systems that retrieve multimedia news content from a database [26, 1, 16]. A
task that is usually done prior or during the retrieval process in most keyword based file
retrieval systems is sorting these files into categories based on similarities or relatedness.
This is to improve the accuracy of the retrieval system by helping it to return results that
are most relevant to a user’s query. This can be accomplished by using various clustering
methods like k-means clustering, hierarchical clustering and clustering by committee (CBC)
[31, 28, 27, 37, 33]. There is also novel clustering algorithm called domain similarity clustering
by committee (DSCBC) which is an extension of CBC [31, 33]. It is an approach that handles
the ambiguities that arise when dealing with adjectives and nouns that are polysemous.
Polysemous words are words that have different meanings or word senses when used in
different contexts and hence belong in different categories given a particular context.
Google1 has implemented a number of techniques available to users to improve their
search results. One of their techniques is allowing users to enter commands within their
searches. An example of this is, placing a tilde before a word where you would like synonyms
for that word to also be included in your search. For example, the search query ∼fast food
will display results that have words that are related to “fast food” as opposed to fast, meaning
quick, and any type of food. Results may include “junk, burgers, or french fries.” An n-gram
system is used to compare the probability of which word is most likely to appear with the
search phrase “fast food”. This allows users’ queries to be easily processed and more effective
in returning the related results.
Google has also updated their search algorithm to use location based semantic retrieval
1http://www.google.com/
3
to give results to the user about a particular query which is location sensitive. For example,
if a user searches for West Indian food and does not specify the location they may get results
that are highly ranked but are not relevant to the user’s query. For example, West Indian
cuisines from New York or Jamaica may show up in the search results when the user lives
in Mankato, Minnesota. Here, semantic information is used to cluster the documents with
the location the query is done from in order to return more appropriate results.
Clustering is used to save time for users. Recently, Facebook has implemented keyword-
based clustering, where it clusters similar postings within the news feed that were posted by
different users at different times. This saves space in displaying similar postings in the news
feed and avoids redundancy. Also, website aggregator digg.com gathers news stories from
different websites based on keywords found within their content or description. Users can
retrieve news stories given a search term. However, digg.com searches mainly by keywords
and not with semantics. Therefore, the retrieval process could be improved by incorporating
semantics in the retrieval process.
In terms of web searching, the precision of the results can be limited by a number of
factors. Bahatia and Khalid explain a number of factors that may hinder the user from
getting the correct results [6, 8]. These are the large amount of information that is present
on the web, keyword based web search usually has low recall because of the large amount
of information present and low precision because of the difficulty of getting relatedness.
Additionally, it is hard for search engines to return information based on user’s personal
preference or context of the words used in a query [6, 8]. In other words, it is hard to tell
what is on the user’s mind.
Although keyword-based sorting and information retrieval has been used for a long time,
there are improvements that can be made to improve the precision of the information re-
trieved. There have being attempts to reduce the effects of ambiguity by giving users the
option of categorizing their tags [27, 4, 12]. An example of this is in Flickr were a user who
4
chooses to enter that word “apple” can choose from a number of categories like comput-
ers, fruits or city. The primary goal of this thesis is to apply a combination of information
retrieval and clustering techniques to improve the accuracy of a query with little user involve-
ment in pre-determining the categories or context of the query. This implies that approaches
to keyword-based file sorting with strategies to handle poloysemous words that may induce
ambiguity will improve the accuracy of results from a query.
This paper will talk about some of the areas that keyword-based file sorting and infor-
mation retrieval have been used in. It will describe some of the tools that are useful in the
process of information retrieval as well as the specific tools used to generate results in our
experiments. It closes with a discussion of the results and suggestions of possible future
work.
2 Related Work
Important to this thesis is the use of algorithms to extract keywords from tags associated
with files and use them to cluster these files based on relatedness. Latent semantic indexing
(LSI) is an approach that compares documents using a vector representation of that doc-
ument [36, 13, 11, 5]. Further improvements were introduced by others to make LSI more
efficient by giving it a stronger statistical foundation [17]. This gave rise to a novel approach
to indexing called Probabilistic Latent Semantic indexing (PLSI). PLSI generates higher
performance gains in indexing documents because it has the capability of handling the am-
biguity that arises while indexing files that have keywords that are synonyms or polysemes.
Keyword-based information retrieval has been used to trace requirements throughout soft-
ware development [9]. Cleland-Haung et al. [9] also show how a threshold is used to determine
whether a document should be retrieved. The threshold can be found by adding weights to
keywords, and making a keyword having a weight above the threshold, eligible for retrieval.
5
IR is evaluated using precision and recall defined as:
precision =number of relevant documents retrieved
total number of documents retrieved
and
recall =number of relevant documents retrieved
number of relevant documents.
Precision refers to the number of relevant documents that are retrieved based on a user’s
query. Recall is the volume of documents that are retrieved based on the user’s query.
A rule of thumb used when extracting keywords is words that show up less frequently
in documents carry more weight in terms of information used as a distinguishing feature of
that file to determine whether that file matches the query of a user. Research shows that if
users interactively try to disambiguate their queries by placing them into categories based
on topics, it will increase categorization and the retrieval of information [28]. Hierarchical
clustering has also been used to get categories based on keywords in [28].
Clustering by domain specific similarity has been proven to handle ambiguity [31, 30].
More specifically, it handles ambiguity that arises from words, especially adjectives, that
have multiple word senses. For example, the word “Hot” which is an adjective can be used
to describe temperature and also could be used to describe something that is trendy. An
extension of CBC called domain similarity clustering by committee (DSCBC) was introduced
and is used to handle polysemous words, specifically adjectives. Table 1 shows results of
DSCBC and CBC [31]. CBC has more words in committees and also has words like “face”
in a category that is not really appropriate.
Mobile devices also incorporate keyword-based searching [22]. Unlike traditional naviga-
tional search that was used in the mobile devices to exactly match the words users input to
produce results, semantic search tries to understand the user’s input by providing more of
what the user might be looking for. This improves the quality and quantity of the search by
6
Table 1: Examples of committees created from CBC and DSCBC
Algorithm CommitteesCBC taste, smell, scent, aroma
beauty, appearance, look, light, color, thingamount, number, time, information, systemtone, attitude, voice wordday, sound, face, light, word, thing
DSCBC smell, aromaappearance, lookquality, amount, numberattitude, countenance, natureday, shade, room, light, face, color
returning more relevant results [22].
Keyword-based searching and semantics searching methods are done to retrieve infor-
mation [22, 15]. Keyword-based searching attempts to capture the essence of a document
by the words or phrases that are present in it. This can be done either by manually ana-
lyzing the document or by automatically subject indexing them [22]. In previous research,
Apache Lucene Library which is a high performance text search engine library that supports
AND, OR and NOT fuzzy logic, proximity and wildcard searches was used retieve informa-
tion from a database [22]. The technology is an index searcher object. Semantic search for
mobile devices uses “five sense” multimedia ontology, which reflects real word information
connected semantically to locations [22]. If a word is searched, the words belonging to a
similar class has similar semantic relationship with each other. [22] implements this through
term mapping, query graph and SPARQL (SPARQL Protocol and RDF Query Language).
Case-based reasoning (CBR) systems incorporate information retrieval techniques to dis-
play the results of a user’s query. These systems use mainly two techniques to produce query
results. They use keyword/syntactical IR and semantic retrieval. In the keyword IR, infor-
mation is retrieved simply on the basis of spotting keywords while the semantic IR offers
7
a more comprehensive display of results in terms of relatedness to other documents. This
allows a CBR to more effectively return results that matches a user’s query and puts less
pressure on the user to enter specific keywords in his/her search.
In the area of medicine, indexing of medical images is an ongoing task. Latent semantic
indexing has been used to index images in a system in order to retrieve them efficiently
[7]. They use probability to handle case of having words that may belong to more than one
cluster.
Learning and semantics have been very important in the area of multimedia information
retrieval (MIR) according to [26]. Previous work has emphasized incorporating classification
into MIR in order to improve the results given to users. Journalism has employed IR tech-
niques to retrieve multimedia content from databases. This content is indexed and clustered
using semantics and various clustering techniques to improve the accuracy of search results.
The method that is usually used in indexing media content is referred to as topic labeling
or topic clustering [16]. Both these methods cluster documents based on terms found within
the tags or description of a file. Topic clustering will be used in this project to cluster files
based of their tags.
3 Steps in Information Retrieval
This section explains the important steps that are involved in the information retrieval
process, how keywords are extacted, how documents are clustered based on relatedness and
techniques used to cluster documents.
Figure 1 shows the flow of data through a simple information retrieval system [26]. A
corpus of data is collected and then stored in a database. When a user makes a query, their
query is compared against the data within the database. If the information in the database
is relevant to the user’s query then the results are returned back to the user.
8
Figure 1: Diagram of an Information Retrieval System
3.1 Extracting keywords
Starting with keyword-based document clustering algorithm described in [19], we modified
the algorithm to retrieve keywords from the tags associated with each file to be clustered.
The document vector is dependent upon the term frequency and document frequency. Since
we are extracting keywords from tags, it will be an easier process due to the fact that there
are fewer words and these terms can be considered as preprocessed to describe the files.
The document vectors represent a document, or in this case a file, and 〈term,weight〉 pairs
are the unique elements of the document vector. The weight value of a term is a ranking
value which can be used to determine whether the term is a keyword or a stopword in the
document. The weighting function w(t) can be calculated as follows:
w(t) =
0 if t is a stopword
1 if t is a keyword
a otherwise 0 ≤ a ≤ 1
When adding weights to terms there are two criteria that must be filled to represent a
document, these are:
1. Discriminative value that distinguishes or characterizes the document from others
9
2. Measure between a keyword and a stopword
There are two approaches that can be used to add weights to terms. These approaches are
frequency-based term weighting (FBW) and keyword-based term weighting (KBW). FBW
is a statistical measure of terms in an inter-document relationship [19]. This approach is
efficient in distinguishing and characterizing a document from others which makes it useful
in document clustering for information retrieval purposes. We cannot rely on the frequency
of a term to characterize a document by term. The only evaluation measure to characterize
a document in frequency-based weighting is frequency statistics [19].
KBW is an approach that is based on keyword importance factors in a document. It
analyzes the content of a document to retrieve key words from it. It calculates the weight
value for keywords by the keyword-weighting factors and the terms ordered by a key word
ranking score. The ranking score is found from the keyword analysis results of the document.
It can be used with FBW to efficiently weight keywords and eliminate the deficiencies of
FBW.
The keyword ranking method of a term depends on the document type and the location
and role of the word in the sentence or paragraph.Thematic words are the representative
terms for a document [19]. These words can be extracted from text by analyzing its contents.
In the case of this project, our keywords are found within XML tags but in more common
cases they could be found in bodies of documents of within tiles of documents. In the more
common cases of extracting keywords, keywords can be classified by different features [19].
They can the classified as word level which is the part-of-speech information. They can be
classified in sentence-level features, which is the type of phrase or sentence location and type.
Different weight is given to terms that occur in different parts of a sentence. For example,
a term in the subjective clause in the English language may carry more weight than a term
in an auxiliary clause.
10
3.2 Topic Clustering and Topic Labeling
Topic clustering is the grouping of items into topics or subjects. It can be done by using
commonly appearing words or phrases to sort items into categories. This is done by first
finding the phrase or words that appear in each item while ignoring stop words. After these
key phrases are found, they are used to put items into related clusters.
Topic clustering is used in search engines to display results from searches [3]. It is
particularly useful when the item searched for belongs to a number of different categories.
Then, topic clustering can be used to display the results that are related to each individual
category. Typical approaches for clustering are hierarchical clustering, which is distance-
based clustering that can be agglomerative or divisive, and k-means clutering, which is also
distance-based and creates clusters by selecting a central file and building the cluster around
it based on relatedness [27].
Topic labeling is very similar to topic clustering. The difference is that topic labeling
usually uses previously defined data to compare the items against. Topic labeling is fre-
quenty applied by using the k-nearest neighbor strategy. The data is spilt into training and
evaluation data. The training data has been manually assigned topics and is then used in
the evaluation phase to compare to the evaluation data, or incoming data, for similarities to
the set of topics in the training set [16, 1]. New documents or files are placed into the topic
cluster of their most frequent neighbors.
3.3 Clustering the file
After the keywords are extracted from the tags of files they are placed into clusters based
on a similar clustering algorithm we modified from [19]. Let C be the set of all clusters.
If n represents the number of clusters in the set C. Then set C will contain the clusters
C1, C2, C3, . . . , Cn
11
C = {C1, C2, C3, . . . , Cn} (1)
Each cluster Ci has to be initialized by a file f that is not assigned to the existing clusters
[19]. File f is considered to be a seed file of Ci. Every time a new cluster in created a sequence
of steps is taken to expand and reduce the cluster so that it is in a stable state from the start
state. In each evolution steps for cluster Ci, Cji is the jth state of the initialized cluster Ci.
Cji : the jth state of a cluster Ci (2)
The characteristic vector of a cluster is a set of 〈keyword, weight〉 pairs that represents
the cluster. If KD is a key word set of a file F and KCiis a keyword set of a cluster Ci,
then KjCi
is the jth state of the cluster Ci. Given the keywords sets for each file, cluster Ci
is created by the self-expanding algorithm.
3.4 Initializing the cluster
According to Kang, the first step of the clustering algorithm is the creation and initialization
of a new cluster [19]. A file F is randomly selected from the pile of documents that is not
assigned to a cluster yet. It is assigned to a new cluster C0i that is an initial state of cluster
Ci.
C0i = {F} (3)
Because this file F is the first file in the new cluster, it is called the initialization file or
the seed file. Keyword set KF of a file F is a set of keywords k1, k2, . . . , kn that are extracted
from file F . The initial state of keyword set K0Ci
is initialized by KF . Algorithm 1 shows
the steps for keyword-based clustering.
12
K0Ci
= KF (4)
KF = {k|k is a keyword that is extracted from F}
Algorithm 1 Keyword-based clustering algorithm
C0i = F
K0Ci
= KF
C1i = {Fx |documentFx where k ∈ KFx}∀ k such that k ∈ K0
Ci
j = 1do {Kj
Ci= ∪KFx where Fx ∈ Cj
i
Cj+1i = Cj
i
∀ Fx ∈ Cji begin
s = sim (Fx, KjCi
)if(s ≤ threshold)Cj+1
i = Cj+1i − Fx
end forj = j + 1
} while(isDeleteDocument())Ci = Cj
i
3.5 Adding files to a Cluster
After the cluster is initialized by C0i it can be expanded by adding more files that are related
to the seed file. This is done by adding more files to the cluster and keywords to the keyword
set. So the files that appear with each keyword of K0Ci
which is the keywords extracted from
the seed file of the cluster C1i which is the next state of the cluster Ci. This will expand the
cluster by the following:
C1i = {Fx|k ∈ Fx, k ∈ K0
Ci} (5)
The cluster is expanded by a number of iterations consisting of keyword expansion and
13
cluster expansion. Additional files are added to the cluster by a similarity evaluation between
the keyword set and the file. When a new file f is added to the cluster its keywords k is
also added to the set of keywords K in that cluster. When the first expansion of the cluster
is performed, keywords from the set of the seed file are used. When the second expansion
is performed the new set keywords are used from which now consists of the keywords from
the seed file and keywords from the added file. Therefore the ith expansion of the cluster is
performed by using the (i− 1)th state of the keyword set.
The total number of iteration is determined by the size of the dataset. If a cluster is
expanded from C0i to C1
i , the keyword set K0Ci
is also expanded to a new keyword set K1Ci
that appears in the total files of cluster C1i . The keyword set Kj
Ciof Cj
i is a union of the
total keyword sets of Cji .
The keyword set KjCi
of the cluster Cji is used to calculate the characteristic vector of each
cluster. The characteristic vector consists of the weight value calculated by term frequency
(TF) and inverted file frequency (IDF) of the keywords and this is used to calculate the
similarity measure between a file and the cluster.
3.6 Reducing and completing the cluster
This phase of the algorithm is to produce a complete cluster by removing files that do not
belong to the cluster. For the cluster Cji , files of a low similarity to the cluster are removed,
that are not related to a cluster Cji through the similarity computation with the cluster Cj
i .
This will filter the files that are related to the cluster and the cluster Cj+1i is generated as
a next step of the cluster Cji . Ultimately the cluster Ci is completed. The next cluster
Ci+1 is created through the same process. Clustering is terminated if all the documents are
clustered or no more clusters is created.
14
<video>
<title>Funny Sports Bloopers</title>
<category>Comedy</category>
<tags>Funny, Sports, Bloopers</tags>
<id>1796OXXdVzs</id>
</video>
Figure 2: Example of XML file used to create the dataset.
<video>
<tags>Funny, Sports, Bloopers</tags>
</video>
Figure 3: Example of the contents of the XML file
4 Tools for Information Retrieval
This section describes the dataset collected for this project, how words are processed for
information retrieval, indexing and the Lemur toolkit used to implement this project. The
significance of each tool to this project is also mentioned in this section.
4.1 Dataset
The dataset used for this project is information gathered by a YouTube video miner created
to remove videos from the site in XML format [23]. This information was placed in a file that
was parsed to identify and extract the tags associated with the videos using Perl’s built-in
XML parser. An example of an extracted tag file is shown in Figure 2. A Perl script was
used to separate each XML video representation into individual files. The contents of the file
consisted of only the tags from a video in XML format as shown in Figure 3. The categories
used were comedy, music, politics, gardening, auto and sports. The dataset consisted of a
total of 300 XML video files with 50 in each category as shown in Table 2.
15
Table 2: XML video count for each category
Topic XML countComedy 50Music 50Politics 50Gardening 50Auto 50Sports 50
Table 3: Examples of related words
Topic Related WordsComedy fun, laugh, humor, hilariousSports sport, football, goal, win, losePolitics election, democrat, republican, politician
4.2 Processing Words
This section describes synonyms, related words, parts-of-speech tagging, stop word removal
and stemming used to process words for this project.
4.2.1 Synonyms and Related Words
Synonyms and related words were used as a test set to compare to the useful keywords that
were taken from tags as a way to initiate the clustering process. These words were placed
into different categories based on the controlled categories for this experiment. This is to
provide a domain specific concept thesaurus that will be used to determine the concepts that
certain keywords relate to. Table 3 shows examples of related words within different topics.
16
Table 4: Sample of stopwords used in the stopword removal process
Type WordsPronouns I, itDeterminers a, an, that, the thisPrepositions about, by, for, from
of, on, toVerbs are, is
4.2.2 Part-of-Speech Tagging
Part-of-speech tagging was used to extract nouns and adjectives from tags in order to select
useful words for categorizing tags. Nouns and adjectives are used because they are more
useful as distinguishing features of a document. For example a noun or an adjective will carry
more information than a determiner or a preposition in a sentence. The Brill tagger2 and the
Stanford tagger3 are popular part-of-speech tagging tools that are useful for processing the
tags associated with each video.The Stanford taggger was used for this experiment because
of its simplicity in generating part-of-speech tags.
4.2.3 Stop Words
Stop words are words that occur frequently in a document but have no significant value as
distinguishing features of a document. They are usually prepositions, determiners, pronouns
and simple verbs. The removal of these words from the document or tagset saves processing
time during document indexing. Table 4 shows some common stop words.
4.2.4 Stemming
Stemming is the process of finding the stem word by reducing a word to its base form.
This saves processing time by reducing the number of words to be processed and allows
2http://www.ling.gu.se/ lager/mogul/brill-tagger/index.html3http://nlp.stanford.edu/software/tagger.shtml
17
<auto><doc1><2>
<ball><doc2><1>
<car><doc1><2>
.
.
.
Figure 4: Example of inverted file structure. Field 1 is the word. Field 2 is the documentthe word is in and field 3 is the number of times the word is seen in the document.
the document to be converted into an inverted file format for efficient storage as shown
in Figure 4. The first position in the inverted file represents a term that can be found
within that document. The second position is the name of the document and the third
position represents frequency of the term within that document. The Porter stemmer4 is a
common tool used to automatically find stem words [34]. Another stemmer is the Krovetz
stemmer5 which uses a dictionary to find the stem words. The Krovertz stemmer usually
produces a larger output file [21]. It also avoids some of the errors that the Porter stemmer
might produce by outputting meaningless stem words. The Lemur toolkit also has a built
in stemmer application with the option of choosing between the Porter and the Krovetz
stemmer.
4.3 Indexing
Indexing is the process of collecting and storing data for efficient retrieval. Indexing allows
for the retrieval of relevant documents based on a search query. If documents were not
indexed, all the documents in a corpus would have to be searched in order to return the
desired result to the user.
4http://tartarus.org/ martin/PorterStemmer/5http://www.comp.lancs.ac.uk/computing/research/stemming/general/krovetz.htm
18
4.3.1 Latent Semantic Indexing
Latent semantic indexing (LSI) uses vector semantics to represent the relationship between
files. LSI has been shown to have better performance than term matching approaches [18, 11,
5]. It outperforms other approaches mainly when a higher recall is needed, text descriptions
are short, or when text is noisy, i.e. when unwanted words are present [18].
Adding weights to elements in an index matrix is typically done by term frequency-inverse
document frequency. This describes elements in a matrix as proportional to the number of
times a term appears in a document, meaning, terms that occur less often are usually more
important and as a result carry more weight.
After the documents are represented in vector form we are able to do the following:
• Check the relatedness of documents j and q in the concepts space by comparing vectors
d̂j and d̂q. This can be done by cosine similarity and give us the clustering of documents.
• Compare two different terms i and p by comparing the vectors t̂i and t̂p. This will give
you a cluster of the terms in the concept space.
• View user’s query as a mini-document and be able to compare it to other documents
in the concept space. For this to be possible we must first translate the user’s query
into the concept space, which involves transforming it to a vector representation.
Even though LSI is more efficient than other similar approaches, there are still some
drawbacks that may occur when using LSI. The first drawback is its difficulty to debug and
analyze because its concept space cannot be easily understood by humans. The second prob-
lem is the performance cost of doing singular value decomposition (SVD). The performance
is O(N2k3), where N is the number of documents and K represents the number of dimensions
in the concept space. Because N will continuously grow, it makes it almost impractical for
a large dynamic dataset [18]. Another drawback in terms of performance arises when new
documents are added because SVD has to be performed.
19
4.3.2 Probabilistic Latent Semantic Indexing
Probabilistic latent semantic indexing (PLSI) is a technique that is based on statistics which
is used to analyze co-occurrence data. It is an extension of LSI that adds a more solid
probabilistic model to indexing documents. If we compare PLSI to LSI, LSI uses singular
value decomposition [36, 14] but PLSA is based on a mixture of decomposition derived from
a latent class model, which gives it its solid statistical background [17, 14].
PLSA models the probability of each co-occurrence of words and documents (w, d) as
follows. The co-occurrence can be represented as a mixture of conditionally independent
multi-nominal distributions where
P (w, d) =∑c
P (c)P (d|c)P (w|c) = P (d)∑c
P (c|d)P (w|c). (6)
∑c P (c)P (d|c)P (w|c), is a symmetric formulation where the conditional probabilities P (d|c)
and P (w|c)) are used to generate w and d from the latent class c. P (d)∑
c P (c|d)P (w|c), is an
asymmetric formulation that shows for each document d a latent class is chosen conditionally
to the document by P (c|d), then a word is generated from that latent class by P (w|c).
The aspect model of PLSA suffers from over-lifting. This is because the parameters grow
linearly with the number of documents. PLSA is a generative model of the documents in a
collection it is used on, but is not a generative model of new documents.
4.4 Lemur Toolkit
The Lemur toolkit6 is a natural language processing toolkit that was created for the purpose
of doing information retrieval experiments [25]. It has a number of built-in applications that
are useful for creating an index of data and testing algorithms. This toolkit was used to
test indexing techniques like PLSI along with clustering techniques like k-means clustering.
6http://www.lemurproject.org/
20
The Lemur toolkit also has the capacity to do pre-processing tasks like word stemming
and filtering stop words. It has a built-in stop word removal application that uses its own
standard stop word list.
5 Methodology
After the XML video files were generated from the YouTube miner, they were copied to an
XML file by categories. A Perl script was then used to separate each block of video into
individual XML flies. This was the collection of videos to be indexed. The Lemur toolkit
was then used to run the function BuildIndex to generate an index from the collection of
files.
After the index was generated, a number of single word queries were run on the index
using the Lemur function IndriRunQuery. The result from this was used as the baseline
for this experiment. The main focus of the rest of the project was to improve upon the
baseline performance to see if applying clustering techniques to indexed data will improve
the information retrieval part of the or this project.
k-means clustering was applied to the index and a number of single word queries were
run again on the same index. PLSI was then applied to the index and another set of single
word queries were made to compare the results. The results were tabulated based on the
precision and recall of each query.
This process was then repeated with queries having multiple words.
6 Results
The results from the experiment were evaluated using precision and recall as defined in Sec-
tion 2 to compare the different performance measures of keyword-based information retrieval
21
Table 5: Results for queries consisting of one word
ResultsPrecision Recall
Regular index 0.75 0.66k-means clustering 0.85 0.82PLSI 0.88 0.88
before and after k-means clustering and PLSI were applied. A total of 12 queries were made,
6 using single word queries and 6 using multiple word queries. The results were tabulated
for a queries consisting of only one word and queries consisting of multiple words. Results
are shown in Tables 4 and 5.
When a query consisting of one word was done on a regular index, precision was 0.75 and
recall was 0.66. However, when k-means clustering was applied to the index, precision was
0.85 and recall was 0.82. Also, when PLSI was applied, precision increased to 0.88 and recall
increased to 0.88 which were 14.7% and 25% increases in precision and recall respectively
over a query done on the regular index. The best results of precision and recall were obtained
when PLSI was applied.
When a query consisting of two words was done on a regular index, precision fell to 0.67
and recall rose to 0.82. When k-means clustering was applied, precision was increased to
0.82 and recall slightly decreased to 0.80. When PLSI was applied, precision increased to
0.86 and recall increased to 0.88 which was 22% and 6.8% increase in precision and recall
respectively over a query done on the regular index and a 0.4% and 0.9% increase in precision
and recall respectively over a query done when k-means clustering was applied. For two-word
queries, the best results of precision and recall were obtained again when PLSI was applied,
supporting the results from the one word case.
22
Table 6: Results for queries consisting of multiple words
ResultsPrecision Recall
Regular index 0.67 0.82k-means clustering 0.82 0.80PLSI 0.86 0.88
7 Conclusion and Future Work
As television moves over to the internet with GoogleTV and sites like YouTube and Hulu,
the typical viewer’s way of viewing may change. Channel surfing or searching by categories
or titles of programs on your local cable network is soon to be over. Television sets may
have access to a much broader search space in the Internet. Here, viewers will be able to do
more specific searches using genre, actors, context or even specific scenes. The system will
then retrieve a result based on a certain context or word sense that will satisfy the user.
Improved information retrieval can also benefit Internet radios where the users can get
more accurate results based on their listening preferences. Instead of using keyword-based
searches to determine the next song to play or previous patterns of other users, they would
use a more probabilistic approach to determining what the user wants.
In this project, the hypothesis was proven and the results showed that when clustering
and indexing techniques were applied to a collection of data, there will be improvements in
information retrieval. Even though the size of the improvements was not large there were
still improvements. This lack of a big difference may be due to the small size of the dataset.
This project could be improved upon by using larger datasets or tags with larger sets of
keywords. Also, more tests could be run with queries having more than two words. Next,
I would incorporate these findings into designing a system that could index and cluster
multi-media files without predefined tags for more efficient information retrieval.
23
References
[1] Najaf Ali Shah. Topic-based clustering of news articles. Proceedings of ACM SE ’04
ACM Southeast Regional Conference 2004, pages 412–413, 2004.
[2] Emilia Apostolova, Sean Neilan, Gary An, Noriko Tomuro, and Steven Lytinen. Djan-
gology: A light-weight web-based tool for distributed collaborative text annotation.
Proceedings of the Seventh Conference on International Language Resources and Eval-
uation (LREC’10), 2010.
[3] Grigory Begelman, Philipp Keller, and Frank Smadja. Automated tag cluster-
ing:Improving search and exploration in the tag space. Proceedings of the Collaborative
Web Tagging Workshop, 2006.
[4] Ron Bekkerman and James Allan. Using bigrams in text categorization. Center of
Intelligent Information Retrieval, UMass Amherst, IR-408:1–10, 2004.
[5] Jerome Bellegarda. Exploiting latent semantic information in statistical language mod-
eling. Proceedings of the IEEE, 88:1279–1296, 2000.
[6] MPS Bhatia and Akshi Kumar Khalid. Information retrieval and machine learning:
Supporting technologies for web mining research and practice. Webology, 5, 2008.
[7] Mike Brown, Christiane Fortsch, and Dieter WiBman. Combining information retrieval
and case-based reasoning for middle ground text retrieval problems. AAAI Technical
Report WS-98-12, pages 3–7, 1998.
[8] H. Chen and Susan Dumais. Bringing order to the web: Automatically categorizing
search results. Proceedings of the SIGCHI Conference on Human Factors in Computing
Systems, pages 145–152, 2000.
24
[9] Jane Cleland-Huang, Raffaella Settimi, and Oussama Benkhadra. Global-centric trace-
abilty for managing non-functional requirements. Proceedings of the 27th international
Conference on Software Engineering, pages 362–371, 2004.
[10] Grace Dasovich, Robert Kim, Daniela S. Raicu, and Jacob Furst. A model for the
relationship between semantic and content based similarity using LIDC. Proceedings of
Medical Imaging 2010: Computer-Aided Diagnosis Conference, 2010.
[11] Scott Deerwester, Susan T. Dumais, George W. Furnas, Thomas K. Landauer, and
Richard Harshman. Indexing by latent semantic analysis. Journal of the American
Society for Information Science, 41(6):391–407, 1990.
[12] Carlotta Domeniconi and Muna Al-Razgan. Weighted cluster ensembles: Methods and
analysis. ACM Transactions on Knowledge Discovery from Data, 2:3–40, 2009.
[13] Susan T. Dumais. Latent semantic indexing. Proceedings of the Text Retrieval Confer-
ence, 1995.
[14] Ayman Farahat and Francie Che. Improving probabilistic latent semantic analysis with
principal component analysis. Eleventh Conference of the European Chapter of the
Association for Computational Linguistics (EACL -2006), pages 105–112, 2006.
[15] William B. Frakes and Ricardo A. Baeza-Yates, editors. Information Retrieval: Data
Structures & Algorithms. Prentice-Hall, 1992.
[16] Alexander Hauptmann. Topic labeling of multiligual broadcast news in the informe-
dia digital video library. Proceedings of the Ninth ACM International Conference on
Multimedia, pages 1–6, 1999.
25
[17] Thomas Hofmann. Probalistic latent semantic indexing. Proceedings of the 22nd Annual
International ACM SIGIR Conference on Research and Development in Information
Retrieval, pages 50–57, 1999.
[18] Jason Hong. An overview of latent semantic indexing. Website, 2000. http://www.cs.
berkeley.edu/~jasonh/classes/sims240/sims-240-final-paper-lsi_files/sims.
[19] Seung-Shik Kang. Keyword-based document clustering. Proceedings of the Sixth Inter-
national Workshop on Information Retrieval with Asian Languages, 11:132–137, 2003.
[20] Robert Kim, Grace Dasovich, Runa Bhaumik, Richard Brock, Jacob D. Furst, and
Daniela S. Raicu. An investigation into the relationship between semantic and content
based similarity using LIDC. Proceedings of the International Conference on Multimedia
Information Retrieval, pages 185–192, 2010.
[21] Bob Krovetz. Viewing morphology as an inference process. Proceedings of 16th ACM
SIGIR Conference, 1993.
[22] Tae-Hoon Lee, Jung-Hyun Kim, Hyeong-Joon Kwon, and Kwang-Seok Hong. Keyword-
based semantic retrieval system using location information in a mobile environment.
Proceedings of the 2009 International Symposium on Web Information Systems and
Applications, 2009.
[23] Brian McMahan. Personal communication, 0ctober 2010. http://ytminer.braingineer.
net.
[24] Prakash Nadkarni. An introduction to information retrieval: Applications in genomics.
The Pharmacogenomics Journal, 2:96–102, 2001.
[25] Paul Ogilvie and Jamie Callan. Experiments using the Lemur Toolkit. Proceedings of
the TREC, 2001.
26
[26] Michael S. Lew, Nicu Sebe, Chabane Djeraba, and Ramesh Jain. Content-based mul-
timedia information retrieval: State of the art and challenges. ACM Transactions on
Multimedia Computing, Communications and Applications, 2:1–19, 2006.
[27] Andriy Shepitsen, Jonathan Gemmell, Bamshad Mobasher, and Robin Burke. Personal-
ized recommendation in social tagging systems using hierarchical clustering. Proceedings
of the 2008 ACM Conference on Recommender Systems, 2008.
[28] Ahu Sieg, Bamshad Mobasher, Steve Lytinen, and Robin Burke. Using concept hierar-
chies to enhance user queries in web-based. Proceedings of the International Conference
on Artificial Intelligence and Applications, 2004.
[29] Vijayan Sugumaran and Veda C. Storey. A semantic-based approach to component
retrieval. The DATA BASE for Advances in Information Systems-Summer 2003, 34:8–
24, 2003.
[30] Noriko Tomuro and Steve Lytinen. Polysemy in lexical semnatics–Automatic discovery
of polysemous senses and their regularities. NYU Symposium on Semantic Knowledge
Discovery, Organization and Use, 2008.
[31] Noriko Tomuro, Steven Lytinen, Kyoko Kazaki, and Hitoshi Isahara. Clustering using
feature domain similarity to discover word sense for adjectives. International Conference
on Semantic Computing, pages 370–377, 2007.
[32] Paolo Tonella, Christian Girardi, and Emanuele Pianta. An empirical study on keyword-
based web site clustering. Proceedings 12th IEEE International Workshop on Program
Comprehension (IWPC’04), pages 204–213, 2004.
[33] Hwee Tou Ng and Hian Beng Lee. Intergrating multiple knowledge source to disam-
biguate word sense: An exemplar-based approach. Proceedings of the 34th annual meet-
ing on Association for Computational Linguistics, pages 40–47, 1996.
27
[34] Cornelis van Rijsbergen, S.E. Robertson, and M.F. Porter. New models in probabilistic
information retrieval. British Library Research and Development Report, (5587), 1980.
[35] Pu Wang, Carlotta Domeniconi, and Jian Hu. Cross-domain text classification using
wikipedia. IEEE Intelligent Informatics Bulletin, 9:5–17, 2008.
[36] Peter Wiemer-Hastings. Latent semantic analysis. Proceedings of the 16th International
Joint Conference on Artificial Intelligence, pages 1–14, 2004.
[37] Wei Xu, Xin Liu, and Yihong Gong. Document clustering based on non-negative matrix
factorization. Proceedings of the 26th Annual International ACM SIGIR Conference on
Research and Development in Informaion Retrieval, pages 267–273, 2003.
28