Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
The Use of Information Retrieval for a
Searchable Database of Audio Teachings
Adedayo Ologunde
A Thesis Submitted in Partial Fulfillment of the
Requirements for the Degree of
Bachelor of Science
in
Computer Science
Minnesota State University, Mankato
Mankato, Minnesota
December 2011
2
Abstract Traditional search engines match key words or phrases based on word frequency,
location, similarity, and update-time. The ranking of the results based on these criteria does not
necessarily reflect the trends of a user’s interest and the large number of results returned forces
the user to sort them by hand. In order to solve the problem of ranking relevant results of interest
to a user, vector based representations of collections of files or documents are created, filtered
through reduction techniques and ranked. In this work, a database containing transcriptions of
audio recordings was created for a small, non-profit organization and augmented with a simple
tagging and search engine in order to provide a basic framework for semantic indexing and
problem domain specific search. The goal was more accurate search results for their audio
recordings. Many of the audio files had brief summaries of the audio with some keywords, while
others did not. As the set of audio data grows, it is important for the file system to adapt the
incoming data to the users’ typical word associations. By asking users to enter keywords when
uploading new files, these associations can be created. Semantic indexing can then be used to
return more accurate search results to the user. This approach, implemented on a desktop system
to ensure data privacy, provided better relatedness and faster retrieval of the desired files and
demonstrated how a small organization can benefit from easy search and indexing of their large
audio library.
3
Contents 1. Introduction ......................................................................................................................................... 4
2. Background .......................................................................................................................................... 5
2.1. Document Indexing in IR ................................................................................................................... 6
2.2. Document Indexing Models .............................................................................................................. 7
2.2.1. Boolean Models ............................................................................................................................. 7
2.2.2. Vector Space Models ...................................................................................................................... 8
2.3. Index Filtering .................................................................................................................................. 10
2.4. Document Similarity Measures ....................................................................................................... 10
2.5. Available Tools ................................................................................................................................ 11
2.6. Evaluation Metrics ........................................................................................................................... 12
3. Dataset ............................................................................................................................................... 12
4. Software Design and Use ................................................................................................................... 13
4.1. Development Process ...................................................................................................................... 14
4.2. Software Use ................................................................................................................................... 16
5. Results ................................................................................................................................................ 17
5.1. Quantitative Software Evaluation ................................................................................................... 17
5.2. User Testing ..................................................................................................................................... 21
6. Discussion and Future Work ............................................................................................................... 21
6.1. Discussion ........................................................................................................................................ 22
6.2. Future Work .................................................................................................................................... 23
Bibliography ............................................................................................................................................... 25
Appendix A User Testing Materials ............................................................................................................ 27
4
1. Introduction Information retrieval (IR) systems can be useful for recovering specific information from
stored records. While storing and retrieving information is useful, humans often need to use the
retrieved data and apply it in order to answer questions and study, compare and rank topics
beyond simple file retrieval. While presenting meaningful results to a user is a challenging
multifaceted process, several solutions to retrieving meaningful information exist. These include
full text search, which becomes useful for locating phrases, information retrieval systems and
user tagging, and structured query search (Manning, Raghavan, Schütze, 2008).
A small non-profit organization had a sizeable collection of audio files associated with
documents containing some keywords. The organization was looking for a solution to organize
their collection and group related transcriptions with one another. In order to solve the problem
of retrieving specific information from their collection, an information retrieval system was
developed. Traditional search engines use key based systems to match words with large amounts
of unrelated results based on word frequency, location, similarity, or update-time, but they do not
necessarily reflect the trends of a user’s interest (Wu et al., 2009). The objective of this project
was to a create a simple, inexpensive vector-based indexing system and search engine to return
reasonable results to users in a manner that fit their collection of documents, allows them to
upload new documents, and allows them to control user access based on topic choice. Free and
commercial alternatives that handle information retrieval problems exist such as Google Desktop
Search for individual users, and IBM’s InfoSphere Warehouse for enterprise. However, such
options might not provide utility for the expressed purpose of searching through a moderate sized
collection of documents.
5
In order to create a system that provides meaningful results for a user, there are multiple
approaches for document indexing and retrieval. Clustering is one such approach that groups
files into categories based on relatedness (Beckford, 2010). Others include the probabilistic
model, which statistically quantifies data within a collection of documents, and the vector space
model, which treats queries and documents as vectors (Soboroff). This work examines two
methods for information retrieval, Boolean models and vector space models, and presents the
necessary steps to implement examples of each model, with a discussion of their role in this
work. Further information is given on tools used for this project and its evaluation based on its
performance and ability to process and return results to the user.
2. Background
Users browsing information on computer systems typically choose between browsing
through the available collection of data for the information that they need or assistance from a
personal information management system (PIM) to locate it for them (Whittaker, 2008;
Manning, Raghavan, Schütze, 2008). In Internet search, they may be assisted by a system with
the ability to scale to large collections, while a personal computer’s desktop search might be able
to provide detailed results such as file location and file previews with results. Enterprise
computer applications also may have the need for search such as database supported web
applications (e.g. Smiley, Text Search, your Database or Solr). Among computer system search
options, the common goal of providing relevant results and presenting them to the user in a
coherent manner exist. Figure 1 illustrates the various steps taken to translate a collection of
document files. These include filtering words contained within a collection for word type,
analyzing the filtered words for the number of appearances within the text, storing the collected
words in an index and applying transformations to the index, which is used to increase the speed
6
of search. In information retrieval, documents refers to the units used to build the system, terms
refer to the data, typically words or shot phrases contained within the document to be processed
or retrieved, and collections refer to the set of documents, which may also be referred to as a
collection (Manning, Raghavan, Schütze, 2009). This section presents information about
document indexing, approaches, filtering, available tools and evaluation methods.
Figure 1 Document Preparation and Information Retrieval Process for use with the Boolean model or for use with the vector space model
2.1. Document Indexing in IR Indexing systems can be used to increase the speed of search. An information retrieval
system used to collect meaningful information for a user typically makes use of an indexing step
that allows the system to examine documents’ terms before retrieval to arrange or associate the
documents with queries in a significant manner. Several models and approaches are available to
represent a user’s query and a system’s response. The Boolean model provides simple
implementation, and allows logical operators. Probabilistic models, along with vector space
models allow for ranked retrieval by relevance, at the cost of complexity.
7
2.2. Document Indexing Models This section introduces two models of document indexing, the vector space model and
the Boolean model. Along with the vector space model that might use indexing techniques such
as latent semantic indexing, the Boolean model relies on text queries to return results as shown in
figure 2. This model works to return matching results from queries to documents via Boolean
operators such as AND, OR, and NOT.
Figure 2 Boolean and Vector Space Models
2.2.1. Boolean Models The Boolean model has been used in dialog systems as well as web search engines that
implement extended Boolean models (Manning, Raghavan, Schütze, 2009). The Boolean model
indexes a set of words, checks if the term is present within a document, and applies Boolean
operators AND, OR, and NOT over the set of documents (Soboroff). Strict Boolean models can
only retrieve exact matches, and weight each term equally, making it impossible to rank
documents, but work quickly and simply in comparison to vector space models. Complex or
extended models add keyword search, phrase search and add chronological ranking (Kobayashi,
Takeda, 2000; Manning, Raghavan, Schütze, 2009; Soboroff).
8
2.2.2. Vector Space Models Implementing a ranking system that takes more into account than order of appearance
requires additional features. This model allows for document comparison, but does so at the cost
of speed and complexity. In a vector space model, an index is used to collect terms in documents.
Steps are taken to collect the documents to be indexed, tokenize the text, linguistically process
the tokens through sets of rules within the bodies of text, and index the documents that contain
the each term (Manning, Raghavan, Schütze, 2009).
Within documents, frequently occurring terms may skew search results towards certain
documents in a collection. To address this issue, vector space models make use of term
frequency (TF), and inverse document frequency (IDF), as shown in equation 1 to create a
TF*IDF distance score representing the number of times a term appears in a document multiplied
by the log of the number of documents in the database over the number of documents containing
the term. Term frequency weighting refers to the method of scoring these term counts. Inverse
document frequency reduces the importance of commonly occurring terms over a collection of
documents (Garcia). Term frequency refers to a count of the number of times a word is repeated
within a document, which can be used to quantify its importance relative to other terms within a
document, as shown in equation 1.
!"! =!!!
Equation 1
where N represents the number of words in the document and Ni is the number of times term i
appears in the document. The inverse document frequency is shown in equation 2.
!"# = log !!"!
Equation 2
9
where D represents the number of documents in the collection, and dfi represents the number of
documents containing the term i. Terms that do not occur in a document (dfi=0) are assigned a
small fixed value. By multiplying the term frequency and inverse document frequency, a
distance score is created that can measure the importance of terms within documents in a
collection. The term weight, wi, is shown in equation 3.
!! = !"! ∗ !"# Equation 3
In this model, documents can be ranked highly against a query if the terms are rare within a
collection, but common within a document (Spoerri). Along with TF-IDF weighting, the process
of indexing a document in a collection might involve normalization to ensure that terms within
documents and the documents themselves are fairly promoted (Singhal, et al., 1995).
Vector space models make use of tokenization, or splitting documents into terms. Word
stemming, or using the root of a word, idea, or synonym can be used to increase the occurrence
of rare terms. Tokenization during part-of-speech tagging often delimits the tokens by
whitespace, or spaces between tokens, however hyphenated phrases such as “soft-spoken” may
be treated as a single term to return more accurate results to the user (Manning, Raghavan,
Schütze, 2009). It is possible to identify the parts of speech of each token within a collection of
documents. These parts include nouns, verbs, pronouns, prepositions, adverbs, conjunctions,
participles and articles based off of information given about the word, and its neighbors
(Jurafsky, Martin, 2008). Tagging of parts of speech can allow for more streamlined results by
reducing noise in a term document matrix while providing a means for the user to identify useful
terms related to their topic of interest. For example, filtering can isolate nouns and proper nouns
and ignore prepositions, conjunctions and articles.
10
2.3. Index Filtering Function words such as “and” are frequently found in documents increasing the
document size and potentially increasing search time, but have little intrinsic effect on the results
to the user. In the set of rules for creating an index, a step is often taken to filter out the most
frequent occurring words or stop words (Manning, Raghavan, Schütze, 2009; Jurafsky, Martin,
2008). The terms selected for the stop words can be adjusted to the collection of documents and
might be different for part-of-speech groups. The use of a stop word list in information retrieval
can reduce space in terms of the index size, as well as reduce the amount of time devoted to
indexing terms.
Stemming, or recognizing and grouping the roots of words, is also used to reduce the size
of an index (Manning, Raghavan, Schütze, 2009). In the processing phase of creating an index,
each word contained within a document is given a weight based on its frequency and the list of
unique words associated with a vector (Jurafsky, Martin, 2008). The weights within the
documents are represented within a document vector. In the vector space model, when a user
searches through the document collection, a query vector represents their query.
2.4. Document Similarity Measures Before a user is presented with search results, cosine similarity is applied to compare the
similarity of vectors, which represent documents within the collection to the resemblance to the
user’s query. Several methods of measuring similarity between documents exist, such as Jaccard
similarity and abstract similarity (Pal, 2008a). In this project, cosine similarity was used. This
process measures the cosine angle between two vectors, A and B as shown in equation 4
(Garcia).
11
!"# !,! = !"#$%& ! = !·!! !
Equation 4
Latent semantic indexing (LSI) is an application of singular value decomposition (SVD),
allowing for document scoring and ranking. LSI projects queries and documents into a space
with latent semantic dimensions, and then SVD finds the optimal projection to a low dimensional
space (Rosario, 2000). LSI can provide optimal matches of variation to documents, but becomes
costly at large dimensions or collections with large amounts of variance between terms.
2.5. Available Tools The nonprofit organization had concerns about security, so the use of web related
technologies were avoided. Several solutions to desktop based search were considered. For
example, Rapid-I provides a widely used software tool for data mining, RapidMiner, which is
graphical and provides extendibility, but offers limited support for inexpensive versions
(RapidMiner). Several open source projects were available to process text. SemanticVectors for
example, creates “WordSpace” models from natural language text, (semanticvectors) but offers
limited extendibility and support. Lucene was used throughout the project to compare the quality
of search results and for a word suggestion feature through Lucene’s SpellChecker class
(Lucene). Lucene provided fast and scalable indexing with various sized collections (Lucene).
However Lucene does not rank documents, which is critical for sorting meaningful results. The
S-Space package provides support for latent semantic analysis, and was designed for scalability
(Jurgens). S-Spaces might have been useful in quickly processing the collection of files, but was
not considered until late in the development process and was not used. The Java Text Mining
Toolkit provided part-of-speech tagging capabilities, similarity measures and the ability to rank
12
documents for a reasonable starting point to build upon (Pal, 2008a). Much of the indexing and
search processing was built upon the JTMT project.
2.6. Evaluation Metrics Two evaluation metrics used in information retrieval are precision and recall as shown in
equations 5 and 6 (Manning, Raghavan, Schütze, 2009; Jurafsky, Martin, 2008; Kobayashi,
2000). Precision is the ratio of relevant documents retrieved to the number of documents and
recall is the proportion of relevant documents retrieved compared to the relevant documents in
the database (Kobayashi, 2000). In order to achieve high precision, the IR system would have to
return only relevant responses to the user’s query.
Recall= !"##$%& !"#$%&#"# !"#$% !" !"#$%&!"#$% !"##$%&' !"##$%& !"#$%&# !" !"#$
Equation 5
Precision= !"##$%& !"#$%&#"# !"#$% !" !"#$%&!"#$%& !" !"#$%&# !"#$% !" !"#$%&
Equation 6
Another method of evaluating the system was through user testing. Because this project
has expected clients, they were asked for feedback. The feedback provided in user testing can be
used to improve software systems as well as the likelihood of their adoption by the target user
populations. The methods used here are described in Appendix A.
3. Dataset The data used for this project was a sample of 126 documents in plain text format. The
number of unique terms in the corpus was roughly 5-6000. The format of the documents was
generally a collection of words, with a date associated with the document name. In addition to
English words, the collection contained Portuguese words, numbers, and symbols that were
filtered out of the results. A sample of a file is shown in Figure 3, demonstrating proper nouns,
13
and special characters contained within the collection, such as the “>” character. When
encountered, characters such as these could not be treated as a proper noun, noun, or verb and
were filtered out of the indexed collection.
20091120-04-Fri
Amithaba one arms-length on the top of our head Chenrezig central channel* heart chakra lotus, tikle luminous as a candle. seed-syllable straight as an arrow - lung (not recorded) - meditation posture instructions 2009- P´howa 20091120.04-Fri > little finger´s width P`howa retreat Lama Tsering Everest Odsal Ling temple Translator- Priscila
Figure 3 A sample file in the collection of documents
4. Software Design and Use Section 4 explains the various open source and freely available software components that
were used in creating this project. These components include the JTMT project, which was used
throughout the indexing and search steps, the Java Swing Widget Toolkit platform, which was
used for the graphical user interface, and the Lucene project, which was used for spelling
suggestion. This is illustrated in figure 4. This section also explains how content filtering was
used in this project, along with the design choices for its graphical user interface.
14
Figure 4 A block diagram displaying the various tools used to create the project.
4.1. Development Process Open source and freely available software from the Java Text Mining Toolkit from Sujit
Pal (2008a) was used for much of the indexing process. The components used were part-of-
speech tagging, TF-IDF indexing, and cosine similarity. After a user selects a directory
containing files, the files in the directory are processed through a set of recognizers, which
recognize properties such as punctuation, phrases, and abbreviation. Next, the files are processed
in a rule-based tagger, which categorizes parts of speech using the MIT Java Wordnet Interface
(Pal, 2008b). Nouns, verbs, adjectives and adverbs can be tagged as an allowable part of speech
through the Wordnet interface after its “sense” or its unique word ID has been determined
(About Wordnet, 2011). Next, the word is tagged as TokenType.Content_WORD. Verbs,
adjectives and adverbs were omitted to save space in the index. Punctuation, numbers,
abbreviation, and spaces are also omitted from the text using a break iterator based on the
ICU4j's RuleBasedBreakIterator (Pal, 2008b). For example, the sentence, “- These are the 3
reasons why Buddha said life is suffering.” is tagged as [“- These are the 3 reasons why Buddha
said life is suffering.”]. The software tags the dash and the period in the form [UNKNOWN], the
15
number 3 as [NUMBER] and remaining words as [word] apostrophes and hyphens are included
with words. Nouns, verbs, and adjectives are identified, for instance, [self-centeredness (noun)
WID-04779645] and then tagged as [CONTENT_WORD] as an allowable part of speech. The
content words are stored in a list with their IDs. A stop word list was used to filter out
commonly used words. Finally, the content words in the collection are assigned frequencies
based upon the number of times they appear within the documents, and the documents are
mapped to a matrix. Users might make mistakes while entering input and may not receive precise
results. In order to address this, a spell checking system prompts the user when no matching
documents are found and suggests alternative terms according to a dictionary list. Java Swing
components were used to create a basic graphical user interface. Java Swing was chosen because
of the ability to run Java programs on Microsoft Windows, Mac OS X, and Linux, which would
suit the user’s varied needs. The GUI was created with the intent to allow the user to select files
and search through their contents, with the visual guidelines of each operating system taken into
Figure 5 A comparison of Mac OS X's native file chooser on the left, and Java's default file chooser on the right
16
consideration. A screenshot of the program is shown in Figure 5 demonstrating a customized file
chooser for Mac OS X. The benefits of conforming to the user interface’s human interaction
guidelines provided consistency with the user’s expectations of the behavior of applications that
run on the platform.
Much of the work for this project was produced with the intent of taking an existing,
cost-effective indexing and searching system and modifying it to work with the non-profit
organization’s collection. Therefore, the JTMT project was modified to work with collections of
documents, instead of paragraphs within a single document, to operate through a GUI instead of
a command prompt for user friendliness. Additionally, the output of the results were modified for
clarity, the ability to save an index to the disk was added to save time in later searches, along
with changes such as using SQLite instead of MySQL in the interest of portability since the
database is stored on the host machine, and does not require a server (Obbayi, 2010). Also, word
suggestions were added through Lucene when no reasonable results to a user’s query were
returned.
4.2. Software Use The program starts by allowing the user to select a directory containing plain text files. The
files are indexed by the program, and a generated map of the document names, words and their
frequencies is created. Through the JTMT toolkit, a simple linear algebra package is used to
normalize the term frequencies (Pal, 2008b). LSI and cosine similarity are applied to the matrix.
To search through the documents, a query vector representing the terms in the query is treated as
an instance of a document vector allowing query terms to be compared against the matrix.
After the user has installed Java on their computer, they select the Java jar file for the project.
They select a directory containing the files that they wish to index. A screenshot demonstrating
17
the results of matching results returned from the index is shown in Figure 6. The spelling
suggestion feature used Lucene’s SpellChecker, allowing the accuracy of matches to be set based
on edit distance, which was set to 0.75, producing results reasonably close to the given input
(Moreira).
Figure 6 A screenshot of the program displaying the results of a query.
5. Results This section presents the evaluation used to determine the correctness and performance of the
program. Results are presented in two ways. An analysis of software performance is followed by
user testing results. The section begins with a description of the test setup and evaluation
methods.
5.1. Quantitative Software Evaluation In order for the software to be useful for the users, several tests were conducted to
measure the quality of the results returned to the user, and the speed of the program. These tests
18
were critical, since a user waiting for results for their query to appear, or a user searching
through an ineffectual list of results might forego using the software altogether. A test setup
consisting of a computer running Microsoft Windows 7 with a 2.8 GHz processor, 4 GB ram,
and a 64 GB SSD executed the program. Tests were conducted for three components of the
program: index creation time, precision/recall of search and search results, and search result
suggestion. The index creation time test consisted of pseudorandom selected files from the non-
profit organization’s document collection, scaled from nearly the size of the collection to about a
tenth of it. The index creation test included the time required to process the files by applying a
chain of recognizers to each file and storing the processed files as tokens in a matrix, as well as
performing latent semantic indexing and cosine similarity on the matrices.
Additional tests were performed on precision and recall of search results by running the
program multiple times with various search terms. Words and phrases were selected from the
document collection and compared to the results returned from the search application. The
results of the suggestion system were also measured for precision and recall, which is discussed
in section 5.2.
Table 1 Index Creation Time
Initial Collection Size Time to Create Index (In Minutes)
10 Files 1.07
25 Files 2.56
50 Files 5.05
75 Files 6.71
100 Files 8.33
125 Files 12.37 The index creation time test was conducted three times for each scaled collection and an
average time to index the collection was recorded as shown in Table 1. Based on observation,
19
results indicate near linear growth in index time as the number of term is increased as
demonstrated in Figure 5. While a collection of 10 documents corresponded to 500 terms or a
5000 element matrix, a collection of 125 documents on average corresponded to 2700 terms.
With the size of the indices taken into account, the increase in index time can be attributed to the
matrix size used for LSI. A matrix of 127*2700 or 337,500 elements requires 2.7 Mb of memory,
and large matrix sizes increase the amount of time because a hundred documents in the matrix
correspond to a hundred points in the term space. The cosine angle of each document must be
calculated against the other documents in the term space, which increased the length of time
required to finish indexing. The increase in the number of files also corresponded to an increase
in index creation time as shown in figure 5, since additional files added more terms to the
collection.
Figure 7 Index Creation Time as the Number of Files Increases
A single document was indexed twice to test for similarity measures. When individual
terms were searched for, they yielded identical results, suggesting high precision. The second
document made of identical terms was then reduced to half its size, and a single term matching
0
20
40
60
80
100
120
140
1.07 2.56 5.05 6.71 8.33 12.37
Files
Time
Index CreaUon Time
20
test was applied to both documents. While the results produced a match for the first document
which contained the term, they also produced a match for the second document which did not
contain the term found in the first document. For example, Document A, which contained the
phrase, “activity is attained with amplification” was compared to Document B, containing the
phrase, “adjust activity by expression.” The comparison produced equivalent scores for both
documents when the term, “attained” was searched for. Latent semantic indexing was used to
compare the two documents, so the term “attained” had a frequency of one within the first
document, 0 within the second document, and 1 within the query vector. After matrix
decomposition was applied to the matrix, the rows or scores of the term remained unequal, and a
similar result occurred when applying cosine similarity between the query and the matrix. The
similarity between documents is represented as shown in equation 7
!!! !! =!1 !20.426 0.851 Equation 7
where q represents a query vector or a user’s query, A represents the term-document matrix, d1
and d2 are two documents, and T is the transpose of the vector. By multiplying the transpose of
the query vector by the term-document matrix, a similarity measure is calculated. The dissimilar
results between documents 1 and 2 suggest that the results returned from the system reflect the
similarity of the documents instead of terms, which might provide high recall but does not
necessarily deliver the level of precision to distinguish one highly related document from
another. The suggestion system was set to return three possible matches from the program’s
dictionary for each term in the query. The system returned reasonably close matches, but at the
cost of precision. One possible solution to correct the low precision of the matches could be to
compare the query to the terms in an individual documents. Since matches are provided by
21
document similarity, an additional term search would filter out documents where the term did not
appear.
5.2. User Testing To test the usability of the software, a knowledgeable user was asked to search for typical
terms in the dataset using the software, while noting their experience in a short survey. The
methods and documentation for user testing are in Appendix A. The user noted the effectiveness
of the software in finding a result within the collection, the ease of use, and their likelihood of
using the software solution. The user indicated the software system was easy to navigate. But
they did not try to add or upload files to the collection. The user thought that files returned made
sense, but expressed frustration with too many false positives returned in response to their
queries, and also noted that some of the returned files were essentially empty. These results
indicate that the system provided high recall, but also may have problems with low precision.
The user also gave the suggestion to implement a monitor for indexing progress, and a method to
link directly to the files in their collection. The user expressed satisfaction with the term
suggestion feature. In response to the user feedback, links were added to the file list in the results
returned and a progress monitor reflecting the stage of the indexing progress was implemented.
6. Discussion and Future Work This project involved investigating a viable system for multiple user document
organization and retrieval. This work involved researching existing methods of information
retrieval, selecting a model that fit the organization’s needs, selecting software components from
an open-source information retrieval project, adjusting them to operate with large, multiple
document collections, considering features, such as saving indices to disk, and a graphic user
22
interface. This section describes the results of the testing and draws conclusions of the
effectiveness of the software, discussing the advantages and disadvantages of using the vector
space model in this project. Additionally, future possibilities for continuing the project are also
discussed in future work.
6.1. Discussion Large data sets result in large indices, which are computationally intensive to create. The
system developed in this work provided acceptable results and document relatedness at the cost
of system responsiveness. While using a vector based model for a simple search program allows
the user to distinguish which files are similar to their query, the index time required is costly
unless preprocessing allows for storing a matrix to the user’s disk. During the testing process for
this application, small collections of documents with similar terms were processed quickly, while
larger collections with a greater number of unique terms required a lengthy amount time. Some
issues with low accuracy were found, as some search terms returned more documents than
necessary as demonstrated in figure 6. Given the startup time required to create an index, the user
could find the wait unacceptable and might prefer to filter through unranked results retrieved by
Boolean methods. In a growing collection of files, it could prove more useful to rely on a
combination of Boolean retrieval, full text search and relevance feedback, which collects data
from the user to further improve results. Improvements to the index time can be realized by
reducing the amount of terms indexed through selecting stop words particular to the data set.
Other issues include the complexity of retrieving data from a vector representation of
documents. For example, the term document matrix, while providing relatedness, does not
preserve the sentence structure of the original documents, which could be useful for a graphical
representation of the search results. A solution to this might include assigning an additional value
23
to the column and row indicating the line number a term occurred on. A normalized matrix could
be stored, but SVD would have to be applied to the matrix each time a document is added to the
collection. Query expansion, which expands the user’s query to include related terms, could be
used to eliminate low precision matches from the system’s results. (Jurafsky, Martin, 2008;
Tsatsaronis, Varlamis, Vazirgiannis, 2010). Clustering creates sets of clusters for documents
based on similarity. Clustering vectors and calculating distances could also reduce the need for
matrix transformations. Further part speech recognition tagging could be applied for more
specific tags, such as tagging only proper nouns to provide the user with an index of names.
6.2. Future Work Future work might focus on measures to reduce the size of the matrix to increase
performance. As WordNet allows for noun, verb, adverb, and adjective relations, the WordNet
database itself might provide for semantic results such as presenting the user with filtering
options. Because of the time required to calculate LSI and cosine similarity on a large matrix, a
system that examines part of the matrix in memory might be more effective in reducing the term
space as well as the index time. An example of such a solution is the S-Space package, which
became available after work on this project began, provides support for Latent Semantic
Analysis and sparse matrices in semantic spaces (Jurgens, 2010). Since S-Spaces are reusable,
the package might prove a better choice for processing large collections of documents. The user
interface can be improved given the feedback from the users.
The quality of term matches can be improved by allowing the user to input a list of words
that they would like to be included in the collection, such as names, initials, or other proper
nouns. Since the organization the software was created for is multilingual, Unicode support
should also be incorporated and thoroughly tested. The project could benefit from a persistent
24
database and a user interface to allow for the inclusion of new files. To fit the structure of the
organization, the ability to add new files without recreating an index as well as the ability to
prevent some files from access could be included.
25
Bibliography “About WordNet.” 2011. Princeton University. 31. Dec 2011 <http://wordnet.princeton.edu>. Beckford, Balmain. “Keyword-Based File Sorting for Information Retrieval” Undergraduate
Thesis. 2010. Department of Computer Science Minnesota State University Mankato. Print.
Bergman, Ofer, Ruth Beyth-Marom, Rafi Nachmias, Noa Gradovitch, and Steve Whittaker. "Improved Search Engines and Navigation Preference in Personal Information Management." ACM Transactions on Information Systems 26.4 (2008): 1-24. Print.
Fang, Hui, Tao Tao, and Chengxiang Zhai. "Diagnostic Evaluation of Information Retrieval Models." ACM Transactions on Information Systems 29.2 (2011). Print.
Garcia, E. “SVD and LSI Tutorial 1: Understanding SVD and LSI,” miislita.com/information-retrieval-tutorial/svd-lsi-tutorial-1-understanding.html. Web. Sun, 26 Jun. 2011.
Jurafsky, Dan, and James H. Martin. Speech and Language Processing: an Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition. Upper Saddle River, NJ: Pearson Prentice Hall, 2008. Print.
Jurgens, David. “The S-Space Package: A scalable software library for semantic spaces” code.google.com/p/airhead-research/. Web. Tue, 16 Aug. 2011.
Kobayashi, Mei, and Koichi Takeda. “Information Retrieval on the Web.” ACM Digital Library. 2 June 2000. Web. 11 Aug. 2011.
Manning, Christopher D., Prabhakar Raghavan, and Hinrich Schütze. Introduction to Information Retrieval. New York: Cambridge UP, 2008. Print.
Moreira, Leandro, R. “Implementing Google’s ‘Did you mean’ Feature In Java” Infoq.com/articles/lucene-did-you-mean. Web. 16 Aug. 2011
Obbayi, Steve. “SQLite vs MySQL: How To Decide Which To Use « Sobbayi's Tech Labs.” Sobbayi's Tech Labs. 30 May 2010. Web. 24 Dec. 2011. <http://blog.sobbayi.com/2010/05/sqlite-vs-mysql-how-to-decide-which-to-use/>.
Pal, Sujit. (2008a). “IR Math with Java : Similarity Measures.” Salmon Run. 27 Sept. 2008. Web. 22 Dec. 2011. <http://sujitpal.blogspot.com/2008/09/ir-math-with-java-similarity-measures.html>.
Pal, Sujit. (2008b). “IR Math with Java : TF, IDF and LSI.” Salmon Run. 20 Sept. 2008. Web. 22 Dec. 2011. <http://sujitpal.blogspot.com/2008/09/ir-math-with-java-similarity-measures.html>.
RapidMiner. Rapid - I. 10 Oct. 2011 < http://rapid-i.com/content/view/181/190/lang,en/>. Rosario, Barbara. “Latent Semantic Indexing: An Overview.” Undergraduate Thesis. 2000.
University of California-Berkeley. Web. 11 Aug. 2011. Singhal, Amit, Gerard Salton, Mandar Mitra, and Chris Buckley. “Document Length
Normalization.” Undergraduate Thesis. 1995. Cornell University. Print. Soboroff, Ian. “IR Models: The Boolean Model.” UMBC Computer Science and Electrical
Engineering | Inspiring Innovation. http://comminfo.rutgers.edu/~aspoerri/. Web. 9 Oct. 2011.
Spoerri, A. “Information Retrieval Models” miislita.com/information-retrieval-tutorial/svd-lsi-tutorial-1-understanding.html. Web. Sun, 26 Jun. 2011.
Tsatsaronis, George, Iraklis Varlamis, and Michalis Vazirgiannis. “Text Relatedness Based on a Word Thesaurus.” Journal of Artificial Intelligence Research 37 (2010). Print.
26
Wu, Lihua, JianPing Feng, and Yunfen Luo. “A Personalized Intelligent Web Retrieval System Based on the Knowledge-Base Concept and Latent Semantic Indexing Model.” Software Engineering Research, Management and Applications (2009): 45-50. Print.
27
Appendix A User Testing Materials This section consists of forms related to the user testing process of this project. The user was
given a brief usability survey in order to assess the utility of the program. The user was also
given a consent form describing the uses of the data collected from the survey, as well as the
overall procedure, the risks and the benefits. Finally, a test script has been included which
outlines how the user interview process was conducted.
Minnesota State University, Mankato Department of Computer Science
Purpose The goal of this study is to gather information about the usability of a searchable database for audio teachings. Information will be used to assess and improve the software. Procedures and Duration The survey will only take about 10-‐15 minutes to complete, depending on the length of your responses. It may take you longer to be familiar with the software. You can ask any questions you have and the researcher will guide you through the software use. When you are comfortable, you can complete the survey on your own and email it back to the researcher, or the researcher will ask you the questions and record your answers. You are free to choose which approach you prefer. If you decide to participate in this study, your participation is completely voluntary and you are free to skip any question. At any point you can chose to end your participation. Again, if you decide to participate, you are free to stop your participation at any time without penalty. Risks and Discomforts While you are not at any physical risk from participating in this research, there may be a risk of feeling discomfort if any of the questions are sensitive to you. Benefits and/or Compensation Only people familiar with the Tsound Project at Odsal Ling Temple and interested in its improvement will be surveyed. There is no other compensation. Confidentiality Any information about you will be kept private. Participants will be assigned a numeric code number and the code key will be kept in a cabinet in the primary researcher’s office. The information you provide in the survey will be matched to the numeric code number and will not be associated with your name. If the survey is given orally, notes will be taken electronically but no names will be associated with any comment or quote. The results of your participation will be stored electronically with paper back ups and will not be available to any other person or organization other than the primary researcher or research assistants. Any information collected will be destroyed at the end of one year. Publications Associated with this Research: The results of this research may appear in publications but individual participants will not be identified. Offer to Answer Questions Before you begin the survey, please acknowledge your acceptance of the terms of the survey. If you have any questions about the the rights and treatment of human subjects, please contact: the IRB Administrator, College of Graduate Studies and Research, Minnesota State University, Mankato, 115 Alumni/Foundation Center, Mankato, MN 56001. Phone: (507) 389-‐2321 FAX: (507) 389-‐5974 If you have any questions concerning the research, please contact:
Dr. Rebecca Bates Minnesota State University Department of Computer Science [email protected]
28
507-‐389-‐5587 Please include this text in an email to [email protected] or [email protected] and include the statement: “I agree to participate in the Usability Study for A Searchable Database of Audio Teachings” followed by your name and the date.
Your Name________________________ Researcher Name:_________________
Date_____________________________ Date ____________________________
Usability Testing Script FACILITATOR: Hello, my name is Adedayo Ologunde and I will be walking you through this usability testing session. I have asked you to participate in the testing of the user interface and your experience with a database for retrieval of audio teachings. The purpose of the test is to judge the effectiveness of the user interface and will not test you, the user. If you are unable to proceed with the course of the system, please let me know and do not feel as if you are not contributing. Your participation will allow for a more intuitive design of the system and it’s ability to retrieve files. I will be observing how you interact with the system through a Skype screencast so that I can help you more easily if you have questions. I will be writing things down, but your information and responses will be kept confidential. All of the data gathered from this session will be labeled with a number. The only place in which your name will be associated with the corresponding number will be a key code locked in Dr. Bates' office (Wissink Hall 231). The consent forms will be located in the same place. You can end this session at any time and there will be no penalty associated with this. I would like to reiterate this point again: you have the choice to end this session at any time and it will not influence your relationship with Minnesota State University, Mankato or any entities associated with the university or any entities associated with the audio retrieval system.
I will now guide you through the beginning of the system and then observe the rest of the interaction. 1. Start up the software by clicking on the icon (described by tester). Let me know if you have any
trouble. 2. Start using the system by putting the desired search terms into the box designated “search terms” 3. Select the most relevant (if any) entry related to the search term. 4. After the search term is selected, select any related terms in the “related terms” form. 5. Open the audio transcription file. 6. Close the file and try another search term. (Repeat until user has located 4-‐5 files and has tried
multiple types of file selection.) Now that you are done with the system I will ask you to do one more thing. Please fill out this short questionnaire that reflects your experience with the audio retrieval system. If you want, I can read you the questions and collect the answers.
Thank you for helping me with my project. I appreciate your contributions.
29
A Searchable Database of Audio Teachings: Usability Survey
Please Choose how much you agree or disagree with the following statements.
Agree
Neutral
Disagree
1 2 3 4 5
It was easy to navigate the system. o o o o o
I had no trouble uploading files. o o o o o
The files returned made sense given the search terms.
o o o o o
I was confused by the returned documents
o o o o o
What should be changed?
What could improve your experience?
What functions are not helpful?