The Use of Information Retrieval for a Searchable Database ...krypton.mnsu.edu/~an5239ke/public/students/CompletedSeniorThes… · Information retrieval (IR) systems can be useful

The Use of Information Retrieval for a

Searchable Database of Audio Teachings

Adedayo Ologunde

A Thesis Submitted in Partial Fulfillment of the

Requirements for the Degree of

Bachelor of Science

in

Computer Science

Minnesota State University, Mankato

Mankato, Minnesota

December 2011

2

Abstract Traditional search engines match key words or phrases based on word frequency,

location, similarity, and update-time. The ranking of the results based on these criteria does not

necessarily reflect the trends of a user’s interest and the large number of results returned forces

the user to sort them by hand. In order to solve the problem of ranking relevant results of interest

to a user, vector based representations of collections of files or documents are created, filtered

through reduction techniques and ranked. In this work, a database containing transcriptions of

audio recordings was created for a small, non-profit organization and augmented with a simple

tagging and search engine in order to provide a basic framework for semantic indexing and

problem domain specific search. The goal was more accurate search results for their audio

recordings. Many of the audio files had brief summaries of the audio with some keywords, while

others did not. As the set of audio data grows, it is important for the file system to adapt the

incoming data to the users’ typical word associations. By asking users to enter keywords when

uploading new files, these associations can be created. Semantic indexing can then be used to

return more accurate search results to the user. This approach, implemented on a desktop system

to ensure data privacy, provided better relatedness and faster retrieval of the desired files and

demonstrated how a small organization can benefit from easy search and indexing of their large

audio library.

3

Contents 1. Introduction ......................................................................................................................................... 4

2. Background .......................................................................................................................................... 5

2.1. Document Indexing in IR ................................................................................................................... 6

2.2. Document Indexing Models .............................................................................................................. 7

2.2.1. Boolean Models ............................................................................................................................. 7

2.2.2. Vector Space Models ...................................................................................................................... 8

2.3. Index Filtering .................................................................................................................................. 10

2.4. Document Similarity Measures ....................................................................................................... 10

2.5. Available Tools ................................................................................................................................ 11

2.6. Evaluation Metrics ........................................................................................................................... 12

3. Dataset ............................................................................................................................................... 12

4. Software Design and Use ................................................................................................................... 13

4.1. Development Process ...................................................................................................................... 14

4.2. Software Use ................................................................................................................................... 16

5. Results ................................................................................................................................................ 17

5.1. Quantitative Software Evaluation ................................................................................................... 17

5.2. User Testing ..................................................................................................................................... 21

6. Discussion and Future Work ............................................................................................................... 21

6.1. Discussion ........................................................................................................................................ 22

6.2. Future Work .................................................................................................................................... 23

Bibliography ............................................................................................................................................... 25

Appendix A User Testing Materials ............................................................................................................ 27

4

1. Introduction Information retrieval (IR) systems can be useful for recovering specific information from

stored records. While storing and retrieving information is useful, humans often need to use the

retrieved data and apply it in order to answer questions and study, compare and rank topics

beyond simple file retrieval. While presenting meaningful results to a user is a challenging

multifaceted process, several solutions to retrieving meaningful information exist. These include

full text search, which becomes useful for locating phrases, information retrieval systems and

user tagging, and structured query search (Manning, Raghavan, Schütze, 2008).

A small non-profit organization had a sizeable collection of audio files associated with

documents containing some keywords. The organization was looking for a solution to organize

their collection and group related transcriptions with one another. In order to solve the problem

of retrieving specific information from their collection, an information retrieval system was

developed. Traditional search engines use key based systems to match words with large amounts

of unrelated results based on word frequency, location, similarity, or update-time, but they do not

necessarily reflect the trends of a user’s interest (Wu et al., 2009). The objective of this project

was to a create a simple, inexpensive vector-based indexing system and search engine to return

reasonable results to users in a manner that fit their collection of documents, allows them to

upload new documents, and allows them to control user access based on topic choice. Free and

commercial alternatives that handle information retrieval problems exist such as Google Desktop

Search for individual users, and IBM’s InfoSphere Warehouse for enterprise. However, such

options might not provide utility for the expressed purpose of searching through a moderate sized

collection of documents.

5

In order to create a system that provides meaningful results for a user, there are multiple

approaches for document indexing and retrieval. Clustering is one such approach that groups

files into categories based on relatedness (Beckford, 2010). Others include the probabilistic

model, which statistically quantifies data within a collection of documents, and the vector space

model, which treats queries and documents as vectors (Soboroff). This work examines two

methods for information retrieval, Boolean models and vector space models, and presents the

necessary steps to implement examples of each model, with a discussion of their role in this

work. Further information is given on tools used for this project and its evaluation based on its

performance and ability to process and return results to the user.

2. Background

Users browsing information on computer systems typically choose between browsing

through the available collection of data for the information that they need or assistance from a

personal information management system (PIM) to locate it for them (Whittaker, 2008;

Manning, Raghavan, Schütze, 2008). In Internet search, they may be assisted by a system with

the ability to scale to large collections, while a personal computer’s desktop search might be able

to provide detailed results such as file location and file previews with results. Enterprise

computer applications also may have the need for search such as database supported web

applications (e.g. Smiley, Text Search, your Database or Solr). Among computer system search

options, the common goal of providing relevant results and presenting them to the user in a

coherent manner exist. Figure 1 illustrates the various steps taken to translate a collection of

document files. These include filtering words contained within a collection for word type,

analyzing the filtered words for the number of appearances within the text, storing the collected

words in an index and applying transformations to the index, which is used to increase the speed

6

of search. In information retrieval, documents refers to the units used to build the system, terms

refer to the data, typically words or shot phrases contained within the document to be processed

or retrieved, and collections refer to the set of documents, which may also be referred to as a

collection (Manning, Raghavan, Schütze, 2009). This section presents information about

document indexing, approaches, filtering, available tools and evaluation methods.

Figure 1 Document Preparation and Information Retrieval Process for use with the Boolean model or for use with the vector space model

2.1. Document Indexing in IR Indexing systems can be used to increase the speed of search. An information retrieval

system used to collect meaningful information for a user typically makes use of an indexing step

that allows the system to examine documents’ terms before retrieval to arrange or associate the

documents with queries in a significant manner. Several models and approaches are available to

represent a user’s query and a system’s response. The Boolean model provides simple

implementation, and allows logical operators. Probabilistic models, along with vector space

models allow for ranked retrieval by relevance, at the cost of complexity.

7

2.2. Document Indexing Models This section introduces two models of document indexing, the vector space model and

the Boolean model. Along with the vector space model that might use indexing techniques such

as latent semantic indexing, the Boolean model relies on text queries to return results as shown in

figure 2. This model works to return matching results from queries to documents via Boolean

operators such as AND, OR, and NOT.

Figure 2 Boolean and Vector Space Models

2.2.1. Boolean Models The Boolean model has been used in dialog systems as well as web search engines that

implement extended Boolean models (Manning, Raghavan, Schütze, 2009). The Boolean model

indexes a set of words, checks if the term is present within a document, and applies Boolean

operators AND, OR, and NOT over the set of documents (Soboroff). Strict Boolean models can

only retrieve exact matches, and weight each term equally, making it impossible to rank

documents, but work quickly and simply in comparison to vector space models. Complex or

extended models add keyword search, phrase search and add chronological ranking (Kobayashi,

Takeda, 2000; Manning, Raghavan, Schütze, 2009; Soboroff).

8

2.2.2. Vector Space Models Implementing a ranking system that takes more into account than order of appearance

requires additional features. This model allows for document comparison, but does so at the cost

of speed and complexity. In a vector space model, an index is used to collect terms in documents.

Steps are taken to collect the documents to be indexed, tokenize the text, linguistically process

the tokens through sets of rules within the bodies of text, and index the documents that contain

the each term (Manning, Raghavan, Schütze, 2009).

Within documents, frequently occurring terms may skew search results towards certain

documents in a collection. To address this issue, vector space models make use of term

frequency (TF), and inverse document frequency (IDF), as shown in equation 1 to create a

TF*IDF distance score representing the number of times a term appears in a document multiplied

by the log of the number of documents in the database over the number of documents containing

the term. Term frequency weighting refers to the method of scoring these term counts. Inverse

document frequency reduces the importance of commonly occurring terms over a collection of

documents (Garcia). Term frequency refers to a count of the number of times a word is repeated

within a document, which can be used to quantify its importance relative to other terms within a

document, as shown in equation 1.

!"! =!!!

Equation 1

where N represents the number of words in the document and Ni is the number of times term i

appears in the document. The inverse document frequency is shown in equation 2.

!"# = log !!"!

Equation 2

9

where D represents the number of documents in the collection, and dfi represents the number of

documents containing the term i. Terms that do not occur in a document (dfi=0) are assigned a

small fixed value. By multiplying the term frequency and inverse document frequency, a

distance score is created that can measure the importance of terms within documents in a

collection. The term weight, wi, is shown in equation 3.

!! = !"! ∗ !"# Equation 3

In this model, documents can be ranked highly against a query if the terms are rare within a

collection, but common within a document (Spoerri). Along with TF-IDF weighting, the process

of indexing a document in a collection might involve normalization to ensure that terms within

documents and the documents themselves are fairly promoted (Singhal, et al., 1995).

Vector space models make use of tokenization, or splitting documents into terms. Word

stemming, or using the root of a word, idea, or synonym can be used to increase the occurrence

of rare terms. Tokenization during part-of-speech tagging often delimits the tokens by

whitespace, or spaces between tokens, however hyphenated phrases such as “soft-spoken” may

be treated as a single term to return more accurate results to the user (Manning, Raghavan,

Schütze, 2009). It is possible to identify the parts of speech of each token within a collection of

documents. These parts include nouns, verbs, pronouns, prepositions, adverbs, conjunctions,

participles and articles based off of information given about the word, and its neighbors

(Jurafsky, Martin, 2008). Tagging of parts of speech can allow for more streamlined results by

reducing noise in a term document matrix while providing a means for the user to identify useful

terms related to their topic of interest. For example, filtering can isolate nouns and proper nouns

and ignore prepositions, conjunctions and articles.

10

2.3. Index Filtering Function words such as “and” are frequently found in documents increasing the

document size and potentially increasing search time, but have little intrinsic effect on the results

to the user. In the set of rules for creating an index, a step is often taken to filter out the most

frequent occurring words or stop words (Manning, Raghavan, Schütze, 2009; Jurafsky, Martin,

2008). The terms selected for the stop words can be adjusted to the collection of documents and

might be different for part-of-speech groups. The use of a stop word list in information retrieval

can reduce space in terms of the index size, as well as reduce the amount of time devoted to

indexing terms.

Stemming, or recognizing and grouping the roots of words, is also used to reduce the size

of an index (Manning, Raghavan, Schütze, 2009). In the processing phase of creating an index,

each word contained within a document is given a weight based on its frequency and the list of

unique words associated with a vector (Jurafsky, Martin, 2008). The weights within the

documents are represented within a document vector. In the vector space model, when a user

searches through the document collection, a query vector represents their query.

2.4. Document Similarity Measures Before a user is presented with search results, cosine similarity is applied to compare the

similarity of vectors, which represent documents within the collection to the resemblance to the

user’s query. Several methods of measuring similarity between documents exist, such as Jaccard

similarity and abstract similarity (Pal, 2008a). In this project, cosine similarity was used. This

process measures the cosine angle between two vectors, A and B as shown in equation 4

(Garcia).

11

!"# !,! = !"#$%& ! = !·!! !

Equation 4

Latent semantic indexing (LSI) is an application of singular value decomposition (SVD),

allowing for document scoring and ranking. LSI projects queries and documents into a space

with latent semantic dimensions, and then SVD finds the optimal projection to a low dimensional

space (Rosario, 2000). LSI can provide optimal matches of variation to documents, but becomes

costly at large dimensions or collections with large amounts of variance between terms.

2.5. Available Tools The nonprofit organization had concerns about security, so the use of web related

technologies were avoided. Several solutions to desktop based search were considered. For

example, Rapid-I provides a widely used software tool for data mining, RapidMiner, which is

graphical and provides extendibility, but offers limited support for inexpensive versions

(RapidMiner). Several open source projects were available to process text. SemanticVectors for

example, creates “WordSpace” models from natural language text, (semanticvectors) but offers

limited extendibility and support. Lucene was used throughout the project to compare the quality

of search results and for a word suggestion feature through Lucene’s SpellChecker class

(Lucene). Lucene provided fast and scalable indexing with various sized collections (Lucene).

However Lucene does not rank documents, which is critical for sorting meaningful results. The

S-Space package provides support for latent semantic analysis, and was designed for scalability

(Jurgens). S-Spaces might have been useful in quickly processing the collection of files, but was

not considered until late in the development process and was not used. The Java Text Mining

Toolkit provided part-of-speech tagging capabilities, similarity measures and the ability to rank

12

documents for a reasonable starting point to build upon (Pal, 2008a). Much of the indexing and

search processing was built upon the JTMT project.

2.6. Evaluation Metrics Two evaluation metrics used in information retrieval are precision and recall as shown in

equations 5 and 6 (Manning, Raghavan, Schütze, 2009; Jurafsky, Martin, 2008; Kobayashi,

2000). Precision is the ratio of relevant documents retrieved to the number of documents and

recall is the proportion of relevant documents retrieved compared to the relevant documents in

the database (Kobayashi, 2000). In order to achieve high precision, the IR system would have to

return only relevant responses to the user’s query.

Recall= !"##$%& !"#$%&#"# !"#$% !" !"#$%&!"#$% !"##$%&' !"##$%& !"#$%&# !" !"#$

Equation 5

Precision= !"##$%& !"#$%&#"# !"#$% !" !"#$%&!"#$%& !" !"#$%&# !"#$% !" !"#$%&

Equation 6

Another method of evaluating the system was through user testing. Because this project

has expected clients, they were asked for feedback. The feedback provided in user testing can be

used to improve software systems as well as the likelihood of their adoption by the target user

populations. The methods used here are described in Appendix A.

3. Dataset The data used for this project was a sample of 126 documents in plain text format. The

number of unique terms in the corpus was roughly 5-6000. The format of the documents was

generally a collection of words, with a date associated with the document name. In addition to

English words, the collection contained Portuguese words, numbers, and symbols that were

filtered out of the results. A sample of a file is shown in Figure 3, demonstrating proper nouns,

13

and special characters contained within the collection, such as the “>” character. When

encountered, characters such as these could not be treated as a proper noun, noun, or verb and

were filtered out of the indexed collection.

20091120-04-Fri

Amithaba one arms-length on the top of our head Chenrezig central channel* heart chakra lotus, tikle luminous as a candle. seed-syllable straight as an arrow - lung (not recorded) - meditation posture instructions 2009- P´howa 20091120.04-Fri > little finger´s width P`howa retreat Lama Tsering Everest Odsal Ling temple Translator- Priscila

Figure 3 A sample file in the collection of documents

4. Software Design and Use Section 4 explains the various open source and freely available software components that

were used in creating this project. These components include the JTMT project, which was used

throughout the indexing and search steps, the Java Swing Widget Toolkit platform, which was

used for the graphical user interface, and the Lucene project, which was used for spelling

suggestion. This is illustrated in figure 4. This section also explains how content filtering was

used in this project, along with the design choices for its graphical user interface.

14

Figure 4 A block diagram displaying the various tools used to create the project.

4.1. Development Process Open source and freely available software from the Java Text Mining Toolkit from Sujit

Pal (2008a) was used for much of the indexing process. The components used were part-of-

speech tagging, TF-IDF indexing, and cosine similarity. After a user selects a directory

containing files, the files in the directory are processed through a set of recognizers, which

recognize properties such as punctuation, phrases, and abbreviation. Next, the files are processed

in a rule-based tagger, which categorizes parts of speech using the MIT Java Wordnet Interface

(Pal, 2008b). Nouns, verbs, adjectives and adverbs can be tagged as an allowable part of speech

through the Wordnet interface after its “sense” or its unique word ID has been determined

(About Wordnet, 2011). Next, the word is tagged as TokenType.Content_WORD. Verbs,

adjectives and adverbs were omitted to save space in the index. Punctuation, numbers,

abbreviation, and spaces are also omitted from the text using a break iterator based on the

ICU4j's RuleBasedBreakIterator (Pal, 2008b). For example, the sentence, “- These are the 3

reasons why Buddha said life is suffering.” is tagged as [“- These are the 3 reasons why Buddha

said life is suffering.”]. The software tags the dash and the period in the form [UNKNOWN], the

15

number 3 as [NUMBER] and remaining words as [word] apostrophes and hyphens are included

with words. Nouns, verbs, and adjectives are identified, for instance, [self-centeredness (noun)

WID-04779645] and then tagged as [CONTENT_WORD] as an allowable part of speech. The

content words are stored in a list with their IDs. A stop word list was used to filter out

commonly used words. Finally, the content words in the collection are assigned frequencies

based upon the number of times they appear within the documents, and the documents are

mapped to a matrix. Users might make mistakes while entering input and may not receive precise

results. In order to address this, a spell checking system prompts the user when no matching

documents are found and suggests alternative terms according to a dictionary list. Java Swing

components were used to create a basic graphical user interface. Java Swing was chosen because

of the ability to run Java programs on Microsoft Windows, Mac OS X, and Linux, which would

suit the user’s varied needs. The GUI was created with the intent to allow the user to select files

and search through their contents, with the visual guidelines of each operating system taken into

Figure 5 A comparison of Mac OS X's native file chooser on the left, and Java's default file chooser on the right

16

consideration. A screenshot of the program is shown in Figure 5 demonstrating a customized file

chooser for Mac OS X. The benefits of conforming to the user interface’s human interaction

guidelines provided consistency with the user’s expectations of the behavior of applications that

run on the platform.

Much of the work for this project was produced with the intent of taking an existing,

cost-effective indexing and searching system and modifying it to work with the non-profit

organization’s collection. Therefore, the JTMT project was modified to work with collections of

documents, instead of paragraphs within a single document, to operate through a GUI instead of

a command prompt for user friendliness. Additionally, the output of the results were modified for

clarity, the ability to save an index to the disk was added to save time in later searches, along

with changes such as using SQLite instead of MySQL in the interest of portability since the

database is stored on the host machine, and does not require a server (Obbayi, 2010). Also, word

suggestions were added through Lucene when no reasonable results to a user’s query were

returned.

4.2. Software Use The program starts by allowing the user to select a directory containing plain text files. The

files are indexed by the program, and a generated map of the document names, words and their

frequencies is created. Through the JTMT toolkit, a simple linear algebra package is used to

normalize the term frequencies (Pal, 2008b). LSI and cosine similarity are applied to the matrix.

To search through the documents, a query vector representing the terms in the query is treated as

an instance of a document vector allowing query terms to be compared against the matrix.

After the user has installed Java on their computer, they select the Java jar file for the project.

They select a directory containing the files that they wish to index. A screenshot demonstrating

17

the results of matching results returned from the index is shown in Figure 6. The spelling

suggestion feature used Lucene’s SpellChecker, allowing the accuracy of matches to be set based

on edit distance, which was set to 0.75, producing results reasonably close to the given input

(Moreira).

Figure 6 A screenshot of the program displaying the results of a query.

5. Results This section presents the evaluation used to determine the correctness and performance of the

program. Results are presented in two ways. An analysis of software performance is followed by

user testing results. The section begins with a description of the test setup and evaluation

methods.

5.1. Quantitative Software Evaluation In order for the software to be useful for the users, several tests were conducted to

measure the quality of the results returned to the user, and the speed of the program. These tests

18

were critical, since a user waiting for results for their query to appear, or a user searching

through an ineffectual list of results might forego using the software altogether. A test setup

consisting of a computer running Microsoft Windows 7 with a 2.8 GHz processor, 4 GB ram,

and a 64 GB SSD executed the program. Tests were conducted for three components of the

program: index creation time, precision/recall of search and search results, and search result

suggestion. The index creation time test consisted of pseudorandom selected files from the non-

profit organization’s document collection, scaled from nearly the size of the collection to about a

tenth of it. The index creation test included the time required to process the files by applying a

chain of recognizers to each file and storing the processed files as tokens in a matrix, as well as

performing latent semantic indexing and cosine similarity on the matrices.

Additional tests were performed on precision and recall of search results by running the

program multiple times with various search terms. Words and phrases were selected from the

document collection and compared to the results returned from the search application. The

results of the suggestion system were also measured for precision and recall, which is discussed

in section 5.2.

Table 1 Index Creation Time

Initial Collection Size Time to Create Index (In Minutes)

10 Files 1.07

25 Files 2.56

50 Files 5.05

75 Files 6.71

100 Files 8.33

125 Files 12.37 The index creation time test was conducted three times for each scaled collection and an

average time to index the collection was recorded as shown in Table 1. Based on observation,

19

results indicate near linear growth in index time as the number of term is increased as

demonstrated in Figure 5. While a collection of 10 documents corresponded to 500 terms or a

5000 element matrix, a collection of 125 documents on average corresponded to 2700 terms.

With the size of the indices taken into account, the increase in index time can be attributed to the

matrix size used for LSI. A matrix of 127*2700 or 337,500 elements requires 2.7 Mb of memory,

and large matrix sizes increase the amount of time because a hundred documents in the matrix

correspond to a hundred points in the term space. The cosine angle of each document must be

calculated against the other documents in the term space, which increased the length of time

required to finish indexing. The increase in the number of files also corresponded to an increase

in index creation time as shown in figure 5, since additional files added more terms to the

collection.

Figure 7 Index Creation Time as the Number of Files Increases

A single document was indexed twice to test for similarity measures. When individual

terms were searched for, they yielded identical results, suggesting high precision. The second

document made of identical terms was then reduced to half its size, and a single term matching

0

20

40

60

80

100

120

140

1.07 2.56 5.05 6.71 8.33 12.37

Files

Time

Index CreaUon Time

20

test was applied to both documents. While the results produced a match for the first document

which contained the term, they also produced a match for the second document which did not

contain the term found in the first document. For example, Document A, which contained the

phrase, “activity is attained with amplification” was compared to Document B, containing the

phrase, “adjust activity by expression.” The comparison produced equivalent scores for both

documents when the term, “attained” was searched for. Latent semantic indexing was used to

compare the two documents, so the term “attained” had a frequency of one within the first

document, 0 within the second document, and 1 within the query vector. After matrix

decomposition was applied to the matrix, the rows or scores of the term remained unequal, and a

similar result occurred when applying cosine similarity between the query and the matrix. The

similarity between documents is represented as shown in equation 7

!!! !! =!1 !20.426 0.851 Equation 7

where q represents a query vector or a user’s query, A represents the term-document matrix, d1

and d2 are two documents, and T is the transpose of the vector. By multiplying the transpose of

the query vector by the term-document matrix, a similarity measure is calculated. The dissimilar

results between documents 1 and 2 suggest that the results returned from the system reflect the

similarity of the documents instead of terms, which might provide high recall but does not

necessarily deliver the level of precision to distinguish one highly related document from

another. The suggestion system was set to return three possible matches from the program’s

dictionary for each term in the query. The system returned reasonably close matches, but at the

cost of precision. One possible solution to correct the low precision of the matches could be to

compare the query to the terms in an individual documents. Since matches are provided by

21

document similarity, an additional term search would filter out documents where the term did not

appear.

5.2. User Testing To test the usability of the software, a knowledgeable user was asked to search for typical

terms in the dataset using the software, while noting their experience in a short survey. The

methods and documentation for user testing are in Appendix A. The user noted the effectiveness

of the software in finding a result within the collection, the ease of use, and their likelihood of

using the software solution. The user indicated the software system was easy to navigate. But

they did not try to add or upload files to the collection. The user thought that files returned made

sense, but expressed frustration with too many false positives returned in response to their

queries, and also noted that some of the returned files were essentially empty. These results

indicate that the system provided high recall, but also may have problems with low precision.

The user also gave the suggestion to implement a monitor for indexing progress, and a method to

link directly to the files in their collection. The user expressed satisfaction with the term

suggestion feature. In response to the user feedback, links were added to the file list in the results

returned and a progress monitor reflecting the stage of the indexing progress was implemented.

6. Discussion and Future Work This project involved investigating a viable system for multiple user document

organization and retrieval. This work involved researching existing methods of information

retrieval, selecting a model that fit the organization’s needs, selecting software components from

an open-source information retrieval project, adjusting them to operate with large, multiple

document collections, considering features, such as saving indices to disk, and a graphic user

22

interface. This section describes the results of the testing and draws conclusions of the

effectiveness of the software, discussing the advantages and disadvantages of using the vector

space model in this project. Additionally, future possibilities for continuing the project are also

discussed in future work.

6.1. Discussion Large data sets result in large indices, which are computationally intensive to create. The

system developed in this work provided acceptable results and document relatedness at the cost

of system responsiveness. While using a vector based model for a simple search program allows

the user to distinguish which files are similar to their query, the index time required is costly

unless preprocessing allows for storing a matrix to the user’s disk. During the testing process for

this application, small collections of documents with similar terms were processed quickly, while

larger collections with a greater number of unique terms required a lengthy amount time. Some

issues with low accuracy were found, as some search terms returned more documents than

necessary as demonstrated in figure 6. Given the startup time required to create an index, the user

could find the wait unacceptable and might prefer to filter through unranked results retrieved by

Boolean methods. In a growing collection of files, it could prove more useful to rely on a

combination of Boolean retrieval, full text search and relevance feedback, which collects data

from the user to further improve results. Improvements to the index time can be realized by

reducing the amount of terms indexed through selecting stop words particular to the data set.

Other issues include the complexity of retrieving data from a vector representation of

documents. For example, the term document matrix, while providing relatedness, does not

preserve the sentence structure of the original documents, which could be useful for a graphical

representation of the search results. A solution to this might include assigning an additional value

23

to the column and row indicating the line number a term occurred on. A normalized matrix could

be stored, but SVD would have to be applied to the matrix each time a document is added to the

collection. Query expansion, which expands the user’s query to include related terms, could be

used to eliminate low precision matches from the system’s results. (Jurafsky, Martin, 2008;

Tsatsaronis, Varlamis, Vazirgiannis, 2010). Clustering creates sets of clusters for documents

based on similarity. Clustering vectors and calculating distances could also reduce the need for

matrix transformations. Further part speech recognition tagging could be applied for more

specific tags, such as tagging only proper nouns to provide the user with an index of names.

6.2. Future Work Future work might focus on measures to reduce the size of the matrix to increase

performance. As WordNet allows for noun, verb, adverb, and adjective relations, the WordNet

database itself might provide for semantic results such as presenting the user with filtering

options. Because of the time required to calculate LSI and cosine similarity on a large matrix, a

system that examines part of the matrix in memory might be more effective in reducing the term

space as well as the index time. An example of such a solution is the S-Space package, which

became available after work on this project began, provides support for Latent Semantic

Analysis and sparse matrices in semantic spaces (Jurgens, 2010). Since S-Spaces are reusable,

the package might prove a better choice for processing large collections of documents. The user

interface can be improved given the feedback from the users.

The quality of term matches can be improved by allowing the user to input a list of words

that they would like to be included in the collection, such as names, initials, or other proper

nouns. Since the organization the software was created for is multilingual, Unicode support

should also be incorporated and thoroughly tested. The project could benefit from a persistent

24

database and a user interface to allow for the inclusion of new files. To fit the structure of the

organization, the ability to add new files without recreating an index as well as the ability to

prevent some files from access could be included.

25

Bibliography “About WordNet.” 2011. Princeton University. 31. Dec 2011 <http://wordnet.princeton.edu>. Beckford, Balmain. “Keyword-Based File Sorting for Information Retrieval” Undergraduate

Thesis. 2010. Department of Computer Science Minnesota State University Mankato. Print.

Bergman, Ofer, Ruth Beyth-Marom, Rafi Nachmias, Noa Gradovitch, and Steve Whittaker. "Improved Search Engines and Navigation Preference in Personal Information Management." ACM Transactions on Information Systems 26.4 (2008): 1-24. Print.

Fang, Hui, Tao Tao, and Chengxiang Zhai. "Diagnostic Evaluation of Information Retrieval Models." ACM Transactions on Information Systems 29.2 (2011). Print.

Garcia, E. “SVD and LSI Tutorial 1: Understanding SVD and LSI,” miislita.com/information-retrieval-tutorial/svd-lsi-tutorial-1-understanding.html. Web. Sun, 26 Jun. 2011.

Jurafsky, Dan, and James H. Martin. Speech and Language Processing: an Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition. Upper Saddle River, NJ: Pearson Prentice Hall, 2008. Print.

Jurgens, David. “The S-Space Package: A scalable software library for semantic spaces” code.google.com/p/airhead-research/. Web. Tue, 16 Aug. 2011.

Kobayashi, Mei, and Koichi Takeda. “Information Retrieval on the Web.” ACM Digital Library. 2 June 2000. Web. 11 Aug. 2011.

Manning, Christopher D., Prabhakar Raghavan, and Hinrich Schütze. Introduction to Information Retrieval. New York: Cambridge UP, 2008. Print.

Moreira, Leandro, R. “Implementing Google’s ‘Did you mean’ Feature In Java” Infoq.com/articles/lucene-did-you-mean. Web. 16 Aug. 2011

Obbayi, Steve. “SQLite vs MySQL: How To Decide Which To Use « Sobbayi's Tech Labs.” Sobbayi's Tech Labs. 30 May 2010. Web. 24 Dec. 2011. <http://blog.sobbayi.com/2010/05/sqlite-vs-mysql-how-to-decide-which-to-use/>.

Pal, Sujit. (2008a). “IR Math with Java : Similarity Measures.” Salmon Run. 27 Sept. 2008. Web. 22 Dec. 2011. <http://sujitpal.blogspot.com/2008/09/ir-math-with-java-similarity-measures.html>.

Pal, Sujit. (2008b). “IR Math with Java : TF, IDF and LSI.” Salmon Run. 20 Sept. 2008. Web. 22 Dec. 2011. <http://sujitpal.blogspot.com/2008/09/ir-math-with-java-similarity-measures.html>.

RapidMiner. Rapid - I. 10 Oct. 2011 < http://rapid-i.com/content/view/181/190/lang,en/>. Rosario, Barbara. “Latent Semantic Indexing: An Overview.” Undergraduate Thesis. 2000.

University of California-Berkeley. Web. 11 Aug. 2011. Singhal, Amit, Gerard Salton, Mandar Mitra, and Chris Buckley. “Document Length

Normalization.” Undergraduate Thesis. 1995. Cornell University. Print. Soboroff, Ian. “IR Models: The Boolean Model.” UMBC Computer Science and Electrical

Engineering | Inspiring Innovation. http://comminfo.rutgers.edu/~aspoerri/. Web. 9 Oct. 2011.

Spoerri, A. “Information Retrieval Models” miislita.com/information-retrieval-tutorial/svd-lsi-tutorial-1-understanding.html. Web. Sun, 26 Jun. 2011.

Tsatsaronis, George, Iraklis Varlamis, and Michalis Vazirgiannis. “Text Relatedness Based on a Word Thesaurus.” Journal of Artificial Intelligence Research 37 (2010). Print.

26

Wu, Lihua, JianPing Feng, and Yunfen Luo. “A Personalized Intelligent Web Retrieval System Based on the Knowledge-Base Concept and Latent Semantic Indexing Model.” Software Engineering Research, Management and Applications (2009): 45-50. Print.

27

Appendix A User Testing Materials This section consists of forms related to the user testing process of this project. The user was

given a brief usability survey in order to assess the utility of the program. The user was also

given a consent form describing the uses of the data collected from the survey, as well as the

overall procedure, the risks and the benefits. Finally, a test script has been included which

outlines how the user interview process was conducted.

Minnesota State University, Mankato Department of Computer Science

Purpose The goal of this study is to gather information about the usability of a searchable database for audio teachings. Information will be used to assess and improve the software. Procedures and Duration The survey will only take about 10-‐15 minutes to complete, depending on the length of your responses. It may take you longer to be familiar with the software. You can ask any questions you have and the researcher will guide you through the software use. When you are comfortable, you can complete the survey on your own and email it back to the researcher, or the researcher will ask you the questions and record your answers. You are free to choose which approach you prefer. If you decide to participate in this study, your participation is completely voluntary and you are free to skip any question. At any point you can chose to end your participation. Again, if you decide to participate, you are free to stop your participation at any time without penalty. Risks and Discomforts While you are not at any physical risk from participating in this research, there may be a risk of feeling discomfort if any of the questions are sensitive to you. Benefits and/or Compensation Only people familiar with the Tsound Project at Odsal Ling Temple and interested in its improvement will be surveyed. There is no other compensation. Confidentiality Any information about you will be kept private. Participants will be assigned a numeric code number and the code key will be kept in a cabinet in the primary researcher’s office. The information you provide in the survey will be matched to the numeric code number and will not be associated with your name. If the survey is given orally, notes will be taken electronically but no names will be associated with any comment or quote. The results of your participation will be stored electronically with paper back ups and will not be available to any other person or organization other than the primary researcher or research assistants. Any information collected will be destroyed at the end of one year. Publications Associated with this Research: The results of this research may appear in publications but individual participants will not be identified. Offer to Answer Questions Before you begin the survey, please acknowledge your acceptance of the terms of the survey. If you have any questions about the the rights and treatment of human subjects, please contact: the IRB Administrator, College of Graduate Studies and Research, Minnesota State University, Mankato, 115 Alumni/Foundation Center, Mankato, MN 56001. Phone: (507) 389-‐2321 FAX: (507) 389-‐5974 If you have any questions concerning the research, please contact:

Dr. Rebecca Bates Minnesota State University Department of Computer Science [email protected]

28

507-‐389-‐5587 Please include this text in an email to [email protected] or [email protected] and include the statement: “I agree to participate in the Usability Study for A Searchable Database of Audio Teachings” followed by your name and the date.

Your Name________________________ Researcher Name:_________________

Date_____________________________ Date ____________________________

Usability Testing Script FACILITATOR: Hello, my name is Adedayo Ologunde and I will be walking you through this usability testing session. I have asked you to participate in the testing of the user interface and your experience with a database for retrieval of audio teachings. The purpose of the test is to judge the effectiveness of the user interface and will not test you, the user. If you are unable to proceed with the course of the system, please let me know and do not feel as if you are not contributing. Your participation will allow for a more intuitive design of the system and it’s ability to retrieve files. I will be observing how you interact with the system through a Skype screencast so that I can help you more easily if you have questions. I will be writing things down, but your information and responses will be kept confidential. All of the data gathered from this session will be labeled with a number. The only place in which your name will be associated with the corresponding number will be a key code locked in Dr. Bates' office (Wissink Hall 231). The consent forms will be located in the same place. You can end this session at any time and there will be no penalty associated with this. I would like to reiterate this point again: you have the choice to end this session at any time and it will not influence your relationship with Minnesota State University, Mankato or any entities associated with the university or any entities associated with the audio retrieval system.

I will now guide you through the beginning of the system and then observe the rest of the interaction. 1. Start up the software by clicking on the icon (described by tester). Let me know if you have any

trouble. 2. Start using the system by putting the desired search terms into the box designated “search terms” 3. Select the most relevant (if any) entry related to the search term. 4. After the search term is selected, select any related terms in the “related terms” form. 5. Open the audio transcription file. 6. Close the file and try another search term. (Repeat until user has located 4-‐5 files and has tried

multiple types of file selection.) Now that you are done with the system I will ask you to do one more thing. Please fill out this short questionnaire that reflects your experience with the audio retrieval system. If you want, I can read you the questions and collect the answers.

Thank you for helping me with my project. I appreciate your contributions.

29

A Searchable Database of Audio Teachings: Usability Survey

Please Choose how much you agree or disagree with the following statements.

Agree

Neutral

Disagree

1 2 3 4 5

It was easy to navigate the system. o o o o o

I had no trouble uploading files. o o o o o

The files returned made sense given the search terms.

o o o o o

I was confused by the returned documents

o o o o o

What should be changed?

What could improve your experience?

What functions are not helpful?

Documents

The Use of Information Retrieval for a Searchable Database ...krypton.mnsu.edu/~an5239ke/public/students/CompletedSeniorThes… · Information retrieval (IR) systems can be useful