29
The Use of Information Retrieval for a Searchable Database of Audio Teachings Adedayo Ologunde A Thesis Submitted in Partial Fulfillment of the Requirements for the Degree of Bachelor of Science in Computer Science Minnesota State University, Mankato Mankato, Minnesota December 2011

The Use of Information Retrieval for a Searchable Database ...krypton.mnsu.edu/~an5239ke/public/students/CompletedSeniorThes… · Information retrieval (IR) systems can be useful

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: The Use of Information Retrieval for a Searchable Database ...krypton.mnsu.edu/~an5239ke/public/students/CompletedSeniorThes… · Information retrieval (IR) systems can be useful

The Use of Information Retrieval for a

Searchable Database of Audio Teachings

Adedayo Ologunde

A Thesis Submitted in Partial Fulfillment of the

Requirements for the Degree of

Bachelor of Science

in

Computer Science

Minnesota State University, Mankato

Mankato, Minnesota

December 2011

Page 2: The Use of Information Retrieval for a Searchable Database ...krypton.mnsu.edu/~an5239ke/public/students/CompletedSeniorThes… · Information retrieval (IR) systems can be useful

2    

Abstract Traditional search engines match key words or phrases based on word frequency,

location, similarity, and update-time. The ranking of the results based on these criteria does not

necessarily reflect the trends of a user’s interest and the large number of results returned forces

the user to sort them by hand. In order to solve the problem of ranking relevant results of interest

to a user, vector based representations of collections of files or documents are created, filtered

through reduction techniques and ranked. In this work, a database containing transcriptions of

audio recordings was created for a small, non-profit organization and augmented with a simple

tagging and search engine in order to provide a basic framework for semantic indexing and

problem domain specific search. The goal was more accurate search results for their audio

recordings. Many of the audio files had brief summaries of the audio with some keywords, while

others did not. As the set of audio data grows, it is important for the file system to adapt the

incoming data to the users’ typical word associations. By asking users to enter keywords when

uploading new files, these associations can be created. Semantic indexing can then be used to

return more accurate search results to the user. This approach, implemented on a desktop system

to ensure data privacy, provided better relatedness and faster retrieval of the desired files and

demonstrated how a small organization can benefit from easy search and indexing of their large

audio library.

Page 3: The Use of Information Retrieval for a Searchable Database ...krypton.mnsu.edu/~an5239ke/public/students/CompletedSeniorThes… · Information retrieval (IR) systems can be useful

3    

Contents 1.   Introduction  .........................................................................................................................................  4  

2.   Background  ..........................................................................................................................................  5  

2.1.   Document  Indexing  in  IR  ...................................................................................................................  6  

2.2.   Document  Indexing  Models  ..............................................................................................................  7  

2.2.1.   Boolean  Models  .............................................................................................................................  7  

2.2.2.   Vector  Space  Models  ......................................................................................................................  8  

2.3.   Index  Filtering  ..................................................................................................................................  10  

2.4.   Document  Similarity  Measures  .......................................................................................................  10  

2.5.   Available  Tools  ................................................................................................................................  11  

2.6.   Evaluation  Metrics  ...........................................................................................................................  12  

3.   Dataset  ...............................................................................................................................................  12  

4.   Software  Design  and  Use  ...................................................................................................................  13  

4.1.   Development  Process  ......................................................................................................................  14  

4.2.   Software  Use  ...................................................................................................................................  16  

5.   Results  ................................................................................................................................................  17  

5.1.   Quantitative  Software  Evaluation  ...................................................................................................  17  

5.2.   User  Testing  .....................................................................................................................................  21  

6.   Discussion  and  Future  Work  ...............................................................................................................  21  

6.1.   Discussion  ........................................................................................................................................  22  

6.2.   Future  Work  ....................................................................................................................................  23  

Bibliography  ...............................................................................................................................................  25  

Appendix  A  User  Testing  Materials  ............................................................................................................  27  

 

Page 4: The Use of Information Retrieval for a Searchable Database ...krypton.mnsu.edu/~an5239ke/public/students/CompletedSeniorThes… · Information retrieval (IR) systems can be useful

4    

1. Introduction Information retrieval (IR) systems can be useful for recovering specific information from

stored records. While storing and retrieving information is useful, humans often need to use the

retrieved data and apply it in order to answer questions and study, compare and rank topics

beyond simple file retrieval. While presenting meaningful results to a user is a challenging

multifaceted process, several solutions to retrieving meaningful information exist. These include

full text search, which becomes useful for locating phrases, information retrieval systems and

user tagging, and structured query search (Manning, Raghavan, Schütze, 2008).

A small non-profit organization had a sizeable collection of audio files associated with

documents containing some keywords. The organization was looking for a solution to organize

their collection and group related transcriptions with one another. In order to solve the problem

of retrieving specific information from their collection, an information retrieval system was

developed. Traditional search engines use key based systems to match words with large amounts

of unrelated results based on word frequency, location, similarity, or update-time, but they do not

necessarily reflect the trends of a user’s interest (Wu et al., 2009). The objective of this project

was to a create a simple, inexpensive vector-based indexing system and search engine to return

reasonable results to users in a manner that fit their collection of documents, allows them to

upload new documents, and allows them to control user access based on topic choice. Free and

commercial alternatives that handle information retrieval problems exist such as Google Desktop

Search for individual users, and IBM’s InfoSphere Warehouse for enterprise. However, such

options might not provide utility for the expressed purpose of searching through a moderate sized

collection of documents.

Page 5: The Use of Information Retrieval for a Searchable Database ...krypton.mnsu.edu/~an5239ke/public/students/CompletedSeniorThes… · Information retrieval (IR) systems can be useful

5    

In order to create a system that provides meaningful results for a user, there are multiple

approaches for document indexing and retrieval. Clustering is one such approach that groups

files into categories based on relatedness (Beckford, 2010). Others include the probabilistic

model, which statistically quantifies data within a collection of documents, and the vector space

model, which treats queries and documents as vectors (Soboroff). This work examines two

methods for information retrieval, Boolean models and vector space models, and presents the

necessary steps to implement examples of each model, with a discussion of their role in this

work. Further information is given on tools used for this project and its evaluation based on its

performance and ability to process and return results to the user.

2. Background

Users browsing information on computer systems typically choose between browsing

through the available collection of data for the information that they need or assistance from a

personal information management system (PIM) to locate it for them (Whittaker, 2008;

Manning, Raghavan, Schütze, 2008). In Internet search, they may be assisted by a system with

the ability to scale to large collections, while a personal computer’s desktop search might be able

to provide detailed results such as file location and file previews with results. Enterprise

computer applications also may have the need for search such as database supported web

applications (e.g. Smiley, Text Search, your Database or Solr). Among computer system search

options, the common goal of providing relevant results and presenting them to the user in a

coherent manner exist. Figure 1 illustrates the various steps taken to translate a collection of

document files. These include filtering words contained within a collection for word type,

analyzing the filtered words for the number of appearances within the text, storing the collected

words in an index and applying transformations to the index, which is used to increase the speed

Page 6: The Use of Information Retrieval for a Searchable Database ...krypton.mnsu.edu/~an5239ke/public/students/CompletedSeniorThes… · Information retrieval (IR) systems can be useful

6    

of search. In information retrieval, documents refers to the units used to build the system, terms

refer to the data, typically words or shot phrases contained within the document to be processed

or retrieved, and collections refer to the set of documents, which may also be referred to as a

collection (Manning, Raghavan, Schütze, 2009). This section presents information about

document indexing, approaches, filtering, available tools and evaluation methods.

 

Figure  1  Document  Preparation  and  Information  Retrieval  Process  for  use  with  the  Boolean  model  or  for  use  with  the  vector  space  model  

2.1. Document Indexing in IR Indexing systems can be used to increase the speed of search. An information retrieval

system used to collect meaningful information for a user typically makes use of an indexing step

that allows the system to examine documents’ terms before retrieval to arrange or associate the

documents with queries in a significant manner. Several models and approaches are available to

represent a user’s query and a system’s response. The Boolean model provides simple

implementation, and allows logical operators. Probabilistic models, along with vector space

models allow for ranked retrieval by relevance, at the cost of complexity.

Page 7: The Use of Information Retrieval for a Searchable Database ...krypton.mnsu.edu/~an5239ke/public/students/CompletedSeniorThes… · Information retrieval (IR) systems can be useful

7    

2.2. Document Indexing Models This section introduces two models of document indexing, the vector space model and

the Boolean model. Along with the vector space model that might use indexing techniques such

as latent semantic indexing, the Boolean model relies on text queries to return results as shown in

figure 2. This model works to return matching results from queries to documents via Boolean

operators such as AND, OR, and NOT.

Figure  2  Boolean  and  Vector  Space  Models

2.2.1. Boolean Models The Boolean model has been used in dialog systems as well as web search engines that

implement extended Boolean models (Manning, Raghavan, Schütze, 2009). The Boolean model

indexes a set of words, checks if the term is present within a document, and applies Boolean

operators AND, OR, and NOT over the set of documents (Soboroff). Strict Boolean models can

only retrieve exact matches, and weight each term equally, making it impossible to rank

documents, but work quickly and simply in comparison to vector space models. Complex or

extended models add keyword search, phrase search and add chronological ranking (Kobayashi,

Takeda, 2000; Manning, Raghavan, Schütze, 2009; Soboroff).

Page 8: The Use of Information Retrieval for a Searchable Database ...krypton.mnsu.edu/~an5239ke/public/students/CompletedSeniorThes… · Information retrieval (IR) systems can be useful

8    

2.2.2. Vector Space Models Implementing a ranking system that takes more into account than order of appearance

requires additional features. This model allows for document comparison, but does so at the cost

of speed and complexity. In a vector space model, an index is used to collect terms in documents.

Steps are taken to collect the documents to be indexed, tokenize the text, linguistically process

the tokens through sets of rules within the bodies of text, and index the documents that contain

the each term (Manning, Raghavan, Schütze, 2009).

Within documents, frequently occurring terms may skew search results towards certain

documents in a collection. To address this issue, vector space models make use of term

frequency (TF), and inverse document frequency (IDF), as shown in equation 1 to create a

TF*IDF distance score representing the number of times a term appears in a document multiplied

by the log of the number of documents in the database over the number of documents containing

the term. Term frequency weighting refers to the method of scoring these term counts. Inverse

document frequency reduces the importance of commonly occurring terms over a collection of

documents (Garcia). Term frequency refers to a count of the number of times a word is repeated

within a document, which can be used to quantify its importance relative to other terms within a

document, as shown in equation 1.

!"! =!!!

Equation 1

where N represents the number of words in the document and Ni is the number of times term i

appears in the document. The inverse document frequency is shown in equation 2.

!"# = log !!"!

Equation 2

Page 9: The Use of Information Retrieval for a Searchable Database ...krypton.mnsu.edu/~an5239ke/public/students/CompletedSeniorThes… · Information retrieval (IR) systems can be useful

9    

where D represents the number of documents in the collection, and dfi represents the number of

documents containing the term i. Terms that do not occur in a document (dfi=0) are assigned a

small fixed value. By multiplying the term frequency and inverse document frequency, a

distance score is created that can measure the importance of terms within documents in a

collection. The term weight, wi, is shown in equation 3.

!! = !"! ∗ !"# Equation 3

In this model, documents can be ranked highly against a query if the terms are rare within a

collection, but common within a document (Spoerri). Along with TF-IDF weighting, the process

of indexing a document in a collection might involve normalization to ensure that terms within

documents and the documents themselves are fairly promoted (Singhal, et al., 1995).

Vector space models make use of tokenization, or splitting documents into terms. Word

stemming, or using the root of a word, idea, or synonym can be used to increase the occurrence

of rare terms. Tokenization during part-of-speech tagging often delimits the tokens by

whitespace, or spaces between tokens, however hyphenated phrases such as “soft-spoken” may

be treated as a single term to return more accurate results to the user (Manning, Raghavan,

Schütze, 2009). It is possible to identify the parts of speech of each token within a collection of

documents. These parts include nouns, verbs, pronouns, prepositions, adverbs, conjunctions,

participles and articles based off of information given about the word, and its neighbors

(Jurafsky, Martin, 2008). Tagging of parts of speech can allow for more streamlined results by

reducing noise in a term document matrix while providing a means for the user to identify useful

terms related to their topic of interest. For example, filtering can isolate nouns and proper nouns

and ignore prepositions, conjunctions and articles.

Page 10: The Use of Information Retrieval for a Searchable Database ...krypton.mnsu.edu/~an5239ke/public/students/CompletedSeniorThes… · Information retrieval (IR) systems can be useful

10    

2.3. Index Filtering Function words such as “and” are frequently found in documents increasing the

document size and potentially increasing search time, but have little intrinsic effect on the results

to the user. In the set of rules for creating an index, a step is often taken to filter out the most

frequent occurring words or stop words (Manning, Raghavan, Schütze, 2009; Jurafsky, Martin,

2008). The terms selected for the stop words can be adjusted to the collection of documents and

might be different for part-of-speech groups. The use of a stop word list in information retrieval

can reduce space in terms of the index size, as well as reduce the amount of time devoted to

indexing terms.

Stemming, or recognizing and grouping the roots of words, is also used to reduce the size

of an index (Manning, Raghavan, Schütze, 2009). In the processing phase of creating an index,

each word contained within a document is given a weight based on its frequency and the list of

unique words associated with a vector (Jurafsky, Martin, 2008). The weights within the

documents are represented within a document vector. In the vector space model, when a user

searches through the document collection, a query vector represents their query.

2.4. Document Similarity Measures Before a user is presented with search results, cosine similarity is applied to compare the

similarity of vectors, which represent documents within the collection to the resemblance to the

user’s query. Several methods of measuring similarity between documents exist, such as Jaccard

similarity and abstract similarity (Pal, 2008a). In this project, cosine similarity was used. This

process measures the cosine angle between two vectors, A and B as shown in equation 4

(Garcia).

Page 11: The Use of Information Retrieval for a Searchable Database ...krypton.mnsu.edu/~an5239ke/public/students/CompletedSeniorThes… · Information retrieval (IR) systems can be useful

11    

!"# !,! = !"#$%&  ! =   !·!! !

Equation 4

Latent semantic indexing (LSI) is an application of singular value decomposition (SVD),

allowing for document scoring and ranking. LSI projects queries and documents into a space

with latent semantic dimensions, and then SVD finds the optimal projection to a low dimensional

space (Rosario, 2000). LSI can provide optimal matches of variation to documents, but becomes

costly at large dimensions or collections with large amounts of variance between terms.

2.5. Available Tools The nonprofit organization had concerns about security, so the use of web related

technologies were avoided. Several solutions to desktop based search were considered. For

example, Rapid-I provides a widely used software tool for data mining, RapidMiner, which is

graphical and provides extendibility, but offers limited support for inexpensive versions

(RapidMiner). Several open source projects were available to process text. SemanticVectors for

example, creates “WordSpace” models from natural language text, (semanticvectors) but offers

limited extendibility and support. Lucene was used throughout the project to compare the quality

of search results and for a word suggestion feature through Lucene’s SpellChecker class

(Lucene). Lucene provided fast and scalable indexing with various sized collections (Lucene).

However Lucene does not rank documents, which is critical for sorting meaningful results. The

S-Space package provides support for latent semantic analysis, and was designed for scalability

(Jurgens). S-Spaces might have been useful in quickly processing the collection of files, but was

not considered until late in the development process and was not used. The Java Text Mining

Toolkit provided part-of-speech tagging capabilities, similarity measures and the ability to rank

Page 12: The Use of Information Retrieval for a Searchable Database ...krypton.mnsu.edu/~an5239ke/public/students/CompletedSeniorThes… · Information retrieval (IR) systems can be useful

12    

documents for a reasonable starting point to build upon (Pal, 2008a). Much of the indexing and

search processing was built upon the JTMT project.

2.6. Evaluation Metrics Two evaluation metrics used in information retrieval are precision and recall as shown in

equations 5 and 6 (Manning, Raghavan, Schütze, 2009; Jurafsky, Martin, 2008; Kobayashi,

2000). Precision is the ratio of relevant documents retrieved to the number of documents and

recall is the proportion of relevant documents retrieved compared to the relevant documents in

the database (Kobayashi, 2000). In order to achieve high precision, the IR system would have to

return only relevant responses to the user’s query.

Recall= !"##$%&  !"#$%&#"#  !"#$%  !"  !"#$%&!"#$%  !"##$%&'  !"##$%&  !"#$%&#  !"  !"#$

Equation 5

Precision= !"##$%&  !"#$%&#"#  !"#$%  !"  !"#$%&!"#$%&  !"  !"#$%&#  !"#$%  !"  !"#$%&

Equation 6

Another method of evaluating the system was through user testing. Because this project

has expected clients, they were asked for feedback. The feedback provided in user testing can be

used to improve software systems as well as the likelihood of their adoption by the target user

populations. The methods used here are described in Appendix A.

3. Dataset The data used for this project was a sample of 126 documents in plain text format. The

number of unique terms in the corpus was roughly 5-6000. The format of the documents was

generally a collection of words, with a date associated with the document name. In addition to

English words, the collection contained Portuguese words, numbers, and symbols that were

filtered out of the results. A sample of a file is shown in Figure 3, demonstrating proper nouns,

Page 13: The Use of Information Retrieval for a Searchable Database ...krypton.mnsu.edu/~an5239ke/public/students/CompletedSeniorThes… · Information retrieval (IR) systems can be useful

13    

and special characters contained within the collection, such as the “>” character. When

encountered, characters such as these could not be treated as a proper noun, noun, or verb and

were filtered out of the indexed collection.

20091120-04-Fri

Amithaba one arms-length on the top of our head Chenrezig central channel* heart chakra lotus, tikle luminous as a candle. seed-syllable straight as an arrow - lung (not recorded) - meditation posture instructions 2009- P´howa 20091120.04-Fri > little finger´s width P`howa retreat Lama Tsering Everest Odsal Ling temple Translator- Priscila

Figure  3  A  sample  file  in  the  collection  of  documents  

4. Software Design and Use Section 4 explains the various open source and freely available software components that

were used in creating this project. These components include the JTMT project, which was used

throughout the indexing and search steps, the Java Swing Widget Toolkit platform, which was

used for the graphical user interface, and the Lucene project, which was used for spelling

suggestion. This is illustrated in figure 4. This section also explains how content filtering was

used in this project, along with the design choices for its graphical user interface.

Page 14: The Use of Information Retrieval for a Searchable Database ...krypton.mnsu.edu/~an5239ke/public/students/CompletedSeniorThes… · Information retrieval (IR) systems can be useful

14    

 

Figure  4  A  block  diagram  displaying  the  various  tools  used  to  create  the  project.  

4.1. Development Process Open source and freely available software from the Java Text Mining Toolkit from Sujit

Pal (2008a) was used for much of the indexing process. The components used were part-of-

speech tagging, TF-IDF indexing, and cosine similarity. After a user selects a directory

containing files, the files in the directory are processed through a set of recognizers, which

recognize properties such as punctuation, phrases, and abbreviation. Next, the files are processed

in a rule-based tagger, which categorizes parts of speech using the MIT Java Wordnet Interface

(Pal, 2008b). Nouns, verbs, adjectives and adverbs can be tagged as an allowable part of speech

through the Wordnet interface after its “sense” or its unique word ID has been determined

(About Wordnet, 2011). Next, the word is tagged as TokenType.Content_WORD. Verbs,

adjectives and adverbs were omitted to save space in the index. Punctuation, numbers,

abbreviation, and spaces are also omitted from the text using a break iterator based on the

ICU4j's RuleBasedBreakIterator (Pal, 2008b). For example, the sentence, “- These are the 3

reasons why Buddha said life is suffering.” is tagged as [“- These are the 3 reasons why Buddha

said life is suffering.”]. The software tags the dash and the period in the form [UNKNOWN], the

Page 15: The Use of Information Retrieval for a Searchable Database ...krypton.mnsu.edu/~an5239ke/public/students/CompletedSeniorThes… · Information retrieval (IR) systems can be useful

15    

number 3 as [NUMBER] and remaining words as [word] apostrophes and hyphens are included

with words. Nouns, verbs, and adjectives are identified, for instance, [self-centeredness (noun)

WID-04779645] and then tagged as [CONTENT_WORD] as an allowable part of speech. The

content words are stored in a list with their IDs. A stop word list was used to filter out

commonly used words. Finally, the content words in the collection are assigned frequencies

based upon the number of times they appear within the documents, and the documents are

mapped to a matrix. Users might make mistakes while entering input and may not receive precise

results. In order to address this, a spell checking system prompts the user when no matching

documents are found and suggests alternative terms according to a dictionary list. Java Swing

components were used to create a basic graphical user interface. Java Swing was chosen because

of the ability to run Java programs on Microsoft Windows, Mac OS X, and Linux, which would

suit the user’s varied needs. The GUI was created with the intent to allow the user to select files

and search through their contents, with the visual guidelines of each operating system taken into

Figure  5  A  comparison  of  Mac  OS  X's  native  file  chooser  on  the  left,  and  Java's  default  file  chooser  on  the  right

Page 16: The Use of Information Retrieval for a Searchable Database ...krypton.mnsu.edu/~an5239ke/public/students/CompletedSeniorThes… · Information retrieval (IR) systems can be useful

16    

consideration. A screenshot of the program is shown in Figure 5 demonstrating a customized file

chooser for Mac OS X. The benefits of conforming to the user interface’s human interaction

guidelines provided consistency with the user’s expectations of the behavior of applications that

run on the platform.

Much of the work for this project was produced with the intent of taking an existing,

cost-effective indexing and searching system and modifying it to work with the non-profit

organization’s collection. Therefore, the JTMT project was modified to work with collections of

documents, instead of paragraphs within a single document, to operate through a GUI instead of

a command prompt for user friendliness. Additionally, the output of the results were modified for

clarity, the ability to save an index to the disk was added to save time in later searches, along

with changes such as using SQLite instead of MySQL in the interest of portability since the

database is stored on the host machine, and does not require a server (Obbayi, 2010). Also, word

suggestions were added through Lucene when no reasonable results to a user’s query were

returned.

4.2. Software Use The program starts by allowing the user to select a directory containing plain text files. The

files are indexed by the program,  and  a  generated map of the document names, words and their

frequencies is created. Through the JTMT toolkit, a simple linear algebra package is used to

normalize the term frequencies (Pal, 2008b). LSI and cosine similarity are applied to the matrix.

To search through the documents, a query vector representing the terms in the query is treated as

an instance of a document vector allowing query terms to be compared against the matrix.

After the user has installed Java on their computer, they select the Java jar file for the project.

They select a directory containing the files that they wish to index. A screenshot demonstrating

Page 17: The Use of Information Retrieval for a Searchable Database ...krypton.mnsu.edu/~an5239ke/public/students/CompletedSeniorThes… · Information retrieval (IR) systems can be useful

17    

the results of matching results returned from the index is shown in Figure 6. The spelling

suggestion feature used Lucene’s SpellChecker, allowing the accuracy of matches to be set based

on edit distance, which was set to 0.75, producing results reasonably close to the given input

(Moreira).

 

Figure  6  A  screenshot  of  the  program  displaying  the  results  of  a  query.

5. Results This section presents the evaluation used to determine the correctness and performance of the

program. Results are presented in two ways. An analysis of software performance is followed by

user testing results. The section begins with a description of the test setup and evaluation

methods.

5.1. Quantitative Software Evaluation In order for the software to be useful for the users, several tests were conducted to

measure the quality of the results returned to the user, and the speed of the program. These tests

Page 18: The Use of Information Retrieval for a Searchable Database ...krypton.mnsu.edu/~an5239ke/public/students/CompletedSeniorThes… · Information retrieval (IR) systems can be useful

18    

were critical, since a user waiting for results for their query to appear, or a user searching

through an ineffectual list of results might forego using the software altogether. A test setup

consisting of a computer running Microsoft Windows 7 with a 2.8 GHz processor, 4 GB ram,

and a 64 GB SSD executed the program. Tests were conducted for three components of the

program: index creation time, precision/recall of search and search results, and search result

suggestion. The index creation time test consisted of pseudorandom selected files from the non-

profit organization’s document collection, scaled from nearly the size of the collection to about a

tenth of it. The index creation test included the time required to process the files by applying a

chain of recognizers to each file and storing the processed files as tokens in a matrix, as well as

performing latent semantic indexing and cosine similarity on the matrices.

Additional tests were performed on precision and recall of search results by running the

program multiple times with various search terms. Words and phrases were selected from the

document collection and compared to the results returned from the search application. The

results of the suggestion system were also measured for precision and recall, which is discussed

in section 5.2.

Table 1 Index Creation Time

Initial Collection Size Time to Create Index (In Minutes)

10 Files 1.07

25 Files 2.56

50 Files 5.05

75 Files 6.71

100 Files 8.33

125 Files 12.37   The index creation time test was conducted three times for each scaled collection and an

average time to index the collection was recorded as shown in Table 1. Based on observation,

Page 19: The Use of Information Retrieval for a Searchable Database ...krypton.mnsu.edu/~an5239ke/public/students/CompletedSeniorThes… · Information retrieval (IR) systems can be useful

19    

results indicate near linear growth in index time as the number of term is increased as

demonstrated in Figure 5. While a collection of 10 documents corresponded to 500 terms or a

5000 element matrix, a collection of 125 documents on average corresponded to 2700 terms.

With the size of the indices taken into account, the increase in index time can be attributed to the

matrix size used for LSI. A matrix of 127*2700 or 337,500 elements requires 2.7 Mb of memory,

and large matrix sizes increase the amount of time because a hundred documents in the matrix

correspond to a hundred points in the term space. The cosine angle of each document must be

calculated against the other documents in the term space, which increased the length of time

required to finish indexing. The increase in the number of files also corresponded to an increase

in index creation time as shown in figure 5, since additional files added more terms to the

collection.

 

Figure  7  Index  Creation  Time  as  the  Number  of  Files  Increases

A single document was indexed twice to test for similarity measures. When individual

terms were searched for, they yielded identical results, suggesting high precision. The second

document made of identical terms was then reduced to half its size, and a single term matching

0  

20  

40  

60  

80  

100  

120  

140  

1.07   2.56   5.05   6.71   8.33   12.37  

Files  

Time  

Index  CreaUon  Time  

Page 20: The Use of Information Retrieval for a Searchable Database ...krypton.mnsu.edu/~an5239ke/public/students/CompletedSeniorThes… · Information retrieval (IR) systems can be useful

20    

test was applied to both documents. While the results produced a match for the first document

which contained the term, they also produced a match for the second document which did not

contain the term found in the first document. For example, Document A, which contained the

phrase, “activity is attained with amplification” was compared to Document B, containing the

phrase, “adjust activity by expression.” The comparison produced equivalent scores for both

documents when the term, “attained” was searched for. Latent semantic indexing was used to

compare the two documents, so the term “attained” had a frequency of one within the first

document, 0 within the second document, and 1 within the query vector. After matrix

decomposition was applied to the matrix, the rows or scores of the term remained unequal, and a

similar result occurred when applying cosine similarity between the query and the matrix. The

similarity between documents is represented as shown in equation 7

!!!  !! =!1 !20.426 0.851 Equation 7

where q represents a query vector or a user’s query, A represents the term-document matrix, d1

and d2 are two documents, and T is the transpose of the vector. By multiplying the transpose of

the query vector by the term-document matrix, a similarity measure is calculated. The dissimilar

results between documents 1 and 2 suggest that the results returned from the system reflect the

similarity of the documents instead of terms, which might provide high recall but does not

necessarily deliver the level of precision to distinguish one highly related document from

another. The suggestion system was set to return three possible matches from the program’s

dictionary for each term in the query. The system returned reasonably close matches, but at the

cost of precision. One possible solution to correct the low precision of the matches could be to

compare the query to the terms in an individual documents. Since matches are provided by

Page 21: The Use of Information Retrieval for a Searchable Database ...krypton.mnsu.edu/~an5239ke/public/students/CompletedSeniorThes… · Information retrieval (IR) systems can be useful

21    

document similarity, an additional term search would filter out documents where the term did not

appear.

5.2. User Testing To test the usability of the software, a knowledgeable user was asked to search for typical

terms in the dataset using the software, while noting their experience in a short survey. The

methods and documentation for user testing are in Appendix A. The user noted the effectiveness

of the software in finding a result within the collection, the ease of use, and their likelihood of

using the software solution. The user indicated the software system was easy to navigate. But

they did not try to add or upload files to the collection. The user thought that files returned made

sense, but expressed frustration with too many false positives returned in response to their

queries, and also noted that some of the returned files were essentially empty. These results

indicate that the system provided high recall, but also may have problems with low precision.

The user also gave the suggestion to implement a monitor for indexing progress, and a method to

link directly to the files in their collection. The user expressed satisfaction with the term

suggestion feature. In response to the user feedback, links were added to the file list in the results

returned and a progress monitor reflecting the stage of the indexing progress was implemented.

6. Discussion and Future Work This project involved investigating a viable system for multiple user document

organization and retrieval. This work involved researching existing methods of information

retrieval, selecting a model that fit the organization’s needs, selecting software components from

an open-source information retrieval project, adjusting them to operate with large, multiple

document collections, considering features, such as saving indices to disk, and a graphic user

Page 22: The Use of Information Retrieval for a Searchable Database ...krypton.mnsu.edu/~an5239ke/public/students/CompletedSeniorThes… · Information retrieval (IR) systems can be useful

22    

interface. This section describes the results of the testing and draws conclusions of the

effectiveness of the software, discussing the advantages and disadvantages of using the vector

space model in this project. Additionally, future possibilities for continuing the project are also

discussed in future work.

6.1. Discussion Large data sets result in large indices, which are computationally intensive to create. The

system developed in this work provided acceptable results and document relatedness at the cost

of system responsiveness. While using a vector based model for a simple search program allows

the user to distinguish which files are similar to their query, the index time required is costly

unless preprocessing allows for storing a matrix to the user’s disk. During the testing process for

this application, small collections of documents with similar terms were processed quickly, while

larger collections with a greater number of unique terms required a lengthy amount time. Some

issues with low accuracy were found, as some search terms returned more documents than

necessary as demonstrated in figure 6. Given the startup time required to create an index, the user

could find the wait unacceptable and might prefer to filter through unranked results retrieved by

Boolean methods. In a growing collection of files, it could prove more useful to rely on a

combination of Boolean retrieval, full text search and relevance feedback, which collects data

from the user to further improve results. Improvements to the index time can be realized by

reducing the amount of terms indexed through selecting stop words particular to the data set.

Other issues include the complexity of retrieving data from a vector representation of

documents. For example, the term document matrix, while providing relatedness, does not

preserve the sentence structure of the original documents, which could be useful for a graphical

representation of the search results. A solution to this might include assigning an additional value

Page 23: The Use of Information Retrieval for a Searchable Database ...krypton.mnsu.edu/~an5239ke/public/students/CompletedSeniorThes… · Information retrieval (IR) systems can be useful

23    

to the column and row indicating the line number a term occurred on. A normalized matrix could

be stored, but SVD would have to be applied to the matrix each time a document is added to the

collection. Query expansion, which expands the user’s query to include related terms, could be

used to eliminate low precision matches from the system’s results. (Jurafsky, Martin, 2008;

Tsatsaronis, Varlamis, Vazirgiannis, 2010). Clustering creates sets of clusters for documents

based on similarity. Clustering vectors and calculating distances could also reduce the need for

matrix transformations. Further part speech recognition tagging could be applied for more

specific tags, such as tagging only proper nouns to provide the user with an index of names.

6.2. Future Work Future work might focus on measures to reduce the size of the matrix to increase

performance. As WordNet allows for noun, verb, adverb, and adjective relations, the WordNet

database itself might provide for semantic results such as presenting the user with filtering

options. Because of the time required to calculate LSI and cosine similarity on a large matrix, a

system that examines part of the matrix in memory might be more effective in reducing the term

space as well as the index time. An example of such a solution is the S-Space package, which

became available after work on this project began, provides support for Latent Semantic

Analysis and sparse matrices in semantic spaces (Jurgens, 2010). Since S-Spaces are reusable,

the package might prove a better choice for processing large collections of documents. The user

interface can be improved given the feedback from the users.

The quality of term matches can be improved by allowing the user to input a list of words

that they would like to be included in the collection, such as names, initials, or other proper

nouns. Since the organization the software was created for is multilingual, Unicode support

should also be incorporated and thoroughly tested. The project could benefit from a persistent

Page 24: The Use of Information Retrieval for a Searchable Database ...krypton.mnsu.edu/~an5239ke/public/students/CompletedSeniorThes… · Information retrieval (IR) systems can be useful

24    

database and a user interface to allow for the inclusion of new files. To fit the structure of the

organization, the ability to add new files without recreating an index as well as the ability to

prevent some files from access could be included.

Page 25: The Use of Information Retrieval for a Searchable Database ...krypton.mnsu.edu/~an5239ke/public/students/CompletedSeniorThes… · Information retrieval (IR) systems can be useful

25    

Bibliography “About WordNet.” 2011. Princeton University. 31. Dec 2011 <http://wordnet.princeton.edu>. Beckford, Balmain. “Keyword-Based File Sorting for Information Retrieval” Undergraduate

Thesis. 2010. Department of Computer Science Minnesota State University Mankato. Print.

Bergman, Ofer, Ruth Beyth-Marom, Rafi Nachmias, Noa Gradovitch, and Steve Whittaker. "Improved Search Engines and Navigation Preference in Personal Information Management." ACM Transactions on Information Systems 26.4 (2008): 1-24. Print.

Fang, Hui, Tao Tao, and Chengxiang Zhai. "Diagnostic Evaluation of Information Retrieval Models." ACM Transactions on Information Systems 29.2 (2011). Print.

Garcia, E. “SVD and LSI Tutorial 1: Understanding SVD and LSI,” miislita.com/information-retrieval-tutorial/svd-lsi-tutorial-1-understanding.html. Web. Sun, 26 Jun. 2011.

Jurafsky, Dan, and James H. Martin. Speech and Language Processing: an Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition. Upper Saddle River, NJ: Pearson Prentice Hall, 2008. Print.

Jurgens, David. “The S-Space Package: A scalable software library for semantic spaces” code.google.com/p/airhead-research/. Web. Tue, 16 Aug. 2011.  

Kobayashi, Mei, and Koichi Takeda. “Information Retrieval on the Web.” ACM Digital Library. 2 June 2000. Web. 11 Aug. 2011.

Manning, Christopher D., Prabhakar Raghavan, and Hinrich Schütze. Introduction to Information Retrieval. New York: Cambridge UP, 2008. Print.

Moreira, Leandro, R. “Implementing Google’s ‘Did you mean’ Feature In Java” Infoq.com/articles/lucene-did-you-mean. Web. 16 Aug. 2011  

Obbayi, Steve. “SQLite vs MySQL: How To Decide Which To Use « Sobbayi's Tech Labs.” Sobbayi's Tech Labs. 30 May 2010. Web. 24 Dec. 2011. <http://blog.sobbayi.com/2010/05/sqlite-vs-mysql-how-to-decide-which-to-use/>.

Pal, Sujit. (2008a). “IR Math with Java : Similarity Measures.” Salmon Run. 27 Sept. 2008. Web. 22 Dec. 2011. <http://sujitpal.blogspot.com/2008/09/ir-math-with-java-similarity-measures.html>.

Pal, Sujit. (2008b). “IR Math with Java : TF, IDF and LSI.” Salmon Run. 20 Sept. 2008. Web. 22 Dec. 2011. <http://sujitpal.blogspot.com/2008/09/ir-math-with-java-similarity-measures.html>.

RapidMiner. Rapid - I. 10 Oct. 2011 <  http://rapid-i.com/content/view/181/190/lang,en/>. Rosario, Barbara. “Latent Semantic Indexing: An Overview.” Undergraduate Thesis. 2000.

University of California-Berkeley. Web. 11 Aug. 2011. Singhal, Amit, Gerard Salton, Mandar Mitra, and Chris Buckley. “Document Length

Normalization.” Undergraduate Thesis. 1995. Cornell University. Print. Soboroff, Ian. “IR Models: The Boolean Model.” UMBC Computer Science and Electrical

Engineering | Inspiring Innovation. http://comminfo.rutgers.edu/~aspoerri/. Web. 9 Oct. 2011.

Spoerri, A. “Information Retrieval Models” miislita.com/information-retrieval-tutorial/svd-lsi-tutorial-1-understanding.html. Web. Sun, 26 Jun. 2011.

Tsatsaronis, George, Iraklis Varlamis, and Michalis Vazirgiannis. “Text Relatedness Based on a Word Thesaurus.” Journal of Artificial Intelligence Research 37 (2010). Print.

Page 26: The Use of Information Retrieval for a Searchable Database ...krypton.mnsu.edu/~an5239ke/public/students/CompletedSeniorThes… · Information retrieval (IR) systems can be useful

26    

Wu, Lihua, JianPing Feng, and Yunfen Luo. “A Personalized Intelligent Web Retrieval System Based on the Knowledge-Base Concept and Latent Semantic Indexing Model.” Software Engineering Research, Management and Applications (2009): 45-50. Print.

Page 27: The Use of Information Retrieval for a Searchable Database ...krypton.mnsu.edu/~an5239ke/public/students/CompletedSeniorThes… · Information retrieval (IR) systems can be useful

27    

Appendix A User Testing Materials This section consists of forms related to the user testing process of this project. The user was

given a brief usability survey in order to assess the utility of the program. The user was also

given a consent form describing the uses of the data collected from the survey, as well as the

overall procedure, the risks and the benefits. Finally, a test script has been included which

outlines how the user interview process was conducted.  

Minnesota  State  University,  Mankato  Department  of  Computer  Science  

Purpose  The  goal  of  this  study  is  to  gather  information  about  the  usability  of  a  searchable  database  for  audio  teachings.    Information  will  be  used  to  assess  and  improve  the  software.  Procedures  and  Duration  The  survey  will  only  take  about  10-­‐15  minutes  to  complete,  depending  on  the  length  of  your  responses.    It  may  take  you  longer  to  be  familiar  with  the  software.    You  can  ask  any  questions  you  have  and  the  researcher  will  guide  you  through  the  software  use.    When  you  are  comfortable,  you  can  complete  the  survey  on  your  own  and  email  it  back  to  the  researcher,  or  the  researcher  will  ask  you  the  questions  and  record  your  answers.    You  are  free  to  choose  which  approach  you  prefer.    If  you  decide  to  participate  in  this  study,  your  participation  is  completely  voluntary  and  you  are  free  to  skip  any  question.    At  any  point  you  can  chose  to  end  your  participation.  Again,  if  you  decide  to  participate,  you  are  free  to  stop  your  participation  at  any  time  without  penalty.    Risks  and  Discomforts  While  you  are  not  at  any  physical  risk  from  participating  in  this  research,  there  may  be  a  risk  of  feeling  discomfort  if  any  of  the  questions  are  sensitive  to  you.      Benefits  and/or  Compensation  Only  people  familiar  with  the  Tsound  Project  at  Odsal  Ling  Temple  and  interested  in  its  improvement  will  be  surveyed.    There  is  no  other  compensation.  Confidentiality  Any  information  about  you  will  be  kept  private.  Participants  will  be  assigned  a  numeric  code  number  and  the  code  key  will  be  kept  in  a  cabinet  in  the  primary  researcher’s  office.    The  information  you  provide  in  the  survey  will  be  matched  to  the  numeric  code  number  and  will  not  be  associated  with  your  name.    If  the  survey  is  given  orally,  notes  will  be  taken  electronically  but  no  names  will  be  associated  with  any  comment  or  quote.    The  results  of  your  participation  will  be  stored  electronically  with  paper  back  ups  and  will  not  be  available  to  any  other  person  or  organization  other  than  the  primary  researcher  or  research  assistants.  Any  information  collected  will  be  destroyed  at  the  end  of  one  year.      Publications  Associated  with  this  Research:    The  results  of  this  research  may  appear  in  publications  but  individual  participants  will  not  be  identified.  Offer  to  Answer  Questions  Before  you  begin  the  survey,  please  acknowledge  your  acceptance  of  the  terms  of  the  survey.  If  you  have  any  questions  about  the  the  rights  and  treatment  of  human  subjects,  please  contact:  the  IRB  Administrator,  College  of  Graduate  Studies  and  Research,  Minnesota  State  University,  Mankato,  115  Alumni/Foundation  Center,  Mankato,  MN  56001.  Phone:  (507)  389-­‐2321  FAX:  (507)  389-­‐5974  If  you  have  any  questions  concerning  the  research,  please  contact:  

Dr.  Rebecca  Bates          Minnesota  State  University      Department  of  Computer  Science      [email protected]          

Page 28: The Use of Information Retrieval for a Searchable Database ...krypton.mnsu.edu/~an5239ke/public/students/CompletedSeniorThes… · Information retrieval (IR) systems can be useful

28    

507-­‐389-­‐5587          Please  include  this  text  in  an  email  to  [email protected]  or  [email protected]  and  include  the  statement:  “I  agree  to  participate  in  the  Usability  Study  for  A  Searchable  Database  of  Audio  Teachings”  followed  by  your  name  and  the  date.  

Your  Name________________________       Researcher  Name:_________________  

Date_____________________________                                Date  ____________________________  

 

Usability  Testing  Script  FACILITATOR:    Hello,  my  name  is  Adedayo  Ologunde  and  I  will  be  walking  you  through  this  usability  testing  session.    I  have  asked  you  to  participate  in  the  testing  of  the  user  interface  and  your  experience  with  a  database  for  retrieval  of  audio  teachings.  The  purpose  of  the  test  is  to  judge  the  effectiveness  of  the  user  interface  and  will  not  test  you,  the  user.    If  you  are  unable  to  proceed  with  the  course  of  the  system,  please  let  me  know  and  do  not  feel  as  if  you  are  not  contributing.    Your  participation  will  allow  for  a  more  intuitive  design  of  the  system  and  it’s  ability  to  retrieve  files.    I  will  be  observing  how  you  interact  with  the  system  through  a  Skype  screencast  so  that  I  can  help  you  more  easily  if  you  have  questions.    I  will  be  writing  things  down,  but  your  information  and  responses  will  be  kept  confidential.    All  of  the  data  gathered  from  this  session  will  be  labeled  with  a  number.  The  only  place  in  which  your  name  will  be  associated  with  the  corresponding  number  will  be  a  key  code  locked  in  Dr.  Bates'  office  (Wissink  Hall  231).    The  consent  forms  will  be  located  in  the  same  place.      You  can  end  this  session  at  any  time  and  there  will  be  no  penalty  associated  with  this.    I  would  like  to  reiterate  this  point  again:  you  have  the  choice  to  end  this  session  at  any  time  and  it  will  not  influence  your  relationship  with  Minnesota  State  University,  Mankato  or  any  entities  associated  with  the  university  or  any  entities  associated  with  the  audio  retrieval  system.    

I  will  now  guide  you  through  the  beginning  of  the  system  and  then  observe  the  rest  of  the  interaction.      1. Start  up  the  software  by  clicking  on  the  icon  (described  by  tester).    Let  me  know  if  you  have  any  

trouble.  2. Start  using  the  system  by  putting  the  desired  search  terms  into  the  box  designated  “search  terms”  3. Select  the  most  relevant  (if  any)  entry  related  to  the  search  term.  4. After  the  search  term  is  selected,  select  any  related  terms  in  the  “related  terms”  form.  5. Open  the  audio  transcription  file.    6. Close  the  file  and  try  another  search  term.    (Repeat  until  user  has  located  4-­‐5  files  and  has  tried  

multiple  types  of  file  selection.)    Now  that  you  are  done  with  the  system  I  will  ask  you  to  do  one  more  thing.    Please  fill  out  this  short  questionnaire  that  reflects  your  experience  with  the  audio  retrieval  system.    If  you  want,  I  can  read  you  the  questions  and  collect  the  answers.  

Thank  you  for  helping  me  with  my  project.    I  appreciate  your  contributions.  

Page 29: The Use of Information Retrieval for a Searchable Database ...krypton.mnsu.edu/~an5239ke/public/students/CompletedSeniorThes… · Information retrieval (IR) systems can be useful

29    

A Searchable Database of Audio Teachings: Usability Survey

 

 

  Please  Choose  how  much  you  agree  or  disagree  with  the  following  statements.      

 

Agree  

 

Neutral  

 

Disagree  

         1      2      3        4      5  

It  was  easy  to  navigate  the  system.    o   o   o   o   o  

   

I  had  no  trouble  uploading  files.  o   o   o   o   o  

   

       

   

 The  files  returned  made  sense  given  the  search  terms.  

o   o   o   o   o      

       

   

 I  was  confused  by  the  returned  documents  

o   o   o   o   o      

 

 

What  should  be  changed?  

       

   

           What  could  improve  your  experience?  

       

   

         

   

What  functions  are  not  helpful?