Exercise overview - erikmitchell.info€¦ · Web viewBaeza-Yates, Ricardo and Berthier Ribeiro-Neto, Modern Information Retrieval, Addison Wesley Longman, 1999, Chapter 1. Watch:

Class 9 - Search and Retrieval

Exercise overview

In classes 1-7 we explored information systems with a focus on the role of structured and semi-

structured data (e.g. web-pages, metadata schemas and relational databases) in supporting the

information lifecycle. We identified different types of metadata that support aspects of the lifecycle

including descriptive (e.g. discovery phase), administrative (e.g. acquisition, appraisal phases),

technical (e.g. preservation phase) and structural (e.g. ingest/management phases). We found that

these types of metadata are essential in enabling long-term management and preservation of

resources.

While we learned about the value of metadata we also found that manual metadata creation

techniques are not always ideal and may not scale as needed in certain situations. In addition to this

problem of scale, there is a widespread school of research and practice that asserts that descriptive

metadata is poorly suited for certain types of discovery, also called "Information Retrieval" or IR. In

this class we explore these alternative approaches to IR from both the systems processing

perspective and the user-engagement perspective. In doing so we will build on our understanding of

information seeking and use models by thinking about new types of discovery.

Suggested readings

1. Mitchell, E. (2015). Chapter 7 in Metadata Standards and Web Services in Libraries, Archives, and Museums. Libraries Unlimited. Santa Barbara, CA.

2. Watch: How Search Works: http://www.youtube.com/watch?v=BNHR6IQJGZs

3. Watch: The evolution of search: http://www.youtube.com/watch?v=mTBShTwCnD4

4. Kernighan, B. (2011). D is for Digital. Chapter 4: Algorithms

5. Read: How Search Works. http://www.google.com/intl/en_us/insidesearch/howsearchworks/thestory/

6. Pirolli, Peter. (2009). Powers of Ten. http://www.parc.com/content/attachments/powers-of-ten.pdf

Optional Readings7. Read/Skim: Michael Lesk, The Seven Ages of Information Retrieval, Conference for the 50th Anniversary of As We

May Think, 1995. http://archive.ifla.org/VI/5/op/udtop5/udtop5.htm

Metadata Standards and Web Services Page 1

Erik Mitchell

http://www.lesk.com/mlesk/ages/ages.html

8. Baeza-Yates, Ricardo and Berthier Ribeiro-Neto, Modern Information Retrieval, Addison Wesley Longman, 1999, Chapter 1

9. Watch: Kenning Arlisch talk about SEO in libraries - http://www.oclc.org/en-US/events/2015/CI_SFPL_Feb_2015.html (Last talk in the first block of speakers)

10. Explore: http://www.wikidata.org/wiki/Wikidata:Main_Page

11. Read/skim: http://news.cnet.com/8301-17938_105-57619807-1/the-web-at-25-i-was-a-teenage-dial-up-addict/

12.

What is Information Retrieval?

This week we are watching a short video on Google Search, browsing a companion Google Search

website, learning about algorithms and finding out more about alternatives to metadata-based search.

Let's start by watching the "How Search Works" video (http://www.youtube.com/watch?

v=BNHR6IQJGZs), The Evolution of Search (http://www.youtube.com/watch?v=mTBShTwCnD4)

and browsing the "How Search Works" website

http://www.google.com/intl/en_us/insidesearch/howsearchworks/thestory/.

Question 1. Using these resources as well as your own Google searches, define the following

terms:

a. Computer Index

b. PageRank

c. Algorithm

d. Free-text Search

e. Universal Search


Erik Mitchell

http://sunsite.dcc.uchile.cl/irbook/chapters/chap1.html

f. Real-time or Instant Search

While indexes exist in database design, and are very important in database speed, we did not discuss

them in depth in Class 7. It turns out that indexes exist in multiple forms and exist with multiple goals

but in short they all exist to facilitate access to a large dataset. Indexes accomplish this by taking a

slice of data and re-sorting it (e.g. indexing all of the occurrences of a word in a document, sorting

words in alphabetic order). Indexes and approaches to indexing are one of the building blocks of IR.

On top of indexes, IR systems use search algorithms to find, process, and display information in

unique ways to the searcher.

IR is a broad field that includes the entire process of document processing, index creation, algorithm

application and document presentation situated within the context of a search. The following figure

shows a sample model of IR and the relationships between resources, the search process and

document presentation process. This process is broken into three broad areas, Predict, Nominate


Erik Mitchell

and Choose.

Predict

In the Predict cycle, documents are processed and indexes are created that help an IR system make

informed guesses about which documents are related to one another. This can be accomplished via

algorithms like PageRank, Term-Frequency/Inverse Document Frequency or N-Gram indexing (more

on these methods later) or by other means. The prediction process is largely a back-office and pre-

processing activity (e.g. systems complete this process in anticipation of a search).

Nominate

The Nominate process is comparable to the search process that we have explored in previous

classes. Using a combination of words, images, vocabularies or other search inputs the system

applies algorithms to determine the best match for a user's query. This process likely involves a

relevance ranking process in which the system predicts the most relevant documents and pushes

them towards the top of the results. Google implements ranking using a number of factors including


Erik Mitchell

personal/social identity (e.g. Google will show results related to you or your friends first), resource

raking with PageRank (e.g. The main website of UMD gets listed first because it is a central hub of

links) and timeliness (e.g. using real-time search Google prioritizes news and other current results).

Relevance Ranking has evolved quickly over the last twenty years and is increasingly the preferred

results display method. At the same time, relevancy is not always the best sorting process.

Question 2. Can you think of some areas in which relevance ranking is not the ideal approach

to sorting search results?

Choose

In the resource selection process users are highly engaged with the system scanning results,

engaging in SenseMaking and other information seeking behaviors to evaluate the fit of a resource

with their information need and ultimately selecting documents for use. In this case as well the

information product delivered may vary even using the same source documents. For example if we

are seeking information about books we want to be presented with a list of texts while if we are

seeking information about an idea and it's presence across multiple texts we may want to see a

concordance that shows our search terms in context (also known as Keyword in Context or KWIC).

Structured data vs. full-text or digital object Information Retrieval

In the structured data world, search and retrieval is based on the idea that our data is highly

structured, predictable and conforms to well-understood boundaries. For example if we are

supporting search of books and other library materials using traditional library metadata (e.g. MARC)

we can expect that our subject headings will conform to LCSH and look for other metadata fields (e.g.

title, author, publication date) to support specific search functions. If in contrast we decided to find

books based just on the full-content of the text with no supporting metadata we would not be able to

make such assumptions.

Let's build our understanding of different approaches to indexing and IR by exploring three Google-

based discovery systems. Our overarching question is "Which of Mark Twain's books proved to be


Erik Mitchell

most popular over time? How can we measure this popularity? Have these rankings changed over

time? Which book is most popular today?" For feasibility we will limit our search to the following

books: Innocents abroad, The Adventures of Huckleberry Finn, The Adventures of Tom Sawyer and

A Connecticut Yankee in King Arthur's Court Roughing It, Letters from the Earth, The Prince and the

Pauper and Life on the Mississippi.

In order to answer these questions we are going to explore several information systems. As we

explore each information system you should look for answers to these questions and think about what

type of index was required to facilitate the search and whether or not the index is based on metadata

or "free text." You should also take note of the best index or search engine for this information.

Types of IR systems

Google Search (http://google.com) The regular Google search interface indexes the web. There is a lot to say about this resource but I

expect we are largely familiar with it. Try a few searches with Google related to each question? A

"Pro-tip" for Google: Look for the "Search Tools" button at the top of the screen just under the search

box. These search tools give you access to some filtering options.

Question 3. What type of index or IR system is most prevalent in this discovery environment

(e.g. Metadata or full-text based)?

Question 4. What search terms or strategies proved to be most useful in this database?

Google Books (http://books.google.com)Google Books is an index created by a large-scale scanning and metadata harvesting operation

initiated by Google in the early 2000s (http://www.google.com/googlebooks/library/). Google books

indexes both metadata (e.g. title, author, publication date) and the full-text of books. It uses a page-

preview approach to showing users where books their search terms occurred.


Erik Mitchell




HathiTrust (http://catalog.hathitrust.org)The HathiTrust is a library-run cooperative organization that shares all of the scanned books and

OCR data from Google. The main objective of HT is to provide an archive of scanned books for

libraries. One product of this archive is a searchable faceted-index discovery system. In some cases

(e.g. when a book is out of copyright) the digital full text is made available.




GoodReads (http://www.goodreads.com/search)GoodReads is a social book cataloging and reading platform. GoodReads aggregates bibliographic

metadata and social recommendations by readers and serves as both a resource discovery and

community engagement platform.





Erik Mitchell

Google Ngram Viewer (https://books.google.com/ngrams/)The Google Ngram Viewer is a specialized slice or index of the Google Books project. An N-Gram is

an index structure that refers to a combination of words that are related by their proximity to one

another. The letter "N" refers to a variable that is any whole number (e.g. 1, 2,3). N-grams are often

referred to according to the number of words that are indexed together. For example in the sentence

"The quick brown fox jumped over the fence" an index of two word combinations or "Bi-grams) would

include "The Quick," "Quick Brown," "Brown Fox," "Fox Jumped" and so forth. Tri-grams are indexes

of three words together (e.g. "The Quick Brown"). N-Gram indexes are a new take on Phrase

searching as applied to full-text resources at a large scale.

Step 1: Searching the Google Ngram viewer can be conceptually somewhat difficult so I

recommend you follow the short tutorial below:

a. Go to the Google Ngram Viewer (https://books.google.com/ngrams/).

b. In the Search box type (without the quotes) "Adventures of Huckleberry Finn,

Adventures of Tom Sawyer."

c. Check the case-insensitive box and click the "Search lots of books" button.

d. You will see a graph displayed (see below) that shows the relative occurance of these

ngrams across the entire corpus of books.

e. You should notice that we searched for Quad-Grams (e.g.4 word phrases) but you can

mix and match n-Grams in a single search. You should also notice that we separate

our n-grams with commas. One technical detail - the maximum number of words you

can search for is 5 words in a phrase so you may need to think about this as you search

google.





Erik Mitchell

Searching and find

Using these four indexing systems try your hand at answering our questions. Don't be shy about

looking for documentation or other sites!


Erik Mitchell

Question Type of index (e.g. free-text / metadata)

Best search and resource to answer the question

Your findings

Which of MT's books proved to

be most popular over time?

How are rankings of popularity

different (e.g. what do they

measure, what data sources do

they use)?

Which book is most popular

today?

Where can you get an electronic

copy of each book?


Erik Mitchell

Evaluating search results: Precision vs. Recall

In deciding which systems worked best for the questions we were asking you likely made qualitative

decisions about what systems worked best. You may have decided that systems were not useful

based on the initial page of results you looked at or you may have ultimately found that specialized or

unique search strategies helped you identify better results. This process of evaluating relevance is

often expressed as "Precision vs. recall" in the IR community.

Broadly stated, precision is related to whether or not the result you wanted was retrieved as part of a

search process. In other words, precision helps you ask the question "How much of what was found

is relevant?" An example of a high precision search is a known item search in an online catalog by

title. In this case you know the title of the book and the index to use (e.g. the title index). The search

results are highly precise - the catalog either has the book or it does not. In a well structured and

perfect search environment, high precision probably best fulfills your information need.

In contrast to precision, recall pertains to the number of search results retrieved during a search. The

Google web search is an example of a high-recall search, search results often contain tens of

thousands of results! In contrast to precision, relevance asks the question "How much of what was

relevant was found?" High recall helps users find the best resource that fits a fuzzy information need.

A good example of this is a search for a website or product on the web where you may remember the

qualities about the product (like it's function, color or price) but not its name. In a fuzzy search world

where we do not always know what we are looking for high recall is most likely preferred over high

precision as we want to expand the number of records we look at.

Precision and recall can be thought of as two intersecting collections of documents including all of the

relevant documents irrelevant documents in an index. The intersection of these two groups of

documents represents the documents retrieved in a search. Precision and recall are presented as a

ratio with the minimum value being 0 and the maximum value being 1. This means that we can think

about precision and recall in terms of a percentage (e.g. 100%).


Erik Mitchell

Precision and recall can be expressed mathematically as:

1. Precision = # of relevant records retrieved / (# relevant retrieved + # irrelevant records retrieved)

2. Recall = # of relevant records retrieved / (# relevant retrieved + # relevant not retrieved)

Ideally, information systems are high recall (e.g. all relevant results retrieved) as well as high

precision ( e.g. high ratio between relevant to irrelevant results). In the real world this is difficult if not

impossible. In fact, as precision increased to 1 (or 100%) recall approaches 0. Conversely, as our

recall approaches 1 (or 100%) our precision approaches 0.

Question 13. You have an index containing 100 documents, 10 of which are relevant to a

given search and 90 which are not. Your search produces 5 good documents and 30 bad

documents. Calculate the precision and recall ratios for this search


Erik Mitchell

Question 14. Suppose you tweaked your indexing or your search and managed to retrieve all

10 relevant documents but at the same time returned 50 irrelevant documents. Calculate

your precision and recall.

Question 15. Assuming that you would rather return all of the relevant documents rather than

missing any what techniques might your IR system need to implement to make the results

more useful?

Recall and precision are just one measure in system evaluation. In addition there are a number of

affective measures such as user satisfaction or happiness and user-generated measures such as

rate or re-use, # of searches to locate a resource or user judgments about the "best" resource.

Summary

In this class we explored types of indexing and information retrieval systems as we considered the

differences between metadata-based, free-text and social/real-time based information retrieval

systems. We learned about a key measure in IR - precision vs. recall and became acquainted with

how to calculate both precision and recall. In doing so we just scratched the surface of IR. If you

were intrigued by some of the search features in Google you may want to try out their

"PowerSearching" course at http://www.google.com/insidesearch/landing/powersearching.html. If the

aspects of IR intrigued you I encourage you to explore the optional readings for this week and explore

more information retrieval systems.


Erik Mitchell

Documents

Exercise overview - erikmitchell.info€¦ · Web viewBaeza-Yates, Ricardo and Berthier Ribeiro-Neto, Modern Information Retrieval, Addison Wesley Longman, 1999, Chapter 1. Watch: