43
September 7, 2000 Information Organization and Retrieval Introduction to Information Retrieval Ray Larson & Marti Hearst University of California, Berkeley School of Information Management and Systems SIMS 202: Information Organization and Retrieval

September 7, 2000Information Organization and Retrieval Introduction to Information Retrieval Ray Larson & Marti Hearst University of California, Berkeley

  • View
    216

  • Download
    1

Embed Size (px)

Citation preview

September 7, 2000 Information Organization and Retrieval

Introduction to Information Retrieval

Ray Larson & Marti Hearst

University of California, Berkeley

School of Information Management and Systems

SIMS 202: Information Organization and Retrieval

September 7, 2000 Information Organization and Retrieval

Search and RetrievalOutline of Part I of SIMS 202

• The Search Process• Information Retrieval Models• Content Analysis/Zipf Distributions• Evaluation of IR Systems

– Precision/Recall– Relevance– User Studies

• System and Implementation Issues• Web-Specific Issues• User Interface Issues• Special Kinds of Search

September 7, 2000 Information Organization and Retrieval

Today

• Finding Out About

• Introduction to Information Retrieval

• Introduction to the Boolean Model and Boolean Queries

September 7, 2000 Information Organization and Retrieval

Finding Out About(This discussion is drawn from R. Belew’s

manuscript)

• Three phases:– Asking of a question (the Information Need)– Construction of an answer (IR proper)– Assessment of the answer (Evaluation)

• Part of an iterative process

September 7, 2000 Information Organization and Retrieval

Question Asking• Person asking = “user”

– In a frame of mind, a cognitive state

– Aware of a gap in their knowledge

– May not be able to fully define this gap

• Paradox of FOA: – If user knew the question to ask, there would often be no work to

do. • “The need to describe that which you do not know in order to find it” Roland

Hjerppe

• Query– External expression of this ill-defined state

September 7, 2000 Information Organization and Retrieval

Question Answering

• Say question answerer is human.– Can they translate the user’s ill-defined question into a

better one?

– Do they know the answer themselves?

– Are they able to verbalize this answer?

– Will the user understand this verbalization?

– Can they provide the needed background?

• Say answerer is a computer system.

September 7, 2000 Information Organization and Retrieval

Assessing the Answer

• How well does it answer the question?– Complete answer? Partial? – Background Information?– Hints for further exploration?

• How relevant is it to the user?

September 7, 2000 Information Organization and Retrieval

IR is an Iterative Process

Repositories

Workspace

Goals

September 7, 2000 Information Organization and Retrieval

IR is a Dialog

– The exchange doesn’t end with first answer

– User can recognize elements of a useful answer

– Questions and understanding changes as the process

continues.

September 7, 2000 Information Organization and Retrieval

A sketch of a searcher… “moving through many actions towards a general goal of satisfactory

completion of research related to an information need.” (after Bates 89)

Q0

Q1

Q2

Q3

Q4

Q5

September 7, 2000 Information Organization and Retrieval

Berry-picking model (cont.)

• The query is continually shifting

• New information may yield new ideas and new directions

• The information need– is not satisfied by a single, final retrieved set– is satisfied by a series of selections and bits of

information found along the way.

September 7, 2000 Information Organization and Retrieval

Information Seeking Behavior

• Two parts of the process:• search and retrieval

• analysis and synthesis of search results

September 7, 2000 Information Organization and Retrieval

Search Tactics and Strategies

• Search Tactics– Bates 79

• Search Strategies– Bates 89– O’Day and Jeffries 93

September 7, 2000 Information Organization and Retrieval

Tactics vs. Strategies

• Tactic: short term goals and maneuvers– operators, actions

• Strategy: overall planning– link a sequence of operators together to achieve

some end

September 7, 2000 Information Organization and Retrieval

• Later in the course:– Search Process and Strategies– User interfaces to improve IR process– Incorporation of Content Analysis into better

systems

September 7, 2000 Information Organization and Retrieval

Restricted Form of the IR Problem

• The system has available only pre-existing, “canned” text passages.

• Its response is limited to selecting from these passages and presenting them to the user.

• It must select, say, 10 or 20 passages out of millions or billions!

September 7, 2000 Information Organization and Retrieval

Information Retrieval

• Revised Task Statement:

Build a system that retrieves documents that users are likely to find relevant to their queries.

• This set of assumptions underlies the field of Information Retrieval.

September 7, 2000 Information Organization and Retrieval

Some IR History

– Roots in the scientific “Information Explosion” following WWII

– Interest in computer-based IR from mid 1950’s• H.P. Luhn at IBM (1958)

• Probabilistic models at Rand (Maron & Kuhns) (1960)

• Boolean system development at Lockheed (‘60s)

• Vector Space Model (Salton at Cornell 1965)

• Statistical Weighting methods and theoretical advances (‘70s)

• Refinements and Advances in application (‘80s)• User Interfaces, Large-scale testing and application (‘90s)

September 7, 2000 Information Organization and Retrieval

Structure of an IR SystemSearchLine Interest profiles

& QueriesDocuments

& data

Rules of the game =Rules for subject indexing +

Thesaurus (which consists of

Lead-InVocabulary

andIndexing

Language

StorageLine

Potentially Relevant

Documents

Comparison/Matching

Store1: Profiles/Search requests

Store2: Documentrepresentations

Indexing (Descriptive and

Subject)

Formulating query in terms of

descriptors

Storage of profiles

Storage of Documents

Information Storage and Retrieval System

Adapted from Soergel, p. 19

September 7, 2000 Information Organization and Retrieval

Structure of an IR SystemSearchLine Interest profiles

& QueriesDocuments

& data

Rules of the game =Rules for subject indexing +

Thesaurus (which consists of

Lead-InVocabulary

andIndexing

Language

StorageLine

Potentially Relevant

Documents

Comparison/Matching

Store1: Profiles/Search requests

Store2: Documentrepresentations

Indexing (Descriptive and

Subject)

Formulating query in terms of

descriptors

Storage of profiles

Storage of Documents

Information Storage and Retrieval System

Adapted from Soergel, p. 19

September 7, 2000 Information Organization and Retrieval

Structure of an IR SystemSearchLine Interest profiles

& QueriesDocuments

& data

Rules of the game =Rules for subject indexing +

Thesaurus (which consists of

Lead-InVocabulary

andIndexing

Language

StorageLine

Potentially Relevant

Documents

Comparison/Matching

Store1: Profiles/Search requests

Store2: Documentrepresentations

Indexing (Descriptive and

Subject)

Formulating query in terms of

descriptors

Storage of profiles

Storage of Documents

Information Storage and Retrieval System

Adapted from Soergel, p. 19

September 7, 2000 Information Organization and Retrieval

Structure of an IR SystemSearchLine Interest profiles

& QueriesDocuments

& data

Rules of the game =Rules for subject indexing +

Thesaurus (which consists of

Lead-InVocabulary

andIndexing

Language

StorageLine

Potentially Relevant

Documents

Comparison/Matching

Store1: Profiles/Search requests

Store2: Documentrepresentations

Indexing (Descriptive and

Subject)

Formulating query in terms of

descriptors

Storage of profiles

Storage of Documents

Information Storage and Retrieval System

Adapted from Soergel, p. 19

September 7, 2000 Information Organization and Retrieval

Relevance (introduction)• In what ways can a document be relevant to a

query?– Answer precise question precisely.

– Who is buried in grant’s tomb? Grant.

– Partially answer question.– Where is Danville? Near Walnut Creek.

– Suggest a source for more information.– What is lymphodema? Look in this Medical Dictionary.

– Give background information.– Remind the user of other knowledge.– Others ...

September 7, 2000 Information Organization and Retrieval

Query Languages

• A way to express the question (information need)

• Types: – Boolean– Natural Language– Stylized Natural Language– Form-Based (GUI)

September 7, 2000 Information Organization and Retrieval

Simple query language: Boolean

– Terms + Connectors (or operators)– terms

• words• normalized (stemmed) words• phrases• thesaurus terms

– connectors• AND• OR• NOT

September 7, 2000 Information Organization and Retrieval

Boolean Queries• Cat

• Cat OR Dog

• Cat AND Dog

• (Cat AND Dog)

• (Cat AND Dog) OR Collar

• (Cat AND Dog) OR (Collar AND Leash)

• (Cat OR Dog) AND (Collar OR Leash)

September 7, 2000 Information Organization and Retrieval

Boolean Queries

• (Cat OR Dog) AND (Collar OR Leash)– Each of the following combinations works:

• Cat x x x x• Dog x x x x x• Collar x x x x• Leash x x x x

September 7, 2000 Information Organization and Retrieval

Boolean Queries

• (Cat OR Dog) AND (Collar OR Leash)– None of the following combinations work:

• Cat x x

• Dog x x

• Collar x x

• Leash x x

September 7, 2000 Information Organization and Retrieval

Boolean Logic

A B

BABA

BABA

BAC

BAC

AC

AC

:Law sDeMorgan'

September 7, 2000 Information Organization and Retrieval

Boolean Queries– Usually expressed as INFIX operators in IR

• ((a AND b) OR (c AND b))

– NOT is UNARY PREFIX operator• ((a AND b) OR (c AND (NOT b)))

– AND and OR can be n-ary operators• (a AND b AND c AND d)

– Some rules - (De Morgan revisited)• NOT(a) AND NOT(b) = NOT(a OR b)

• NOT(a) OR NOT(b)= NOT(a AND b)

• NOT(NOT(a)) = a

September 7, 2000 Information Organization and Retrieval

Boolean Logic

t33

t11 t22

D11D22

D33

D44D55

D66

D88D77

D99

D1010

D1111

m1

m2

m3m5

m4

m7m8

m6

m2 = t1 t2 t3

m1 = t1 t2 t3

m4 = t1 t2 t3

m3 = t1 t2 t3

m6 = t1 t2 t3

m5 = t1 t2 t3

m8 = t1 t2 t3

m7 = t1 t2 t3

September 7, 2000 Information Organization and Retrieval

Boolean Searching“Measurement of thewidth of cracks in prestressedconcrete beams”

Formal Query:cracks AND beamsAND Width_measurementAND Prestressed_concrete

Cracks

Beams Widthmeasurement

Prestressedconcrete

Relaxed Query:(C AND B AND P) OR(C AND B AND W) OR(C AND W AND P) OR(B AND W AND P)

September 7, 2000 Information Organization and Retrieval

Psuedo-Boolean Queries

• A new notation, from web search– +cat dog +collar leash

• Does not mean the same thing!

• Need a way to group combinations.

• Phrases:– “stray cat” AND “frayed collar”– +“stray cat” + “frayed collar”

Informationneed

Index

Pre-process

Parse

Collections

Rank

Query

text input

September 7, 2000 Information Organization and Retrieval

Result Sets• Run a query, get a result set

• Two choices– Reformulate query, run on entire collection– Reformulate query, run on result set

• Example: Dialog query• (Redford AND Newman)• -> S1 1450 documents• (S1 AND Sundance)• ->S2 898 documents

Informationneed

Index

Pre-process

Parse

Collections

Rank

Query

text input

Reformulated Query

Re-Rank

September 7, 2000 Information Organization and Retrieval

Ordering of Retrieved Documents• Pure Boolean has no ordering• In practice:

– order chronologically

– order by total number of “hits” on query terms• What if one term has more hits than others?

• Is it better to one of each term or many of one term?

• Fancier methods have been investigated – p-norm is most famous

• usually impractical to implement

• usually hard for user to understand

September 7, 2000 Information Organization and Retrieval

Boolean• Advantages

– simple queries are easy to understand– relatively easy to implement

• Disadvantages– difficult to specify what is wanted– too much returned, or too little– ordering not well determined

• Dominant language in commercial systems until the WWW

September 7, 2000 Information Organization and Retrieval

Faceted Boolean Query

• Strategy: break query into facets (polysemous with earlier meaning of facets)

– conjunction of disjunctionsa1 OR a2 OR a3

b1 OR b2

c1 OR c2 OR c3 OR c4

– each facet expresses a topic“rain forest” OR jungle OR amazon

medicine OR remedy OR cure

Smith OR Zhou

AND

AND

September 7, 2000 Information Organization and Retrieval

Faceted Boolean Query

• Query still fails if one facet missing

• Alternative: Coordination level ranking– Order results in terms of how many facets (disjuncts)

are satisfied

– Also called Quorum ranking, Overlap ranking, and Best Match

• Problem: Facets still undifferentiated

• Alternative: assign weights to facets

September 7, 2000 Information Organization and Retrieval

Proximity Searches• Proximity: terms occur within K positions of one another

– pen w/5 paper

• A “Near” function can be more vague– near(pen, paper)

• Sometimes order can be specified• Also, Phrases and Collocations

– “United Nations” “Bill Clinton”

• Phrase Variants– “retrieval of information” “information retrieval”

September 7, 2000 Information Organization and Retrieval

Filters

• Filters: Reduce set of candidate docs• Often specified simultaneous with query• Usually restrictions on metadata

– restrict by:• date range• internet domain (.edu .com .berkeley.edu)• author• size• limit number of documents returned

September 7, 2000 Information Organization and Retrieval

Next Week

• Statistical Properties of Text

• Preparing information for search: Lexical analysis

• Introduction to the Vector Space model of IR.