20
The Structure of Information Retrieval Systems LBSC 708A/CMSC 838L Douglas W. Oard and Philip Resnik Session 1: September 4, 2001

The Structure of Information Retrieval Systems LBSC 708A/CMSC 838L Douglas W. Oard and Philip Resnik Session 1: September 4, 2001

Embed Size (px)

Citation preview

Page 1: The Structure of Information Retrieval Systems LBSC 708A/CMSC 838L Douglas W. Oard and Philip Resnik Session 1: September 4, 2001

The Structure ofInformation Retrieval Systems

LBSC 708A/CMSC 838L

Douglas W. Oard and Philip Resnik

Session 1: September 4, 2001

Page 2: The Structure of Information Retrieval Systems LBSC 708A/CMSC 838L Douglas W. Oard and Philip Resnik Session 1: September 4, 2001

Agenda

• 5:30-6:00 Introductions

• 6:00-6:15 What is “Information Retrieval?”

• 6:15-6:40 Applications

6:40-6:55 (Break)

• 6:55-7:20 Applications

• 7:20-7:50 System design

7:50-7:55 (Stretch break)

• 7:55-8:15 Course outline

Page 3: The Structure of Information Retrieval Systems LBSC 708A/CMSC 838L Douglas W. Oard and Philip Resnik Session 1: September 4, 2001

Introductions

• Pair up with a partner from another department

• Get to know each other in 5 minutes (total)

• Tell us about your partner in 30 seconds– Name– Degree program (department, Master’s/Ph.D./Visitor)– Information retrieval experience– One thing they would like to learn

Page 4: The Structure of Information Retrieval Systems LBSC 708A/CMSC 838L Douglas W. Oard and Philip Resnik Session 1: September 4, 2001

What do We Mean by “Information?”

• How is it different from “data”?– Information is data in context

• Databases contain data and produce information

• IR systems contain and provide information

• How is it different from “knowledge”?– Knowledge is a basis for making decisions

• Many “knowledge bases” contain decision rules

Page 5: The Structure of Information Retrieval Systems LBSC 708A/CMSC 838L Douglas W. Oard and Philip Resnik Session 1: September 4, 2001

What Do We Mean by “Retrieval?”

• Find something that you want– The information need may or may not be explicit

• Known item search– Find the class home page

• Answer seeking– Is Lexington or Louisville the capital of Kentucky?

• Directed exploration– Who makes videoconferencing systems?

Page 6: The Structure of Information Retrieval Systems LBSC 708A/CMSC 838L Douglas W. Oard and Philip Resnik Session 1: September 4, 2001

Source: Global Reach

EnglishEnglish

2000 2005

Global Internet User Population

Chinese

Page 7: The Structure of Information Retrieval Systems LBSC 708A/CMSC 838L Douglas W. Oard and Philip Resnik Session 1: September 4, 2001

IR is More Than Web Searching!

• Form into four groups to discuss:

1: A system to search a collection of oral history interviews

2: Construction of a personalized newspaper

3: Software to find music recordings in an online CD store

4: Searching all the Xerox copies ever made at an office

Page 8: The Structure of Information Retrieval Systems LBSC 708A/CMSC 838L Douglas W. Oard and Philip Resnik Session 1: September 4, 2001

What To Do

• Form 2-3 person subgroups to discuss: (10 min)– How would you describe what to search for?

• What makes one object “better” than another?

– How would you recognize when you have found it?• How would you explain the way you made a choice?

– What kind of technology might be helpful?• Speech recognition, optical character recognition, …

• Get together to discuss what you learned (10 min)– Build a 5 minute Powerpoint presentation

Page 9: The Structure of Information Retrieval Systems LBSC 708A/CMSC 838L Douglas W. Oard and Philip Resnik Session 1: September 4, 2001

Supporting the Search Process

SourceSelection

Search

Query

Selection

Ranked List

Examination

Document

Delivery

Document

QueryFormulation

IR System

Query Reformulation and

Relevance Feedback

SourceReselection

Nominate ChoosePredict

Page 10: The Structure of Information Retrieval Systems LBSC 708A/CMSC 838L Douglas W. Oard and Philip Resnik Session 1: September 4, 2001

Supporting the Search Process

SourceSelection

Search

Query

Selection

Ranked List

Examination

Document

Delivery

Document

QueryFormulation

IR System

Indexing Index

Acquisition Collection

Page 11: The Structure of Information Retrieval Systems LBSC 708A/CMSC 838L Douglas W. Oard and Philip Resnik Session 1: September 4, 2001

Design Strategies

• Foster human-machine synergy– Exploit complementary strengths– Accommodate shared weaknesses

• Divide-and-conquer – Divide task into stages with well-defined interfaces– Continue dividing until problems are easily solved

• Co-design related components– Iterative process of joint optimization

Page 12: The Structure of Information Retrieval Systems LBSC 708A/CMSC 838L Douglas W. Oard and Philip Resnik Session 1: September 4, 2001

Human-Machine Synergy

• Machines are good at:– Doing simple things accurately and quickly– Scaling to larger collections in sublinear time

• People are better at:– Accurately recognizing what they are looking for– Evaluating intangibles such as “quality”

• Both are pretty bad at:– Mapping consistently between words and concepts

Page 13: The Structure of Information Retrieval Systems LBSC 708A/CMSC 838L Douglas W. Oard and Philip Resnik Session 1: September 4, 2001

Divide and Conquer

• Strategy: use encapsulation to limit complexity

• Approach:– Define interfaces (input and output) for each component

• Query interface: input terms, output representation

– Define the functions performed by each component• Remove common words, weight rare terms higher, …

– Repeat the process within components as needed

• Result: a hierarchical decomposition

Page 14: The Structure of Information Retrieval Systems LBSC 708A/CMSC 838L Douglas W. Oard and Philip Resnik Session 1: September 4, 2001

Co-design

• Design of one component may affect another– Effect may be direct or indirect

• Approach:– Develop alternatives for each interacting component– Assess the effects of each practical combination

• on efficiency, effectiveness, and usability

– Repeat the process until a suitable combination is found

Page 15: The Structure of Information Retrieval Systems LBSC 708A/CMSC 838L Douglas W. Oard and Philip Resnik Session 1: September 4, 2001

Some Examples of Codesign in IR

SourceSelection

Search

Query

Selection

Ranked List

Examination

Document

Delivery

Document

QueryFormulation

IR System

Indexing Index

Acquisition Collection

Page 16: The Structure of Information Retrieval Systems LBSC 708A/CMSC 838L Douglas W. Oard and Philip Resnik Session 1: September 4, 2001

Course Goals

• Appreciate IR system capabilities and limitations

• Understand IR system design & implementation– For a broad range of applications and media

• Evaluate IR system performance

• Identify current IR research problems

Page 17: The Structure of Information Retrieval Systems LBSC 708A/CMSC 838L Douglas W. Oard and Philip Resnik Session 1: September 4, 2001

Course Design

• Text/readings provide background and detail– At least one recommended reading is required

• Class provides organization and direction– We will not cover every important detail

• Assignments and project provide experience– The TA can help CLIS students with the project

• Final exam helps focus your effort

Page 18: The Structure of Information Retrieval Systems LBSC 708A/CMSC 838L Douglas W. Oard and Philip Resnik Session 1: September 4, 2001

Grading

• Assignments (30% total)– Mastery of concepts and experience using tools– 708A: “homework,” 838L: “programming”

• Term project (30%)– 3 options, described on course Web page

• Final exam (40%)– Two different in-class exams

Page 19: The Structure of Information Retrieval Systems LBSC 708A/CMSC 838L Douglas W. Oard and Philip Resnik Session 1: September 4, 2001

Handy Things to Know

• Classes will be videotaped– Available in the CLIS library if you miss class

• Office hours are by appointment– Send an email or ask after class

• Everything is on the web– At http://www.clis.umd.edu/courses/708a/

• We are most easily reached by email– [email protected], [email protected]

Page 20: The Structure of Information Retrieval Systems LBSC 708A/CMSC 838L Douglas W. Oard and Philip Resnik Session 1: September 4, 2001

Do This Week

• Do the reading before class– Don’t fall behind!

• Start on assignment 1– Due in 2 weeks!

• Explore the Web site– Start thinking about the term project