35
1 TP6084 CAPAIAN MAKLUMAT INFORMATION RETRIEVAL (IR) Introduction

TP6084 CAPAIAN MAKLUMAT INFORMATION RETRIEVAL (IR)

  • Upload
    yael

  • View
    26

  • Download
    0

Embed Size (px)

DESCRIPTION

TP6084 CAPAIAN MAKLUMAT INFORMATION RETRIEVAL (IR). Introduction. What will be covered today…. Course overview Introduction to IR. What this course about. Search Engines What is it? How to build one? How to evaluate? What are the models? How do Google rank results? etc Models? - PowerPoint PPT Presentation

Citation preview

Page 1: TP6084 CAPAIAN MAKLUMAT INFORMATION RETRIEVAL  (IR)

1

TP6084 CAPAIAN MAKLUMAT INFORMATION RETRIEVAL (IR)

Introduction

Page 2: TP6084 CAPAIAN MAKLUMAT INFORMATION RETRIEVAL  (IR)

2

What will be covered today…

• Course overview

• Introduction to IR

Page 3: TP6084 CAPAIAN MAKLUMAT INFORMATION RETRIEVAL  (IR)

What this course about• Search Engines

– What is it?– How to build one?– How to evaluate?– What are the models?– How do Google rank results?– etc

• Models?• What are the research in this

area..?• What about Mutimedia data?• What about semantic web?• etc…..

3

Page 4: TP6084 CAPAIAN MAKLUMAT INFORMATION RETRIEVAL  (IR)

4

Course Overview

• What this course is …about– How people search and find information.– How computers store and retrieve

information.– How computer systems are designed to help

people find information they need.

Page 5: TP6084 CAPAIAN MAKLUMAT INFORMATION RETRIEVAL  (IR)

5

Course Overview

• The course will emphasize on– Understanding of

• Theories• Tools• Algorithms, and• Evaluations

for Information Retrieval Systems– Viewing web search engine as the practical

application of IR system

Page 6: TP6084 CAPAIAN MAKLUMAT INFORMATION RETRIEVAL  (IR)

6

Course Content (subject to change)

• Introduction• IR and Search Engine• Architecture of Search Engine• Text processing• Indexing and Ranking• Queries & Interface• Retrieval Models• Evaluation• Classification & Clustering• Social Search

Page 7: TP6084 CAPAIAN MAKLUMAT INFORMATION RETRIEVAL  (IR)

7

References• The textbook for this course:

Croft, W.B., Metzler, D. & Strohman, T. 2009. Search Engines: Information Retrieval in Practice. New York: Addison Wesley

• Other recommended books:– Grossman, D.A. & Frieder, D.A. 2004. Information Retrieval:

Algorithms & Heuristics, 2nd Edition. Berlin: Springer.– Baeza-Yates, R. & Ribeiro-Neto, B. 1999. Modern Information

Retrieval. New York: Addison Wesley– Manning, C., Raghavan, P. & Schutze, H. 2008. Introduction to

Information Retrieval. New York: Cambridge University Press• For general reading on search engine, you must read:

– Batella, J. 2005. The Search: How Google and Its Rivals Rewrote the Rules of Business and Transformed Our Culture. New York: Portfolio Hardcover.

• List of related journal/proceedings articles will be informed time by time during class.

Page 8: TP6084 CAPAIAN MAKLUMAT INFORMATION RETRIEVAL  (IR)

8

Assessment

• Exam – 50%

• Project/Assignments – 50%

• Lectures:– Monday (11 am – 12 noon) BK8

– Thursday (10 am – 12 noon) BK8

Page 9: TP6084 CAPAIAN MAKLUMAT INFORMATION RETRIEVAL  (IR)

Any problem..?• Dr. Shereena Arif (PhD)

• Room H-2-8, IT School, Faculty of Information Science & Technology, UKM Bangi.

• E-mail : [email protected] OR [email protected]

• Website/blog : shereenarif.wordpress.com

• Blog dedicated for this course : tp6084.wordpress.com

• Any media suggested for communication?

9

Page 10: TP6084 CAPAIAN MAKLUMAT INFORMATION RETRIEVAL  (IR)

Shall we start ………

10

Page 11: TP6084 CAPAIAN MAKLUMAT INFORMATION RETRIEVAL  (IR)

11

What is IR?

• Finding relevant information in large collections of data

• In such a collection you may want to find:– ‘Give me information on the history of the Tun Razak’

An article about Tun Razak (text retrieval)– ‘What does a brain tumor look like on a CT-scan’

A picture of a brain tumor (image retrieval)– `It goes like this: I do, I do, I do, I do do do do do . . . '

A certain song (music retrieval)

Page 12: TP6084 CAPAIAN MAKLUMAT INFORMATION RETRIEVAL  (IR)

12

What is IR?

• IR is a branch of applied computer science focusing on the representation, storage, organization, access, and distribution of information. [System Centered]

• IR involves helping users find information that matches their information needs. [User Centered]

Page 13: TP6084 CAPAIAN MAKLUMAT INFORMATION RETRIEVAL  (IR)

13

Text Retrieval

• Online library catalogs (OPAC)

• Internet search engines, such as– AltaVista, Google, Ilse

• Specialized systems (aka vendors):– MEDLINE (medical articles)– Lexis-Nexis (legal, business, academic, . . . )– Westlaw (legal articles)– Dialog (business information)

Page 14: TP6084 CAPAIAN MAKLUMAT INFORMATION RETRIEVAL  (IR)

14

Retrieval vs. Browsing

• Popular Web Directories:– Yahoo!, Open Directory Project (dmoz)

• The user has to ‘guess’ the ‘right’ directories to find the information– The user has to adapt to the designers'

conceptualization of the directory

• The goal of information retrieval is to provide immediate random access to the data– The user can specify his information need

Page 15: TP6084 CAPAIAN MAKLUMAT INFORMATION RETRIEVAL  (IR)

15

IR vs. Database Querying

• IR is not the same thing as querying a database• Database querying assumes that the data is in a

standardized format.• Transforming all information, news articles, web

sites into a database format is difficult and impossible for large data collections.

• Text retrieval can work with plain, unformatted data.

Page 16: TP6084 CAPAIAN MAKLUMAT INFORMATION RETRIEVAL  (IR)

16

Data Retrieval vs. Information Retrieval

Data retrieval Information retrievalContent Data InformationData object Table Document Matching Exact match Partial match, best matchItems wanted Matching RelevantQuery language SQL(artificial) NaturalQuery specification Complete IncompleteModel Deterministic Probabilistic

Highly structure Less structure

Page 17: TP6084 CAPAIAN MAKLUMAT INFORMATION RETRIEVAL  (IR)

17

Relevance as Similarity

• A fundamental idea within IR is:‘A document is relevant to a query if they are

similar’

• Similarity can be defined as:– string matching/comparison– similar vocabulary– same meaning of text

Page 18: TP6084 CAPAIAN MAKLUMAT INFORMATION RETRIEVAL  (IR)

18

The Ubiquity of IR

• Search engines• Information filtering

– E-mail routing– Text categorization

• Detecting information structure– Hyperlink generation– Topic/Information detection/Screening– Portal development and maintenance– Digital libraries

• Question Answering

Page 19: TP6084 CAPAIAN MAKLUMAT INFORMATION RETRIEVAL  (IR)

19

“Web brings IR to the Center of the Stage”

IR has become a center of the focus in the Web era. Its theories, techniques, and applications have reached many fields

where processing large amount of information is essential.

Page 20: TP6084 CAPAIAN MAKLUMAT INFORMATION RETRIEVAL  (IR)

20

Challenges of IR

User InformationSearch/select

Info. Needs Queries Stored Information

Translating info.needs to queries

Matching queriesTo stored information

Query result evaluation:Does the information found match user’s

information needs?

Page 21: TP6084 CAPAIAN MAKLUMAT INFORMATION RETRIEVAL  (IR)

21

Data and Information• Data

– String of symbols associated with objects, people, and events

– Values of an attribute• Data need not have meaning to everyone• Data must be interpreted with associated attributes.

• Information– The meaning of the data interpreted by a person or a system– Data that changes the state of a person or system that

perceives it.– Data that reduces uncertainty.

• if data contain no uncertainty, there are no information with the data.

• Examples: It snows in the winter. It does not snow this winter.

Page 22: TP6084 CAPAIAN MAKLUMAT INFORMATION RETRIEVAL  (IR)

22

Information and Knowledge

• knowledge– Structured information

• through structuring, information becomes understandable

– Processed Information • through processing, information becomes

meaningful and useful

– information shared and agreed upon within a community

Data

information

knowledge

Page 23: TP6084 CAPAIAN MAKLUMAT INFORMATION RETRIEVAL  (IR)

23

Text• Strings of ASCII symbols or Unicode

– structured by the author– indexed by information service providers

• Representation of natural languages people use – To convey meanings– To communicate between readers and

authors.

• Data or information?– If it can be understood, it’s information.

• by Whom? A person or a system?

Page 24: TP6084 CAPAIAN MAKLUMAT INFORMATION RETRIEVAL  (IR)

24

Documents

• Logical unit of text– articles, books, – links, web pages

• Other components that come with the text– figures, charts, graphics– multimedia

Page 25: TP6084 CAPAIAN MAKLUMAT INFORMATION RETRIEVAL  (IR)

25

Textual Data• Repository of human intellectuals

– Rich and diverse resources for all answers. • If it is written, it is there (in text)

– Meaningful and understandable (to users).

• Simple ASCII representation• Free of pre-formatted structures

– continuous – separated into documents

• Easy to process by the computer – Machine Intensive (not labor intensive)

Page 26: TP6084 CAPAIAN MAKLUMAT INFORMATION RETRIEVAL  (IR)

26

Problems with Text• Massive

– Any IR system needs the capability of large scale data processing.

– Use of indexes and various representations are required.

• Inconsistent– It’s a human language

• Syntactical and semantic variances – Same information expressed in different ways. – Different information expressed in similar ways.

• Incomplete– It uses common knowledge. – It’s an open system.

Page 27: TP6084 CAPAIAN MAKLUMAT INFORMATION RETRIEVAL  (IR)

27

Retrieval• Retrieval

– What do we retrieve?• Data• Information • Knowledge

– We retrieve documents that contains text which carries information.

• Information can be anywhere • in the text, in the links, in the process of text.

Page 28: TP6084 CAPAIAN MAKLUMAT INFORMATION RETRIEVAL  (IR)

28

Information Retrieval

• Are they the same?– Text retrieval– Document retrieval– Information retrieval

Page 29: TP6084 CAPAIAN MAKLUMAT INFORMATION RETRIEVAL  (IR)

29

Information Retrieval

• Conceptually, information retrieval is used to cover all related problems in finding needed information

• Historically, information retrieval is about document retrieval, emphasizing document as the basic unit

• Technically, information retrieval refers to (text) string manipulation, indexing, matching, querying, etc.

Page 30: TP6084 CAPAIAN MAKLUMAT INFORMATION RETRIEVAL  (IR)

30

IR Systems

• IR systems contain three components:– System– People– Documents (information items)

User

SYSTEMS

Browsing

Retrieval

Documents (Database)

Page 31: TP6084 CAPAIAN MAKLUMAT INFORMATION RETRIEVAL  (IR)

31

Basic Overview of Retrieval Process

Page 32: TP6084 CAPAIAN MAKLUMAT INFORMATION RETRIEVAL  (IR)

32

Detail Overview of Retrieval Process

Page 33: TP6084 CAPAIAN MAKLUMAT INFORMATION RETRIEVAL  (IR)

33

Historical Summary

• 1960’s– Basic advances in retrieval and indexing techniques

• 1950: Calvin N. Moors coins the term `Information Retrieval'• 1959: Luhn describes statistical retrieval• 1960: Maron and Kuhns dene a probabilistic model of IR• 1966: Craneld project denes evaluation measures• 1968: Gerard Salton's rst book about the SMART retrieval• system

Page 34: TP6084 CAPAIAN MAKLUMAT INFORMATION RETRIEVAL  (IR)

34

Historical Summary

• 1990’s and 2000’s– Large-scale, full-text IR and filtering experiments and systems– Dominance of ranking– Many Web-based retrieval engines– Interfaces and browsing– Multimedia and multilingual– Machine learning techniques– Question answering (factoids)

• The Future– IR in context (the right answer for you now here)– Logic-based IR?– NLP?– Integration with other functionality– Distributed, heterogeneous database access

Page 35: TP6084 CAPAIAN MAKLUMAT INFORMATION RETRIEVAL  (IR)

35

End of Topic 1