31

Techniques of information retrieval

Embed Size (px)

Citation preview

Page 1: Techniques of information retrieval
Page 2: Techniques of information retrieval

Techniques of Information RetrievalTariq Hassan & Sabahat

Page 3: Techniques of information retrieval

Road Map :• What is IR ?• Why & How it works?• Evaluation Techniques• Global & Local Methods1. Relevance Feedback2. Probabilistic Relevance Feedback3. Indirect Relevance Feedback4. Rocchio Algorithm5. Linear Classifiers6. Naïve Bayes Text Classification

Question & Discussion

Page 4: Techniques of information retrieval

What is IR? Why & How?

• Information needed to satisfy user.

• Why? Due to different formats of Data.• How?

StopListStemmingInverse Document FrequencyWord Counts

Page 5: Techniques of information retrieval

What is IR? Why & How?

Generally IR used in 3 scenarios1. Web search2. Personal IR ( Text Classification )3. Enterprise Level

Page 6: Techniques of information retrieval

Evaluation Techniques

• Why?• How? Relevant & Non Relevant Documents

Precision And Recall MethodsP = # (relevant Items Retrieved) #(retrieved Items)

R = #(relevant Items Retrieved) #(relevant Items)

Page 7: Techniques of information retrieval

Methods:1. Global Methods Reformulation Queries

2. Local MethodsRelative to the initial results against any

query

Page 8: Techniques of information retrieval

Local Methods

1. Relevance Feedback

2. Probabilistic Relevance Feedback

3. Indirect Feedback

1. Relevance FeedbackFeedback given by the user about the relevance of thedocuments in the initial set of results.

1. Relevance Feedback2. Probabilistic Relevance Feedback PRF is implementing by building a classifiers.

1. Relevance Feedback2. Probabilistic Relevance Feedback3. Indirect Relevance Feedback Without user interventions. 1. By using user actions. 2. By using user Histories or Logs

Page 9: Techniques of information retrieval

Conclusion : Relevance Feedback

Assumption: User have initial knowledge

Issues : Misspelling Cross Languages Mismatch Vocabulary

Page 10: Techniques of information retrieval

Rocchio AlgorithmIncorporates the relevance feedback mechanism in vector space model.Also uses the Cosine Similarity FunctionEuclidean Mechanism

Page 11: Techniques of information retrieval

Example

Page 12: Techniques of information retrieval

Outcome• Relevance Feedback plays an

important role to understand the user requirements.

• Rocchio Algorithm is not the best but the optimized and better option due to its simplicity and good results.

• Have a significant importance with respect to content based systems.

Page 13: Techniques of information retrieval

Classification Problems• Given:

– A document d– A fixed set of categories:

Sports, Informatics, literature, medical, entertainment– A training set of documents each

labeled with its class• Determine:

– A learning method or algorithm which will enable us to learn a classifier

– For a test document dT we have to determine its category

Page 14: Techniques of information retrieval

Classification Techniques

• Manual (a.k.a. Knowledge Engineering)

– typically, rule-based expert systems

• Machine Learning

–Naïve Bayesian (Probabilistic)

– Decision Trees (Decision Structures)

– Support Vector Machines (Linear Classification)

Page 15: Techniques of information retrieval

Document Representation

• Binary Representation• Frequency Representation• TF*IDF Representation

Page 16: Techniques of information retrieval

Naïve Bayes document classification example

• Probabilistic– Prior vs Posterior

• Bernoulli Model– Feature vector with binary

elements• Multinomial Model

– Integers representing frequency of words

Page 17: Techniques of information retrieval
Page 18: Techniques of information retrieval

Classify the document

Page 19: Techniques of information retrieval

Naïve Bayes classfication

• Very fast learning and testing– Why?

• Low storage requirements• Very good in domains with

many equally important features

• More robust to irrelevant features than many learning methods

Page 20: Techniques of information retrieval

Linear Classification

• Documents as labeled vectors• Documents in the same class form a

contiguous region of space• Documents from different classes

don’t overlap (much)• Learning a classifier: build surfaces

to delineate classes in the space

Page 21: Techniques of information retrieval

Support Vector Machines

• Find a linear hyperplane (decision boundary) that will separate the data

Page 22: Techniques of information retrieval

Support Vector Machines

• One Possible Solution

B1

Page 23: Techniques of information retrieval

Support Vector Machines

• Another possible solution

B2

Page 24: Techniques of information retrieval

Support Vector Machines

• Other possible solutions

B2

Page 25: Techniques of information retrieval

Support Vector Machines

• Which one is better? B1 or B2?• How do you define better?

B1

B2

Page 26: Techniques of information retrieval

Support Vector Machines

• Find hyperplane maximizes the margin

B1

B2

b11

b12

b21b22

margin

Page 27: Techniques of information retrieval

Support Vector MachinesB1

B2

b11

b12

b21b22

margin

Support Vectors

Page 28: Techniques of information retrieval

Support Vector Machines

B1

b11

b12

0 bxw

1 bxw 1 bxw

1bxw if1

1bxw if1)(

xf 2||||

2 Marginw

Page 29: Techniques of information retrieval

Support Vector Machines

B1

b11

b12

0 bxw

1 bxw 1 bxw

1bxw if1

1bxw if1)(

xf 2||||

2 Marginw

Page 30: Techniques of information retrieval

Questions & Discussion

Page 31: Techniques of information retrieval

Bottom Line• Which classifier do I use for a given document

classification problem? Answer : Depends

How much training data is available? How simple/complex is the problem? How noisy is the data? How stable is the problem over time?

For an unstable problem, its better to use a simple and robust classifier.