29
Introduction to Information Retrieval Ricardo Campos Inverted Index LIC ITM Tecnologias Avançadas de Programação Abrantes, Portugal, 2019

Introduction to Information Retrieval Inverted Indexricardo/ficheiros/IR-Indexing.pdfMap Reduce 5 Toolkits 6 Summary 7 Q&A 8. ... By building an inverted index, the search engine knows

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Introduction to Information Retrieval Inverted Indexricardo/ficheiros/IR-Indexing.pdfMap Reduce 5 Toolkits 6 Summary 7 Q&A 8. ... By building an inverted index, the search engine knows

Introduction to Information Retrieval

Ricardo Campos

Inverted Index

Instituto Politécnico de Tomar

LIC ITM – Tecnologias Avançadas de Programação Abrantes, Portugal, 2019

Page 2: Introduction to Information Retrieval Inverted Indexricardo/ficheiros/IR-Indexing.pdfMap Reduce 5 Toolkits 6 Summary 7 Q&A 8. ... By building an inverted index, the search engine knows

What is Information Retrieval?

This presentation was developed by Ricardo Campos, Professor of ICT of the Polytechnic Institute of Tomar and researcher of LIAAD - INESC TEC. Part of the slides used in this presentation were adapted from presentations found in internet and from reference bibliography:

• Jaime Arguello (University of North Carolina)

• Vitor Mangaravite (INESC TEC / UFMG)

Page 3: Introduction to Information Retrieval Inverted Indexricardo/ficheiros/IR-Indexing.pdfMap Reduce 5 Toolkits 6 Summary 7 Q&A 8. ... By building an inverted index, the search engine knows

What is Information Retrieval?

Please refer to the following when using this presentation:

Campos, Ricardo. (2019).

A .ppt version of this presentation can be provided upon request by sending an email to [[email protected]]

Page 4: Introduction to Information Retrieval Inverted Indexricardo/ficheiros/IR-Indexing.pdfMap Reduce 5 Toolkits 6 Summary 7 Q&A 8. ... By building an inverted index, the search engine knows

What is Information Retrieval?

AGENDAWhat is this talk about?

Inverted Index

2Overview

1

Data Structure

3Challenges

4

Map Reduce

5Toolkits

6Summary

7Q&A

8

Page 5: Introduction to Information Retrieval Inverted Indexricardo/ficheiros/IR-Indexing.pdfMap Reduce 5 Toolkits 6 Summary 7 Q&A 8. ... By building an inverted index, the search engine knows

What is Information Retrieval?

Data about keywords, documents and especially the occurrences of keywords in documents, needs to be stored in an appropriate data structure; This process is called indexing: process of transforming items (documents) into a searchable data structure.

Goal: facilitate efficient access to and processing of stored data.

Is Term Document Matrix the solution? NO!! This is a very sparse matrix.

Page 6: Introduction to Information Retrieval Inverted Indexricardo/ficheiros/IR-Indexing.pdfMap Reduce 5 Toolkits 6 Summary 7 Q&A 8. ... By building an inverted index, the search engine knows

What is Information Retrieval?

docid Content

1 Pease porridge hot, pease porridge cold,

2 Pease porridge in the pot,

3 Nine days old.

4 Some like it hot, some like it cold,

5 Some like it in the pot,

6 Nine days old.

Term doc1 doc2 doc3 doc4 doc5 doc6

pease 2 1 0 0 0 0

porridge 2 1 0 0 0 0

hot 1 0 0 1 0 0

cold 1 0 0 1 0 0

in 0 1 0 0 1 0

the 0 1 0 0 1 0

pot 0 1 0 0 1 0

nine 0 0 1 0 0 1

days 0 0 1 0 0 1

old 0 0 1 0 0 1

some 0 0 0 2 1 0

like 0 0 0 2 1 0

it 0 0 0 2 1 0

Size:|V| x |D|

● |V| number of unique terms in the vocabulary;

● |D| number of indexed documents;

ie. Reuters Corpus Volume I (RCV1):

● |V| = ~391k

● |D| = ~800k

|V| x |D| = ~313G

Page 7: Introduction to Information Retrieval Inverted Indexricardo/ficheiros/IR-Indexing.pdfMap Reduce 5 Toolkits 6 Summary 7 Q&A 8. ... By building an inverted index, the search engine knows

What is Information Retrieval?

The solution is called indexing: process of transforming items (documents) into a searchable data structure (“implementation model” used in practice to store the information captured in the matrix representation)

While building an inverted index does require extra processing up front, taking the time to do so can greatly reduce the amount of time it takes to find something.

Imagine entering a keyword and letting the engine crawl the Internet and build a list of pages to return to you. Such a query would take an extremely long amount of time to complete.

Page 8: Introduction to Information Retrieval Inverted Indexricardo/ficheiros/IR-Indexing.pdfMap Reduce 5 Toolkits 6 Summary 7 Q&A 8. ... By building an inverted index, the search engine knows

What is Information Retrieval?

By building an inverted index, the search engine knows all the web pages related to a keyword ahead of time and these results are simply displayed to the user.

These indexes are often ingested into a database for fast query responses.

Can you imagine how a non-inverted file would be?

Page 9: Introduction to Information Retrieval Inverted Indexricardo/ficheiros/IR-Indexing.pdfMap Reduce 5 Toolkits 6 Summary 7 Q&A 8. ... By building an inverted index, the search engine knows

What is Information Retrieval?

htt

ps:

//w

ww

.yo

utu

be.c

om

/watc

h?v

=Ky

CYyo

Gusq

s

Page 10: Introduction to Information Retrieval Inverted Indexricardo/ficheiros/IR-Indexing.pdfMap Reduce 5 Toolkits 6 Summary 7 Q&A 8. ... By building an inverted index, the search engine knows

What is Information Retrieval?

Types

There are mainly 3 types of inverted indexes:

• DocID index;

• Frequency Index

• Positional Index

The details of what data needs to be stored depend on the required functionality of the application;

Page 11: Introduction to Information Retrieval Inverted Indexricardo/ficheiros/IR-Indexing.pdfMap Reduce 5 Toolkits 6 Summary 7 Q&A 8. ... By building an inverted index, the search engine knows

What is Information Retrieval?

pease 2

porridge 2

hot 2

cold 2

in 2

the 2

pot 2

nine 2

days 2

old 2

some 2

like 2

it 2

(1,2)

(1,2)

(1,4)

(1,4)

(2,5)

(2,5)

(2,5)

(3,6)

(3,6)

(3,6)

(4,5)

(4,5)

(4,5)

(docid)DF

docid Content

1 Pease porridge hot, pease porridge cold,

2 Pease porridge in the pot,

3 Nine days old.

4 Some like it hot, some like it cold,

5 Some like it in the pot,

6 Nine days old.

Data stored in an Inverted Index:

• Docid Index: {term : [df, [doc1, doc2, …]}

DocId Index

Page 12: Introduction to Information Retrieval Inverted Indexricardo/ficheiros/IR-Indexing.pdfMap Reduce 5 Toolkits 6 Summary 7 Q&A 8. ... By building an inverted index, the search engine knows

What is Information Retrieval?

pease 2

porridge 2

hot 2

cold 2

in 2

the 2

pot 2

nine 2

days 2

old 2

some 2

like 2

it 2

(1,2)

(1,2)

(1,4)

(1,4)

(2,5)

(2,5)

(2,5)

(3,6)

(3,6)

(3,6)

(4,5)

(4,5)

(4,5)

(docid)DF

Documents with:

some AND like AND it

doc4 doc5

Easy Retrieval

DocId Index

Page 13: Introduction to Information Retrieval Inverted Indexricardo/ficheiros/IR-Indexing.pdfMap Reduce 5 Toolkits 6 Summary 7 Q&A 8. ... By building an inverted index, the search engine knows

What is Information Retrieval?

pease 2

porridge 2

hot 2

cold 2

in 2

the 2

pot 2

nine 2

days 2

old 2

some 2

like 2

it 2

(1,2), (2,1)

(1,2), (2,1)

(1,1), (4,1)

(1,1), (4,1)

(2,1), (5,1)

(2,1), (5,1)

(2,1), (5,1)

(3,1), (6,1)

(3,1), (6,1)

(3,1), (6,1)

(4,1), (5,1)

(4,2), (5,1)

(4,2), (5,1)

(docid, tf)DF

docid Content

1 Pease porridge hot, pease porridge cold,

2 Pease porridge in the pot,

3 Nine days old.

4 Some like it hot, some like it cold,

5 Some like it in the pot,

6 Nine days old.

Data stored in an Inverted Index:

• Docid Index: {term : [df, [doc1, doc2, …]}

• Frequency Index: {term : [df, {doc1 : tf, doc2: tf ], …}

Frequency Index3

3

2

2

2

2

2

2

2

2

2

3

3

TotF

Page 14: Introduction to Information Retrieval Inverted Indexricardo/ficheiros/IR-Indexing.pdfMap Reduce 5 Toolkits 6 Summary 7 Q&A 8. ... By building an inverted index, the search engine knows

What is Information Retrieval?

Documents with:

pease porridge

doc1 doc2

But gives more relevance to Doc1

since it both words appear there more

frequently

Frequency Indexpease 2

porridge 2

hot 2

cold 2

in 2

the 2

pot 2

nine 2

days 2

old 2

some 2

like 2

it 2

(1,2), (2,1)

(1,2), (2,1)

(1,1), (4,1)

(1,1), (4,1)

(2,1), (5,1)

(2,1), (5,1)

(2,1), (5,1)

(3,1), (6,1)

(3,1), (6,1)

(3,1), (6,1)

(4,1), (5,1)

(4,2), (5,1)

(4,2), (5,1)

(docid, tf)DF

3

3

2

2

2

2

2

2

2

2

2

3

3

TotF

Term or document frequency information is not used for Boolean queries (just set operations performed on hit lists). Full advantage of this structure can be taken by statistical ranking algorithms

Page 15: Introduction to Information Retrieval Inverted Indexricardo/ficheiros/IR-Indexing.pdfMap Reduce 5 Toolkits 6 Summary 7 Q&A 8. ... By building an inverted index, the search engine knows

What is Information Retrieval?

docid Content

1 Pease porridge hot, pease porridge cold,

2 Pease porridge in the pot,

3 Nine days old.

4 Some like it hot, some like it cold,

5 Some like it in the pot,

6 Nine days old.

Data stored in an Inverted Index:

• Docid Index: {term : [df, [doc1, doc2, …]}

• Frequency Index: {term : [df, {doc1 : tf, doc2: tf ], …}

• Positional Index: {term : [df, TotalFreq, {doc1 : [tf, [offsets]], doc2: [tf, [offsets]], …}]}

pease 2

porridge 2

hot 2

cold 2

in 2

the 2

pot 2

nine 2

days 2

old 2

some 2

like 2

it 2

(1,2, [0,3]), (2,1, [0])

(1,2, [1,4]), (2,1, [1])

(1,1, [2]), (4,1, [3])

(1,1, [5]), (4,1, [7])

(2,1, [2]), (5,1, [3])

(2,1, [3]), (5,1, [4])

(2,1, [4]), (5,1, [5])

(3,1, [0]), (6,1, [0])

(3,1, [1]), (6,1, [1])

(3,1, [2]), (6,1, [2])

(4,1, [0,4]), (5,1, [0])

(4,2, [1,5]), (5,1, [1])

(4,2, [2,6]), (5,1, [2])

(docid, tf, [offsets])DFPositional Index3

3

2

2

2

2

2

2

2

2

2

3

3

TotF

Page 16: Introduction to Information Retrieval Inverted Indexricardo/ficheiros/IR-Indexing.pdfMap Reduce 5 Toolkits 6 Summary 7 Q&A 8. ... By building an inverted index, the search engine knows

What is Information Retrieval?

Documents with phrases:

“some like it hot”

doc4

Makes use of offsets

Positional Index

A positional index is 2–4 as large as a non-positional index, but it enables queries to contain proximity information e.g.) “some like” versus some AND like.

pease 2

porridge 2

hot 2

cold 2

in 2

the 2

pot 2

nine 2

days 2

old 2

some 2

like 2

it 2

(1,2, [0,3]), (2,1, [0])

(1,2, [1,4]), (2,1, [1])

(1,1, [2]), (4,1, [3])

(1,1, [5]), (4,1, [7])

(2,1, [2]), (5,1, [3])

(2,1, [3]), (5,1, [4])

(2,1, [4]), (5,1, [5])

(3,1, [0]), (6,1, [0])

(3,1, [1]), (6,1, [1])

(3,1, [2]), (6,1, [2])

(4,1, [0,4]), (5,1, [0])

(4,2, [1,5]), (5,1, [1])

(4,2, [2,6]), (5,1, [2])

(docid, tf, [offsets])DF

3

3

2

2

2

2

2

2

2

2

2

3

3

TotF

Page 17: Introduction to Information Retrieval Inverted Indexricardo/ficheiros/IR-Indexing.pdfMap Reduce 5 Toolkits 6 Summary 7 Q&A 8. ... By building an inverted index, the search engine knows

What is Information Retrieval?

The inverted index of a document collection is basically a data structure that attaches each distinctive term with a list of all documents that contains the term. It may be implemented through a dictionary (hash table) which supports search in O(1).

Most retrieval systems keep the dictionary in memory and the postings on disk

Web search engines frequently keep both in memory

The set of keywords is called the dictionary/vocabulary;

The list of document identifiers associated with a given keyword is called a posting list;

Page 18: Introduction to Information Retrieval Inverted Indexricardo/ficheiros/IR-Indexing.pdfMap Reduce 5 Toolkits 6 Summary 7 Q&A 8. ... By building an inverted index, the search engine knows

What is Information Retrieval?

What should we store as a key? The term (e.g. “ETH”), or an id of the term?

Page 19: Introduction to Information Retrieval Inverted Indexricardo/ficheiros/IR-Indexing.pdfMap Reduce 5 Toolkits 6 Summary 7 Q&A 8. ... By building an inverted index, the search engine knows

What is Information Retrieval?

For each term, you get a hit list consisting of:

• document frequency

• total frequency

• document ID

• term frequency

• position of term in doc

legionella 41 60

DF TotF

2 1

DocId TF

119

3 4 2

7

93

148

The word ‘legionella’ occurs in

41 different documents, with a

total of 60 occurences;

Under DocId 2 it occurs

only one time on position

119;

Page 20: Introduction to Information Retrieval Inverted Indexricardo/ficheiros/IR-Indexing.pdfMap Reduce 5 Toolkits 6 Summary 7 Q&A 8. ... By building an inverted index, the search engine knows

What is Information Retrieval?

1. Initialization

A. Create an empty dictionary structure S

2. Collect term appearances

A. For each document Di in the collection

Scan Di (parse into index terms)

B. For each index term t

Let fd,t be the freq of term t in Doc d

search S for t

if t is not in S, insert it

Append a node storing (d, fd,t ) to t’s inverted list

Page 21: Introduction to Information Retrieval Inverted Indexricardo/ficheiros/IR-Indexing.pdfMap Reduce 5 Toolkits 6 Summary 7 Q&A 8. ... By building an inverted index, the search engine knows

What is Information Retrieval?

Postings Size: Zipf’s Law

In other words:

• A few elements occur very frequently (thus only a few terms have a large posting list – worst case for stopwords - long posting lists also leads to poor search time)

• Many elements occur very infrequently (thus there are several terms in the dictionary but most of them with a small posting list)

Page 22: Introduction to Information Retrieval Inverted Indexricardo/ficheiros/IR-Indexing.pdfMap Reduce 5 Toolkits 6 Summary 7 Q&A 8. ... By building an inverted index, the search engine knows

What is Information Retrieval?

Index is much larger than memory (RAM)

Collection is larger than disk space (e.g. web)

Build index for new docs, merge new with old index

Page 23: Introduction to Information Retrieval Inverted Indexricardo/ficheiros/IR-Indexing.pdfMap Reduce 5 Toolkits 6 Summary 7 Q&A 8. ... By building an inverted index, the search engine knows

What is Information Retrieval?

Page 24: Introduction to Information Retrieval Inverted Indexricardo/ficheiros/IR-Indexing.pdfMap Reduce 5 Toolkits 6 Summary 7 Q&A 8. ... By building an inverted index, the search engine knows

What is Information Retrieval?

Page 25: Introduction to Information Retrieval Inverted Indexricardo/ficheiros/IR-Indexing.pdfMap Reduce 5 Toolkits 6 Summary 7 Q&A 8. ... By building an inverted index, the search engine knows

What is Information Retrieval?

The indexing problem

• Scalability is paramount

• Must be relatively fast, but need not be real time

• Fundamentally a batch operation

The retrieval problem

• Must have sub-second response time

• For the web, only need relatively few results

Suggest reading this chapter: http://www.dcs.bbk.ac.uk/~dell/teaching/cc/book/ditp/ditp_ch4.pdf

Page 26: Introduction to Information Retrieval Inverted Indexricardo/ficheiros/IR-Indexing.pdfMap Reduce 5 Toolkits 6 Summary 7 Q&A 8. ... By building an inverted index, the search engine knows

What is Information Retrieval?

htt

ps:

//w

ww

.yo

utu

be.c

om

/watc

h?v

=C

PjS

vanPl7

s

Page 28: Introduction to Information Retrieval Inverted Indexricardo/ficheiros/IR-Indexing.pdfMap Reduce 5 Toolkits 6 Summary 7 Q&A 8. ... By building an inverted index, the search engine knows

What is Information Retrieval?

All modern search engines are based on inverted indexes;

The simplest form of an inverted list stores just the documents where a given word appears;

In an inverted index that contains only document information, the features are binary (1 if the document contains the term, 0 otherwise);

This is too coarse to find the best few documents when there are a lot of possible matches. Thus word counts appear as a powerful predictor of document relevance

When looking for matches for a query like “San Francisco” the location of the words in the document is an important factor

Page 29: Introduction to Information Retrieval Inverted Indexricardo/ficheiros/IR-Indexing.pdfMap Reduce 5 Toolkits 6 Summary 7 Q&A 8. ... By building an inverted index, the search engine knows

What is Information Retrieval?