Temu-Kembali Informasi dalam Praktek•Inverted index: a word-oriented mechanism for indexing a text collection to speed up the searching task •The inverted index structure is composed

Temu-Kembali Informasidalam Praktek

Husni

Tahapan Pembangunan Sistem IR

• Crawling• Menghimpunan dokumen

• Memisahkan Teks dan URL

• Proprocessing• Komputasi Linguistik terhadap dokumen termasuk tokenisasi, case folding,

normalisasi, pembobotan

• Indexing• Membuat list (inverted) yang memetakan term ke nomor dokumen

• Querying• Mencari dokumen yang relevan dengan Query

Preprocessing

Indexing: Basic Concepts

• Inverted index: a word-oriented mechanism for indexing a text collection to speed up the searching task

• The inverted index structure is composed of two elements: the vocabulary and the occurrences

• The vocabulary is the set of all different words in the text For each word in the vocabulary the index stores the documents which contain that word (inverted index)


• Term-document matrix: the simplest way to represent the documents that contain each word of the vocabulary


• The main problem of this simple solution is that it requires too much space

• As this is a sparse matrix, the solution is to associate a list of documents with each word

• The set of all those lists is called the occurrences


• Basic inverted index

Full Inverted Indexes

• The basic index is not suitable for answering phrase or proximity queries

• Hence, we need to add the positions of each word in each document to the index (full inverted index)


• In the case of multiple documents, we need to store one occurrence list per term-document pair

• The space required for the vocabulary is rather small

• Heaps’ law: the vocabulary grows as O(n), where • n is the collection size!

• is a collection-dependent constant between 0.4 and 0.6

• For instance, in the TREC-3 collection, the vocabulary of 1 gigabyte of text occupies only 5 megabytes

• This may be further reduced by stemming and other normalization techniques

• The occurrences demand much more space

• The extra space will be O(n) and is around• 40% of the text size if stopwords are omitted

• 80% when stopwords are indexed

• Document-addressing indexes are smaller, because only one occurrence per file must be recorded, for a given word

• Depending on the document (file) size, document-addressing indexes typically require 20% to 40% of the text size


• To reduce space requirements, a technique called block addressing is used

• The documents are divided into blocks, and the occurrences point to the blocks where the word appears

• The Table below presents the projected space taken by inverted indexes for texts of different sizes

• The blocks can be of fixed size or they can be defined using the division of the text collection into documents

• The division into blocks of fixed size improves efficiency at retrieval time• This is because larger blocks match queries more frequently and are more

expensive to traverse

• This technique also profits from locality of reference• That is, the same word will be used many times in the same context and all

the references to that word will be collapsed in just one reference

Single Word Queries

• The simplest type of search is that for the occurrences of a single word

• The vocabulary search can be carried out using any suitable data structure• Ex: hashing, tries, or B-trees

• The first two provide O(m) search cost, where m is the length of the query

• We note that the vocabulary is in most cases sufficiently small so as to stay in main memory

• The occurrence lists, on the other hand, are usually fetched from disk

Multiple Word Queries

• If the query has more than one word, we have to consider two cases:• conjunctive (AND operator) queries• disjunctive (OR operator) queries

• Conjunctive queries imply to search for all the words in the query, obtaining one inverted list for each word

• Following, we have to intersect all the inverted lists to obtain the documents that contain all these words

• For disjunctive queries the lists must be merged

• The first case is popular in the Web due to the size of the document collection

List Intersection

• The most time-demanding operation on inverted indexes is the merging of the lists of occurrences• Thus, it is important to optimize it

• Consider one pair of lists of sizes m and n respectively, stored in consecutive memory, that needs to be intersected

• If m is much smaller than n, it is better to do m binary searches in the larger list to do the intersection

• If m and n are comparable, Baeza-Yates devised a double binary search algorithm• It is O(log n) if the intersection is trivially empty• It requires less than m + n comparisons on average

List Intersection

• When there are more than two lists, there are several possible heuristics depending on the list sizes

• If intersecting the two shortest lists gives a very small answer, might be better to intersect that to the next shortest list, and so on

• The algorithms are more complicated if lists are stored non-contiguously and/or compressed

Inverted Indexing: Data Structures for IR

• Efficient data structures needed to process large document collections quickly

• How do we store documents in order to maximize retrieval performance?• We must avoid linear scans of text (e.g. grep command) at query time

• We must index documents in advance

Inverted Indexing: Term-document incidence matrix• Naïve data structure: term-document incidence matrix

• Is it feasible for large document collections?• Consider 𝑁 = 10% documents, each with about 1𝐾 terms• Avg. 6 bytes/term including spaces/punctuation• 6𝐺𝐵 of data in the documents• Suppose there are 𝑀 = 500𝐾 distinct terms among the documents

Inverted Indexing: Term-document incidence matrix• 500𝐾×1𝑀 matrix has half-a-trillion 0’s and 1’s

• But it has no more than one billion 1’s

• Matrix is extremely sparse• What’s a better representation?

• We only record the 1’s positions → inverted index

Indexing: Inverted index

• For each term, we have a list that records in which documents the term occurs• Each term in the list is conventionally called a posting

• A posting is a tuple of the form 𝑡/ ,𝑑2 , where 𝑡/ is a term identifier and 𝑑2 is a document identifier

• The list is called a posting list (or inverted list)

• All the postings list taken together are referred to as the postings

Indexing: Inverted index

• Inverted indexes are independent from the adopted IR model (Boolean model, vector space model, etc.)

• Each posting usually contains:• The identifier of the linked document

• The frequency of appearance of the term in the document

• The position of the term for each document (optional)• Expressed as number of words from the begin of the document, number of bytes, etc.

• A.k.a. positional posting

• For each term is also usually stored the frequency of appearance of the term in the dictionary

Indexing: Membangun Inverted Index

Langkah-langkah:

1. Kumpulkan dokumen yang akan diindeks• “Friends, Romans, countrymen…”• “So let it be with Caesar…”• …

2. Tokenisasi teks, ubah setiap dokumen menjadi daftar token• |Friends|Romans|countrymen|So|…

3. Pra-pemrosesan linguistik. Setiap dokumen direpresentasikan sebagaidaftar token yang dinormalisasi• |friends|romans|countrymen|so|…

4. Buat indeks dari dokumen dari setiap token yang muncul, terdiri daridictionary dan postings

Indexing: Membangun Inverted Index

• Dalam koleksi dokumen, diasumsikan bahwa setiap dokumen memilikinomor seri unik, pengidentifikasi dokumen (docID)

• Proses Indexing:1. Input: daftar token yang dinormalisasi untuk setiap dokumen2. Sortir term secara alfabet3. Gabungkan jumlah kemunculan term yang sama4. Catat frekuensi kemunculan term dalam dokumen

• tidak dibutuhkan oleh model Boolean• digunakan oleh model ruang vektor

5. Kelompokkan instance dari term yang sama dan pisahkan kamus dan posting

Inverted Indexing: Contoh

1. Deretan pasangan (Token, DocID)


2. Urutkanterm

3. GabungEntri Banyak

Term

4. SebutkanInfo.

Frekuensi


5. Pecahkanhasil dalam

Dictionary &Posting

Setiap posting dapatmenyimpan informasilain seperti frekuensiterm dalam setiapdokumen dan posisi term di dalam dokumen itu

Inverted Indexing: Processing Boolean queries

• Consider processing the Boolean query: 𝑞 = 𝑡4 ∧ 𝑡6• Locate 𝑡4 in the dictionary

• Retrieve its postings

• Locate 𝑡6 in the dictionary

• Retrieve its postings

• Intersect the two postings


• The intersection operation is the crucial one: we need to efficiently intersect postings lists so as to be able to quickly find documents that contain both terms• If the list lengths are 𝑛8 and 𝑛9, then it takes 𝑂 𝑛8 + 𝑛9 operations.

• Assumption: postings sorted by docID


• Walk through the two postings simultaneously, in time linear in the total number of postings entries

• We would like to be able to extend the intersection operation to process more complicated queries like

• This is accomplished by means of query optimization• how to organize the work of answering a query so that the least amount of work

needs to be done by the system

Kemiripan Query dan Dokumen

• Dokumen_01: The game of life is a game of everlasting learning

• Dokumen_02: The unexamined life is not worth living

• Dokumen_03: Never stop learning

• Query: life learning

Langkah 1: Term Frequency (TF)

• TF Dokumen_01

• TF Dokumen_02

• TF Dokumen_03

Dokumen_01 the game of life is a everlasting learning

Term Frequency 1 2 2 1 1 1 1 1

Dokumen_02 the unexamined life is not worth livingTerm Frequency 1 1 1 1 1 1 1

Dokumen_03 never stop learningTerm Frequency 1 1 1

Normalisasi TF: Jumlah Term

• TF normal Dokumen_01



Dokumen_01 the game of life is a everlasting learningNormalized TF 0.1 0.2 0.2 0.1 0.1 0.1 0.1 0.1

Dokumen_02 the unexamined life is not worth livingNormalized TF 0.142857 0.142857 0.142857 0.142857 0.142857 0.142857 0.142857

Dokumen_03 never stop learningNormalized TF 0.333333 0.333333 0.333333

Kode Python: Kalkulasi TF Normal

def termFrequency(term, document):

normalizeDocument = document.lower().split()

return normalizeDocument.count(term.lower()) / float(len(normalizeDocument))

Langkah 2: Inverse Document Frequency (IDF)

IDF untuk term game

IDF(game) = 1 + loge(Jumlah total dokumen/ jumlah dokumen dengan term game di dalamnya)

Ada 3 dokumen total = Dokumen_01, Dokumen_02, Dokumen_03

Term game hanya muncul dalam Dokumen_01

IDF(game) = 1 + loge(3 / 1)

= 1 + 1.098726209

= 2.098726209

e = 2,718281828

IDF Semua TermTerms IDFthe 1.405507153game 2.098726209of 2.098726209life 1.405507153is 1.405507153a 2.098726209everlasting 2.098726209learning 1.405507153unexamined 2.098726209not 2.098726209worth 2.098726209living 2.098726209never 2.098726209stop 2.098726209

Kode Python: Menghitung IDF

def inverseDocumentFrequency(term, allDocuments):

numDocumentsWithThisTerm = 0

for doc in allDocuments:

if term.lower() in allDocuments[doc].lower().split():

numDocumentsWithThisTerm = numDocumentsWithThisTerm + 1

if numDocumentsWithThisTerm > 0:

return 1.0 + log(float(len(allDocuments)) / numDocumentsWithThisTerm)

else:

return 1.0

Langkah 3: TF * IDF

• query: life learning

Dokumen_01 Dokumen_02 Dokumen_03life 0.140550715 0.200786736 0learning 0.140550715 0 0.468502384

Langkah 4: VSM - Cosine Similarity

Cosine Similarity (d1, d2) = Dot product(d1, d2) / ||d1|| * ||d2||

Dot product (d1,d2) = d1[0] * d2[0] + d1[1] * d2[1] * … * d1[n] * d2[n]

||d1|| = square root(d1[0]2 + d1[1]2 + ... + d1[n]2)

||d2|| = square root(d2[0]2 + d2[1]2 + ... + d2[n]2)

https://janav.files.wordpress.com/2013/10/cosinesimilarity.jpg

TF*IDF Untuk query

TF IDF TF*IDF

life 0.5 1.405507153 0.702753576

learning 0.5 1.405507153 0.702753576

CoSim Query dengan Dokumen_01

Cosine Similarity(Query,Dokumen_01) = Dot product(Query, Dokumen_01) / ||Query|| * ||Dokumen_01||

Dot product(Query, Dokumen_01)

= ((0.702753576) * (0.140550715) + (0.702753576)*(0.140550715))

= 0.197545035151

||Query|| = sqrt((0.702753576)2 + (0.702753576)2) = 0.993843638185

||Dokumen_01|| = sqrt((0.140550715)2 + (0.140550715)2) = 0.198768727354

Cosine Similarity(Query, Dokumen) = 0.197545035151 / (0.993843638185) * (0.198768727354)

= 0.197545035151 / 0.197545035151

= 1

Kemiripan Dokumen dengan Query

Dokumen_01 Dokumen_02 Dokumen_03Cosine Similarity 1 0.707106781 0.707106781

https://janav.files.wordpress.com/2013/10/cosinesimiarlity.jpeg

Variasi Bobot TF

Varian IDF

Selain Cosine Similarity

• Jaccard distance

• Kullback-Leibler divergence

• Euclidean distance

Contoh: IR untuk Koleksi Buku

• Corpus: 10,000 buku

• 5 buku:• Analytics and Big-Data

• The Hanging Tree

• Broken Dreams

• Blessed kid

• Girl with a Dragon Tattoo

• Query: Book for Analytics newbie

• Buku mana yang paling cocok dengan Query?

Term Frequency (TF) Matrix

https://www.analyticsvidhya.com/wp-content/uploads/2015/04/table1.png

Normalisasi TF TF = 1 + log (TF) if TF > 0

0 if TF = 0

https://www.analyticsvidhya.com/wp-content/uploads/2015/04/table2.png

Relevansi Dokumen dengan Query

Document 1 : 1.7 + 3.1 + 2.8 + 1 = 8.6

Document 2 :2.3 + 3.0 + 0 + 2 = 7.3

Document 3 : 2.5 + 3.0 + 0 + 2 = 7.5

Document 4 : 2.6 + 3.0 + 0 + 2.3 = 7.9

Document 5 : 2.3 + 3.0 + 0 + 2.5 = 7.8

Inverse Document Frequency Matrix(IDF)

IDF = log (N/DF)

https://www.analyticsvidhya.com/wp-content/uploads/2015/04/IDF.png

TF-IDF Matrix

https://www.analyticsvidhya.com/wp-content/uploads/2015/04/TFIDF.png