56
Temu-Kembali Informasi dalam Praktek Husni

Temu-Kembali Informasi dalam Praktek•Inverted index: a word-oriented mechanism for indexing a text collection to speed up the searching task •The inverted index structure is composed

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Temu-Kembali Informasi dalam Praktek•Inverted index: a word-oriented mechanism for indexing a text collection to speed up the searching task •The inverted index structure is composed

Temu-Kembali Informasidalam Praktek

Husni

Page 2: Temu-Kembali Informasi dalam Praktek•Inverted index: a word-oriented mechanism for indexing a text collection to speed up the searching task •The inverted index structure is composed

Tahapan Pembangunan Sistem IR

• Crawling• Menghimpunan dokumen

• Memisahkan Teks dan URL

• Proprocessing• Komputasi Linguistik terhadap dokumen termasuk tokenisasi, case folding,

normalisasi, pembobotan

• Indexing• Membuat list (inverted) yang memetakan term ke nomor dokumen

• Querying• Mencari dokumen yang relevan dengan Query

Page 3: Temu-Kembali Informasi dalam Praktek•Inverted index: a word-oriented mechanism for indexing a text collection to speed up the searching task •The inverted index structure is composed
Page 4: Temu-Kembali Informasi dalam Praktek•Inverted index: a word-oriented mechanism for indexing a text collection to speed up the searching task •The inverted index structure is composed

Preprocessing

Page 5: Temu-Kembali Informasi dalam Praktek•Inverted index: a word-oriented mechanism for indexing a text collection to speed up the searching task •The inverted index structure is composed

Indexing: Basic Concepts

• Inverted index: a word-oriented mechanism for indexing a text collection to speed up the searching task

• The inverted index structure is composed of two elements: the vocabulary and the occurrences

• The vocabulary is the set of all different words in the text For each word in the vocabulary the index stores the documents which contain that word (inverted index)

Page 6: Temu-Kembali Informasi dalam Praktek•Inverted index: a word-oriented mechanism for indexing a text collection to speed up the searching task •The inverted index structure is composed

Indexing: Basic Concepts

• Term-document matrix: the simplest way to represent the documents that contain each word of the vocabulary

Page 7: Temu-Kembali Informasi dalam Praktek•Inverted index: a word-oriented mechanism for indexing a text collection to speed up the searching task •The inverted index structure is composed

Indexing: Basic Concepts

• The main problem of this simple solution is that it requires too much space

• As this is a sparse matrix, the solution is to associate a list of documents with each word

• The set of all those lists is called the occurrences

Page 8: Temu-Kembali Informasi dalam Praktek•Inverted index: a word-oriented mechanism for indexing a text collection to speed up the searching task •The inverted index structure is composed

Indexing: Basic Concepts

• Basic inverted index

Page 9: Temu-Kembali Informasi dalam Praktek•Inverted index: a word-oriented mechanism for indexing a text collection to speed up the searching task •The inverted index structure is composed

Full Inverted Indexes

• The basic index is not suitable for answering phrase or proximity queries

• Hence, we need to add the positions of each word in each document to the index (full inverted index)

Page 10: Temu-Kembali Informasi dalam Praktek•Inverted index: a word-oriented mechanism for indexing a text collection to speed up the searching task •The inverted index structure is composed

Full Inverted Indexes

• In the case of multiple documents, we need to store one occurrence list per term-document pair

Page 11: Temu-Kembali Informasi dalam Praktek•Inverted index: a word-oriented mechanism for indexing a text collection to speed up the searching task •The inverted index structure is composed

• The space required for the vocabulary is rather small

• Heaps’ law: the vocabulary grows as O(n), where • n is the collection size!

• is a collection-dependent constant between 0.4 and 0.6

• For instance, in the TREC-3 collection, the vocabulary of 1 gigabyte of text occupies only 5 megabytes

• This may be further reduced by stemming and other normalization techniques

Page 12: Temu-Kembali Informasi dalam Praktek•Inverted index: a word-oriented mechanism for indexing a text collection to speed up the searching task •The inverted index structure is composed

• The occurrences demand much more space

• The extra space will be O(n) and is around• 40% of the text size if stopwords are omitted

• 80% when stopwords are indexed

• Document-addressing indexes are smaller, because only one occurrence per file must be recorded, for a given word

• Depending on the document (file) size, document-addressing indexes typically require 20% to 40% of the text size

Page 13: Temu-Kembali Informasi dalam Praktek•Inverted index: a word-oriented mechanism for indexing a text collection to speed up the searching task •The inverted index structure is composed

Full Inverted Indexes

• To reduce space requirements, a technique called block addressing is used

• The documents are divided into blocks, and the occurrences point to the blocks where the word appears

Page 14: Temu-Kembali Informasi dalam Praktek•Inverted index: a word-oriented mechanism for indexing a text collection to speed up the searching task •The inverted index structure is composed

• The Table below presents the projected space taken by inverted indexes for texts of different sizes

Page 15: Temu-Kembali Informasi dalam Praktek•Inverted index: a word-oriented mechanism for indexing a text collection to speed up the searching task •The inverted index structure is composed

• The blocks can be of fixed size or they can be defined using the division of the text collection into documents

• The division into blocks of fixed size improves efficiency at retrieval time• This is because larger blocks match queries more frequently and are more

expensive to traverse

• This technique also profits from locality of reference• That is, the same word will be used many times in the same context and all

the references to that word will be collapsed in just one reference

Page 16: Temu-Kembali Informasi dalam Praktek•Inverted index: a word-oriented mechanism for indexing a text collection to speed up the searching task •The inverted index structure is composed

Single Word Queries

• The simplest type of search is that for the occurrences of a single word

• The vocabulary search can be carried out using any suitable data structure• Ex: hashing, tries, or B-trees

• The first two provide O(m) search cost, where m is the length of the query

• We note that the vocabulary is in most cases sufficiently small so as to stay in main memory

• The occurrence lists, on the other hand, are usually fetched from disk

Page 17: Temu-Kembali Informasi dalam Praktek•Inverted index: a word-oriented mechanism for indexing a text collection to speed up the searching task •The inverted index structure is composed

Multiple Word Queries

• If the query has more than one word, we have to consider two cases:• conjunctive (AND operator) queries• disjunctive (OR operator) queries

• Conjunctive queries imply to search for all the words in the query, obtaining one inverted list for each word

• Following, we have to intersect all the inverted lists to obtain the documents that contain all these words

• For disjunctive queries the lists must be merged

• The first case is popular in the Web due to the size of the document collection

Page 18: Temu-Kembali Informasi dalam Praktek•Inverted index: a word-oriented mechanism for indexing a text collection to speed up the searching task •The inverted index structure is composed

List Intersection

• The most time-demanding operation on inverted indexes is the merging of the lists of occurrences• Thus, it is important to optimize it

• Consider one pair of lists of sizes m and n respectively, stored in consecutive memory, that needs to be intersected

• If m is much smaller than n, it is better to do m binary searches in the larger list to do the intersection

• If m and n are comparable, Baeza-Yates devised a double binary search algorithm• It is O(log n) if the intersection is trivially empty• It requires less than m + n comparisons on average

Page 19: Temu-Kembali Informasi dalam Praktek•Inverted index: a word-oriented mechanism for indexing a text collection to speed up the searching task •The inverted index structure is composed

List Intersection

• When there are more than two lists, there are several possible heuristics depending on the list sizes

• If intersecting the two shortest lists gives a very small answer, might be better to intersect that to the next shortest list, and so on

• The algorithms are more complicated if lists are stored non-contiguously and/or compressed

Page 20: Temu-Kembali Informasi dalam Praktek•Inverted index: a word-oriented mechanism for indexing a text collection to speed up the searching task •The inverted index structure is composed
Page 21: Temu-Kembali Informasi dalam Praktek•Inverted index: a word-oriented mechanism for indexing a text collection to speed up the searching task •The inverted index structure is composed

Inverted Indexing: Data Structures for IR

• Efficient data structures needed to process large document collections quickly

• How do we store documents in order to maximize retrieval performance?• We must avoid linear scans of text (e.g. grep command) at query time

• We must index documents in advance

Page 22: Temu-Kembali Informasi dalam Praktek•Inverted index: a word-oriented mechanism for indexing a text collection to speed up the searching task •The inverted index structure is composed

Inverted Indexing: Term-document incidence matrix• Naïve data structure: term-document incidence matrix

• Is it feasible for large document collections?• Consider 𝑁 = 10% documents, each with about 1𝐾 terms• Avg. 6 bytes/term including spaces/punctuation• 6𝐺𝐵 of data in the documents• Suppose there are 𝑀 = 500𝐾 distinct terms among the documents

Page 23: Temu-Kembali Informasi dalam Praktek•Inverted index: a word-oriented mechanism for indexing a text collection to speed up the searching task •The inverted index structure is composed

Inverted Indexing: Term-document incidence matrix• 500𝐾×1𝑀 matrix has half-a-trillion 0’s and 1’s

• But it has no more than one billion 1’s

• Matrix is extremely sparse• What’s a better representation?

• We only record the 1’s positions → inverted index

Page 24: Temu-Kembali Informasi dalam Praktek•Inverted index: a word-oriented mechanism for indexing a text collection to speed up the searching task •The inverted index structure is composed

Indexing: Inverted index

• For each term, we have a list that records in which documents the term occurs• Each term in the list is conventionally called a posting

• A posting is a tuple of the form 𝑡/ ,𝑑2 , where 𝑡/ is a term identifier and 𝑑2 is a document identifier

• The list is called a posting list (or inverted list)

• All the postings list taken together are referred to as the postings

Page 25: Temu-Kembali Informasi dalam Praktek•Inverted index: a word-oriented mechanism for indexing a text collection to speed up the searching task •The inverted index structure is composed

Indexing: Inverted index

• Inverted indexes are independent from the adopted IR model (Boolean model, vector space model, etc.)

• Each posting usually contains:• The identifier of the linked document

• The frequency of appearance of the term in the document

• The position of the term for each document (optional)• Expressed as number of words from the begin of the document, number of bytes, etc.

• A.k.a. positional posting

• For each term is also usually stored the frequency of appearance of the term in the dictionary

Page 26: Temu-Kembali Informasi dalam Praktek•Inverted index: a word-oriented mechanism for indexing a text collection to speed up the searching task •The inverted index structure is composed

Indexing: Membangun Inverted Index

Langkah-langkah:

1. Kumpulkan dokumen yang akan diindeks• “Friends, Romans, countrymen…”• “So let it be with Caesar…”• …

2. Tokenisasi teks, ubah setiap dokumen menjadi daftar token• |Friends|Romans|countrymen|So|…

3. Pra-pemrosesan linguistik. Setiap dokumen direpresentasikan sebagaidaftar token yang dinormalisasi• |friends|romans|countrymen|so|…

4. Buat indeks dari dokumen dari setiap token yang muncul, terdiri daridictionary dan postings

Page 27: Temu-Kembali Informasi dalam Praktek•Inverted index: a word-oriented mechanism for indexing a text collection to speed up the searching task •The inverted index structure is composed

Indexing: Membangun Inverted Index

• Dalam koleksi dokumen, diasumsikan bahwa setiap dokumen memilikinomor seri unik, pengidentifikasi dokumen (docID)

• Proses Indexing:1. Input: daftar token yang dinormalisasi untuk setiap dokumen2. Sortir term secara alfabet3. Gabungkan jumlah kemunculan term yang sama4. Catat frekuensi kemunculan term dalam dokumen

• tidak dibutuhkan oleh model Boolean• digunakan oleh model ruang vektor

5. Kelompokkan instance dari term yang sama dan pisahkan kamus dan posting

Page 28: Temu-Kembali Informasi dalam Praktek•Inverted index: a word-oriented mechanism for indexing a text collection to speed up the searching task •The inverted index structure is composed

Inverted Indexing: Contoh

1. Deretan pasangan (Token, DocID)

Page 29: Temu-Kembali Informasi dalam Praktek•Inverted index: a word-oriented mechanism for indexing a text collection to speed up the searching task •The inverted index structure is composed

Inverted Indexing: Contoh

2. Urutkanterm

3. GabungEntri Banyak

Term

4. SebutkanInfo.

Frekuensi

Page 30: Temu-Kembali Informasi dalam Praktek•Inverted index: a word-oriented mechanism for indexing a text collection to speed up the searching task •The inverted index structure is composed

Inverted Indexing: Contoh

5. Pecahkanhasil dalam

Dictionary &Posting

Setiap posting dapatmenyimpan informasilain seperti frekuensiterm dalam setiapdokumen dan posisi term di dalam dokumen itu

Page 31: Temu-Kembali Informasi dalam Praktek•Inverted index: a word-oriented mechanism for indexing a text collection to speed up the searching task •The inverted index structure is composed

Inverted Indexing: Processing Boolean queries

• Consider processing the Boolean query: 𝑞 = 𝑡4 ∧ 𝑡6• Locate 𝑡4 in the dictionary

• Retrieve its postings

• Locate 𝑡6 in the dictionary

• Retrieve its postings

• Intersect the two postings

Page 32: Temu-Kembali Informasi dalam Praktek•Inverted index: a word-oriented mechanism for indexing a text collection to speed up the searching task •The inverted index structure is composed

Inverted Indexing: Processing Boolean queries

• The intersection operation is the crucial one: we need to efficiently intersect postings lists so as to be able to quickly find documents that contain both terms• If the list lengths are 𝑛8 and 𝑛9, then it takes 𝑂 𝑛8 + 𝑛9 operations.

• Assumption: postings sorted by docID

Page 33: Temu-Kembali Informasi dalam Praktek•Inverted index: a word-oriented mechanism for indexing a text collection to speed up the searching task •The inverted index structure is composed

Inverted Indexing: Processing Boolean queries

• Walk through the two postings simultaneously, in time linear in the total number of postings entries

• We would like to be able to extend the intersection operation to process more complicated queries like

• This is accomplished by means of query optimization• how to organize the work of answering a query so that the least amount of work

needs to be done by the system

Page 34: Temu-Kembali Informasi dalam Praktek•Inverted index: a word-oriented mechanism for indexing a text collection to speed up the searching task •The inverted index structure is composed
Page 35: Temu-Kembali Informasi dalam Praktek•Inverted index: a word-oriented mechanism for indexing a text collection to speed up the searching task •The inverted index structure is composed

Kemiripan Query dan Dokumen

• Dokumen_01: The game of life is a game of everlasting learning

• Dokumen_02: The unexamined life is not worth living

• Dokumen_03: Never stop learning

• Query: life learning

Page 36: Temu-Kembali Informasi dalam Praktek•Inverted index: a word-oriented mechanism for indexing a text collection to speed up the searching task •The inverted index structure is composed

Langkah 1: Term Frequency (TF)

• TF Dokumen_01

• TF Dokumen_02

• TF Dokumen_03

Dokumen_01 the game of life is a everlasting learning

Term Frequency 1 2 2 1 1 1 1 1

Dokumen_02 the unexamined life is not worth livingTerm Frequency 1 1 1 1 1 1 1

Dokumen_03 never stop learningTerm Frequency 1 1 1

Page 37: Temu-Kembali Informasi dalam Praktek•Inverted index: a word-oriented mechanism for indexing a text collection to speed up the searching task •The inverted index structure is composed

Normalisasi TF: Jumlah Term

• TF normal Dokumen_01

• TF normal Dokumen_02

• TF normal Dokumen_03

Dokumen_01 the game of life is a everlasting learningNormalized TF 0.1 0.2 0.2 0.1 0.1 0.1 0.1 0.1

Dokumen_02 the unexamined life is not worth livingNormalized TF 0.142857 0.142857 0.142857 0.142857 0.142857 0.142857 0.142857

Dokumen_03 never stop learningNormalized TF 0.333333 0.333333 0.333333

Page 38: Temu-Kembali Informasi dalam Praktek•Inverted index: a word-oriented mechanism for indexing a text collection to speed up the searching task •The inverted index structure is composed

Kode Python: Kalkulasi TF Normal

def termFrequency(term, document):

normalizeDocument = document.lower().split()

return normalizeDocument.count(term.lower()) / float(len(normalizeDocument))

Page 39: Temu-Kembali Informasi dalam Praktek•Inverted index: a word-oriented mechanism for indexing a text collection to speed up the searching task •The inverted index structure is composed

Langkah 2: Inverse Document Frequency (IDF)

IDF untuk term game

IDF(game) = 1 + loge(Jumlah total dokumen/ jumlah dokumen dengan term game di dalamnya)

Ada 3 dokumen total = Dokumen_01, Dokumen_02, Dokumen_03

Term game hanya muncul dalam Dokumen_01

IDF(game) = 1 + loge(3 / 1)

= 1 + 1.098726209

= 2.098726209

e = 2,718281828

Page 40: Temu-Kembali Informasi dalam Praktek•Inverted index: a word-oriented mechanism for indexing a text collection to speed up the searching task •The inverted index structure is composed

IDF Semua TermTerms IDFthe 1.405507153game 2.098726209of 2.098726209life 1.405507153is 1.405507153a 2.098726209everlasting 2.098726209learning 1.405507153unexamined 2.098726209not 2.098726209worth 2.098726209living 2.098726209never 2.098726209stop 2.098726209

Page 41: Temu-Kembali Informasi dalam Praktek•Inverted index: a word-oriented mechanism for indexing a text collection to speed up the searching task •The inverted index structure is composed

Kode Python: Menghitung IDF

def inverseDocumentFrequency(term, allDocuments):

numDocumentsWithThisTerm = 0

for doc in allDocuments:

if term.lower() in allDocuments[doc].lower().split():

numDocumentsWithThisTerm = numDocumentsWithThisTerm + 1

if numDocumentsWithThisTerm > 0:

return 1.0 + log(float(len(allDocuments)) / numDocumentsWithThisTerm)

else:

return 1.0

Page 42: Temu-Kembali Informasi dalam Praktek•Inverted index: a word-oriented mechanism for indexing a text collection to speed up the searching task •The inverted index structure is composed

Langkah 3: TF * IDF

• query: life learning

Dokumen_01 Dokumen_02 Dokumen_03life 0.140550715 0.200786736 0learning 0.140550715 0 0.468502384

Page 43: Temu-Kembali Informasi dalam Praktek•Inverted index: a word-oriented mechanism for indexing a text collection to speed up the searching task •The inverted index structure is composed

Langkah 4: VSM - Cosine Similarity

Cosine Similarity (d1, d2) = Dot product(d1, d2) / ||d1|| * ||d2||

Dot product (d1,d2) = d1[0] * d2[0] + d1[1] * d2[1] * … * d1[n] * d2[n]

||d1|| = square root(d1[0]2 + d1[1]2 + ... + d1[n]2)

||d2|| = square root(d2[0]2 + d2[1]2 + ... + d2[n]2)

Page 44: Temu-Kembali Informasi dalam Praktek•Inverted index: a word-oriented mechanism for indexing a text collection to speed up the searching task •The inverted index structure is composed

TF*IDF Untuk query

TF IDF TF*IDF

life 0.5 1.405507153 0.702753576

learning 0.5 1.405507153 0.702753576

Page 45: Temu-Kembali Informasi dalam Praktek•Inverted index: a word-oriented mechanism for indexing a text collection to speed up the searching task •The inverted index structure is composed

CoSim Query dengan Dokumen_01

Cosine Similarity(Query,Dokumen_01) = Dot product(Query, Dokumen_01) / ||Query|| * ||Dokumen_01||

Dot product(Query, Dokumen_01)

= ((0.702753576) * (0.140550715) + (0.702753576)*(0.140550715))

= 0.197545035151

||Query|| = sqrt((0.702753576)2 + (0.702753576)2) = 0.993843638185

||Dokumen_01|| = sqrt((0.140550715)2 + (0.140550715)2) = 0.198768727354

Cosine Similarity(Query, Dokumen) = 0.197545035151 / (0.993843638185) * (0.198768727354)

= 0.197545035151 / 0.197545035151

= 1

Page 46: Temu-Kembali Informasi dalam Praktek•Inverted index: a word-oriented mechanism for indexing a text collection to speed up the searching task •The inverted index structure is composed

Kemiripan Dokumen dengan Query

Dokumen_01 Dokumen_02 Dokumen_03Cosine Similarity 1 0.707106781 0.707106781

Page 47: Temu-Kembali Informasi dalam Praktek•Inverted index: a word-oriented mechanism for indexing a text collection to speed up the searching task •The inverted index structure is composed
Page 48: Temu-Kembali Informasi dalam Praktek•Inverted index: a word-oriented mechanism for indexing a text collection to speed up the searching task •The inverted index structure is composed

Variasi Bobot TF

Page 49: Temu-Kembali Informasi dalam Praktek•Inverted index: a word-oriented mechanism for indexing a text collection to speed up the searching task •The inverted index structure is composed

Varian IDF

Page 50: Temu-Kembali Informasi dalam Praktek•Inverted index: a word-oriented mechanism for indexing a text collection to speed up the searching task •The inverted index structure is composed

Selain Cosine Similarity

• Jaccard distance

• Kullback-Leibler divergence

• Euclidean distance

Page 51: Temu-Kembali Informasi dalam Praktek•Inverted index: a word-oriented mechanism for indexing a text collection to speed up the searching task •The inverted index structure is composed

Contoh: IR untuk Koleksi Buku

• Corpus: 10,000 buku

• 5 buku:• Analytics and Big-Data

• The Hanging Tree

• Broken Dreams

• Blessed kid

• Girl with a Dragon Tattoo

• Query: Book for Analytics newbie

• Buku mana yang paling cocok dengan Query?

Page 53: Temu-Kembali Informasi dalam Praktek•Inverted index: a word-oriented mechanism for indexing a text collection to speed up the searching task •The inverted index structure is composed

Normalisasi TF TF = 1 + log (TF) if TF > 0

0 if TF = 0

Page 54: Temu-Kembali Informasi dalam Praktek•Inverted index: a word-oriented mechanism for indexing a text collection to speed up the searching task •The inverted index structure is composed

Relevansi Dokumen dengan Query

Document 1 : 1.7 + 3.1 + 2.8 + 1 = 8.6

Document 2 :2.3 + 3.0 + 0 + 2 = 7.3

Document 3 : 2.5 + 3.0 + 0 + 2 = 7.5

Document 4 : 2.6 + 3.0 + 0 + 2.3 = 7.9

Document 5 : 2.3 + 3.0 + 0 + 2.5 = 7.8

Page 55: Temu-Kembali Informasi dalam Praktek•Inverted index: a word-oriented mechanism for indexing a text collection to speed up the searching task •The inverted index structure is composed

Inverse Document Frequency Matrix(IDF)

IDF = log (N/DF)