Upload
raven-weaver
View
23
Download
0
Embed Size (px)
DESCRIPTION
IR Homework #2. By J. H. Wang Mar. 25, 2008. Programming Exercise #2: Term Weighting. Goal: to assign TF-IDF weights for each index term in inverted files Input : inverted index files (the output of HW#1) Output : term weighting files (exact format to be described later). - PowerPoint PPT Presentation
Citation preview
IR Homework #2
By J. H. WangMar. 25, 2008
Programming Exercise #2: Term Weighting
• Goal: to assign TF-IDF weights for each index term in inverted files
• Input: inverted index files – (the output of HW#1)
• Output: term weighting files– (exact format to be described later)
Input: Inverted Index
• Two files– Vocabulary file: a sorted list of words
(each word in a separate line)– Occurrences file: for each word, a list of
occurrences in the original text • [word#] [term freq.] [ (doc#, char#) pairs]• 1 7 (1, 12) (1, 28) (3, 31) (8, 39) (8, 65) (10,
16) (11, 91) • 2 2 (3, 44) (8, 72)• …
TF-IDF Weighting
• Term-document matrix (N*M)– Each row i contains the TF-IDF term
weights wij for term ti in document dj
• 0.3 0.7 0.0 0.2 0.9 0.0 0.0 0.1 0.1 0.9 0.0 0.4 0.1 0.0…
– N: # of terms, M: # of documents• Ex: 20k words * 400 docs = 8M entries!
But many of them are 0’s!
– Sparse matrix how to store them in an efficient way?
Output Format
• wij = tfij * log (N/dfi)
– We only keep entries with nonzero tfij • Similar to occurrences file
– For each word, a list of nonzero entries in the term-document matrix • [word#] [doc freq.] [ (doc# (j), wij) pairs]
• 1 4 (1, 0.3) (2, 0.7) (4, 0.2) (5, 0.9) • 2 5 (1, 0.1) (2, 0.1) (3, 0.9) (5, 0.4) (6, 0.1) • …
Implementation Issues
• You will need both TF (term frequency) and DF (document frequency) factors for each term
• You can calculate the term frequencies and document frequencies at the same time when you build the index– That is, you can combine HW#2 into HW#1 if
necessary
• You may want to remove stopwords to further reduce the number of rows in the matrix
Optional Features
• Optional functionalities– Other weighting schemes, such as:
probabilistic weighting– Stopword removal– Dimension reduction strategies, such as
Latent Semantic Indexing (or SVD)– They should be able to be turned off by
a parameter trigger
Submission
• Your submission should include– The source code (and optionally your executable
file)– A one-page description that includes the following
• Major features in your work (ex: high efficiency, low storage, able to deal with multiple formats, …)
• Major difficulties encountered• Special requirements for execution environments (ex:
Java Runtime Environment)• The names and the responsible parts of each individual
member should be clearly identified for team work
• Due: three weeks (Apr. 16, 2008)
Evaluation
• The TF-IDF weighting files generated by your program will be checked for correctness
• Optional features such as probabilistic weighting and latent semantic indexing will be considered as bonus
• You might be required to demo if the program submitted was unable to run by TA
Questions?