Upload
alexandrina-rogers
View
218
Download
4
Embed Size (px)
Citation preview
FYP Progress
Sudhanshu Khemka
OutlineCIKM
Implementation of Smoothing techniques on the GPU
Re running experiments using the wt2g collection
The Future
CIKM Began my FYP by revamping my UROP report and submitting it for
publication to CIKM Learnt the importance of succinct writing Importance of re-drawing images Importance of re-writing equations
Comments by reviewers Definition of language modeling approach not clear We should use a standard dataset
Also noticed that we need to improve our smoothing model.
Direction for my FYP
Implemented smoothing models
The Good Turing smoothing algorithm
The Kneser Ney Smoothing algorithm
Good Turing Smoothing
Intuition : We estimate the probability of things that occur c times using the probability of things that occur c+1 times.
Smoothed Count: c* = (c+1)
Smoothed Probability: P(c*) =
In the above definition, Nc is the number of N grams that occur c times.
Doc 1 Doc 2
a shoe 1 1
a cat 0 2
foo bar 2 0
a dog 1 2
Smoothed Count: c* = (c+1)
Smoothed Probability: P(c*) =
2 phases:
1) Calculate the Nc values
2) Smooth counts
Calculating NC values1 0 2 1Doc1:
0 1 1 2Sort:
0 1 2 3Positions:
0 1 2
0 1 3
Stream compaction
Doc1: N0 = 1 , N1 = 2 , N2 = 1
Smooth Ngram counts1 0 2 1Doc1
Thread 2Thread 1Thread 0 Thread 3
Let one thread compute the smoothed count for each Ngram
Smoothed Count: c* = (c+1)
Smoothed Probability: P(c*) =
Experimental results
1000 10000 100000 1000000 100000000
100
200
300
400
500
600
GPU(ms)CPU(ms)
Number of elements
Tim
e(m
s)
Intuition: Let us assume we are smoothing bigram counts. In order to smooth the count for a bigram wi-1wi , do the following: If C > 0 , subtract D from the count :
If C(wi-1wi) = 0, base our estimate on the number of different contexts word w has appeared in.
Knesser Ney Discounting
𝑃 𝐾𝑁 (𝑤𝑖|𝑤𝑖−1 )=𝐶 (𝑤𝑖− 1𝑤𝑖 )−𝐷
𝐶 (𝑤𝑖−1)
𝑃 𝐾𝑁 (𝑤𝑖|𝑤𝑖−1 )=¿ otherwise. =
Experimental results
1000 10000 100000 1000000 20000000
200
400
600
800
1000
1200
GPU(ms)CPU(ms)
Number of elements
OutlineCIKM
Implementation of Smoothing techniques on the GPU
Re running experiments using the wt2g collection
The Future
Re run experiments using the wt2g collection
Provided by University of Glasgow
Cost : 350 pounds. Size = 2 GB
Webpage
HTML parser
Text Inverted Index
LM indexer
Re run experiments
Results!
HTML Parser and LM Indexer Both written in Python
Used the lxml API for HTML parsing : from lxml import html
Do not use inbuilt HTML parser provided by Python. Cannot handle broken HTML very well while extracting text
Beautiful soap is also a good option
Used the nltk library for stemming (nltk.stem.porter) and indexing (nltk.word_tokenize)
OutlineCIKM
Implementation of Smoothing techniques on the GPU
Re running experiments using the wt2g collection
The Future Implementing Ponte and Croft’s Model Re running experiments using the TREC GOV 2 collection
Future Work Modify code to implement Ponte and Croft’s model
Re run experiments using the TREC GOV 2 collection.
Source DataSetUsing Graphics Processors for high performance IR query processing
TREC GOV2 dataset of 25.2 million pages
Optimized topK processing with global page scores TREC GOV2. 426 GB
On Efficient Posting List Intersection withMulticore Processors
Altavista query log. (250GB)
Improving Search Engines Performance on Multithreading Processors
TodoCL database. A database consisting of pages from Chile (1.5GB)
Faster Top-k Document Retrieval Using Block-Max Indexes TREC GOV2.
Batch Query processing for web search engines Subset of 10 million pages crawled by PolyBot web crawler
A Language Modeling approach to Information Retrieval TREC. Topics 202-250 and topics 51-100.
Improved Index Compression Techniques for Versioned Document Collections
Wikipedia dataset (8 million documents) + Internet Archive (1.06 million documents)
Ranking Web Pages Using Collective Knowledge Wikipedia + TREC Combining the Language Model and Inference Network Approaches to Retrieval
TREC 4,6,7, and 8 queries