CIKM Implementation of Smoothing techniques on the GPU Re running experiments using the wt2g collection The Future

FYP Progress

Sudhanshu Khemka

OutlineCIKM

Implementation of Smoothing techniques on the GPU

Re running experiments using the wt2g collection

The Future

CIKM Began my FYP by revamping my UROP report and submitting it for

publication to CIKM Learnt the importance of succinct writing Importance of re-drawing images Importance of re-writing equations

Comments by reviewers Definition of language modeling approach not clear We should use a standard dataset

Also noticed that we need to improve our smoothing model.

Direction for my FYP

Implemented smoothing models

The Good Turing smoothing algorithm

The Kneser Ney Smoothing algorithm

Good Turing Smoothing

Intuition : We estimate the probability of things that occur c times using the probability of things that occur c+1 times.

Smoothed Count: c* = (c+1)

Smoothed Probability: P(c*) =

In the above definition, Nc is the number of N grams that occur c times.

Doc 1 Doc 2

a shoe 1 1

a cat 0 2

foo bar 2 0

a dog 1 2



2 phases:

1) Calculate the Nc values

2) Smooth counts

Calculating NC values1 0 2 1Doc1:

0 1 1 2Sort:

0 1 2 3Positions:

0 1 2

0 1 3

Stream compaction

Doc1: N0 = 1 , N1 = 2 , N2 = 1

Smooth Ngram counts1 0 2 1Doc1

Thread 2Thread 1Thread 0 Thread 3

Let one thread compute the smoothed count for each Ngram



Experimental results

1000 10000 100000 1000000 100000000

100

200

300

400

500

600

GPU(ms)CPU(ms)

Number of elements

Tim

e(m

s)

Intuition: Let us assume we are smoothing bigram counts. In order to smooth the count for a bigram wi-1wi , do the following: If C > 0 , subtract D from the count :

If C(wi-1wi) = 0, base our estimate on the number of different contexts word w has appeared in.

Knesser Ney Discounting

𝑃 𝐾𝑁 (𝑤𝑖|𝑤𝑖−1 )=𝐶 (𝑤𝑖− 1𝑤𝑖 )−𝐷

𝐶 (𝑤𝑖−1)

𝑃 𝐾𝑁 (𝑤𝑖|𝑤𝑖−1 )=¿ otherwise. =

Experimental results

1000 10000 100000 1000000 20000000

200

400

600

800

1000

1200

GPU(ms)CPU(ms)

Number of elements

OutlineCIKM



The Future

Re run experiments using the wt2g collection

Provided by University of Glasgow

Cost : 350 pounds. Size = 2 GB

Webpage

HTML parser

Text Inverted Index

LM indexer

Re run experiments

Results!

HTML Parser and LM Indexer Both written in Python

Used the lxml API for HTML parsing : from lxml import html

Do not use inbuilt HTML parser provided by Python. Cannot handle broken HTML very well while extracting text

Beautiful soap is also a good option

Used the nltk library for stemming (nltk.stem.porter) and indexing (nltk.word_tokenize)

OutlineCIKM



The Future Implementing Ponte and Croft’s Model Re running experiments using the TREC GOV 2 collection

Future Work Modify code to implement Ponte and Croft’s model

Re run experiments using the TREC GOV 2 collection.

Source DataSetUsing Graphics Processors for high performance IR query processing

TREC GOV2 dataset of 25.2 million pages

Optimized topK processing with global page scores TREC GOV2. 426 GB

On Efficient Posting List Intersection withMulticore Processors

Altavista query log. (250GB)

Improving Search Engines Performance on Multithreading Processors

TodoCL database. A database consisting of pages from Chile (1.5GB)

Faster Top-k Document Retrieval Using Block-Max Indexes TREC GOV2.

Batch Query processing for web search engines Subset of 10 million pages crawled by PolyBot web crawler

A Language Modeling approach to Information Retrieval TREC. Topics 202-250 and topics 51-100.

Improved Index Compression Techniques for Versioned Document Collections

Wikipedia dataset (8 million documents) + Internet Archive (1.06 million documents)

Ranking Web Pages Using Collective Knowledge Wikipedia + TREC Combining the Language Model and Inference Network Approaches to Retrieval

TREC 4,6,7, and 8 queries

Documents

CIKM Implementation of Smoothing techniques on the GPU Re running experiments using the wt2g collection The Future