Assignment 2 314

7/29/2019 Assignment 2 314

1/2

CE314 - Natural Language Engineering

Assignment 2: Indexing for Web Search

Udo Kruschwitz

15th November 2010

Plagiarism

You are reminded that this work is for credit towards the composite mark in CE314, and that the work you

submit must therefore be your own. Any material you make use of, whether it be from textbooks, the Web or

any other source must be acknowledged as a comment in the program, and the extent of the reference clearly

indicated.

The Problem

A typical search engine consists of three parts:

A crawler that collects pages over the Web

An indexing component which processes those pages and selects the important keywords and writes themto a database

A front end for querying the index database.

A good indexing componentis essential to give good answers to a user query. The main problem of the indexingstep is to decide which words in the text are interesting and which ones can be ignored.

Most Web search engines do not do very sophisticated language processing in order to build their indexes. Theysimply delete some stopwords and pass the remaining words to the query system.

The Task

Your task is to build a simple indexing component for a Web search system. Your system should take HTMLpages as input, process them using the kind of techniques that we have been looking at in the module, andoutput an index consisting of a list of keywords. The ideal system would begin by deleting markup and replacingHTML special symbols (such as &) with their ASCII correspondent (&). It would then part-of-speech tagthe input, use the POS tags to decide which parts of the text to keep as keywords (e.g. only choose nouns),and apply a stemmer (or a tool for baseform reduction).

This assignment comes in stages. Marks are given for each stage. You may choose not to attempt some stages.You might also implement a system that does not strictly follow the stages but will work in the same way. Thestages are as follows:

Input/Output (10%) The system must be able to read some input (for example from a file) and produceappropriately formatted output, which could be a simple list of words.

Deleting Markup (20%) Before the text can be analyzed it is necessary to get rid of the HTML tags.The result will be plain text. You could use finite state methods for this. Note however, that if you simply

delete all HTML tags, you will lose information such as meta tag keywords. Therefore, I strongly suggestthat you use some tool to perform this task.

1

7/29/2019 Assignment 2 314

2/2

Pre-processing: Sentence Splitting, Tokenization and Normalization (10%) The next stepshould be to transform the input text into a normal form of your choice.

Part-of-Speech Tagging (10%) The input should be tagged with a part-of-speech tagger (e.g. OpenNLP,QTag or the Brill tagger), so that the result can then be processed in the next steps.

Selecting Keywords (20%) One aim of your system is to identify the words or phrases in the text thatare most useful for indexing purposes. Your system should remove words which are not useful, such as

very frequent words or stopwords. You should develop a selection method, possibly using POS tags (e.g.nouns and noun phrases are usually good indices) in combination with statistical/frequency information.

Stemming or Morphological Analysis (10%) Writing word stems to the database rather than wordsallows to treat various inflected forms of a word in the same way, i.e. bus and busses refer to exactly thesame thing even though they are different words.

Engineering a Complete System (10%) The final system should have control over all the individualcomponents so that there is a single call and all the above steps will be performed.

The Report

You will have noticed that the percentages above only add up to 90%. This is because one of the importantaspects of the project is that your work should be well documented and your code well commented. 10% ofyour mark will come from this. You should submit:

A description of your implementation: what the code does, and the software you used

Clear commented code

Unedited output from a run of the code submitted using this Web page: http://news.bbc.co.uk/(feel free to submit other runs as well, i.e. using Web pages of your own choice)

Commented output.

You may work in pairs. If you do, you only need to submit one report. Both members of a pair will get thesame mark unless there is reason to do otherwise.

Software

You can implement your system either on the Linux or the Windows machines. Perl, Java, Python, C/C++,and shell scripts are good choices for this project (you may even use Prolog for some of the processing steps),but you are by no means restricted to those languages. You can use any of the software discussed in the labs,or any additional software you find appropriate. On the Windows machines, besides Perl and Java, you can useQTag (or other software installed in the labs such as Connexor, NLTK or GATE). On the Linux machines youcan use shell scripts and the Brill tagger that can be accessed from the command line. If you want to use an

existing stemmer, the Porter stemmer (briefly discussed in the lectures) would be a candidate. The algorithm isdescribed in the textbook by Jurafsky and Martin. A Web site that provides Java, Perl and C implementations(as well as many others) is the following:

http://www.tartarus.org/~martin/PorterStemmer/

Submission

The assignment, which counts for 20% of the overall mark, should be submitted via the electronic submissionsystem by Friday, 17 December 2010, 11:59 (mid-day) (see the submission guidelines provided for Assign-ment 1). The guidelines about late assignments are explained in the handbook. The assignments will be markedby 17 January 2011.

2

Documents

Assignment 2 314