1 Corpus-Based Work Chapter 4 Foundations of statistical natural language processing

1

Corpus-Based Work

Chapter 4

Foundations of statistical natural language processing

2

Introduction

• Requirements of NLP work– Computers– Corpora– Application/Software

• This section covers some issues concerning the formats and problems encountered in dealing with raw data

• Low-level processing before actual work– Word/Sentence extraction

3

Getting Set Up

• Computers– Memory requirements for large corpora– Statistical NLP methods involve counts required to

be accessed speedily

• Corpora– “A corpus is a special collection of textual material

collected according to a certain set of criteria”– Licensing– Most of the time free sources are not linguistically

marked-up

4

• Corpora– Representative sample

• What we find for sample also holds for general population

– Balanced corpus• Each subtype of text matching predetermined

criterion of importance

• Importance in statistical NLP– Representative corpus– In results type/domain of corpus should be

included

5

• Software– Text editors

• TextPad, Emacs, BBedit• Regular expressions

– Patterns as regular language

– Programming language• C/C++ widely used (Efficient)• Pearl for text preparation and formatting• Built in database and easy handling of complicated

structures makes Prolog important• Java as pure Object Oriented gives automatic

memory management

6

Looking at Text

• Either in raw format or marked-up– ‘Markup’ is used for putting some codes into

data file, giving some information about text

• Issues in automatic processing– Junk formatting/content (Corpus Cleaning)– Case sensitivity (All capitalize)

1. Proper Nouns?

2. Stress through capitalization • Loss of contextual information

7

• Tokenization– Text is divided into units called ‘tokens’– Treatment of punctuation marks?

• What is a word?– Graphic word (Kucera and Francis 1967)

• A string of contiguous alphanumeric characters with white space on either side.

• This is not practical definition even in case of Latin• Especially for news corpus some odd entries can

be present e.g. Micro$oft, C| net• Apart from these oddities there are some other

issues

8

• Periods– Words are not always bounded by white

spaces (commas, semicolons and periods)– Periods are at the end of sentence and also at

the end of abbreviations – In abbreviation they should be attached to

words (Wash. wash)– When abbreviations occur at the end of

sentence there is only one period present, performing both functions

• Within morphology, this phenomenon is referred as ‘haplology’

9

• Single Apostrophes– Difficulties in dealing with constructions such

as I’ll or isn’t– The count of graphic word is 1 according to

basic definition but should be counted as 2 words

• 1. S NP VP• 2. if we split then some funny words may occur in

collection

– End of quotations marks– Possessive form of words ending with ‘s’ or ‘z’

• Charles’ Law Muaz’ book

10

• Hyphenation– Does sequence of letters with hyphen in-

between, count as one or two?– Line ending hyphens

• Remove hyphen at the end of line and join both parts together

• If there is some other type of hyphen at end of line (haplology) then? (text-based)

– Mostly in electronic text line breaking hyphens are not present, but there are some other issues…….

11

• Some things with hyphens are clearly treated as one word – E-mail, A-l-Plus and co-operate

• Other cases are arguable– Non-lawyer, pro-Arabs and so-called– The hyphens here are called lexical hyphens– Inserted before or after small word formatives to split

vowel sequence in some cases

• Third class of hyphens is inserted to indicate correct grouping– A text-based medium– A final take-it-or-leave-it offer

12

• Inconsistencies in hyphenation – Cooperate Co-operate– So we can have multiple forms treated as

either one word or two

• Lexemes– Single dictionary entry with single meaning

• Homographs– Two lexemes have overlapping forms/nature

• Saw

13

• Word segmentation in other languages

• Opposite issue– White spaces but not word boundary– “the New York-New Heaven railroad”– “I couldn’t work the answer out”

• In spite of, in order to, because of

• Variant coding of information of certain semantic type– Phone numbers 42-111-128-128

• Problem in information extraction

14

• Speech Corpora Issues– More contractions– Various phonetic representations – Pronunciation variants– Sentence fragments– Filler words

• Morphology– Keep various forms separately or collapse

them? e.g. sit, sits, sat– Grouping them together and working with

lexemes (Initially looks easier)

15

• Stemming – Strips off affixes

• Lemmatization– To extract the lemma or lexeme from

inflected form

• Empirical research within IR shows that stemming does not help in performance

1. Information loss (operating operate)

2. Closely related tokens are grouped in chunks, which are more useful

3. Not good for morphologically rich languages

16

• Sentences– What is a sentence?– In English, something ending with ‘.’, ‘?’ or ‘!’– Abbreviations issues

• Other issues– you reminded me, she remarked, of your

mother.”– Nested things are classified as ‘clauses’– Quotation marks after punctuation

• ‘.’ is not sentence boundary in this case

17

• Sentence boundary (SB) detection– Place tentative SB after all occurrences of .?!– Move the boundary after quotation mark (if

any)– Disqualify a period boundary in case of

• Preceded by an abbreviation not at sentence end , and capitalized Prof., Dr.

• Or not followed by capitalized words like in case of etc., jr.

– Disqualify a boundary with ? Or ! • If followed by a lower case letter

– Regard all other as correct SBs

18

• Riley (1989) used classification trees for SB detection– Features of trees included case and length of

words preceding or following a period and probabilities of words to occur before and after a sentence boundary

– It required large quantity of labeled data

• Palmer and Hearst used POS of such words and implemented with Neural Networks (98-99% accurate)

• In other languages?

19

• Marked-up Data– Some sort of code is used to provide

information (mostly SGML, XML)– It can be done automatically, manually or

mixture of both (Semi-Automatic)– Some texts mark up just sentence and

paragraph boundaries – Other mark up more than this basic

information • e.g. Pen Treebank (Full syntactic structure)

– Common mark up is POS tagging

20

• Grammatical Tagging– Generally done with conventional POS

tagging like Noun, Verbs etc.– Also some information regarding nature

of the words like Plurality of nouns or Superlative forms of adjectives

• Tag set– The most influential tag set have been the

one used to tag American Brown Corpus and Lancaster-Oslo-Bergen corpus

21

• Size of tag sets– Brown 87 179 (Total tags)– Penn 45– Claws1 132

• Penn tag set is widely used in computational work

• Tags are different in different tag sets– Larger tag sets obviously have fine-grained

distinctions– Detail level is according to domain of corpora

22

• The design of tag set– Grammatical class of word– Features to tell the behavior of the word

• Part of Speech– Semantic grounds– Syntactic distributional grounds– Morphological grounds

• Splitting tags in further categories gives improved information but makes classification harder

• There is not a simple relationship between tag set size and performance of taggers

Documents

1 Corpus-Based Work Chapter 4 Foundations of statistical natural language processing