Natural Language Processing >> Tokenisation

Natural Language Processing>> Tokenisation <<

Prof. Dr. Bettina Harriehausen-MühlbauerUniv. of Applied Science, Darmstadt, Germanywww.fbi.h-da.de/~harriehausen

[email protected] [email protected]

winter / fall 2010/201141.4268

http://www.fbi.h-da.de/~harriehausen

mailto:[email protected]

mailto:[email protected]

WS 2010/2011 NLP - Harriehausen 2

1 Tokenisation

2 Sentence segmentation

3 Lexical analyser

4 Unix/Linux tools

content


1 Tokenisation


3 Lexical analyser

4 Unix/Linux tools

content


Tokenisation

• It is often useful to represent a text as a list of tokens.

• The process of breaking a text up into its constituent tokens is known as tokenisation.

• Tokenisation can occur at a number of different levels: a text could be broken up into paragraphs, sentences,

words, syllables, or phonemes. And for any given level of tokenisation, there are many different algorithms for

breaking up the text.

• For example, at the word level, it is not immediately clear how to treat such strings as "can't," "$22.50," "New York," and "so-called."

definition


• Breaking a text into smaller units – each unit having some particular meaning.

• In most cases, into separated words and sentences.

• Carried out by finding the word boundaries, the points where one words ends and another begins.

• Tokens/lexemes: the words identified by the process of tokenisation.

• Word segmentation• Tokenisation in languages where no word boundaries are

explicitly marked• e.g. when whitespaces are not used to signify word boundaries.• Chinese, Thai

definition


Programming Languages• Part of the lexical analysis in the compilation of a programming language source.• Programming languages are designed to be unambiguous – both with regard to lexemes and syntax.

Natural Languages• The same letter can serve many different functions.• The syntax is not as strict as in programming languages.

differences…


• “Clairson International Corp. said it expects to report a net loss for its second quarter ended March 26 and doesn’t expect to meet analysts’ profit estimates of $3.9 to $4 million, or 76 cents a share to 79 cents a share, for its year ending Sept. 24.”

• The period is used in three different ways. When is a period a part of a token and when not?

• ’ used in two different ways.

Is tokenisation easy ?


Abbreviations

Abbreviations need to be recognised.• “The overt involvement of Mr. Obama’s team

they have tried to ease Gov. David A. Paterson . . . ” (New York Times, 22.09.2009)

• Dr. (doctor ? drive ? – i.e. language dependent !)

Multiword expressions

In some cases, a sequence of tokens needs one token.• in spite of; on the other hand; 26. mars

Tokenisation; examples of problems


Tokenisation; examples of problems from last semester‘s project


Relevant Fields:• Forename, Surname• Job Title, Academic Title• Address• Telephone, Telefax, Mobile Phone• E-mail, Web Address

Tokenisation; examples of problems from last semester‘s project


Telefon:Telefon,Tel,TEL,Phone,phone,tel,t,T,fon,Durchwahl,Residence,PrivatPatterns: +(area-code, 2 characters)(6-20 numbers and special characters)Characters: 0-9,-,(,),.,[,],+,/Lookup: table of area codes

(Telefon|Tel|TEL|Phone|phone|tel|t|T)?[\s]*[\.]?[\s]*[:]?[\s]*[$]?[\s]*[+]?[\s]*[\(]?[\s]*\d{0,6}[\s]*[$]?[\s]*[\.\-]?[\s]*[\d\.\-]{6,20}

Tokenisation; complexity of „easy“ tokens


Telefax:Indicator word(s): Telefax,Fax,FAX,fax,f,F, PC-FaxPatterns: +(area-code, 2 characters)(6-20 numbers and special characters)Characters: 0-9,-,(,),.,[,],+,/Lookup: table of area codes



Mobile:Indicator word(s): Mobile,mobile,MOBIL,MOBILE,m,M,Handy,MobPatterns: +(area code, 2 characters)(mobile network code 3 characters)(numbers and special characters)Characters: 0-9,-,(,),.,[,],+,/Lookup: table of area codes



Email:Indicator word(s): e-mail,E-mail,Email,E,Mail,e-mail,eMail Patterns: (numbers and characters)@(numbers and characters).(top-level domain)Characters: a-z, 0-9, .,-Lookup: List of top-level domains



Web Address:Indicator word(s): Web,URL Homepage,Internet,web Patterns: (http://)(www.)(numbers and (special) characters).(top-level domain)Characters: -Lookup: -



Forname and Surname:Indicator word(s): -Patterns: (Forename middle name Surname) Forename and surname can be composed of two names, divided by "-", Middle name can be abbreviated Characters: -Lookup: names database



Company name:Indicator word(s): GmbH, Ltd, LTD, Inc., company, AG, Aktiengesellschaft, Fachhochschule, Universität, Beratung, e.K., Versicherungsdienst, technik, Co., & Partner, Co.KG, &, Solutions, System, Systems

Patterns: -Characters: -Lookup: Company keywords databaseCheck with domain name web and email addressThe company name often occurs multiple times, because of recognized logos etc.



German Address:Zip: 5 digits number (Zip identification sequence)

followed by the city name, proceeded by either DE- or D- or nothing.

City: String of symbols that is proceeded by a the zip.Can be identified by cross matching zip with city

Street: Consists of letters and special symbols, followed by digits and letters Followed by D- DE- in combination with the zip

Can be identified by cross matching zip with streetCan be identified by cross matching city and street



US Address:Indicator word(s): PASEO,PLAZA,PASAJE,CARR,PARQUE,VEREDA,VISTA,VIA,CALLEJON,PATIO,BLVD,CAMINO

A ZipCode can be of the format:99999 or 99999-9999

Characters: -Lookup: ZipCode databaseSee also http://www.usps.com/ncsc/addressstds/deliveryaddress.htm



• A token is a categorized block of text.

• The block of text corresponding to the token is known as a lexeme.

Example• “.”, “?”, “!”, may all be categorized as punctuation tokens. However, they are all different lexemes.

• “321.56“, ”12“, ”19.9”, may all be categorized as numbertokens. However, they are all different lexemes.

Lexemes vs. tokens


1 Tokenisation


3 Lexical analyser

4 Unix/Linux tools

content


• Breaking a text into sentences.

• Requires an understanding of the various uses of punctuation characters in a language.

• The boundaries between sentences need to be recognised.

• The boundaries occur between words.• “Sentence boundary detection”

• At first sight, this seems simple.• Can’t we just search for “.”, “?”, “!”• And sometimes “:”, “;”

Sentence segmentation


Counter examples that make segmentation difficult:

• direct speech: ’’ Oh my goodness ! ’’, he said, while listening to the bad news.

• abbreviations:• Dogs, cats, budgies, fish, etc. all belong to the category of ‘pets’.• “The contemporary viewer may simply ogle the vast wooded vistas rising up from the Saguenay River and Lac St. Jean, standing in for the St.Lawrence River.”•“The firm said it plans to sublease its current headquarters at 55 Water St. A spokesman declined to elaborate.”

• ellipsis: text…text

Sentence segmentation … BUT…


Is a simple rule not sufficient?

delim = “.” | “!” | “?”IF (right context = delim + space + capital letter ORdelim + quote + space + capital letter ORdelim + space + quote + capital letter)THEN sentence boundary

Sentence segmentation … rules


• If a period preceeding a space is used as an indication of sentence boundaries, then one can recognise about 90% of the periods which end a sentence in the Brown corpus (http://en.wikipedia.org/wiki/Brown_Corpus).• One can get quite far by using simple regular expressions without using a list of abbreviations.• Let us assume three kinds of abbreviations in English:

A., B., C. [A-Za-z]\.U.S., m.p.h. [A-Za-z]\.([A-Za-z]\.)+Mr., St., Assn. [A-Z][bcdfghj-np-tvxz]+\.

• By using these two simple methods one can correctly recognise about 98% of the sentence boundaries in the Brown corpus.

A simple sentence segmentation


• The Brown University Standard Corpus of Present-Day American English (or just Brown Corpus) was compiled in the 1960s by Henry Kucera and W. Nelson Francis at Brown University, Providence, Rhode Island as a general corpus (text collection) in the field of corpus linguistics. • The initial Brown Corpus had only the words themselves, plus a location identifier for each. • Over the following several years part-of-speech tags were applied. The tagged Brown Corpus used a selection of about 80 parts of speech, as well as special indicators for compound forms, contractions, foreign words and a few other phenomena, and formed the basis for many later corpora such as the Lancaster- Oslo/Bergen Corpus. The tagged corpus enabled far more sophisticated statistical analysis.

(Brown Corpus (http://en.wikipedia.org/wiki/Brown_Corpus))


Part-of-speech tags used

Tag Definition. sentence closer (. ; ? *)( left paren) right paren* Not, n't-- dash, comma: colonABL pre-qualifier (quite, rather)ABN pre-quantifier (half, all)ABX pre-quantifier (both)AP post-determiner (many, several, next)AT article (a, the, no)BE beBED were

(Brown Corpus)



Tag DefinitionBEDZ wasBEG beingBEM amBEN beenBER are, artBEZ isCC coordinating conjunction (and, or)CD cardinal numeral (one, two, 2, etc.)CS subordinating conjunction (if, although)DO doDOD didDOZ doesDT singular determiner/quantifier (this, that)DTI singular or plural determiner/quantifier (some, any)DTS plural determiner (these, those)

(Brown Corpus)



Tag DefinitionDTX determiner/double conjunction (either)EX existential thereFW foreign word (hyphenated before regular tag)HV haveHVD had (past tense)HVG havingHVN had (past

etc.

(Brown Corpus)


Tag-Sets

• The tag-sets contains the number of tags that are assigned by a tagger

Tag-Set Number of Tags

• Brown Corpus 87• Lancaster-Oslo/Bergen 135• Lancaster UCREL 165• London-Lund Corpus of Spoken English 197• Penn Treebank 36 + 12• IBM / Microsoft 8

(Different taggers)


1 Tokenisation


3 Lexical analyser

4 Unix/Linux tools

content


• A lexical analyser is a program which breaks a text into lexemes (tokens).

• A program which generates a lexical analyser is called a lexical analyser generator.

• Examples: Lex/Flex/JFlex (http://jflex.de/)

• The user defines a set of regular expression patterns.• The program generates finite-state automata.• The automata are used to recognise tokens.

Lexical Analyser

http://jflex.de/


%% A finite-state automata recognising (a|b)*abb

%public%class Simple%standalone%unicode

%{ String str = "Found: ";%}Pattern = (a|b)*abb%%{Pattern} { System.out.println(str + " " + yytext());}. { ;}

JFlex example (http://jflex.de/manual.html)


%% A good tokeniser for English?%public%class EngGood%standalone%unicode

%{%}

WhiteSpace = [ \t\f\n]Lower = [a-z]Upper = [A-Z]EngChar = {Upper}|{Lower}EngWord = {EngChar}+

%%{WhiteSpace} {;}{EngWord} { System.out.println(yytext());}. { System.out.println(yytext());}

JFlex example


1 Tokenisation


3 Lexical analyser

4 Unix/Linux tools

content


Various Unix tools exist which simplify the tokenisation and processing of texts:

• grep (general regular expression parser)• tr (translate characters)• sed (string/stream edit) (http://en.wikipedia.org/wiki/Sed)

• Search-Programming languages:• PERL

Unix/Linux tools


grep is a command line text search utility originally written for Unix. The name is taken from the first letters in global / regular expression / print.A backronym of the unusual name also exists in the form of Generalized Regular Expression Parser. The grep command searches files or standard input globally for lines matching a given regular expression, and prints them to the program's standard output.

This is an example of a common grep usage:

grep apple fruitlist.txt

In this case, grep prints all lines containing apple from the file fruitlist.txt, regardless of word boundaries; therefore lines containing pineapple or apples are also printed. The grep command is case sensitive by default, so this example's output does not include lines containing Apple (with a capital A) unless they also contain apple.

Grep (http://www.panix.com/~elflord/unix/grep.html)


To search all .txt files in a directory for apple in a shell that supports globbing, use an asterisk in place of the file name:

grep apple *.txt

Regular expressions can be used to match more complicated queries. The following prints all lines in the file that begin with the letter a, followed by any one character, then the letters ple.

grep ^a.ple fruitlist.txt

Grep (http://www.panix.com/~elflord/unix/grep.html)


NAMEtr - translate characters

SYNOPSIStr [ -cds ] [ string1 [ string2 ] ]

DESCRIPTIONTr copies the standard input to the standard output with substitution or deletion of selected characters (runes).

Input characters found in string1 are mapped into the corresponding characters of string2. When string2 is short it is padded to the length of string1 by duplicating its last character. Any combination of the options -cds may be used:

-c Complement string1: replace it with a lexicographically ordered list of all other characters.

-d Delete from input all characters in string1. -s Squeeze repeated output characters that occur in

string2 to single characters.

tr (http://linux.die.net/man/1/tr) or

(http://publib.boulder.ibm.com/infocenter/iseries/v5r4/index.jsp?topic=%2Frzahz%2Ftr.htm)


EXAMPLES

Replace all upper-case ASCII letters by lower-case.

tr A-Z a-z <mixed >lower

Create a list of all the words in one per line in where a word is taken to be a maximal string of alphabetics. String2 is given as a quoted newline.

tr -cs A-Za-z ’ ’ <file1 >file2

tr (http://linux.die.net/man/1/tr) or

(http://publib.boulder.ibm.com/infocenter/iseries/v5r4/index.jsp?topic=%2Frzahz%2Ftr.htm)

Documents

Natural Language Processing >> Tokenisation