40
Natural Language Processing >> Tokenisation << Prof. Dr. Bettina Harriehausen-Mühlbauer Univ. of Applied Science, Darmstadt, Germany www.fbi.h-da.de/~harriehausen [email protected] [email protected] winter / fall 2010/2011 41.4268

Natural Language Processing >> Tokenisation

  • View
    223

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Natural Language Processing >> Tokenisation

Natural Language Processing>> Tokenisation <<

Prof. Dr. Bettina Harriehausen-MühlbauerUniv. of Applied Science, Darmstadt, Germanywww.fbi.h-da.de/~harriehausen

[email protected] [email protected]

winter / fall 2010/201141.4268

Page 2: Natural Language Processing >> Tokenisation

WS 2010/2011 NLP - Harriehausen 2

1 Tokenisation

2 Sentence segmentation

3 Lexical analyser

4 Unix/Linux tools

content

Page 3: Natural Language Processing >> Tokenisation

WS 2010/2011 NLP - Harriehausen 3

1 Tokenisation

2 Sentence segmentation

3 Lexical analyser

4 Unix/Linux tools

content

Page 4: Natural Language Processing >> Tokenisation

WS 2010/2011 NLP - Harriehausen 4

Tokenisation

• It is often useful to represent a text as a list of tokens.

• The process of breaking a text up into its constituent tokens is known as tokenisation.

• Tokenisation can occur at a number of different levels: a text could be broken up into paragraphs, sentences,

words, syllables, or phonemes. And for any given level of tokenisation, there are many different algorithms for

breaking up the text.

• For example, at the word level, it is not immediately clear how to treat such strings as "can't," "$22.50," "New York," and "so-called."

definition

Page 5: Natural Language Processing >> Tokenisation

WS 2010/2011 NLP - Harriehausen 5

• Breaking a text into smaller units – each unit having some particular meaning.

• In most cases, into separated words and sentences.

• Carried out by finding the word boundaries, the points where one words ends and another begins.

• Tokens/lexemes: the words identified by the process of tokenisation.

• Word segmentation• Tokenisation in languages where no word boundaries are

explicitly marked• e.g. when whitespaces are not used to signify word boundaries.• Chinese, Thai

definition

Page 6: Natural Language Processing >> Tokenisation

WS 2010/2011 NLP - Harriehausen 6

Programming Languages• Part of the lexical analysis in the compilation of a programming language source.• Programming languages are designed to be unambiguous – both with regard to lexemes and syntax.

Natural Languages• The same letter can serve many different functions.• The syntax is not as strict as in programming languages.

differences…

Page 7: Natural Language Processing >> Tokenisation

WS 2010/2011 NLP - Harriehausen 7

• “Clairson International Corp. said it expects to report a net loss for its second quarter ended March 26 and doesn’t expect to meet analysts’ profit estimates of $3.9 to $4 million, or 76 cents a share to 79 cents a share, for its year ending Sept. 24.”

• The period is used in three different ways. When is a period a part of a token and when not?

• ’ used in two different ways.

Is tokenisation easy ?

Page 8: Natural Language Processing >> Tokenisation

WS 2010/2011 NLP - Harriehausen 8

Abbreviations

Abbreviations need to be recognised.• “The overt involvement of Mr. Obama’s team

they have tried to ease Gov. David A. Paterson . . . ” (New York Times, 22.09.2009)

• Dr. (doctor ? drive ? – i.e. language dependent !)

Multiword expressions

In some cases, a sequence of tokens needs one token.• in spite of; on the other hand; 26. mars

Tokenisation; examples of problems

Page 9: Natural Language Processing >> Tokenisation

WS 2010/2011 NLP - Harriehausen 9

Tokenisation; examples of problems from last semester‘s project

Page 10: Natural Language Processing >> Tokenisation

WS 2010/2011 NLP - Harriehausen 10

Relevant Fields:• Forename, Surname• Job Title, Academic Title• Address• Telephone, Telefax, Mobile Phone• E-mail, Web Address

Tokenisation; examples of problems from last semester‘s project

Page 11: Natural Language Processing >> Tokenisation

WS 2010/2011 NLP - Harriehausen 11

Telefon:Telefon,Tel,TEL,Phone,phone,tel,t,T,fon,Durchwahl,Residence,PrivatPatterns: +(area-code, 2 characters)(6-20 numbers and special characters)Characters: 0-9,-,(,),.,[,],+,/Lookup: table of area codes

(Telefon|Tel|TEL|Phone|phone|tel|t|T)?[\s]*[\.]?[\s]*[:]?[\s]*[\(]?[\s]*[+]?[\s]*[\(]?[\s]*\d{0,6}[\s]*[\)]?[\s]*[\.\-]?[\s]*[\(\)\d\.\-]{6,20}

Tokenisation; complexity of „easy“ tokens

Page 12: Natural Language Processing >> Tokenisation

WS 2010/2011 NLP - Harriehausen 12

Telefax:Indicator word(s): Telefax,Fax,FAX,fax,f,F, PC-FaxPatterns: +(area-code, 2 characters)(6-20 numbers and special characters)Characters: 0-9,-,(,),.,[,],+,/Lookup: table of area codes

Tokenisation; complexity of „easy“ tokens

Page 13: Natural Language Processing >> Tokenisation

WS 2010/2011 NLP - Harriehausen 13

Mobile:Indicator word(s): Mobile,mobile,MOBIL,MOBILE,m,M,Handy,MobPatterns: +(area code, 2 characters)(mobile network code 3 characters)(numbers and special characters)Characters: 0-9,-,(,),.,[,],+,/Lookup: table of area codes

Tokenisation; complexity of „easy“ tokens

Page 14: Natural Language Processing >> Tokenisation

WS 2010/2011 NLP - Harriehausen 14

Email:Indicator word(s): e-mail,E-mail,Email,E,Mail,e-mail,eMail Patterns: (numbers and characters)@(numbers and characters).(top-level domain)Characters: a-z, 0-9, .,-Lookup: List of top-level domains

Tokenisation; complexity of „easy“ tokens

Page 15: Natural Language Processing >> Tokenisation

WS 2010/2011 NLP - Harriehausen 15

Web Address:Indicator word(s): Web,URL Homepage,Internet,web Patterns: (http://)(www.)(numbers and (special) characters).(top-level domain)Characters: -Lookup: -

Tokenisation; complexity of „easy“ tokens

Page 16: Natural Language Processing >> Tokenisation

WS 2010/2011 NLP - Harriehausen 16

Forname and Surname:Indicator word(s): -Patterns: (Forename middle name Surname) Forename and surname can be composed of two names, divided by "-", Middle name can be abbreviated Characters: -Lookup: names database

Tokenisation; complexity of „easy“ tokens

Page 17: Natural Language Processing >> Tokenisation

WS 2010/2011 NLP - Harriehausen 17

Company name:Indicator word(s): GmbH, Ltd, LTD, Inc., company, AG, Aktiengesellschaft, Fachhochschule, Universität, Beratung, e.K., Versicherungsdienst, technik, Co., & Partner, Co.KG, &, Solutions, System, Systems

Patterns: -Characters: -Lookup: Company keywords databaseCheck with domain name web and email addressThe company name often occurs multiple times, because of recognized logos etc.

Tokenisation; complexity of „easy“ tokens

Page 18: Natural Language Processing >> Tokenisation

WS 2010/2011 NLP - Harriehausen 18

German Address:Zip: 5 digits number (Zip identification sequence)

followed by the city name, proceeded by either DE- or D- or nothing.

City: String of symbols that is proceeded by a the zip.Can be identified by cross matching zip with city

Street: Consists of letters and special symbols, followed by digits and letters Followed by D- DE- in combination with the zip

Can be identified by cross matching zip with streetCan be identified by cross matching city and street

Tokenisation; complexity of „easy“ tokens

Page 19: Natural Language Processing >> Tokenisation

WS 2010/2011 NLP - Harriehausen 19

US Address:Indicator word(s): PASEO,PLAZA,PASAJE,CARR,PARQUE,VEREDA,VISTA,VIA,CALLEJON,PATIO,BLVD,CAMINO

A ZipCode can be of the format:99999 or 99999-9999

Characters: -Lookup: ZipCode databaseSee also http://www.usps.com/ncsc/addressstds/deliveryaddress.htm

Tokenisation; complexity of „easy“ tokens

Page 20: Natural Language Processing >> Tokenisation

WS 2010/2011 NLP - Harriehausen 20

• A token is a categorized block of text.

• The block of text corresponding to the token is known as a lexeme.

Example• “.”, “?”, “!”, may all be categorized as punctuation tokens. However, they are all different lexemes.

• “321.56“, ”12“, ”19.9”, may all be categorized as numbertokens. However, they are all different lexemes.

Lexemes vs. tokens

Page 21: Natural Language Processing >> Tokenisation

WS 2010/2011 NLP - Harriehausen 21

1 Tokenisation

2 Sentence segmentation

3 Lexical analyser

4 Unix/Linux tools

content

Page 22: Natural Language Processing >> Tokenisation

WS 2010/2011 NLP - Harriehausen 22

• Breaking a text into sentences.

• Requires an understanding of the various uses of punctuation characters in a language.

• The boundaries between sentences need to be recognised.

• The boundaries occur between words.• “Sentence boundary detection”

• At first sight, this seems simple.• Can’t we just search for “.”, “?”, “!”• And sometimes “:”, “;”

Sentence segmentation

Page 23: Natural Language Processing >> Tokenisation

WS 2010/2011 NLP - Harriehausen 23

Counter examples that make segmentation difficult:

• direct speech: ’’ Oh my goodness ! ’’, he said, while listening to the bad news.

• abbreviations:• Dogs, cats, budgies, fish, etc. all belong to the category of ‘pets’.• “The contemporary viewer may simply ogle the vast wooded vistas rising up from the Saguenay River and Lac St. Jean, standing in for the St.Lawrence River.”•“The firm said it plans to sublease its current headquarters at 55 Water St. A spokesman declined to elaborate.”

• ellipsis: text…text

Sentence segmentation … BUT…

Page 24: Natural Language Processing >> Tokenisation

WS 2010/2011 NLP - Harriehausen 24

Is a simple rule not sufficient?

delim = “.” | “!” | “?”IF (right context = delim + space + capital letter ORdelim + quote + space + capital letter ORdelim + space + quote + capital letter)THEN sentence boundary

Sentence segmentation … rules

Page 25: Natural Language Processing >> Tokenisation

WS 2010/2011 NLP - Harriehausen 25

• If a period preceeding a space is used as an indication of sentence boundaries, then one can recognise about 90% of the periods which end a sentence in the Brown corpus (http://en.wikipedia.org/wiki/Brown_Corpus).• One can get quite far by using simple regular expressions without using a list of abbreviations.• Let us assume three kinds of abbreviations in English:

A., B., C. [A-Za-z]\.U.S., m.p.h. [A-Za-z]\.([A-Za-z]\.)+Mr., St., Assn. [A-Z][bcdfghj-np-tvxz]+\.

• By using these two simple methods one can correctly recognise about 98% of the sentence boundaries in the Brown corpus.

A simple sentence segmentation

Page 26: Natural Language Processing >> Tokenisation

WS 2010/2011 NLP - Harriehausen 26

• The Brown University Standard Corpus of Present-Day American English (or just Brown Corpus) was compiled in the 1960s by Henry Kucera and W. Nelson Francis at Brown University, Providence, Rhode Island as a general corpus (text collection) in the field of corpus linguistics. • The initial Brown Corpus had only the words themselves, plus a location identifier for each. • Over the following several years part-of-speech tags were applied. The tagged Brown Corpus used a selection of about 80 parts of speech, as well as special indicators for compound forms, contractions, foreign words and a few other phenomena, and formed the basis for many later corpora such as the Lancaster- Oslo/Bergen Corpus. The tagged corpus enabled far more sophisticated statistical analysis.

(Brown Corpus (http://en.wikipedia.org/wiki/Brown_Corpus))

Page 27: Natural Language Processing >> Tokenisation

WS 2010/2011 NLP - Harriehausen 27

Part-of-speech tags used

Tag Definition. sentence closer (. ; ? *)( left paren) right paren* Not, n't-- dash, comma: colonABL pre-qualifier (quite, rather)ABN pre-quantifier (half, all)ABX pre-quantifier (both)AP post-determiner (many, several, next)AT article (a, the, no)BE beBED were

(Brown Corpus)

Page 28: Natural Language Processing >> Tokenisation

WS 2010/2011 NLP - Harriehausen 28

Part-of-speech tags used

Tag DefinitionBEDZ wasBEG beingBEM amBEN beenBER are, artBEZ isCC coordinating conjunction (and, or)CD cardinal numeral (one, two, 2, etc.)CS subordinating conjunction (if, although)DO doDOD didDOZ doesDT singular determiner/quantifier (this, that)DTI singular or plural determiner/quantifier (some, any)DTS plural determiner (these, those)

(Brown Corpus)

Page 29: Natural Language Processing >> Tokenisation

WS 2010/2011 NLP - Harriehausen 29

Part-of-speech tags used

Tag DefinitionDTX determiner/double conjunction (either)EX existential thereFW foreign word (hyphenated before regular tag)HV haveHVD had (past tense)HVG havingHVN had (past

etc.

(Brown Corpus)

Page 30: Natural Language Processing >> Tokenisation

WS 2010/2011 NLP - Harriehausen 30

Tag-Sets

• The tag-sets contains the number of tags that are assigned by a tagger

Tag-Set Number of Tags

• Brown Corpus 87• Lancaster-Oslo/Bergen 135• Lancaster UCREL 165• London-Lund Corpus of Spoken English 197• Penn Treebank 36 + 12• IBM / Microsoft 8

(Different taggers)

Page 31: Natural Language Processing >> Tokenisation

WS 2010/2011 NLP - Harriehausen 31

1 Tokenisation

2 Sentence segmentation

3 Lexical analyser

4 Unix/Linux tools

content

Page 32: Natural Language Processing >> Tokenisation

WS 2010/2011 NLP - Harriehausen 32

• A lexical analyser is a program which breaks a text into lexemes (tokens).

• A program which generates a lexical analyser is called a lexical analyser generator.

• Examples: Lex/Flex/JFlex (http://jflex.de/)

• The user defines a set of regular expression patterns.• The program generates finite-state automata.• The automata are used to recognise tokens.

Lexical Analyser

Page 33: Natural Language Processing >> Tokenisation

WS 2010/2011 NLP - Harriehausen 33

%% A finite-state automata recognising (a|b)*abb

%public%class Simple%standalone%unicode

%{ String str = "Found: ";%}Pattern = (a|b)*abb%%{Pattern} { System.out.println(str + " " + yytext());}. { ;}

JFlex example (http://jflex.de/manual.html)

Page 34: Natural Language Processing >> Tokenisation

WS 2010/2011 NLP - Harriehausen 34

%% A good tokeniser for English?%public%class EngGood%standalone%unicode

%{%}

WhiteSpace = [ \t\f\n]Lower = [a-z]Upper = [A-Z]EngChar = {Upper}|{Lower}EngWord = {EngChar}+

%%{WhiteSpace} {;}{EngWord} { System.out.println(yytext());}. { System.out.println(yytext());}

JFlex example

Page 35: Natural Language Processing >> Tokenisation

WS 2010/2011 NLP - Harriehausen 35

1 Tokenisation

2 Sentence segmentation

3 Lexical analyser

4 Unix/Linux tools

content

Page 36: Natural Language Processing >> Tokenisation

WS 2010/2011 NLP - Harriehausen 36

Various Unix tools exist which simplify the tokenisation and processing of texts:

• grep (general regular expression parser)• tr (translate characters)• sed (string/stream edit) (http://en.wikipedia.org/wiki/Sed)

• Search-Programming languages:• PERL

Unix/Linux tools

Page 37: Natural Language Processing >> Tokenisation

WS 2010/2011 NLP - Harriehausen 37

grep is a command line text search utility originally written for Unix. The name is taken from the first letters in global / regular expression / print.A backronym of the unusual name also exists in the form of Generalized Regular Expression Parser. The grep command searches files or standard input globally for lines matching a given regular expression, and prints them to the program's standard output.

This is an example of a common grep usage:

grep apple fruitlist.txt

In this case, grep prints all lines containing apple from the file fruitlist.txt, regardless of word boundaries; therefore lines containing pineapple or apples are also printed. The grep command is case sensitive by default, so this example's output does not include lines containing Apple (with a capital A) unless they also contain apple.

Grep (http://www.panix.com/~elflord/unix/grep.html)

Page 38: Natural Language Processing >> Tokenisation

WS 2010/2011 NLP - Harriehausen 38

To search all .txt files in a directory for apple in a shell that supports globbing, use an asterisk in place of the file name:

grep apple *.txt

Regular expressions can be used to match more complicated queries. The following prints all lines in the file that begin with the letter a, followed by any one character, then the letters ple.

grep ^a.ple fruitlist.txt

Grep (http://www.panix.com/~elflord/unix/grep.html)

Page 39: Natural Language Processing >> Tokenisation

WS 2010/2011 NLP - Harriehausen 39

NAMEtr - translate characters

SYNOPSIStr [ -cds ] [ string1 [ string2 ] ]

DESCRIPTIONTr copies the standard input to the standard output with substitution or deletion of selected characters (runes).

Input characters found in string1 are mapped into the corresponding characters of string2. When string2 is short it is padded to the length of string1 by duplicating its last character. Any combination of the options -cds may be used:

-c Complement string1: replace it with a lexicographically ordered list of all other characters.

-d Delete from input all characters in string1. -s Squeeze repeated output characters that occur in

string2 to single characters.

tr (http://linux.die.net/man/1/tr) or

(http://publib.boulder.ibm.com/infocenter/iseries/v5r4/index.jsp?topic=%2Frzahz%2Ftr.htm)

Page 40: Natural Language Processing >> Tokenisation

WS 2010/2011 NLP - Harriehausen 40

EXAMPLES

Replace all upper-case ASCII letters by lower-case.

tr A-Z a-z <mixed >lower

Create a list of all the words in one per line in where a word is taken to be a maximal string of alphabetics. String2 is given as a quoted newline.

tr -cs A-Za-z ’ ’ <file1 >file2

tr (http://linux.die.net/man/1/tr) or

(http://publib.boulder.ibm.com/infocenter/iseries/v5r4/index.jsp?topic=%2Frzahz%2Ftr.htm)