Basic&Text Processing · 2018. 8. 12. · Dan%Jurafsky% Tokeniza4on:&language&issues& • Chinese%and%Japanese%no%spaces%between%words:% • 莎拉波娃现在居住在美国东南部的佛

Basic Text Processing

Word tokeniza-on

Dan Jurafsky

Text Normaliza4on •  Every NLP task needs to do text normaliza-on: 1.  Segmen-ng/tokenizing words in running text 2.  Normalizing word formats 3.  Segmen-ng sentences in running text !

Dan Jurafsky

How many words?

•  I do uh main-‐ mainly business data processing •  Fragments, filled pauses

•  Seuss’s cat in the hat is different from other cats! •  Lemma: same stem, part of speech, rough word sense •  cat and cats = same lemma

• Wordform: the full inflected surface form •  cat and cats = different wordforms

Dan Jurafsky

How many words? they lay back on the San Francisco grass and looked at the stars and their

•  Type: an element of the vocabulary. •  Token: an instance of that type in running text. •  How many?

•  15 tokens (or 14) •  13 types (or 12) (or 11?)

Dan Jurafsky

How many words?

N = number of tokens V = vocabulary = set of types

|V| is the size of the vocabulary

Tokens = N Types = |V|

Switchboard phone conversa-ons

2.4 million 20 thousand

Shakespeare 884,000 31 thousand

Google N-‐grams 1 trillion 13 million

Church and Gale (1990): |V| > O(N½)

Dan Jurafsky

Simple Tokeniza4on in UNIX

•  (Inspired by Ken Church’s UNIX for Poets.) •  Given a text file, output the word tokens and their frequencies tr -sc ’A-Za-z’ ’\n’ < shakes.txt ! | sort ! | uniq –c !!1945 A!

72 AARON! 19 ABBESS! 5 ABBOT! ... ...!

25 Aaron! 6 Abate! 1 Abates! 5 Abbess! 6 Abbey! 3 Abbot .... …!

Change all non-alpha to newlines

Sort in alphabetical order

Merge and count each type

Dan Jurafsky

The first step: tokenizing

tr -sc ’A-Za-z’ ’\n’ < shakes.txt | head!!THE!SONNETS!by!William!Shakespeare!From!

fairest!creatures!We!...

Dan Jurafsky

The second step: sor4ng

tr -sc ’A-Za-z’ ’\n’ < shakes.txt | sort | head!!A!A!A!A!A!A!

A!A!A!...

Dan Jurafsky

More coun4ng •  Merging upper and lower case!tr ‘A-Z’ ‘a-z’ < shakes.txt | tr –sc ‘A-Za-z’ ‘\n’ | sort | uniq –c •  Sor-ng the counts tr ‘A-Z’ ‘a-z’ < shakes.txt | tr –sc ‘A-Za-z’ ‘\n’ | sort | uniq –c | sort –n –r!

23243 the!22225 i!18618 and!16339 to!15687 of!12780 a!12163 you!10839 my!10005 in!8954 d!

What happened here?

Dan Jurafsky

Issues in Tokeniza4on

•  Finland’s capital → Finland Finlands Finland’s ? •  what’re, I’m, isn’t → What are, I am, is not!•  Hewlett-Packard → Hewlett Packard ?!•  state-of-the-art → state of the art ? •  Lowercase! !→ lower-case lowercase lower case ? •  San Francisco !→ one token or two? •  m.p.h., PhD. → ??

Dan Jurafsky

Tokeniza4on: language issues

•  French •  L'ensemble → one token or two? •  L ? L’ ? Le ? •  Want l’ensemble to match with un ensemble

•  German noun compounds are not segmented •  Lebensversicherungsgesellscha5sangestellter •  ‘life insurance company employee’ •  German informa-on retrieval needs compound spliKer

Dan Jurafsky

Tokeniza4on: language issues •  Chinese and Japanese no spaces between words:

•  莎拉波娃现在居住在美国东南部的佛罗里达。 �•  莎拉波娃现在居住在美国东南部的佛罗里达 �•  Sharapova now lives in US southeastern Florida

•  Further complicated in Japanese, with mul-ple alphabets intermingled •  Dates/amounts in mul-ple formats

フォーチュン500社は情報不足のため時間あた$500K(約6,000万円)

Katakana Hiragana Kanji Romaji

End-‐user can express query en-rely in hiragana!

Dan Jurafsky

Word Tokeniza4on in Chinese

•  Also called Word Segmenta4on •  Chinese words are composed of characters

•  Characters are generally 1 syllable and 1 morpheme. •  Average word is 2.4 characters long.

•  Standard baseline segmenta-on algorithm: •  Maximum Matching (also called Greedy)

Dan Jurafsky

Maximum Matching Word Segmenta4on Algorithm

•  Given a wordlist of Chinese, and a string. 1)  Start a pointer at the beginning of the string 2)  Find the longest word in dic-onary that matches the string

star-ng at pointer 3)  Move the pointer over the word in string 4)  Go to 2

Dan Jurafsky

Max-‐match segmenta4on illustra4on

•  Theca-nthehat •  Thetabledownthere

•  Doesn’t generally work in English!

•  But works astonishingly well in Chinese •  莎拉波娃现在居住在美国东南部的佛罗里达。 •  莎拉波娃现在居住在美国东南部的佛罗里达

•  Modern probabilis-c segmenta-on algorithms even beoer

the table down there

the cat in the hat

theta bled own there

Basic Text Processing

Word tokeniza-on

Documents

Basic&Text Processing · 2018. 8. 12. · Dan%Jurafsky% Tokeniza4on:&language&issues& • Chinese%and%Japanese%no%spaces%between%words:% • 莎拉波娃现在居住在美国东南部的佛