Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
Basic Text Processing
Word tokeniza-on
Dan Jurafsky
Text Normaliza4on • Every NLP task needs to do text normaliza-on: 1. Segmen-ng/tokenizing words in running text 2. Normalizing word formats 3. Segmen-ng sentences in running text !
Dan Jurafsky
How many words?
• I do uh main-‐ mainly business data processing • Fragments, filled pauses
• Seuss’s cat in the hat is different from other cats! • Lemma: same stem, part of speech, rough word sense • cat and cats = same lemma
• Wordform: the full inflected surface form • cat and cats = different wordforms
Dan Jurafsky
How many words? they lay back on the San Francisco grass and looked at the stars and their
• Type: an element of the vocabulary. • Token: an instance of that type in running text. • How many?
• 15 tokens (or 14) • 13 types (or 12) (or 11?)
Dan Jurafsky
How many words?
N = number of tokens V = vocabulary = set of types
|V| is the size of the vocabulary
Tokens = N Types = |V|
Switchboard phone conversa-ons
2.4 million 20 thousand
Shakespeare 884,000 31 thousand
Google N-‐grams 1 trillion 13 million
Church and Gale (1990): |V| > O(N½)
Dan Jurafsky
Simple Tokeniza4on in UNIX
• (Inspired by Ken Church’s UNIX for Poets.) • Given a text file, output the word tokens and their frequencies tr -sc ’A-Za-z’ ’\n’ < shakes.txt ! | sort ! | uniq –c !!1945 A!
72 AARON! 19 ABBESS! 5 ABBOT! ... ...!
25 Aaron! 6 Abate! 1 Abates! 5 Abbess! 6 Abbey! 3 Abbot .... …!
Change all non-alpha to newlines
Sort in alphabetical order
Merge and count each type
Dan Jurafsky
The first step: tokenizing
tr -sc ’A-Za-z’ ’\n’ < shakes.txt | head!!THE!SONNETS!by!William!Shakespeare!From!
fairest!creatures!We!...
Dan Jurafsky
The second step: sor4ng
tr -sc ’A-Za-z’ ’\n’ < shakes.txt | sort | head!!A!A!A!A!A!A!
A!A!A!...
Dan Jurafsky
More coun4ng • Merging upper and lower case!tr ‘A-Z’ ‘a-z’ < shakes.txt | tr –sc ‘A-Za-z’ ‘\n’ | sort | uniq –c • Sor-ng the counts tr ‘A-Z’ ‘a-z’ < shakes.txt | tr –sc ‘A-Za-z’ ‘\n’ | sort | uniq –c | sort –n –r!
23243 the!22225 i!18618 and!16339 to!15687 of!12780 a!12163 you!10839 my!10005 in!8954 d!
What happened here?
Dan Jurafsky
Issues in Tokeniza4on
• Finland’s capital → Finland Finlands Finland’s ? • what’re, I’m, isn’t → What are, I am, is not!• Hewlett-Packard → Hewlett Packard ?!• state-of-the-art → state of the art ? • Lowercase! !→ lower-case lowercase lower case ? • San Francisco !→ one token or two? • m.p.h., PhD. → ??
Dan Jurafsky
Tokeniza4on: language issues
• French • L'ensemble → one token or two? • L ? L’ ? Le ? • Want l’ensemble to match with un ensemble
• German noun compounds are not segmented • Lebensversicherungsgesellscha5sangestellter • ‘life insurance company employee’ • German informa-on retrieval needs compound spliKer
Dan Jurafsky
Tokeniza4on: language issues • Chinese and Japanese no spaces between words:
• 莎拉波娃现在居住在美国东南部的佛罗里达。 �• 莎拉波娃 现在 居住 在 美国 东南部 的 佛罗里达 �• Sharapova now lives in US southeastern Florida
• Further complicated in Japanese, with mul-ple alphabets intermingled • Dates/amounts in mul-ple formats
フォーチュン500社は情報不足のため時間あた$500K(約6,000万円)
Katakana Hiragana Kanji Romaji
End-‐user can express query en-rely in hiragana!
Dan Jurafsky
Word Tokeniza4on in Chinese
• Also called Word Segmenta4on • Chinese words are composed of characters
• Characters are generally 1 syllable and 1 morpheme. • Average word is 2.4 characters long.
• Standard baseline segmenta-on algorithm: • Maximum Matching (also called Greedy)
Dan Jurafsky
Maximum Matching Word Segmenta4on Algorithm
• Given a wordlist of Chinese, and a string. 1) Start a pointer at the beginning of the string 2) Find the longest word in dic-onary that matches the string
star-ng at pointer 3) Move the pointer over the word in string 4) Go to 2
Dan Jurafsky
Max-‐match segmenta4on illustra4on
• Theca-nthehat • Thetabledownthere
• Doesn’t generally work in English!
• But works astonishingly well in Chinese • 莎拉波娃现在居住在美国东南部的佛罗里达。 • 莎拉波娃 现在 居住 在 美国 东南部 的 佛罗里达
• Modern probabilis-c segmenta-on algorithms even beoer
the table down there
the cat in the hat
theta bled own there
Basic Text Processing
Word tokeniza-on