21
Problem 1: Word Segmentation whatdoesthisreferto what does this refer to

Problem 1: Word Segmentation whatdoesthisreferto

Embed Size (px)

Citation preview

Page 1: Problem 1: Word Segmentation whatdoesthisreferto

Problem 1: Word Segmentation

whatdoesthisreferto

what does this refer to

Page 2: Problem 1: Word Segmentation whatdoesthisreferto

Application: Chinese Text

Page 3: Problem 1: Word Segmentation whatdoesthisreferto

Application: Internet Domain Names

www.visitbritain.com

Visit Britain

Page 4: Problem 1: Word Segmentation whatdoesthisreferto

Statistical Machine Learning

Best segmentation= one with highest probability

Probability of a segmentation= P(first word) × P(rest of segmentation)

P(word)= estimated by counting

Page 5: Problem 1: Word Segmentation whatdoesthisreferto

Statistical Machine Learning

choosespain

Choose Spain Chooses pain

P(“Choose Spain”) > P(“Chooses Pain”)

Page 6: Problem 1: Word Segmentation whatdoesthisreferto

Example

segment(“nowisthetime…”) Pf(“n”) × Pr(“owisthetime…”)

Pf(“no”) × Pr(“wisthetime…”)

Pf(“now”) × Pr(“isthetime…”)

Pf(“nowi”) × Pr(“sthetime…”) ……

Page 7: Problem 1: Word Segmentation whatdoesthisreferto

Example

segment(“nowisthetime…”)

Page 8: Problem 1: Word Segmentation whatdoesthisreferto

The Complete Program

Page 9: Problem 1: Word Segmentation whatdoesthisreferto

Performance

Accuracy = 98% Trained on 1.7B words (English)

Typical errors: baseratesoughtto

base rate sought to smallandinsignificant

small and in significant ginormousego

g in or mouse go

Page 10: Problem 1: Word Segmentation whatdoesthisreferto

Some Results

whorepresents.com[“who”, “represents”]

therapistfinder.com[“therapist”, “finder”]

expertsexchange.com[“experts”, “exchange”]

speedofart.net[“speed”, “of”, “art”]

penisland.com error: expected [“pen”, “island”]

Page 11: Problem 1: Word Segmentation whatdoesthisreferto

Problem 2: Spelling Correction

Mehran Salami Typical word processor: Tehran Salami But Google can …

Page 12: Problem 1: Word Segmentation whatdoesthisreferto
Page 13: Problem 1: Word Segmentation whatdoesthisreferto

Statistical Machine Learning

Best correction= one with highest probability

Probability of a spelling correction c= P(c as a word) × P(original is a typo for c)

P(c as a word)= estimated by counting

P(original is a typo for c)= proportional to number of changes

Page 14: Problem 1: Word Segmentation whatdoesthisreferto

The Complete Program

Page 15: Problem 1: Word Segmentation whatdoesthisreferto

Problem 3: Speech Recognition

An informal, incomplete grammar of the English language runs over 1,700 pages.

Invariably, simple models and a lot of data trump more elaborate models based on less data.

Page 16: Problem 1: Word Segmentation whatdoesthisreferto

Problem 3: Speech Recognition

If you have a lot of data, memorisation is a good policy.

For many tasks such as speech recognition, once we have a billion or so examples, we essentially have a closed set that represents (or at least approximates) what we need, without general rules.

Page 17: Problem 1: Word Segmentation whatdoesthisreferto

Problem 3: Speech Recognition

Page 18: Problem 1: Word Segmentation whatdoesthisreferto

Problem 3: Speech Recognition

Page 19: Problem 1: Word Segmentation whatdoesthisreferto

Problem 3: Speech Recognition

“Every time I fire a linguist, the performance of our speech recognition system goes up.”

--- Fred Jelinek

Page 20: Problem 1: Word Segmentation whatdoesthisreferto

Problem 4: Machine Translation

Page 21: Problem 1: Word Segmentation whatdoesthisreferto

Conclusion

(Statistical) [Machine] Learning Is

The Ultimate Agile Development Tool

Peter Norvig(Director of Research, Google)