Problem 1: Word Segmentation whatdoesthisreferto

Preview:

Citation preview

Problem 1: Word Segmentation

whatdoesthisreferto

what does this refer to

Application: Chinese Text

Application: Internet Domain Names

www.visitbritain.com

Visit Britain

Statistical Machine Learning

Best segmentation= one with highest probability

Probability of a segmentation= P(first word) × P(rest of segmentation)

P(word)= estimated by counting

Statistical Machine Learning

choosespain

Choose Spain Chooses pain

P(“Choose Spain”) > P(“Chooses Pain”)

Example

segment(“nowisthetime…”) Pf(“n”) × Pr(“owisthetime…”)

Pf(“no”) × Pr(“wisthetime…”)

Pf(“now”) × Pr(“isthetime…”)

Pf(“nowi”) × Pr(“sthetime…”) ……

Example

segment(“nowisthetime…”)

The Complete Program

Performance

Accuracy = 98% Trained on 1.7B words (English)

Typical errors: baseratesoughtto

base rate sought to smallandinsignificant

small and in significant ginormousego

g in or mouse go

Some Results

whorepresents.com[“who”, “represents”]

therapistfinder.com[“therapist”, “finder”]

expertsexchange.com[“experts”, “exchange”]

speedofart.net[“speed”, “of”, “art”]

penisland.com error: expected [“pen”, “island”]

Problem 2: Spelling Correction

Mehran Salami Typical word processor: Tehran Salami But Google can …

Statistical Machine Learning

Best correction= one with highest probability

Probability of a spelling correction c= P(c as a word) × P(original is a typo for c)

P(c as a word)= estimated by counting

P(original is a typo for c)= proportional to number of changes

The Complete Program

Problem 3: Speech Recognition

An informal, incomplete grammar of the English language runs over 1,700 pages.

Invariably, simple models and a lot of data trump more elaborate models based on less data.

Problem 3: Speech Recognition

If you have a lot of data, memorisation is a good policy.

For many tasks such as speech recognition, once we have a billion or so examples, we essentially have a closed set that represents (or at least approximates) what we need, without general rules.

Problem 3: Speech Recognition

Problem 3: Speech Recognition

Problem 3: Speech Recognition

“Every time I fire a linguist, the performance of our speech recognition system goes up.”

--- Fred Jelinek

Problem 4: Machine Translation

Conclusion

(Statistical) [Machine] Learning Is

The Ultimate Agile Development Tool

Peter Norvig(Director of Research, Google)

Recommended