10
INF1-CG 2013 Lecture 22: Learning from data: Noisy channels and Bayes' theorem Henry S. Thompson 12 March 2013 1. The rise of statistical models Practical speech and language processing systems have made huge advances over the last 15 years or so But not in the way many people had expected, say, 40 years ago Rather than building computer systems based on linguistic and psychological theory For problems such as Speech recognition Machine translation Text summarisation Information extraction Successful systems have been based on statistical models The systems depend on Sophisticated applied mathematics Large corpora (machine-readable collections) of real language data 2. Rationalism versus empiricism revisited The practical success of this approach has influenced cognitive science And on a number of fronts new alternatives to 'rule-based' or 'symbolic' approaches are being explored A literal interpretation of existing practical systems doesn't make sense The size of the datasets required is implausible For example, a good statistical MT system is likely to have been trained on 100s of millions of words of language data But the principles have proved inspiring We're going to look at one widely-used technique Noisy-channel decoding using Bayes' theorem Applied to one linguistic phenomenon Word segmentation in order to get an idea of how this works

The University of Edinburgh - INF1-CG 2013 Lecture 22 ......INF1-CG 2013 Lecture 22: Learning from data: Noisy channels and Bayes' theorem Henry S. Thompson 12 March 2013 1. The rise

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: The University of Edinburgh - INF1-CG 2013 Lecture 22 ......INF1-CG 2013 Lecture 22: Learning from data: Noisy channels and Bayes' theorem Henry S. Thompson 12 March 2013 1. The rise

INF1-CG 2013 Lecture 22: Learningfrom data: Noisy channels and

Bayes' theoremHenry S. Thompson

12 March 2013

1. The rise of statistical modelsPractical speech and language processing systems have made huge advances over the last 15years or so

• But not in the way many people had expected, say, 40 years ago

Rather than building computer systems based on linguistic and psychological theory

• For problems such as◦ Speech recognition◦ Machine translation◦ Text summarisation◦ Information extraction

Successful systems have been based on statistical models

The systems depend on

• Sophisticated applied mathematics• Large corpora (machine-readable collections) of real language data

2. Rationalism versus empiricism revisitedThe practical success of this approach has influenced cognitive science

• And on a number of fronts new alternatives to 'rule-based' or 'symbolic' approaches arebeing explored

A literal interpretation of existing practical systems doesn't make sense

• The size of the datasets required is implausible• For example, a good statistical MT system is likely to have been trained on 100s of

millions of words of language data

But the principles have proved inspiring

We're going to look at one widely-used technique

• Noisy-channel decoding using Bayes' theorem

Applied to one linguistic phenomenon

• Word segmentation

in order to get an idea of how this works

www.princexml.com
Prince - Non-commercial License
This document was created with Prince, a great way of getting web content onto paper.
Page 2: The University of Edinburgh - INF1-CG 2013 Lecture 22 ......INF1-CG 2013 Lecture 22: Learning from data: Noisy channels and Bayes' theorem Henry S. Thompson 12 March 2013 1. The rise

3. The problem: finding word boundariesMany ancient writing systems gave no indication of where one word stopped and the nextbegan:

This file has been identified as being free of known restrictions under copyright law, including all related andneighboring rights.Reproduction in Edward Maunde Thompson, An Introduction to Greek and Latin Paleography, (Oxford: Clarendon, 1912). Websource:Paul Halsall, Byzantine Paleography

Page 3: The University of Edinburgh - INF1-CG 2013 Lecture 22 ......INF1-CG 2013 Lecture 22: Learning from data: Noisy channels and Bayes' theorem Henry S. Thompson 12 March 2013 1. The rise

This file has been identified as being free of known restrictions under copyright law, including all related andneighboring rights.Reproduction in Edward Maunde Thompson, An Introduction to Greek and Latin Paleography, (Oxford: Clarendon, 1912). Websource:Paul Halsall, Byzantine Paleography, word boundaries added by HST

Courtesy of BlindMadDog

Page 4: The University of Edinburgh - INF1-CG 2013 Lecture 22 ......INF1-CG 2013 Lecture 22: Learning from data: Noisy channels and Bayes' theorem Henry S. Thompson 12 March 2013 1. The rise

Courtesy of BlindMadDog, word boundaries added by HST

Some modern writing systems don't either

Courtesy of Mainichi Shimbun

Page 5: The University of Edinburgh - INF1-CG 2013 Lecture 22 ......INF1-CG 2013 Lecture 22: Learning from data: Noisy channels and Bayes' theorem Henry S. Thompson 12 March 2013 1. The rise

Courtesy of Mainichi Shimbun, word boundaries added by HST

Users of writing systems which do show word boundaries find this challenging

• And may even think it's unnatural

But in fact most language is like that

• Spoken language, that is

4. Speech is not what you thinkJust as our experience of vision is not what is there to see

• So our experience of hearing is not what is there to hear• Particularly where language is concerned

Our perception of speech is hugely misleading

• We hear distinct words, as if there were breaks between each one• But this is not actually the case at the level of the actual sound

For example here's a display of the sound corresponding to an short phrase

Page 6: The University of Edinburgh - INF1-CG 2013 Lecture 22 ......INF1-CG 2013 Lecture 22: Learning from data: Noisy channels and Bayes' theorem Henry S. Thompson 12 March 2013 1. The rise

How many words do you think there are?

5. Segmenting speech, cont'dHere's an audio example for you to listen to

• Try to count the number of words: Some speech• Ah, that was backwards -- easier this way? Some speech, properly this time

Here's another example, closer to home

• First, backwards: Some more speech• Now, forwards: Some more speech, properly this time

Despite this inconvenient property of normal spoken language

• It is evidently the case that◦ people can easily wreak a nice beach◦ Sorry, . . . recognise speech

• As long as it's what they were expecting

6. The noisy channel modelWhen we have a sequence of observations

• Which are evidence for some original, underlying, sequence

And we want to reconstruct the original

• So, in the case of spelling correction, we have the mis-spelled observations (asequence of letters)

• And we want the original correct sequence

Or, in the case of speech recognition

• We have the acoustic waveform• And we want the original word sequence

We apply the noisy channel model

Originally this was meant literally:

Page 7: The University of Edinburgh - INF1-CG 2013 Lecture 22 ......INF1-CG 2013 Lecture 22: Learning from data: Noisy channels and Bayes' theorem Henry S. Thompson 12 March 2013 1. The rise

courtesy of Henry S. Thompson

7. Probability: concepts and notationsUsing the noisy channel model is all about working with probabilities

• So we need to introduce some concepts and notations

We will write P(X) for the probability of some event or state of affairs X

• Probabilities are always expressed as numbers between 0 (never happen/be the case)to 1 (certain to happen/be the case)

• Probability estimates are often based on frequency observations◦ If we observe some process N times, and a particular outcome occurs n times,

then the relative frequency of occurrence of that outcome, nN , is also an

estimate of the probability of that outcome on another occasion

8. Joint and conditional probabilityWe can distinguish two important cases involving two (or more) events or states of affairs:

joint probability The probability of two (or more) events/states of affairs bothoccurring/being the case, written P(X,Y) (or P(X,Y,...))conditional probability The probability of some event/state of affairs X giventhat we 'already' know one other event/state of affairs Y (or more) to haveoccurred/be the case, written P(X|Y) (or P(X|Y,...))

For example

• The probability of it snowing in Edinburgh and of the temperature in Edinburgh beinggreater than 0C

• The probability of it snowing in Edinburgh given that the temperature is greater than0C

Those two states of affairs are clearly related

• In general we need to distinguish independent from dependent events/states ofaffairs

The joint probability of two events, P(X,Y), is just P(X)P(Y|X) (or, by symmetry, P(Y)P(X|Y))

• So our two examples are related:P(snowInEdin,aboveFreezingInEdin) =P(snowInEdin)P(aboveFreezingInEdin|snowInEdin)

If X and Y are independent

Page 8: The University of Edinburgh - INF1-CG 2013 Lecture 22 ......INF1-CG 2013 Lecture 22: Learning from data: Noisy channels and Bayes' theorem Henry S. Thompson 12 March 2013 1. The rise

• P(Y|X) is just P(Y)• the joint probability is just the product P(X)P(Y)• which is what we expect for independent events

9. Joint and conditional probability, exampleLet's see how this works for our snow in Edinburgh example

Here's the (approximate) data for 2011:

Temperature greater than 0?No Yes Row total

No 20 335 355Snow? Yes 8 2 10Column total 28 337 365

Evidently, P(snow) = 10/365, P(warm) = 337/365 and P(snow,warm) = 2/365

We get conditional probabilities by dividing row entries by row totals:

• E.g. P(warm|snow) = 2/10

Or column entries by column totals:

• E.g. P(snow|warm) = 2/337

So, does P(snow,warm) = P(snow)P(warm|snow) ?

• 2/365 = 10/365 ⋅ 2/10

What about P(snow,warm) = P(warm)P(snow|warm) ?

• 2/365 = 337/365 ⋅ 2/337

10. Joint and conditional probability, example2For example, consider how my geek friend Hector chooses his clothing for the week:

shirtsHe has three: blue, black and brown

trousersHe has three pairs: blue, black and brown

socksHe has four: two black and two brown

Every Monday morning, he picks one shirt, one pair of trousers and two socks, at random

Every weekend he washes what he wore that week, and puts them back in the closet

11. Independent probabilities, exampleWhat are the chances Hector turns up to work on Monday with matching shirt and trousers?

• Three cases: all in black, all in blue, all in brown

Page 9: The University of Edinburgh - INF1-CG 2013 Lecture 22 ......INF1-CG 2013 Lecture 22: Learning from data: Noisy channels and Bayes' theorem Henry S. Thompson 12 March 2013 1. The rise

• What odds the first case?◦ The choice of shirt is independent from the choice of trousers, so the answer is

1/3 * 1/3 == 1/9• So the overall probability is 1/3

◦ 1/9 for all black plus 1/9 for all blue plus 1/9 for all brown

12. Conditional probability, exampleNow for the socks. What are the chances he turns up to work on Monday with matching socks?

• Two cases: both black or both brown• What odds the first case?

◦ The choice of second sock is not independent from the choice of first sock (whynot?), so the answer is not 1/2 * 1/2 == 1/4

◦ We have to use the full joint probability formula▪ P(X)P(Y|X)▪ That is, in this case, P(firstBlack)P(secondBlack|firstBlack)

◦ P(firstBlack) is 1/2◦ But P(secondBlack|firstBlack) is only 1/3

▪ Because one of the two black socks has already been picked, so onlyone black and two brown remain

◦ So the probability for two black socks is 1/2 * 1/3 == 1/6• And since the story for two browns is the same, the overall answer is 1/6 + 1/6 ==

1/3

13. The noisy channel model, cont'dEarly telegraphy, radio and telephony were noisy

The basic shape of the problem is the same:

• Given a sequence of observations• What is the most probable sequence of source events?

Or, more formally, given o1n, a sequence of observations, for what sequence of source events s1

n

is P(s1n | o1

n) the greatest?

This is usually written argmaxs1n

P(s1n | o1

n)

Claude Shannon broke this problem down into two parts:

• The probability of the original sequence itself

Page 10: The University of Edinburgh - INF1-CG 2013 Lecture 22 ......INF1-CG 2013 Lecture 22: Learning from data: Noisy channels and Bayes' theorem Henry S. Thompson 12 March 2013 1. The rise

• The probability of the transformation caused by the channel