Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
INF1-CG 2013 Lecture 22: Learningfrom data: Noisy channels and
Bayes' theoremHenry S. Thompson
12 March 2013
1. The rise of statistical modelsPractical speech and language processing systems have made huge advances over the last 15years or so
• But not in the way many people had expected, say, 40 years ago
Rather than building computer systems based on linguistic and psychological theory
• For problems such as◦ Speech recognition◦ Machine translation◦ Text summarisation◦ Information extraction
Successful systems have been based on statistical models
The systems depend on
• Sophisticated applied mathematics• Large corpora (machine-readable collections) of real language data
2. Rationalism versus empiricism revisitedThe practical success of this approach has influenced cognitive science
• And on a number of fronts new alternatives to 'rule-based' or 'symbolic' approaches arebeing explored
A literal interpretation of existing practical systems doesn't make sense
• The size of the datasets required is implausible• For example, a good statistical MT system is likely to have been trained on 100s of
millions of words of language data
But the principles have proved inspiring
We're going to look at one widely-used technique
• Noisy-channel decoding using Bayes' theorem
Applied to one linguistic phenomenon
• Word segmentation
in order to get an idea of how this works
3. The problem: finding word boundariesMany ancient writing systems gave no indication of where one word stopped and the nextbegan:
This file has been identified as being free of known restrictions under copyright law, including all related andneighboring rights.Reproduction in Edward Maunde Thompson, An Introduction to Greek and Latin Paleography, (Oxford: Clarendon, 1912). Websource:Paul Halsall, Byzantine Paleography
This file has been identified as being free of known restrictions under copyright law, including all related andneighboring rights.Reproduction in Edward Maunde Thompson, An Introduction to Greek and Latin Paleography, (Oxford: Clarendon, 1912). Websource:Paul Halsall, Byzantine Paleography, word boundaries added by HST
Courtesy of BlindMadDog
Courtesy of BlindMadDog, word boundaries added by HST
Some modern writing systems don't either
Courtesy of Mainichi Shimbun
Courtesy of Mainichi Shimbun, word boundaries added by HST
Users of writing systems which do show word boundaries find this challenging
• And may even think it's unnatural
But in fact most language is like that
• Spoken language, that is
4. Speech is not what you thinkJust as our experience of vision is not what is there to see
• So our experience of hearing is not what is there to hear• Particularly where language is concerned
Our perception of speech is hugely misleading
• We hear distinct words, as if there were breaks between each one• But this is not actually the case at the level of the actual sound
For example here's a display of the sound corresponding to an short phrase
How many words do you think there are?
5. Segmenting speech, cont'dHere's an audio example for you to listen to
• Try to count the number of words: Some speech• Ah, that was backwards -- easier this way? Some speech, properly this time
Here's another example, closer to home
• First, backwards: Some more speech• Now, forwards: Some more speech, properly this time
Despite this inconvenient property of normal spoken language
• It is evidently the case that◦ people can easily wreak a nice beach◦ Sorry, . . . recognise speech
• As long as it's what they were expecting
6. The noisy channel modelWhen we have a sequence of observations
• Which are evidence for some original, underlying, sequence
And we want to reconstruct the original
• So, in the case of spelling correction, we have the mis-spelled observations (asequence of letters)
• And we want the original correct sequence
Or, in the case of speech recognition
• We have the acoustic waveform• And we want the original word sequence
We apply the noisy channel model
Originally this was meant literally:
courtesy of Henry S. Thompson
7. Probability: concepts and notationsUsing the noisy channel model is all about working with probabilities
• So we need to introduce some concepts and notations
We will write P(X) for the probability of some event or state of affairs X
• Probabilities are always expressed as numbers between 0 (never happen/be the case)to 1 (certain to happen/be the case)
• Probability estimates are often based on frequency observations◦ If we observe some process N times, and a particular outcome occurs n times,
then the relative frequency of occurrence of that outcome, nN , is also an
estimate of the probability of that outcome on another occasion
8. Joint and conditional probabilityWe can distinguish two important cases involving two (or more) events or states of affairs:
joint probability The probability of two (or more) events/states of affairs bothoccurring/being the case, written P(X,Y) (or P(X,Y,...))conditional probability The probability of some event/state of affairs X giventhat we 'already' know one other event/state of affairs Y (or more) to haveoccurred/be the case, written P(X|Y) (or P(X|Y,...))
For example
• The probability of it snowing in Edinburgh and of the temperature in Edinburgh beinggreater than 0C
• The probability of it snowing in Edinburgh given that the temperature is greater than0C
Those two states of affairs are clearly related
• In general we need to distinguish independent from dependent events/states ofaffairs
The joint probability of two events, P(X,Y), is just P(X)P(Y|X) (or, by symmetry, P(Y)P(X|Y))
• So our two examples are related:P(snowInEdin,aboveFreezingInEdin) =P(snowInEdin)P(aboveFreezingInEdin|snowInEdin)
If X and Y are independent
• P(Y|X) is just P(Y)• the joint probability is just the product P(X)P(Y)• which is what we expect for independent events
9. Joint and conditional probability, exampleLet's see how this works for our snow in Edinburgh example
Here's the (approximate) data for 2011:
Temperature greater than 0?No Yes Row total
No 20 335 355Snow? Yes 8 2 10Column total 28 337 365
Evidently, P(snow) = 10/365, P(warm) = 337/365 and P(snow,warm) = 2/365
We get conditional probabilities by dividing row entries by row totals:
• E.g. P(warm|snow) = 2/10
Or column entries by column totals:
• E.g. P(snow|warm) = 2/337
So, does P(snow,warm) = P(snow)P(warm|snow) ?
• 2/365 = 10/365 ⋅ 2/10
What about P(snow,warm) = P(warm)P(snow|warm) ?
• 2/365 = 337/365 ⋅ 2/337
10. Joint and conditional probability, example2For example, consider how my geek friend Hector chooses his clothing for the week:
shirtsHe has three: blue, black and brown
trousersHe has three pairs: blue, black and brown
socksHe has four: two black and two brown
Every Monday morning, he picks one shirt, one pair of trousers and two socks, at random
Every weekend he washes what he wore that week, and puts them back in the closet
11. Independent probabilities, exampleWhat are the chances Hector turns up to work on Monday with matching shirt and trousers?
• Three cases: all in black, all in blue, all in brown
• What odds the first case?◦ The choice of shirt is independent from the choice of trousers, so the answer is
1/3 * 1/3 == 1/9• So the overall probability is 1/3
◦ 1/9 for all black plus 1/9 for all blue plus 1/9 for all brown
12. Conditional probability, exampleNow for the socks. What are the chances he turns up to work on Monday with matching socks?
• Two cases: both black or both brown• What odds the first case?
◦ The choice of second sock is not independent from the choice of first sock (whynot?), so the answer is not 1/2 * 1/2 == 1/4
◦ We have to use the full joint probability formula▪ P(X)P(Y|X)▪ That is, in this case, P(firstBlack)P(secondBlack|firstBlack)
◦ P(firstBlack) is 1/2◦ But P(secondBlack|firstBlack) is only 1/3
▪ Because one of the two black socks has already been picked, so onlyone black and two brown remain
◦ So the probability for two black socks is 1/2 * 1/3 == 1/6• And since the story for two browns is the same, the overall answer is 1/6 + 1/6 ==
1/3
13. The noisy channel model, cont'dEarly telegraphy, radio and telephony were noisy
The basic shape of the problem is the same:
• Given a sequence of observations• What is the most probable sequence of source events?
Or, more formally, given o1n, a sequence of observations, for what sequence of source events s1
n
is P(s1n | o1
n) the greatest?
This is usually written argmaxs1n
P(s1n | o1
n)
Claude Shannon broke this problem down into two parts:
• The probability of the original sequence itself
• The probability of the transformation caused by the channel