Albert Gatt Corpora and Statistical Methods. Probability distributions Part 2
Preview:
Citation preview
- Slide 1
- Albert Gatt Corpora and Statistical Methods
- Slide 2
- Probability distributions Part 2
- Slide 3
- Example 1: Book publishing Case: publishing house considers
whether to publish a new textbook on statistical NLP considerations
include: production cost, expected sales, net profits (given cost)
Problem: to publish or not to publish? depends on expected sales
and profits if published, how many copies? depends on demand and
cost
- Slide 4
- Example 1: Demand & cost figures Suppose: book costs 35, of
which: publisher gets 25 bookstore gets 6 author gets 4 To make a
decision, publisher needs to estimate profits as a function of the
probability of selling n books, for different values of n. profit =
(25 * n) overall production cost
- Slide 5
- Terminology Random variable In this example, the expected
profit from selling n books is our random variable It takes on
different values, depending on n We use uppercase (e.g. X) to
denote the random variable Distribution The different values of X
(denoted x) form a distribution. If each value x can be assigned a
probability (the probability of making a given profit), then we can
plot each value x against its likelihood.
- Slide 6
- Definitions Random variable A variable whose numerical value is
determined by chance. Formally, a function that returns a unique
numerical value determined by the outcome of an uncertain
situation. Can be discrete (our exclusive focus) or continuous
Probability distribution For a discrete random variable X, the
probability distribution p(x) gives the probabilities for each
value x of X. The probabilities p(x) of all possible values of X
sum to 1. The distribution tells us how much out of the overall
probability space (the probability mass), each value of x takes
up.
- Slide 7
- Tabulated probability distribution No. copies soldProd.
costProfits (X) Probability P(x) 5,000275,000-150,000.20
10,000300,000-50,000.40 20,000350,000150,000.25
30,000400,000350,000.10 40,000450,000550,000.05
- Slide 8
- Plotting the distribution
- Slide 9
- Uses of a probability distribution Computation of: mean: the
expected value of X in the long run based on the specific values of
X, and their probability NB: NOT interpreted as value in a sample
of data, but expected (future) value based on sample. standard
deviation & variance: the extent to which actual values of X
will differ from the mean skewness: the extent to which our
distribution is balanced, i.e. whether its symmetrical
- Slide 10
- In graphics Mean: expected value in the long run SD &
variance: How much actual values deviate from mean overall
Skewness: Symmetry or tail of our distribution
- Slide 11
- Measures of expectation and variation
- Slide 12
- The expected value (mean) The expected value of a discrete
random variable X, denoted E[X] or , is a weighted average of the
values of X weighted, because not all values x will have the same
probability estimated by summing, for all values of X, the product
of x and its probability p(x)
- Slide 13
- More on expected value The mean or expected value tells us
that, in the long run, we can expect X to have the value . E.g. in
our example, our book publisher can expect long- term profits of:
(-150,000 *.2) + (-50,000 *.4) + (150,000 *.25) + (350,000 *.1) +
(550,000 *.05) = 50,000
- Slide 14
- Variance Mean is the expected value of X, E[X] Variance ( 2 )
reflects the extent to which the actual outcomes deviate from
expectation (i.e. from E[X]) 2 = E[(X ) 2 ] = (x ) 2 p(x) i.e. the
weighted sum of deviations squared Reasons for squaring: eliminates
the distinction between +ve and ve makes it exponential: larger
deviations are given more importance e.g. one deviation of 10 is as
large as 4 deviations of 5
- Slide 15
- Standard deviation Variance gives overall dispersion or
variation Standard deviation ( ) is the dispersion of possible
outcomes; it indicates how spread out the distribution is.
estimated as square root of variance
- Slide 16
- The book publishing example again Recall that for our new book
on stat NLP, expected profit is 50,000 Whats the standard
deviation? need to estimate (50000-x) 2 for all x multiply by p(x)
in each case take the square root of the result This is left as an
exercise
- Slide 17
- Skewness The mean gives us the centre of a distribution.
Standard deviation gives us dispersion. Skewness (denoted gamma) is
a measure of the symmetry of the outcomes.
- Slide 18
- Skewness, continued The formula calculates the average value of
cubed deviations by the standard deviation cubed. Why cubed? The
cube of a positive deviation is itself positive; that of a negative
is itself negative. We want both, as we want to know deviations
both to the left (-ve) and right (+ve) of the mean. Like the
variance estimation, this emphasises large deviations in either
direction (its exponential). If the outcomes are symmetrical around
the mean, then +ve and ve deviations are balanced, and skewness is
0.
- Slide 19
- Graphical display of skewness Positive skewness: tail going
right Negative skewness: tail going left
- Slide 20
- Skewness and language By Zipfs law (next week), word
frequencies do not cluster around the mean. There are a few highly
frequent words (making up a large proportion of overall word
frequency) There are many highly infrequent words (f = 1 or f = 2)
So the Zipfian distribution is highly skewed. We will hear more on
the Zipfian distribution in the next lecture.
- Slide 21
- The concept of information
- Slide 22
- What is information? Main ingredient: an information source,
which transmits symbols from a finite alphabet S every symbol is
denoted s i we call a sequence of such symbols a text assume a
probability distribution s.t. every s i has probability p(s i )
Example: a dice is an information source; every throw yields a
symbol from the alphabet {1,2,3,4,5,6} 6 successive throws yield a
text of 6 symbols
- Slide 23
- Quantifying information Intuition: the more probable a symbol
is, the less information it yields something seen very often is not
very surprising So information is the inverse probability of the
symbol for some b > 1. Usually we use base 2 Another term for
I(s) is surprisal
- Slide 24
- Properties of I 1. Non-negative 2. If p(s) = 1, I(s) = 0 3. If
2 events s 1, s 2 are independent, then: 4. Monotonic: slight
changes in probability result in slight changes in I
- Slide 25
- Aggregate measure of information What is the information
content of a text (sequence of symbols)? this is the same as
finding the average information of a random variable the measure is
called Entropy, denoted H 1. Define X as a random variable over the
symbols in our alphabet P(s) = P(X=s) for all s in our alphabet 2.
Estimate H(P)
- Slide 26
- Entropy The entropy (or information) of a probability
distribution is entropy is the expected value (mean) of the
surprisal the value is interpreted as the number of bits of
information
- Slide 27
- Entropy example Source = an 8-sided die Alphabet S =
{1,2,3,4,5,6,7,8} every s i has p = 1/8
- Slide 28
- Interpretation of entropy The information contained in the
distribution P (the more unpredictable the outcomes, the higher the
entropy) The message length if the message was generated according
to P and coded optimally
- Slide 29
- Interpretation cont/d For the 8-sided die example, the result
H(P)=3 tells us we need 3 bits on average to transmit the result of
rolling an 8-sided die: We cant do it in less than 3 bits 12345678
001010011100101110111000
- Slide 30
- Entropy for multiple variables So far we have dealt with a
single random variable The joint entropy of a pair of RVs:
- Slide 31
- Conditional Entropy Given X and Y, how much information about Y
do we gain if we know X? a version of entropy using conditional
probability: H(Y|X)
- Slide 32
- Mutual information
- Slide 33
- Just as probability can change based on posterior knowledge, so
can information. Suppose our distribution gives us the probability
P(a) of observing the symbol a. Suppose we first observe the symbol
b. If a and b are not independent, this should alter our
information state with respect to the probability of observing a.
i.e. we can compute p(a|b)
- Slide 34
- Mutual info between two symbols The change in our information
about a on observing b is: If a and b are completely independent,
I(a;b)=0.
- Slide 35
- Averaging mutual information We want to average mutual
information between all values of a random variable A and those of
a random variable B. And similarly:
- Slide 36
- Combining the two Thus, mutual info involves taking the joint
probability and dividing by the individual probabilities. I.e. a
comparison of the likelihood of observing a, b together vs.
separately.
- Slide 37
- Mutual Information: summary Gives a measure of reduction in
uncertainty about a random variable X, given knowledge of Y
quantifies how much information about X is contained in Y
- Slide 38
- Some more on I(X;Y) In statistical NLP, we often calculate
pointwise mutual information this is the mutual information between
two points on a distribution I(x;y) rather than I(X;Y) used for
some applications in lexical acquisition
- Slide 39
- Mutual Information -- example Suppose were interested in the
collocational strength of two words x and y e.g. bread and butter
mutual information quantifies the likelihood of observing x and y
together (in some window) If there is no interesting relationship,
knowing about bread tells us nothing about the likelihood of
encountering butter Here, P(x,y) = P(x)P(y) and I(x;y) = 0 This is
the Church and Hanks (1991) approach. NB. The approach uses
pointwise MI