301
Slides Advanced Statistics Summer Term 2011 (April 5, 2011 – May 17, 2011) Tuesdays, 14.15 – 15.45 and 16.00 – 17.30 Room: J 498 Prof. Dr. Bernd Wilfling Westf¨ alische Wilhelms-Universit¨ atM¨unster

fortstat_english.pdf

Embed Size (px)

Citation preview

  • Slides

    Advanced Statistics

    Summer Term 2011(April 5, 2011 May 17, 2011)

    Tuesdays, 14.15 15.45 and 16.00 17.30Room: J 498

    Prof. Dr. Bernd Wilfling

    Westfalische Wilhelms-Universitat Munster

  • Contents

    1 Introduction1.1 Syllabus1.2 Why Advanced Statistics?

    2 Random Variables, Distribution Functions, Expectation,Moment Generating Functions

    2.1 Basic Terminology

    2.2 Random Variable, Cumulative Distribution Function, Density Function2.3 Expectation, Moments and Moment Generating Functions2.4 Special Parameteric Families of Univariate Distributions

    3 Joint and Conditional Distributions, Stochastic Independence3.1 Joint and Marginal Distribution3.2 Conditional Distribution and Stochastic Independence3.3 Expectation and Joint Moment Generating Functions

    3.4 The Multivariate Normal Distribution

    4 Distributions of Functions of Random Variables

    4.1 Expectations of Functions of Random Variables

    4.2 Cumulative-distribution-function Technique4.3 Moment-generating-function Technique4.4 General Transformations

    5 Methods of Estimation5.1 Sampling, Estimators, Limit Theorems5.2 Properties of Estimators

    5.3 Methods of Estimation5.3.1 Least-Squares Estimators5.3.2 Method-of-moments Estimators5.3.3 Maximum-Likelihood Estimators

    6 Hypothesis Testing

    6.1 Basic Terminology6.2 Classical Testing Procedures6.2.1 Wald Test

    6.2.2 Likelihood-Ratio Test6.2.3 Lagrange-Multiplier Test

    i

  • References and Related Reading

    In German:

    Mosler, K. und F. Schmid (2008). Wahrscheinlichkeitsrechnung und schlieende Statistik

    (3. Auflage). Springer Verlag, Heidelberg.

    Schira, J. (2009). Statistische Methoden der VWL und BWL Theorie und Praxis (3. Auf-lage). Pearson Studium, Munchen.

    Wilfling, B. (2010). Statistik I. Skript zur Vorlesung Deskriptive Statistik im Win-

    tersemester 2010/2011 an der Westfalischen Wilhelms-Universitat Munster.

    Wilfling, B. (2011). Statistik II. Skript zur Vorlesung Wahrscheinlichkeitsrechnungund schlieende Statistik im Sommersemester 2011 an der WestfalischenWilhelms-Universitat Munster.

    In English:

    Chiang, A. (1984). Fundamental Methods of Mathematical Economics, 3. edition. McGraw-Hill, Singapore.

    Feller, W. (1968). An Introduction to Probability Theory and its Applications, Vol. 1. John

    Wiley & Sons, New York.

    Feller, W. (1971). An Introduction to Probability Theory and its Applications, Vol. 2. JohnWiley & Sons, New York.

    Garthwaite, P.H., Jolliffe, I.T. and B. Jones (2002). Statistical Inference, 3. edition. Oxford

    University Press, Oxford.

    Mood, A.M., Graybill, F.A. and D.C. Boes (1974). Introduction to the Theory of Statistics,3. edition. McGraw-Hill, Tokyo.

    ii

  • 1. Introduction

    1.1 Syllabus

    Aim of this course:

    Consolidation of probability calculus

    statistical inference(on the basis of previous Bachelor courses)

    Preparatory course to Econometrics, Empirical Economics1

  • Web-site:

    http://www1.wiwi.uni-muenster.de/oeew/ Study Courses summer term 2011 Advanced Statistics

    Style:

    Lecture is based on slides Slides are downloadable as PDF-files from the web-site

    References:

    See Contents2

  • How to get prepared for the exam:

    Courses Class in Advanced Statistics(Thu, 14.00 16.00 and 16.00 18.00, J 498,April 7, 2011 May 19, 2011)

    Auxiliary material to be used in the exam:

    Pocket calculator (non-programmable) All course-slides and solutions to class-exercises No textbooks

    3

  • Class teacher:

    Dipl.-Mathem. Marc Lammerding(see personal web-site)

    4

  • 1.2 Why Advanced Statistics?

    Contents of the BA course Statistics II:

    Random experiments, events, probability Random variables, distributions Samples, statistics Estimators Tests of hypothesis

    Aim of the BA course Statistics II:

    Elementary understanding of statistical concepts(sampling, estimation, hypothesis-testing)

    5

  • Now:

    Course in Advanced Statistics(probability calculus and mathematical statistics)

    Aim of this course:

    Better understanding of distribution theory How can we find good estimators? How can we construct good tests of hypothesis?

    6

  • Preliminaries:

    BA coursesMathematicsStatistics IStatistics II

    The slides for the BA courses Statistics I+II are downloadablefrom the web-site(in German)

    Later courses based on Advanced Statistics:

    All courses belonging to the three modules Econometricsand Empirical Economics(Econometrics I+II, Analysis of Time Series, ...)

    7

  • 2. Random Variables, Distribution Functions, Ex-pectation, Moment generating Functions

    Aim of this section:

    Mathematical definition of the conceptsrandom variable

    (cumulative) distribution function

    (probability) density function

    expectation and moments

    moment generating function

    8

  • Preliminaries:

    Repetition of the notionsrandom experiment

    outcome (sample point) and sample space

    event

    probability

    (see Wilfling (2011), Chapter 2)

    9

  • 2.1 Basic Terminology

    Definition 2.1: (Random experiment)

    A random experiment is an experiment

    (a) for which we know in advance all conceivable outcomes thatit can take on, but

    (b) for which we do not know in advance the actual outcomethat it eventually takes on.

    Random experiments are performed in controllable trials.

    10

  • Examples of random experiments:

    Drawing of lottery numbers Roulette, tossing a coin, tossing a dice Technical experiments(testing the hardness of lots from steel production etc.)

    In economics:

    Random experiments (according to Def. 2.1) are rare(historical data, trials are not controllable)

    Modern discipline: Experimental Economics

    11

  • Definition 2.2: (Sample point, sample space)

    Each conceivable outcome of a random experiment is called asample point. The totality of conceivable outcomes (or samplepoints) is defined as the sample space and is denoted by .

    Examples:

    Random experiment of tossing a single dice: = {1,2,3,4,5,6}

    Random experiment of tossing a coin until HEAD shows up: = {H,TH,TTH,TTTH,TTTTH, . . .}

    Random experiment of measuring tomorrows exchange ratebetween the euro and the US-$:

    = [0,)12

  • Obviously:

    The number of elements in can be either (1) finite or (2)infinite, but countable or (3) infinite and uncountable

    Now:

    Definition of the notion Event based on mathematical sets

    Definition 2.3: (Event)

    An event of a random experiment is a subset of the sample space. We say the event A occurs if the random experiment hasan outcome A.

    13

  • Remarks:

    Events are typically denoted by A,B,C, . . . or A1, A2, . . . A = is called the sure event(since for every sample point we have A)

    A = (empty set) is called the impossible event(since for every we have / A)

    If the event A is a subset of the event B (A B) we say thatthe occurrence of A implies the occurrence of B(since for every A we also have B)

    Obviously:

    Events are represented by mathematical sets application of set operations to events

    14

  • Combining events (set operations):

    Intersection:n

    i=1Ai occurs, if all Ai occur

    Union:n

    i=1Ai occurs, if at least one Ai occurs

    Set difference:C = A\B occurs, if A occurs and B does not occur

    Complement:C = \A A occurs, if A does not occur

    The events A and B are called disjoint, if A B = (both events cannot occur simultaneously)

    15

  • Now:

    For any arbitrary event A we are looking for a number P (A)which represents the probability that A occurs

    Formally:P : A P (A)

    (P () is a set function)

    Question:

    Which properties should the probability function (set func-tion) P () have?

    16

  • Definition 2.4: (Kolmogorov-axioms)

    The following axioms for P () are called Kolmogorov-axioms:

    Nonnegativity: P (A) 0 for every A

    Standardization: P () = 1

    Additivity: For two disjoint events A and B (i.e. for AB = )P () satisfies

    P (A B) = P (A) + P (B)

    17

  • Easy to check:

    The three axioms imply several additional properties and ruleswhen computing with probabilities

    Theorem 2.5: (General properties)

    The Kolmogorov-axioms imply the following properties:

    Probability of the complementary event:P (A) = 1 P (A)

    Probability of the impossible event:P () = 0

    Range of probabilities:0 P (A) 1

    18

  • Next:

    General rules when computing with probabilities

    Theorem 2.6: (Calculation rules)

    The Kolmogorov-axioms imply the following calculation rules(A,B,C are arbitrary events):

    Addition rule (I):P (A B) = P (A) + P (B) P (A B)

    (probability that A or B occurs)

    19

  • Addition rule (II):P (A B C) = P (A) + P (B) + P (C)

    P (A B) P (B C)P (A C) + P (A B C)

    (probability that A or B or C occurs)

    Probability of the difference event:P (A\B) = P (A B)

    = P (A) P (A B)

    20

  • Notice:

    If B implies A (i.e. if B A) it follows thatP (A\B) = P (A) P (B)

    21

  • 2.2 Random Variable, Cumulative DistributionFunction, Density Function

    Frequently: Instead of being interested in a concrete sample point itself, we are rather interested in a number depending on

    Examples: Profit in euro when playing roulette Profit earned when selling a stock Monthly salary of a randomly selected person

    Intuitive meaning of a random variable: Rule translating the abstract into a number

    22

  • Definition 2.7: (Random variable [rv])

    A random variable, denoted by X or X(), is a mathematicalfunction of the form

    X : R X().

    Remarks:

    A random variable relates each sample point to a realnumber

    Intuitively:A random variable X characterizes a number that is a prioriunknown

    23

  • When the random experiment is carried out, the randomvariable X takes on the value x

    x is called realization or value of the random variable X afterthe random experiment has been carried out

    Random variables are denoted by capital letters, realizationsare denoted by small letters

    The rv X describes the situation ex ante, i.e. before carryingout the random experiment

    The realization x describes the situation ex post, i.e. afterhaving carried out the random experiment

    24

  • Example 1:

    Consider the experiment of tossing a single coin (H=Head,T=Tail). Let the rv X represent the Number of Heads

    We have = {H,T}

    The random variable X can take on two values:

    X(T ) = 0, X(H) = 1

    25

  • Example 2:

    Consider the experiment of tossing a coin three times. LetX represent the Number of Heads

    We have = {(H,H,H)

    =1

    , (H,H, T ) =2

    , . . . , (T, T, T ) =8

    }The rv X is defined by

    X() = number of H in

    Obviously:X relates distinct s to the same number, e.g.

    X((H,H, T )) = X((H,T,H)) = X((T,H,H)) = 2

    26

  • Example 3:

    Consider the experiment of randomly selecting 1 person froma group of people. Let X represent the persons status ofemployment

    We have = {employed

    =1, unemployed

    =2}

    X can be defined asX(1) = 1, X(2) = 0

    27

  • Example 4:

    Consider the experiment of measuring tomorrows price of aspecific stock. Let X denote the stock price

    We have = [0,), i.e. X is defined byX() =

    Conclusion:

    The random variable X can take on distinct values with spe-cific probabilities

    28

  • Question:

    How can we determine these specific probabilities and howcan we calculate with them?

    Simplifying notation: (a, b, x R) P (X = a) P ({|X() = a}) P (a < X < b) P ({|a < X() < b}) P (X x) P ({|X() x})

    Solution:

    We can compute these probabilities via the so-called cumu-lative distribution function of X

    29

  • Intuitively:

    The cumulative distribution function of the random variableX characterizes the probabilities according to which the pos-sible values x are distributed along the real line(the so-called distribution of X)

    Definition 2.8: (Cumulative distribution function [cdf])

    The cumulative distribution function of a random variable X,denoted by FX, is defined to be the function

    FX : R [0,1]x FX(x) = P ({|X() x}) = P (X x).

    30

  • Example:

    Consider the experiment of tossing a coin three times. LetX represent the Number of Heads

    We have = {(H,H,H)

    = 1, (H,H, T )

    = 2, . . . , (T, T, T )

    = 8}

    For the probabilities of X we findP (X = 0) = P ({(T, T, T )}) = 1/8P (X = 1) = P ({(T, T,H), (T,H, T ), (H,T, T )}) = 3/8P (X = 2) = P ({(T,H,H), (H,T,H), (H,H, T )}) = 3/8P (X = 3) = P ({(H,H,H)}) = 1/8

    31

  • Thus, the cdf is given by

    FX(x) =

    0.000 forx < 00.125 for 0 x < 10.5 for 1 x < 20.875 for 2 x < 31 forx 3

    Remarks:

    In practice, it will be sufficient to only know the cdf FX of X In many situations, it will appear impossible to exactly specifythe sample space or the explicit function X : R.However, often we may derive the cdf FX from other factualconsideration

    32

  • General properties of FX:

    FX(x) is a monotone, nondecreasing function

    We havelim

    xFX(x) = 0 and limx+FX(x) = 1

    FX is continuous from the right; that is,limzxz>x

    FX(z) = FX(x)

    33

  • Summary:

    Via the cdf FX(x) we can answer the following question:

    What is the probability that the random variable X takeson a value that does not exceed x?

    Now:

    Consider the question:

    What is the value which X does not exceed with aprespecified probability p (0,1)?

    quantile function of X34

  • Definition 2.9: (Quantile function)

    Consider the rv X with cdf FX. For every p (0,1) the quantilefunction of X, denoted by QX(p), is defined as

    QX : (0,1) Rp QX(p) = min{x|FX(x) p}.

    The value of the quantile function xp = QX(p) is called the pthquantile of X.

    Remarks:

    The pth quantile xp of X is defined as the smallest numberx satisfying FX(x) p

    In other words: The pth quantile xp is the smallest value thatX does not exceed with probability p

    35

  • Special quantiles:

    Median: p = 0.5 Quartiles: p = 0.25,0.5,0.75 Quintiles: p = 0.2,0.4,0.6,0.8 Deciles: p = 0.1,0.2, . . . ,0.9

    Now:

    Consideration of two distinct classes of random variables(discrete vs. continuous rvs)

    36

  • Reason:

    Each class requires a specific mathematical treatment

    Mathematical tools for analyzing discrete rvs:

    Finite and infinite sums

    Mathematical tools for analyzing continuous rvs:

    Differential- and integral calculus

    Remarks:

    Some rvs are partly discrete and partly continuous Such rvs are not treated in this course

    37

  • Definition 2.10: (Discrete random variable)

    A random variable X will be defined to be discrete if it can takeon either

    (a) only a finite number of values x1, x2, . . . , xJ or

    (b) an infinite, but countable number of values x1, x2, . . .

    each with strictly positive probability; that is, if for all j =1, . . . , J, . . . we have

    P (X = xj) > 0 andJ,...j=1

    P (X = xj) = 1.

    38

  • Examples of discrete variables:

    Countable variables (X = Number of . . .) Encoded qualitative variables

    Further definitions:

    Definition 2.11: (Support of a discrete random variable)

    The support of a discrete rv X, denoted by supp(X), is definedto be the totality of all values that X can take on with a strictlypositive probability:

    supp(X) = {x1, . . . , xJ} or supp(X) = {x1, x2, . . .}.39

  • Definition 2.12: (Discrete density function)

    For a discrete random variable X the function

    fX(x) = P (X = x)

    is defined to be the discrete density function of X.

    Remarks:

    The discrete density function fX() takes on strictly positivevalues only for elements of the support of X. For realizationsof X that do not belong to the support of X, i.e. for x /supp(X), we have fX(x) = 0:

    fX(x) =

    {P (X = xj) > 0 forx = xj supp(X)

    0 forx / supp(X)40

  • The discrete density function fX() has the following proper-ties:

    fX(x) 0 for all x

    xjsupp(X)

    fX(xj) = 1

    For any arbitrary set A R the probability of the event{|X() A} = {X A} is given by

    P (X A) = xjA

    fX(xj)

    41

  • Example:

    Consider the experiment of tossing a coin three times andlet X = Number of Heads(see slide 31)

    Obviously: X is discrete and has the supportsupp(X) = {0,1,2,3}

    The discrete density function of X is given by

    fX(x) =

    P (X = 0) = 0.125 forx = 0P (X = 1) = 0.375 forx = 1P (X = 2) = 0.375 forx = 2P (X = 3) = 0.125 forx = 3

    0 forx / supp(X)42

  • The cdf of X is given by (see slide 32)

    FX(x) =

    0.000 forx < 00.125 for 0 x < 10.5 for 1 x < 20.875 for 2 x < 31 forx 3

    Obviously:

    The cdf FX() can be obtained from fX():

    FX(x) = P (X x) =

    {xjsupp(X)|xjx}

    =P (X=xj) fX(xj)

    43

  • Conclusion:

    The cdf of a discrete random variable X is a step functionwith steps at the points xj supp(X). The height of thestep at xj is given by

    FX(xj) limxxjx

  • Now:

    Definition of continuous random variables

    Intuitively:

    In contrast to discrete random variables, continuous randomvariables can take on an uncountable number of values(e.g. every real number on a given interval)

    In fact:

    Definition of a continuous random variable is quite technical

    45

  • Definition 2.13: (Continuous rv, probability density function)

    A random variable X is called continuous if there exists a functionfX : R [0,) such that the cdf of X can be written as

    FX(x) = x fX(t)dt for all x R.

    The function fX(x) is called the probability density function (pdf)of X.

    Remarks:

    The cdf FX() of a continuous random variable X is a prim-itive function of the pdf fX()

    FX(x) = P (X x) is equal to the area under the pdf fX()between the limits and x

    46

  • Cdf FX() and pdf fX()

    47

    x

    fX(t)

    P(X x) = FX(x)

    t

  • Properties of the pdf fX():1. A pdf fX() cannot take on negative value, i.e.

    fX(x) 0 for all x R2. The area under a pdf is equal to one, i.e. +

    fX(x)dx = 1

    3. If the cdf FX(x) is differentiable we have

    fX(x) = FX(x) dFX(x)/dx

    48

  • Example: (Uniform distribution over [0,10])

    Consider the random variable X with pdffX(x) =

    {0 , for x / [0,10]0.1 , for x [0,10]

    Derivation of the cdf FX:For x < 0 we have

    FX(x) = x fX(t) dt =

    x 0 dt = 0

    49

  • For x [0,10] we haveFX(x) =

    x fX(t) dt

    = 0 0 dt =0

    + x00.1 dt

    = [0.1 t]x0= 0.1 x 0.1 0= 0.1 x

    50

  • For x > 10 we have

    FX(x) = x fX(t) dt

    = 0 0 dt =0

    + 100

    0.1 dt =1

    + 10

    0 dt =0

    = 1

    51

  • Now:

    Interval probabilities, i.e. (for a, b R, a < b)P (X (a, b]) = P (a < X b)

    We haveP (a < X b) = P ({|a < X() b})

    = P ({|X() > a} {|X() b})

    = 1 P ({|X() > a} {|X() b})

    = 1 P ({|X() > a} {|X() b})

    = 1 P ({|X() a} {|X() > b})52

  • = 1 [P (X a) + P (X > b)]

    = 1 [FX(a) + (1 P (X b))]

    = 1 [FX(a) + 1 FX(b)]

    = FX(b) FX(a)

    = b fX(t) dt

    a fX(t) dt

    = bafX(t) dt

    53

  • Interval probability between the limits a and b

    54

    a x b

    fX(x)

    P(a < X b)

  • Important result for a continuous rv X:

    P (X = a) = 0 for all a R

    Proof:

    P (X = a) = limbaP (a < X b) = limba

    bafX(x) dx

    = aafX(x)dx = 0

    Conclusion:

    The probability that a continuous random variable X takeson a single explicit value is always zero

    55

  • Probability of a single value

    56

    a b1b2b3

    fX(x)

    x

  • Notice:

    This does not imply that the event {X = a} cannot occur

    Consequence:

    Since for continuous random variables we always have P (X =a) = 0 for all a R, it follows that

    P (a < X < b) = P (a X < b) = P (a X b)= P (a < X b) = FX(b) FX(a)

    (when computing interval probabilities for continuous rvs, itdoes not matter if the interval is open or closed)

    57

  • 2.3 Expectation, Moments and Moment Gener-ating Functions

    Repetition:

    Expectation of an arbitrary random variable X

    Definition 2.14: (Expectation)

    The expectation of the random variable X, denoted by E(X), isdefined by

    E(X) =

    {xjsupp(X)}xj P (X = xj) , if X is discrete

    + x fX(x) dx , if X is continuous

    .

    58

  • Remarks:

    The expectation of the random variable X is approximatelyequal to the sum of all realizations each weighted by theprobability of its occurrence

    Instead of E(X) we often write X There exist random variables that do not have an expectation(see class)

    59

  • Example 1: (Discrete random variable)

    Consider the experiment of tossing two dice. Let X repre-sent the absolute difference of the two dice. What is theexpectation of X?

    The support of X is given bysupp(X) = {0,1,2,3,4,5}

    60

  • The discrete density function of X is given by

    fX(x) =

    P (X = 0) = 6/36 forx = 0P (X = 1) = 10/36 forx = 1P (X = 2) = 8/36 forx = 2P (X = 3) = 6/36 forx = 3P (X = 4) = 4/36 forx = 4P (X = 5) = 2/36 forx = 5

    0 forx / supp(X) This gives

    E(X) = 0 636

    + 1 1036

    + 2 836

    + 3 636

    + 4 436

    + 5 236

    =7036

    = 1.9444

    61

  • Example 2: (Continuous random variable)

    Consider the continuous random variable X with pdf

    fX(x) =

    x4 , for 1 x 30 , elsewise To calculate the expectation we split up the integral:

    E(X) = + x fX(x) dx

    = 1 0 dx+

    31x x

    4dx+

    +3

    0 dx

    62

  • = 31

    x2

    4dx =

    14[13 x3

    ]31

    =14(273 13

    )=

    2612

    = 2.1667

    Frequently:

    Random variable X plus discrete density or pdf fX is known We have to find the expectation of the transformed randomvariable

    Y = g(X)

    63

  • Theorem 2.15: (Expectation of a transformed rv)

    Let X be a random variable with discrete density or pdf fX().For any Baire-function g : R R the expectation of the trans-formed random variable Y = g(X) is given by

    E(Y ) = E[g(X)]

    =

    {xjsupp(X)}g(xj) P (X = xj) , if X is discrete

    + g(x) fX(x) dx , if X is continuous

    .

    64

  • Remarks:

    All functions considered in this course are Baire-functions For the special case g(x) = x (the identity function) Theorem2.15 coincides with Definition 2.14

    Next:

    Some important rules for calculating expected values

    65

  • Theorem 2.16: (Properties of expectations)

    Let X be an arbitrary random variable (discrete or continuous),c, c1, c2 R constants and g, g1, g2 : R R functions. Then:

    1. E(c) = c.

    2. E[c g(X)] = c E[g(X)].3. E[c1 g1(X) + c2 g2(X)] = c1 E[g1(X)] + c2 E[g2(X)].4. If g1(x) g2(x) for all x R then

    E[g1(X)] E[g2(X)].

    Proof: Class

    66

  • Now:

    Consider the random variable X (discrete or continuous) andthe explicit function g(x) = [x E(X)]2 variance and standard deviation of X

    Definition 2.17: (Variance, standard deviation)

    For any random variable X the variance, denoted by Var(X), isdefined as the expected quadratic distance between X and itsexpectation E(X); that is

    Var(X) = E[(X E(X))2].The standard deviation of X, denoted by SD(X), is defined tobe the (positive) square root of the variance:

    SD(X) = +Var(X).

    67

  • Remark:

    Setting g(X) = [X E(X)]2 in Theorem 2.15 (on slide 64)yields the following explicit formulas for discrete and contin-uous random variables:

    Var(X) = E[g(X)]

    =

    {xjsupp(X)}[xj E(X)]2 P (X = xj)

    + [x E(X)]

    2 fX(x) dx

    68

  • Example: (Discrete random variable)

    Consider again the experiment of tossing two dice with Xrepresenting the absolute difference of the two dice (see Ex-ample 1 on slide 60). The variance is given by

    Var(X) = (0 70/36)2 6/36+ (1 70/36)2 10/36+ (2 70/36)2 8/36+ (3 70/36)2 6/36+ (4 70/36)2 4/36+ (5 70/36)2 2/36= 2.05247

    Notice:

    The variance is an expectation per definitionem rules for expectations are applicable

    69

  • Theorem 2.18: (Rules for variances)

    Let X be an arbitrary random variable (discrete or continuous)and a, b R real constants; then

    1. Var(X) = E(X2) [E(X)]2.2. Var(a+ b X) = b2 Var(X).

    Proof: Class

    Next:

    Two important inequalities dealing with expectations andtransformed random variables

    70

  • Theorem 2.19: (Chebyshev inequality)

    Let X be an arbitrary random variable and g : R R+ a non-negative function. Then, for every k > 0 we have

    P [g(X) k] E [g(X)]k

    .

    Special case:

    Considerg(x) = [x E(X)]2 and k = r2 Var(X) (r > 0)

    Theorem 2.19 impliesP{[X E(X)]2 r2 Var(X)

    } Var(X)r2 Var(X) =

    1r2

    71

  • Now:P{[X E(X)]2 r2 Var(X)

    }= P {|X E(X)| r SD(X)}= 1 P {|X E(X)| < r SD(X)}

    It follows that

    P {|X E(X)| < r SD(X)} 1 1r2

    (specific Chebyshev inequality)

    72

  • Remarks:

    The specific Chebyshev inequality provides a minimal proba-bility of the event that any arbitrary random variable X takeson a value from the following interval:

    [E(X) r SD(X),E(X) + r SD(X)] For example, for r = 3 we have

    P {|X E(X)| < 3 SD(X)} 1 132

    =89

    which is equivalent to

    P {E(X) 3 SD(X) < X < E(X) + 3 SD(X)} 0.8889or

    P {X (E(X) 3 SD(X),E(X) + 3 SD(X))} 0.888973

  • Theorem 2.20: (Jensen inequality)

    Let X be a random variable with mean E(X) and let g : R Rbe a convex function, i.e. for all x we have g(x) 0; then

    E [g(X)] g(E[X]).

    Remarks:

    If the function g is concave (i.e. if g(x) 0 for all x) thenJensens inequality states that E [g(X)] g(E[X])

    Notice that in general we haveE [g(X)] 6= g(E[X])

    74

  • Example:

    Consider the random variable X and the function g(x) = x2 We have g(x) = 2 0 for all x, i.e. g is convex It follows from Jensens inequality that

    E [g(X)] =E(X2)

    g(E[X]) =[E(X)]2

    i.e.

    E(X2) [E(X)]2 0 This implies

    Var(X) = E(X2) [E(X)]2 0(the variance of an arbitrary rv cannot be negative)

    75

  • Now:

    Consider the random variable X with expectation E(X) = X,the integer number n N and the functions

    g1(x) = xn

    g2(x) = [x X]n

    Definition 2.21: (Moments, central moments)

    (a) The n-th moment of X, denoted by n, is defined asn E[g1(X)] = E(Xn).

    (b) The n-th central moment of X about X, denoted by n, isdefined as

    n E[g2(X)] = E[(X X)n].76

  • Relations:

    1 = E(X) = X(the 1st moment coincides with E(X))

    1 = E[X X] = E(X) X = 0(the 1st central moment is always equal to 0)

    2 = E[(X X)2] = Var(X)(the 2nd central moment coincides with Var(X))

    77

  • Remarks:

    The first four moments of a random variable X are importantmeasures of the probability distribution(expectation, variance, skewness, kurtosis)

    The moments of a random variable X play an important rolein theoretical and applied statistics

    In some cases, when all moments are known, the cdf of arandom variable X can be determined

    78

  • Question:

    Can we find a function that gives us a representation of allmoments of a random variable X?

    Definition 2.22: (Moment generating function)

    Let X be a random variable with discrete density or pdf fX().The expected value of etX is defined to be the moment gener-ating function of X if the expected value exists for every valueof t in some interval h < t < h, h > 0. That is, the momentgenerating function of X, denoted by mX(t), is defined as

    mX(t) = E[etX

    ].

    79

  • Remarks:

    The moment generating function mX(t) is a function in t There are rvs X for which mX(t) does not exist If mX(t) exists it can be calculated asmX(t) = E

    [etX

    ]

    =

    {xjsupp(X)}etxj P (X = xj) , if X is discrete

    + e

    tx fX(x) dx , if X is continuous

    80

  • Question:

    Why is mX(t) called the moment generating function?

    Answer:

    Consider the nth derivative of mX(t) with respect to t:

    dn

    dtnmX(t) =

    {xjsupp(X)}(xj)

    n etxj P (X = xj) for discrete X + x

    n etx fX(x) dx for continuous X

    81

  • Now, evaluate the nth derivative at t = 0:

    dn

    dtnmX(0) =

    {xjsupp(X)}(xj)

    n P (X = xj) for discrete X + x

    n fX(x) dx for continuous X

    = E(Xn) = n(see Definition 2.21(a) on slide 76)

    82

  • Example:

    Let X be a continuous random variable with pdffX(x) =

    {0 , for x < 0

    ex , for x 0(exponential distribution with parameter > 0)

    We havemX(t) = E

    [etX

    ]= + e

    tx fX(x) dx

    = +0

    e(t)x dx = t

    for t <

    83

  • It follows thatmX(t) =

    ( t)2 and m

    X(t) =

    2( t)3

    and thus

    mX(0) = E(X) =1

    and mX(0) = E(X2) =22

    Now:

    Important result on moment generating functions

    84

  • Theorem 2.23: (Identification property)

    Let X and Y be two random variables with densities fX() andfY (), respectively. Suppose that mX(t) and mY (t) both existand that mX(t) = mY (t) for all t in the interval h < t < h forsome h > 0. Then the two cdfs FX() and FY () are equal; thatis FX(x) = FY (x) for all x.

    Remarks:

    Theorem 2.23 states that there is a unique cdf FX(x) for agiven moment generating function mX(t) if we can find mX(t) for X then, at least theoretically, we

    can find the distribution of X

    We will make use of this property in Section 485

  • Example:

    Suppose that a random variable X has the moment generat-ing function

    mX(t) =1

    1 t for 1 < t < 1 Then the pdf of X is given by

    fX(x) =

    {0 , for x < 0ex , for x 0

    (exponential distribution with parameter = 1)

    86

  • 2.4 Special Parametric Families of Univariate Dis-tributions

    Up to now:

    General mathematical properties of arbitrary distributions Discrimination: discrete vs continuous distributions Consideration of

    the cdf FX(x)

    the discrete density or the pdf fX(x)

    expectations of the form E[g(X)]the moment generating function mX(t)

    87

  • Central result:

    The distribution of a random variable X is (essentially) de-termined by fX(x) or FX(x)

    FX(x) can be determined by fX(x)(cf. slide 46)

    fX(x) can be determined by FX(x)(cf. slide 48)

    Question:

    How many different distributions are known to exist?

    88

  • Answer:

    Infinitely many

    But:

    In practice, there are some important parametric families ofdistributions that provide good models for representing real-world random phenomena

    These families of distributions are decribed in detail in alltextbooks on mathematical statistics(see e.g. Mosler & Schmid (2008), Mood et al. (1974))

    89

  • Important families of discrete distributionsBernoulli distribution

    Binomial distribution

    Geometric distribution

    Poisson distribution

    Important families of continuous distributionsUniform or rectangular distribution

    Exponential distribution

    Normal distribution

    90

  • Remark:

    The most important family of distributions at all is the nor-mal distribution

    Definition 2.24: (Normal distribution)

    A continuous random variable X is defined to be normally dis-tributed with parameters R and 2 > 0, denoted by X N(, 2), if its pdf is given by

    fX(x) =12pi e

    12(x

    )2, x R.

    91

  • PDFs of the normal distribution

    92

    0 5 x

    fX(x)

    N(0,1) N(5,1)

    N(5,3)

    N(5,5)

  • Remarks:

    The special normal distribution N(0,1) is called standard nor-mal distribution the pdf of which is denoted by (x)

    The properties as well as calculation rules for normally dis-tributed random variables are important pre-conditions forthis course(see Wilfling (2011), Section 3.4)

    93

  • 3. Joint and Conditional Distributions, StochasticIndependence

    Aim of this section:

    Multidimensional random variables (random vectors)(joint and marginal distributions)

    Stochastic (in)dependence and conditional distribution Multivariate normal distribution(definition, properties)

    Literature:

    Mood, Graybill, Boes (1974), Chapter IV, pp. 129-174 Wilfling (2011), Chapter 4

    94

  • 3.1 Joint and Marginal Distribution

    Now:

    Consider several random variables simultaneously

    Applications:

    Several economic applications Statistical inference

    95

  • Definition 3.1: (Random vector)

    Let X1, , Xn be a set of n random variables each representingthe same random experiment, i.e.

    Xi : R for i = 1, . . . , n.Then X = (X1, . . . , Xn) is called an n-dimensional random vari-able or an n-dimensional random vector.

    Remark:

    In the literature random vectors are often denoted byX = (X1, . . . , Xn) or more simply by X1, . . . , Xn

    96

  • For n = 2 it is common practice to writeX = (X,Y ) or (X,Y ) or X,Y

    Realizations are denoted by small letters:x = (x1, . . . , xn)

    Rn or x = (x, y) R2

    Now:

    Characterization of the probability distribution of the randomvector X

    97

  • Definition 3.2: (Joint cumulative distribution function)

    Let X = (X1, . . . , Xn) be an n-dimensional random vector. Thefunction

    FX1,...,Xn : Rn [0,1]defined by

    FX1,...,Xn(x1, . . . , xn) = P (X1 x1, X2 x2, . . . , Xn xn)is called the joint cumulative distribution function of X.

    Remark:

    Definition 3.2 applies to discrete as well as to continuousrandom variables X1, . . . , Xn

    98

  • Some properties of the bivariate cdf (n = 2):

    FX,Y (x, y) is monotone increasing in x and y limxFX,Y (x, y) = 0

    limyFX,Y (x, y) = 0

    limx+y+

    FX,Y (x, y) = 1

    Remark:

    Analogous properties hold for the n-dimensional cdfFX1,...,Xn(x1, . . . , xn)

    99

  • Now:

    Joint discrete versus joint continuous random vectors

    Definition 3.3: (Joint discrete random vector)

    The random vector X = (X1, . . . , Xn) is defined to be a joint dis-crete random vector if it can assume only a finite (or a countableinfinite) number of realizations x = (x1, . . . , xn) such that

    P (X1 = x1, X2 = x2, . . . , Xn = xn) > 0

    and P (X1 = x1, X2 = x2, . . . , Xn = xn) = 1,

    where the summation is over all possible realizations of X.

    100

  • Definition 3.4: (Joint continuous random vector)

    The random vector X = (X1, . . . , Xn) is defined to be a jointcontinuous random vector if and only if there exists a nonnegativefunction fX1,...,Xn(x1, . . . , xn) such that

    FX1,...,Xn(x1, . . . , xn) = xn . . .

    x1 fX1,...,Xn(u1, . . . , un) du1 . . . dun

    for all (x1, . . . , xn). The function fX1,...,Xn is defined to be a jointprobability density function of X.

    Example:

    Consider X = (X,Y ) with joint pdffX,Y (x, y) =

    {x+ y , for (x, y) [0,1] [0,1]0 , elsewise

    101

  • Joint pdf fX,Y (x, y)

    102

    00.2

    0.40.6

    0.81

    x0

    0.20.4

    0.60.81

    y0

    0.51

    1.52

    fHx,yL

    00.2

    0.40.6

    0.8x

  • The joint cdf can be obtained byFX,Y (x, y) =

    y

    x fX,Y (u, v) du dv

    = y0

    x0(u+ v) du dv

    = . . .

    =

    0.5(x2y+ xy2) , for (x, y) [0,1] [0,1]0.5(x2+ x) , for (x, y) [0,1] [1,)0.5(y2+ y) , for (x, y) [1,) [0,1]

    1 , for (x, y) [1,) [1,)(Proof: Class)

    103

  • Remarks:

    If X = (X1, . . . , Xn) is a joint continuous random vector,then

    nFX1,...,Xn(x1, . . . , xn)

    x1 xn = fX1,...,Xn(x1, . . . , xn) The volume under the joint pdf represents probabilities:

    P (au1 < X1 ao1, . . . , aun < Xn aon)

    = aonaun

    . . . ao1au1

    fX1,...,Xn(u1, . . . , un) du1 . . . dun

    104

  • In this course:Emphasis on joint continuous random vectors

    Analogous results for joint discrete random vectors(see Mood, Graybill, Boes (1974), Chapter IV)

    Now:

    Determination of the distribution of a single random vari-able Xi from the joint distribution of the random vector(X1, . . . , Xn) marginal distribution

    105

  • Definition 3.5: (Marginal distribution)

    Let X = (X1, . . . , Xn) be a continuous random vector with jointcdf FX1,...,Xn and joint pdf fX1,...,Xn. Then

    FX1(x1) = FX1,...,Xn(x1,+,+, . . . ,+,+)FX2(x2) = FX1,...,Xn(+, x2,+, . . . ,+,+)

    . . .

    FXn(xn) = FX1,...,Xn(+,+,+, . . . ,+, xn)

    are called marginal cdfs while

    106

  • fX1(x1) = + . . .

    + fX1,...,Xn(x1, x2, . . . , xn) dx2 . . . dxn

    fX2(x2) = + . . .

    + fX1,...,Xn(x1, x2, . . . , xn) dx1 dx3 . . . dxn

    fXn(xn) =

    + . . .

    + fX1,...,Xn(x1, x2, . . . , xn) dx1 dx2 . . . dxn1

    are called marginal pdfs of the one-dimensional (univariate) ran-dom variables X1, . . . , Xn.

    107

  • Example:

    Consider the bivariate pdffX,Y (x, y)

    =

    {40(x 0.5)2y3(3 2x y) , for (x, y) [0,1] [0,1]

    0 , elsewise

    108

  • Bivariate pdf fX,Y (x, y)

    109

    00.2

    0.40.6

    0.81

    x0

    0.20.4

    0.60.81

    y0123

    fHx,yL

    00.2

    0.40.6

    0.8x

  • The marginal pdf of X obtains asfX(x) =

    1040(x 0.5)2y3(3 2x y)dy

    = 40(x 0.5)2 10(3y3 2xy3 y4)dy

    = 40(x 0.5)2[34y4 2x

    4y4 1

    5y5]10

    = 40(x 0.5)2(34 2x

    4 15

    )= 20x3+42x2 27x+5.5

    110

  • Marginal pdf fX(x)

    111

    0.2 0.4 0.6 0.8 1x

    0.250.5

    0.751

    1.251.5

    fHxL

  • The marginal pdf of Y obtains asfY (y) =

    1040(x 0.5)2y3(3 2x y)dx

    = 40y3 10(x 0.5)2(3 2x y)dx

    = 103y3(y 2)

    112

  • Marginal pdf fY (y)

    113

    0.2 0.4 0.6 0.8 1y

    0.51

    1.52

    2.53

    fHyL

  • Remarks:

    When considering the marginal instead of the joint distribu-tions, we are faced with an information loss(the joint distribution uniquely determines all marginal distri-butions, but the converse does not hold in general)

    Besides the respective univariate marginal distributions, thereare also multivariate distributions which can be obtained fromthe joint distribution of X = (X1, . . . , Xn)

    114

  • Example:

    For n = 5 consider X = (X1, . . . , X5) with joint pdf fX1,...,X5 Then the marginal pdf of Z = (X1, X3, X5) obtains as

    fX1,X3,X5(x1, x3, x5)

    = +

    + fX1,...,X5(x1, x2, x3, x4, x5) dx2 dx4

    (integrate out the irrelevant components)

    115

  • 3.2 Conditional Distribution and Stochastic Inde-pendence

    Now:

    Distribution of a random variable X under the condition thatanother random variable Y has already taken on the realiza-tion y(conditional distribution of X given Y = y)

    116

  • Definition 3.6: (Conditional distribution)

    Let X = (X,Y ) be a bivariate continuous random vector withjoint pdf fX,Y (x, y). The conditional density of X given Y = y isdefined to be

    fX|Y=y(x) =fX,Y (x, y)

    fY (y).

    Analogously, the conditional density of Y given X = x is definedto be

    fY |X=x(y) =fX,Y (x, y)

    fX(x).

    117

  • Remark:

    Conditional densities of random vectors are defined analo-gously, e.g.

    fX1,X2,X4|X3=x3,X5=x5(x1, x2, x4) =

    fX1,X2,X3,X4,X5(x1, x2, x3, x4, x5)

    fX3,X5(x3, x5)

    118

  • Example:

    Consider the bivariate pdffX,Y (x, y)

    =

    {40(x 0.5)2y3(3 2x y) , for (x, y) [0,1] [0,1]

    0 , elsewise

    with marginal pdf

    fY (y) = 103 y3(y 2)

    (cf. Slides 108-112)

    119

  • It follows thatfX|Y=y(x) =

    fX,Y (x, y)

    fY (y)

    =40(x 0.5)2y3(3 2x y)

    103 y3(y 2)

    =12(x 0.5)2(3 2x y)

    2 y

    120

  • Conditional pdf fX|Y=0.01(x) of X given Y = 0.01

    121

    0.2 0.4 0.6 0.8 1x

    0.51

    1.52

    2.5

    3Bedingte Dichte

  • Conditional pdf fX|Y=0.95(x) of X given Y = 0.95

    122

    0.2 0.4 0.6 0.8 1x

    0.20.40.60.81

    1.2Bedingte Dichte

  • Now:

    Combine the concepts joint distribution and conditionaldistribution to define the notion stochastic independence(for two random variables first)

    Definition 3.7: (Stochastic Independence [I])

    Let (X,Y ) be a bivariate continuous random vector with jointpdf fX,Y (x, y). X and Y are defined to be stochastically inde-pendent if and only if

    fX,Y (x, y) = fX(x) fY (y) for all x, y R.

    123

  • Remarks:

    Alternatively, stochastic independence can be defined via thecdfs:X and Y are stochastically independent, if and only if

    FX,Y (x, y) = FX(x) FY (y) for all x, y R. If X and Y are independent, we have

    fX|Y=y(x) =fX,Y (x, y)

    fY (y)=

    fX(x) fY (y)fY (y)

    = fX(x)

    fY |X=x(y) =fX,Y (x, y)

    fX(x)=

    fX(x) fY (y)fX(x)

    = fY (y)

    If X and Y are independent and g and h are two continuousfunctions, then g(X) and h(Y ) are also independent

    124

  • Now:

    Extension to n random variables

    Definition 3.8: (Stochastic independence [II])

    Let (X1, . . . , Xn) be a continuous random vector with joint pdffX1,...,Xn(x1, . . . , xn) and joint cdf FX1,...,Xn(x1, . . . , xn). X1, . . . , Xnare defined to be stochastically independent, if and only if for all(x1, . . . , xn) Rn

    fX1,...,Xn(x1, . . . , xn) = fX1(x1) . . . fXn(xn)or

    FX1,...,Xn(x1, . . . , xn) = FX1(x1) . . . FXn(xn).

    125

  • Remarks:

    For discrete random vectors we define: X1, . . . , Xn are stochas-tically independent, if and only if for all (x1, . . . , xn) RnP (X1 = x1, . . . , Xn = xn) = P (X1 = x1) . . . P (Xn = xn)

    or

    FX1,...,Xn(x1, . . . , xn) = FX1(x1) . . . FXn(xn) In the case of independence, the joint distribution resultsfrom the marginal distributions

    If X1, . . . , Xn are stochastically independent and g1, . . . , gn arecontinuous functions, then Y1 = g1(X1), . . . , Yn = gn(Xn) arealso stochastically independent

    126

  • 3.3 Expectation and Joint Moment GeneratingFunctions

    Now:

    Definition of the expectation of a functiong : Rn R

    (x1, . . . , xn) 7 g(x1, . . . xn)of a continuous random vector X = (X1, . . . , Xn)

    127

  • Definition 3.9: (Expectation of a function)

    Let (X1, . . . , Xn) be a continuous random vector with joint pdffX1,...,Xn(x1, . . . , xn) and g : Rn R a real-valued continuousfunction. The expectation of the function g of the random vectoris defined to be

    E[g(X1, . . . , Xn)]

    = + . . .

    + g(x1, . . . , xn) fX1,...,Xn(x1, . . . , xn) dx1 . . . dxn.

    128

  • Remarks:

    For a discrete random vector (X1, . . . , Xn) the analogous def-inition is

    E[g(X1, . . . , Xn)] =

    g(x1, . . . , xn) P (X1 = x1, . . . , Xn = xn),where the summation is over all realizationen of the vector

    Definition 3.9 includes the expectation of a univariate ran-dom variable X:Set n = 1 and g(x) = x

    E(X1) E(X) = + xfX(x) dx

    Definition 3.9 includes the variance of X:Set n = 1 and g(x) = [x E(X)]2 Var(X1) Var(X) =

    + [x E(X)]

    2fX(x) dx

    129

  • Definition 3.9 includes the covariance of two variables:Set n = 2 and g(x1, x2) = [x1 E(X1)] [x2 E(X2)] Cov(X1, X2)

    = +

    + [x1 E(X1)][x2 E(X2)]fX1,X2(x1, x2) dx1 dx2

    Via the covariance we define the correlation coefficient:Corr(X1, X2) =

    Cov(X1, X2)Var(X1)

    Var(X2)

    General properties of expected values, variances, covariancesand the correlation coefficient Class

    130

  • Now:

    Expectation and variances of random vectors

    Definition 3.10: (Expected vector, covariance matrix)

    Let X = (X1, . . . , Xn) be a random vector. The expected vectorof X is defined to be

    E(X) = E(X1)...E(Xn)

    .The covariance matrix of X is defined to be

    Cov(X) =

    Var(X1) Cov(X1, X2) . . . Cov(X1, Xn)

    Cov(X2, X1) Var(X2) . . . Cov(X2, Xn)... ... . . . ...

    Cov(Xn, X1) Cov(Xn, X2) . . . Var(Xn)

    .131

  • Bemerkung:

    Obviously, the covariance matrix is symmetric per definition

    Now:

    Expected vectors and covariance matrices under linear trans-formations of random vectors

    Let

    X = (X1, . . . , Xn) be a n-dimensional random vector A be an (m n) matrix of real numbers b be an (m 1) column vector of real numbers

    132

  • Obviously:

    Y = AX+ b is an (m 1) random vector:

    Y =

    a11 a12 . . . a1na21 a22 . . . a2n... ... . . . ...

    am1 am2 . . . amn

    X1X2...Xn

    +b1b2...bm

    =

    a11X1+ a12X2+ . . .+ a1nXn+ b1a21X1+ a22X2+ . . .+ a2nXn+ b2

    ...am1X1+ am2X2+ . . .+ amnXn+ bm

    133

  • The expected vector of Y is given by

    E(Y) =

    a11E(X1) + a12E(X2) + . . .+ a1nE(Xn) + b1a21E(X1) + a22E(X2) + . . .+ a2nE(Xn) + b2

    ...am1E(X1) + am2E(X2) + . . .+ amnE(Xn) + bm

    = AE(X) + b

    The covariance matrix of Y is given by

    Cov(Y) =

    Var(Y1) Cov(Y1, Y2) . . . Cov(Y1, Yn)

    Cov(Y2, Y1) Var(Y2) . . . Cov(Y2, Yn)... ... . . . ...

    Cov(Yn, Y1) Cov(Yn, Y2) . . . Var(Yn)

    = ACov(X)A

    (Proof: Class)

    134

  • Remark:

    Cf. the analogous results for univariate variables:E(a X + b) = a E(X) + b

    Var(a X + b) = a2 Var(X)

    Up to now:

    Expected values for unconditional distributions

    Now:

    Expected values for conditional distributions(cf. Definition 3.6, Slide 117)

    135

  • Definition 3.11: (Conditional expected value of a function)

    Let (X,Y ) be a continuous random vector with joint pdf fX,Y (x, y)and let g : R2 R be a real-valued function. The conditionalexpected value of the function g given X = x is defined to be

    E[g(X,Y )|X = x] = + g(x, y) fY |X(y) dy.

    136

  • Remarks:

    An analogous definition applies to a discrete random vector(X,Y )

    Definition 3.11 naturally extends to higher-dimensional dis-tributions

    For g(x, y) = y we obtain the special case E[g(X,Y )|X = x] =E(Y |X = x)

    Note that E[g(X,Y )|X = x] is a function of x

    137

  • Example:

    Consider the joint pdffX,Y (x, y) =

    {x+ y , for (x, y) [0,1] [0,1]0 , elsewise

    The conditional distribution of Y given X = x is given by

    fY |X(y) = x+ yx+0.5 , for (x, y) [0,1] [0,1]0 , elsewise

    For g(x, y) = y the conditional expectation is given asE(Y |X = x) =

    10y x+ y

    x+0.5dy =

    1x+0.5

    (x2+13

    )

    138

  • Remarks:

    Consider the function g(x, y) = g(y)(i.e. g does not depend on x)

    Denote h(x) = E[g(Y )|X = x] We calculate the unconditional expectation of the trans-formed variable h(X)

    We have

    139

  • E {E[g(Y )|X = x]} = E[h(X)] = + h(x) fX(x) dx

    = + E[g(Y )|X = x] fX(x) dx

    = +

    [ + g(y) fY |X(y) dy

    ] fX(x) dx

    = +

    + g(y) fY |X(y) fX(x) dy dx

    = +

    + g(y) fX,Y (x, y) dy dx

    = E[g(Y )]140

  • Theorem 3.12:

    Let (X,Y ) be an arbitrary discrete or continuous random vector.Then

    E[g(Y )] = E {E[g(Y )|X = x]}and, in particular,

    E[Y ] = E {E[Y |X = x]} .

    Now:

    Three important rules for conditional and unconditional ex-pected values

    141

  • Theorem 3.13:

    Let (X,Y ) be an arbitrary discrete or continuous random vectorand g1(), g2() two unidimensional functions. Then

    1. E[g1(Y ) + g2(Y )|X = x] = E[g1(Y )|X = x] + E[g2(Y )|X = x],2. E[g1(Y ) g2(X)|X = x] = g2(x) E[g1(Y )|X = x].3. If X and Y are stochastically independent we have

    E[g1(X) g2(Y )] = E[g1(X)] E[g2(Y )].

    142

  • Finally:

    Moment generating function for random vectors

    Definition 3.14: (Joint moment generating function)

    Let X = (X1, . . . , Xn) be an arbitrary discrete or continuousrandom vector. The joint moment generating function of X isdefined to be

    mX1,...,Xn(t1, . . . , tn) = E[et1X1+...+tnXn

    ]if this expectation exists for all t1, . . . , tn with h < tj < h for anarbitary value h > 0 and for all j = 1, . . . , n.

    143

  • Remarks:

    Via the joint moment generating functionmX1,...,Xn(t1, . . . , tn)we can derive the following mathematical objects:

    the marginal moment generating functions mX1(t1), . . . ,mXn(tn)

    the moments of the marginal distributions

    the so-called joint moments

    144

  • Important result: (cf. Theorem 2.23, Slide 85)

    For any given joint moment generating functionmX1,...,Xn(t1, . . . , tn) there exists a unique joint cdfFX1,...,Xn(x1, . . . , xn)

    145

  • 3.4 The Multivariate Normal Distribution

    Now:

    Extension of the univariate normal distribution

    Definition 3.15: (Multivariate normal distribution)

    Let X = (X1, . . . , Xn) be an continuous random vector. X is de-fined to have a multivariate normal distribution with parameters

    =

    1...n

    and = 21 1n... . . . ...n1 2n

    ,if for x = (x1, . . . , xn) Rn its joint pdf is given byfX(x) = (2pi)

    n/2 [det()]1/2 exp{12(x )1 (x )

    }.

    146

  • Remarks:

    See Chang (1984, p. 92) for a definition and the propertiesof the determinant det(A) of the matrix A

    Notation:X N(,)

    is a column vector with 1, . . . , n R is a regular, positive definite, symmetric (n n) matrix Role of the parameters:

    E(X) = and Cov(X) =

    147

  • Joint pdf of the multiv. standard normal distribution N(0, In):(x) = (2pi)n/2 exp

    {12xx

    } Cf. the analogy to the univariate pdf in Definition 2.24, Slide91

    Properties of the N(,) distribution: Partial vectors (marginal distributions) of X also have multi-variate normal distributions, i.e. if

    X =

    [X1X2

    ] N

    ([12

    ],

    [11 1221 22

    ])then

    X1 N(1,11)X2 N(2,22)

    148

  • Thus, all univariate variables of X = (X1, . . . , Xn) have uni-variate normal distributions:

    X1 N(1, 21)X2 N(2, 22)

    ...Xn N(n, 2n)

    The conditional distributions are also (univariately or multi-variately) normal:

    X1|X2 = x2 N(1+12

    122 (x2 2),11 1212221

    ) Linear transformations:Let A be an (m n) matrix, b an (m 1) vector of realnumbers and X = (X1, . . . , Xn) N(,). Then

    AX+ b N(A+ b,AA)149

  • Example:

    ConsiderX N(,)

    N([

    01

    ],

    [1 0.50.5 2

    ]) Find the distribution of Y = AX+ b where

    A =

    [1 23 4

    ], b =

    [12

    ] It follows that Y N(A+ b,AA) In particular,

    A+ b =[36

    ]and AA =

    [12 2424 53

    ]150

  • Now:

    Consider the bivariate case (n = 2), i.e.X = (X,Y ), E(X) =

    [XY

    ], =

    [2X XYY X 2Y

    ] We have

    XY = Y X = Cov(X,Y ) = X Y Corr(X,Y ) = X Y The joint pdf follows from Definition 3.15 with n = 2

    fX,Y (x, y) =1

    2piXY1 2

    exp

    12 (1 2)[(x X)2

    2X 2(x X)(y Y )

    XY+(y Y )2

    2Y

    ]}(Derivation: Class)

    151

  • fX,Y (x, y) for X = Y = 0, x = Y = 1 and = 0

    152

    -20

    2x-2

    0

    2

    y0

    0.050.1

    0.15fHx,yL

    -20

    2x

  • fX,Y (x, y) for X = Y = 0, x = Y = 1 and = 0.9

    153

    -20

    2x-2

    0

    2

    y0

    0.10.20.3

    fHx,yL

    -20

    2x

  • Remarks:

    The marginal distributions are given byX N(X , 2X) and Y N(Y , 2Y ) interesting result for the normal distribution:

    If (X,Y ) has a bivariate normal distribution, then X and Yare independent if and only if = Corr(X,Y ) = 0

    The conditional distributions are given byX|Y = y N

    (X +

    XY

    (y Y ), 2X(1 2))

    Y |X = x N(Y +

    YX

    (x X), 2Y(1 2))

    (Proof: Class)

    154

  • 4. Distributions of Functions of Random Vari-ables

    Setup:

    Consider as given the joint distribution of X1, . . . , Xn(i.e. consider as given fX1,...,Xn and FX1,...,Xn)

    Consider k functionsg1 : Rn R, . . . , gk : Rn R

    Find the joint distribution of the k random variablesY1 = g1(X1, . . . , Xn), . . . , Yk = gk(X1, . . . Xn)

    (i.e. find fY1,...,Yk and FY1,...,Yk)

    155

  • Example:

    Consider as given X1, . . . , Xn with fX1,...,Xn Consider the functions

    g1(X1, . . . , Xn) =n

    i=1Xi and g2(X1, . . . , Xn) =

    1n

    ni=1

    Xi

    Find fY1,Y2 with Y1 =ni=1Xi and Y2 =

    1nni=1Xi

    Remark:

    From the joint distribution fY1,...,Yk we can derive the k marginaldistributions fY1, . . . fYk(cf. Chapter 3, Slides 106, 107)

    156

  • Aim of this chapter:

    Techniques for finding the (marginal) distribution(s)of (Y1, . . . , Yk)

    157

  • 4.1 Expectations of Functions of Random Vari-ables

    Simplification:

    In a first step, we are not interested in the exact distributions,but merely in certain expected values of Y1, . . . , Yk

    Expectation two ways:

    Consider as given the (continuous) random variables X1, . . . ,Xn and the function g : Rn R

    Consider the random variables Y = g(X1, . . . , Xn) and findthe expectation E[g(X1, . . . , Xn)]

    158

  • Two ways of calculating E(Y ):E(Y ) =

    + y fY (y) dy

    or

    E(Y ) = + . . .

    + g(x1, . . . , xn)fX1,...,Xn(x1, . . . xn) dx1 . . . dxn

    (cf. Definition 3.9, Slide 128)

    It can be proved that

    Both ways of calculating E(Y ) are equivalent

    choose the most convenient calculation

    159

  • Now:

    Calculation rules for expected values, variances, covariancesof sums of random variables

    Setting:

    X1, . . . , Xn are given continuous or discrete random variableswith joint density fX1,...,Xn

    The (transforming) function g : Rn R is given by

    g(x1, . . . , xn) =n

    i=1xi

    160

  • In a first step, find the expectation and the variance of

    Y = g(X1, . . . , Xn) =n

    i=1Xi

    Theorem 4.1: (Expectation and variance of a sum)

    For the given random variables X1, . . . , Xn we have

    E ni=1

    Xi

    = ni=1

    E(Xi)

    and

    Var

    ni=1

    Xi

    = ni=1

    Var(Xi) + 2 n

    i=1

    nj=i+1

    Cov(Xi, Xj).

    161

  • Implications:

    For given constants a1, . . . , an R we have

    E ni=1

    ai Xi = n

    i=1ai E(Xi)

    (why?)

    For two random variables X1 and X2 we haveE(X1 X2) = E(X1) E(X2)

    If X1, . . . , Xn are stochastically independent, it follows thatCov(Xi, Xj) = 0 for all i 6= j and hence

    Var

    ni=1

    Xi

    = ni=1

    Var(Xi)

    162

  • Now:

    Calculating the covariance of two sums of random variables

    Theorem 4.2: (Covariance of two sums)

    Let X1, . . . , Xn and Y1, . . . , Ym be two sets of random variablesand let a1, . . . an and b1, . . . , bm be two sets of constants. Then

    Cov

    ni=1

    ai Xi,mj=1

    bj Yj = n

    i=1

    mj=1

    ai bj Cov(Xi, Yj).

    163

  • Implications:

    The variance of a weighted sum of random variables is givenby

    Var

    ni=1

    ai Xi = Cov n

    i=1ai Xi,

    nj=1

    aj Xj

    =n

    i=1

    nj=1

    ai aj Cov(Xi, Xj)

    =n

    i=1a2i Var(Xi) +

    ni=1

    nj=1,j 6=i

    ai aj Cov(Xi, Xj)

    =n

    i=1a2i Var(Xi) + 2

    ni=1

    nj=i+1

    ai aj Cov(Xi, Xj)

    164

  • For two random variables X1 and X2 we haveVar(X1 X2) = Var(X1) + Var(X2) 2 Cov(X1, X2),

    and if X1 and X2 are independent we have

    Var(X1 X2) = Var(X1) + Var(X2)

    Finally:

    Important result concerning the expectation of a product oftwo random variables

    165

  • Setting:

    Let X1, X2 be both continuous or both discrete random vari-ables with joint density fX1,X2

    Let g : Rn R be defined as g(x1, x2) = x1 x2 Find the expectation of

    Y = g(X1, X2) = X1 X2

    Theorem 4.3: (Expectation of a product)

    For the random variables X1, X2 we have

    E (X1 X2) = E(X1) E(X2) + Cov(X1, X2).166

  • Implication:

    If X1 and X2 are stochastically independent, we haveE (X1 X2) = E(X1) E(X2)

    Remarks:

    A formula for Var(X1 X2) also exists In many cases, there are no explicit formulas for expectedvalues and variances of other transformations (e.g. for ratiosof random variables)

    167

  • 4.2 The Cumulative-distribution-function Tech-nique

    Motivation:

    Consider as given the random variables X1, . . . , Xn with jointdensity fX1,...,Xn

    Find the joint distribution of Y1, . . . , Yk where Yj = gj(X1, . . . ,Xn) for j = 1, . . . , k

    The joint cdf of Y1, . . . , Yk is defined to beFY1,...,Yk(y1, . . . , yk) = P (Y1 y1, . . . , Yk yk)

    (cf. Definition 3.2, Slide 98)

    168

  • Now, for each y1, . . . , yk the event{Y1 y1, . . . , Yk yk}

    = {g1(X1, . . . , Xn) y1, . . . , gk(X1, . . . , Xn) yk} ,i.e. the latter event is an event described in terms of the givenfunctions g1, . . . , gk and the given random variables X1, . . . , Xn

    since the joint distribution of X1, . . . , Xn is assumed given,presumably the probability of the latter event can be cal-culated and consequently FY1,...,Yk determined

    169

  • Example 1:

    Consider n = 1 (i.e. consider X1 X with cdf FX) and k = 1(i.e. g1 g and Y1 Y )

    Consider the functiong(x) = a x+ b, b R, a > 0

    Find the distribution ofY = g(X) = a X + b

    170

  • The cdf of Y is given byFY (y) = P (Y y)

    = P [g(X) y]= P (a X + b y)

    = P(X y b

    a

    )= FX

    (y ba

    ) If X is continuous, the pdf of Y is given by

    fY (y) = FY (y) = F

    X

    (y ba

    )=1a fX

    (y ba

    )(cf. Slide 48)

    171

  • Example 2:

    Consider n = 1 and k = 1 and the functiong(x) = ex

    The cdf of Y = g(X) = eX is given byFY (y) = P (Y y)

    = P (eX y)= P [X ln(y)]= FX[ln(y)]

    If X is continuous, the pdf of Y is given byfY (y) = F

    Y (y) = F

    X [ln(y)] =

    fX [ln(y)]y

    172

  • Now:

    Consider n = 2 and k = 2, i.e. consider X1 and X2 with jointdensity fX1,X2(x1, x2)

    Consider the functionsg1(x1, x2) = x1+ x2 and g2(x1, x2) = x1 x2

    Find the distributions of the sum and the difference of tworandom variables

    Derivation via the two-dimensional cdf-technique

    173

  • Theorem 4.4: (Distribution of a sum / difference)

    Let X1 and X2 be two continuous random variables with joint pdffX1,X2(x1, x2). Then the pdfs of Y1 = X1+X2 and Y2 = X1X2are given by

    fY1(y1) = + fX1,X2(x1, y1 x1) dx1

    = + fX1,X2(y1 x2, x2) dx2

    and

    fY2(y2) = + fX1,X2(x1, x1 y2) dx1

    = + fX1,X2(y2+ x2, x2) dx2.

    174

  • Implication: If X1 and X2 are independent, then

    fY1(y1) = + fX1(x1) fX2(y1 x1) dx1

    fY2(y2) = + fX1(x1) fX2(x1 y2) dx1

    Example: Let X1 and X2 be independent random variables both withpdf

    fX1(x) = fX2(x) =

    {1 , for x [0,1]0 , elsewise

    Find the pdf of Y = X1+X2(Class)

    175

  • Now:

    Analogous results for the product and the ratio of two ran-dom variables

    Theorem 4.5: (Distribution of a product / ratio)

    Let X1 and X2 be continuous random variables with joint pdffX1,X2(x1, x2). Then the pdfs of Y1 = X1 X2 and Y2 = X1/X2are given by

    fY1(y1) = +

    1|x1|fX1,X2(x1,

    y1x1) dx1

    and

    fY2(y2) = + |x2| fX1,X2(y2 x2, x2) dx2.

    176

  • 4.3 The Moment-generating-function Technique

    Motivation:

    Consider as given the random variables X1, . . . , Xn with jointpdf fX1,...,Xn

    Again, find the joint distribution of Y1, . . . , Yk where Yj =gj(X1, . . . , Xn) for j = 1, . . . , k

    177

  • According to Definition 3.14, Slide 143, the joint momentgenerating function of the Y1, . . . , Yk is defined to be

    mY1,...,Yk(t1, . . . , tk) = E[et1Y1+...+tkYk

    ]=

    + . . .

    + e

    t1g1(x1,...,xn)+...+tkgk(x1,...,xn)

    fX1,...,Xn(x1, . . . , xn) dx1 . . . dxn If mY1,...,Yk(t1, . . . , tk) can be recognized as the joint momentgenerating function of some known joint distribution, it willfollow that Y1, . . . , Yk has that joint distribution by virtue ofthe identification property(cf. Slide 145)

    178

  • Example:

    Consider n = 1 and k = 1 where the random variable X1 Xhas a standard normal distribution

    Consider the function g1(x) g(x) = x2 Find the distribution of Y = g(X) = X2 The moment generating function of Y is given by

    mY (t) = E[etY

    ]= E

    [etX2

    ]=

    + e

    tx2 fX(x)dx

    179

  • = + e

    tx2 12pi

    e12x2dx

    = . . .

    =

    1212 t

    12 for t < 12

    This is the moment generating function of a gamma distri-bution with parameters = 12 and r =

    12

    (see Mood, Graybill, Boes (1974), pp. 540/541)

    Y = X2 (0.5,0.5)

    180

  • Now:

    Distribution of sums of independent random variables

    Preliminaries:

    Consider the moment generating function of such a sum Let X1, . . . , Xn be independent random variables and let Y =n

    i=1Xi

    The moment generating function of Y is given bymY (t) = E

    [etY

    ]= E

    [etn

    i=1Xi]= E

    [etX1 etX2 . . . etXn]

    = E[etX1

    ] E [etX2] . . . E [etXn] [Theorem 3.13(c)]= mX1(t) mX2(t) . . . mXn(t)

    181

  • Theorem 4.6: (Moment generating function of a sum)

    Let X1, . . . , Xn be stochastically independent random variableswith existing moment generating functions mX1(t), . . . ,mXn(t)for all t (h, h), h > 0. Then the moment generating functionof the sum Y =

    ni=1Xi is given by

    mY (t) =n

    i=1mXi(t) for t (h, h).

    Hopefully:

    The distribution of the sum Y = ni=1Xi may be identifiedfrom the moment generating function of the sum mY (t)

    182

  • Example 1:

    Assume that X1, . . . , Xn are independent and identically dis-tributed exponential random variables with parameter > 0

    The moment generating function of each Xi (i = 1, . . . , n) isgiven by

    mXi(t) =

    t for t < (cf. Mood, Graybill, Boes (1974), pp. 540/541)

    So the moment generating function of the sum Y = ni=1Xiis given by

    mY (t) = mXi(t) = ni=1

    mXi(t) =( t

    )n183

  • This is the moment generating function of a (n, ) distri-bution(cf. Mood, Graybill, Boes (1974), pp. 540/541)

    the sum of n independent, identically distributed expo-nential random variables with parameter has a (n, )distribution

    184

  • Example 2:

    Assume that X1, . . . , Xn are independent random variablesand that Xi N(i, 2i )

    Furthermore, let a1, . . . , an R be constants Then the distribution of the weighted sum is given by

    Y =n

    i=1ai Xi N

    ni=1

    ai i,n

    i=1a2i 2i

    (Proof: Class)

    185

  • 4.4 General Transformations

    Up to now:

    Techniques that allow us, under special circumstances, tofind the distributions of the transformed variables

    Y1 = g1(X1, . . . , Xn), . . . , Yk = gk(X1, . . . , Xn)

    However:

    These methods do not necessarily hit the mark(e.g. if calculations get too complicated)

    186

  • Resort:

    There are constructive methods by which it is generally pos-sible (under rather mild conditions) to find the distributionsof transformed random variables transformation theorems

    Here:

    We restrict attention to the simplest case where n = 1, k = 1,i.e. we consider the transformation Y = g(X)

    For multivariate extensions (i.e. for n 1, k 1) see Mood,Graybill, Boes (1974), pp. 203-212

    187

  • Theorem 4.7: (Transformation theorem for densities)

    Suppose X is a continuous random variable with pdf fX(x). SetD = {x : fX(x) > 0}. Furthermore, assume that

    (a) the transformation g : D W with y = g(x) is a one-to-onetransformation of D onto W ,

    (b) the derivative with respect to y of the inverse function g1 :W D with x = g1(y) is continuous and nonzero for ally W .

    Then Y = g(X) is a continuous random variable with pdf

    fY (y) =

    dg1(y)dy

    fX (g1(y)) , for y W0 , elsewise

    .

    188

  • Remark:

    The transformation g : D W with y = g(x) is called one-to-one, if for every y W there exists exactly one x D withy = g(x)

    Example:

    Suppose X has the pdffX(x) =

    { x1 , for x [1,+)

    0 , elsewise

    (Pareto distribution with parameter > 0)

    Find the distribution of Y = ln(X) We have D = [1,+), g(x) = ln(x), W = [0,+)

    189

  • Furthermore, g(x) = ln(x) is a one-to-one transformation ofD = [1,+) onto W = [0,+) with inverse function

    x = g1(y) = ey Its derivative with respect to y is given by

    dg1(y)dy

    = ey,

    i.e. the derivative is continuous and nonzero for all y [0,+) Hence, the pdf of Y = ln(x) is given by

    fY (y) =

    {ey (ey)1 , for y [0,+)

    0 , elsewise

    =

    { ey , for y [0,+)

    0 , elsewise

    190

  • 5. Methods of Estimation

    Setting:

    Let X be a random variable (or let X be a random vector)representing a random experiment

    We are interested in the actual distribution of X (or X)

    Notice:

    In practice the actual distribution of X is a priori unknown

    191

  • Therefore:

    Collect information on the unknown distribution by repeat-edly observing the random experiment (and thus the randomvariable X)

    random sample statistic estimator

    192

  • 5.1 Sampling, Estimators, Limit Theorems

    Setting:

    Let X represent the random experiment under consideration(X is a univariate random variable)

    We intend to observe the random experiment (i.e. X) n times Prior to the explicit realizations we may consider the potentialobservations as a set of n random variables X1, . . . , Xn

    193

  • Definition 5.1: (Random sample)

    The random variables X1, . . . , Xn are defined to be a randomsample from X if

    (a) each Xi, i = 1, . . . , n, has the same distribution as X,

    (b) X1, . . . , Xn are stochastically independent.

    The number n is called the sample size.

    194

  • Remarks:

    We assume that, in principle, the random experiment can berepeated as often as desired

    We call the realizations x1, . . . , xn of the random sampleX1, . . . , Xn the observed or the concrete sample

    Considering the random sample X1, . . . , Xn as a random vec-tor, we see that its joint density is given by

    fX1,...,Xn(x1, . . . , xn) =n

    i=1fXi(xi)

    (since the Xis are independent; cf. Definition 3.8, Slide 125)

    195

  • Model of a random sample

    196

    Zufallsvorgang X

    Mgliche Realisationen

    X1 (ZV) x1 (Realisation 1. Exp.)

    X2 (ZV)

    Xn (ZV)

    x2 (Realisation 2. Exp.)

    xn (Realisation n. Exp.)

    . . . . . .

  • Now:

    Consider functions of the sampling variables X1, . . . , Xn statistic estimator

    Definition 5.2: (Statistic)

    Let X1, . . . , Xn be a random sample from X and let g : Rn Rbe a real-valued function with n arguments that does not containany unknown parameters. Then the random variable

    T = g(X1, . . . , Xn)

    is called a statistic.

    197

  • Examples:

    Sample mean:

    X = g1(X1, . . . , Xn) =1n

    ni=1

    Xi

    Sample variance:

    S2 = g2(X1, . . . , Xn) =1n

    ni=1

    (Xi X

    )2 Sample standard deviation:

    S = g3(X1, . . . , Xn) =

    1n

    ni=1

    (Xi X

    )2

    198

  • Remarks:

    All these concepts can be extended to the multivariate case The statistic T = g(X1, . . . , Xn) is a function of random vari-ables and hence it is itself a random variable a statistic has a distribution

    (and, in particular, an expectation and a variance)

    Purposes of statistics:

    Statistics provide information on the distribution of X Statistics are central tools for

    estimating parametershypothesis-testing on parameters

    199

  • Random samples and statistics

    200

    Stichprobe

    ( X1, . . ., Xn)

    Messung Stichprobenrealisation ( x1, . . ., xn)

    g( X1, . . ., Xn) Statistik

    g( x1, . . ., xn) Realisation der Statistik

  • Now:

    Let X be a random variable with unknown cdf FX(x) We may be interested in one or several unknown parametersof X

    Let denote this unknown vector of parameters, e.g. =

    [E(X)Var(X)

    ] Frequently, the distribution family of X is known, e.g. X N(, 2), but we do not know the specific parameters. Then

    =[2

    ] We will estimate the unknown parameter vector on the basisof statistics from a random sample X1, . . . , Xn

    201

  • Definition 5.3: (Estimator, estimate)

    The statistic (X1, . . . , Xn) is called estimator (or point estima-tor) of the unknown parameter vector . After having observedthe concrete sample x1, . . . , xn, we call the realization of the es-timator (x1, . . . , xn) an estimate.

    Remarks:

    The estimator (X1, . . . , Xn) is a random variable or a randomvector an estimator has a (joint) distribution, an expected value

    (or vector) and a variance (or a covariance matrix)

    The estimate (x1, . . . , xn) is a number (or a vector of num-bers)

    202

  • Example:

    Let X N(, 2) with unknown parameters and 2 The vector of parameters to be estimated is given by

    =[2

    ]=

    [E(X)Var(X)

    ] Potential estimators of and 2 are

    =1n

    ni=1

    Xi and 2 =

    1n 1

    ni=1

    (Xi )2

    an estimator of is given by

    =[2

    ]=

    1nni=1Xi1n 1

    ni=1 (Xi )2

    203

  • Question:

    Why do we need this seemingly complicated concept of anestimator in the form of a random variable?

    Answer:

    To establish a comparison between alternative estimators ofthe parameter vector

    Example:

    Let = Var(X) denote the unknown variance of X

    204

  • Two alternative estimators of are1(X1, . . . , Xn) =

    1n

    ni=1

    (Xi X

    )22(X1, . . . , Xn) =

    1n 1

    ni=1

    (Xi X

    )2

    Question:

    Which estimator is better and for what reasons? properties (goodness criteria) of point estimators

    (see Section 5.2)

    205

  • Notice:

    Some of these criteria qualify estimators in terms of theirproperties when the sample size becomes large(n, large-sample-properties)

    Therefore:

    Explanation of the concept of stochastic convergence:Central-limit theorem

    Weak law of large numbers

    Convergence in probability

    Convergence in distribution

    206

  • Theorem 5.4: (Univariate central-limit theorem)

    Let X be any arbitrary random variable with E(X) = andVar(X) = 2. Let X1, . . . , Xn be a random sample from X andlet

    Xn =1n

    ni=1

    Xi

    denote the arithmetic sample mean. Then, for n, we have

    Xn N(,2

    n

    )and

    nXn

    N(0,1).

    Next:

    Generalization to the multivariate case207

  • Theorem 5.5: (Multivariate central-limit theorem)

    Let X = (X1, . . . , Xm) be any arbitrary random vector withE(X) = and Cov(X) = . Let X1, . . . ,Xn be a (multivari-ate) random sample from X and let

    Xn =1n

    ni=1

    Xi

    denote the multivariate arithmetic sample mean. Then, for n, we have

    Xn N(,1n)

    andn(Xn

    ) N(0,).

    208

  • Remarks:

    A multivariate random sample from the random vector Xarises naturally by replacing all univariate random variablesin Definition 5.1 (Slide 194) by corresponding multivariaterandom vectors

    Note the formal analogy to the univariate case in Theorem5.4(be aware of matrix-calculus rules!)

    Next:

    Famous theorem on the arithmetic sample mean

    209

  • Theorem 5.6: (Weak law of large numbers)

    Let X1, X2, . . . be a sequence of independent and identically dis-tributed random variables with

    E(Xi) =

  • Remarks:

    Theorem 5.6 is known as the weak law of large numbers Irrespective of how small we choose > 0, the probabilitythat Xn deviates more than from its expectation tendsto zero when the sample size increases

    Notice the analogy between a sequence of independent andidentically distributed random variables and the definition ofa random sample from X on Slide 194

    Next:

    The first important concept of limiting behaviour211

  • Definition 5.7: (Convergence in probability)

    Let Y1, Y2, . . . be a sequence of random variables. We say thatthe sequence Y1, Y2, . . . converges in probability to , if for any > 0 we have

    limnP (|Yn | ) = 0.

    We denote convergence in probability by

    plim Yn = or Ynp .

    Remarks:

    Specific case: Weak law of large numbersplim Xn = or Xn

    p 212

  • Typically (but not necessarily) a sequence of random vari-ables converges in probability to a constant R

    For multivariate sequences of random vectors Y1,Y2, . . . theDefinition 5.7 has to be applied to the respective correspond-ing elements

    The concept of convergence in probability is important toqualifying estimators

    Next:

    Alternative concepts of stochastic convergence

    213

  • Definition 5.8: (Convergence in distribution)

    Let Y1, Y2, . . . be a sequence of random variables and let Z also bea random variable. We say that the sequence Y1, Y2, . . . convergesin distribution to the distribution of Z if

    limnFYn(y) = FZ(y) for any y R.

    We denote convergence in distribution by

    Ynd Z.

    Remarks: Specific case: central-limit theorem

    Yn =nXn

    d U N(0,1)

    In the case of convergence in distribution, the sequence ofrvs always converges to a limiting random variable

    214

  • Theorem 5.9: (Rules for probability limits)

    Let X1, X2, . . . and Y1, Y2, . . . be sequences of random variableswith plim Xn = a and plim Yn = b. Then

    (a) plim (Xn Yn) = a b,

    (b) plim (Xn Yn) = a b,

    (c) plim(XnYn

    )= ab (for b 6= 0).

    (d) (Slutsky-Theorem) If g : R R is a continuous function ina R, then

    plim g (Xn) = g(a).

    215

  • Remark:

    There is a property similar to Slutskys theorem that holdsfor the convergence in distribution

    Theorem 5.10: (Rule for limiting distributions)

    Let X1, X2, . . . be a sequence of random variables and let Z be a

    random variable such that Xnd Z. If h : R R is a continuous

    function, then

    h (Xn)d h(Z).

    Next:

    Connection of both convergence concepts216

  • Theorem 5.11: (Cramer-Theorem)

    Let X1, X2, . . . and Y1, Y2, . . . be sequences of random variables,let Z be a random variable and a R a constant. Assume thatplim Xn = a and Yn

    d Z. Then(a) Xn+ Yn

    d a+ Z,(b) Xn Yn d a Z.

    Example:

    Let X1, . . . , Xn be a random sample from X with E(X) = and Var(X) = 2

    217

  • It can be shown thatplim S2n = plim

    1n 1

    ni=1

    (Xi Xn

    )2= 2

    plim S2n = plim1n

    ni=1

    (Xi Xn

    )2= 2

    For g1(x) = x/2 Slutkskys theorem yields

    plim g1(S2n

    )= plim

    S2n2

    = g1(2) = 1

    plim g1(S2n)= plim

    S2n2

    = g1(2) = 1

    218

  • For g2(x) = /x Slutkskys theorem yieldsplim g2

    (S2n

    )= plim

    Sn

    = g2(2) = 1

    plim g2(S2n)= plim

    Sn

    = g2(2) = 1

    From the central-limit theorem we know thatnXn

    d U N(0,1)

    219

  • Now, Cramers theorem yieldsg2(S2n

    ) nXn

    =Sn

    nXn

    =nXn Sn

    d 1 U= U N(0,1)

    Analogously, Cramers theorem yieldsnXn Sn

    d U N(0,1)

    220

  • 5.2 Properties of Estimators

    Content of Definition 5.3 (Slide 202):

    An estimator is defined to be a statistic(a function of the random sample) there are several alternative estimators of the unknown

    parameter vector

    Example:

    Assume that X N(0, 2) with unknown variance 2 and letX1, . . . , Xn be a random sample from X

    Alternative estimators of = 2 are1 =

    1n

    ni=1

    (Xi X

    )2and 2 =

    1n 1

    ni=1

    (Xi X

    )2221

  • Important questions:

    Are there reasonable criteria according to which we can selecta good estimator?

    How can we construct good estimators?

    First goodness property of point estimators:

    Concept of repeated sampling:Draw several random samples from XConsider the estimator for each random sampleAn average of the estimates should be close to theunknown parameter(no systematic bias)

    unbiasedness of an estimator222

  • Definition 5.12: (Unbiasedness, bias)

    An estimator (X1, . . . , Xn) of the unknown parameter is definedto be an unbiased estimator if its expectation coincides with theparameter to be estimated, i.e. if

    E[(X1, . . . , Xn)

    ]= .

    The bias of the estimator is defined as

    Bias() = E() .

    Remarks:

    Definition 5.12 easily generalizes to the multivariate case The bias of an unbiased estimator is equal to zero

    223

  • Now: Important and very general result

    Theorem 5.13: (Unbiased estimators of E(X) and Var(X))Let X1, . . . , Xn be a random sample form X where X may bearbitrarily distributed with unknown expectation = E(X) andunknown variance 2 = Var(X). Then the estimators

    (X1, . . . , Xn) = X =1n

    ni=1

    Xi

    and

    2(X1, . . . , Xn) = S2 =

    1n 1

    ni=1

    (Xi X

    )2are always unbiased estimators of the parameters = E(X) and2 = Var(X), respectively.

    224

  • Remarks:

    Proof: Class Note that no explicit distribution of X is required Unbiasedness does, in general, not carry over to parametertransformations. For example,

    S =S2 is not a unbiased estimator of = SD(X) =

    Var(X)

    Question:

    How can we compare two alternative unbiased estimators ofthe parameter ?

    225

  • Definition 5.14: (Relative efficiency)

    Let 1 and 2 be two unbiased estimators of the unknown pa-rameter . 1 is defined to be relatively more efficient than 2if

    Var(1) Var(2)for all possible parameter values of and

    Var(1) < Var(2)

    for at least one possible parameter value of .

    226

  • Example:

    Assume = E(X) Consider the estimators

    1(X1, . . . , Xn) =1n

    ni=1

    Xi

    2(X1, . . . , Xn) =X12+

    12(n 1)

    ni=2

    Xi

    Which estimator is relatively more efficient?(Class)

    Question:

    How can we compare two estimators if (at least) one esti-mator is biased?

    227

  • Definition 5.15: (Mean-squared error)

    Let be an estimator of the parameter . The mean-squarederror of the estimator is defined to be

    MSE() = E[( )2]

    = Var()+[Bias()

    ]2.

    Remarks:

    If an estimator is unbiased, then its MSE is equal to thevariance of the estimator

    The MSE of an estimator depends on the value of theunknown parameter

    228

  • Next:

    Comparison of alternative estimators via their MSEs

    Definition 5.16: (MSE efficiency)

    Let 1 and 2 be two alternative estimators of the unknownparameter . 1 is defined to be more MSE efficient than 2 if

    MSE(1) MSE(2)for all possible parameter values of and

    MSE(1)

  • Unbiased vs biased estimator

    230

    ),,( 12 nXX K

    ),,( 11 nXX K

  • Remarks:

    Frequently 2 estimators of are not comparable with respectto MSE efficiency since their respective MSE curves cross

    There is no general mathematical principle for constructingMSE efficient estimators

    However, there are methods for finding the estimator withuniformly minimum-variance among all unbiased estimators restriction to the class of all unbiased estimators These specific methods are not discussed here(Rao-Blackwell-Theorem, Lehmann-Scheffe-Theorem)

    Here, we consider only one important result231

  • Theorem 5.17: (Cramer-Rao lower bound for variance)

    Let X1, . . . , Xn be a random sample from X and let be a param-eter to be estimated. Consider the joint density of the randomsample fX1,...,Xn(x1, . . . , xn) and define the value

    CR() E

    ( fX1,...,Xn(X1, . . . , Xn)

    )21

    .

    Under certain (regularity) conditions we have for any unbiasedestimator (X1, . . . , Xn)

    Var() CR().

    232

  • Remarks:

    The value CR() is the minmal variance that any unbiasedestimator can take on

    goodness criterion for unbiased estimators

    If for an unbiased estimator (X1, . . . , Xn)Var() = CR(),

    then is called UMVUE(Uniformly Minimum-Variance Unbiased Estimator)

    233

  • Second goodness property of point estimators:

    Consider an increasing sample size (n)Notation: n(X1, . . . , Xn) = (X1, . . . , Xn)

    Analysis of the asymptotic distribution properties of n

    consistency of an estimator

    Definition 5.18: ((Weak) consistency)

    The estimator n(X1, . . . , Xn) is called (weakly) consistent for if it converges in probability to , i.e. if

    plim n(X1, . . . , Xn) = .

    234

  • Example:

    Assume that X N(, 2) with known 2 (e.g. 2 = 1) Consider the following two estimators of :

    n(X1, . . . , Xn) =1n

    ni=1

    Xi

    n(X1, . . . , Xn) =1n

    ni=1

    Xi+2n

    n is (weakly) consistent for (Theorem 5.6, Slide 210: weak law of large numbers)

    235

  • n is (weakly) consistent for (this follows from Theorem 5.9(a), Slide 215)

    Exact distribution of n:n N(, 2/n)

    (linear transformation of the normal distribution)

    Exact distribution of n:n N(+2/n, 2/n)

    (linear transformation of the normal distribution)

    236

  • Pdfs of the estimator n for n = 2,10,20 (2 = 1)

    237

    6

    4

    2

    -1 -0.5 =0 0.5 1 0

    8

  • Pdfs of the estimator n for n = 2,10,20 (2 = 1)

    238

    6

    4

    2

    -0.5 =0 0.5 1 1.5 2 2.5 0

    8

  • Remarks:

    Sufficient (but not necessary) condition for consistency:limnE(n) = (asymptotic unbiasedness)

    limnVar(n) = 0

    Possible properties of an estimator:consistent and unbiased

    inconsistent and unbiased

    consistent and biased

    inconsistent and biased

    239

  • Next:

    Application of the central-limit theorem to estimators asymptotic normality of an estimator

    Definition 5.19: (Asymptotic normality)

    An estimator n(X1, . . . , Xn) of the parameter is called asymp-totically normal if there exist (1) a sequence of real constants1, 2, . . . and (2) a function V () such that

    n (n n) d U N(0, V ()).

    240

  • Remarks:

    Alternative notation:n

    appr. N(n, V ()/n) The concept of asymptotic normality naturally extends tomultivariate settings

    241

  • 5.3 Methods of Estimation

    Up to now:

    Definitions + properties of estimators

    Next:

    Construction of estimators

    Three classical methods:

    Method of Lesst Squares (LS) Method of Moments (MM) Maximum-Likelihood method (ML)

    242

  • Remarks:

    There are further methods(e.g. the Generalized Method-of-Moments, GMM)

    Here: focus on ML estimation

    243

  • 5.3.1 Least-Squares Estimators

    History: Introduced by

    A.M. Legendre (1752-1833)C.F. Gau (1777-1855)

    Idee: Approximate the (noisy) observations x1, . . . , xn by functionsgi(1, . . . , m), i = 1, . . . , n,m < n such that

    S(x1, . . . , xn; ) =n

    i=1[xi gi()]2 min

    The LS-estimator is then defined to be

    (X1, . . . , Xn) = argmin S(X1, . . . , Xn; )

    244

  • Remark:

    The LS-method is central to the linear regression model(cf. the courses Econometrics I + II)

    245

  • 5.3.2 Method-of-moments Estimators

    History:

    Introduced by K. Pearson (1857-1936)

    Definition 5.20: (Theoretical and sample moments)

    (a) Let X be a random variable with expectation E(X). Thetheoretical p-th moment of X, denoted by p, is defined as

    p = E(Xp).The theoretical p-th central moment of X, denoted by p, isdefined as

    p = E {[X E(X)]p} .246

  • (b) Let X1, . . . , Xn be a random sample from X and let X denotethe arithmetic sample mean. Then the p-th sample moment,denoted by p, is defined as

    p =1n

    ni=1

    Xpi .

    The p-th central sample moment, denoted by p, is definedas

    p =1n

    ni=1

    (Xi X

    )p.

    247

  • Remarks:

    The theoretical moments p and p had already been intro-duced in Definition 2.21 (Slide 76)

    The sample moments p and p are estimators of the theo-retical moments p and p

    The arithmetic sample mean is the 1st sample moment ofX1, . . . , Xn

    The sample variance is the 2nd central sample moment ofX1, . . . , Xn

    248

  • General setting:

    Based on the random sample X1, . . . , Xn from X estimate ther unknown parameters 1, . . . , r

    Basic idea of the method of moments:

    1. Express the r theoretical moments as functions of the r un-known parameters:

    1 = g1(1, . . . , r)...

    r = gr(1, . . . , r)

    249

  • 2. Express the r unknown parameters as functions of the r the-oretical moments:

    1 = h1(1, . . . , r, 1, . . . ,

    r)

    ...

    r = hr(1, . . . , r, 1, . . . ,

    r)

    3. Replace the theoretical moments by the sample moments:

    1(X1, . . . , Xn) = h1(1, . . . , r, 1, . . . ,

    r)

    ...

    r(X1, . . . , Xn) = hr(1, . . . , r, 1, . . . ,

    r)

    250

  • Example: (Exponential distribution)

    Let the random variable X have an exponential distributionwith parameter > 0 and pdf

    fX(x) =

    {ex , for x > 00 , elsewise

    The expectation and the variance of X are given byE(X) = 1

    Var(X) =12

    251

  • Method-of-moments estimator via the expectation:1. We know that

    E(X) = 1 =1

    2. This implies

    =11

    3. Method-of-moments estimator for :

    (X1, . . . , Xn) =1

    1/nni=1Xi

    252

  • Method-of-moments estimator via the variance:1. We know that