Probability Distributions and Introduction to...

ProbabilityDistributionsandIntroductiontoStatisticalInferenceBIO5312FALL2017

STEPHANIE J. SPIELMAN,PHD

RandomvariableRandomprocessesproducenumericaloutcomes:◦ Numberoftailsin50coinflips◦ Thesumofeveryone'sheights

Definition:arandomvariableisafunctionthatmapsoutcomesofarandomprocesstoanumericvalue◦ X isafunction(rule)thatassignanumberX(s) toeachoutcomes ∈S (wheres isaneventinsamplespace S )

◦ r.v.'s aretechnicallyneitherrandomnorvariables…◦ But,youcanthinkofthemroughlynumericaloutcomesofrandomprocesses

DiscretevscontinuousRVDiscrete randomvariablescantakeon(mapto)afinitenumberofvalues

Continuousrandomvariablescantakeon(mapto)innumerable/infinitevalues

ExpressingdiscreterandomvariablesProbabilitymassfunction(PMF)◦ Describesthevaluestakenbyadiscreter.v.X anditsassociatedprobabilities◦ Functionthatassigns,toanypossiblevaluex ofadiscreter.v.X,theprobabilityP(X = x)

1 2 3 4 5 6Event

PMFforrollingafairdieP(X = x)

PMFproperties0 ≤ 𝑃 𝑋 = 𝑥 ≤ 1

∑𝑃 𝑋 = 𝑥 = 1��

PMFissimplyafanciertermforadiscreteprobabilitydistribution

ExpressingdiscreterandomvariablesCumulativedistributionfunction(CDF)◦ Functiondefined,foraspecificvaluex ofadiscreter.v.X,asF(x) = P(X ≤ x)

1 2 3 4 5 6Event

CDFforrollingafairdie

P(X ≤ 1)

P(X ≤ 4)

CDFproperties0 ≤ 𝐹 𝑋 ≤ 1

CDFfunctionsarenon-decreasing

PMFvsCDFPMF:WhatistheprobabilityofeventX?

CDF:Whatisthesumofprobabilitiesforallevents≤X?

ExpectationandspreadofrandomvariablesTheexpectation ofar.v.istheprobability-weightedaverageofallpossiblevalues(i.e.,mean)◦𝔼 𝑋 = 𝜇 = ∑ 𝑥/𝑝(𝑥/)�

Thevarianceofar.v.isdefined◦𝑉𝑎𝑟 𝑋 = 𝜎7 = 𝔼[ 𝑋 − 𝜇 7] = ∑ [𝑥/7𝑝(𝑥/)�

/ ] − 𝜇7

◦𝑉𝑎𝑟 𝑋 = 𝔼[𝑋7] − 𝔼[𝑋]7

Example:TheBinomialdistributionThebinomialdistribution describestheprobabilityofobtainingk successesinnBernoullitrials,wheretheprobabilityofsuccessforeachtrialisconstantatp

ABernoullitrial hasabinaryoutcome(success/fail,true/false,yes/no),andP(success)=p isthesameforallrealizationsofthetrial

TheBInS conditionsTobebinomiallydistributed,mustsatisfythefollowing:

Binaryoutcomes

Independenttrials(outcomesdonotinfluenceeachother)

n isfixedbeforethetrialsbegin

Sameprobabilityofsuccess,p,foralltrials

Isitbinomial?

Abagcontains10balls,7redand3green.

Situation1:Youdraw5ballsfromthebag,notingtheballcoloreachtimeandthenreturningittothebag.

Situation2:Youdraw5ballsfromthebag,retainingeachdrawnballforsafe-keepingsoyoucanplaycatchatanymoment.

Situation3:Youkeepdrawingballs,withreplacement,untilyouhavedrawn4redballs. NoL

ThebinomialdistributionThePMF(probabilitydistribution)forabinomially-distributedrandomvariable:

𝑃 𝑋 = 𝑘 = <= 𝑝

=(1 − 𝑝)(<>=)= <= 𝑝

=𝑞(<>=)

Thebinomialcoefficient: <= = <!=! <>= !

◦ readas"nchoosek"

Wikipediaweighsin

ThebinomialdistributionTheexpectation forabinomialr.v.◦ 𝔼 𝑋 = 𝜇 = np

Thevarianceforabinomialr.v.◦ 𝑉𝑎𝑟 𝑋 = 𝜎7 = npq = np(1 − p)

Wewritebinomiallydistributedr.v.'s as𝑋~𝐵(𝑛, 𝑝)

Example:PlayingwithabinomialrvEachchildborntoaparticularsetofparentshasa25%probabilityofhavingbloodtypeO.Assumetheparentshadfivechildren.

Here,n=5andp=0.25,meaningwedefineTypeOas"success",andnotTypeOas"failure".à X~B(5,0.25)

Tasks:◦ Computeexpectationandvariance◦ VisualizePMF◦ VisualizeCDF◦ Makesomecalculations…

ExpectationandvarianceEachchildborntoaparticularsetofparentshasa25%probabilityofhavingbloodtypeO.Assumetheparentshadfivechildren.B(5,0.25)

𝔼 𝑋 = 𝜇 = np =5*0.25=1.25

𝑉𝑎𝑟 𝑋 = 𝜎7 = npq = np(1 − p) =5*0.25*0.75=0.9375

VisualizethePMF

0.2373046875

0.3955078125

0.263671875

0.087890625

0.01464843750.00097656250.0

0 1 2 3 4 5Number of kids

abilit

?distributionsDistributions in the stats package

Description:Density, cumulative distribution function, quantile function andrandom variate generation for many standard probabilitydistributions are available in the ‘stats’ package.

Details:The functions for the density/mass function, cumulativedistribution function, quantile function and random variategeneration are named in the form ‘dxxx’, ‘pxxx’, ‘qxxx’ and ‘rxxx’respectively.

For the beta distribution see ‘dbeta’.

For the binomial (including Bernoulli) distribution see ‘dbinom’.

For the Cauchy distribution see ‘dcauchy’.

For the chi-squared distribution see ‘dchisq’.

Distributionfunctions,generallyFunction Purpose Binomialversion

dxxx() Probabilitydistribution dbinom(x, size, prob)

pxxx() CDF pbinom(q, size, prob)rxxx() Generaterandom

numbersfromgivendistribution

rbinom(n, size, prob)

qxxx() Quantile:Inverse ofpxxx()

qbinom(p, size, prob)

BinomialdistributionfunctionsBinomial function Example Output

dbinom(x, size, prob) dbinom(2, 5, 0.25) Prob ofobtaining2successesin5trials,wherep=0.25à 0.263

pbinom(q, size, prob) pbinom(2, 5, 0.25) Prob ofobtaining≤2successesin5trials,wherep=0.25à 0.896

rbinom(n, size, prob) rbinom(100, 5, 0.25) Generate100kvaluesfromthisbinomialdist.à 100from{0,1,2,3,4}

qbinom(p, size, prob) qbinom(0.896, 5, 0.25) SmallestvaluexwhereF(x)>=p*à 2*notprob success,justaprob

MakingthePMF0.2373046875

0.3955078125

0.263671875

0.087890625

0.01464843750.00097656250.0

abilit

> ## Use dbinom() to get the PMF values> p = 0.25> n = 5> k0 <- dbinom(0, 5, 0.25) ## Prob of 0 successes, aka no children are Type O> k1 <- dbinom(1, 5, 0.25) ## Prob of 1 success, aka only 1 child is Type O

> ## Advanced:> library(purrr)> map_dbl(0:5, dbinom, 5, 0.25)

[1] 0.2373046875 0.3955078125 0.2636718750 0.0878906250 0.0146484375[6] 0.0009765625

MakingthePMF## data frame (tibble) of probabilities for PMF> data.pmf <- tibble(k = 0:5, prob = c(0.236623, 0.396, 0.264, 0.0879, 0.0145, 0.000977))> data.pmf

# A tibble: 6 x 2k prob

<int> <dbl>1 0 0.2366232 1 0.3960003 2 0.2640004 3 0.0879005 4 0.0145006 5 0.000977

## Equivalent> data.pmf <- tibble(k = 0:5, prob = map_dbl(0:5, dbinom, 5, 0.25))

MakingthePMFusesadifferent*stat*> ggplot(data.pmf, aes(x = k, y=prob))+ geom_bar( stat="identity" ) +

xlab("Number of kids") + ylab("Probability Type O")

0 2 4Number of kids

abilit

Tweakingthex-axis> ggplot(data.pmf, aes(x = k, y=prob))+ geom_bar( stat="identity" ) +

ylab("Probability Type O") + scale_x_continuous(name = "Number of kids", breaks = 0:5)

abilit

Addingsometext> ggplot(data.pmf, aes(x = k, y=prob))+ geom_bar( stat="identity" ) +

ylab("Probability Type O") + scale_x_continuous(name = "Number of kids", breaks = 0:5) +geom_text(aes(x = k, y= prob + 0.01, label = prob))

0.236623

0.0879

0.0145 0.0009770.0

abilit

VisualizetheCDF> binom.sample <- tibble(x = rbinom(1000, 5, 0.25))> ggplot(binom.sample, aes(x=x)) + stat_ecdf() +

xlab("# Type O kids") + ylab("Cumulative probability")

0 1 2 3 4 5# Type O kids

bility

SolvingforprobabilitiesEachchildborntoaparticularsetofparentshasa25%probabilityofhavingbloodtypeO.Assumetheparentshadfivechildren.B(5,0.25)

Whatistheprobabilitythatexactly2childrenwereTypeO?> dbinom(2, 5, 0.25)

[1] 0.2636719

0.2373046875

0.3955078125

0.263671875

0.087890625

0.01464843750.00097656250.0

abilit

SolvingforprobabilitiesEachchildborntoaparticularsetofparentshasa25%probabilityofhavingbloodtypeO.Assumetheparentshadfivechildren.B(5,0.25)

Whatistheprobabilitythatexactly2childrenwereTypeO?

𝑃 𝑋 = 2 = I7 0.25

70.75(I>7)

=10*0.0625*0.422=0.26375

𝑃 𝑋 = 𝑘 = 𝑛𝑘 𝑝=(1 − 𝑝)(<>=)=

𝑛𝑘 𝑝=𝑞(<>=)

0.2373046875

0.3955078125

0.263671875

0.087890625

0.01464843750.00097656250.0

abilit

SolvingforprobabilitiesWhatistheprobabilitythat2orfewer childrenwereTypeO?

> pbinom(2, 5, 0.25) [1] 0.8964844

0 1 2 3 4 5# Type O kids

bility

B(5,0.25)

SolvingforprobabilitiesWhatistheprobabilitythat2orfewerchildrenwereTypeO?

> dbinom(0, 5, 0.25) + dbinom(1, 5, 0.25) + dbinom(2, 5, 0.25)

[1] 0.8964844

B(5,0.25)

0.2373046875

0.3955078125

0.263671875

0.087890625

0.01464843750.00097656250.0

abilit

SolvingforprobabilitiesWhatistheprobabilitythatmorethan2children(ie either3,4,or5)wereTypeO?

> 1 - pbinom(2, 5, 0.25) [1] 0.1035156

0 1 2 3 4 5# Type O kids

bility

B(5,0.25)

SolvingforprobabilitiesWhatistheprobabilitythatmorethan2children(ie either3,4,or5)wereTypeO?

> dbinom(3, 5, 0.25) + dbinom(4, 5, 0.25) + dbinom(5, 5, 0.25)

[1] 0.1035156

B(5,0.25)

0.2373046875

0.3955078125

0.263671875

0.087890625

0.01464843750.00097656250.0

abilit

BREATHE

ExpressingcontinuousrandomvariablesProbabilitydensityfunction(PDF)◦ Describesthevaluestakenbyacontinuousr.v.X anditsassociatedprobabilities

◦ Functionsuchthatthearea underthecurvebetweenanytwopointsa,bcorrespondstotheprobabilitythatther.v.fallsbetweena,b

◦à 𝑃 𝑎 ≤ 𝑋 ≤ 𝑏 = ∫ 𝑓 𝑥 𝑑𝑥QR

𝑃 𝑎 ≤ 𝑋 ≤ 𝑏 = S 𝑓 𝑥 𝑑𝑥Q

PDFpropertiesContinuousr.v.'s areinfinitelyprecise:𝑃 𝑋 = 𝑥) = 𝑃(𝑥 ≤ 𝑋 ≤ 𝑥 = 0◦ ExactlyunlikePMFs

TotalareaunderthePDFequals1: ∫ 𝑓 𝑥 𝑑𝑥 = 1T>T

Probabilitiesaren'tnegative:𝑓 𝑥 ≥ 0

ExpressingcontinuousrandomvariablesCumulativedistributionfunction(CDF)◦ Functiondefined,foraspecificvaluex ofacontinuousr.v.X,asF(x) = P(X ≤ x)◦ (mostly)thesameasfordiscrete

0.0 2.5 5.0 7.5 10.0x

5 10 15 20 25x

RelationshipbetweenPDFandCDF

Jumpingrightin:NormaldistributionThePDF(probabilitydistribution)foranormally-distributedrandomvariable:

𝑓 𝑥 = V7WXY� 𝑒𝑥𝑝 >([>\)Y

Wewritenormallydistributedr.v.'s as𝑋~𝑁(𝜇, 𝜎7)

It'sgross,everyoneknowsit,andyouwillbeneitherpluggingnorchuggingwiththisequation

PDFofnormaldistributionExample,let'ssaywomen'sheights(cm)arenormallydistributedaccordingto𝑁(165, 64)◦ Popquiz:whatisthestandarddeviationofthisdistribution?

140 160 180Value

Probability

Wikipediaweighsin

MakingthePDFAnother"interesting"hack:

140 160 180Value

Probability

𝑁(165, 64)

> plot.range <- tibble(x = c(165 - 32, 165 + 32)) > ggplot(plot.range, aes(x=x)) +

stat_function(fun = dnorm, args=list(mean=164, sd=8))

MakingtheCDF> data.cdf <- tibble(x = rnorm(10000, 164, 8))> ggplot(data.cdf, aes(x=x)) + stat_ecdf()

140 160 180x

ExpectationandvarianceAnyguesses?

It'sinthedefinition:𝑋~𝑁(𝜇, 𝜎7)

Typesofquestionsonecanask:◦ Whatistheprobabilitythatarandomly-chosenwomanistallerthan158cm?◦ Whatistheprobabilitythatarandomly-chosenwomanisbetween163—170cmtall?

◦ Whatistheprobabilitythatarandomly-chosenwomanisshorterthan167cm?

◦ Whatistheprobabilitythatarandomly-chosenwomanis168cmtall?

Workingwiththenormaldistribution

Typesofquestionsonecanask:◦ Whatistheprobabilitythatarandomly-chosenwomanistallerthan158cm?◦ Whatistheprobabilitythatarandomly-chosenwomanisbetween163—170cmtall?

◦ Whatistheprobabilitythatarandomly-chosenwomanisshorterthan167cm?

◦ Whatistheprobabilitythatarandomly-chosenwomanis168cmtall?

Workingwiththenormaldistribution

PropertiesofthenormaldistributionSymmetricaroundthemean

Mean=median=mode

Inflectionpoints

Introducingthestandardnormal:𝑋~𝑁(0,1)

−5 −4 −3 −2 −1 0 1 2 3 4 5x

Probability

𝜇 =0

TheseareZ-scores

StandardNormal𝑋~𝑁(0,1)

The cumulative-distribution function (cdf) for a standard normal distribution is denoted by

( ) ( )x X xPr

where X follows an N(0,1) distribution. This function is shown in Figure 5.10.

The symbol is used as shorthand for the phrase “is distributed as.” Thus X ~ N(0,1) means that the random variable X is distributed as an N(0,1) distribution.

Column A in Table 3 of the Appendix presents (x) for various positive values of x for a standard normal distribution. This cumulative distribution function is illus-trated in Figure 5.11. Notice that the area to the left of 0 is .5.

f(x) = 12

e–(1/2)x2

x0–1.00–1.96–2.58 1.00 1.96 2.58

68% of area95% of area99% of area

CHE-ROSNER-10-0205-005.indd 115 7/15/10 7:50:47 PM

PDFandCDFof𝑋~𝑁(0,1)

Furthermore, the area to the left of x approaches 0 as x becomes small and ap-proaches 1 as x becomes large.

The right-hand tail of the standard normal distribution Pr(X x) is given in column B of Appendix Table 3.

If X ~ N(0,1), then find Pr(X 1.96) and Pr(X 1).

From the Appendix, Table 3, column A,

(1.96) .975 and (1) .8413

From the symmetry properties of the standard normal distribution,

( ) ( ) ( ) ( ) ( )x X x X x X x xPr Pr Pr1 1

This symmetry property is depicted in Figure 5.12 for x 1.

Pr(X x) = (x) =area to the left of x

x–3 –2 –1 0

0.0.023

.977 .9987

Furthermore, the area to the left of x approaches 0 as x becomes small and ap-proaches 1 as x becomes large.

The right-hand tail of the standard normal distribution Pr(X x) is given in column B of Appendix Table 3.

If X ~ N(0,1), then find Pr(X 1.96) and Pr(X 1).

From the Appendix, Table 3, column A,

(1.96) .975 and (1) .8413

From the symmetry properties of the standard normal distribution,

( ) ( ) ( ) ( ) ( )x X x X x X x xPr Pr Pr1 1

This symmetry property is depicted in Figure 5.12 for x 1.

Pr(X x) = (x) =area to the left of x

x–3 –2 –1 0

0.0.023

.977 .9987

Iftheshadedgreyarea=0.977,whatisx?

StandardNormal𝑋~𝑁(0,1)Duetosymmetry,P(X≤-x)=1- P(X≤ x)

Calculate Pr(X 1.96) if X ~ N(0,1).

Pr(X 1.96) Pr(X 1.96) .0250 from column B of Table 3.

Furthermore, for any numbers a, b we have Pr(a X b) Pr(X b) Pr(X a) and thus we can evaluate Pr(a X b) for any a, b from Table 3.

Compute Pr( 1 X 1.5) if X ~ N(0,1).

Pr( 1 X 1.5) Pr(X 1.5) Pr(X 1)

Pr(X 1.5) Pr(X 1) .9332 .1587

Forced vital capacity (FVC), a standard measure of pulmonary function, is the volume of air a person can expel in 6 seconds. Current research looks at potential risk factors, such as cigarette smoking, air pollution, indoor al-lergies, or the type of stove used in the home, that may affect FVC in grade-school children. One problem is that age, sex, and height affect pulmonary function, and these variables must be corrected for before considering other risk factors. One way to make these adjustments for a particular child is to find the mean and standard deviation for children of the same age (in 1-year age groups), sex, and height (in 2-in. height groups) from large national surveys and compute a standardized FVC, which is defined as ( )X , where X is the original FVC. The standardized FVC then approximately follows an N(0,1) distribution, if the distribution of the original FVC values was bell-shaped. Suppose a child is considered in poor pulmonary health if his or her standardized FVC 1.5. What percentage of children are in poor pul-monary health?

Pr(X 1.5) Pr(X 1.5) .0668

Thus about 7% of children are in poor pulmonary health.

x0–1 1

1 – (1)(–1)

For𝑋~𝑁(0,1),whatistheprobabilityP(X≤0.47)?

−5 −4 −3 −2 −1 0 1 2 3 4 5Z

Probability

# CDF: P(X <= 0.47)> pnorm(0.47)

[1] 0.6808225

Normaldistributionfunctions

Normalfunction Meaning

dnorm(x) Density atX=x

pnorm(q) P(X<= x)

rnorm(n) Generate nrandomdrawsfromN(0,1)

qnorm(p) ObtainxfromgivenCDFarea:qnorm(0.6808225) = 0.47

−5 −4 −3 −2 −1 0 1 2 3 4 5Z

Probability

For𝑋~𝑁(0,1),whatistheprobabilityP(-1.32≤x≤0.47)?

−5 −4 −3 −2 −1 0 1 2 3 4 5Z

Probability

0.47-1.32

For𝑋~𝑁(0,1),whatistheprobabilityP(-1.32≤x≤0.47)?

−5 −4 −3 −2 −1 0 1 2 3 4 5Z

Probability

0.47-1.32

# P(X <= 0.47)> pnorm(0.47)

[1] 0.6808225

−5 −4 −3 −2 −1 0 1 2 3 4 5Z

Probability

−5 −4 −3 −2 −1 0 1 2 3 4 5Z

Probability

# P(X <= -1.32)> pnorm(-1.32)

[1] 0.093417510.587405

For𝑋~𝑁(0,1),whatistheprobabilityP(-1≤x≤1)?AKAprobabilityofbeingwithin1standarddeviationofmean?

For𝑋~𝑁(0,1),whatistheprobabilityP(x≥ 2.14)?

## Two approaches:

> 1 - pnorm(2.14)[1] 0.01617738

> pnorm(-2.14)[1] 0.01617738

−5 −4 −3 −2 −1 0 1 2 3 4 5Z

Probability

For𝑋~𝑁(0,1),thetop8%ofthedistributionfallsabovewhatnumber?

−5 −4 −3 −2 −1 0 1 2 3 4 5Z

Probability

Area=0.08

> qnorm(1 - 0.08)[1] 1.405072

> -1 * qnorm(0.08)[1] 1.405072

Historicalconsiderationofz-scores

Re-scalingtostandardnormaltocomparedistributions𝑍 = [>\

X◦x=distributionvalueofinterest("rawscore")◦𝜇,𝜎 =r.v./populationmean,standarddeviation

Example:Weightforapopulationofrabbitsfollowsanormaldistribution𝑁(2.6, 1.1)

WhatistheZ-scorefora3poundrabbit?

𝑍 = [>\X

=a>7.bV.V� =0.381

Whatisprobabilityarabbitweighslessthan3pounds?pnorm(0.381) = 0.648

Doesismakesensethatthisnumberispositive?

pnorm(3, 2.6, sqrt(1.1)) = 0.648 THEFUTUREISNOW

Normaldistributionsfunctions,revisitedAllfunctionsassumestandardnormal.Provideadditionalargumentsforothernormals:

Standardnormal Anynormalpnorm(q) = pnorm(q, 0, 1) pnorm(q, mean, sd)

Z-scoresaremostusefulforcomparingdifferentdistributionsWeightforrabbitpopAisdistributed𝑁(2.6, 1.1)

WeightforrabbitpopBisdistributed𝑁(2.9, 0.17)

Whichofthesetworabbitsisbigger?PopArabbitweighting2.95lbs,orpopBrabbitweighing3.1lbs?

PopulationA:𝑍 = [>\X= 7.dI>7.b

V.V� = 0.334

PopulationB:𝑍 = [>\X= a.V>7.d

f.Vg� = 0.485

PuttingitalltogetherTheheightofEuropeanmenisdistributedas𝑁 175, 53.3

TheheightofEuropeanwomenisdistributedas𝑁(162.5, 34.8)

Whatproportionofmenisshorterthan150cm,akaP(man<150)?UsingZ-scores

𝑍 = [>\X= VIf>VgI

Ia.a� = -3.424

> pnorm(-3.424)[1] 0.0003085331

SkippingZ-scores

> pnorm(150, 175, sqrt(53.3))[1] 0.0003081516

PuttingitalltogetherWhatproportionofwomenistallerthan162.5cm?

Men:𝑁 175, 53.3Women:𝑁(162.5, 34.8)

PuttingitalltogetherWhatproportionofwomenistallerthan170cm?

Men:𝑁 175, 53.3Women:𝑁(162.5, 34.8)

UsingZ-scores

𝑍 = [>\X= Vgf>Vb7.I

ai.j� = 1.2713

> 1 - pnorm(1.2713)[1] 0.101811

SkippingZ-scores

> 1 - pnorm(170, 162.5, sqrt(34.8))[1] 0.1017987

PuttingitalltogetherWhatisthetallestawomancanbeandstillbeinthebottom22%?

Men:𝑁 175, 53.3Women:𝑁(162.5, 34.8)

UsingZ-scores

> qnorm(0.22)[1] -0.7721932

𝑍 = [>\X à x = 𝑍𝜎 + 𝜇

= −0.7722 ∗ 34.8� + 162.5= 𝟏𝟓𝟕. 𝟗𝒄𝒎

SkippingZ-scores

> qnorm(0.22, 162.5, sqrt(34.8))[1] 157.9447

PuttingitalltogetherWhatistheshortest awomancanbeandstillbeinthetop 22%?

Men:𝑁 175, 53.3Women:𝑁(162.5, 34.8)

UsingZ-scores

> -1 * qnorm(0.22)[1] 0.7721932

𝑍 = [>\X à x = 𝑍𝜎 + 𝜇

= 0.7722 ∗ 34.8� + 162.5= 𝟏𝟔𝟕. 𝟎𝟓𝒄𝒎

SkippingZ-scores

> qnorm(1-0.22, 162.5, sqrt(34.8))[1] 167.0553

PuttingitalltogetherWhatistheprobabilityarandomlychosenmanisbetween175–182cmtall?

à P(X<182)– P(X<175)=P(X<182)– 0.5

Men:𝑁 175, 53.3Women:𝑁(162.5, 34.8)

> pnorm(182, 175, sqrt(53.3)) – 0.5[1] 0.3311738

PuttingitalltogetherWhatistheprobabilityarandomlychosenmaniseitherbetween175–182cmtallor between150—160cmtall?

à P(175<X<182)+P(150<X<160)

Men:𝑁 175, 53.3Women:𝑁(162.5, 34.8)

### First probability> pnorm(182, 175, sqrt(53.3)) – 0.5[1] 0.3311738

> ### Second prob.> pnorm(160, 175, sqrt(53.3)) – pnorm(150, 175, sqrt(53.3))[1] 0.01965059

> 0.3311738 + 0.01965059[1] 0.3508244

PuttingitalltogetherIhavetworandomly-chosenEuropeanfriends,onemanandonewomaneach.Whatistheprobabilitythemanisatleast180cmandthewomanisbetween163—170cm?

à P(man>180)xP(163<woman<170)

Men:𝑁 175, 53.3Women:𝑁(162.5, 34.8)

### First probability> 1 - pnorm(180, 175, sqrt(53.3))

[1] 0.2467138

> ### Second prob.> pnorm(170, 162.5, sqrt(34.8)) – pnorm(163, 162.5, sqrt(34.8))

[1] 0.3644282

> 0.246713*0.3644282[1] 0.08990917

PuttingitalltogetherIhavetwonewrandomly-chosenEuropeanfriends,onemanandonewomaneach.Whatistheprobabilitythemanis180cmandthewomanis163cm?

à P(man=180)xP(woman=163)

Men:𝑁 175, 53.3Women:𝑁(162.5, 34.8)

PuttingitalltogetherAssume50.8%ofEuropeansarewomen.Ifarandomly-chosenpersonisshorterthan155cmtall,whatistheprobabilitythepersonisawoman?

à P(woman|<155)=

Men:𝑁 175, 53.3Women:𝑁(162.5, 34.8)

### P(<155 | woman)> pnorm(155, 162.5, sqrt(34.8))

[1] 0.1017987

P(<155|woman)*P(woman)/P(<155)

0.102 0.508

SolvingthedenominatorP(<155)=P(<155andman)+P(<155andwoman)=

P(<155|man)*P(man)+P(<155|woman)*P(woman)

### P(<155 | man) > pnorm(155, 175, sqrt(53.3))

[1] 0.003076926

0.0031 0.492 0.102 0.508

=0.0533

PuttingitalltogetherAssume50.8%ofEuropeansarewomen.Ifarandomly-chosenpersonisshorterthan155cm,whatistheprobabilitythepersonisawoman?

à P(woman|<155)=

Men:𝑁 175, 53.3Women:𝑁(162.5, 34.8)

P(<155|woman)*P(woman)/P(<155)

0.102 0.508 0.533

=0.972

StatisticalinferencePopulation Sample

Random sampling

Statistical inferencePopulation parameters !, "

Sample estimates x, s

TwomainflavorsofstatisticalinferenceEstimation◦ Estimateapopulationparameterfromsampledata◦ Pointestimates:Whatisthepopulationmean?◦ Intervalestimates:Inwhatrangeofvaluesisthepopulationmeanlikelytofall?

Hypothesistesting◦ Testwhetherthevalueofapopulationparameterisequaltosomespecificvalue◦ Isthereevidencethatmysamplediffersfromsomeunderlyingpopulation?

ThesamplingdistributionTheprobabilitydistributionofvaluesforanestimatethatweobtainundersampling

Obtainingasamplingdistribution

0 5000 10000 15000nucleotides

> genes <- read.csv("genes.csv")> head(genes)

nucleotides1 37852 74163 21354 76825 57666 11079

> mean(genes$nucleotides)[1] 2761.039 > sd(genes$nucleotides)[1] 2037.645

> ggplot(genes, aes(x=nucleotides)) + geom_histogram(fill="white", color="black")

Obtainingasamplingdistribution### the function sample_n draws a random sample of rows

> small.sample <- genes %>% sample_n(25) > mean(small.sample$nucleotides)

[1] 2151.8

> ggplot(small.sample , aes(x = nucleotides)) + geom_histogram() +geom_vline(xintercept=2151.8, color="blue") + geom_vline(xintercept= 2761.039, color="red")

0 1000 2000 3000 4000nucleotides

geom_vline(xintercept=…)geom_hline(yintercept=…)geom_abline(yintercept=…, slope=…)

ThesamplemeanforarandomsampleofN=25is𝒙w = 𝟐𝟏𝟓𝟏. 𝟖

ObtainingasamplingdistributionNowimaginewedraw20samplesofN=25andcomputeeachoftheirmeans:> head(n20.means)sample.mean

1 2584.842 2574.123 2382.644 3143.685 2252.566 2368.44

Samplingdistributionofthemean

2000 2500 3000 3500sample.mean

QuantifyingthesamplingdistributionThestandarderror isthestandarddeviationoftheestimateofthesamplingdistribution◦ Standarderrorofthemean:𝑆𝐸[̅ =

X<�,approximatewith }

◦ SEisnotthestandarddeviationofasample◦ Here,nrepresentsthenumberofsamples (not thesamplesize)

Italsoquantifiestheprecisionofourestimate,i.e.howfarfromthepopulationparameterweare

Computingthestandarderrorofthemean

> head(n20.means)sample.mean

1 2584.842 2574.123 2382.644 3143.685 2252.566 2368.44

> sd(n20.means$sample.mean) / sqrt(20)[1] 93.11888

2000 2500 3000 3500sample.mean

Samplingdistributionofthemean

SeveralsamplingdistributionscomprisedofNsamples,eachofn=25

2000 2500 3000 3500sample.mean

2000 2500 3000 3500 4000sample.mean

2000 3000 4000sample.mean

N=20 N=50 N=100 N=1000 N=10000

2000 2500 3000 3500 4000sample.mean

StandarderrordecreasesasNincreases

2000 2500 3000 3500sample.mean

2000 2500 3000 3500 4000sample.mean

N=20 N=50 N=100 N=1000 N=10000

SE=93.1 SE=58.1 SE=37.9 SE=13.3 SE=4.02

2000 2500 3000 3500 4000sample.mean

Therefore,meanofsamplingdistributionapproachespopulationmean 2761.039

2000 2500 3000 3500sample.mean

2000 2500 3000 3500 4000sample.mean

N=20 N=50 N=100 N=1000 N=10000

SE=93.1 SE=58.1 SE=37.9 SE=13.3 SE=4.02𝒙w=2780.89 𝒙w=2753.91 𝒙w=2781.51 𝒙w=2777.02 𝒙w=2763.82

2000 2500 3000 3500 4000sample.mean

TheCentralLimitTheoremAssamplesizeincreases,thesamplingdistributionofthemean willbeapproximatelynormalregardlessoftruepopulationdistribution

0 5000 10000 15000nucleotides

Populationdistribution N=1e4samplingdistribution

Nextweek..Introductiontohypothesistestingandcomparingmeans

Morefunfactsonestimationwillcomelaterinthesemester,tobebundledwith*likelihood*

Probability Distributions and Introduction to...

Documents

1 Chapter Seven Introduction to Sampling Distributions Section 3 Sampling Distributions for Proportions

Introduction to Statistical Data Analysis Lecture 3: Probability Distributions · 2016-08-18 · Discrete Probability Distributions Rules for Discrete Distributions Mean Variance

Introduction to reliability and reliability distributions

Introduction to Probability Distributions

Chapter 5 Normal Probability Distributions Normal Probability Distributions Chapter 5 5.1 Introduction to Normal Distributions and the Standard Distribution Larson & Farber, Elementary

Heavy-tailed distributions for agent-based economic … · Heavy-tailed distributions for agent-based economic modelling 4.1 Introduction ... Probability distributions of size variables

13 Introduction to Stationary Distributions

CS6825: Probability Distributions An Introduction

Chapter 7 Introduction to Sampling Distributions

7. Normal Distributions - Free Statistics Bookonlinestatbook.com/mobile/normal_distribution/normal...7. Normal Distributions A. Introduction B. History C. Areas of Normal Distributions

BIO5312 Biostatistics Lecture 04: Central Limit Theorem ... · BIO5312 Biostatistics Lecture 04: Central Limit Theorem and Three ... we can use the general equation

1 1 Slide Sampling and Sampling Distributions Sampling Distribution of Sampling Distribution of Introduction to Sampling Distributions Introduction to

Chapter 3 Theoretical Probability Distributions · 2020. 8. 26. · Chapter 3 Theoretical Probability Distributions 3.1 Introduction Probability distributions are either discrete

Chapter 7 Introduction to Sampling Distributions

Introduction to the Continuous Distributions. The Uniform Distribution

Introduction to Statistics and Frequency · PDF fileIntroduction to Statistics and . Frequency Distributions. ... Introduction to Statistics and Frequency Distributions. 3. ... You

Chapter 5: Normal Probability Distributions · 2017-03-07 · Probability and Statistics – Mrs. Leahy Chapter 5: Normal Probability Distributions 5.1 Introduction to Normal Distributions

Prbability Theory and Mathematical Statistics Lecture 05: Probability Distributions ... · 2019. 3. 27. · Introduction Multivariate Distributions Marginal Distributions Conditional

Chapter Seven Introduction to Sampling Distributions Section 3

Introduction to Continuous Probability Distributions SIX Introduction to Continuous Probability Distributions 6.1 The Normal Probability Distribution 6.2 Other Continuous Probability