29
DATA ANALYSIS Module Code: CA660 Lecture Block 3

DATA ANALYSIS Module Code: CA660 Lecture Block 3

  • View
    222

  • Download
    1

Embed Size (px)

Citation preview

Page 1: DATA ANALYSIS Module Code: CA660 Lecture Block 3

DATA ANALYSIS

Module Code: CA660

Lecture Block 3

Page 2: DATA ANALYSIS Module Code: CA660 Lecture Block 3

2

Standard Statistical Distributions

Importance Modelling practical applications Mathematical properties are known Described by few parameters, which have natural interpretations.

Bernoulli Distribution.This is used to model a trial/expt. which gives rise to two outcomes: success/ failure: male/ female, 0 / 1..…Let p be the probability that the outcome is one and q = 1 - p that the outcome is zero.

E[X] = p (1) + (1 - p) (0) = pVAR[X] = p (1)2 + (1 - p) (0)2 - E[X]2 = p (1 - p).

0 1 p

Prob1

1 - p

p

Page 3: DATA ANALYSIS Module Code: CA660 Lecture Block 3

3

Standard distributions - Binomial

Binomial Distribution.Suppose that we are interested in the number of successes X in n independent repetitions of a Bernoulli trial, where the probability of success in an individual trial is p. Then

Prob{X = k} = nCk pk (1-p)n - k, (k = 0, 1, …, n) E[X] = n p VAR[X] = n p (1 - p)

(n=4, p=0.2)Prob

1

4np

This is the appropriate distribution to model e.g. Preferences expressed between two brands e.g. Number of recombinant gametes produced by a heterozygous parent for a 2-locus model. Extension for 3 loci, (brands) is multinomial

Page 4: DATA ANALYSIS Module Code: CA660 Lecture Block 3

4

Standard distributions - Poisson

Poisson Distribution.The Poisson distribution arises as a limiting case of the binomial distribution, where n ® ¥, p ® 0 in such a way that np ® l ( Constant)

P{X = k} = exp ( - ) l lk / ! ( = 0, 1, 2, k k … ).

E [X] = lVAR [X] = l.Poisson is used to model No. of occurrences of a certain phenomenon in a

fixed period of time or space, e.g. e.g.O particles emitted by radioactive source in fixed direction for interval TO people arriving in a queue in a fixed interval of timeO genomic mapping functions, e.g. cross over as a random event

X5

1

Page 5: DATA ANALYSIS Module Code: CA660 Lecture Block 3

5

Other Standard examples: e.g. Hypergeometric, Exponential….

• Hypergeometric. Consider a population of M items, of which W are deemed to be successes.

• Let X be the number of successes that occur in a sample of size n, drawn without replacement from the finite population, then

Prob { X = k} = WCk M-WCn-k / MCn ( k = 0, 1, 2, … )

• Then E [X] = n W / M VAR [X] = n W (M - W) (M - n) / { M2 (M - 1)}

• Exponential : special case of the Gamma distribution with n = 1 used e.g. to model inter-arrival time of customers or time to arrival of first customer in a simple queue, e.g. fragment lengths in genome mapping etc.

• The p.d.f. is f (x) = l exp ( - l x ), x ³ 0, > l 0 = 0 otherwise

Page 6: DATA ANALYSIS Module Code: CA660 Lecture Block 3

6

Standard p.d.f.’s - Gaussian/ Normal

• A random variable X has a normal distribution with mean m and standard deviation s if it has density

with and• Arises naturally as the limiting distribution of the average of a set of

independent, identically distributed random variables with finite variances.

• Plays a central role in sampling theory and is a good approximation to a large class of empirical distributions. Default assumption in many empirical studies is that each observation is approx. ~ N( , m s 2)

• Note: Statistical tables of the Normal distribution are of great importance in analysing practical data sets. X is said to be a Standardised Normal variable if m = 0 and s = 1.

otherwise

xx

xf

0

21exp

21

)(

2

)(XE 2)( XV

Page 7: DATA ANALYSIS Module Code: CA660 Lecture Block 3

7

Standard p.d.f.’s : Student’s t-distribution

• A random variable X has a t -distribution with ‘ ’ d.o.f. ( t ) if it has density

= 0 otherwise.Symmetrical about origin, with E[X] = 0 & V[X] = n / (n -2).

• For small n, the tn distribution is very flat. • For n ³ 25, the tn distribution Standard Normal curve. • Suppose Z a standard Normal variable, W has a cn

2 distribution and Z and W independent then r.v. has form

• If x1, x2, … ,xn is a random sample from N( m , s2) , and, if define

then

tt

tf2)1(2

1

2

2)1(

)(

nWZX

1

)( 22

n

xxs

i1~)(

ntns

x

Page 8: DATA ANALYSIS Module Code: CA660 Lecture Block 3

8

Chi-Square Distribution• A r.v. X has a Chi-square distribution with n degrees of freedom; (n a positive

integer) if it is a Gamma distribution with l = 1, so its p.d.f. is

E[X] =n ; Var [X] =2n• Two important applications:

- If X1, X2, … , Xn a sequence of independently distributed Standardised Normal Random Variables, then the sum of squares

X12 + X2

2 + … + Xn2 has a 2 distribution (n degrees of freedom).

- If x1, x2, … , xn is a random sample from N( ,m s2), then

and and

s2 has 2 distribution, n - 1 d.o.f., with r.v.’s and s2 independent.

X

c2 ν (x)

Prob

otherwise

xnxxxf n

0

0)!1()exp()( 1

n

i

in

xx1

n

i

i xxs

1

2

22 )(

x

Page 9: DATA ANALYSIS Module Code: CA660 Lecture Block 3

9

F-Distribution• A r.v. X has an F distribution with m and n d.o.f. if it has a density

function = ratio of gamma functions for x>0 and = 0 otherwise.•

• For X and Y independent r.v.’s, X ~ cm

2 and Y~ cn2 then

• One consequence: if x1, x2, … , xm ( m 2)³ is a random sample from N(m1, s1

2), and y1, y2, … , yn ( n 2)³ a random sample from N(m2,s22),

then

4)2)(4(

)2(2][

4)2(][

2

2

nifnnm

nmnXVar

nifnnXE

nY

mXF nm ,

1,12

2

~)1()(

)1()(

nm

i

i

Fnyy

mxx

Page 10: DATA ANALYSIS Module Code: CA660 Lecture Block 3

10

Sampling and Sampling Distributions – Extended Examples: refer to primer

Central Limit TheoremIf X1, X2,… Xn are a random sample of r.v. X, (mean , variance 2), then, in the limit, as n , the sampling distribution of means has a Standard Normal distribution, N(0,1)

Probabilities for sampling distribution – limits

• for large n

U = standardized Normal deviate

,...2,1'

i

n

xx

ii

}{ bUaPbx

aPx

x

Page 11: DATA ANALYSIS Module Code: CA660 Lecture Block 3

11

Large Sample theory

• In particular

• is the C.D.F. or D.F.• In general, the closer the random variable X behaviour is to the

Normal, the faster the approximation approaches U. Generally, n 30 “Large sample” theory

xxx

rxrP

rxrPrxP

}{}{

n

r

n

r FF

F

Page 12: DATA ANALYSIS Module Code: CA660 Lecture Block 3

12

Attribute and Proportionate Samplingrecall primer sample proportion and sample mean synonymous

Probability Statements

If X and Y independent Binomially distributed r.v.’s parameters n, p and m, p respectively, then X+Y ~ B(n+m, p)

• So, Y=X1+ X2+…. + Xn ~ B(n, p) for the IID X~B(1, p)• Since we know Y = np, Y=(npq) and, clearly then

• and, further is the sampling distribution of a proportion

xp̂

xnY

nasNnpq

npY

n

nn

Yx

Y

Y

x

x )1,0(

)1,0(~ˆ

N

npq

ppU

Page 13: DATA ANALYSIS Module Code: CA660 Lecture Block 3

13

Differences in Proportions

• Can use 2 : Contingency table type set-up• Can set up as parallel to difference estimate or test of 2 means

(independent) so for 100 (1- )% a C.I.

• Under H0: P1 – P2 =0 so, can write S.E. as

for pooled

X & Y = No. of successes

2

22

1

1121

ˆˆˆˆ)ˆˆ(2

nqp

nqpUpp

S.E., n1, n2 large.

Small sample n-1

21

11ˆˆ

nnqp

21

2211

21

ˆˆˆ

nn

pnpn

nn

YXp

2-sided

Page 14: DATA ANALYSIS Module Code: CA660 Lecture Block 3

14

C.L.T. and Approximations summary

• General form of theorem - an infinite sequence of independent r.v.’s, with means, variances as before, then approximation U for n large enough. Note: No condition on form of distribution of the X’s (the raw data)

• Strictly - for approximations of discrete distributions, can improve by considering correction for continuity

e.g.

parameterPoissonX

U ,5.0

pproportionsampleobservedsosampleinNoxnpq

pnxU ˆ/,.

5.0)(

Page 15: DATA ANALYSIS Module Code: CA660 Lecture Block 3

15

Generalising Sampling Distn. Concept-see primer

• For sampling distribution of any statistic, a sample characteristic is an unbiased estimator of the parent population characteristic, if the mean of the corresponding sampling distribution is equal to the parent characteristic.Also the sample average proportion is an unbiased estimator of the parent average proportion

• Sampling without replacement from a finite population gives the Hypergeometric distribution.

finite population correction (fpc) = Ö [( N - n) / ( N - 1)] , N, n are parent population and sample size respectively.

• Above applies to variance also.

PpExE }ˆ{}{

Page 16: DATA ANALYSIS Module Code: CA660 Lecture Block 3

16

Examples

Large scale 1980 survey in country showed 30% of adult population with given genetic trait. If still the current rate, what is probability that, in a random sample of 1000, the number with the trait will be (a) < 250, (b) 316 or more?

Soln. Let X = no. successes (with trait) in sample. So, for expected proportion of 0.3 in population, we suppose X ~B(1000,0.3)

Since np=300, and √npq = √210 =14.49, distn. of X ~N(300,14.49)

(a) P{X<280} or P{X≤279}

(b) P{X≥316}

0786.0415.149.14

3005.279

UPUP

1423.08588.0107.149.14

3005.315

UPUP

Page 17: DATA ANALYSIS Module Code: CA660 Lecture Block 3

Examples

Auditors checking if certain firm overstating value of inventory items. Decide to randomly select 15 items. For each, determine recorded amount (R), audited (exact) amount (A) and hence difference between the two = X, variable of interest. Of particular interest is whether average difference > 250 Euro. 170 350 310 220 500 420 560 230 270 380 200 250 430 450 210So n = 15, x = €330 and s = €121.5 H0 : €250

H0 : €250

Decision Rule: Reject H0 if where the dof = n-1 =14

Value from data

Since 2.55 > 1.761, reject H0. Also, the p-value is the area to the right of 2.55. It is between 0.01 and 0.025, (so less than = 0.05), so again - reject H0

The data indicate that the firm is overstating the value of its inventory items by more than €250 on average

761.1250

14,05.0

tns

xt

55.2155.121

250330

t

Page 18: DATA ANALYSIS Module Code: CA660 Lecture Block 3

18

Examples contd.

Blood pressure readings before and after 6 months on medication taken in women students, (aged 25-35); sample of 15. Calculate (a) 95% C.I. for mean change in B.P. (b) test at 1% level of significance, (= 0.01) that the medication reduces B.P.

Data: Subject 1 2 3 4 5 6 7 8 9 10 11 12 13 14 151st (x) 70 80 72 76 76 76 72 78 82 64 74 92 74 68 84 2nd (y) 68 72 62 70 58 66 68 52 64 72 74 60 74 72 74

d =x-y 2 8 10 6 18 10 4 26 18 -8 0 32 0 -4 10

(a) So for 95% C. limits

15

98.1014

)(80.8

15

025.0

2

std

dds

dd ii

Page 19: DATA ANALYSIS Module Code: CA660 Lecture Block 3

19

Contd.Value for t0.025 based on d.o.f. = 14. From t-table, find t0.025 = 2.145

So, 95% C.I. is:

i.e. limits are 8.80 6.08 or (2.72, 14.88), so 95% confident that there is a mean difference (reduction) in B.P. of between 2.72 and 14.88

(b) The claim is that > 0, so we look at H0: = 0 vs H1: > 0 ,

So t-statistic as before, but right-tailed (one sided only) Rejection Region. For d.o.f. = 14, t0.01 = 2.624. So calculated value from our data

clearly in Rejection region, so H0

rejected in favour of H1 at = 0.01 Reduction in B.P. after medication strongly supported by data.

95.015

98.10145.280.8

15

98.10145.280.8

DP

10.3

1598.10

80.8

nsd

t

0

t14

Accept Reject = 1% t0.01 = 2.624.

Page 20: DATA ANALYSIS Module Code: CA660 Lecture Block 3

20

Examples

Rates of prevalence of CF antibody to P1 virus among given age group children. Of 113 boys tested, 34 have antibody, while of 139 girls tested, 54 have antibody. Is evidence strong for a higher prevalence rate in girls?

H0: p1=p2 vs H1: p1< p2 (where p1, p2 proportion boys, girls with +ve preference respectively).

Soln.

Can not reject H0

Actual p-value = P{U ≤ -1.44) = 0.0749

349.0139113

5434ˆ

p

388.0139

54ˆ

301.0113

34ˆ

2

1

p

p

44.1

1391

1131

651.0349.0

388.0301.0

U

Page 21: DATA ANALYSIS Module Code: CA660 Lecture Block 3

Developed Examples using Standard Distributions/sampling distributions

Lot Acceptance Sampling in SPC. Binomial frequently used. Suppose shipment of 500 calculator chips arrives at electronics firm; acceptable if a sample of size 10 has no more than one defective chip. What is the probability of accepting lot if, in fact, (i) 10% (50 chips) are defective (ii) 20% (100) are defective?

n = 10 trials, each with 2 outcomes: Success = defective; Failure = not defectiveP = P{Success} = 0.10, (assume constant for simplicity)X= no. successes out of n trials = No. defective out of 10 sampledi.e. Electronics Firm will accept shipment if X = 0 or 1

(i) P{accept} = P{0 or 1} = P {0 } + P{1} =P{X 1} (cumulative) From tables: n=10, p=0.10, P(0}=0.349, P{1} = 0.387So, P{accept} = 0.736 , i.e 73.6% chance

(ii) For p=0.20, P{0} = 0.107, P{1} = 0.268, so P{accept} = 0.375 or 37.5% chance

Page 22: DATA ANALYSIS Module Code: CA660 Lecture Block 3

Example contd.

Suppose have a shipment of 50 chips, similar set up to before – check for lot acceptance, still selecting sample of size 10 and assuming 10% defective. Success and Failure as before

Now, though, p = P{Success 1st trial} = 5/50 = 0.1 first trial , but ConditionalP{Success 2nd trial} = 5/49 = 0.102 if 1st pick is a failure (not defective) OR P{Success 2nd trial} = 4/49 =0.082 if 1st is defective (success). Hypergeometric

Think of two sets in shipment – one having 5 S’s, the other 45 F’sTaking 10 chips randomly from the two sectionsIf x are selected from S set, then 10-x must be selected from F set, i.e. N = 50, k = 5, n = 10

So P{1 S and 9 Fs} = P{1} = and P{0} from similar expression = 0.31

431.0)(

))((

1050

94515 C

CC

Page 23: DATA ANALYSIS Module Code: CA660 Lecture Block 3

Example contd.Approximations: Poisson to BinomialSuppose shipment = 2500 chips and want to test 100. Accept lot if sample contains no more than one defective. Assuming 5% defective. What is probability of accepting lot?

Note: n= 100, N=2500 ratio = 0.04 , i.e. < 5%, so can avoid the work for hypergeometric , as approximately Binomial, n = 100, p 0.05So Binomial random variable X here = no. defective chips out of 100P{accept lot} = P{X1} = P{0} +P{1}

Lot of work, not tabulated Alternative: Poisson approx. to Binomial where n >20, np 7 works well, so probability from Poisson table, where

close to result for Binomial

037.0)95.0()05.0()95.0()05.0(}{ 99.011100

10000100 CCacceptP

0404.00337.00067.0}1{

0337.0}1{

0067.0}0{

5)5.0)(100(

XP

P

P

np

Page 24: DATA ANALYSIS Module Code: CA660 Lecture Block 3

Example contd.Approximations: Normal to discrete distributionSupposing still want to sample 100 chips, but 10% of chips expected to be defective. Rule for approximation of Binomial is that n large, p small, or that np < 7. Now p =0.10, so np = 10, Poisson not a good approximation.

However, n large and np=10, n(1-p)=90, and both > 5, so can use Normal approximationthen X is a binomial r.v. with

So have

Very small chance of accepting lot with this many defectives.

39)1(

10)1.0)(100(

npqpnp

np

0023.0}83.2{3

105.1}5.1{

39)1(

}5.1{}1{

UPUPXP

npqpnp

XPXP

Normal

NormalBinomial

Page 25: DATA ANALYSIS Module Code: CA660 Lecture Block 3

Developed Examples using Standard Distributions/sampling distributions

RECOMBINANTS, BINOMIAL and MULTINOMIAL• Binomial No. of recombinant gametes, produced by a heterozygous parent

for a 2-locus model, parameters, n and = P{gamete recombinant} (= R.F.)

So for nr recombinants in sample of n

• Multinomial 3-locus model (A,B,C) 4 possible classes of gametes (non-recombinants, AB recombinants, BC recombinants and double recombinants at loci ABC).

Joint probability distribution for r.v.’s requires counting number in each class

where a+b+c+d = n and P1, P2, P3, P4 are probabilities of observing a member of each of 4 classes respectively

fractionionrecombinatn

nr

dcba PPPPdcba

ndXcXbXaXP 43214321 !!!!

!},,,{

Page 26: DATA ANALYSIS Module Code: CA660 Lecture Block 3

26

Developed Examples contd.

fractionionrecombinatn

nr rˆ

cenInterferesoRFpossiblerrrrr BCABBCABAC ,32

generallymorerrCrrr BCABBCABAC*2

eCoincidencCoefftCcenInterfereC .,1 **

frequencytrecombinandoubletruerwhererr

rC

BCAB

1212*

2

Background Recombinant Interference

Greater physical distance between loci greater chance to recombine - (homologous). Departure from additivity increases with distance -hence mapping.

Example: 2 loci A,B, same chromasome, segregated for two alleles at each locus A,a,B,b gametes AB, Ab, aB, ab. Parental types AB, ab gives Ab and aB recombinants . Simple ratio. Denote recombinant fraction as R.F. (r)

Example: For 3 linked loci, A,B, C, relationship based on simple prob. theory

Page 27: DATA ANALYSIS Module Code: CA660 Lecture Block 3

27

Example contd.- LINKAGE/G.M CONSTRUCTION

• Genetic Map -Models linear arrangement of group of genes / markers (easily identified genetic features - e.g. change in known gene, piece of DNA with no known function). Map based on homologous recombination during meiosis. If 2 markers close on chromosome, allele inheritance usually through meiosis

• 4 basic steps after marker data obtained. Pairwise linkage - all 2-locus combinations (based on observed and expected frequencies of genotypic classes). Grouping markers into Linkage Groups (based on R.F.’s, significance level etc.). If good genome coverage –many markers, good data and genetic model, No. linkage groups should haploid no. chromosomes for organism. Ordering within group markers (key step, computationally demanding, precision important). Estimation multipoint R.F. (physical distance - no. of DNA base pairs between two genes vs map distance => transformation of R.F.).

• Ultimate Physical map = DNA sequence (restriction map also common)

Page 28: DATA ANALYSIS Module Code: CA660 Lecture Block 3

28

Example contd.

GENETIC LINKAGE and MAPPING• Linkage Phase - chromatid associations of alleles of linked loci - same chromosome =coupled, different =repulsion• Genetic Recombination - define R.F. (e.g. in terms of gametes or

phenotypes); homologous case - greater the distance between loci, greater chance of recombining. High interference = problem for multiple locus models. R.F. between loci not additive. Need Mapping Function

• Haldane’s Mapping Function

Assume crossovers occur randomly along chromosome length and average number = , model as Poisson, so

P{NO crossover} = e - and P{Crossover} = 1- e -

Page 29: DATA ANALYSIS Module Code: CA660 Lecture Block 3

29

Example - continued• P{recombinant} = 0.5 P(Crossover} (each pair of homologs, with one

crossover resulting in one-half recombinant gametes) • Define Expected No. recombinants in terms of mapping function (m =

0.5 ) R.F. r = 0.5(1-e -2m) (form of Haldane’s M.F.) with inverse m = - 0.5 ln (1-2r) so converting an estimated R.F. to Haldane’s map distance• Thus, for locus order ABC mAC = mAB + mBC (since mAB= - 0.5ln(1-2rAB) ) etc.

Substituting for each of these gives us the usual relationship between R.F.’s (for the no interference situation)

• Net Effect - transform to straight line i.e. mAC vs mAB or mBC

• In practice - too simple/only applies to specific conditions; may not relate directly to physical distance (= common Mapping Fn. issue).