Upload
kaoru
View
34
Download
3
Tags:
Embed Size (px)
DESCRIPTION
Problems in Data Analyses. General case in data analysis. Assumptions distortion Missing data. General Assumption of Anova. The error terms are randomly, and normally distributed Populations (for each condition) are Normally Distributed - PowerPoint PPT Presentation
Citation preview
Problems in Problems in Data AnalysesData Analyses
General case in data analysis
Assumptions distortionMissing data
General Assumption of AnovaGeneral Assumption of Anova The error terms are randomly, and normally The error terms are randomly, and normally
distributeddistributedPopulations (for each condition) are Normally DistributedPopulations (for each condition) are Normally Distributed
The variance of different population are The variance of different population are homogeneous (Homo-scedasticity)homogeneous (Homo-scedasticity)Populations (for each condition) have Equal VariancesPopulations (for each condition) have Equal Variances
Variances and means of different populations Variances and means of different populations are not correlated (independent)are not correlated (independent)
The main effects are additiveThe main effects are additive
CRD ANOVA F-Test AssumptionsCRD ANOVA F-Test Assumptions Randomness & NormalityRandomness & Normality Homogeneity of VarianceHomogeneity of Variance Independence of ErrorsIndependence of Errors AdditiveAdditive
55
Randomized Block Randomized Block FF Test Test AssumptionsAssumptions
1.1. NormalityNormalityPopulations are normally distributedPopulations are normally distributed
2.2. Homogeneity of VarianceHomogeneity of VariancePopulations have equal variancesPopulations have equal variances
3.3. Independence of ErrorsIndependence of ErrorsIndependent random samples are drawnIndependent random samples are drawn
4.4. The main effects are additiveThe main effects are additive5. No Interaction Between Blocks & 5. No Interaction Between Blocks &
TreatmentsTreatments
Randomly, independently and Randomly, independently and Normally distributionNormally distribution
The assumption of normality do not affect the The assumption of normality do not affect the validity of the analysis of variance too seriouslyvalidity of the analysis of variance too seriously
There are test for normality, but it is rather point There are test for normality, but it is rather point pointless to apply them unless the number of pointless to apply them unless the number of samples we are dealing with is fairly largesamples we are dealing with is fairly large
Independence implies that there is no relation Independence implies that there is no relation between the size of the error terms and the between the size of the error terms and the experimental grouping to which the belongexperimental grouping to which the belong
It is important to avoid having all plots receiving a It is important to avoid having all plots receiving a given treatment occupying adjacent positions in the given treatment occupying adjacent positions in the fieldfield
The best insurance against seriously violating the The best insurance against seriously violating the first assumption of the analysis of variance is to first assumption of the analysis of variance is to carry out the randomization appropriate to the carry out the randomization appropriate to the particular designparticular design
NormalityNormality Reason:Reason:
• ANOVA ANOVA is is anan Analysis of Variance Analysis of Variance • Analysis of two variances, more specifically, the ratio of Analysis of two variances, more specifically, the ratio of
two variancestwo variances• Statistical inference is based on the F distribution Statistical inference is based on the F distribution
which is given by the ratio of two chi-squared which is given by the ratio of two chi-squared distributionsdistributions
• No surprise that each variance in the ANOVA ratio No surprise that each variance in the ANOVA ratio come from a parent normal distributioncome from a parent normal distribution
Calculations can always be derived no matter Calculations can always be derived no matter what the distribution is. Calculations are what the distribution is. Calculations are algebraic properties separating sums of squares. algebraic properties separating sums of squares. Normality is only needed for statistical inferenceNormality is only needed for statistical inference..
Diagnosis: NormalityDiagnosis: Normality•The points on the normality plot must
more or less follow a line to claim “normal distributed”.
•There are statistic tests to verify it scientifically.
•The ANOVA method we learn here is not sensitive to the normality assumption. That is, a mild departure from the normal distribution will not change our conclusions much.Normality plot: normal scores vs. residuals
Normality TestsNormality Tests Wide variety of tests can be performed to test if the data
follows a normal distribution. Mardia (1980) provides an extensive list for both the uni-
variate and multivariate cases and it is categorized into two types: • Properties of normal distribution, more specifically, the
first four moments of the normal distribution Shapiro-Wilk’s W (compares the ratio of the standard
deviation to the variance multiplied by a constant to one)
Lilliefors-Kolmogorov-Smirnov Test Graphical methods based on residual error (Residual
Plotts)• Goodness-of-fit tests,
Kolmogorov-Smirnov D Cramer-von Mises W2 Anderson-Darling A2
Checking for NormalityChecking for Normality
TOOLS1. Histogram and/or box-plot of all residuals (eij).2. Normal probability (Q-Q) plot.3. Formal test for normality.
Reminder: Normality of the RESIDUALS is assumed. The original data are assumed normal also, but each group may have a different mean if Ha is true. Practice is to first fit the model, THEN output the residuals, then test for normality of the residuals. This APPROACH is always correct.
Histogram of ResidualsHistogram of Residualsproc glm data=stress;
class sand;
model resistance = sand / solution;
output out=resid r=r_resis p=p_resis ;
title1 'Compression resistance in concrete beams as';
title2 ' a function of percent sand in the mix';
run;
proc capability data=resid;
histogram r_resis / normal;
ppplot r_resis / normal square ;
run;
Formal Tests of NormalityFormal Tests of Normality• Kolmogorov-Smirnov test; Anderson-Darling test
(both based on the empirical CDF).• Shapiro-Wilk’s test; Ryan-Joiner test (both are
correlation based tests applicable for n < 50).• D’Agostino’s test (n>=50).
All quite conservative – they fail to reject the null hypothesis of normality more often
than they should.
Shapiro-Wilk’s W testShapiro-Wilk’s W test
2k
1i n j 1 jd
j 1
W a ( )
e1, e2, …, en represent data ranked from smallest to largest.
H0: The population has a normal distribution.HA: The population does not have a normal distribution.
n2
ii 1
d ( )
nk2
(n 1)k2
If n is even
If n is odd.R.R. Reject H0 if W < W0.05
Coefficients ai come from a table.
Critical values of W come from a table.
Shapiro-Wilk CoefficientsShapiro-Wilk Coefficients
Shapiro-Wilk CoefficientsShapiro-Wilk Coefficients
Shapiro-Wilk W Table
D’Agostino’s TestD’Agostino’s Test
(D 0.28209479) nY0.02998598
12n
21jn
j 1
n1
j2j 1
2
s ( )
[ j (n 1)]D
n s
e1, e2, …, en represent data ranked from smallest to largest.
H0: The population has a normal distribution.Ha: The population does not have a normal distribution.
Two sided test. Reject H0 if
0.025 0.975Y Y or Y Y Y0.025 and Y0.975 come from a table of percentiles of the Y statistics
The Consequences of Non-Normality
F-test is very robust against non-normal data, especially in a fixed-effects model
Large sample size will approximate normality by Central Limit Theorem (recommended sample size > 50)
Simulations have shown unequal sample sizes between treatment groups magnify any departure from normality
A large deviation from normality leads to hypothesis test conclusions that are too liberal and a decrease in power and efficiency
Remedial Measures for Non-Remedial Measures for Non-NormalityNormality
Data transformation Be aware - transformations may lead to a fundamental
change in the relationship between the dependent and the independent variable and is not always recommended.
Don’t use the standard F-test. • Modified F-tests
Adjust the degrees of freedom Rank F-test (capitalizes the F-tests robustness)
• Randomization test on the F-ratio • Other non-parametric test if distribution is unknown• Make up our own test using a likelihood ratio if
distribution is known
Homogeneity of VariancesHomogeneity of Variances Eisenhart (1947) describes the problem of unequal Eisenhart (1947) describes the problem of unequal
variances as followsvariances as follows• the ANOVA model is based on the proportion of the the ANOVA model is based on the proportion of the
mean squares of the factors and the residual mean mean squares of the factors and the residual mean squares squares
• The residual mean square is the unbiased estimator of The residual mean square is the unbiased estimator of 22, the variance of a single observation , the variance of a single observation
• The between treatment mean squares takes into The between treatment mean squares takes into account not only the differences between observations, account not only the differences between observations, 22,, just like the residual mean squares, but also the just like the residual mean squares, but also the variance between treatments variance between treatments
• If there was non-constant variance among treatments, If there was non-constant variance among treatments, the residual mean square can be replaced with some the residual mean square can be replaced with some overall variance, overall variance, a a
22, and a treatment variance, , and a treatment variance, t t22, ,
which is some weighted version of which is some weighted version of a a22
• The “neatness” of ANOVA is lost The “neatness” of ANOVA is lost
Homogeneity of VariancesHomogeneity of Variances The overall F-test is very robust against The overall F-test is very robust against
heterogeneity of variances, especially with fixed heterogeneity of variances, especially with fixed effects and equal sample sizes. effects and equal sample sizes.
Tests for treatment differences like t-tests and Tests for treatment differences like t-tests and contrasts are severely affected, resulting in contrasts are severely affected, resulting in inferences that may be too liberal or conservativeinferences that may be too liberal or conservative
Unequal variances can have a marked effect on Unequal variances can have a marked effect on the level of the test, especially if smaller sample the level of the test, especially if smaller sample sizes are associated with groups having larger sizes are associated with groups having larger variancesvariances
Unequal variances will lead to bias conclusionUnequal variances will lead to bias conclusion
Way to solve the problem of Way to solve the problem of Heterogeneous variancesHeterogeneous variances
The data can be separated into The data can be separated into groups such that the variances groups such that the variances within each group are homogenouswithin each group are homogenous
An advance statistic tests can be An advance statistic tests can be used rather than analysis of varianceused rather than analysis of variance
Transform the data in such a way Transform the data in such a way that data will be homogenousthat data will be homogenous
Tests for Homogeneity of VariancesTests for Homogeneity of Variances• Bartley’s Test• Levene’s Test
Computes a one-way-anova on the absolute value (or sometimes the square) of the residuals, |yij – ŷi| with t-1, N – t degrees of freedomConsidered robust to departures of normality, but too conservative
• Brown-Forsythe Test A slight modification of Levene’s test, where the median is substituted for the mean (Kuehl (2000) refers to it as the Levene (med) Test)
• The Fmax Test (Hartley Test)Proportion of the largest variance of the treatment groups to the smallest and compares it to a critical value table
Bartlett’s TestBartlett’s TestBartlett’s Test: Allows unequal replication, but requires normality.
t
1i
2iei
2e
t
1ii slog)1n(slog)1n(C
t
1i
2i2
tssIf C > 2
(t-1),then apply the correction term
t
1ii
t
1i i )1n(
1)1n(
1)1t(3
11CF
Reject if C/CF > 2(t-1),
More work but better power
Levene’s TestLevene’s Test
t
iTiij
n
ji
i
t
ii
tnzzn
tzznL
i
1
2
1
2
1
)/()(
)1/()(
More work but powerful result.
Letij ij iz y y iy = sample
median of i-th group
Reject H0 if
1 2,df ,dfL Fdf1 = t -1df2 = nT - t
Essentially an Anova on the zij
t
iiT nn
1
A logical extension of the F test for t=2, Requires equal replication, r, among groups. Requires normality
2min
2max
max ssF
Reject if Fmax > F,t,n-1,
.
Hartley’s Test
Tabachnik and Fidell (2001) use the Fmax ratio more as a rule of thumb rather than using a table of critical values.
Fmax ratio is no greater than 10 Sample sizes of groups are approximately equal
(ratio of smallest to largest is no greater than 4)
Tests for Homogeneity of VariancesTests for Homogeneity of Variances
More importantly:More importantly:
VARIANCE TESTS ARE ONLY FOR ONE-WAY ANOVA
WARNING: Homogeneity of variance testing is only
available for un-weighted one-way models.
Tests for Homogeneity of VariancesTests for Homogeneity of Variances(Randomized Complete Block Design and/or (Randomized Complete Block Design and/or
Factorial Design)Factorial Design) In a CRD, the variance of each
treatment group is checked for homogeneity
In factorial/RCBD, each cell’s variance should be checked
Ho: σHo: σijij22 = σ = σi’j’i’j’
22, For all i, j where i ≠ i’, j , For all i, j where i ≠ i’, j ≠ j’≠ j’
Tests for Homogeneity of VariancesTests for Homogeneity of Variances(Latin-Squares/Split-Plot Design)(Latin-Squares/Split-Plot Design)
If there is only one score per cell, homogeneity of variances needs to be shown for the marginals of each column and each row•Each factor for a latin-square•Whole plots and subplots for split-plot
If there are repetitions, homogeneity is to be shown within each cell like RCBD
Remedial Measures for Remedial Measures for Heterogeneous VariancesHeterogeneous Variances
Studies that do not involve repeated measures•If normality is violated, the data transformation necessary to normalize data will usually stabilize variances as well
•If variances are still not homogeneous, non-ANOVA tests might be an option
Transformations to Achieve Transformations to Achieve Homo-scedasticityHomo-scedasticity
What can we do if the homo-scedasticity (equal variances) assumption is rejected?
1. Declare that the Anova model is not an adequate model for the data. Look for alternative models.
2. Try to “cheat” by forcing the data to be homo-scedastic through a transformation of the response variable Y. (Variance Stabilizing Transformations.)
IndependenceIndependence
Independent observations• No correlation between error terms• No correlation between independent
variables and error Positively correlated data inflates standard
error• The estimation of the treatment means are
more accurate than the standard error shows.
It is a special case and the most common cause of heterogeneity of
variance
Independence TestsIndependence Tests If some notion of how the data was collected is
understandable, check can be done if there exists any autocorrelation.
The Durbin-Watson statistic looks at the correlation of each value and the value before it• Data must be sorted in correct order for
meaningful results• For example, samples collected at the same
time would be ordered by time if suspect results could be depent on time
IndependenceIndependence A positive correlation between means and
variances is often encountered when there is a wide range of sample means
Data that often show a relation between variances and means are data based on counts and data consisting of proportion or percentages
Transformation data can frequently solve the problems
Remedial Measures for Remedial Measures for Dependent DataDependent Data
First defense against dependent data is First defense against dependent data is proper study design and randomizationproper study design and randomization• Designs could be implemented that takes correlation into Designs could be implemented that takes correlation into
account, e.g., crossover design account, e.g., crossover design Look for environmental factors unaccounted Look for environmental factors unaccounted
for for • Add covariates to the model if they are causing correlation, Add covariates to the model if they are causing correlation,
e.g., quantified learning curvese.g., quantified learning curves If no underlying factors can be found If no underlying factors can be found
attributed to the autocorrelationattributed to the autocorrelation• Use a different model, e.g., random effects modelUse a different model, e.g., random effects model• Transform the independent variables using the correlation Transform the independent variables using the correlation
coefficientcoefficient
The Main effects are additiveThe Main effects are additive For each design, there is a mathematical For each design, there is a mathematical
model called a linear additive model.model called a linear additive model. It means that the value of experimental unit It means that the value of experimental unit
is made up of general mean plus main is made up of general mean plus main effects plus an error termeffects plus an error term
When the effects are not additive, there are When the effects are not additive, there are multiplicative treatment effectmultiplicative treatment effect
In the case of multiplication treatment In the case of multiplication treatment effects, there are again transformation that effects, there are again transformation that will change the data to fit the additive modelwill change the data to fit the additive model
Data TransformationData Transformation There are two ways in which the anova assumptions can
be violated:1. Data may consist of measurement on an ordinal or a nominal scale2. Data may not satisfy at least one of the four requirements
Two options are available to analyze data:1. It is recommended to use non-parametric data analysis2. It is recommended to transform the data before analysis
Square Root TransformationSquare Root Transformation It is used when we are dealing with counts It is used when we are dealing with counts
of rare eventsof rare events The data tend to follow a Poisson The data tend to follow a Poisson
distributiondistribution If there is account less than 10. It is better If there is account less than 10. It is better
to add 0.5 to the valueto add 0.5 to the value
ii yz
i ik2 This transformation works when we notice the variance changes as a linear function of the mean.
• Useful for count data (Poisson Distributed).
• For small values of Y, use Y+.5.Typical use: Counts of items when countsare between 0 and 10.
Square Root TransformationSquare Root Transformation
k>0
Response is positive and continuous.
0.00
5.00
10.00
15.00
20.00
25.00
30.00
35.00
0 10 20 30 40
Sample Mean
Sam
ple
Varia
nce
Logaritmic TransformationLogaritmic Transformation It is used when the standard deviation of It is used when the standard deviation of
samples are roughly proportional to the samples are roughly proportional to the meansmeans
There is an evidence of multiplicative rather There is an evidence of multiplicative rather than additivethan additive
Data with negative values or zero can not Data with negative values or zero can not be transformed. It is suggested to add 1 be transformed. It is suggested to add 1 before transformation before transformation
This transformation tends to work when the variance is a linear function of the square of the mean• Replace Y by Y+1 if zero occurs.
• Useful if effects are multiplicative (later).
• Useful If there is considerable heterogeneity in the data.
Z Yln( )2 2ki i
Typical use: 1. Growth over time.2. Concentrations.3. Counts of times when counts are greater than 10.
Logarithmic TransformationLogarithmic Transformation
k>0
Response is positive and continuous.
0
20
40
60
80
100
120
140
160
180
200
0 10 20 30 40
Sample Mean
Sam
ple
Varia
nce
Arcus sinus or angular Arcus sinus or angular TransformationTransformation
It is used when we are dealing with counts It is used when we are dealing with counts expressed as percentages or proportion of the total expressed as percentages or proportion of the total samplesample
Such data generally have a binomial distributionSuch data generally have a binomial distribution Such data normally show typical characteristics in Such data normally show typical characteristics in
which the variances are related to the meanswhich the variances are related to the means
With proportions, the variance is a linear function of the mean times (1-mean) where the sample mean is the expected proportion.
• Y is a proportion (decimal between 0 and 1).• Zero counts should be replaced by 1/4, and N by N-1/4 before converting to percentages
YarcsinYsinZ 1
i i ik2 1
Response is a proportion.
Typical use: 1. Proportion of seeds germinating.2. Proportion responding.
ARCSINE SQUARE ROOTARCSINE SQUARE ROOT
0
0.05
0.1
0.15
0.2
0.25
0.3
0 0.2 0.4 0.6 0.8 1Sample Mean
Response is positive and continuous.
This transformation works when the variance is a linear function of the fourth power of the mean.• Use Y+1 if zero occurs
• Useful if the reciprocal of the original scale has meaning.
ZY
1
i ik2 4
Typical use: Survival time.
Reciprocal TransformationReciprocal Transformation
0
5000
10000
15000
20000
25000
30000
35000
0 10 20 30 40 50 60 70Sample Mean
n
ii
i
i
i
yn
y
yz
1
1
ln1exp
0ln
01
suggestedtransformation
geometric mean of the original data.
Exponent, l, is unknown. Hence the model can be viewed as having an additional parameter which must be estimated (choose the value of l that minimizes the residual sum of squares).
Box/Cox Transformations (advanced)Box/Cox Transformations (advanced)
General case in data analysis
Assumptions distortionMissing data
Missing data
Reason of missing data: An animal may die
An experimental plot may be flooded out A worker may be ill and not turn up on the job
A jar of jelly may be dropped on the floor The recorded data may be lost
Observations that intended to be made but did not make.
Since most experiment are designed with at least some degree of balance/symmetry, any missing observations will destroy the balance
Missing data In the presence of missing data, the research
goal remains making inferences that apply to the population targeted by the complete sample - i.e. the goal remains what it was if we had seen the complete data.
However, both making inferences and performing the analysis are now more complex.
Making assumptions in order to draw inferences, and then use an appropriate computational approach for the analysis is required
Consider the causes and pattern of the missing data for making appropriate changes in the planned analysis of the data
Missing dataAvoid adopting computationally simple solutions
(such as just analyzing complete data or carrying forward the last observation in a longitudinal study) which generally lead to misleading inferences.
In one factor experiment, the data analysis can be executed with good estimated value, but in the factorial experiment theoretically can not be analyzed
In CRD one factor experiment, if there are missing data, data can be analyzed with different replication numbers
In the RCBD one factor experiment, if 1 – 2 complete block or treatment is missing but there are still 2 blocks complete, data analysis simply can be proceeded
Missing data In the RCBD/LS one factor is experiment,
if there 1 – 2 missing observations in the block or treatment , data can be treated by :a. the appropriate method of unequal frequencies
b. the use of estimating unknown value from the observed data
The estimate of the missing observation most frequently is the value that minimizes the experimental error sum of square when the regular analysis is performed
Missing values in RCBD’s Missing values result in a loss of
orthogonality (generally) A single missing value can be
imputed• The missing cell (yi*j*=x) can be
estimated by profile least squares
(tT + b B – S) [(t-1)(b -1)]Yij =
Where:• t = number of treatment•b = number of block•T = sum of observation with the same
treatment as the missing observation
•B = sum of observations in the same block
as the missing observation•S = number of treatments
Imputation The error df should be reduced by one, since M was
estimated SAS can compute the F statistic, but the p-value will have to
be computed separately The method is efficient only when a couple cells are missing The usual Type III analysis is available, but be careful of
interpretation Little and Rubin use MLE and simulation-based approaches PROC MI in SAS v9 implements Little and Rubin approaches
Missing Data in Latin Square If only one plot is missing, you can use the
following formula:
Yij(k) = t(Ri + Cj + Tk)-2G [(t-1)(t-2)]
^
Where:• Ri = sum of remaining observations in the ith row• Cj = sum of remaining observations in the jth
column• Tk = sum of remaining observations in the kth
treatment• G = grand total of the available observations• t = number of treatments
Total and error df must be reduced by 1 Used only to obtain a valid ANOVA and should
not be used in the computation of means
Relative Efficiency
RE(RCB, CR): the relative efficiency of the randomized complete block design compared to a completely randomized design
Did blocking increase our precision for comparing treatment means in a given
experiment?
MSEbtMSEtbMSBb
MSEMSECRRCBRE
RCB
CR
)1()1()1(),(
MSE and MSB comes from RCB.
IF RE(RCB,CR) >1 then blocking is efficient because many more observations would be required in CRD than would be required in the RCB
Relative Efficiency of LS To compare with an RBD using columns as
blocks RE = RE = MSRMSR + (t-1)MSE + (t-1)MSE tMSEtMSE To compare with an RBD using rows as
blocks RE = RE = MSCMSC + (t-1)MSE + (t-1)MSE tMSEtMSE To compare with a CRD RE = RE = MSR + MSCMSR + MSC + (t-1)MSE + (t-1)MSE (t+1)MSE(t+1)MSE