31
DATA SCREENING Wei-Jiun, Shen Ph. D.

Data screening

  • Upload
    -

  • View
    27

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Data screening

DATA SCREENINGWei-Jiun, Shen Ph. D.

Page 2: Data screening

Anything that can go wrong will go wrong

Page 3: Data screening

Why do we need to screen data?

Page 4: Data screening

Purpose

Detect and correct data errors Detect and treat missing data Detect and handle insufficiently sampled

variables Conduct transformations and standardizations Detect and handle outliers

Page 5: Data screening

First concern

Accuracy of data file Descriptive statistics Graphic representations

Honest correlations Missing data

Pattern or amount Random or not

Outliers

Page 6: Data screening

MISSING DATA“blank” part in data set

Page 7: Data screening

Why is missing data a problem?

Systematical problem Bias sampling

Demographic variables

Inappropriate measuring procedure Behavioral items

Insufficient amount for analysis Small sample

Misleading research results Biased data in, _______ out

Page 8: Data screening

Probability distribution of missingness

Consider the probability of missingness Are certain groups more likely to have missing

values? Respondents in female less likely to report age?

Are certain responses more likely to be missing? Respondents with high SPA less likely to report

anxiety?

Certain analysis methods assume a certain probability distribution

Page 9: Data screening

Missing completely at random (MCAR)

Missing data is independent of any other measured variable (y2) and independent of the variable itself (y1)

I.e., SES=y2; depression=y1. If participants dropped out across a range of SES

levels, then the missing on depression would be independent of SES

Little’s MCAR test in MVA indicates whether MCAR or not (want ns)

Page 10: Data screening

Missing at random (MAR)

Missing data may be dependent on another measured variable (y2), but is independent of the variable itself (y1). I.e., SES=y2; depression=y1. If participants only from high levels of SES

dropped out , then the missing on depression would be dependent on SES. SES.

MAR can be inferred if Little’s test is significant but missingness predictable from other vars (other than the variable itself) –tested by Separate Variance Test. MNAR indicated if this test reveals missingness related to the DV

Page 11: Data screening

Treatment for missing data

Deleting cases or variables Descriptive statistics

Estimating missing data Using missing data correlation matrix Treating missing data as data Repeating analyses with and without missing data

Choosing among methods for dealing with missing data Pattern or amount

Page 12: Data screening

Deletion or preservation?

Deletion <5% MCAR/MAR

Preservation MNAR Small sample

Replacement Mean (grand or group) Regression (predict missing value by other IVs) Expectation Maximization (form missing data r matrix by

assumed distribution)

Page 13: Data screening

OUTLIERCases with extreme value on variables

Page 14: Data screening

Why is outlier a problem?

Systematical problem Bias sampling

Wrong population

Statistical problem ↑error variance ↓statistical power ↑typeⅠ, Ⅱ error ↓normality

Misleading research results Biased data in, _______ out

Page 15: Data screening

Influence of outlier

Leverage × discrepancy

Page 16: Data screening

Treatment for outlier

Estimating outlier Standardized score (z>2, 2.5, 3) Graphical methods (p-p, q-q plot) Mahalanobis distance (χ2 test)

Deletion or transformation Critical to analysis or not Preservation

Transformation Score alternation

Page 17: Data screening

NORMALITY, LINEARITY &

HOMOSCEDASTICITYBasic assumption

Page 18: Data screening

Key assumptions in GLM

Normality Linearity Homogeneity of variance Interval level data Independence of observations

Page 19: Data screening

Normality

Normal distribution

Page 20: Data screening

Test for normality

Skewness & Kurtosis

Page 21: Data screening

Test for normality

T-test for skewness & kurtosis score Kolmogorov-Smirnov test & Shaprio-wilk test

Z

w

Page 22: Data screening

Test for normality

Plotting cumulative distribution function

Page 23: Data screening

Test for normality

P-P plot (probability) & Q-Q plot (quantile)

Page 24: Data screening

Linearity

Straight-line relationship between 2 variables

Page 25: Data screening

Homoscedasticity

Homogeneity of variance Homogeneity of variance-covariance matrix

Page 26: Data screening

Homoscedasticity

Residual

Page 27: Data screening

COMMON DATA TRANSFORMATIONS

Page 28: Data screening

Data transformations

Direction

Skewness Treatment

+

Moderate New X = SQRT (X)Substantial New X = LG10 (X)

Substantial with zero

New X = LG10 (X+C)

Severe New X = 1/XL-shaped with zero New X = 1 (X+C)

-Moderate New X = SQRT (K-X)

Substantial New X = LG10 (K-X)J-shaped New X = 1 (K-X)C = a constant added to each score so that the smallest score is 1.

K = a constant from which each score is subtracted so that the smallest score is 1; usually equal to the largest score + 1.

Page 29: Data screening

PRACTICE

Page 30: Data screening

Check list

Descriptive statistics Range Mean & SD Skewness & kurtosis

Missing data (missing value analysis) Normal distribution

Kolmogorov-Smirnov test (n>50) Shapiro-Wilk test (n<50) Skewness & kurtosis PP plot

Outlier (single/multiple: z-score/Mahalanobis distance)

Linearilty Homoscedasticity Multiconllinearity

Page 31: Data screening

Report

Try