Upload
harry-phillip-banks
View
216
Download
0
Embed Size (px)
Citation preview
Multifactor Dimensionality ReductionLaura MustavichIntroduction to Data MiningFinal Project PresentationApril 26, 2007
The Inspiration For a Method
The Nature of Complex Diseases
Most common diseases are complex Caused by multiple genes Often interacting with one another
This interaction is termed Epistasis
Epistasis
When an allele at one locus masks the effect of an allele at another locus
The Failure of Traditional Methods
Traditional gene hunting methods successful for rare Mendelian (single gene) diseases
Unsuccessful for complex diseases:Since many genes interact to cause the
disease, the effect of any single gene is too small to detect
They do not take this interaction into account
MDR: The Algorithm
Multifactor Dimensionality Reduction A data mining approach to identify
interactions among discrete variables that influence a binary outcome
A nonparametric alternative to traditional statistical methods such as logistic regression
Driven by the need to improve the power to detect gene-gene interactions
Multifactor Dimensionality Reduction
MDR Step 0
Divide data (genotypes, discrete environmental factors, and affectation status) into 10 distinct subsets
Multifactor Dimensionality Reduction
MDR Step 1
Select a set of n genetic or environmental factors (which are suspected of epistasis together) from the set of all variables in the training set
Multifactor Dimensionality Reduction
MDR Step 2
Create a contingency table for these multilocus genotypes, counting the number of affected and unaffected individuals with each multilocus genotype
Multifactor Dimensionality Reduction
MDR Step 3
Calculate the ratio of cases to controls for each multilocus genotype
Multifactor Dimensionality Reduction
MDR Step 4
Label each multilocus genotype as “high-risk” or “low-risk”, depending on whether the case-control ratio is above a certain threshold
****This is the dimensionality reduction step Reduces n-dimensional space to 1 dimension with 2 levels
Multifactor Dimensionality Reduction
MDR Step 5
Use labels to classify individuals as cases or controls, and calculate the misclassification rate
Multifactor Dimensionality Reduction
Repeat steps 1-5 for: All possible combinations of n factors All possible values of n Across all 10 training and testing sets
The Best Model
Minimizes prediction error:the average misclassification rate across all the 10 cross-validation subsets
Maximizes cross-validation consistency:the number of times a particular model was the best model across cross-validation subsets
Hypothesis test of best model:
Evaluate magnitude of cross-validation consistency and prediction error estimates bypermutation testing: Randomize disease labels Repeat MDR analysis several times to get distribution
of cross-validation consistencies and prediction errors Use distributions to determine p-values for your actual
cross-validation consistencies and prediction errors
Permutation Testing: An illustration
Sample Quantiles:An Example Empirical Distribution
Fre
quen
cy
0.2 0.4 0.6 0.8 1.0
02
46
810
0% 0.045754
25% 0.168814
50% 0.237763
75% 0.321027
90% 0.423336
95% 0.489813
99% 0.623899
99.99% 0.872345
100% 1
0.4500
The probability that we would see results as, or more, extreme than 0.4500, simply by chance, is between5% and 10%
Strengths
Facilitates simultaneous detection and characterization of multiple genetic loci associated with a discrete clinical endpoint by reducing the dimensionality of the multilocus data
Non-parametric – no values are estimated Assumes no particular genetic model False-positive rate is minimized due to multiple
testing
Weaknesses
Computationally intensive
(especially with >10 loci) The curse of dimensionality:
decreased predictive ability with high dimensionality and small sample due to cells with no data
MDR Software
The Authors
Multifactor dimensionality reduction software for detecting gene-gene and gene-environment interactions. Hahn, Ritchie, Moore, 2003.
www.sourceforge.net
Values Calculated by MDR
Measure Formula/InterpretationBalanced Accuracy (Sensitivity+Specificity)/2; fitness measure
Accuracy is skewed in favor of the larger class, whereas balanced accuracy gives equal weight to each class
Accuracy (TP+TN)/(TP+TN+FP+FN)
Proportion of instances correctly classified
Sensitivity TP/(TP+FN); proportion of actual positives correctly classified
Specificity TN/(TN+FP); proportion of actual negatives correctly classified
Odds Ratio (TP*TN)/(FP*FN); compares whether the probability of a certain event is the same for two groups
X2 Chi-squared score for the attribute constructed by MDR from this attribute combination
Precision TP/(TP+FP); the proportion of relevant cases returned
Kappa 2(TP*TN+FP*FN)/[(TP+FN)(FN+TN)+(TP+FP)*(FP+TN)]
A function of total accuracy and random accuracy
F-Measure 2*TP/(2*TP+FP+FN); a function of sensitivity and precision
Sign Test
n = number of cross-validation intervals
C = number of cross-validation intervals with testing accuracy ≥ 0.5
The probability of observing c or more cross-validation intervals with testing accuracy ≥ 0.5 if each case were actually classified randomly
nn
ck k
np
2
1
The Problem of Alcoholism
A Case Study
Genes Associated With Alcoholism
ADH enzymes
ALDH2 enzyme
Alcohol
Acetaldehyde
Acetate
ADH(alcohol dehydrogenase)
and
ALDH2(acetaldehyde dehydrogenase 2)
genes are
associated with alcoholism
involved in alcohol metabolism
ADH Genes
ADH7 ADH6 ADH4 ADH5ADH1B ADH1AADH1C5’ 3’
Class IV Class IClass V Class II Class III
370 kb
Chromosome 4
Taste Receptors and Aversion to Alcohol
PTC
TAS2R38
Tasters Non-Tasters
Alcohol Tastes Bitter Alcohol Tastes Sweet
Drink Less Alcohol Drink More Alcohol
• a person must be willing to drink in order to be an alcoholic
• TAS2R38 affects the amount of alcohol a person is willing to drink
• therefore, it is related to alcoholism, although no direct association has been found
• we hope to provide a direct link between TAS2R38 and alcoholism, by demonstrating that it acts epistatically with other genes associated with alcoholism
Actual Analysis
Data
A sample of cases and controls (alcoholics and non-alcoholics) from three East Asian populations: the Ami, Atayal, and Taiwanese
Genotyped for 98 markers within several genes: ALDH2, all ADH genes, and 2 taste receptor genes, TAS2R16 and TAS2R38 (PTC)
Computational Limitations
1. The software package has a problem reading missing data
I was forced to use only complete records, dwindling my (already small) sample to 79 complete records
Computational Limitations
2. The computation time is way too long for higher order models, especially for high numbers of attributes
I was advised to restrict my attributes to markers within ADHIC, and the 2 taste receptor genes, which left me with 36 attributes
I considered models only up to order 4
Summary of Results: All Populations
Order Model Training Bal. Acc. Testing Bal. Acc. Sign Test (p) CV Consistency
1 X.04..ADH1C.dwstrm.Te 0.6049 0.4278 0 (1.0000) 5/10
2X.07..TAS2R16.C_11431 X.04..ADH1C.dwstrm.Te
0.7076 0.4438 3 (0.9453) 6/10
3X.07..TAS2R16.C_11431 X.04..ADH1C.dwstrm.Te X.04..ADH1C.rs3762896
0.785 0.3186 1 (0.9990) 4/10
4
X.07..TAS2R16.C_11431 X.07..PTC.C_8876291_1 X.07..PTC.C_8876482_1 X.04..ADH1C.dwstrm.Te
0.8453 0.3564 2 (0.9893) 6/10
Instances: 79 Attributes: 36 Ratio: 1.3235
Summary of Results: Ami
Order Model Training Bal. Acc. Testing Bal. Acc. Sign Test (p) CV Consistency
1 X.07..TAS2R16.C_11431 0.7331 0.4598 5 (0.6230) 5/10
2X.07..TAS2R16.C_11431 X.04..ADH1C.C_2688508
0.8284 0.3476 2 (0.9893) 3/10
3X.07..TAS2R16.C_11431 X.07..PTC.C_8876467_1 X.04..ADH1C.C_2688508
0.9688 0.9545 10 (0.0010) 10/10
4
X.07..TAS2R16.C_11431 X.07..TAS2R16.C_11431.1 X.07..PTC.C_8876467_1 X.04..ADH1C.C_2688508
0.9722 0.8712 8 (0.0547) 9/10
Instances: 30 Attributes: 36 Ratio: 0.8750
Cross Validation Statistics Set
Measure Training Testing
Balanced Accuracy 0.9688 0.9545
Accuracy 0.9667 0.95
Sensitivity 1 1
Specificity 0.9375 0.9091
Odds Ratio ∞ ∞
χ2 23.6250 (p < 0.0001) 1.6364 (p = 0.2008)
Precision 0.9333 0.9
Kappa 0.9333 0.9
F-Measure 0.9655 0.9474
Sign Test: 10 (p = 0.0010)Cross-validation Consistency: 10/10
Whole Dataset Statistics:
Training Balanced Accuracy: 0.9688 Training Accuracy: 0.9667 Training Sensitivity: 1.0000 Training Specificity: 0.9375 Training Odds Ratio: ∞ Training Χ²: 26.2500 (p < 0.0001) Training Precision: 0.9333 Training Kappa: 0.9333 Training F-Measure: 0.9655
Graphical Model
Classification Rules
X.07..TAS2R16.C_11431 X.07..PTC.C_8876467_1 X.04..ADH1C.C_2688508 Class
IF
A\A
AND
C\G
AND
C\C
THEN
0
A\A C\G C\T 1
A\A C\G T\T 0
A\A G\G C\C 0
A\A G\G C\T 0
A\A G\G T\T 1
A\G C\C C\T 1
A\G C\G C\C 0
A\G C\G C\T 0
A\G C\G T\T 0
A\G G\G C\C 1
A\G G\G C\T 1
A\G G\G T\T 0
G\G C\G C\T 1
G\G G\G C\C 0
G\G G\G C\T 1
G\G G\G T\T 1
Locus Dendrogram
Future Work
Simulations to calculate the power of MDR, especially in relation to sample size
Comparison of MDR with logistic regression, and other proposed methods to detect epistasis, with respect to the current data set and simulated data
Research how different methods to search the sample space can be incorporated into MDR implementation to improve computational feasibility