23
Integrating domain knowledge with statistical and data mining methods for high-density genomic SNP disease association analysis Dinu et al, J. Biomedical Informatics 40 (2007) 750-760

view - John Moult's Group HomePage

  • Upload
    tommy96

  • View
    232

  • Download
    0

Embed Size (px)

Citation preview

Page 1: view - John Moult's Group HomePage

Integrating domain knowledge with statistical and data mining methods for high-density genomic SNP disease association analysis

Dinu et al, J. Biomedical Informatics 40 (2007) 750-760

Page 2: view - John Moult's Group HomePage

Pathway/SNP

-A software application that allows its user to utilize pathway data in the analysis of high-density genomic SNP data derived from disease association studies.

- The purpose is to analyze the underlying etiology of disease through the integration of pathway information using statistical and data mining approaches.

Page 3: view - John Moult's Group HomePage

Background:

- Large scale genome-wide association (GWA) studies are now available to identify genomic mutations associated with wide range of diseases.

- Complex diseases, like, diabetes, hypertension, etc. are believed to be caused by the interaction of multiple genes and environmental factors.

- The number of mathematical operations required to assess the association between multiple interacting genomic loci and disease grows exponentially with the number of interacting SNPs.

- Various statistical approaches, like stepwise algorithm, varying parameters, etc. are used to analyze these associations.

- Data mining approaches are used for multi-locus association with traits.

Page 4: view - John Moult's Group HomePage

Computational complexity for brute-force ‘full-scan’ interaction analysis between all possible combinations of n genomic markers and a disease is exponential in n.

For Affymetrix 100K SNP GeneChip,

m = 100,000 genomic markers

Full scan requires

# of marker interaction # of tests

2 5.00 x109

3 1.66 x1014

4 4.16 x1018

5 8.33 x1022

Fastest supercomputer can perform ~3.67x1014 flops/s

Page 5: view - John Moult's Group HomePage

Conclusion:

- “One model fits all” approach is not optimal.

Pathway/SNP

– Designed as an exploratory tool which integrates pathway information, gene annotation, and SNP location to identify the pathways that are most strongly associated with disease.

Architecture: 3-tier architecture written in Java

1> Presentation tier – written in Java Server Pages

2> Logic tier – statistical and data mining algorithms in Java

3> Data tier – genotype, phenotype and annotation data stored in heavily indexed relational database.

Page 6: view - John Moult's Group HomePage
Page 7: view - John Moult's Group HomePage

Biological Data- Annotations for 561 pathways –

181 KEGG, 314 BioCarta and 66 GenMAPP human pathways.

- Gene annotation data – from NCBI Entrez Gene

- Affymetrix 100k and 500k GeneChip microarray annotation files are preloaded in the database.

Relevant SNPs:

In a given biological pathway if SNPs are located within 10,000 base pairs (bp) of a pathway gene’s location, they are considered as relevant.

Relevant Genes:First gene list is extracted from a particular database then it is augmented from literature and Entrez gene.

Page 8: view - John Moult's Group HomePage

Algorithms:

1> Single SNP association with disease

- Chi square and Armitage’s trend test

2> Pathway association with disease

- U-statistics or data mining algorithms

3> Permutation-based statistical significance inference

- Bonferroni adjustment or False discovery rate (FDR)

Page 9: view - John Moult's Group HomePage

Single SNP association with disease:

1> Chi square test

2> Armitage’s trend test

Chi square test

Allele-based:

Genotype-based:

Allele A count

Allele B count

Case

Control

AA count

AB count

BB count

Case

Control

1 degree of freedom

2 degrees of freedom

More preferred

Page 10: view - John Moult's Group HomePage

Armitage’s Trend Test

This test is performed of case vs. control having a ‘trend’ with different models of association between a SNP and disease.

Additive interaction model: This model tests the association that depend additively upon the risk or minor allele, 0 for homozygous non-risk alleles, 1 for heterozygous alleles and 2 for homozygous risk alleles.

Dominant model: tests the association of having at least one risk allele in homozygous (1) or heterozygous (1) vs. no risk in homozygous non-risk allele (0).

Recessive model: tests the association of having one homozygous risk allele (1) vs. having at least one non-risk allele in homozygous (0) or in heterozygous (0).

Armitage’s Trend Test statistic has 1 degree of freedom

Page 11: view - John Moult's Group HomePage

U-statistics for pathway association with disease:

-Non-parametric algorithm that can simultaneously test the association of multiple markers with disease, with only a single degree of freedom.

- First measures a score over all markers for pairs of subjects (set of SNPs) within each of the case and control groups. Genetic scoring for a pair of subjects is measured by a “kernel” function, like recessive, dominant and linear dosage.

- Then compares the average scores between cases and controls by use of a global statistic with one degree of freedom instead of the implicit many degrees of freedom when many markers are analyzed.

- The resulting z-scores can be used to rank pathways and also to calculate an approximate p-value.

Page 12: view - John Moult's Group HomePage

Consider b as risk allele and a as non-risk allele

Page 13: view - John Moult's Group HomePage

Data mining for pathway association with disease:

- Data mining classifiers (e.g., SVM, Random Forests, logistic, tree-based) can be used to explore the association between pathways and disease.

- The “percent correct” classification of cases and controls estimated with the genotypes at the pathway SNPs can be used as a statistic for measuring the association between pathways and disease.

- Incorporated using Weka data mining program, classifiers are run by default with a 10-fold cross validation.

Page 14: view - John Moult's Group HomePage

Multiple testing corrections:

- It may be possible that a good test statistic value that we have obtained would have occurred by chance alone. Multiple testing corrections are designed to help one to ensure, if possible, that this is not the case.

Bonferroni adjustments:

- The Bonferonni adjustment multiplies each individual p-value by the number of times that same test was performed (the value of markers tested).

-This value, which is quite conservative, seeks to estimate the probability that this test would have come out this well by chance at least once from all of the times this test was performed.

Page 15: view - John Moult's Group HomePage

Statistical significance using permutation based FDR:

- The False Discovery Rate (FDR) option calculates the False Discovery Rate for each statistical test selected. This is a test which is itself based upon the p-values from the original tests.

- The interpretation of the False Discovery Rate is “What would the rate of false discoveries (false positives) be if I accepted ALL of the tests whose p-value is at or below the p-value of this test?”

-The aim of the FDR procedure is to control at a desired level (e.g., 0.05) the proportion of type I errors (false positives) among all significant results.

Page 16: view - John Moult's Group HomePage

- Suppose m hypotheses are tested, and R of them are rejected (positive results). Of the rejected hypotheses, suppose that V of them are really null–that is, that V is the number of type I errors, or false positive results. The False Discovery Rate is defined as

that is, the expected proportion of false positive findings among all rejected hypotheses times the probability of making at least one rejection.- This procedure may yield higher statistical power compared to family wise error rate. Pathways with low FDR (e.g., below 0.05) are considered significant.

FDR = E(V/R | R > 0). P(R > 0),

Page 17: view - John Moult's Group HomePage

Using Pathway/SNP to analyze AMD data set:

- This data set contains 116,204 genome wide SNPs genotyped with Affymetrix 100k Gene Chip

- Case-control study of 146 caucasian individuals

- 50 controls and 96 cases with advanced AMD

- 50 patients with wet AMD (severe) and 46 patients with dry AMD.

- Initial analysis identifies a mutation in complement factor H (CFH) on chromosome 1 to be strongly associated with AMD.

- Identified 46 genes (from KEGG & NCBI genome 35 version)

- Total 94 SNPs are relevant (within 10,000 bp).

- Armitage’s trend test with additive model and U-statistics with 5 kernels (dominant, recessive, linear, quadratic, allele match) and 4 data-mining algorithms (J48, Random Forests, SVM, Naïve Bayes) were performed.

- Patients were grouped in 4 categories: control vs. all cases (wet+dry), control vs. wet AMD, control vs. dry AMD, dry AMD vs. wet AMD.

Page 18: view - John Moult's Group HomePage
Page 19: view - John Moult's Group HomePage
Page 20: view - John Moult's Group HomePage
Page 21: view - John Moult's Group HomePage

- Identified two additional pathway genes, C7 and MBL2:

Page 22: view - John Moult's Group HomePage

- Explanation of the difference between progressing to dry AMD, less severe form to wet AMD, more severe one

Page 23: view - John Moult's Group HomePage

Lessons learned:

- The potential need for high performance computation to support a tool like Pathway/SNP

- The need for permutation testing to evaluate the results of the analysis

- Dealing with different versions of the biological data and knowledge

- Why different analysis algorithms might work better with different data sets and different diseases

- The complexity of the “clinical phenotype”