Upload
data-science-thailand
View
842
Download
3
Embed Size (px)
Citation preview
Single Nucleotide Polymorphism Analysis
Asst. Prof. Vitara Pungpapong, Ph.D.Department of StatisticsFaculty of Commerce and AccountancyChulalongkorn University
1
Outline
• What is SNP array?
• Typical SNP analysis
• Challenges
• The ICM/M Method
• Results
Vitara Pungpapong 2
Microarray
• Usually known as Chip-chip.
• First publication in 1999
• Each known gene is a one spot on the chip.
• Laser induced fluorescence (LIF) is used to obtain color and intensity of each gene.
• Varying colors show varying levels of gene activity.
• A microarray chip can contain 10,000 –20,000 genes.
Vitara Pungpapong 3
Single Nucleotide Polymorphism
• Usually called chip-seq or SNP.
Vitara Pungpapong 4
(https://en.wikipedia.org/wiki/Single-nucleotide_polymorphism)
Microarray vs SNPs
• Microarray is more suitable for small genomes
• More bias in microarray
• SNPs generally produces profiles with a better signal-to-noise ratio, and allows detection of more peaks and narrower peaks.
• SNPs generate more high-throughput data (> 1Tb) which requires more effort in analysis.
Vitara Pungpapong 5
1000 Genome Project• http://www.1000genomes.org/
• The 1000 Genome Project provide the largest public catalog of human genetic variation.
• The Project ran from 2008 and completed in 2015.
• The human genome consists of approximately 3 billion DNA base pairs and is estimated to carry around 20,000 protein coding genes.
• The samples for the 1000 Genomes Project are anonymous and have no associated medical or phenotype data.
• The project holds self-reported ethnicity and gender.
• All participants declared themselves to be healthy at the time the samples were collected.
Vitara Pungpapong 6
1000 Genome Project
Vitara Pungpapong 7
(http://www.1000genomes.org/)
dbGaP database
Vitara Pungpapong 8
• http://www.ncbi.nlm.nih.gov/gap
Maize Genome
• http://www.panzea.org/
Vitara Pungpapong 9
Vitara Pungpapong 10
Genome-wide Association Study (GWAS)
Goal: Identify genetic variants associated with phenotype of interest.
Typical(Simple) GWAS Analysis
Vitara Pungpapong 11
Preprocessing Data
Univariate Analysis
Controlling FWER or FDR
GWAS Gold Standard: 5 x 10-8
Preprocessing Data in GWAS
• SNP Call Rate (98-99%)
• Sample Call Rate (98-99%)
• Data Imputation
• Minor Allele Frequency (Remove extremely rare SNPs, i.e., <5% frequency)
• Hardy-Weinberg Equilibrium
• Recode SNPs to the count of minor allele (0, 1, 2)
• For more information, refer to Turner et. Al. (2011).
Vitara Pungpapong 12
Biological Pathways
Vitara Pungpapong 13
Databases:- KEGG
(http://www.genome.jp/kegg/pathway.html)
- Ingenuity (http://www.ingenuity.com/)
- etc.
Challenges in GWAS
Vitara Pungpapong 14
- Want to incorporate biological pathway in GWAS
- Want to analyze all SNPs at once
High-dimensional Regression
• Regression with n < p
• Challenges in high-dimensional regression
– Large p small n problem
– Multicollinearity
– Sparsity
Vitara Pungpapong 15
Bayesian Model Setup
Vitara Pungpapong 16
𝐘 = 𝐗𝛽 + 휀, 휀~𝑁 0, 𝜎2𝐼𝑛
Consider a normal regression model:
Prior to capture sparsity in regression coefficient:
𝛽𝑗|𝜏𝑗 ~ 1 − 𝜏𝑗 𝛿0 𝛽𝑗 + 𝜏𝑗𝛾𝛼 𝛽𝑗 𝜎 .
where 𝛿0 . is a Dirac delta function at zero
𝜏𝑗 = 1 𝛽𝑗≠0
𝛾𝛼 𝛽𝑗 𝜎 =𝛼 𝑛 − 1
2𝜎exp −
𝛼 𝑛 − 1
𝜎|𝛽𝑗|
Bayesian Model Setup
Vitara Pungpapong 17
Bayesian Model Setup
• The Ising model is employed to model relationship among SNPs.
• The Ising model assumes that the relationship lies in an undirected graph G = (V, E) where V is a set of vertices and E is a set of edges.
• The Ising prior for 𝜏 = 𝜏1, … , 𝜏𝑝𝑡
where 𝜏𝑗 = 1 𝛽𝑗≠0
Vitara Pungpapong 18
𝑃 𝜏 =1
𝑍(𝑎, 𝑏exp 𝑎
𝑗
𝜏𝑗 + 𝑏
<𝑗,𝑘>∈𝐸
𝜏𝑗𝜏𝑘 𝜏1
𝜏2
𝜏3 𝜏4
𝜏5
The ICM/M Algorithm
• Pungpapong et. al. (2015).
• Idea: The conditional distributions are used to obtain parameters
• The ICM/M consists of two main parts:– Conditional median for each regression coefficient
– Conditional mode for hyperparameters and auxiliary parameters
Vitara Pungpapong 19
The ICM/M Algorithm
Vitara Pungpapong 20
Obtain initial estimate ( 𝛽, 𝜎2)
Obtain 𝜏 = 𝜏1, … , 𝜏𝑝𝑡
where 𝜏𝑗 = 1 𝛽𝑗≠0
𝑎, 𝑏 = mode 𝑖=1𝑝
𝑃 𝜏𝑗 𝜏−𝑗; 𝑎, 𝑏 ) =
mode 𝑖=1𝑝
𝑃( 𝜏𝑗|{ 𝜏𝑘: < 𝜏𝑗 , 𝜏𝑘 >∈ 𝐸}; 𝑎, 𝑏
𝛽𝑗 = median 𝛽𝑗 𝐘, 𝐗, 𝛽−𝑗 , 𝜎2, 𝑎, 𝑏 , 𝑗 = 1,… , 𝑝,
where 𝛽−𝑗 = ( 𝛽1, . . , 𝛽𝑗−1, 𝛽𝑗+1, … , 𝛽𝑝 .
𝜎2 = mode(𝜎2|𝐘, 𝐗, 𝛽, 𝑎, 𝑏
Convergence in 𝜷? Stop
The ICM/M Algorithm
Vitara Pungpapong 21
Obtain initial estimate ( 𝛽, 𝜎2)
Obtain 𝜏 = 𝜏1, … , 𝜏𝑝𝑡
where 𝜏𝑗 = 1 𝛽𝑗≠0
𝑎, 𝑏 = mode 𝑖=1𝑝
𝑃 𝜏𝑗 𝜏−𝑗; 𝑎, 𝑏 ) =
mode 𝑖=1𝑝
𝑃( 𝜏𝑗|{ 𝜏𝑘: < 𝜏𝑗 , 𝜏𝑘 >∈ 𝐸}; 𝑎, 𝑏
𝛽𝑗 = median 𝛽𝑗 𝐘, 𝐗, 𝛽−𝑗 , 𝜎2, 𝑎, 𝑏 , 𝑗 = 1,… , 𝑝,
where 𝛽−𝑗 = ( 𝛽1, . . , 𝛽𝑗−1, 𝛽𝑗+1, … , 𝛽𝑝 .
𝜎2 = mode(𝜎2|𝐘, 𝐗, 𝛽, 𝑎, 𝑏
Convergence in 𝜷? Stop
Generalized Linear Models (GLMs)
Vitara Pungpapong 22
Iteratively Reweighted Least Squares
Vitara Pungpapong 23
Extension of the ICM/M to GLMs
• Borrow the idea of an iteratively reweighted least squares (IRLS).
Vitara Pungpapong 24
Simulation Studies
• A total of 1,782 SNPs were randomly selected from the Framingham dataset (Cupples et. al. 2007)
• 24 human regulatory pathways were retrieved from KEGG database which involved 1,502 genes.
• 311 SNPS involved in 5 pathways were assumed to have nonzero effect where the effect sizes were randomly generated from Uniform[0.5, 3].
• Phenotype were simulated from the normal regression model with the error variance = 5.
Vitara Pungpapong 25
Simulation Studies
• Results
Vitara Pungpapong 26
Method Prediction Error
False Positive
False Negative
Lasso 30.7 (.41) .69 (.0004) .02 (.0004)
Adaptive Lasso
206.2 (.57) .07 (.0017) .13 (.0002)
ICM/M 21.7 (.23) .03 (.0015) .04 (.0003)
Framingham Data Analysis
• Dataset: Framingham heart study (Cupples et. al. 2007)
• Phenotype: log transformation of vitamin D level
• Sample size: 952 for training set and 519 for test set
• The gene-pathway information relevant to vitamin D level is obtained from the KEGG database
• There are 84,834 SNPs resided in 2,167 genetic regions in 112 pathways.
• Univariate tests were applied for screening process resulting in 7,824 SNPs left for the analysis.
Vitara Pungpapong 27
Framingham Data Analysis
• Prediction errors and no. of identified SNPs
Vitara Pungpapong 28
Method Prediction Error
No. of IdentifiedSNPs
Lasso .2560 14
Adaptive Lasso
.2085 5
ICM/M .2121 5
Framingham Data Analysis
Chromosome - SNP
1-3887 4-0894 4-1174 5-2773 8-5143 17-3907
17-9089
𝛽 Lasso .0412 0 .0355 .0402 0 0 0
Adaptive Lasso
.1521 0 .0434 .1539 -.0200 0 .0167
ICM/M .2417 -0.0512 0 .3047 -.0857 .1093 0
P-value* Lasso .2694 1 1 .6050 1 1 1
Adaptive Lasso
.2060 1 1 .0031 1 1 1
ICM/M .0837 1 1 .0034 1 1 1
Vitara Pungpapong 29
* From multi-sample split method (Meinhausen et. al. (2009))
Parkinson’s Disease Data Analysis
• Data come from 3 different studies on PD– Autopsy-Confirmed Parkinson Disease GWAS Consortium
(APDGC) (dbGaP Study Accession: phs000394.v1.p1)
– Genome-Wide Association Study of Parkinson Disease: Genes and Environment (dbGaP Study Accession: phs000196.v2.p1)
– NINDS-Genome-Wide Genotyping in Parkinson's Disease: First Stage Analysis and Public Release of Data (n=1741)
– dbGaP Study Accession: phs000089.v3.p2
• Combined three data sets and obtained overlapping SNPs (𝑛 = 6,704, 𝑝 = 888,398
Vitara Pungpapong 30
Parkinson’s Disease Data Analysis
• Pathway related to PD were retrieved from Ingenuity© IPA.
Vitara Pungpapong 31
Parkinson’s Disease Data Analysis
• ICM/M found 46 SNPs having nonzero regression coefficients across 22 chromosomes.
• 8 genes known to PD were identified (e.g., TLR4, TNF, …).
Vitara Pungpapong 32
References
• Cupples, L. A.et al. (2007). The framingham heart study 100k snp genome-wide association study resource: Overview of 17 phenotype working group reports. BMC Medical Genetics, 8(Suppl 1):S1.
• Ho et. al. (2011). ChIP-chip versus ChIP-seq: Lessons for experimental design and data analysis, BMC Genomics 2011 12:134.
• Meinshausen et. al. (2009). P-values for high-dimensional regression. Journal of the American Statistical Association, 104:1671–1681.
• Pungpapong et. al. (2015). Selecting Massive Variables Using An Iterated Conditional Modes/Medians Algorithm, Electronic Journal of Statistics 9 : 1243-1266.
• Turner, S. (2011). Quality control procedures for genome-wide association studies. Curr Protoc Hum Genet 2011;68:1–19.1.18.
Vitara Pungpapong 33
Vitara Pungpapong 34