Single Nucleotide Polymorphism Analysis (SNPs)

Single Nucleotide Polymorphism Analysis

Asst. Prof. Vitara Pungpapong, Ph.D.Department of StatisticsFaculty of Commerce and AccountancyChulalongkorn University

1

Outline

• What is SNP array?

• Typical SNP analysis

• Challenges

• The ICM/M Method

• Results

Vitara Pungpapong 2

Microarray

• Usually known as Chip-chip.

• First publication in 1999

• Each known gene is a one spot on the chip.

• Laser induced fluorescence (LIF) is used to obtain color and intensity of each gene.

• Varying colors show varying levels of gene activity.

• A microarray chip can contain 10,000 –20,000 genes.

Vitara Pungpapong 3

Single Nucleotide Polymorphism

• Usually called chip-seq or SNP.

Vitara Pungpapong 4

(https://en.wikipedia.org/wiki/Single-nucleotide_polymorphism)

Microarray vs SNPs

• Microarray is more suitable for small genomes

• More bias in microarray

• SNPs generally produces profiles with a better signal-to-noise ratio, and allows detection of more peaks and narrower peaks.

• SNPs generate more high-throughput data (> 1Tb) which requires more effort in analysis.

Vitara Pungpapong 5

1000 Genome Project• http://www.1000genomes.org/

• The 1000 Genome Project provide the largest public catalog of human genetic variation.

• The Project ran from 2008 and completed in 2015.

• The human genome consists of approximately 3 billion DNA base pairs and is estimated to carry around 20,000 protein coding genes.

• The samples for the 1000 Genomes Project are anonymous and have no associated medical or phenotype data.

• The project holds self-reported ethnicity and gender.

• All participants declared themselves to be healthy at the time the samples were collected.

Vitara Pungpapong 6

1000 Genome Project

Vitara Pungpapong 7

(http://www.1000genomes.org/)

dbGaP database

Vitara Pungpapong 8

• http://www.ncbi.nlm.nih.gov/gap

Maize Genome

• http://www.panzea.org/

Vitara Pungpapong 9

Vitara Pungpapong 10

Genome-wide Association Study (GWAS)

Goal: Identify genetic variants associated with phenotype of interest.

Typical(Simple) GWAS Analysis


Preprocessing Data

Univariate Analysis

Controlling FWER or FDR

GWAS Gold Standard: 5 x 10-8

Preprocessing Data in GWAS

• SNP Call Rate (98-99%)

• Sample Call Rate (98-99%)

• Data Imputation

• Minor Allele Frequency (Remove extremely rare SNPs, i.e., <5% frequency)

• Hardy-Weinberg Equilibrium

• Recode SNPs to the count of minor allele (0, 1, 2)

• For more information, refer to Turner et. Al. (2011).


Biological Pathways


Databases:- KEGG

(http://www.genome.jp/kegg/pathway.html)

- Ingenuity (http://www.ingenuity.com/)

- etc.

Challenges in GWAS


- Want to incorporate biological pathway in GWAS

- Want to analyze all SNPs at once

High-dimensional Regression

• Regression with n < p

• Challenges in high-dimensional regression

– Large p small n problem

– Multicollinearity

– Sparsity


Bayesian Model Setup


𝐘 = 𝐗𝛽 + 휀, 휀~𝑁 0, 𝜎2𝐼𝑛

Consider a normal regression model:

Prior to capture sparsity in regression coefficient:

𝛽𝑗|𝜏𝑗 ~ 1 − 𝜏𝑗 𝛿0 𝛽𝑗 + 𝜏𝑗𝛾𝛼 𝛽𝑗 𝜎 .

where 𝛿0 . is a Dirac delta function at zero

𝜏𝑗 = 1 𝛽𝑗≠0

𝛾𝛼 𝛽𝑗 𝜎 =𝛼 𝑛 − 1

2𝜎exp −

𝛼 𝑛 − 1

𝜎|𝛽𝑗|




• The Ising model is employed to model relationship among SNPs.

• The Ising model assumes that the relationship lies in an undirected graph G = (V, E) where V is a set of vertices and E is a set of edges.

• The Ising prior for 𝜏 = 𝜏1, … , 𝜏𝑝𝑡

where 𝜏𝑗 = 1 𝛽𝑗≠0


𝑃 𝜏 =1

𝑍(𝑎, 𝑏exp 𝑎

𝑗

𝜏𝑗 + 𝑏

<𝑗,𝑘>∈𝐸

𝜏𝑗𝜏𝑘 𝜏1

𝜏2

𝜏3 𝜏4

𝜏5

The ICM/M Algorithm

• Pungpapong et. al. (2015).

• Idea: The conditional distributions are used to obtain parameters

• The ICM/M consists of two main parts:– Conditional median for each regression coefficient

– Conditional mode for hyperparameters and auxiliary parameters


The ICM/M Algorithm


Obtain initial estimate ( 𝛽, 𝜎2)

Obtain 𝜏 = 𝜏1, … , 𝜏𝑝𝑡


𝑎, 𝑏 = mode 𝑖=1𝑝

𝑃 𝜏𝑗 𝜏−𝑗; 𝑎, 𝑏 ) =

mode 𝑖=1𝑝

𝑃( 𝜏𝑗|{ 𝜏𝑘: < 𝜏𝑗 , 𝜏𝑘 >∈ 𝐸}; 𝑎, 𝑏

𝛽𝑗 = median 𝛽𝑗 𝐘, 𝐗, 𝛽−𝑗 , 𝜎2, 𝑎, 𝑏 , 𝑗 = 1,… , 𝑝,

where 𝛽−𝑗 = ( 𝛽1, . . , 𝛽𝑗−1, 𝛽𝑗+1, … , 𝛽𝑝 .

𝜎2 = mode(𝜎2|𝐘, 𝐗, 𝛽, 𝑎, 𝑏

Convergence in 𝜷? Stop

The ICM/M Algorithm


Obtain initial estimate ( 𝛽, 𝜎2)

Obtain 𝜏 = 𝜏1, … , 𝜏𝑝𝑡


𝑎, 𝑏 = mode 𝑖=1𝑝

𝑃 𝜏𝑗 𝜏−𝑗; 𝑎, 𝑏 ) =

mode 𝑖=1𝑝

𝑃( 𝜏𝑗|{ 𝜏𝑘: < 𝜏𝑗 , 𝜏𝑘 >∈ 𝐸}; 𝑎, 𝑏

𝛽𝑗 = median 𝛽𝑗 𝐘, 𝐗, 𝛽−𝑗 , 𝜎2, 𝑎, 𝑏 , 𝑗 = 1,… , 𝑝,

where 𝛽−𝑗 = ( 𝛽1, . . , 𝛽𝑗−1, 𝛽𝑗+1, … , 𝛽𝑝 .

𝜎2 = mode(𝜎2|𝐘, 𝐗, 𝛽, 𝑎, 𝑏

Convergence in 𝜷? Stop

Generalized Linear Models (GLMs)


Iteratively Reweighted Least Squares


Extension of the ICM/M to GLMs

• Borrow the idea of an iteratively reweighted least squares (IRLS).


Simulation Studies

• A total of 1,782 SNPs were randomly selected from the Framingham dataset (Cupples et. al. 2007)

• 24 human regulatory pathways were retrieved from KEGG database which involved 1,502 genes.

• 311 SNPS involved in 5 pathways were assumed to have nonzero effect where the effect sizes were randomly generated from Uniform[0.5, 3].

• Phenotype were simulated from the normal regression model with the error variance = 5.


Simulation Studies

• Results


Method Prediction Error

False Positive

False Negative

Lasso 30.7 (.41) .69 (.0004) .02 (.0004)

Adaptive Lasso

206.2 (.57) .07 (.0017) .13 (.0002)

ICM/M 21.7 (.23) .03 (.0015) .04 (.0003)

Framingham Data Analysis

• Dataset: Framingham heart study (Cupples et. al. 2007)

• Phenotype: log transformation of vitamin D level

• Sample size: 952 for training set and 519 for test set

• The gene-pathway information relevant to vitamin D level is obtained from the KEGG database

• There are 84,834 SNPs resided in 2,167 genetic regions in 112 pathways.

• Univariate tests were applied for screening process resulting in 7,824 SNPs left for the analysis.



• Prediction errors and no. of identified SNPs


Method Prediction Error

No. of IdentifiedSNPs

Lasso .2560 14

Adaptive Lasso

.2085 5

ICM/M .2121 5


Chromosome - SNP

1-3887 4-0894 4-1174 5-2773 8-5143 17-3907

17-9089

𝛽 Lasso .0412 0 .0355 .0402 0 0 0

Adaptive Lasso

.1521 0 .0434 .1539 -.0200 0 .0167

ICM/M .2417 -0.0512 0 .3047 -.0857 .1093 0

P-value* Lasso .2694 1 1 .6050 1 1 1

Adaptive Lasso

.2060 1 1 .0031 1 1 1

ICM/M .0837 1 1 .0034 1 1 1


* From multi-sample split method (Meinhausen et. al. (2009))

Parkinson’s Disease Data Analysis

• Data come from 3 different studies on PD– Autopsy-Confirmed Parkinson Disease GWAS Consortium

(APDGC) (dbGaP Study Accession: phs000394.v1.p1)

– Genome-Wide Association Study of Parkinson Disease: Genes and Environment (dbGaP Study Accession: phs000196.v2.p1)

– NINDS-Genome-Wide Genotyping in Parkinson's Disease: First Stage Analysis and Public Release of Data (n=1741)

– dbGaP Study Accession: phs000089.v3.p2

• Combined three data sets and obtained overlapping SNPs (𝑛 = 6,704, 𝑝 = 888,398



• Pathway related to PD were retrieved from Ingenuity© IPA.



• ICM/M found 46 SNPs having nonzero regression coefficients across 22 chromosomes.

• 8 genes known to PD were identified (e.g., TLR4, TNF, …).


References

• Cupples, L. A.et al. (2007). The framingham heart study 100k snp genome-wide association study resource: Overview of 17 phenotype working group reports. BMC Medical Genetics, 8(Suppl 1):S1.

• Ho et. al. (2011). ChIP-chip versus ChIP-seq: Lessons for experimental design and data analysis, BMC Genomics 2011 12:134.

• Meinshausen et. al. (2009). P-values for high-dimensional regression. Journal of the American Statistical Association, 104:1671–1681.

• Pungpapong et. al. (2015). Selecting Massive Variables Using An Iterated Conditional Modes/Medians Algorithm, Electronic Journal of Statistics 9 : 1243-1266.

• Turner, S. (2011). Quality control procedures for genome-wide association studies. Curr Protoc Hum Genet 2011;68:1–19.1.18.



Data & Analytics

Single Nucleotide Polymorphism Analysis (SNPs)