15
Genome wide association studies (A Brief Start) Source: PLoS Comput Biol. 2012 Dec; 8(12): e1002822. Published online 2012 Dec 27. doi: 10.1371/journal.pcbi.1002822 PMCID: PMC3531285 Chapter 11: Genome-Wide Association Studies William S. Bush 1,* and Jason H. Moore 2 Fran Lewitter, Editor and Maricel Kann, Editor And Zhiwu Zhang Lecture and Labs

Genome wide association studies (A Brief Start) Source: PLoS Comput Biol. 2012 Dec; 8(12): e1002822. Published online 2012 Dec 27. doi: 10.1371/journal.pcbi.100282210.1371/journal.pcbi.1002822

Embed Size (px)

Citation preview

Page 1: Genome wide association studies (A Brief Start) Source: PLoS Comput Biol. 2012 Dec; 8(12): e1002822. Published online 2012 Dec 27. doi: 10.1371/journal.pcbi.100282210.1371/journal.pcbi.1002822

Genome wide association studies(A Brief Start)

Source: PLoS Comput Biol. 2012 Dec; 8(12): e1002822.

Published online 2012 Dec 27. doi: 10.1371/journal.pcbi.1002822PMCID: PMC3531285

Chapter 11: Genome-Wide Association StudiesWilliam S. Bush1,* and Jason H. Moore2

Fran Lewitter, Editor and Maricel Kann, EditorAnd Zhiwu Zhang Lecture and Labs

Page 2: Genome wide association studies (A Brief Start) Source: PLoS Comput Biol. 2012 Dec; 8(12): e1002822. Published online 2012 Dec 27. doi: 10.1371/journal.pcbi.100282210.1371/journal.pcbi.1002822

GWAS

• Idea is a epidemiological study of common diseases using the Genome.

• Essentially GWAS searches the genome for small variations, called single nucleotide polymorphisms or SNPs, that occur more frequently in people with a particular disease than in people without the disease or vice versa.

• Then it does significance testing to see if there are any association between the disease and the location of that the genetic variation.

• First we need to understand what is a SNP.

Page 3: Genome wide association studies (A Brief Start) Source: PLoS Comput Biol. 2012 Dec; 8(12): e1002822. Published online 2012 Dec 27. doi: 10.1371/journal.pcbi.100282210.1371/journal.pcbi.1002822

SNP• Most humans have a genome that is very similar but there are locations

on the genome where commonly there are differences between people.• SNPs are single base-pair changes in the DNA sequence that occur with

high frequency in the human genome. • SNPs are typically used as markers of a genomic region, with the large

majority of them having a minimal impact on biological systems. • SNPs can have functional consequences, causing amino acid changes,

changes to mRNA transcript stability, and changes to transcription factor binding affinity.

• SNPs are by far the most abundant form of genetic variation in the human genome.

• SNPs typically have two alleles, meaning within a population there are two commonly occurring base-pair possibilities for a SNP location.

Page 4: Genome wide association studies (A Brief Start) Source: PLoS Comput Biol. 2012 Dec; 8(12): e1002822. Published online 2012 Dec 27. doi: 10.1371/journal.pcbi.100282210.1371/journal.pcbi.1002822

SNP versus Mutation• The frequency of a SNP is given in terms of the minor allele frequency or the

frequency of the less common allele.

• A SNP with a minor allele G frequency of 0.35 implies that 35% of a population has the allele versus the more common allele (the major allele), which is found in 65% of the population.

• Mutations: These conditions are largely caused by extremely rare genetic variants that ultimately induce a detrimental change to protein function, which leads to the disease state. Variants with such low frequency in the population are sometimes referred to as mutations, though they can be structurally equivalent to SNPs - single base-pair changes in the DNA sequence.

• In the genetics literature, the term SNP is generally applied to common single base-pair changes, and the term mutation is applied to rare genetic variants.

Page 5: Genome wide association studies (A Brief Start) Source: PLoS Comput Biol. 2012 Dec; 8(12): e1002822. Published online 2012 Dec 27. doi: 10.1371/journal.pcbi.100282210.1371/journal.pcbi.1002822

SNP and GWAS• GWAS examine SNPs across the genome, they represent

a promising way to study complex, common diseases in which many genetic variations contribute to a person’s risk.

• This approach has already identified SNPs related to several complex conditions including diabetes, heart abnormalities, Parkinson disease, and Crohn disease.

• There is hope that as we do more studies we will understand more common diseases.

Page 6: Genome wide association studies (A Brief Start) Source: PLoS Comput Biol. 2012 Dec; 8(12): e1002822. Published online 2012 Dec 27. doi: 10.1371/journal.pcbi.100282210.1371/journal.pcbi.1002822

CV/CD hypothesis• This hypothesis states that common disorders are likely influenced

by common genetic variation

• If common genetic variants influence disease, the effect size for any one variant must be small relative to that found for rare disorders.

• If common disorders show heritability (inheritance in families), then multiple common alleles must influence disease susceptibility. As such, the total genetic risk due to common genetic variation must be spread across multiple genetic factors.

• These two points suggest that traditional family-based genetic studies are not likely to be successful for complex diseases, prompting a shift toward population-based studies.

Page 7: Genome wide association studies (A Brief Start) Source: PLoS Comput Biol. 2012 Dec; 8(12): e1002822. Published online 2012 Dec 27. doi: 10.1371/journal.pcbi.100282210.1371/journal.pcbi.1002822

The HapMap Project

• We need to KNOW where the SNPS occur with what density • We also need to figure out which SNPS are related to racial

phenotypes.• Hence, the International Hap/Map project was launched to

understand the SNPs related to race.• Indentified 500,000 SNPs for people of European descent.

Page 8: Genome wide association studies (A Brief Start) Source: PLoS Comput Biol. 2012 Dec; 8(12): e1002822. Published online 2012 Dec 27. doi: 10.1371/journal.pcbi.100282210.1371/journal.pcbi.1002822

LD: Linkage Disequilibrium• LD: property of one allele in an SNPs being correlated with

an allele in another SNPs along a contiguous stretch of the genome.

• When all alleles are independent we have Linkage equilibrium, so when they are dependent – we call it LD.

• Common measures are Distance, or R-square defined for proportions.

• Idea is: causality is almost impossible to prove in these studies and so, because of the small effect sizes and indirect associations. Hence, large scale studies are required.

Page 9: Genome wide association studies (A Brief Start) Source: PLoS Comput Biol. 2012 Dec; 8(12): e1002822. Published online 2012 Dec 27. doi: 10.1371/journal.pcbi.100282210.1371/journal.pcbi.1002822

Genotyping Technology• Two primary platforms have been used for most GWAS. These include

products from Illumina (San Diego, CA) and Affymetrix (Santa Clara, CA). • Affymetrix platform prints short DNA sequences as a spot on the chip that

recognizes a specific SNP allele. Alleles (i.e. nucleotides) are detected by differential hybridization of the sample DNA.

• Illumina on the other hand uses a bead-based technology with slightly longer DNA sequences to detect alleles. The Illumina chips are more expensive to make but provide better specificity.

• A chip that has more SNPs with better overall genomic coverage for a study of Africans than Europeans. This is because African genomes have had more time to recombine and therefore have less LD between alleles at different SNPs. More SNPs are needed to capture the variation across the African genome.

• These next-generation sequencing methods will provide all the DNA sequence variation in the genome. It is time now to retool for this new onslaught of data.

Page 10: Genome wide association studies (A Brief Start) Source: PLoS Comput Biol. 2012 Dec; 8(12): e1002822. Published online 2012 Dec 27. doi: 10.1371/journal.pcbi.100282210.1371/journal.pcbi.1002822

Design• Most common are:• Case control (binary response)

• Quantitative (continuous response)

• Quantitative easier: uses ANOVA like methods for each SNP presence or absence (response like HDL, LDL anything that is measured)

• For yes/no phenotypes we can use 2 by 2 tables and chi-square or logistic regression. This study type asks if the allele of a genetic variant is found more often than expected in individuals with the phenotype of interest (e.g. with the disease being studied).

• Early calculations on statistical power indicated that this approach could be better than linkage studies at detecting weak genetic effects

Page 11: Genome wide association studies (A Brief Start) Source: PLoS Comput Biol. 2012 Dec; 8(12): e1002822. Published online 2012 Dec 27. doi: 10.1371/journal.pcbi.100282210.1371/journal.pcbi.1002822

Common Data • The most common approach of GWA studies is the case-control setup, which

compares two large groups of individuals, one healthy control group and one case group affected by a disease.

• For each of these SNPs it is then investigated if the allele frequency is significantly altered between the case and the control group.

• In such setups, the fundamental unit for reporting effect sizes is the odds ratio.• If the allele frequency in the case group is much higher than in the control group,

the odds ratio is higher than 1, and vice versa for lower allele frequency. • Additionally, a P-value for the significance of the odds ratio is typically calculated

using a simple chi-squared test. • Finding odds ratios that are significantly different from 1 is the objective of the

GWA study because this shows that a SNP is associated with disease.

Page 12: Genome wide association studies (A Brief Start) Source: PLoS Comput Biol. 2012 Dec; 8(12): e1002822. Published online 2012 Dec 27. doi: 10.1371/journal.pcbi.100282210.1371/journal.pcbi.1002822

Data

• The most common type of data appears to be in the form of 2 by 2 tables.

• Lets say we have two groups disease and not disease and we are focusing on the presence and absence of

• Essentially calculate the chi-square test for all the SNPs.

Disease Not disease

G 2000 8000

Not G 8000 2000

Page 13: Genome wide association studies (A Brief Start) Source: PLoS Comput Biol. 2012 Dec; 8(12): e1002822. Published online 2012 Dec 27. doi: 10.1371/journal.pcbi.100282210.1371/journal.pcbi.1002822

Other types of Data

• Instead of being Disease or Not Disease the phenotype could be a measure of a trait, like height, biomass etc.

• In that case we model the data as a linear model:

• Y = SNP effect + error

• And perform ANOVA type analysis• However, there are other contributing factors to the model

and Dr. Zhiwu Zhang talked to us about these kinds of models

Page 14: Genome wide association studies (A Brief Start) Source: PLoS Comput Biol. 2012 Dec; 8(12): e1002822. Published online 2012 Dec 27. doi: 10.1371/journal.pcbi.100282210.1371/journal.pcbi.1002822

Fixed and Random Effect Models

• GLM for GWAS

Y = SNP + Q (or PCs) + e(fixed effect)

MLM for GWAS

Y = SNP + Q (or PCs) + Kinship + eFixed effect Random effect

Page 15: Genome wide association studies (A Brief Start) Source: PLoS Comput Biol. 2012 Dec; 8(12): e1002822. Published online 2012 Dec 27. doi: 10.1371/journal.pcbi.100282210.1371/journal.pcbi.1002822

GLM to GLiM

• The Mixed model is obviously a better approach as we can model the systematic variations in the model batter.

• However, it has been looked at in depth only for continuous response and not so much for binary response or categorical response.

• Hence, the direction is going from General Linear Mixed models to Generalized Linear Mixed models, using logistic regression.

• P(Y=1| X’s) = SNP + Q + K • (where we incorporate a fixed and a random effect in the

model).