SNP Genome-Wide Association Tutorial - Golden Helix, Inc › SVS › tutorials › snp_gwas › snp_gwas.pdf · SNP Genome-Wide Association Tutorial, Release 8.7.0 Updated: February

SNP Genome-Wide Association TutorialRelease 8.7.0

Golden Helix, Inc.

Feb 12, 2019

Contents

1. Sample QA - I: Basics 2A. Open Project . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2B. Calculate Sample Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2B. Filtering Samples with Low Call Rate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4C. Genotype Gender Check . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5D. Apply Filtering Results to Genotype Spreadsheet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2. Sample QA - II: Cryptic Relatedness 9

3. Sample QA - III: Population Stratification 14

4. SNP Quality Assurance 18

5. Genotype Association Analysis 22A. Genotype Association Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22B. Generating Q-Q Plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24C. Generating P-Value Plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24D. Creating a Manhattan Plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27E. Saving Plots as Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

6. Genotypic Regression and Correcting for Covariates 33A. The Possibly Confounding Covariate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33B. Genotypic Regression–Full Model Only . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33C. Genotypic Regression–Full vs. Reduced Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

7. Regressing on Covariate/Column Interactions 42A. The Covariate/Column Interaction Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42B. Other Possible Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

i

SNP Genome-Wide Association Tutorial, Release 8.7.0

Updated: February 12, 2019

Level: Fundamentals

Version: 8.8.3 or higher

Product: SVS

The following tutorial is designed to systematically introduce you to a number of techniques for genome-wide associ-ation studies. It is not meant to replicate all the workflows you might use in a complete analysis, but instead touch ona sampling of the more typical scenarios you may come across in your own studies.

The genotype data included is a portion of a public GWAS dataset from the Gene Expression Omnibus database, aswell as 270 HapMap samples. There is a population spreadsheet that identifies the HapMap subpopulation and thestudy data. Both studies were run on the Affymetrix 500K genotyping array. All phenotype data is simulated.

Requirements

To follow along you will need to download and unzip the following file:

Download

SNP_GWAS_Tutorial.zip

We hope you enjoy the experience and look forward to your feedback.

All open windows (except the Project Navigator) can be closed after each section’s completion.

Contents 1

http://doc.goldenhelix.com/SVS/tutorials/snp_gwas/SNP_GWAS_Tutorial.zip

1. Sample QA - I: Basics

A. Open Project

For genetic association analysis in SVS 8 you will need, at minimum, three types of data: phenotype data, genotypedata, and a genetic marker map for the genotype data. Additional preparatory steps ensure that the analysis runscorrectly for your study.

You need to have a project open before you can import data or perform analyses.

• From the Welcome Screen select File > Open Project. Locate the Golden Helix project fileSNP_GWAS_Tutorial.ghp and select it, then click Open. This will open the Project Navigator.

• Select Tools > Current Project’s Options and confirm that the correct default genome assembly is selected.For the data in this tutorial, the genome assembly should be set to Homo sapiens (Human), GRCh37 (hg19) (Feb2009). This option primarily affects the genomic view and which data sources are available in the integratedGenomeBrowse plotting interface.

There should be four dataset nodes visible in the project: Phenotype, Population, 500K Geno Training Data, and500K Geno HapMap Data.

• Double-click on 500K Geno Training Data - Sheet 1. The spreadsheet contains the genotype information forthe study sample data. There are total of 565 study samples with 498,784 SNPs as indicated in the top rightportion of the window (rows x columns).

• Click on the green map button in the top-left corner of the spreadsheet to view the Marker Map information andother marker information included with the Marker Map file (Figure 1-1).

The following sections lead you through quality assurance procedures performed in genome-wide association studiesto identify samples of poor quality (call rates, heterozygosity, etc.) and those whose identity is of question (mismatchedgender, ethnicity different than intended for the study, cryptically related, etc.). In some cases we’ll automatically filtersamples; in others we’ll just identify samples for possible exclusion.

All of the sample statistics can be calculated with one tool. In the first step you will create a master sample statisticsspreadsheet. Then you will use this spreadsheet to filter out samples based on a variety of undesirable characteristics.

B. Calculate Sample Statistics

Use the Genotype Statistics by Sample tool to calculate sample call rates and heterozygosity rates over the entiregenome and over autosomes only.

• in the 500K Geno Training Data - Sheet 1 choose Genotype > Genotype Statistics by Sample.

2


Figure 1-1. Mapped genotype spreadsheet

B. Calculate Sample Statistics 3


• Under Gender Chromosome Statistics, check Gender inference: and choose X from the dropdown. Leavethe default threshold value. Under Additional Outputs (Verbose Output), check Output count and variantstatistics for each autosomal chromosome. The window should match Figure 1-2. Click Run.

Figure 1-2. Genotype Statistics by Sample dialog

Two output spreadsheets are created containing the statistics. The first spreadsheet, Statistics by Sample, containsthe call rate and the X-chromosome heterozygosity rate. The second spreadsheet, Autosome Statistics by Sample,contains the statistics calculated separately for each autosome.

B. Filtering Samples with Low Call Rate

First filter out samples with low call rates. The sample call rate is defined as the fraction of called SNPs per sampleover the total number of SNPs in the dataset. A standard quality threshold for excluding samples with a low call rate is95%. It is important to not include the Y chromosome when calculating per sample call rates. It is up to the researcherwhether or not the X chromosome should be included in call rate calculations. If the X chromosome is not to beincluded, use the autosome-only call rate to determine which samples should be filtered.

• Open the Statistics by Sample spreadsheet and right-click on the Call Rate (Autosomes) column header (Col-umn 9). Select Activate by Threshold and activate all samples with Threshold Value: >= 0.95. (Leave theother options in the Activate by Threshold dialog box at their default values.) Click OK.

In the top-right corner, the number of active rows is listed at 468 (out of 565 total). This means 97 samples had anautosomal call rate < 95%.

4 1. Sample QA - I: Basics


• Create a row subset spreadsheet by going to Select > Row > Row Subset Spreadsheet. Rename the result-ing spreadsheet by right-clicking on the area at the bottom of the spreadsheet window and selecting Rename.Change the name to: Samples with Call Rate >= 0.95.

Now use the renamed spreadsheet to compare heterozygosity rates.

C. Genotype Gender Check

Use the X chromosome heterozygosity rate to identify those samples whose inferred gender does not agree with theirreported gender. Since the reported gender information is contained in the phenotype spreadsheet, you will first need tojoin the sample statistics spreadsheet with the phenotype information. Then create a histogram of the Heterozygosityrate and filter based on reported gender to detect discrepancies.

• From Samples with Call Rate >= 0.95 choose File > Join or Merge Spreadsheets. Select the Phenotype -Sheet 1 spreadsheet and click OK.

• Choose Current Spreadsheet under Spreadsheet as Child of. Leave all other parameters as the defaults andclick OK.

In the combined spreadsheet, inferred gender is located in the 22nd column (Inferred Gender) and reported gender islocated in the 29th column (Gender).

• Right-click on the Het Rate from All Columns (Chr. X) column header (Column 18) and choose Plot His-togram. There are two distinct distributions. The left one represents males and the right one females.

• To better visualize the distributions, in the plot viewer, click on the Graph 1 node in the Graph Control Interfaceand set the Bin Count parameter to 128.

• Next, click on the Het Rate from All Columns (Chr. X) node in the Graph Control Interface and select theColor tab. Select the By Variable radio button and click Select variable....

• Scroll to Gender, select it and click OK. Then click the blue box next to the Gender == ‘F’ option and changethe color to Yellow. Follow a similar process to change the Gender == ‘M’ to blue. With the plot colored onreported gender, the inconsistencies between inferred and reported gender are evident. More inconsistencies canbe seen by changing the opacity.

• While the Het Rate from All Columns (Chr. X) node is still selected in the Graph Control Interface, clickon the Item tab and move the opacity bar to the left until it is in the center of its range.

In this example, there are four samples who have a reported gender opposite of what is characteristic according to theheterozygosity rate. We will compare the values of the two columns to identify mismatched samples.

• From the spreadsheet Samples with Call Rate >= 0.95 + Phenotype - Sheet 1, select Select > Compare andActivate by Column Agreement. Click Add Columns and select the two previously mentioned categoricalcolumns (Inferred Gender and Gender) in the menu. Click OK.

Note: Be sure that both columns have a C to the left of the column name. These categorical columns use thesame naming conventions, ‘M’ and ‘F’, to identify gender.

• Under Create subset spreadsheet(s) of:, check both Rows with matching data values and Rows with differ-ing data values. This will allow you to easily examine and work with both samples that have consistent gendersand samples that have inconsistent genders.

• Confirm that the options in the window match those in Figure 1-4. Click OK.

C. Genotype Gender Check 5


Figure 1-3. X Heterozygosity Histogram Colored by Reported Gender



Figure 1-4. Compare Columns window

Two subset spreadsheets are created: Rows with matching values in columns Inferred Gender and Gender andRows with differing values in columns Inferred Gender and Gender.

Note: In your own study, you may be able rectify the spreadsheet by verifying that the gender was simply a data entryerror and not a genotyping anomaly.

D. Apply Filtering Results to Genotype Spreadsheet

Now to exclude the filtered samples from the genotype dataset. To do this, use the rows in the Rows with matchingvalues in columns Inferred Gender and Gender spreadsheet to activate their corresponding rows in the 500K GenoTraining Data spreadsheet.

• Open the Rows with matching values in columns Inferred Gender and Gender spreadsheet and chooseSelect > Apply Current Selection to Second Spreadsheet.

• Choose to Apply filtered rows to the following spreadsheet 500K Geno Training Data - Sheet 1, then clickOK.

You’ll notice now in the upper-right portion of the window that there are only 464 rows active out of 565. Create asubset with these samples.

• Choose Select > Row > Row Subset Spreadsheet.

• Rename this spreadsheet in the Project Navigator (right click on the node and select Rename Node) to Subset- Samples with Call Rate >= .95 and Matched Gender.

Note: You can close all open spreadsheets from the project navigator by going to Window > Close All.

D. Apply Filtering Results to Genotype Spreadsheet 7


Note: Hiding child nodes in the project navigator that you will not be actively working with can help make the projecteasier to navigate. Clicking on the arrow next to 500K Geno Training Data - Sheet 1 results in the streamlined projectnavigator view shown in Figure 1-5.

Figure 1-5. Streamlined Project Navigator.


2. Sample QA - II: Cryptic Relatedness

Next, we will find and filter samples determined to be “related” to other samples. Relatedness is often defined asfamily-relatedness, but identity by descent (IBD) estimation can also detect duplicate samples, duplicate samples thatoccur from one of a pair of genotyping chips but not the other, or sample contamination. Before doing IBD, standardpractice is to first prune the SNPs in LD with one another, reducing the number of association tests performed and thusthe effect of multiple testing.

• Open Subset - Samples with Call Rate >= .95 and Matched Gender.

• Choose Genotype > Quality Assurance and Utilities > LD Pruning.

• Input 100 for Window Size, 5 for Window Increment, r^2 for LD Statistic, 0.5 for LD Threshold, and CHMfor LD Computation Method. Click OK.

This will take a few minutes. Upon finishing, a message will appear stating that 286,502 markers have been inactivated.These markers are designated by the grayed-out columns in the spreadsheet. You can also see in the upper-right portionof the spreadsheet window that only 212,282 columns are active out of 498,784 total columns.

To perform IBD, you will be using only the active columns (non-correlated markers) for autosomal chromosomes, sofirst create a column subset spreadsheet and then inactivate the X chromosome before continuing.

• Choose Select > Column > Column Subset Spreadsheet.

• Then from the subset spreadsheet, choose Select > Activate by Chromosomes, uncheck the X chromosomeand click OK.

Note: In datasets with other non-autosomal chromosomes, inactivate these as well.

• In the Project Navigator rename this node to Pruned SNP Subset.

Now you are ready for IBD estimation.

• From the Pruned SNP Subset spreadsheet, choose Genotype > Quality Assurance and Utilities > Identityby Descent Estimation.

• For this exercise:

– Uncheck Output IBS distances ((IBS 2 + 0.5*IBS 1)/# non-missing markers).

– Uncheck Output untransformed estimates of P(Z=0), P(Z=1), and P(Z=2).

– Make sure that Output PI = P(Z=1)/2 + P(Z=2) is checked.

– Make sure that Output transformed estimates P*(Z=0), P*(Z=1), and P*(Z=2) is unchecked.

– Check Output all pairs where PI >= ___ and enter 0 in this blank.

– Click Run.

9


This will take a few minutes. Upon completion, two spreadsheets are output: IBD Estimate: Estimated PI andPairwise IBD Estimates (PI >= 0).

IBD Estimate: Estimated PI gives an N x N table where N is the number of samples in the dataset. By plotting aheatmap of this table we can detect patterns showing relatedness. Pairwise IBD Estimates (PI >= 0) outputs variousIBD statistics for all samples. These values can be sorted or plotted to find samples related to one another or detectsample contamination.

• From the IBD Estimate: Estimated PI, choose Plot > Heat Map (Uniform).

You’ll get the plot in Figure 2-1.

Figure 2-1. Default heat map of IBD PI estimates

By default, the heat map has a three color scheme calculated automatically. We want to define the color schememanually based on a two color scheme where we look for sample pairs with a PI estimate of 0.25 or greater (PI of 0.25

10 2. Sample QA - II: Cryptic Relatedness


= second degree relatives, 0.5 = first degree relatives, and 1 = identical twins (or duplicate samples)).

• Click on the IBD Estimate: Estimated PI node in the Graph Control Interface (upper-left portion of the win-dow).

• On the Color tab choose Manual and then right-click the first option (0), and select Delete.

• Right click on the new first parameter, select Edit, and change it to 0.2. Then click once on the color box nextto the second parameter and select red. Click OK. Your plot should look similar to Figure 2-2.

Figure 2-2. Heat map with two colors

You can begin to see samples along the diagonal line that appear to be related to one another. You can zoom in to seethis more clearly by clicking and dragging a red box in the plot area (Figure 2-3).

In most population-based studies, you’ll typically find a couple of sample pairs that are cryptically related in one wayor another, which you can subsequently remove. In this particular study there are known to be several family trios, and

11


Figure 2-3. Zoomed in area around several related individuals

12 2. Sample QA - II: Cryptic Relatedness


the cluster pattern above indicates this. For the purpose of this tutorial we won’t exclude any sample pairs here, but ifthis were your own study, you would need to decide which member(s) of the trio to keep or discard from the study.

If you need to exclude samples from your study, the Pairwise IBD Estimates (PI >= 0) output can be used. Open thespreadsheet and right-click on the Estimated PI column (Column 7) and select Sort Descending. Those sample pairsthat are highly related will be listed at the top of the spreadsheet. You can then manually exclude either one or bothsamples of each pair from your dataset.

13

3. Sample QA - III: PopulationStratification

The next step is to identify samples that depart from the expected homogenous ethnicity of your study. You can do thisby performing principal component analysis on your data and comparing the first two principal components againstreference samples of known ethnicities.

There are several ways to perform PCA. Some recommend using the pruned set of SNPs (as was done with IBD inthis tutorial), some recommend using a filtered set of SNPs (on minor allele frequency and HWE for example), andsome recommend using the entire SNP set. There are advantages to each. Here, we’ll use the pruned set of SNPs wecreated earlier. First, we need to append the HapMap samples to our study samples.

• Open the Pruned SNP Subset spreadsheet and choose File > Append Spreadsheets.

• Choose the 500K Geno HapMap Data - Sheet 1 and click OK.

• In the Append Spreadsheet window enter Pruned SNPs + HapMap as the New Dataset Name, keep the restof the parameters as the defaults and click OK.

The new spreadsheet should have 734 rows and 207,706 columns.

Note: You don’t want to run PCA on non-autosomal chromosomes. The pruned SNP spreadsheet already had the Xchromosome inactive, which caused this chromosome to be automatically dropped during the append process.

Now run PCA on this spreadsheet.

• From the spreadsheet, select Genotype > Genotype Principal Component Analysis.

• Under Principal Components, enter 5 for Find up to top ___ components.

• Leave the defaults for the rest of the options and click Run.

Two spreadsheets, the Principal Components (Additive Model) spreadsheet and the PC Eigenvalues (AdditiveModel) spreadsheet, result from the analysis. To find out how many principal components are required to explain themajority of the population stratification, a PCA plot will be created and the eigenvalues will be visually inspected.

Look at the PC Eigenvalues (Additive Model) spreadsheet and notice that there is very little change between the third,fourth and fifth eigenvalues, and that these are much smaller than either of the first two eigenvalues. This implies thattwo principal components explain the majority of stratification in the SNP data. This is consistent with there being twodifferent dimensions of stratification between the three major populations in the data: CEU (European), YRI (African),and CHB+JPT (Asian). We shall visualize the population stratification by plotting these first two principal componentsagainst one another, juxtaposing the HapMap samples with the GEO study data.

First, you need to join the Population spreadsheet with the Principal Components spreadsheet.

• Open the Population - Sheet 1 spreadsheet and select File > Join or Merge Spreadsheets.

• Select the Principal Components (Additive Model) spreadsheet.

14


• In the Join or Merge Spreadsheets window select the Current spreadsheet radio button under Spreadsheetas Child of, leave the rest of the parameters as defaults and then click OK. The combined spreadsheet shouldlook like Figure 3-1.

Figure 3-1. Population added to the Principal Components spreadsheet

It is now possible to plot one component against the other and color-code each sample or data point according to itsrespective ethnicity.

• From the Population + Principal Components (Additive Model) - Sheet 1 spreadsheet, select Plot > XYScatter Plots.

The XY Scatter Parameters dialog appears with two list views. The list view on the left is for selecting the column(principal component) to represent the independent or X axis. The list view on the right is for selecting a single ormultiple columns (principal components) to represent the dependent or Y axis.

• In the left list box select EV = 35.8391. In the right list box check EV = 15.5366 and click Plot.

If we color each data point according to its respective ethnicity, the clusters become more obvious.

• In the Graph Control Interface in the upper-left pane of the Plot Viewer, select the Item EV = 15.5366.

• Select the Color tab and select the By Variable radio button

• Click Select Variable... and select Population from the list. Click OK.

15


Figure 3-2. PCA plot of 3 HapMap populations and the study population

16 3. Sample QA - III: Population Stratification


The four population groups are separated by color in the plot (Figure 3-2). We can see that our study consisted mostlyof Caucasians (largest cluster), but also had Asians and some Asian-Americans and African-Americans. Four separategraph items, one for each Population category, are displayed in the Graph Control Interface, making it easy tochange the name, color and symbol for each.

• When finished, close the Plot Viewer and rename its associated node in the Project Navigator (under the Popu-lation + Principal Components (Additive Model) - Sheet 1 spreadsheet node) to PCA Plot.

Note: If it is important to exclude or keep samples based on whether or not they are in a population cluster, werecommend using the Numeric > Multidimensional Outlier Detection feature for doing this, rather than visualinspection of the PCA plots. The threshold of which samples are in a population cluster can be ambiguous if it isbased on visual inspection alone.

To use the Numeric > Multidimensional Outlier Detection feature to exclude samples outside of a population cluster,calculate the inter-quartile range (IQR) distance around the centroid (median) of the study population cluster andexclude those samples that are 1.5 IQRs from the third quartile.

See Multidimensional Outlier Detection for examples of this process.

In summary, samples were filtered out if they had a call rate below 95% or their reported and genotypically inferredgenders did not match. We will not use these samples for the rest of the tutorial. We have also identified other samplesthat could be excluded for cryptic relatedness–however, we will still use these other samples for the rest of the tutorial.Lastly, we have identified population stratification in the dataset.

Next, we will filter SNPs based on standard SNP quality assurance metrics.

17

http://doc.goldenhelix.com/SVS/latest/svsmanual/numeric_data_quality.html?highlight=multidimensional#multoutlier

4. SNP Quality Assurance

Because we will be using a case-control phenotype, and because this phenotype will be involved with one of the SNP-wise quality assurance measures, we will merge the phenotype and genotype information now. (Specifically, filteringon Hardy-Weinberg equilibrium (HWE) will be based on controls only.)

Note: If your phenotype is quantitative, you can still merge the phenotype and genotype information at this point–yousimply will not need the phenotype information until you actually start performing association testing.

• Open the Phenotype - Sheet 1 spreadsheet and choose File > Join or Merge Spreadsheets.

• Select the 500K Geno Training Data - Sheet 2 spreadsheet and click OK.

• Leave all the parameters in the Join or Merge Spreadsheet window as the defaults and click OK.

As before, exclude the X chromosome from the following analysis. Also, specify case/control status, because, asstated above, we will only calculate Hardy-Weinberg Equilibrium (HWE) based on control samples.

• Choose Select > Activate by Chromosomes, uncheck the X chromosome, and click OK.

• Next left-click the Phenotype 1 column label header. This will turn the column magenta, denoting the columnas the dependent variable.

Note: If your phenotype is quantitative, do not left-click the header of your phenotype column (yet).

The spreadsheet should look like Figure 4-1.

• From the joined spreadsheet choose Genotype > Genotype Filtering by Marker.

The Genotype Filtering window lets you simultaneously choose thresholds for multiple statistics to filter out SNPsfailing to meet respective quality assurance measures.

• Check the following options and enter the following thresholds:

• Drop if Call Rate < 0.9

• Drop if Minor Allele Frequency (MAF) < 0.01

• Perform HWE filtering based on: Controls

• Drop if Fisher’s exact test for HWE P-Value < 1e-4

• Make sure Inactivate genotype columns that meet above criteria for filtering is checked.

• Make sure Output spreadsheet with marker statistics and ‘Drop?’ columns is checked.

• Make sure Classify alleles by allele frequency is selected at the top of the dialog.

18


Figure 4-1. Joined spreadsheet with case/control status selected

19


• Make sure all other boxes are unchecked.

• Click Run.

Note: If your phenotype is quantitative, you will have to leave Perform HWE filtering based on: set to All, becauseyou won’t be able to change it. This is fine. Keep all the other parameters the same.

Upon completion, SNPs in the Phenotype + 500K Geno Training Data - Sheet 1 not meeting the specified thresholdswill be inactivated. A new spreadsheet, Filtering Results, will also be output with the various marker statistics foreach SNP.

• To see how many SNPs were filtered, go to the Project Navigator and select the Phenotype + 500K GenoTraining Data - Sheet 1 spreadsheet. At the end of the Node Change Log will be an individual log noting,among other things, how many SNPs were filtered out (columns set to inactive).

Assuming all steps were followed correctly to this point, 104,035 SNPs should have been filtered out. Though anyfurther analyses only take active columns and rows into consideration, it is often preferred to first create a subsetspreadsheet of only those that are active.

• From the Phenotype + 500K Geno Training Data - Sheet 1 spreadsheet, go to Select > Subset Active Data.

The new spreadsheet, Phenotype + 500K Geno Training Data - Active Subset, should have 464 rows and 384,262columns.

Note: If your phenotype is quantitative, now will be a good time to make it dependent. Inside this new subsetspreadsheet, left-click on the name header of your quantitative phenotype, turning it magenta.

• Rename this spreadsheet in the Project Navigator to Filtered Data for Association Testing, then add a color tagto the node by right-clicking and selecting Tag Node Green. Colored tags are useful for easily locating specificspreadsheets.

You should now have a filtered set of samples and SNPs for association testing. Your project should look similar toFigure 4-2.

20 4. SNP Quality Assurance


Figure 4-2. Project Navigator view

21

5. Genotype Association Analysis

After assuring the quality of the data, association testing can be performed.

A. Genotype Association Testing

• Open the Filtered Data for Association Testing spreadsheet and make sure the Phenotype 1 Binary columnheader is still set as the dependent variable.

• Choose Genotype > Genotype Association Tests.

• Make sure that, above the Association Test Parameters tab, Classify alleles by allele frequency is selected.

• In the Association Test Parameters tab, under Genetic Model or Tests, make sure the Additive Model: (dd)-> (Dd) -> (DD) radio button is selected.

• Under Test Statistic or Method, check only Correlation/Trend test and Exact form of Cochran-Armitagetest. Leave the other tests unchecked.

• Under Multiple Testing Correction, make sure Bonferroni Adjustment is checked and uncheck the otheroptions.

• Check Output data for P-P/Q-Q Plots.

• Make sure Correct for stratification with PCA is unchecked.

• In the Overall Marker Statistics tab, check Genotype counts under Count Tables and make sure all otheroptions in this tab are unchecked. Click Run.

• Upon completion, a new spreadsheet is created, Association Tests (Additive Model). This spreadsheet displaysseveral association statistics for each SNP (Figure 5-1).

When a SNP has a significant p-value according to the Correlation/Trend test, but the contingency table of Case Statusby Number of Minor Alleles shows at least one count less than 5, the results from the Exact Cochran-Armitage Testshould also be examined, since the assumptions of the Correlation/Trend test have been violated for those cases.

• Right-click on the Corr/Trend -log10 P column and select Sort Descending to bring the results for the SNPsthat have the most significant P-values (according to the Correlation/Trend Test) to the top of the spreadsheetdisplay.

Let us view the contingency table for the most significant marker (according to Corr/Trend), which is SNP_A-2070191.

• Scroll the results spreadsheet out to Column 15. Columns 15 through 28 have all the data to reconstruct thecontingency table for each marker. (Ignore Columns 29 through 31, since, for the additive model, associationtesting always drops samples for which the marker has missing data.)

We see that the contingency table for marker SNP_A-2070191 is as follows:

22


Figure 5-1. Association test results

A. Genotype Association Testing 23


dd Dd DD TotalCase 100 111 19 230Control 147 69 5 221Total 247 180 24 451

This results in an Exact Armitage P-Value of 1.869e-7. We see that this is not that much different from the Corr/Trendresult (of 2.4997e-7) because all cell counts in this contingency table are 5 or higher.

We also see that this marker appears to have the most significant p-value according to the Exact Armitage test (as wellas according to the Correlation/Trend Test).

• To check this, right click on the header of Exact Armitage -log10 P (Column 10) and click Sort Descending.The spreadsheet row for marker SNP_A-2070191 will remain on top.

B. Generating Q-Q Plots

Q-Q plots are generated by plotting the expected chi-squared values against the observed chi-squared values.

• From Association Tests (Additive Model), select Plot > XY Scatter Plots. Two list views will appear.

• Select Corr/Trend expected X^2 (seventh down) in the left list box and Corr/Trend X^2 (fourth down) in theright list box.

• Click Plot.

• In the Graph Control Interface, Select Graph 1.

• Under the Add Item tab select f(x) = m(x) + b and click Add.

This will generate a straight line with a slope of 1 and y-intercept of 0. You should have a Q-Q plot that looks likeFigure 5-2.

• (If you wanted to change the weight and color of this line, you could, in the Graph Control Interface, select itsassociated graph item (f(x) = 1.0(x) + 0.0) and, in the Item tab to the right of Line, choose the color and weightyou would like for the line.)

• When you’ve finished, close the Plot Viewer and rename its associated node in the Project Navigator to Q-QPlot.

Similarly, you could also plot a P-P plot for the Corr/Trend results by using Corr/Trend expected -log10 P on theX axis and Corr/Trend -log10 P on the Y axis, or plot a P-P plot for the Exact Armitage results by using ExactArmitage expected -log10 P on the X axis and Exact Armitage -log10 P on the Y axis.

(For this example, it turns out that the Q-Q plot and both of these P-P plots are quite similar.)

C. Generating P-Value Plots

• From the Association Tests (Additive Model) spreadsheet, right-click on the header of the Corr/Trend -log10P column (Column 2) and select Plot Variable in GenomeBrowse.

Notice the full-domain view now has chromosome bands and the X-axis is represented by chromosome and physicalposition (Figure 5-3).

Note that the most significant p-values are in Chromosome 6. Let us zoom in for a closer look.

• There are many ways to zoom in the GenomeBrowse window:

24 5. Genotype Association Analysis


Figure 5-2. Q-Q plot of association results

C. Generating P-Value Plots 25


Figure 5-3. P-value plot in genome browser



– Double-clicking a chromosome’s diagram (or its number) in the Full Domain View (top plot).

– Manually selecting a chromosome in the chromosome selector dropdown (near the top left of the plotwindow).

– Manually entering a chromosome range (such as 5 - 6) in the chromosome selector dropdown (near the topleft of the plot window), then clicking on the right-arrow icon just to the right of the Genomic LocationBar (at the top of the plot window above the Full Domain Band).

– Manually entering a chromosome and/or position or a gene name in the Genomic Location Bar (at the topof the plot window above the Full Domain Band).

– Using the mouse scroll wheel.

– Using the zoom slider.

– And last of all, you can zoom into any region in the plot by selecting on the zoom toolbar, thenpressing your left mouse button at the left edge of the region and while holding down your left mousebutton dragging the cursor across to the right edge of the region, finally releasing the mouse button tocomplete the zoom.

Right now, double-click the 6 in the Full Domain View.

Zooming displays the karyogram view of chromosome 6. The annotation track provides more information about theSNPs and other genotypes that are in the region on display.

• Zoom further into the peak (click and drag on the x-axis) and left-click on the top-most point in the plot (Figure5-4).

This displays, in the Console (bottom-left pane), the marker name, its -log10 p-value, chromosome, and position,along with other marker map and spreadsheet values.

You can see in what gene(s) this SNP and others in the peak reside by looking at the annotation gene source that isloaded by default into the plot.

• Zoom in even further around the top five or seven SNPs and, if necessary, expand the boundaries around theannotation track to give it more room.

You can also find a listing of information available in each plot by activating the Feature List for that plot. Right-clickanywhere on an active plot and select Feature List (Figure 5-6).

Note: You may find that the Feature List comes out in its own window. If you wish, you may then dock that windowinto the bottom of the main display (which will make the main plot window match Figure 5-6).

D. Creating a Manhattan Plot

Manhattan plots are popular images for publication purposes as they color-coded by chromosome making it easy tosee where significant markers reside.

• First, turn off the Feature List by clicking the X in its upper right corner.

• Then, in the Chromosome selection drop-down (near the upper-left-hand corner of the plot window) select Allto reset the zoom.

• Select either Corr/Trend -log10 P graph item in the Plot Tree.

D. Creating a Manhattan Plot 27


Figure 5-4. Zoomed in around the peak in the p-value plot



Figure 5-5. Observing the gene track

D. Creating a Manhattan Plot 29


Figure 5-6. P-value Plot with Feature List



• Under the Style tab, select Chromosome in the Style By: drop-down.

Figure 5-7. Manhattan plot

This will split the graph into 22 different colors, one for each chromosome (Figure 5-7). You can change the color,symbol choice, and size of each chromosome by selecting its respective option in the Style options box.

E. Saving Plots as Images

You can save all displayed plots in the plot view as an image.

• Choose File > Save as Image from the plot window.

This will bring up a preview window (Figure 5-8). Here you can manipulate various image parameters.

• Click Advanced in the upper-right corner of the display (unless the advanced options are already showing).

E. Saving Plots as Images 31


• Make sure the Domain View option under the Advanced options is unchecked.

• (Under the Advanced options, you can also change the margins of the image.)

• Next, Browse to a folder where you want the image saved, give it the name Manhattan Plot and click Save.

• Click Save again at the bottom of the preview window to save the image.

• Once the image is saved, close the Plot Viewer and rename its associated node in the Project Navigator toManhattan Plot.

Figure 5-8. Save as Image preview window


6. Genotypic Regression and Correcting forCovariates

If you build a model and the response is correlated with one or more variables relating to the samples, those variables(“confounding variables”) can sometimes hide real associations from the results of performing a test, or at least hidethe true magnitude of these associations. We will next examine how to correct for confounding variables as a part ofour association testing.

We will use the Genotypic Regression feature (Genotype > Genotypic Regression Analysis) to do this. This featureis very similar to Numeric > Numeric Regression Analysis, with the exception that genotypic data may be directlyused as input. A new third tab allows selection of genotype-specific parameters.

A. The Possibly Confounding Covariate

Let us first look at the phenotype and at the possibly confounding covariate that we will be dealing with in this sectionof the tutorial. We will be using Phenotype 2 as the phenotype, and we suspect Gender may be a confoundingcovariate.

• Open the Filtered Data for Association Testing spreadsheet, right-click on Phenotype 2, and choose PlotHistogram.

• In the Graph Control Interface, click on the Graph 1 node and, in the Graph tab, set the Bin Count parameterto 32.

• Next, in the Graph Control Interface, click on the Phenotype 2 node and select the Color tab. Select the ByVariable radio button in this tab and click Select variable....

• Scroll to Gender, select it and click OK. Then click the blue box next to the Gender == ‘F’ option and changethe color to yellow. Follow a similar process to change the Gender == ‘M’ to blue.

• While the Phenotype 2 node is still selected in the Graph Control Interface, click on the Item tab and movethe opacity bar to the left until it is in the center of its range.

We see that the peak and the average of the distribution of Phenotype 2 for males are somewhat less than the peakand the average, respectively, of the distribution of Phenotype 2 for females. This indicates that Phenotype 2 mildlycorrelates with Gender. Therefore, we will want to correct for Gender when testing on Phenotype 2.

B. Genotypic Regression–Full Model Only

But before that, as a baseline, let’s test on Phenotype 2 without correcting for covariates.

• Open the Filtered Data for Association Testing spreadsheet and left-click once on Phenotype 1 to inactivatethe column. Now set Phenotype 2 as dependent by left-clicking once on its column header.

33


Figure 6-1. Distributions of Phenotype 2 distinguished by Gender

34 6. Genotypic Regression and Correcting for Covariates


• Select Genotype > Genotypic Regression Analysis.

• Under Selection Parameters, select Regress on each of the 384255 genotypic columns. Make sure all othercheckboxes in this tab are unchecked.

• In the Output Parameters tab, check Output data for P-P/Q-Q plots, Bonferroni adjustment (on N covari-ates), and False Discovery Rate (FDR). Make sure all other checkboxes in this tab are unchecked.

• In the Genotypic Parameters tab, make sure Classify Alleles by major allele (d) vs. minor allele (D) asfound in the data, Additive model: DD=2, Dd=1, dd=0, and Drop samples containing missing genotypevalues are checked. Check Genotype counts and make sure all other boxes in this tab are unchecked.

• Finally, make sure all parameters agree with Figures 6-2, 6-3, and 6-4, then click Run.

This process won’t take that long, even for the 384,255 markers, because this is a process that uses linear regressionrather than logistic regression. (Linear regression is automatically being used because Phenotype 2 is quantitativerather than case/control.)

Let us examine the results.

• First, right-click on the bottom of the results spreadsheet window and select Rename. Rename the spreadsheetto Genotypic Regression Results - Full Model Only.

• Right-click on the header of -log10 Full-Model P and select Plot Variable in GenomeBrowse.

We see significant results in Chromosome 11.

• First, go back to the navigator node corresponding to this plot and rename it to Genotypic Regression Plot.

• Now, in the Genotypic Regression Plot, zoom in to Chromosome 11 by double-clicking on it, then zoom inimmediately around the cluster of SNP’s that are very significant. (There should be 17 very significant SNP’s.)

We see (Figure 6-5) that the most significant SNP has a -log10 p-value of about 10.

C. Genotypic Regression–Full vs. Reduced Model

Now, let us perform our association test while correcting for Gender.

• Open the Filtered Data for Association Testing - Sheet 2 spreadsheet and make sure Phenotype 2 is stilldependent.

• Go to Genotype > Genotypic Regression Analysis.

• Under Selection Parameters, make sure Regress on each of the 384255 genotypic columns is still checked.

• Under Regress on each of the 384255 genotypic columns, check Correct for covariate(s). In the ReducedModel Covariates section, click Add Covariate. Check Gender and click Add.

• The first tab should look like Figure 6-6. The other two tabs should still look like Figures 6-3 and 6-4. ClickRun.

This will take a little longer, since there are two regressions for each marker–the full model (encompassing the full-model covariate and the reduced-model covariates) and the reduced model (encompassing only the reduced-modelcovariates).

• Now, right-click on the bottom of this results spreadsheet window and rename it to Genotypic RegressionResults - Full vs Reduced.

• Scroll this spreadsheet over to the FDR column (Column 20) and right-click on the column header of FDR.Select Sort Ascending.

Notice that there are 17 markers for which the False Discovery Rate (FDR) is less than 0.05.

C. Genotypic Regression–Full vs. Reduced Model 35


Figure 6-2. First tab of the Genotypic Regression dialog



Figure 6-3. Second tab of the Genotypic Regression dialog



Figure 6-4. Third tab of the Genotypic Regression dialog



Figure 6-5. Full-Model-Only plot of the most significant SNPs



Figure 6-6. First tab of the Genotypic Regression dialog for Full vs. Reduced Model



• Scroll the spreadsheet back to Columns 1 and 2, FvR Model P-Value and -log10 FvR Model P.

Observe that these same 17 markers have full-vs-reduced p-values less than 6.5 × 10−7 and -log 10 full-vs-reducedp-values greater than 6. We conclude that these markers are genomewide-significant.

Let us compare these results visually with the baseline results.

• In the Genotypic Regression Plot, go to the Plot Tree and click on the first -log10 Full-Model P node. In theAdd tab under Controls, click Add Plot Item(s).

• Select the Project button (if it is not already selected). Choose the Genotypic Regression Results - Full vsReduced spreadsheet, select -log10 FvR Model P, and click Plot and Close.

• In the Plot Tree, the -log10 FvR Model P node should be selected. In the Style tab under Controls, click onthe blue square and change the color to green. The result should look something like Figure 6-7.

Figure 6-7. Plot of the Full-Model results compared with the Full-vs.-Reduced results of the most significant SNPs

We see that most of these 17 markers now show as being somewhat less significant after having been corrected forGender, with one marker having a corrected p-value 10 times larger than its uncorrected p-value.

Even so, the corrected p-values for all of these 17 markers are still genomewide-significant, as we saw by examiningtheir False Discovery Rate.


7. Regressing on Covariate/ColumnInteractions

We saw in the last section (6. Genotypic Regression and Correcting for Covariates) that when the most significantmarkers resulting from testing against Phenotype 2 were corrected for Gender, their significance lessened somewhat,simply because Phenotype 2 is somewhat correlated with Gender.

However, rumor has it (see Note below) that for this data, “the association signal [of Phenotype 2] is only valid forfemales”. If this is correct, it means that for any marker containing “the association signal”, the value of Phenotype 2will change significantly for subjects that are both female and have a variation in the number of minor alleles in thatmarker. This constitutes an interaction–specifically, between a covariate (Gender) and the marker (column) currentlybeing tested–a “Covariate/Column Interaction”.

We will now test if there is such an interaction taking place.

Note: If you wish, you can click on the first navigator node of the project (SNP_GWAS_Tutorial) and read the UserNotes text box, which will explain how the phenotype data for this project was simulated.

A. The Covariate/Column Interaction Regression

Our original Full-Model-Only test that we have already performed on Phenotype 2 will also be the baseline model forthis test.

We will run this test with Gender as an interaction covariate, rather than as a correcting covariate.

• Open the Filtered Data for Association Testing - Sheet 2 spreadsheet and make sure Phenotype 2 is stilldependent.

• Go to Genotype > Genotypic Regression Analysis.

• This time, select Regress on covariate-column interactions (on 384255 genotypic cols). The upper-right boxon this tab will become enabled and change its appearance slightly.

• In this upper-right box, now called Covariate-Column Interactions (Full Model), click Add Col Interaction.Check Gender and click Add. The first tab should now look like Figure 7-1.

Notice, in Figure 7-1, that the upper-right and lower-right boxes indicate what variables will be in which parts of thisregression. The Full Model will encompass, along with the reduced-model covariates, the interaction of Gender:F (1for female, 0 for male) with the current column’s data. The reduced model will encompass (along with the intercept)both Gender:F by itself and the current column’s data by itself.

• The second and third tabs (Output Parameters and Genotypic Parameters) should still look like Figures 6-3and 6-4. Click Run.

42


Figure 7-1. First tab of the Genotypic Regression dialog for Covariate-Column Interactions

A. The Covariate/Column Interaction Regression 43


This should take a tiny bit longer than the regression that corrected for Gender.

• Right-click on the bottom of the results spreadsheet window and rename it to Genotypic Regression Results -CCI.

• Scroll this spreadsheet over to the FDR column (this time it’s Column 26) and right-click on the column headerof FDR. Select Sort Ascending.

We see that some regressions were not successful, probably because for these markers, all the genotypes for femaleswere homozygous major-allele, and thus, for these markers, the interaction variables were all zeroes.

Note: You can explore the above statement for yourself for the first unsuccessful marker listed, SNP_A-2130115, asfollows:

• Go to the Filtered Data for Association Testing - Sheet 2 spreadsheet and sort on Gender. Note that there are144 females.

• Select Edit > Find and look for SNP_A-2130115.

• Note that all the genotypes in the first 144 rows are G_G except for one missing value (?_?) genotype.

In the additive genetic model (which is being used here–see Figure 6-4), G_G corresponds to zero, since G is themajor allele. So the interaction for every subject, male or female, will either be zero because of being G_G, becauseof being male, or both.

• In the results spreadsheet we have been examining (Genotypic Regression Results - CCI), scroll down to andpast row 84.

We see that there are 17 markers with excellent False Discovery Rate values (less than 0.0002) (with 16 of these havingFDR less than .00001) and 67 more with good False Discovery Rate values. This is promising.

• Scroll the spreadsheet horizontally to Column 1.

We see that all the p-values of these 84 markers look excellent. Let’s see how the best 17 stack up against the p-valuesfrom our other tests.

• Go back to the Genotypic Regression Plot, and in this window, go to the Plot Tree and click on the first -log10Full-Model P node. In the Add tab under Controls, click Add Plot Item(s).

• Select the Project button (if it is not already selected). Choose the Genotypic Regression Results - CCIspreadsheet, select -log10 FvR Model P, and click Plot and Close.

• In the Plot Tree, the first -log10 FvR Model P node should already be selected. In the Style tab under Controls,click on the blue square and change the color to pink. The result should look something like Figure 7-2.

We see that the interactions are obtaining some of the best significance that we have found so far for any markers. (Asa side note, we see that the significance for a number of other (less-significant) markers becomes much greater whentesting them for interaction with Gender.)

• Now, in the Chromosome selection drop-down (near the upper-left-hand corner of the plot window) select Allto reset the zoom.

We see that there are no markers that, by testing their interactions with Gender, become more significant than theinteractions with Gender of the 17 markers we have been studying in detail.

B. Other Possible Studies

You have now performed a cursory genome-wide association study on a case/control phenotype and a second genome-wide association study using regression on a quantitative phenotype. You may wish to try out other, possibly more

44 7. Regressing on Covariate/Column Interactions


Figure 7-2. Plot of the Full-Model results (blue) compared with both the corrected results (green) and the Covari-ate/Column Interaction results (pink)

B. Other Possible Studies 45


challenging, analyses, such as running other association tests, running other variations of regression, or testing on theother phenotypes.

As mentioned in the Note at the beginning of this section, more information about each phenotype in this tutorialproject can be found by clicking on the first node (SNP_GWAS_Tutorial) in the Project Navigator and reading theUser Notes text box.

46 7. Regressing on Covariate/Column Interactions

Documents

SNP Genome-Wide Association Tutorial - Golden Helix, Inc › SVS › tutorials › snp_gwas › snp_gwas.pdf · SNP Genome-Wide Association Tutorial, Release 8.7.0 Updated: February