Goals of this Section
• Familiarize with the basic concepts of quantitative genetics:– Traits, phenotypes, genotypes
• Understand the basics of trait mapping• Understand the conceptual foundations of
association studies• Lear how to perform a genome wide association
study in the iPlant Discovery Environment– Obtain genotypes– Run a Mixed Linear Model
PhenotypeObservable (measurable) trait (character) of an organism
Trait: eye color
Phenotype: wild type (red), white eyed, orange eyed
http://www.unc.edu/depts/our/hhmi/hhmi-ft_learning_modules/fruitflymodule/phenotypes.html
Donahue, R. P., et al., Probable assignment of the Duffy blood group locus to chromosome 1 in man, Proceedings of the National Academy of Sciences 61, 949-955 (1968).
Co-segregation in Pedigree
Quantitative Traits
• Probably caused by multiple loci– Interaction effects– Environment
If the mean trait value for individuals with marker state MM is different from the mean
trait value of individuals with marker state mm (i.e. the marker is associated with the
phenotype), then the marker is linked to a quantitative trait locus.
Mar
kers
Individuals
Trait value
Marker #3 Mean Trait Value
Present 99 ± 5
Absent 118 ± 8
Marker #6 Mean Trait Value
Present 110 ± 10
Absent 115 ± 13
Quantitative Genetics
Exploring the Genetic Architecture* Underlying Quantitative Traits
*Genetic Architecture• How many loci?• Which location?• How strong?
Tools for Statistical Genetics in the DETool Purpose
Genotype by Sequencing Workflow Automatic pipeline for extracting SNPs from GBS data (with genome from user or from iPlant database)
UNEAK pipeline Automatic pipeline for extracting SNPs from GBS data without reference genomes
MLM workflow Automatic workflow for fitting Mixed Linear Model
GLM workflow Automatic workflow for fitting General Linear Model
QTLC workflow Automatic workflow for composite interval mapping
QTL simulation workflow Automatic workflow for simulating trait data with given linkage map
PLINK PLINK implementation of various association models
Zmapqtl Interval mapping and composite interval mapping with the options to perform a permutation test
LRmapqtl Linear regression modeling
SRmapqtl Stepwise regression modeling
AntEpiSeeker Epistatic interaction modeling
Random Jungle Random Forest implementation for GWAS
FaST-LMM Factored Spectrally Transformed Linear Mixed Modeling
Qxpak Versatile mixed modeling
gluH2P Convert Hapmap format to Ped format
LD Linkage Disequilibrium plot
Structure Estimation of population structure
PGDSpider Data conversion tool
GLMstrucutre GLM with population structure as fixed effect
A Model for Quantitative Traits
P = G + E + GG + GEP=PhenotypeG=GenotypeE=EnvironmentGG=Interaction between genotypesGE=Interaction between genotype and environment
P = G + e
Phenotype
Genotype Environment
A Statistical Model for QTLs
P=G + e
yij trait value in individual j with genotype iβ0 population average of trait valueβ1 effect of marker i on trait valuexi marker genotype iεij error term
General Linear Model (in matrix notation): Y=Xb + e
Note: If errors are not normally distributed, use generalized linear models
Linkage Mapping (QTL Mapping)
• Designed population– F2– Recombinant inbred (RIL)– Double-Haploid (DH)– Back-cross (B2)
Limitation of Linkage Mapping
• Needs large number of related individuals• Resolution limited (interval contains 100s of
genes)• QTL position and effect are confounded
Association Mapping
• Use random collection of individuals from natural population
• Very dense marker map = very high resolution
Linkage & RecombinationRecombination causes linkage decay
Other factors affecting LD:• Selection (artificial or natural)• Drift• Mutations• Population structure• Demography
Pitfalls: Population Structure
• Difference in allele frequencies between subpopulations
• Due to neutral or adaptive processes
• Can create spurious association
• Similar effect due to presence of related individuals (esp. in plants)
• Can be accounted for using the data:– Estimate number of subpopulations– Assign individuals to subpopulation– Estimate kinship
Accounting for Random Effects: Mixed Linear Models
• "Cost" associated with estimating a parameter• We are not interested in the value of the parameter, only the variance• Q-K method (structured association)
y=Xβ+Sα+Qv+Zu+e
Fixed effects:β Vector of fixed effectsα Vector of SNPs effectsv Vector of subpopulation effects
Random effects:u Vector of kinship effectse Residuals
Q Matrix of population association (STRUCTURE)X, S, Z Incidence Matrices
MLM Pipeline for GWAS
marker
trait
filter
convert
impute
impute
K
GLM
MLM
http://www.maizegenetics.net/statistical-geneticsZhang et al. Nature Genetics. 2010; doi:10.1038/ng.546
Ed Buckler (Cornell University)TASSEL
http://www.maizegenetics.net/tassel/docs/Tassel_User_Guide_3.0.pdf
MLM Input Files
• Hapmap file• Phenotype data• Kinship matrix*• Population structure*
straintraits
Phenotype data
strain3 populations sum to 1
* Kinship matrix & population structure data can be generated using TASSEL or with “MLM Workflow” App in DE
Population structure
Origin
• Hapmap file: – Download (e.g. http://triticeaetoolbox.org/)– Convert from PLINK (.map/.ped) using Tassel 3 Conversion– Impute with NPUTE– Transform to numerical format with NumericalTransform
• Phenotype data• Kinship matrix
– Generate from hapmap marker data with Kinship• Population structure
– Generate using ParallelStructure– Convert to matrix with Structure2Tassel
MLM Output• MLM1.txt
– Marker– “df” degrees of freedom– “F” F distribution for test of marker– “p” p-value– “errordf” df used for denominator of F-test– etc.
• MLM2.txt– Estimated effect for each allele for each marker
• MLM3.txt– The compression results shows the likelihood, genetic variance, and error variance for
each compression level tested during the optimization process.
See TASSEL manual for details:http://www.maizegenetics.net/tassel/docs/Tassel_User_Guide_3.0.pdf