29
Gaston Version 1.5.7 Herv´ e Perdry, Claire Dandine-Roulland September 21, 2020 Introduction Gaston offers functions for efficient manipulation of large genotype (SNP) matrices, and state-of-the-art implementation of algorithms to fit Linear Mixed Models, that can be used to compute heritability estimates or to perform association tests. Thanks to the packages Rcpp, RcppParallel, RcppEigen, Gaston functions are mainly written in C++. Many are multithreaded. For better performances, we recommend to build Gaston with clang++. To do so, it is sufficient to create in your home directory a file ~/.R/Makevars containing CXX1X = clang++ or if you use a version of R >= 3.4.0, CXX11 = clang++ In this vignette, we illustrate Gaston using the included data sets AGT, LCT, and TTN (see the corresponding manual pages for a description). Gaston also includes some example files in the extdata folder. Not all options of the functions are described here, but rather their basic usage. The reader is advised to look at the manual pages for details. Note that the package name is written gaston when dealing with R commands, but Gaston (with a capital) in human language. 1 Modifying Gaston’s behaviour with global options 1.1 Number of threads The number of threads used by multithreaded functions can be modified through RcppParallel function setThreadOptions. It is advised to try several values for the number of threads, as using too many threads might be counterproductive due to an important overhead. The default value set by RcppParallel is generally too high. Some functions have a verbose argument, which controls the function verbosity. To mute all functions at once you can use options(gaston.verbose = FALSE). 1

Gaston - R · In this vignette, we illustrate Gaston using the included data sets AGT, LCT, and TTN(see the corresponding manual pages for a description). Gaston also includes some

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Gaston - R · In this vignette, we illustrate Gaston using the included data sets AGT, LCT, and TTN(see the corresponding manual pages for a description). Gaston also includes some

Gaston

Version 1.5.7

Herve Perdry, Claire Dandine-Roulland

September 21, 2020

Introduction

Gaston offers functions for efficient manipulation of large genotype (SNP) matrices, and state-of-the-artimplementation of algorithms to fit Linear Mixed Models, that can be used to compute heritability estimatesor to perform association tests. Thanks to the packages Rcpp, RcppParallel, RcppEigen, Gaston functionsare mainly written in C++. Many are multithreaded.

For better performances, we recommend to build Gaston with clang++. To do so, it is sufficient to createin your home directory a file ~/.R/Makevars containing CXX1X = clang++ or if you use a version of R >=3.4.0, CXX11 = clang++

In this vignette, we illustrate Gaston using the included data sets AGT, LCT, and TTN (see the correspondingmanual pages for a description). Gaston also includes some example files in the extdata folder. Not alloptions of the functions are described here, but rather their basic usage. The reader is advised to look at themanual pages for details.

Note that the package name is written gaston when dealing with R commands, but Gaston (with a capital)in human language.

1 Modifying Gaston’s behaviour with global options

1.1 Number of threads

The number of threads used by multithreaded functions can be modified through RcppParallel functionsetThreadOptions. It is advised to try several values for the number of threads, as using too many threadsmight be counterproductive due to an important overhead. The default value set by RcppParallel is generallytoo high.

Some functions have a verbose argument, which controls the function verbosity. To mute all functions atonce you can use options(gaston.verbose = FALSE).

1

Page 2: Gaston - R · In this vignette, we illustrate Gaston using the included data sets AGT, LCT, and TTN(see the corresponding manual pages for a description). Gaston also includes some

1.2 Basic statistics computations

Since version 1.4, the behaviour of all functions that output a matrix of genotypes (a bed.matrix, describedin the next section) can be modified by setting options(gaston.auto.set.stats = FALSE). The effects ofthis option is described in section 2.4 below. Note that some examples in the manual pages might not workif you use this option.

1.3 Autosomes, gonosomes, mitochondria

Since version 1.4.7, some functions take into account wether a SNP is autosomal, X or Y linked, or mitochon-drial. The ids of the corresponding chromosomes are defined by options gaston.autosomes, gaston.chr.x,gaston.chr.y, gaston.chr.mt. Default values are c(1:22, 25), 23, 24 and 26, following to Plink conven-tion (in this convention, 25 corresponds to the pseudo-autosomal XY region: we chose to include it in theautosomes).

2 Genotype matrices

An S4 class for genotype matrices is defined, named bed.matrix. Each row correspond to an individual, andeach column to a SNP.

2.1 Reading bed.matrices from files

Bed.matrices be read from files using read.bed.matrix. The function read.vcf reads VCF files; it relies onthe package WhopGenome.

Gaston includes example files that can be used for illustration:

> x <- read.bed.matrix( system.file("extdata", "LCT.bed", package="gaston") )

Reading /tmp/Rtmp6g175f/Rinst3ab040bdfa1a/gaston/extdata/LCT.rds

Reading /tmp/Rtmp6g175f/Rinst3ab040bdfa1a/gaston/extdata/LCT.bed

> x

A bed.matrix with 503 individuals and 607 markers.

snps stats are not set (or incomplete)

ped stats are not set (or incomplete)

The folder extdata/ contains files LCT.bed, LCT.rds, LCT.bim and LCT.fam. The .bed, .bim and .fam filesfollow the PLINK specifications. The .rds file is a R data file; if it is present, the .bim and .fam files areignored. You can ignore the .rds file using option rds = NULL:

> x <- read.bed.matrix( system.file("extdata", "LCT.bed", package="gaston"), rds = NULL )

Reading /tmp/Rtmp6g175f/Rinst3ab040bdfa1a/gaston/extdata/LCT.fam

Reading /tmp/Rtmp6g175f/Rinst3ab040bdfa1a/gaston/extdata/LCT.bim

Reading /tmp/Rtmp6g175f/Rinst3ab040bdfa1a/gaston/extdata/LCT.bed

Page 3: Gaston - R · In this vignette, we illustrate Gaston using the included data sets AGT, LCT, and TTN(see the corresponding manual pages for a description). Gaston also includes some

ped stats and snps stats have been set.

'p' has been set.

'mu' and 'sigma' have been set.

> x

A bed.matrix with 503 individuals and 607 markers.

snps stats are set

ped stats are set

A bed.matrix can be saved using write.bed.matrix. The default behavior is to write .bed, .bim, .fam and.rds files; see the manual page for more details.

2.2 Conversion from and to R objects

A numerical matrix x containing genotype counts (0, 1, 2 or NA) can be transformed in a bed.matrix withas(x, "bed.matrix"). The resulting object will lack individual and SNP informations (if the rownames andcolnames of x are set, they will be used as SNP and individual ids respectively).

Conversely, a numerical matrix can be retrieved from a bed.matrix using as.matrix.

The function as.bed.matrix allows to provide data frames corresponding to the .fam and .bim files. Theyshould have colnames famid, id, father, mother, sex, pheno, and chr, id, dist, pos, A1, A2 respectively.This function is widely used in the examples included in manual pages.

> data(TTN)

> x <- as.bed.matrix(TTN.gen, TTN.fam, TTN.bim)

> x

A bed.matrix with 503 individuals and 733 markers.

snps stats are set

ped stats are set

2.3 The insides of a bed.matrix

In first approach, a bed.matrix behaves as a ”read-only” matrix containing only 0, 1, 2 and NAs, unless thegenotypes are standardized (use standardize<-). They are stored in a compact form, each genotype beingcoded on 2 bits (hence 4 genotypes per byte).

Bed.matrices are implemented using S4 classes and methods. Let’s have a look on the slots names of thebed.matrix x created above using the dataset LCT.

> data(TTN)

> x <- as.bed.matrix(TTN.gen, TTN.fam, TTN.bim)

> slotNames(x)

[1] "ped" "snps" "bed"

[4] "p" "mu" "sigma"

[7] "standardize_p" "standardize_mu_sigma"

Page 4: Gaston - R · In this vignette, we illustrate Gaston using the included data sets AGT, LCT, and TTN(see the corresponding manual pages for a description). Gaston also includes some

The slot x@bed is an external pointer, which indicates where the genetic data are stored in memory. It willbe used by the C++ functions called by Gaston.

> x@bed

<pointer: 0x5616dee2b590>

Let’s look at the contents of the slots x@ped and x@snps. The other slots will be commented later.

The slot x@ped gives informations on the individuals. The first 6 columns correspond to the contents of a.fam file, or to the first 6 columns of a .ped file (known as linkage format). The other columns are simpledescriptive stats that are computed by Gaston, unless options(gaston.auto.set.stats = FALSE) was set(see below section 2.4).

> dim(x@ped)

[1] 503 30

> head(x@ped)

famid id father mother sex pheno N0 N1 N2 NAs N0.x N1.x N2.x NAs.x N0.y N1.y

1 HG00096 HG00096 0 0 0 NA 128 82 523 0 0 0 0 0 0 0

2 HG00097 HG00097 0 0 0 NA 109 81 543 0 0 0 0 0 0 0

3 HG00099 HG00099 0 0 0 NA 75 154 503 1 0 0 0 0 0 0

4 HG00100 HG00100 0 0 0 NA 148 86 499 0 0 0 0 0 0 0

5 HG00101 HG00101 0 0 0 NA 18 394 320 1 0 0 0 0 0 0

6 HG00102 HG00102 0 0 0 NA 50 180 503 0 0 0 0 0 0 0

N2.y NAs.y N0.mt N1.mt N2.mt NAs.mt callrate hz callrate.x hz.x callrate.y hz.y

1 0 0 0 0 0 0 1.0000000 0.1118690 NaN NaN NaN NaN

2 0 0 0 0 0 0 1.0000000 0.1105048 NaN NaN NaN NaN

3 0 0 0 0 0 0 0.9986357 0.2103825 NaN NaN NaN NaN

4 0 0 0 0 0 0 1.0000000 0.1173261 NaN NaN NaN NaN

5 0 0 0 0 0 0 0.9986357 0.5382514 NaN NaN NaN NaN

6 0 0 0 0 0 0 1.0000000 0.2455662 NaN NaN NaN NaN

callrate.mt hz.mt

1 NaN NaN

2 NaN NaN

3 NaN NaN

4 NaN NaN

5 NaN NaN

6 NaN NaN

The slot x@snps gives informations on the SNPs. Its first 6 columns corresponds to the contents of a .bim

file. The other columns are simple descriptive stats that are computed by Gaston. Again, if the global optionoptions(gaston.auto.set.stats = FALSE) was set (see below), these columns will be absent (see belowsection 2.4).

Page 5: Gaston - R · In this vignette, we illustrate Gaston using the included data sets AGT, LCT, and TTN(see the corresponding manual pages for a description). Gaston also includes some

> dim(x@snps)

[1] 733 17

> head(x@snps)

chr id dist pos A1 A2 N0 N1 N2 NAs N0.f N1.f N2.f NAs.f callrate

1 2 rs7571247 0 179200322 C T 5 88 410 0 NA NA NA NA 1

2 2 rs3813253 0 179200714 G A 24 187 292 0 NA NA NA NA 1

3 2 rs6760059 0 179200947 T C 11 139 353 0 NA NA NA NA 1

4 2 rs16866263 0 179201048 T G 2 53 448 0 NA NA NA NA 1

5 2 rs77946091 0 179201380 A G 2 53 448 0 NA NA NA NA 1

6 2 rs77711640 0 179201557 A G 2 54 447 0 NA NA NA NA 1

maf hz

1 0.09741551 0.1749503

2 0.23359841 0.3717694

3 0.16003976 0.2763419

4 0.05666004 0.1053678

5 0.05666004 0.1053678

6 0.05765408 0.1073559

Note: We refer to the allele A1 as the reference allele and to A2 as the alternative allele. The choiceof the reference allele is arbitrary, it doesn’t to be e.g. the major allele. If the matrix is produced byread.bed.matrix, Gaston will keep them in the order they appear in the .bim file, their won”t be anyreordering.

2.4 Basic statistics included in a bed.matrix

Some simple descriptive statistics can be added to a bed.matrix with set.stats. Since version 1.4 ofgaston, this function is called by default by all functions that create a bed.matrix, unless the global optionoptions(gaston.auto.set.stats = FALSE) was set. This option can be used to gain some time in someparticular cases.

We illustrate here the effect of this option:

> options(gaston.auto.set.stats = FALSE)

> data(TTN)

> x <- as.bed.matrix(TTN.gen, TTN.fam, TTN.bim)

> head(x@ped)

famid id father mother sex pheno

1 HG00096 HG00096 0 0 0 NA

2 HG00097 HG00097 0 0 0 NA

3 HG00099 HG00099 0 0 0 NA

4 HG00100 HG00100 0 0 0 NA

5 HG00101 HG00101 0 0 0 NA

6 HG00102 HG00102 0 0 0 NA

> head(x@snps)

chr id dist pos A1 A2

1 2 rs7571247 0 179200322 C T

Page 6: Gaston - R · In this vignette, we illustrate Gaston using the included data sets AGT, LCT, and TTN(see the corresponding manual pages for a description). Gaston also includes some

2 2 rs3813253 0 179200714 G A

3 2 rs6760059 0 179200947 T C

4 2 rs16866263 0 179201048 T G

5 2 rs77946091 0 179201380 A G

6 2 rs77711640 0 179201557 A G

The reader is invited to compare with what we obtained in the previous section of this document.

The function set.stats can be called to add the missing descriptive statistics to the ped and snps slots:

> x <- set.stats(x)

ped stats and snps stats have been set.

'p' has been set.

'mu' and 'sigma' have been set.

> head(x@ped)

famid id father mother sex pheno N0 N1 N2 NAs N0.x N1.x N2.x NAs.x N0.y N1.y

1 HG00096 HG00096 0 0 0 NA 128 82 523 0 0 0 0 0 0 0

2 HG00097 HG00097 0 0 0 NA 109 81 543 0 0 0 0 0 0 0

3 HG00099 HG00099 0 0 0 NA 75 154 503 1 0 0 0 0 0 0

4 HG00100 HG00100 0 0 0 NA 148 86 499 0 0 0 0 0 0 0

5 HG00101 HG00101 0 0 0 NA 18 394 320 1 0 0 0 0 0 0

6 HG00102 HG00102 0 0 0 NA 50 180 503 0 0 0 0 0 0 0

N2.y NAs.y N0.mt N1.mt N2.mt NAs.mt callrate hz callrate.x hz.x callrate.y hz.y

1 0 0 0 0 0 0 1.0000000 0.1118690 NaN NaN NaN NaN

2 0 0 0 0 0 0 1.0000000 0.1105048 NaN NaN NaN NaN

3 0 0 0 0 0 0 0.9986357 0.2103825 NaN NaN NaN NaN

4 0 0 0 0 0 0 1.0000000 0.1173261 NaN NaN NaN NaN

5 0 0 0 0 0 0 0.9986357 0.5382514 NaN NaN NaN NaN

6 0 0 0 0 0 0 1.0000000 0.2455662 NaN NaN NaN NaN

callrate.mt hz.mt

1 NaN NaN

2 NaN NaN

3 NaN NaN

4 NaN NaN

5 NaN NaN

6 NaN NaN

The columns N0, N1, N2 and NAs give for each individual the number of autosomal SNPs with a genotypeequal to 0, 1, 2 (corresponding to A1A1, A1A2 and A2A2) and missing, respectively. The homologous columnswith names ending with .x, .y and .mt give the same for SNPs on the X, Y, and mitochondria (this is whyyou get only 0s here).

The columns callrate, callrate.x, callrate.y, callrate.mt give the individual callrate for autosomal,X, Y, mitochondrial SNPs, and similarly, hz, hz.x, hz.y, hz.mt give the individual heterozygosity.

> head(x@snps)

chr id dist pos A1 A2 N0 N1 N2 NAs N0.f N1.f N2.f NAs.f callrate

Page 7: Gaston - R · In this vignette, we illustrate Gaston using the included data sets AGT, LCT, and TTN(see the corresponding manual pages for a description). Gaston also includes some

1 2 rs7571247 0 179200322 C T 5 88 410 0 NA NA NA NA 1

2 2 rs3813253 0 179200714 G A 24 187 292 0 NA NA NA NA 1

3 2 rs6760059 0 179200947 T C 11 139 353 0 NA NA NA NA 1

4 2 rs16866263 0 179201048 T G 2 53 448 0 NA NA NA NA 1

5 2 rs77946091 0 179201380 A G 2 53 448 0 NA NA NA NA 1

6 2 rs77711640 0 179201557 A G 2 54 447 0 NA NA NA NA 1

maf hz

1 0.09741551 0.1749503

2 0.23359841 0.3717694

3 0.16003976 0.2763419

4 0.05666004 0.1053678

5 0.05666004 0.1053678

6 0.05765408 0.1073559

The columns N0, N1, N2 and NAs contain the number of genotypes 0, 1, 2 and NA for each SNP. For X-linkedSNPs only, the homologous columns with names ending with .f give the same, taking only women intoaccount. The callrate is the proportion of non-missing genotypes. In the snps slot, maf is the minor allelefrequency, and hz it the heterozygosity rate. Note that the callrate for Y linked SNPs taking only men intoaccount, while the heterozygosity rate for X linked SNPs is computed only on women.

The default is too also also update the slots x@p, x@mu and x@sigma:

> str(x@p)

num [1:733] 0.903 0.766 0.84 0.943 0.943 ...

> str(x@mu)

num [1:733] 1.81 1.53 1.68 1.89 1.89 ...

> str(x@sigma)

num [1:733] 0.421 0.587 0.512 0.33 0.33 ...

p contains the alternative allele (A2) frequency; mu is equal to 2*p and is the expected value of the genotype(coded in 0, 1, 2); and sigma is the genotype standard error. If the Hardy-Weinberg equilibrium holds, sigmashould be close to sqrt(2*p(1-p)). This is illustrated on the figure below.

> plot(x@p, x@sigma, xlim=c(0,1))

> t <- seq(0,1,length=101);

> lines(t, sqrt(2*t*(1-t)), col="red")

Page 8: Gaston - R · In this vignette, we illustrate Gaston using the included data sets AGT, LCT, and TTN(see the corresponding manual pages for a description). Gaston also includes some

●●●

●●

●●●

●●●

●●●

●●

●●●●●

●●

●●

●●●

●●

●●●

●●●

●●●●

●●●

●●●●

●● ●

●●

●●

●●

●●

●●●● ●●

●● ●

●●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●●●

●●●●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

● ●

●●

●●●●●

●●

●●

●●●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●● ●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

0.0 0.2 0.4 0.6 0.8 1.0

0.3

0.4

0.5

0.6

0.7

x@p

x@si

gma

There are also functions set.stats.snps and set.stats.ped to update only x@snps and x@ped.

The option gaston.auto.set.stats can be useful when extracting individuals from bed.matrices (seen nextsection): if gaston.auto.set.stats = FALSE, the slots @p, @mu and @sigma are left unaltered; in contrast,the default behavior is to call set.stats which recomputes the values of these slots. Note that when usingset.stats “manually”, one can use options to alter its behavior, allowing to update only some slots. Thefollowing code illustrates this (remember that here gaston.auto.set.stats = FALSE) :

> head(x@p)

[1] 0.9025845 0.7664016 0.8399602 0.9433400 0.9433400 0.9423459

> y <- x[1:30,]

> head(y@p)

[1] 0.9025845 0.7664016 0.8399602 0.9433400 0.9433400 0.9423459

> y <- set.stats(y, set.p = FALSE)

ped stats and snps stats have been set.

'mu' and 'sigma' have been set.

> head(y@p)

[1] 0.9025845 0.7664016 0.8399602 0.9433400 0.9433400 0.9423459

> y <- set.stats(y)

ped stats and snps stats have been set.

'p' has been set.

'mu' and 'sigma' have been set.

> head(y@p)

[1] 0.8833333 0.8000000 0.8166667 0.9833333 0.9833333 0.9833333

Hardy-Weinberg Equilibrium can be tested for all SNPs simply by x <- set.hwe(x). This command addsto the data frame x@snps a column hwe, containing the p-value of the test (both χ2 and “exact” tests areavailable, see the man page).

Note that when writting a bed.matrix x to disk with write.bed.matrix, the .rds file contains all the slots

Page 9: Gaston - R · In this vignette, we illustrate Gaston using the included data sets AGT, LCT, and TTN(see the corresponding manual pages for a description). Gaston also includes some

of x, and thus contains the additionnal variables added by set.stats or set.hwe, which are not saved inthe .bim and .fam files.

For the remaining of this vignette, we restore the default behaviour:

> options(gaston.auto.set.stats = TRUE)

Page 10: Gaston - R · In this vignette, we illustrate Gaston using the included data sets AGT, LCT, and TTN(see the corresponding manual pages for a description). Gaston also includes some

2.5 Subsetting bed.matrices

It is possible to subset bed.matrices just as normal matrices, writing e.g. x[1:100,] to extract the first 100individuals, or x[1:100,10:19] to extract the SNPs 10 to 19 for these 100 individuals:

> x[1:100,]

A bed.matrix with 100 individuals and 733 markers.

snps stats are set

ped stats are set

> x[1:100,10:19]

A bed.matrix with 100 individuals and 10 markers.

snps stats are set

ped stats are set

The use of logical vectors for subsetting is allowed too. The following code extracts the SNPs with minorallele frequency > 0.1:

> x[,x@snps$maf > 0.1]

A bed.matrix with 503 individuals and 501 markers.

snps stats are set

ped stats are set

For this kind of selection, Gaston provides the functions select.inds and select.snps, which have a nicersyntax. Hereafter we use the same condition on the minor allele frequency, and we introduce also a conditionon the Hardy-Weinberg Equilibrium p-value:

> x <- set.hwe(x)

Computing HW chi-square p-values

> select.snps(x, maf > 0.1 & hwe > 1e-3)

A bed.matrix with 503 individuals and 497 markers.

snps stats are set

ped stats are set

The conditions in select.snps can use any of the variables defined in the data frame x@snps, as well asvariables defined in the user session.

The function select.inds is similar, using variables of the data frame x@ped. One can for example selectindividuals with callrate greater than 95% with select.inds(x, callrate > 0.5).

These functions should make basic Quality Control easy.

Note: If options(gaston.auto.set.stats = FALSE) was set, subsetting will erase some or all statisticsadded by set.stats, except the p, mu and sigma slots.

Page 11: Gaston - R · In this vignette, we illustrate Gaston using the included data sets AGT, LCT, and TTN(see the corresponding manual pages for a description). Gaston also includes some

2.6 Genomic sex

When dealing with genome wide data, the function set.genomic.sex can be use to add a x@ped$genomic.sexcolumn, determined by clustering on individuals X heterozygosity rate and Y callrate.

x <- select.inds(x, sex == genomic.sex) will then allow to keep only individuals whose sex and ge-nomic sex are concordant.

An other possibility is to use x@ped$sex <- x@ped$genomic.sex; x <- set.stats(x) to erase the originalsex indication and use genomic sex in subsequent analyses.

2.7 Merging bed.matrices

Gaston defines methods rbind and cbind which can be used to merge several bed.matrices. cbind checksthat individuals famids and ids are identical ; all remaining columns are taken from the first argument,without further control.

> data(AGT)

> x1 <- as.bed.matrix(AGT.gen, AGT.fam, AGT.bim)

> x1

A bed.matrix with 503 individuals and 361 markers.

snps stats are set

ped stats are set

> data(LCT)

> x2 <- as.bed.matrix(LCT.gen, LCT.fam, LCT.bim)

> x2

A bed.matrix with 503 individuals and 607 markers.

snps stats are set

ped stats are set

> x <- cbind(x1,x2)

> x

A bed.matrix with 503 individuals and 968 markers.

snps stats are set

ped stats are set

rbind similarly checks whether the SNP ids are identical, and also if the reference alleles are identical.If needed, it will perform reference allele inversions (changing a SNP A/G to G/A) or allele strand flips(changing A/G to T/C) or both.

> x3 <- x[1:10, ]

> x4 <- x[-(1:10), ]

> rbind(x3,x4)

A bed.matrix with 503 individuals and 968 markers.

snps stats are set

ped stats are set

Page 12: Gaston - R · In this vignette, we illustrate Gaston using the included data sets AGT, LCT, and TTN(see the corresponding manual pages for a description). Gaston also includes some

3 Standardized matrices

Gaston allows to “standardize” a genotype matrix, replacing each genotype Xij (i is the individual index, jis the SNP index) by

Xij − µjσj

(1)

where µj = 2pj is the mean of the 0,1,2-coded genotype (pj being the alternative allele frequency), andσj is either its standard error, or the expected standard error under Hardy-Weinberg Equilibrium, that is√

2pj(1− pj).

Consider the TTN data set. The unscaled matrix contains 0,1,2 values:

> x <- as.bed.matrix(TTN.gen, TTN.fam, TTN.bim)

> X <- as.matrix(x)

> X[1:5,1:4]

rs7571247 rs3813253 rs6760059 rs16866263

HG00096 2 2 2 2

HG00097 2 2 2 2

HG00099 1 2 2 2

HG00100 2 0 0 2

HG00101 2 2 2 2

To scale it using the standard error, use standardize(x) <- "mu_sigma" (or standardize(x) <- "mu", asthe function uses match.arg):

> standardize(x) <- "mu"

> as.matrix(x)[1:5, 1:4]

rs7571247 rs3813253 rs6760059 rs16866263

HG00096 0.4629595 0.7953656 0.6254622 0.3437939

HG00097 0.4629595 0.7953656 0.6254622 0.3437939

HG00099 -1.9132512 0.7953656 0.6254622 0.3437939

HG00100 0.4629595 -2.6094762 -3.2827052 0.3437939

HG00101 0.4629595 0.7953656 0.6254622 0.3437939

The result is similar to what would be obtained from the base function scale:

> scale(X)[1:5,1:4]

rs7571247 rs3813253 rs6760059 rs16866263

HG00096 0.4629595 0.7953656 0.6254622 0.3437939

HG00097 0.4629595 0.7953656 0.6254622 0.3437939

HG00099 -1.9132512 0.7953656 0.6254622 0.3437939

HG00100 0.4629595 -2.6094762 -3.2827052 0.3437939

HG00101 0.4629595 0.7953656 0.6254622 0.3437939

To use√

2pj(1− pj) instead of the standard error, use standardize(x) <- "p"; a similar result can againbe obtained from scale, with the right options:

Page 13: Gaston - R · In this vignette, we illustrate Gaston using the included data sets AGT, LCT, and TTN(see the corresponding manual pages for a description). Gaston also includes some

> standardize(x) <- "p"

> as.matrix(x)[1:5, 1:4]

rs7571247 rs3813253 rs6760059 rs16866263

HG00096 0.4646063 0.7807675 0.6173047 0.3465926

HG00097 0.4646063 0.7807675 0.6173047 0.3465926

HG00099 -1.9200567 0.7807675 0.6173047 0.3465926

HG00100 0.4646063 -2.5615820 -3.2398911 0.3465926

HG00101 0.4646063 0.7807675 0.6173047 0.3465926

> scale(X, scale = sqrt(2*x@p*(1-x@p)))[1:5,1:4]

rs7571247 rs3813253 rs6760059 rs16866263

HG00096 0.4646063 0.7807675 0.6173047 0.3465926

HG00097 0.4646063 0.7807675 0.6173047 0.3465926

HG00099 -1.9200567 0.7807675 0.6173047 0.3465926

HG00100 0.4646063 -2.5615820 -3.2398911 0.3465926

HG00101 0.4646063 0.7807675 0.6173047 0.3465926

To go back to the 0,1,2-coded genotypes, use standardize(x) <- "none".

Note: In standardized matrices, the NA values are replaced by zeroes, which amount to impute the missinggenotypes by the mean genotype.

3.1 Matrix multiplication

Standardized matrices can be multiplied with numeric vectors or matrices with the operator %*%.

This feature can be used e.g. to simulate quantitative phenotypes with a genetic component. The followingcode simulates a phenotype with an effect of the SNP #350:

> y <- x %*% c(rep(0,350),0.25,rep(0,ncol(x)-351)) + rnorm(nrow(x), sd = 1)

Page 14: Gaston - R · In this vignette, we illustrate Gaston using the included data sets AGT, LCT, and TTN(see the corresponding manual pages for a description). Gaston also includes some

3.2 Genetic Relationship Matrix and Principal Component Analysis

If Xs is a standardized n× q genotype matrix, a Genetic Relationship Matrix (GRM) of the individuals canbe computed as

GRM =1

q − 1XsX

′s

where q is the number of SNPs. This computation is done by the function GRM.

NOTE: Since version 1.4.7, GRM has an argument which.snps which allows to give a logical vector specifyingwhich SNPs to use in the computation. The default is to use only autosomal SNPs (the above formula makeslittle to no sense for X linked SNPs, unless there are only women in the sample).

The Principal Component Analysis (PCA) of large genomic data sets is used to retrieve population strat-ification. It can be obtained from the eigen decomposition of the GRM. To illustrate this, we included inthe extdata folder a dataset extracted from the 1000 genomes project. This data set comprehend the 503individuals of european ascent, with 10025 SNPs on the chromosome 2. These SNPs have been selected withthe function LD.thin so that they have very low LD with each other (r2 < 0.05). We added in the dataframe x@ped a variable population which is a factor with levels CEU, FIN, GBR, IBS and TSI.

We can load this data set as follows:

> x <- read.bed.matrix( system.file("extdata", "chr2.bed", package="gaston") )

Reading /tmp/Rtmp6g175f/Rinst3ab040bdfa1a/gaston/extdata/chr2.rds

Reading /tmp/Rtmp6g175f/Rinst3ab040bdfa1a/gaston/extdata/chr2.bed

> x

A bed.matrix with 503 individuals and 10025 markers.

snps stats are not set (or incomplete)

ped stats are not set (or incomplete)

> head(x@ped)

famid id father mother sex pheno population

1 HG00096 HG00096 0 0 0 -9 GBR

2 HG00097 HG00097 0 0 0 -9 GBR

3 HG00099 HG00099 0 0 0 -9 GBR

4 HG00100 HG00100 0 0 0 -9 GBR

5 HG00101 HG00101 0 0 0 -9 GBR

6 HG00102 HG00102 0 0 0 -9 GBR

> table(x@ped$population)

CEU FIN GBR IBS TSI

99 99 91 107 107

We compute the Genetic Relationship Matrix, and its eigen decomposition (we don’t need to standardize x

explicitely: is x isn’t standardized, GRM will use standardize(x) = "p"):

> K <- GRM(x)

>

> eiK <- eigen(K)

> # deal with a small negative eigen value

Page 15: Gaston - R · In this vignette, we illustrate Gaston using the included data sets AGT, LCT, and TTN(see the corresponding manual pages for a description). Gaston also includes some

> eiK$values[ eiK$values < 0 ] <- 0

The eigenvectors are normalized. The Principal Components (PC) can be computed by multiplying them bythe square root of the associated eigenvalues. Here we plot the projection of the 503 individuals on the firsttwo PCs, with colors corresponding to their population.

> PC <- sweep(eiK$vectors, 2, sqrt(eiK$values), "*")

> plot(PC[,1], PC[,2], col=x@ped$population)

> legend("bottomleft", pch = 1, legend = levels(x@ped$population), col = 1:5)

●●●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●● ●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

● ●

●●●

● ●

●●

●●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

● ●

●●

●●

●●●

●●

●●

●●

●●

−0.20 −0.15 −0.10 −0.05 0.00 0.05 0.10

−0.

15−

0.10

−0.

050.

000.

050.

10

PC[, 1]

PC

[, 2]

CEUFINGBRIBSTSI

As K can be written

K =

(1√q − 1

Xs

)(1√q − 1

Xs

)′,

the PCs are the left singular vectors of 1√q−1Xs. The vectors of loadings are the right singlar vectors of

this matrix. The vector of loadings corresponding to a PC u is the vector v with unit norm, such thatu = 1√

q−1Xsv.

They can be retrieved with the function bed.loadings:

> # one can use PC[,1:2] instead of eiK£vectors[,1:2] as well

> L <- bed.loadings(x, eiK$vectors[,1:2])

> dim(L)

[1] 10025 2

> head(L)

[,1] [,2]

rs113106463 -0.0059549225 0.0064476340

rs13390778 -0.0099678228 -0.0125720968

rs75011129 -0.0001122798 -0.0001181848

rs4637157 -0.0151318033 -0.0148076809

rs62116661 -0.0002977576 0.0048268361

Page 16: Gaston - R · In this vignette, we illustrate Gaston using the included data sets AGT, LCT, and TTN(see the corresponding manual pages for a description). Gaston also includes some

rs10170011 0.0185198431 -0.0019553340

We verify that the loadings have unit norm:

> colSums(L**2)

[1] 1 1

And that the PCs are retrieved by right multiplyingXs by the loadings (here we need to explicitely standardizex):

> standardize(x) <- 'p'

> head( (x %*% L) / sqrt(ncol(x)-1) )

[,1] [,2]

HG00096 0.021410879 -0.12418404

HG00097 -0.014005020 -0.07378044

HG00099 0.001384442 -0.09708206

HG00100 0.020319167 -0.08920824

HG00101 0.002596090 -0.08925960

HG00102 0.010155044 -0.08968427

> head( PC[,1:2] )

[,1] [,2]

[1,] 0.021410879 -0.12418405

[2,] -0.014005017 -0.07378046

[3,] 0.001384438 -0.09708205

[4,] 0.020319161 -0.08920823

[5,] 0.002596081 -0.08925958

[6,] 0.010155038 -0.08968427

Page 17: Gaston - R · In this vignette, we illustrate Gaston using the included data sets AGT, LCT, and TTN(see the corresponding manual pages for a description). Gaston also includes some

3.3 Linkage Disequilibrium

Doing the crossproduct in the reverse order produces a moment estimate of the Linkage Disequilibrium (LD):

LD =1

n− 1X ′sXs

where n is the number of individuals. This computation is done by the function LD (usually, only parts ofthe whole LD matrix is computed). The fonction LD can compute a square symmetric matrix giving the LDfor a given region, measured by r2 (the default), r or D. It can also compute the LD between two differentregions.

> data(AGT)

> x <- as.bed.matrix(AGT.gen, AGT.fam, AGT.bim)

>

> ld.x <- LD(x, c(1,ncol(x)))

>

> LD.plot(ld.x, snp.positions = x@snps$pos, write.ld = NULL,

+ max.dist = 20e3, write.snp.id = FALSE, draw.chr = FALSE,

+ pdf = "LD_AGT.pdf")

pdf

2

This method is also used by LD.thin to extract a set of SNPs in low linkage disequilibrium (it is oftenrecommended to perform this operation before computing the GRM). We illustrate this function on the AGTdata set:

> y <- LD.thin(x, threshold = 0.4, max.dist = 500e3)

> y

A bed.matrix with 503 individuals and 48 markers.

snps stats are set

ped stats are set

The argument max.dist = 500e3 is to specify that the LD won’t be computed for SNPs more than 500 kbappart. We verify that there is no SNP pair with LD r2 > 0.4 (note that the LD matrix has ones on thediagonal):

> ld.y <- LD( y, lim = c(1, ncol(y)) )

> sum( ld.y > 0.4 )

[1] 48

Page 18: Gaston - R · In this vignette, we illustrate Gaston using the included data sets AGT, LCT, and TTN(see the corresponding manual pages for a description). Gaston also includes some

4 Linear Mixed Models

Linear Mixed Models are usually written under the form

Y = Xβ + Z1u1 + · · ·+ Zkuk + ε

where Y ∈ Rn is the response vector, and X ∈ Rn×p, Z1 ∈ Rn×q1 , . . . , Zk ∈ Rn×qk are design matrices.The vector β ∈ Rp is the vector of fixed effects; the random effects are drawn in normal distributions,u1 ∼ N (0, τ1Iq1), . . . , uk ∼ N (0, τkIqk), as the residual error ε ∼ N (0, σ2In).

Here we will use the equivalent form

Y = Xβ + ω1 + . . .+ ωk + ε

where Ki = ZiZ′i (i = 1, . . . , k), the random terms are ωi ∼ N(0, τiKi) for i ∈ 1, . . . , k and ε ∼ N(0, σ2In).

Note that using the R function model.matrix can help you to rewrite a model under this form.

Gaston provides two functions for estimating the model parameters when the model is written in the secondform. We are going to illustrate these functions on simulated data.

4.1 Data simulation

We will use the above notations. First generate some (random) design matrices:

> set.seed(1)

> n <- 100

> q1 <- 20

> Z1 <- matrix( rnorm(n*q1), nrow = n )

> X <- cbind(1, 5*runif(n))

Then the vector of random effects u1 and a vector y under the model Y = X(1, 2)′ + Zu1 + ε with u1 ∼N (0, 2Iq1) and ε ∼ N (0, 3In):

> u1 <- rnorm(q1, sd = sqrt(2))

> y <- X %*% c(1,2) + Z1 %*% u1 + rnorm(n, sd = sqrt(3))

To illustrate the case where there are several random effects vectors, we generate an other matrix Z2, thecorresponding vector of random effects u2, and a vector y2 under the model Y = X(1, 2)′ + Zu1 + Z2u2 + ε.

> q2 <- 10

> Z2 <- matrix( rnorm(n*q2), nrow = n )

> u2 <- rnorm(q2, sd = 1)

> y2 <- X %*% c(1,2) + Z1 %*% u1 + Z2 %*% u2 + rnorm(n, sd = sqrt(3))

4.2 Model fitting with the AIREML algorithm

lmm.aireml is a function for linear mixed models parameter estimation and BLUP computations.

Page 19: Gaston - R · In this vignette, we illustrate Gaston using the included data sets AGT, LCT, and TTN(see the corresponding manual pages for a description). Gaston also includes some

4.2.1 One random effects vector

Let’s start with the simple case (only one random effects vector). As lmm.aireml uses the second form ofthe linear mixed model, and we give it the matrix K1 = Z1Z

′1.

> K1 <- tcrossprod(Z1)

> fit <- lmm.aireml(y, X, K = K1, verbose = FALSE)

> str(fit)

List of 11

$ sigma2 : num 3.37

$ tau : num 1.65

$ logL : num -150

$ logL0 : num -231

$ niter : int 8

$ norm_grad : num 1.92e-08

$ Py : num [1:100] 0.509 -0.236 0.11 0.328 -0.291 ...

$ BLUP_omega: num [1:100] -0.441 -12.618 -1.724 5.052 5.831 ...

$ BLUP_beta : num [1:2] 1.4 1.79

$ varbeta : num [1:2, 1:2] 0.1398 -0.0403 -0.0403 0.0167

$ varXbeta : num 7.61

The result is a list giving all kind of information. Here we see that τ is estimated to 1.65 and σ2 to 3.37.The Best Linear Unbiased Predictor (BLUP) for β is in the component BLUP_beta and here it is (1.4, 1.79).

The component BLUP_omega holds the BLUP for ω1 = Zu1. The BLUP for u1 can be retrieved by theformula u1 = τ1Z

′1Py, the value of Py being in the component Py. The plots below compare the true values

of ω1 and u1 with their BLUPs.

> par(mfrow = c(1, 2))

> plot(Z1 %*% u1, fit$BLUP_omega); abline(0, 1, lty = 2, col = 3)

> BLUP_u1 <- fit$tau * t(Z1) %*% fit$Py

> plot(u1, BLUP_u1); abline(0, 1, lty = 2, col = 3)

●●

●●

● ●

●●

● ●

● ●

●●

●●

●●

● ●

−10 −5 0 5 10 15 20

−10

−5

05

1015

20

Z1 %*% u1

fit$B

LUP

_om

ega

−3 −2 −1 0 1 2

−3

−2

−1

01

2

u1

BLU

P_u

1

Page 20: Gaston - R · In this vignette, we illustrate Gaston using the included data sets AGT, LCT, and TTN(see the corresponding manual pages for a description). Gaston also includes some

4.2.2 Several random effects vectors

It is sufficient to give to lmm.aireml a list with the two matrices K1 and K2:

> K2 <- tcrossprod(Z2)

> fit2 <- lmm.aireml(y2, X, K = list(K1, K2), verbose = FALSE)

> str(fit2)

List of 10

$ sigma2 : num 3.04

$ tau : num [1:2] 1.81 0.821

$ logL : num -164

$ logL0 : num -244

$ niter : int 13

$ norm_grad : num 3.17e-07

$ Py : num [1:100] -0.353 -0.199 0.879 0.472 0.968 ...

$ BLUP_omega: num [1:100] 0.342 -15.406 6.239 4.559 4.361 ...

$ BLUP_beta : num [1:2] 1.1 1.98

$ varXbeta : num 9.33

The component tau now holds the two values τ1 and τ2. Note that there is only one BLUP_omega vector. Itcorresponds to the BLUP for ω1 + ω2. The BLUPs for each ωi can be retrieved using ωi = τiKiPy:

> par(mfrow = c(1, 2))

> omega1 <- fit2$tau[1] * K1 %*% fit2$Py

> omega2 <- fit2$tau[2] * K2 %*% fit2$Py

> plot(Z1 %*% u1, omega1); abline(0, 1, lty = 2, col = 3)

> plot(Z2 %*% u2, omega2); abline(0, 1, lty = 2, col = 3)

●●

●●●

●●

● ●

−10 −5 0 5 10 15 20

−10

−5

05

1015

20

Z1 %*% u1

omeg

a1 ●

●● ●

●●

● ●

● ●

●●●

●●

−5 0 5

−10

−5

05

Z2 %*% u2

omeg

a2

The BLUPs for u1 and u2 can as above be retrieved using ui = τiZ′iPy.

Page 21: Gaston - R · In this vignette, we illustrate Gaston using the included data sets AGT, LCT, and TTN(see the corresponding manual pages for a description). Gaston also includes some

4.3 Model fitting with the diagonalization trick

In the case where there is only one vector of random effects (k = 1), the diagonalization trick uses the eigendecomposition of K1 to speed up the computation of the restricted likelihood, which allows to use a genericalgorithm to maximize it. Of course, the eigen decomposition of K1 needs to be computed beforehand, whichinduces an overhead.

The trick relies on a transformation of the data which leads to a diagonal variance matrix for Y . Thecomputation of a the restricted likelihood involves the computation of the inverse of this matrix, which isthen easy to compute. Write the eigen decomposition of K1 as K1 = UΣ2U ′ with Σ2 a diagonal matrix ofpositive eigenvalues, and U an orthogonal matrix. If we let Y = U ′Y , X = U ′X, ω1 = U ′ω1, and ε = U ′ε,we have

Y = Xβ + ω1 + ε

with ω1 ∼ N(0, τΣ2

)and ε ∼ N(0, σ2In). As stated above, var

(Y)

= τΣ2 + σ2In is diagonal.

We fit the first model again:

> eiK1 <- eigen(K1)

> fit.d <- lmm.diago(y, X, eiK1)

Optimization in interval [0, 1]

Optimizing with p = 0

[Iteration 1] Current point = 0 df = 4301.78

[Iteration 2] Current point = 0.00597137 df = 2202.74

[Iteration 3] Current point = 0.0183343 df = 1073.77

[Iteration 4] Current point = 0.041009 df = 501.592

[Iteration 5] Current point = 0.0785099 df = 224.008

[Iteration 6] Current point = 0.134165 df = 94.4939

[Iteration 7] Current point = 0.206173 df = 36.1315

[Iteration 8] Current point = 0.279987 df = 10.9812

[Iteration 9] Current point = 0.326676 df = 1.82058

[Iteration 10] Current point = 0.337786 df = 0.0676512

[Iteration 11] Too many iterations, using Brent algorithm

[Iteration 11] Brent gives 0.338232

> str(fit.d)

List of 9

$ sigma2 : num 3.23

$ tau : num 1.65

$ Py : num [1:100] 0.531 -0.243 0.116 0.34 -0.304 ...

$ BLUP_omega: num [1:100] -0.443 -12.629 -1.73 5.059 5.836 ...

$ BLUP_beta : num [1:2] 1.4 1.79

$ varbeta : num [1:2, 1:2] 0.134 -0.0387 -0.0387 0.016

$ Xbeta : num [1:100] 3.09 5.93 1.65 5.85 9.89 ...

$ varXbeta : num 7.51

$ p : int 0

You can check that the result is similar to the result obtained earlier with lmm.aireml.

The likelihood computation with the diagonalization trick is fast enough to plot the likelihood:

Page 22: Gaston - R · In this vignette, we illustrate Gaston using the included data sets AGT, LCT, and TTN(see the corresponding manual pages for a description). Gaston also includes some

> TAU <- seq(0.5,2.5,length=50)

> S2 <- seq(2.5,4,length=50)

> lik <- lmm.diago.likelihood(tau = TAU, s2 = S2, Y = y, X = X, eigenK = eiK1)

> lik.contour(TAU, S2, lik, heat = TRUE, xlab = "tau", ylab = "sigma^2")

0.5 1.0 1.5 2.0 2.5

2.6

2.8

3.0

3.2

3.4

3.6

3.8

4.0

tau

sigm

a^2

−160.627

−157.851

−152.972 −151.658

−151.658

−151.229

−151.229

−150.974

−150.718

−150.457

4.4 Genomic heritability estimation with Gaston

Heritability estimates based on Genetic Relationship Matrices (GRM) are computed under the followingmodel:

Y = β + ω + ε

where ω ∼ N (0, τK) and ε ∼ N (0, σ2In), K being a GRM computed on whole genome data (e.g. by thefunction GRM). The heritability is h2 = τ

τ+σ2 .

Note that as K = 1q−1XsX

′s with Xs a standardized genotype matrix, letting Z = (q − 1)−

12Xs, this model

is equivalent toY = β + Zu+ ε

with u ∼ N (0, τIq).

The function random.pm generates random positive matrices with eigenvalues roughly similar to those typi-cally observed when using whole genome data. It outputs a list with a member K: a positive matrix, and amember eigen: its eigen decomposition (as the base function eigen would output it).

The function lmm.simu can be used simulate data under the linear mixed model. It uses the eigen decompo-sition of K. Hereafter we use τ = 1, σ2 = 2, hence a 33.3% heritability.

> set.seed(1)

> n <- 2000

> R <- random.pm(n)

> y <- 2 + lmm.simu(tau = 1, sigma2 = 2, eigenK = R$eigen)$y

Page 23: Gaston - R · In this vignette, we illustrate Gaston using the included data sets AGT, LCT, and TTN(see the corresponding manual pages for a description). Gaston also includes some

We can use lmm.diago to estimates the parameters of the model.

> fit <- lmm.diago(y, eigenK = R$eigen)

Optimization in interval [0, 1]

Optimizing with p = 0

[Iteration 1] Current point = 0 df = 3.32616

[Iteration 2] Current point = 0.099573 df = 0.0543536

[Iteration 3] Current point = 0.101252 df = 1.2657e-05

We have τ = 0.298 and σ2 = 2.65, hence h2 is estimated to 10.1%.

The function lmm.diago.likelihood allows to compute a profile likelihood with parameter h2 = ττ+σ2 . It

can be useful to get confidence intervals. Here, we simply plot it:

> H2 <- seq(0,1,length=51)

> lik <- lmm.diago.likelihood(h2 = H2, Y = y, eigenK = R$eigen)

> plot(H2, exp(lik$likelihood-max(lik$likelihood)), type="l", yaxt="n", ylab="likelihood")

0.0 0.2 0.4 0.6 0.8 1.0

H2

likel

ihoo

d

It is often advised to include the first few (10 or 20) Principal Components (PC) in the model as fixedeffects, to account for a possible population stratification. The function lmm.diago has an argument p forthe number of PCs to incoporate in the model with fixed effects. We simulate a trait with a large effect ofthe two first PCs:

> PC <- sweep(R$eigen$vectors, 2, sqrt(R$eigen$values), "*")

> y1 <- 2 + PC[,1:2] %*% c(5,5) + lmm.simu(tau = 1, sigma2 = 2, eigenK = R$eigen)$y

Here are the heritability estimates with p = 0 (the default) and p = 10.

> fit0 <- lmm.diago(y1, eigenK = R$eigen)

Optimization in interval [0, 1]

Page 24: Gaston - R · In this vignette, we illustrate Gaston using the included data sets AGT, LCT, and TTN(see the corresponding manual pages for a description). Gaston also includes some

Optimizing with p = 0

[Iteration 1] Current point = 0 df = 8.47139

[Iteration 2] Current point = 0.239713 df = 0.364915

[Iteration 3] Current point = 0.2509 df = 0.000461187

[Iteration 4] Current point = 0.250914 df = 7.18147e-10

> fit0$tau/(fit0$tau+fit0$sigma2)

[1] 0.2509142

> fit10 <- lmm.diago(y1, eigenK = R$eigen, p = 10)

Optimization in interval [0, 1]

Optimizing with p = 10

[Iteration 1] Current point = 0 df = 4.45003

[Iteration 2] Current point = 0.143356 df = 0.0854252

[Iteration 3] Current point = 0.146209 df = 2.38332e-05

> fit10$tau/(fit10$tau+fit10$sigma2)

[1] 0.1462102

As expected, the estimate is inflated when no PCs are incorporated in the model.

Page 25: Gaston - R · In this vignette, we illustrate Gaston using the included data sets AGT, LCT, and TTN(see the corresponding manual pages for a description). Gaston also includes some

5 Association tests

The function association.test performs genome-wide association tests with either (generalized) linearmodels, or with generalized) linear (mixed) models, for quantitative or binary (0/1) traits.

5.1 Quantitative trait

When the function association.test is called with parameter method = "lm" it performs a test of associ-ation between SNPs and a quantitative trait Y under the model

Y = (X|PC)α+Gβ + ε

where X is the matrix of covariables with fixed effects (including a column of ones for the intercept), G isthe vector of genotypes at the SNP under consideratione. A few PCs can be included in the model with fixedeffects to correct for population stratification (the parameter p gives the number of PCs to include). TheWald test is performed to test for β = 0.

When the function is called with parameter method = "lmm" the following mixed model is used instead:

Y = (X|PC)α+Gβ + ω + ε

where X and G are as above, and ω ∼ N (0, τK) where K is a Genetic Relationship Matrix (computed onthe whole genome).

Three tests are proposed for β = 0: a score test, a Wald test on β, or a Likelihood Ratio Test.

We illustrate this function on a simple example, built on the AGT data set:

> data(AGT)

> x <- as.bed.matrix(AGT.gen, AGT.fam, AGT.bim)

> standardize(x) <- 'p'

As the whole genome is not available, we generate a random positive matrix to play the role of the GRM:

> set.seed(27)

> R <- random.pm(nrow(x))

And we simulate a phenotype with an effect of the SNP #351, and a polygenic component:

> y <- 2 + x %*% c(rep(0,350),0.35,rep(0,ncol(x)-351)) +

+ lmm.simu(tau = 0.3, sigma2 = 1, eigenK=R$eigen)$y

The following code performs the association test with a score test and a wald test, and compares the resultingp-values:

Page 26: Gaston - R · In this vignette, we illustrate Gaston using the included data sets AGT, LCT, and TTN(see the corresponding manual pages for a description). Gaston also includes some

> t_wald <- association.test(x, y, K = R$K, method = "lm", test = "wald")

> t_wald_mixed <- association.test(x, y, eigenK = R$eigen, method = "lmm", test = "wald")

> plot( t_wald$p, t_wald_mixed$p, log = "xy", xlab = "lm (wald)", ylab = "lmm (wald)")

> abline(0,1,lty=3)

●●

●●

●●

●●

●●

●●●

●●

●●●

●●

●●●

●●

●●

●●

●●

●●●

●●

●●●

●●●

●●●

●●

●●

●●

●●●

●●

●●●

●●

●●

●●●

●●

●●

●●●●●●

●●●

●●

●●●●●●●

●●●●●●

●●

●●●

●●●●●●●●

●●●●

●●●

●●●●●●

●●●

●●●

●●

●●

●●

●●●

●●●

●●●●

●●

●●

●●

●●●

●●●

●●

●●●

●●●

●●

●●

●●●

●●●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●●●

●●

●●●

●●

●●●●

●●

●●

●●

●●

●●

●●●

1e−10 1e−07 1e−04 1e−01

1e−

101e

−07

1e−

041e

−01

lm (wald)

lmm

(w

ald)

We can display the results under the form of a (mini) Manhattan plot:

> manhattan(t_wald_mixed, col = ifelse(1:ncol(x) == 351, "red", "black"))

●●

●●

●●

●●

●●

●●●

●●

●●●●

●●

●●●

●●

●●

●●

●●

●●●

●●

●●●●

●●●

●●●

●●●●

●●

●●●

●●

●●●

●●

●●

●●●●●

●●

●●●●●●

●●●●●

●●●●●●●

●●●●●●●●●

●●●

●●●●●●●●

●●●●

●●●

●●●●●●

●●●

●●●

●●

●●

●●

●●●

●●●

●●● ●

●●●

●●

●●● ●

●●●

●●

●●●

●●●

●●

●●●

●●●

●●●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●●●

●●

●●●

●●

●●●●●●

●●

●●

●●

●●

●●●

02

46

810

Position

−lo

g 10(p

)

230800000 230820000 230840000 230860000 230880000 230900000

We colored the point corresponding to the SNP #351 in red. Note that when used on genome-wide data,the function manhattan will draw a classical Manhattan plot, alternating colors between chromosomes.

There are other SNPs with low association p-values: these are the SNPs in LD with SNP #351. We canconfirm this by plotting the p-values (again on log scale) as a function of the LD (measured by r2):

Page 27: Gaston - R · In this vignette, we illustrate Gaston using the included data sets AGT, LCT, and TTN(see the corresponding manual pages for a description). Gaston also includes some

> lds <- LD(x, 351, c(1,ncol(x)))

> plot(lds, -log10(t_wald_mixed$p), xlab="r^2", ylab="-log(p)")

●●

●●

●●

●●

●●

●●●

●●

●●●

●●

●●●

●●

●●

●●

●●

●●●

●●

●●●

●●●

●●●

●●

● ●

●●

●●●

●●

●●●

●●

●●

●●●●

●●

● ●●●●●

●●●●●

●●●●●●●

●●●●●●●●●

●●●

●●●●●●●●

●●●●

●●●

●●●●●●

●●●

●●●

●●

●●

●●

●●●

●●●

●●●●

●●●

●●

●●

●●●

●●●

●●

●● ●

●●●

●●

●●●

●●●

●●●●

●●

●●

● ●●

●●

● ●

●●

●●

●●

●●●●

●●

●●●

●●

●● ●●●●

●●

● ●

●●

●●

●●●

0.0 0.2 0.4 0.6 0.8

02

46

810

r^2

−lo

g(p)

5.2 Binary phenotype

We use the quantitative phenotype previously generated to construct a binary phenotype, and we performthe association test with a logistic model (Wald test) or logistic mixed model (score test).

> y1 <- as.numeric(y > mean(y))

> t_binary <- association.test(x, y1, K = R$K, method = "lm", response = "binary", test="wald")

> t_binary_mixed <- association.test(x, y1, K = R$K, method = "lmm", response = "binary")

[Initialization] beta = -0.802666

[Initialization] tau = 0.1

[Iteration 1] tau = 0.0999517

[Iteration 1] beta = -0.797314

(d1 = inf)

(d = 0.243753)

[Iteration 2] gr = -1.17655

[Iteration 2] tau = 0.0203148

[Iteration 2] beta = -0.802054

(d1 = 1.042e+304)

(d = 0.248665)

[Iteration 3] gr = -0.00651935

[Iteration 3] tau = 0.0198994

[Iteration 3] beta = -0.801601

(d1 = 9.89056e+303)

(d = 0.248691)

[Iteration 4] gr = -1.49208e-06

[Iteration 4] tau = 0.0198993

Page 28: Gaston - R · In this vignette, we illustrate Gaston using the included data sets AGT, LCT, and TTN(see the corresponding manual pages for a description). Gaston also includes some

[Iteration 4] beta = -0.8016

> plot( t_binary$p, t_binary_mixed$p, log = "xy", xlab = "lm (wald)", ylab = "lmm (score)" )

> abline(0,1,lty=3)

●●

●●

●●

●●●

●●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●●

●●●●●

●●

●●

●●

●●●

●●

●●●

●●●

●●●●

●●●

●●

●●

●●

●●

●●●

●●

●●●●●

●●

●●

●●●●●●●●●

●●●

●●●●●●●●

●●

●●●

●●●●●●

●●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●●

●●

●●

●●●●

●●

●●

●●●

●●●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●●

●●

●●

●●●

●●

●●

●●

●●●

1e−11 1e−08 1e−05 1e−02

1e−

111e

−08

1e−

051e

−02

lm (wald)

lmm

(sc

ore)

Page 29: Gaston - R · In this vignette, we illustrate Gaston using the included data sets AGT, LCT, and TTN(see the corresponding manual pages for a description). Gaston also includes some

6 What you can hope for in later releases

A few things that may be added to the package in the near future:

• Reading .ped files

• Maximum Likelihood Linkage Disequilibrium estimation

• LMM models fitting written in equation form

• More functions and models for association testing

• Anything you asked the maintainer for