Genome-wide association study of an immune disease (IBD ...€¦ · Genome-wide association study of an immune disease (IBD) Immune signatures of recovery after surgery (hip replacement)

Scientific scenarios

2

Genome-wide association study of an immune disease (IBD)

Immune signatures of recovery after surgery (hip replacement)

Drug repositioning for an immune disease (IBD)

Network analysis of a brain disease (Alzheimer’s Disease)

Scientific scenarios

3

Genome wide association studies Covariates Meta-analysis Population stratification

Principal Component Analysis Enrichment analysis (pathways, Gene Ontology, published datasets)

CyTOF Hierarchical clustering Test of differences in mean Correlation analysis

Prediction of clinical outcomes Regression analysis Covariates

Gene expression Connectivity maps Enrichment analysis (Gene Ontology)

Test for differences in mean

Gene expression Coexpression analysis Gene networks Hierachical clustering Gene signatures

Enrichment analysis (pathways) Regression analysis Principal Component Analysis

Paper 1

4

Goal

Identify and characterize genes associated with IBD (ulcerative colitis & Crohn’s disease)

5

Dataset

6

CASES CONTROLS

http://atlasofscience.org

Ulcerative Colitis

Crohn’s Disease

IBD

Workflow

Identify loci associated with the phenotype

Associate loci to genes

Annotate genes/loci

7

1. Identification Dataset

8

Samples

SNPs

0 1 … 1 1 1 2 … 2 0 0 1 … 1 2 2 2 … 2 2 0 0 … 0 1 0 1 … 1 2 0 2 … 2 1 0 1 … 1 1 0 2 … 1 1 2 2 … 2 2 0 0 … 0 1 0 1 … 1 2

Number of minor alleles for a certain SNP

1. Identification Single test for association

9

QUESTION: do people carrying a certain genotype have an increased probability of

having the disease?

Logistic regression

Other variable of interest like ethnicity, age, sex can be added to the model

1. Identification Multiple tests in the same cohort

Test all SNPs for association

independently

Correct for multiple testing

10

Instead of using 0.05 as a threshold for significance divide it by the total number of independent tests (5x10-8 for genome-wide studies)

1. Identification Tests across cohorts

11

15 published GWAS 1.23 million imputed SNPs

ImmunoChip 196,524 SNPs

Meta-Analysis

1. Identification Population structure

12

Non European samples excluded from the analysis

Computed using all samples Computed using controls only

Clear population stratification in the Immunochip datasets First principal components included in the logistic regression model to account for this

1. Identification Meta-analysis

Cohort 1

13

Cases Controls

Cohort 2

Cases Controls

Association analysis (GWAS) Association analysis (GWAS)

SNP P estimate SNP P estimate

Combine effect sizes (estimate)

1. Identification Meta analysis

14

GOAL: estimate the overall effect of a SNP across multiple cohort

ISSUE: studies with larger sample sizes should be considered more reliable

SOLUTION: weight the effect size in each cohort by the inverse of the variance (roughly proportional to the

sample size)

1. Identification Results

193 statistically independent signals of association at genome-wide significance (UC, CD and IBD) for a total of 163 regions (71 not previously reported)

15

Explained variance increased from 8.2% to 13.6% for Crohn’s disease and from 4.1 to

7.5% for ulcerative colitis

Lower than 5x10-8

2. Gene locus association

16

1. Most associated SNPs are in non-coding regions 2. Only 29 IBD-associated SNPs were in strong linkage disequilibrium disequilibrium with a missense variant

How do we link SNPs to genes?

2. Gene locus association

17

SNP Gene

Gene Relationships Across Implicated

loci

Disease Association Protein-Protein Link

Evaluator

Expression Quantitative Trait

Loci

Is the SNP known to affect

gene expression?

Are genes around the locus connected to

other genes in other loci (text-mining)?

Are genes around the loci involved in direct

or indirect protein-protein interaction with genes in other

loci?

3. Gene/locus annotation Overlap with published results

18

IMD: immune mediated diseases MSMD: Mendelian susceptibility to mycobacterial disease PID: Primary Immunodeficiencies (including fungal and bacterial infections)

“Genes implicated in this overlap correlate with reduced levels of circulating T cells (ADA, CD40, TAP1, TAP2, NBN, BLM, DNMT3B) or specific subsets such as Th17, memory, or regulatory T cells”

3. Gene/locus annotation Selection analysis

19

Do IBD SNPs show signs of selective pressure?

1. Strong overlap with leprosy genes 2. Strong overlap with Mendelian susceptibility to mycobacterial disease 3. Infectious organisms known to be agents of natural selection

3. Gene/locus annotation Selection analysis

the allele frequency is being pushed to one

extreme due to evolutionary pressure favouring the resulting

genotype

20

Both alleles are maintained

3. Gene/locus annotation Gene Ontology enrichment

21

* Immune system processes (p = 3.5x10-26) * Regulation of cytokine production (p = 2.7x10-24) * Lymphocyte activation (p = 1.8x10-23) * Response to molecules of bacterial origin (p = 2.4x10-20)

IMD: immune mediated diseases MSMD: Mendelian susceptibility to mycobacterial disease PID: Primary Immunodeficiencies (including fungal and bacterial infections)

3. Gene/locus annotation Cell type specificity

22

Notably several of these cell types express genes near our IBD associations much more specifically when stimulated

1. Use cell specific expression data to rate how

much a gene is specific to a cell

2. Identify genes in the SNP region

3. Score SNPs based on the percentile of the

most specifically expressed gene in the gene

(for each cell type)

4. Score each cell type by taking the log average

of the cell type score across all IBD SNPs

5. Choose random sets of SNPs to assess

statistical significance

Take home messages Multiple GWAS studies can be combined to identify stable associations across multiple cohorts

Differences between and within cohorts need to be taken into account Principal Component Analysis

Testing millions of SNPs implies looking for very low p-values because of multiple test (large sample sizes are required)

Many GWAS hits are in non-coding regions

Genetics only explain a portion of the total variance

Multiple strategies exist to validate and characterize identified genes Overlap with known diseases

GO enrichment

Cell type specificity

Selection analysis

23

Paper 2

24

Goal

Characterize the phenotypic and functional immune response to surgical trauma

25

Dataset

26

Whole-blood

Immune cells

Hip replacement

surgery 1h 24h 72h -1h 6wks

26 patients who underwent hip

surgery

Single-cell proteomics (21 cell surface proteins,

phospoepitopes of 10 intracellular proteins) as

measured by CyTOF

Workflow

Identify cell populations in all

samples

Compare cell population frequencies and signaling profiles

before and after surgery

Identify population features that can

predict recovery from surgery

27

1. Identification of cell populations Manual gating

Identification of cell populations using hierarchical 2D scatter plots

13 distinct cell populations were identified using the available markers

28

1. Identification of cell population Hierarchical clustering

29

1. Assign each item to its own cluster 2. Finds the closest (most similar) pair of clusters and merge them into a single cluster 3. Computes distances (similarities) between the new cluster and each of the old clusters 4. Repeats steps 2 and 3 until all items are clustered into a single cluster of size N

Item: single cell Cluster: population of phenotypically similar cells Distance: Euclidian distance based on the expression of all markers (proteins)

2. Identification of cell populations Automated gating

30

Unsupervised hierarchical clustering

Distance based on all surface markers

2. Comparison Cell frequencies (manual gating)

31

time

frequency fold change from baseline

2. Comparison Cell frequencies (automated gating)

32

2. Comparison Cell frequencies (automated gating)

33

3. Comparison Signaling responses

34

Intensity fold change from baseline

samples

Signaling molecules


35

Early and concurrent activation of major

signaling pathways in innate and adaptive

immune cell compartments

Parenthesis Signaling responses

36

Correlation between proteins can be computed within or between time points


37

Signaling correlation

network in a specific cell

subset Clustering of the correlation matrix

4. Prediction Outcome measurements

38

4. Prediction Demographic and clinical variables

39

QUESTION: do demographic and clinical variables correlate with recovery parameters?

Six parameters were tested:sex, age, body mass index (BMI), type of anesthesia, duration of surgery, and estimated blood loss

Only sex was significantly related to a clinical recovery parameter (postoperative fatigue)

4. Prediction Cell frequencies and signaling responses

40

QUESTION: do cell frequencies or signaling responses correlate with recovery parameters?

4. Prediction Cell frequencies and signaling responses

41

These correlations remained significant and unchanged when accounting for potential confounders (including sex, age, body mass index (BMI), type of anesthesia, duration of surgery, and

estimated blood loss)

Cluster C

Cluster B

Cluster A

Take home messages

42

Manual gating

Automated gating

Samples gated separately

Samples pooled together

Comparison between groups

Correlation with clinical phenotypes

Cell populations

Percentages

Marker median intensity

Multi-dimensional

single-cell CYTOF dataset

Paper 3

43

Goal

Reconstruct gene-regulatory networks in late-onset Alzheimer’s disease (LOAD)

44

Dataset

45

Gene A Gene B Gene A Gene B

LOAD 376 patients

Healthy 173 individuals

Dorsolateral prefrontal cortex

Visual cortex

Cerebellum

39,579 expression traits

Workflow

Infer gene-regulatory

networks in LOAD and HD

Compare networks

Rank-order network for relevance to

LOAD pathology

46

1. Network inference

47

Using only top variable genes

1. Network inference

48

Top variable genes

Correlation Matrix

Adjacency Matrix

Overlap Matrix

Hierarchical Clustering

Cut to identify modules

111 modules for LOAD, 89 modules for HD

Top

var

iab

le g

enes

overlap=0: genes i and j are unlinked and do not share any neighbor overlap=1: genes i and j are linked share all neighbors

adjacency=|r(i,j)|β

Pairwise Pearson correlation r(i,j)

1. Network comparison

49

genes that are part of the module

Color represents the topological overlap between each gene (connectivity)

Top right is the connectivity in LOAD Bottom left is the connectivity in Normal

Number is the gain (>1) or loss (<1) of connectivity genes that

are part of the module

1. Network comparison

50

Gain in connectivity (GOC)

Loss in connectivity (LOC)

54% of all modules showed GOC 4.5% of all modules showed LOC

1. Network characterization

51

Canonical pathways and biological processes enrichment Fisher’s test


52

LOAD module

Neuropathologic trait

1. Identify a module by the first principal component (PC1) of all its genes

2. Correlate PC1 with neuropathology traits (Pearson correlation)


53

AT: atrophy WMAT: white matter atrophy EL: enlargement


54

Significance of enrichment Number of correlated neuropathology traits

Module ranking

Take home messages Gene expression can be used to build modules of correlated genes

These modules can be independently built for LOAD and HD and compared to identify loss or gain of connectivity

Modules differentially connected in LOAD versus HD were enriched for multiple functional categories

Modules differentially connected in LOAD versus HD were correlated with neuropathological traits associated to LOAD

Modules can be rank based on both enrichment and association to relevant traits

55

The paper contains many more analyses that have not been described in this presentation

Paper 4

56

Goal

Identify possible drug therapies for IBD using publicly available datasets

57

Dataset

58

IBD Healthy

http://atlasofscience.org

Gene A Gene B Gene A Gene B Publicly available dataset

Gene signature

Differentially expressed genes

Workflow

Identify DEG in IBD compared to

controls

Compare this signature with

publicly available drug signatures

Identify inversely correlated signatures

Validate top hits in vivo

59

2. Comparison with drug signatures

60

The derived IBD upregulated and downregulated genes.

164 drugs


61

interesting

candidate known treatment

A negative score indicates that the drug exhibits an expression pattern that is oppositional to the disease


62

Gene ontology enrichment Fisher’s test

Negative correlation between topiramate gene signature and IBD

signature

The correlation is not very strong

3. Validation

63

Vehicle only

TNBC + vehicle

TNBS + prednisolone

TNBS + topiramate Topiramate reduces IBD

symptoms

3. Validation

64

8 genes were randomly selected for qPCR validation

Only 2 genes were differentially expressed

between treatment groups

3. Validation

65

Topiramate works even better than the approved drug

prednisolone

Take home messages Gene signatures of diseases can be computed by looking at differentially expressed genes between cases and controls

Disease gene signatures can be compared to drug profiles (connectivity maps)

Drug signatures anti-correlated to disease gene signatures could potentially be used as treatments

Hypotheses needs to be validated in vivo models

66

Computational Immunology Focus areas

67

Visualization Get a preliminary feeling of the data

Outliers Population stratification

Summarize results

Dimensionality reduction/Aggregation Reduce the burden of statistical testing Make results more interpretable Make results easier to visualize

Association with phenotype Link molecular/clinical features with phenotypes

Disease status Gene signatures Quantitative clinical parameters

Statistical validation (not widely covered in this course) Compare results in real datasets with results in

randomized datasets (internal) Provide support for results using available

resources (external)

Computational Immunology Examples

68

Visualization Scatter plots Heatmaps Hierarchical trees Barplots/Boxplots Venn diagram

Dimensionality reduction/Aggregation Principal Component Analysis Gene modules Protein modules

Association with phenotype Tests for difference in mean Linear regression Logistic regression Correlation between gene signatures

Statistical validation (not widely covered in this course) Data randomization (internal) Overlap of results with other published results

(external) Validation of results in independent cohorts

(external)

Acknowledgments BRC Bioinformatics Team Emanuele de Rinaldis

Alan Todd

Venu Pullabathla

Helen Alexander

Jonathan Smith

Prodromos Chatzikyriakou

Dr. Filipe Gracio

69

Documents

Genome-wide association study of an immune disease (IBD ...€¦ · Genome-wide association study of an immune disease (IBD) Immune signatures of recovery after surgery (hip replacement)