Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
Scientific scenarios
2
Genome-wide association study of an immune disease (IBD)
Immune signatures of recovery after surgery (hip replacement)
Drug repositioning for an immune disease (IBD)
Network analysis of a brain disease (Alzheimer’s Disease)
Scientific scenarios
3
Genome wide association studies Covariates Meta-analysis Population stratification
Principal Component Analysis Enrichment analysis (pathways, Gene Ontology, published datasets)
CyTOF Hierarchical clustering Test of differences in mean Correlation analysis
Prediction of clinical outcomes Regression analysis Covariates
Gene expression Connectivity maps Enrichment analysis (Gene Ontology)
Test for differences in mean
Gene expression Coexpression analysis Gene networks Hierachical clustering Gene signatures
Enrichment analysis (pathways) Regression analysis Principal Component Analysis
Paper 1
4
Goal
Identify and characterize genes associated with IBD (ulcerative colitis & Crohn’s disease)
5
Dataset
6
CASES CONTROLS
http://atlasofscience.org
Ulcerative Colitis
Crohn’s Disease
IBD
Workflow
Identify loci associated with the phenotype
Associate loci to genes
Annotate genes/loci
7
1. Identification Dataset
8
Samples
SNPs
0 1 … 1 1 1 2 … 2 0 0 1 … 1 2 2 2 … 2 2 0 0 … 0 1 0 1 … 1 2 0 2 … 2 1 0 1 … 1 1 0 2 … 1 1 2 2 … 2 2 0 0 … 0 1 0 1 … 1 2
Number of minor alleles for a certain SNP
1. Identification Single test for association
9
QUESTION: do people carrying a certain genotype have an increased probability of
having the disease?
Logistic regression
Other variable of interest like ethnicity, age, sex can be added to the model
1. Identification Multiple tests in the same cohort
Test all SNPs for association
independently
Correct for multiple testing
10
Instead of using 0.05 as a threshold for significance divide it by the total number of independent tests (5x10-8 for genome-wide studies)
1. Identification Tests across cohorts
11
15 published GWAS 1.23 million imputed SNPs
ImmunoChip 196,524 SNPs
Meta-Analysis
1. Identification Population structure
12
Non European samples excluded from the analysis
Computed using all samples Computed using controls only
Clear population stratification in the Immunochip datasets First principal components included in the logistic regression model to account for this
1. Identification Meta-analysis
Cohort 1
13
Cases Controls
Cohort 2
Cases Controls
Association analysis (GWAS) Association analysis (GWAS)
SNP P estimate SNP P estimate
Combine effect sizes (estimate)
1. Identification Meta analysis
14
GOAL: estimate the overall effect of a SNP across multiple cohort
ISSUE: studies with larger sample sizes should be considered more reliable
SOLUTION: weight the effect size in each cohort by the inverse of the variance (roughly proportional to the
sample size)
1. Identification Results
193 statistically independent signals of association at genome-wide significance (UC, CD and IBD) for a total of 163 regions (71 not previously reported)
15
Explained variance increased from 8.2% to 13.6% for Crohn’s disease and from 4.1 to
7.5% for ulcerative colitis
Lower than 5x10-8
2. Gene locus association
16
1. Most associated SNPs are in non-coding regions 2. Only 29 IBD-associated SNPs were in strong linkage disequilibrium disequilibrium with a missense variant
How do we link SNPs to genes?
2. Gene locus association
17
SNP Gene
Gene Relationships Across Implicated
loci
Disease Association Protein-Protein Link
Evaluator
Expression Quantitative Trait
Loci
Is the SNP known to affect
gene expression?
Are genes around the locus connected to
other genes in other loci (text-mining)?
Are genes around the loci involved in direct
or indirect protein-protein interaction with genes in other
loci?
3. Gene/locus annotation Overlap with published results
18
IMD: immune mediated diseases MSMD: Mendelian susceptibility to mycobacterial disease PID: Primary Immunodeficiencies (including fungal and bacterial infections)
“Genes implicated in this overlap correlate with reduced levels of circulating T cells (ADA, CD40, TAP1, TAP2, NBN, BLM, DNMT3B) or specific subsets such as Th17, memory, or regulatory T cells”
3. Gene/locus annotation Selection analysis
19
Do IBD SNPs show signs of selective pressure?
1. Strong overlap with leprosy genes 2. Strong overlap with Mendelian susceptibility to mycobacterial disease 3. Infectious organisms known to be agents of natural selection
3. Gene/locus annotation Selection analysis
the allele frequency is being pushed to one
extreme due to evolutionary pressure favouring the resulting
genotype
20
Both alleles are maintained
3. Gene/locus annotation Gene Ontology enrichment
21
* Immune system processes (p = 3.5x10-26) * Regulation of cytokine production (p = 2.7x10-24) * Lymphocyte activation (p = 1.8x10-23) * Response to molecules of bacterial origin (p = 2.4x10-20)
IMD: immune mediated diseases MSMD: Mendelian susceptibility to mycobacterial disease PID: Primary Immunodeficiencies (including fungal and bacterial infections)
3. Gene/locus annotation Cell type specificity
22
Notably several of these cell types express genes near our IBD associations much more specifically when stimulated
1. Use cell specific expression data to rate how
much a gene is specific to a cell
2. Identify genes in the SNP region
3. Score SNPs based on the percentile of the
most specifically expressed gene in the gene
(for each cell type)
4. Score each cell type by taking the log average
of the cell type score across all IBD SNPs
5. Choose random sets of SNPs to assess
statistical significance
Take home messages Multiple GWAS studies can be combined to identify stable associations across multiple cohorts
Differences between and within cohorts need to be taken into account Principal Component Analysis
Testing millions of SNPs implies looking for very low p-values because of multiple test (large sample sizes are required)
Many GWAS hits are in non-coding regions
Genetics only explain a portion of the total variance
Multiple strategies exist to validate and characterize identified genes Overlap with known diseases
GO enrichment
Cell type specificity
Selection analysis
23
Paper 2
24
Goal
Characterize the phenotypic and functional immune response to surgical trauma
25
Dataset
26
Whole-blood
Immune cells
Hip replacement
surgery 1h 24h 72h -1h 6wks
26 patients who underwent hip
surgery
Single-cell proteomics (21 cell surface proteins,
phospoepitopes of 10 intracellular proteins) as
measured by CyTOF
Workflow
Identify cell populations in all
samples
Compare cell population frequencies and signaling profiles
before and after surgery
Identify population features that can
predict recovery from surgery
27
1. Identification of cell populations Manual gating
Identification of cell populations using hierarchical 2D scatter plots
13 distinct cell populations were identified using the available markers
28
1. Identification of cell population Hierarchical clustering
29
1. Assign each item to its own cluster 2. Finds the closest (most similar) pair of clusters and merge them into a single cluster 3. Computes distances (similarities) between the new cluster and each of the old clusters 4. Repeats steps 2 and 3 until all items are clustered into a single cluster of size N
Item: single cell Cluster: population of phenotypically similar cells Distance: Euclidian distance based on the expression of all markers (proteins)
2. Identification of cell populations Automated gating
30
Unsupervised hierarchical clustering
Distance based on all surface markers
2. Comparison Cell frequencies (manual gating)
31
time
frequency fold change from baseline
2. Comparison Cell frequencies (automated gating)
32
2. Comparison Cell frequencies (automated gating)
33
3. Comparison Signaling responses
34
Intensity fold change from baseline
samples
Signaling molecules
3. Comparison Signaling responses
35
Early and concurrent activation of major
signaling pathways in innate and adaptive
immune cell compartments
Parenthesis Signaling responses
36
Correlation between proteins can be computed within or between time points
3. Comparison Signaling responses
37
Signaling correlation
network in a specific cell
subset Clustering of the correlation matrix
4. Prediction Outcome measurements
38
4. Prediction Demographic and clinical variables
39
QUESTION: do demographic and clinical variables correlate with recovery parameters?
Six parameters were tested:sex, age, body mass index (BMI), type of anesthesia, duration of surgery, and estimated blood loss
Only sex was significantly related to a clinical recovery parameter (postoperative fatigue)
4. Prediction Cell frequencies and signaling responses
40
QUESTION: do cell frequencies or signaling responses correlate with recovery parameters?
4. Prediction Cell frequencies and signaling responses
41
These correlations remained significant and unchanged when accounting for potential confounders (including sex, age, body mass index (BMI), type of anesthesia, duration of surgery, and
estimated blood loss)
Cluster C
Cluster B
Cluster A
Take home messages
42
Manual gating
Automated gating
Samples gated separately
Samples pooled together
Comparison between groups
Correlation with clinical phenotypes
Cell populations
Percentages
Marker median intensity
Multi-dimensional
single-cell CYTOF dataset
Paper 3
43
Goal
Reconstruct gene-regulatory networks in late-onset Alzheimer’s disease (LOAD)
44
Dataset
45
Gene A Gene B Gene A Gene B
LOAD 376 patients
Healthy 173 individuals
Dorsolateral prefrontal cortex
Visual cortex
Cerebellum
39,579 expression traits
Workflow
Infer gene-regulatory
networks in LOAD and HD
Compare networks
Rank-order network for relevance to
LOAD pathology
46
1. Network inference
47
Using only top variable genes
1. Network inference
48
Top variable genes
Correlation Matrix
Adjacency Matrix
Overlap Matrix
Hierarchical Clustering
Cut to identify modules
111 modules for LOAD, 89 modules for HD
Top
var
iab
le g
enes
overlap=0: genes i and j are unlinked and do not share any neighbor overlap=1: genes i and j are linked share all neighbors
adjacency=|r(i,j)|β
Pairwise Pearson correlation r(i,j)
1. Network comparison
49
genes that are part of the module
Color represents the topological overlap between each gene (connectivity)
Top right is the connectivity in LOAD Bottom left is the connectivity in Normal
Number is the gain (>1) or loss (<1) of connectivity genes that
are part of the module
1. Network comparison
50
Gain in connectivity (GOC)
Loss in connectivity (LOC)
54% of all modules showed GOC 4.5% of all modules showed LOC
1. Network characterization
51
Canonical pathways and biological processes enrichment Fisher’s test
1. Network characterization
52
LOAD module
Neuropathologic trait
1. Identify a module by the first principal component (PC1) of all its genes
2. Correlate PC1 with neuropathology traits (Pearson correlation)
1. Network characterization
53
AT: atrophy WMAT: white matter atrophy EL: enlargement
1. Network characterization
54
Significance of enrichment Number of correlated neuropathology traits
Module ranking
Take home messages Gene expression can be used to build modules of correlated genes
These modules can be independently built for LOAD and HD and compared to identify loss or gain of connectivity
Modules differentially connected in LOAD versus HD were enriched for multiple functional categories
Modules differentially connected in LOAD versus HD were correlated with neuropathological traits associated to LOAD
Modules can be rank based on both enrichment and association to relevant traits
55
The paper contains many more analyses that have not been described in this presentation
Paper 4
56
Goal
Identify possible drug therapies for IBD using publicly available datasets
57
Dataset
58
IBD Healthy
http://atlasofscience.org
Gene A Gene B Gene A Gene B Publicly available dataset
Gene signature
Differentially expressed genes
Workflow
Identify DEG in IBD compared to
controls
Compare this signature with
publicly available drug signatures
Identify inversely correlated signatures
Validate top hits in vivo
59
2. Comparison with drug signatures
60
The derived IBD upregulated and downregulated genes.
164 drugs
2. Comparison with drug signatures
61
interesting
candidate known treatment
A negative score indicates that the drug exhibits an expression pattern that is oppositional to the disease
2. Comparison with drug signatures
62
Gene ontology enrichment Fisher’s test
Negative correlation between topiramate gene signature and IBD
signature
The correlation is not very strong
3. Validation
63
Vehicle only
TNBC + vehicle
TNBS + prednisolone
TNBS + topiramate Topiramate reduces IBD
symptoms
3. Validation
64
8 genes were randomly selected for qPCR validation
Only 2 genes were differentially expressed
between treatment groups
3. Validation
65
Topiramate works even better than the approved drug
prednisolone
Take home messages Gene signatures of diseases can be computed by looking at differentially expressed genes between cases and controls
Disease gene signatures can be compared to drug profiles (connectivity maps)
Drug signatures anti-correlated to disease gene signatures could potentially be used as treatments
Hypotheses needs to be validated in vivo models
66
Computational Immunology Focus areas
67
Visualization Get a preliminary feeling of the data
Outliers Population stratification
Summarize results
Dimensionality reduction/Aggregation Reduce the burden of statistical testing Make results more interpretable Make results easier to visualize
Association with phenotype Link molecular/clinical features with phenotypes
Disease status Gene signatures Quantitative clinical parameters
Statistical validation (not widely covered in this course) Compare results in real datasets with results in
randomized datasets (internal) Provide support for results using available
resources (external)
Computational Immunology Examples
68
Visualization Scatter plots Heatmaps Hierarchical trees Barplots/Boxplots Venn diagram
Dimensionality reduction/Aggregation Principal Component Analysis Gene modules Protein modules
Association with phenotype Tests for difference in mean Linear regression Logistic regression Correlation between gene signatures
Statistical validation (not widely covered in this course) Data randomization (internal) Overlap of results with other published results
(external) Validation of results in independent cohorts
(external)
Acknowledgments BRC Bioinformatics Team Emanuele de Rinaldis
Alan Todd
Venu Pullabathla
Helen Alexander
Jonathan Smith
Prodromos Chatzikyriakou
Dr. Filipe Gracio
69