GeneExpressionDataAnalysis - University of...

Preview:

Citation preview

Gene Expression Data Analysis

Qin Ma, Ph.D.December 10, 2017

1

Bioinformatics

• This interdisciplinary science … is about providing computational support to studies on linking the behavior of cells, organisms and populations to the information encoded in the genomes.

– Temple Smith, Current Topics in Computational Molecular Biology (2002)

2

Bioinformatics

GenomicsTranscriptomicsMetabolomicsMetagenomicsEpigenomicsProteomics

Interactomics…

Omics data

Systems biology

Characteristics of Biological Big Data

Big Small Data

v.s.

Small Big Data

3

• 36.8 million transactions per day on Amazon

Next Generation Sequencing Data

• Biomedical Data (behavioral outcomes in observational study)

The Hierarchical Structure of Computational Techniques

4

Models

Algorithms

Programs Tools Software

Central Dogma

� DNA à RNA à Protein

Intro to gene expression (central dogma). (n.d.). Retrieved November 05, 2017, from https://www.khanacademy.org/science/biology/gene-expression-central-dogma/central-dogma-transcription/a/intro-to-gene-expression-central-dogma

5/46

6

Information derivable from gene expression data

Inference: genes with similar expression patterns might be functionally related, e.g., working in the same pathway or co-regulated

co-expression -> co-regulation

Inference: genes x, y are highly expressed under conditions W while genes a, b are not expressed

genome sequence

Inference: gene X is significantly more highly expressed in diseased cell than in normal cell; hence gene X could potentially serve s a marker of the disease – differentially expressed genes

genome sequence

Control

Treatment

Gene Expression Measurement

7

Read quality check (FastQC)

RNA-seq read mapping (BWA, Bowtie)

RNA-seq Assemblywith reference genome

(Cufflinks)

RNA-seq Assemblywithout reference genome (Trinity: De-novo assembly)

Microarray (GEO)

RNA-seq (SRA)

$ $

RNA-seq

� Process

� Purpose� Analysis of Big Genomic Data� Gene Expression Estimation

� Variations� Differential Gene Expression

Analysis� Functional Enrichment Analysis� Network Analysis

Forde, B. M., & O’Toole, P. W. (2013). Next-generation sequencing technologies and their impact on microbial genomics. Briefings in functional genomics, 12(5), 440-453.

8/46

Non-trivial RNA-seq Analysis Pipeline

9

Quality Check

Read Mapping

GeneRead Count

DifferentialExpression Analysis

Functional Enrichment Analysis

RNA-seq Reads

Data Trimming

De-novo (Bi)-Clustering

Network Analysis & Modeling

(De-novo) Assembly

Operon Prediction

Non-trivial RNA-seq Analysis Tools

10

FastQC

HISAT

HtSeq

EdgeR/DeSeq DAVID/GO

RNA-seq Reads

Btrim

MCL/QUBIC NCA/GtrieScanner

CufflinksTrinity

DOORSeqTU

Existing RNA-seq Pipeline Tools

2009

2010

2011

2012

2013

2014

2015

2016

2017

HISAT2BridgerHtSeq

RSEMCufflinksCutadapt

NovoalignBowtie 2

BWABowti

eTopHa

tGNUM

ap

TopHat2STARTrinity

DESeq2

GSNAPedgeRFastQCFastX

kallisto

sleuth

11/46

ViDGER

� Tool to assist in interpreting and analyzing count matrices

� PCA, MDS, Clustering

� DGEA� Visualizations

� Basic R package � Shiny

implementation

12/46

ViDGERCompatibility� Count & condition

matrix

� Popular DGE tools by citation count

� Cuffdiff*� edgeR� DESeq2� DEGseq� limma� sleuth*

1202164%

520028%

1607 8%

DGE & Visualization Visualization Only None

13/46

Shiny Input� Count Matrix

� Generates basic figures from matrix

� Initial Analyses� PCA

� MDS

14/46

Differential Gene Expression� Select DGE tool to

analyze data

� Interactive results table

� DGE results visualizations for improved interpretation

� Interactivity between table & figures

15/46

Pitfall I: Popularity ≠High Performance

MapSplice2CRAC GSNAPNovoalignTopHat2

27

Human(97.8%)(86.1%)(98.9%)(90.3%)(12.5%)

Pitfall II: Gene expression estimation

28

Quality Check

Read Mapping

GeneRead Count

DifferentialExpression Analysis

Functional Enrichment Analysis

RNA-seq Reads

Data Trimming

De-novo (Bi)-Clustering

Network Analysis & Modeling

(De-novo) Assembly

Operon Prediction

Mapping uncertainty!

Pitfall II: Gene expression estimation

RNA-seq reads mapping uncertainty

29

Mapping Uncertainty Occurrences

� Plants� Highly duplicative nature of genome

� Animals� Alternative splicing

� Metagenomics� Sequencing of entire microbial communities simultaneously� Identical genes across different species� Similar, mutated or evolved genes� Currently other issues compounding mapping uncertainty

19/46

Pitfall II: How Serious?

30

Diploid plants Polyploid plants

Species Arabidopsisthaliana Vitis vinifera Solanum

lycopersicumSolanum

tuberosumTriticum aestivum

Unique-mapped 77%~89% 55%~82% 49%~87% 55%~69% 62%~69%

Multi-mapped 8%~17% 10%~25% 6%~34% 18%~26% 18%~25%

Un-mapped 2%~5% 8%~23% 5%~44% 12%~19% 9%~18%

Similar things happen in Human (transcript) and Metagenome

Diploid plants Polyploid plants Animal

TotalSpecies Arabidopsis

thaliana Vitis vinifera SolanumLycopersicum

PanicumVirgatum

TriticumAestivum

HumanGenome

HumanTranscriptome

Mus musculusGenome

Mus musculusTranscriptome

Datasets 10 10 10 10 13 11 11 10 10 95

Size(G) 153.7 152.3 151.8 385.7 348.1 249.9 249.9 129.9 129.9 1951

Unique-Mapped 69%~89% 55%~82% 52%~88% 47%~66% 61%~69% 55%-65% 10%~15% 40%~70% 11%~27% 55%

Multi-Mapped 8%~17% 9%~25% 5%~34% 17%~33% 17%~25% 21%-28% 23%-31% 10%~38% 9%~42% 22%

Un-mapped 2%~17% 8%~23% 4%~16% 13%~25% 9%~18% 12%-21% 55%-65% 3%~31% 43%~67% 23%

(Multi-mapped)/(Total mapped) 8%-18% 10%-31% 6%-39% 22%-39% 21%-28% 25%-33% 61%-72% 13%~48% 29%~77% 29%

Mapping Uncertainty in Real Data

21/46

Mapping Uncertainty in Plant Data

22/46

Mapping Uncertainty in Animal Data

23/46

Pitfall II: How to Proceed?

31

a) Ignore them: only consider unique mapping– 30%-70% of reads are discarded from further analysis in plants

b) Random mapping: If multiple equally best matches, choose one at random– TopHat

c) Report all: try to keep more information– Cufflinks: distribute these multiple mapping reads uniformly or

based on the expression level of unique mapping reads.

Pitfall II: How to Proceed?

32

It is an OPEN and challenge problem!

Quantifying Mapping Uncertainty

� Gene Expression Quality Check (GeneQC)� Computational program collecting relevant information from

datasets� Interprets information in meaningful way to provide quantification

of mapping uncertainty

� Two levels of observations� Genomic level: Sequence Similarity between two genomic locations� Transcriptomic level: Proportion of shared ambiguous reads

26/46

GeneQC

97

0.5

A B

C

C D

27/46

D-score� Allows for comparable

metric of mapping uncertainty

� Combines three statistics� Maximum proportion

of shared ambiguous reads

� Maximum base-pair similarity

� Number of gene pair interactions

� Normalized between 0 and 1 for each dataset

𝒊

0.07

0.18

0.65

0.95

0.84

28/46

Variables: 𝐷.

� 𝐷.: Sequence Similarity * Match Length� max

2{𝑠𝑠5,2 ∗ 𝑙5,2}

� 𝑠𝑠5,2 = 𝑠𝑒𝑞𝑢𝑒𝑛𝑐𝑒𝑠𝑖𝑚𝑖𝑙𝑖𝑟𝑡𝑦𝑜𝑓𝑔𝑒𝑛𝑒𝑖𝑎𝑛𝑑𝑔𝑒𝑛𝑒𝑦� 𝑙5,2 = 𝑚𝑎𝑡𝑐ℎ𝑙𝑒𝑛𝑔𝑡ℎ

� Additional Constraints for 𝐷.� e-value < 10KL

� SS*Match Length > 100� Mismatch < 5� Gap < 5

𝑔𝑒𝑛𝑒𝑦.: 𝑠𝑠5,. = 65%; 𝑙5,. = 100

𝑔𝑒𝑛𝑒𝑦P: 𝑠𝑠5,P = 85%; 𝑙5,. = 200

𝑔𝑒𝑛𝑒𝑖

𝑔𝑒𝑛𝑒𝑦R: 𝑠𝑠5,R = 85%; 𝑙5,R = 200

𝑔𝑒𝑛𝑒𝑦S: 𝑠𝑠5,S = 85%; 𝑙5,S = 350

29/46

Variables: 𝐷P

� 𝐷P: Max MMR percentage�

UV∩XUV

� 𝐺5 = 𝑟𝑒𝑎𝑑𝑠𝑎𝑙𝑖𝑔𝑛𝑒𝑑𝑡𝑜𝑔𝑒𝑛𝑒𝑖� 𝑋 = argmax

]|𝐺5 ∩ 𝑌|

𝐺5𝑋

𝑌P

𝑌.

𝐺5∩ 𝑋

30/46

Variables: 𝐷S

� 𝐷S: Degree weight� log.b 𝑆5 ∪ 𝑀5 + 1 � 𝑆5 = {𝑔𝑒𝑛𝑜𝑚𝑖𝑐𝑙𝑜𝑐𝑎𝑡𝑖𝑜𝑛𝑠𝑤ℎ𝑒𝑟𝑒𝐷. > 0}� 𝑀5 = {𝑔𝑒𝑛𝑜𝑚𝑖𝑐𝑙𝑜𝑐𝑎𝑡𝑖𝑜𝑛𝑠𝑤ℎ𝑒𝑟𝑒𝐷P > 0}

� Separated into two populations� 𝐷P = 0� 𝐷P ≠ 0

31/46

Variables by Species

32/46

D-score Development

� 𝐷., 𝐷P, 𝐷S combined into one distinct value

� Regression-based approach to optimize effect of each parameter

𝐷 = 𝛼.𝐷. + 𝛼P𝐷P + 𝛼S𝐷S + 𝛼R𝐷.𝐷P + 𝛼j𝐷.𝐷S + 𝛼L𝐷P𝐷S + 𝛼k𝐷.𝐷P𝐷S

𝑆𝐷 = 𝐷S(𝛼.𝐷. + 𝛼P𝐷P)

� 𝐷∗ used as dependent variable to represent mapping uncertainty� 𝐺5 = 𝑟𝑒𝑎𝑑𝑠𝑚𝑎𝑝𝑝𝑒𝑑𝑡𝑜𝑔𝑒𝑛𝑒𝑖 (All matches)� 𝑈5 = 𝑟𝑒𝑎𝑑𝑠𝑢𝑛𝑖𝑞𝑢𝑒𝑙𝑦𝑚𝑎𝑝𝑝𝑒𝑑𝑡𝑜𝑔𝑒𝑛𝑒𝑖 (Unique mapping)� Real alignment falls somewhere between

� |𝑈5| ≤ |𝑅5| ≤ |𝐺5|

� 𝐷∗ = UV K rVUV

= 1 − rVUV= 1 − UV t uV

P UV

�.P≤ 𝐷∗ ≤ 1

� 𝐷∗ regressed upon (𝐷., 𝐷P, 𝐷S) to determine optimized coefficients for each dataset

� Interpretations for each set of coefficients can be used to understand biological mechanisms behind species-specific mapping uncertainty

33/46

D-scores

34/46

Simplified D-score

35/46

Simplified D-score Distributions� Density plots appear to

show mixture distributions

� Individual distributions can help indicate categorizations for mapping uncertainty

36/46

Level of Mapping Uncertainty from D-scores

� Mixture model distributions fit to set of D-scores� Indicates level of mapping uncertainty for each annotated gene� Normal & Gamma distribution fitting� Variable number of distributions

� Mixture Model Fitting using Expectation-Maximization Algorithm

� 𝑃 𝑋 𝜃 = ∑ 𝛽z𝑌z 𝑋 𝜃z�z

� 𝑋 = 𝑥., 𝑥P, … , 𝑥~ represent the set of D-scores� 𝛽z represent the weight for the 𝑘�� component with ∑ 𝛽z�

z = 1� 𝑌z(𝑋|𝜃z) represent the distribution of the 𝑘�� component

� 𝜃z is the set of parameters for the 𝑘�� component

37/46

Mixture Model Fitting: Initialization

� Assume 𝑌z(𝑋, 𝜃z) = 𝑁(𝑋; 𝜇z, 𝜎zP)

� Initial parameterization� K-means clustering to separate into k components� 𝜃z, 𝛽zcalculated for each component using MLE based on 𝑁z

� 𝑀𝐿𝐸(𝜇z) =∑ ��,����

��

� 𝑀𝐿𝐸 𝜎zP =∑ ��,�K��

����

��

� 𝛽z =���

, with 𝑁z = 𝑛𝑢𝑚𝑏𝑒𝑟𝑜𝑓𝑑𝑎𝑡𝑎𝑝𝑜𝑖𝑛𝑡𝑠𝑖𝑛𝑐𝑜𝑚𝑝𝑜𝑛𝑒𝑛𝑡𝑘 & ∑ 𝑁z�z = 𝑁

𝑘 = 4

38/46

Mixture Model Fitting: Expectation & Maximization

� Posterior Probability of containment within each component for each D-score is calculated

𝑃 𝑥� ∈ 𝑘5 𝑥� =𝑃 𝑥� 𝑥� ∈ 𝑘5 𝑃 𝑘5

𝑃 𝑥�=𝑁 𝑥� 𝜇z, 𝜎z

𝑁z𝑁

∑ 𝛽z𝑁 𝑥� 𝜇z, 𝜎z�z

=𝛽z𝑁 𝑥� 𝜇z𝜎z

∑ 𝛽z𝑁 𝑥� 𝜇z𝜎z�z

� Parameters for each component calculated after Expectation Step

𝜇z =∑ 𝑃 𝑥� ∈ 𝑘5 𝑥� 𝑥����.

∑ 𝑃 𝑥� ∈ 𝑘5 𝑥����.

𝜎zP = ∑ 𝑃 𝑥� ∈ 𝑘5 𝑥� 𝑥� − 𝜇z

P���.

∑ 𝑃 𝑥� ∈ 𝑘5 𝑥����.

𝛽z =∑ 𝑃 𝑥� ∈ 𝑘5 𝑥����.

𝑁

39/46

Mixture Model Fitting: Optimization

� Expectation and Maximization steps repeated until no significant improvement achieved after each iteration

� log likelihood fails to substantially increase

� Implementation in R with 𝑘 ∈ {1, … , 9}� Best model fitting determined by lowest Bayesian Information

Criterion (BIC)

𝑘 = 4

40/46

Mixture Model Fitting

𝑘 = 4The four distributions provide criteria for separating genes into 4 categorizations based on mapping uncertainty level

41/46

Addressing Mapping Uncertainty

� Co-expression Modules (CEMs)� Genes typically co-expressed at certain rates with other genes

forming co-expression modules� Can use expression levels for known co-expressed genes (CEGs) to

predict likely expression levels for the gene locations� This information can be in turn used to determine which location is

most likely for any particular ambiguous read

� Can use existing information to gain insight into the likelihood of the correct location for alignment

� If no prior CEMs are available, biclustering of data can provide dataset-specific CEMs.

42/46

Pitfall III: T-test for differentially expression analysis

Wilcoxon (nonparametric) test has better performance than T-test

(parametric)

Bioinformatics. 2002 Nov;18(11):1454-61.Cited by 308

P-value < 0.0134

Pitfall IV: co-expression correlation

chip1 chip2 chip3 chip4 chip5 chip6 chip7 chip8 Chip9 chip10

Gene1 7.6 6.0 10.8 8.3 9.1 8.7 7.4 6.4 10.2 6.5

Gene2 8.1 7.2 7.0 8.4 8.9 8.8 6.5 10.4 6.9 7.5

Pearson Spearman

• Pearson benchmarks linear relationship• Spearman’s rank correlation benchmarks monotonic relationship

Pearson or Spearman?

35

45

Pitfall V: Co-expression in LARGE data set

Genes are not necessarily co-expressed under all experimental conditions,when we have a large data set!

Gen

esConditions

One dimensional clustering (genes or conditions)

Bi-clustering (genes & conditions)more data!!

Computer Lab Requirement

• Recent version of following software– R– RStudio– MiKTeX (or TeXLive)

• Install the following R packages on yourpersonal computer– EdgeR– QUBIC– sand

46

Final Report Presentation

• 12 teams, 3 person/team

• For each team, 15 mins team presentation– 12 mins presentation– 3 mins question-and-answer

• One score per team

47

Recommended