GeneExpressionDataAnalysis - University of...

Gene Expression Data Analysis

Qin Ma, Ph.D.December 10, 2017

Bioinformatics

• This interdisciplinary science … is about providing computational support to studies on linking the behavior of cells, organisms and populations to the information encoded in the genomes.

– Temple Smith, Current Topics in Computational Molecular Biology (2002)

Bioinformatics

GenomicsTranscriptomicsMetabolomicsMetagenomicsEpigenomicsProteomics

Interactomics…

Omics data

Systems biology

Characteristics of Biological Big Data

Big Small Data

Small Big Data

• 36.8 million transactions per day on Amazon

Next Generation Sequencing Data

• Biomedical Data (behavioral outcomes in observational study)

The Hierarchical Structure of Computational Techniques

Models

Algorithms

Programs Tools Software

Central Dogma

� DNA à RNA à Protein

Intro to gene expression (central dogma). (n.d.). Retrieved November 05, 2017, from https://www.khanacademy.org/science/biology/gene-expression-central-dogma/central-dogma-transcription/a/intro-to-gene-expression-central-dogma

Information derivable from gene expression data

Inference: genes with similar expression patterns might be functionally related, e.g., working in the same pathway or co-regulated

co-expression -> co-regulation

Inference: genes x, y are highly expressed under conditions W while genes a, b are not expressed

genome sequence

Inference: gene X is significantly more highly expressed in diseased cell than in normal cell; hence gene X could potentially serve s a marker of the disease – differentially expressed genes

genome sequence

Control

Treatment

Gene Expression Measurement

Read quality check (FastQC)

RNA-seq read mapping (BWA, Bowtie)

RNA-seq Assemblywith reference genome

(Cufflinks)

RNA-seq Assemblywithout reference genome (Trinity: De-novo assembly)

Microarray (GEO)

RNA-seq (SRA)

RNA-seq

� Process

� Purpose� Analysis of Big Genomic Data� Gene Expression Estimation

� Variations� Differential Gene Expression

Analysis� Functional Enrichment Analysis� Network Analysis

Forde, B. M., & O’Toole, P. W. (2013). Next-generation sequencing technologies and their impact on microbial genomics. Briefings in functional genomics, 12(5), 440-453.

Non-trivial RNA-seq Analysis Pipeline

Quality Check

Read Mapping

GeneRead Count

DifferentialExpression Analysis

Functional Enrichment Analysis

RNA-seq Reads

Data Trimming

De-novo (Bi)-Clustering

Network Analysis & Modeling

(De-novo) Assembly

Operon Prediction

Non-trivial RNA-seq Analysis Tools

FastQC

EdgeR/DeSeq DAVID/GO

RNA-seq Reads

MCL/QUBIC NCA/GtrieScanner

CufflinksTrinity

DOORSeqTU

Existing RNA-seq Pipeline Tools

HISAT2BridgerHtSeq

RSEMCufflinksCutadapt

NovoalignBowtie 2

BWABowti

eTopHa

TopHat2STARTrinity

DESeq2

GSNAPedgeRFastQCFastX

kallisto

sleuth

ViDGER

� Tool to assist in interpreting and analyzing count matrices

� PCA, MDS, Clustering

� DGEA� Visualizations

� Basic R package � Shiny

implementation

ViDGERCompatibility� Count & condition

matrix

� Popular DGE tools by citation count

� Cuffdiff*� edgeR� DESeq2� DEGseq� limma� sleuth*

1202164%

520028%

1607 8%

DGE & Visualization Visualization Only None

Shiny Input� Count Matrix

� Generates basic figures from matrix

� Initial Analyses� PCA

� MDS

Differential Gene Expression� Select DGE tool to

analyze data

� Interactive results table

� DGE results visualizations for improved interpretation

� Interactivity between table & figures

Pitfall I: Popularity ≠High Performance

MapSplice2CRAC GSNAPNovoalignTopHat2

Human(97.8%)(86.1%)(98.9%)(90.3%)(12.5%)

Pitfall II: Gene expression estimation

Quality Check

Read Mapping

GeneRead Count

DifferentialExpression Analysis

Functional Enrichment Analysis

RNA-seq Reads

Data Trimming

De-novo (Bi)-Clustering

Network Analysis & Modeling

(De-novo) Assembly

Operon Prediction

Mapping uncertainty!

Pitfall II: Gene expression estimation

RNA-seq reads mapping uncertainty

Mapping Uncertainty Occurrences

� Plants� Highly duplicative nature of genome

� Animals� Alternative splicing

� Metagenomics� Sequencing of entire microbial communities simultaneously� Identical genes across different species� Similar, mutated or evolved genes� Currently other issues compounding mapping uncertainty

Pitfall II: How Serious?

Diploid plants Polyploid plants

Species Arabidopsisthaliana Vitis vinifera Solanum

lycopersicumSolanum

tuberosumTriticum aestivum

Unique-mapped 77%~89% 55%~82% 49%~87% 55%~69% 62%~69%

Multi-mapped 8%~17% 10%~25% 6%~34% 18%~26% 18%~25%

Un-mapped 2%~5% 8%~23% 5%~44% 12%~19% 9%~18%

Similar things happen in Human (transcript) and Metagenome

Diploid plants Polyploid plants Animal

TotalSpecies Arabidopsis

thaliana Vitis vinifera SolanumLycopersicum

PanicumVirgatum

TriticumAestivum

HumanGenome

HumanTranscriptome

Mus musculusGenome

Mus musculusTranscriptome

Datasets 10 10 10 10 13 11 11 10 10 95

Size(G) 153.7 152.3 151.8 385.7 348.1 249.9 249.9 129.9 129.9 1951

Unique-Mapped 69%~89% 55%~82% 52%~88% 47%~66% 61%~69% 55%-65% 10%~15% 40%~70% 11%~27% 55%

Multi-Mapped 8%~17% 9%~25% 5%~34% 17%~33% 17%~25% 21%-28% 23%-31% 10%~38% 9%~42% 22%

Un-mapped 2%~17% 8%~23% 4%~16% 13%~25% 9%~18% 12%-21% 55%-65% 3%~31% 43%~67% 23%

(Multi-mapped)/(Total mapped) 8%-18% 10%-31% 6%-39% 22%-39% 21%-28% 25%-33% 61%-72% 13%~48% 29%~77% 29%

Mapping Uncertainty in Real Data

Mapping Uncertainty in Plant Data

Mapping Uncertainty in Animal Data

Pitfall II: How to Proceed?

a) Ignore them: only consider unique mapping– 30%-70% of reads are discarded from further analysis in plants

b) Random mapping: If multiple equally best matches, choose one at random– TopHat

c) Report all: try to keep more information– Cufflinks: distribute these multiple mapping reads uniformly or

based on the expression level of unique mapping reads.

Pitfall II: How to Proceed?

It is an OPEN and challenge problem!

Quantifying Mapping Uncertainty

� Gene Expression Quality Check (GeneQC)� Computational program collecting relevant information from

datasets� Interprets information in meaningful way to provide quantification

of mapping uncertainty

� Two levels of observations� Genomic level: Sequence Similarity between two genomic locations� Transcriptomic level: Proportion of shared ambiguous reads

GeneQC

D-score� Allows for comparable

metric of mapping uncertainty

� Combines three statistics� Maximum proportion

of shared ambiguous reads

� Maximum base-pair similarity

� Number of gene pair interactions

� Normalized between 0 and 1 for each dataset

Variables: 𝐷.

� 𝐷.: Sequence Similarity * Match Length� max

2{𝑠𝑠5,2 ∗ 𝑙5,2}

� 𝑠𝑠5,2 = 𝑠𝑒𝑞𝑢𝑒𝑛𝑐𝑒𝑠𝑖𝑚𝑖𝑙𝑖𝑟𝑡𝑦𝑜𝑓𝑔𝑒𝑛𝑒𝑖𝑎𝑛𝑑𝑔𝑒𝑛𝑒𝑦� 𝑙5,2 = 𝑚𝑎𝑡𝑐ℎ𝑙𝑒𝑛𝑔𝑡ℎ

� Additional Constraints for 𝐷.� e-value < 10KL

� SS*Match Length > 100� Mismatch < 5� Gap < 5

𝑔𝑒𝑛𝑒𝑦.: 𝑠𝑠5,. = 65%; 𝑙5,. = 100

𝑔𝑒𝑛𝑒𝑦P: 𝑠𝑠5,P = 85%; 𝑙5,. = 200

𝑔𝑒𝑛𝑒𝑖

𝑔𝑒𝑛𝑒𝑦R: 𝑠𝑠5,R = 85%; 𝑙5,R = 200

𝑔𝑒𝑛𝑒𝑦S: 𝑠𝑠5,S = 85%; 𝑙5,S = 350

Variables: 𝐷P

� 𝐷P: Max MMR percentage�

UV∩XUV

� 𝐺5 = 𝑟𝑒𝑎𝑑𝑠𝑎𝑙𝑖𝑔𝑛𝑒𝑑𝑡𝑜𝑔𝑒𝑛𝑒𝑖� 𝑋 = argmax

]|𝐺5 ∩ 𝑌|

𝐺5𝑋

𝐺5∩ 𝑋

Variables: 𝐷S

� 𝐷S: Degree weight� log.b 𝑆5 ∪ 𝑀5 + 1 � 𝑆5 = {𝑔𝑒𝑛𝑜𝑚𝑖𝑐𝑙𝑜𝑐𝑎𝑡𝑖𝑜𝑛𝑠𝑤ℎ𝑒𝑟𝑒𝐷. > 0}� 𝑀5 = {𝑔𝑒𝑛𝑜𝑚𝑖𝑐𝑙𝑜𝑐𝑎𝑡𝑖𝑜𝑛𝑠𝑤ℎ𝑒𝑟𝑒𝐷P > 0}

� Separated into two populations� 𝐷P = 0� 𝐷P ≠ 0

Variables by Species

D-score Development

� 𝐷., 𝐷P, 𝐷S combined into one distinct value

� Regression-based approach to optimize effect of each parameter

𝐷 = 𝛼.𝐷. + 𝛼P𝐷P + 𝛼S𝐷S + 𝛼R𝐷.𝐷P + 𝛼j𝐷.𝐷S + 𝛼L𝐷P𝐷S + 𝛼k𝐷.𝐷P𝐷S

𝑆𝐷 = 𝐷S(𝛼.𝐷. + 𝛼P𝐷P)

� 𝐷∗ used as dependent variable to represent mapping uncertainty� 𝐺5 = 𝑟𝑒𝑎𝑑𝑠𝑚𝑎𝑝𝑝𝑒𝑑𝑡𝑜𝑔𝑒𝑛𝑒𝑖 (All matches)� 𝑈5 = 𝑟𝑒𝑎𝑑𝑠𝑢𝑛𝑖𝑞𝑢𝑒𝑙𝑦𝑚𝑎𝑝𝑝𝑒𝑑𝑡𝑜𝑔𝑒𝑛𝑒𝑖 (Unique mapping)� Real alignment falls somewhere between

� |𝑈5| ≤ |𝑅5| ≤ |𝐺5|

� 𝐷∗ = UV K rVUV

= 1 − rVUV= 1 − UV t uV

�.P≤ 𝐷∗ ≤ 1

� 𝐷∗ regressed upon (𝐷., 𝐷P, 𝐷S) to determine optimized coefficients for each dataset

� Interpretations for each set of coefficients can be used to understand biological mechanisms behind species-specific mapping uncertainty

D-scores

Simplified D-score

Simplified D-score Distributions� Density plots appear to

show mixture distributions

� Individual distributions can help indicate categorizations for mapping uncertainty

Level of Mapping Uncertainty from D-scores

� Mixture model distributions fit to set of D-scores� Indicates level of mapping uncertainty for each annotated gene� Normal & Gamma distribution fitting� Variable number of distributions

� Mixture Model Fitting using Expectation-Maximization Algorithm

� 𝑃 𝑋 𝜃 = ∑ 𝛽z𝑌z 𝑋 𝜃z�z

� 𝑋 = 𝑥., 𝑥P, … , 𝑥~ represent the set of D-scores� 𝛽z represent the weight for the 𝑘�� component with ∑ 𝛽z�

z = 1� 𝑌z(𝑋|𝜃z) represent the distribution of the 𝑘�� component

� 𝜃z is the set of parameters for the 𝑘�� component

Mixture Model Fitting: Initialization

� Assume 𝑌z(𝑋, 𝜃z) = 𝑁(𝑋; 𝜇z, 𝜎zP)

� Initial parameterization� K-means clustering to separate into k components� 𝜃z, 𝛽zcalculated for each component using MLE based on 𝑁z

� 𝑀𝐿𝐸(𝜇z) =∑ ��,��

��

� 𝑀𝐿𝐸 𝜎zP =∑ ��,�K��

��

� 𝛽z =��

, with 𝑁z = 𝑛𝑢𝑚𝑏𝑒𝑟𝑜𝑓𝑑𝑎𝑡𝑎𝑝𝑜𝑖𝑛𝑡𝑠𝑖𝑛𝑐𝑜𝑚𝑝𝑜𝑛𝑒𝑛𝑡𝑘 & ∑ 𝑁z�z = 𝑁

𝑘 = 4

Mixture Model Fitting: Expectation & Maximization

� Posterior Probability of containment within each component for each D-score is calculated

𝑃 𝑥� ∈ 𝑘5 𝑥� =𝑃 𝑥� 𝑥� ∈ 𝑘5 𝑃 𝑘5

𝑃 𝑥�=𝑁 𝑥� 𝜇z, 𝜎z

𝑁z𝑁

∑ 𝛽z𝑁 𝑥� 𝜇z, 𝜎z�z

=𝛽z𝑁 𝑥� 𝜇z𝜎z

∑ 𝛽z𝑁 𝑥� 𝜇z𝜎z�z

� Parameters for each component calculated after Expectation Step

𝜇z =∑ 𝑃 𝑥� ∈ 𝑘5 𝑥� 𝑥��.

∑ 𝑃 𝑥� ∈ 𝑘5 𝑥��.

𝜎zP = ∑ 𝑃 𝑥� ∈ 𝑘5 𝑥� 𝑥� − 𝜇z

P��.

∑ 𝑃 𝑥� ∈ 𝑘5 𝑥��.

𝛽z =∑ 𝑃 𝑥� ∈ 𝑘5 𝑥��.

Mixture Model Fitting: Optimization

� Expectation and Maximization steps repeated until no significant improvement achieved after each iteration

� log likelihood fails to substantially increase

� Implementation in R with 𝑘 ∈ {1, … , 9}� Best model fitting determined by lowest Bayesian Information

Criterion (BIC)

𝑘 = 4

Mixture Model Fitting

𝑘 = 4The four distributions provide criteria for separating genes into 4 categorizations based on mapping uncertainty level

Addressing Mapping Uncertainty

� Co-expression Modules (CEMs)� Genes typically co-expressed at certain rates with other genes

forming co-expression modules� Can use expression levels for known co-expressed genes (CEGs) to

predict likely expression levels for the gene locations� This information can be in turn used to determine which location is

most likely for any particular ambiguous read

� Can use existing information to gain insight into the likelihood of the correct location for alignment

� If no prior CEMs are available, biclustering of data can provide dataset-specific CEMs.

Pitfall III: T-test for differentially expression analysis

Wilcoxon (nonparametric) test has better performance than T-test

(parametric)

Bioinformatics. 2002 Nov;18(11):1454-61.Cited by 308

P-value < 0.0134

Pitfall IV: co-expression correlation

chip1 chip2 chip3 chip4 chip5 chip6 chip7 chip8 Chip9 chip10

Gene1 7.6 6.0 10.8 8.3 9.1 8.7 7.4 6.4 10.2 6.5

Gene2 8.1 7.2 7.0 8.4 8.9 8.8 6.5 10.4 6.9 7.5

Pearson Spearman

• Pearson benchmarks linear relationship• Spearman’s rank correlation benchmarks monotonic relationship

Pearson or Spearman?

Pitfall V: Co-expression in LARGE data set

Genes are not necessarily co-expressed under all experimental conditions,when we have a large data set!

esConditions

One dimensional clustering (genes or conditions)

Bi-clustering (genes & conditions)more data!!

Computer Lab Requirement

• Recent version of following software– R– RStudio– MiKTeX (or TeXLive)

• Install the following R packages on yourpersonal computer– EdgeR– QUBIC– sand

Final Report Presentation

• 12 teams, 3 person/team

• For each team, 15 mins team presentation– 12 mins presentation– 3 mins question-and-answer

• One score per team

GeneExpressionDataAnalysis - University of...

Documents

Christine Vogel NIH Public Access Edward M. Marcotte ...csbl.bmb.uga.edu/mirrors/JLU/DragonStar/2016/winter/download... · Emerging evidence is changing our view of the role for the

Mirrors and Lenses. Flat Mirrors Images Formed by Spherical Mirrors Concave Mirrors and Sign Conventions Thin Lenses Mirrors and Lenses Sections 1-4

Metastasis: cancer cell s escape from oxidative stresscsbl.bmb.uga.edu/mirrors/JLU/DragonStar2015/download... · 2014. 6. 18. · coordinated action of tyrosine kinases (such as Src

Curved Mirrors, Ray Diagrams, and Simulations...Curved Mirrors, Ray Diagrams, and Simulations Background Information Spherical mirrors may be concave or convex. Concave mirrors have

Mirrors & Mirrors

Spherical Mirrors Alfano I: Year 4. Concave Mirrors Concave mirrors are a type of spherical mirrors. We see them used as make-up mirrors or shaving mirrors,

14-3: Curved Mirrors. Curved Mirrors What are some examples of curved mirrors?

Jlu Lusaka Final

Catalogue Reflect+ 2015 - exclusive mirrors by Deknudt Mirrors

Curved Mirrors Concave and Convex Mirrors Concave and convex mirrors are curved mirrors similar to portions of a sphere. light rays Concave mirrors reflect

Leading Edge Review - University of Georgiacsbl.bmb.uga.edu/mirrors/JLU/DragonStar2016/download/The... · 2016-12-03 · Leading Edge Review Hallmarks of Cancer: The Next Generation

PRAGMA Institute on Implementation @ PRAGMA 19 Wilfred W. Li, Ph.D., UCSD, USA Xiaohui Wei, Ph.D., JLU, PRC Hosted by JLU Changchun, Jilin, PRC, Sept 13,

Functionalization Electronic Supplementary Information ... · S-3 Fig. S3 PXRD patterns of JLU-Liu33F for simulated, as-synthesized and solvent-exchanged samples. JLU-Liu33F was stable

Olmec mirrors: an example of archaeological American mirrors … · 2007-01-30 · Olmec mirrors: an example of archaeological American mirrors José J. Lunazzi Universidade Estadual

Interchain Interactions in Organic - JLU

Alexander Stankovski/Mirrors Within Mirrors

EVO Manufacturing EVO-3035 JLU Rocker Steps · EVO Manufacturing EVO-3035 JLU Rocker Steps READ BEFORE INSTALLATION: 1. Safely park vehicle on level ground. 2. Start by preassembling

ePlace:’Electrostacs’Based’Placement Using’Nesterov’s’Method’cseweb.ucsd.edu/~jlu/papers/eplace-dac14/slides.pdf · ePlace:’Electrostacs’Based’Placement Using’Nesterov’s’Method’

J.M. Gabrielse. Outline Reflection Mirrors Plane mirrors Spherical mirrors Concave mirrors Convex mirrors Refraction Lenses Concave lenses Convex lenses

Solar Mirrors: How Engineers Measure Mirrors