45
Bioinformatics for Stem Cell Lecture 2 Debashis Sahoo, PhD

Bioinformatics for Stem Cell Lecture 2 Debashis Sahoo, PhD

Embed Size (px)

Citation preview

Page 1: Bioinformatics for Stem Cell Lecture 2 Debashis Sahoo, PhD

Bioinformatics for Stem CellLecture 2

Debashis Sahoo, PhD

Page 2: Bioinformatics for Stem Cell Lecture 2 Debashis Sahoo, PhD

Outline

• Lecture 1 Recap• Multivariate analysis• Microarray data analysis• Boolean analysis• Sequencing data analysis

Page 3: Bioinformatics for Stem Cell Lecture 2 Debashis Sahoo, PhD

MULTIVARIATE ANALYSIS

Page 4: Bioinformatics for Stem Cell Lecture 2 Debashis Sahoo, PhD

Identify Markers of Human Colon Cancer and Normal Colon

4

Piero Dalerba Tomer Kalisky

Page 5: Bioinformatics for Stem Cell Lecture 2 Debashis Sahoo, PhD

Single Cell Analysis of Normal Human Colon Epithelium

Page 6: Bioinformatics for Stem Cell Lecture 2 Debashis Sahoo, PhD

Hierarchical Clustering

Page 7: Bioinformatics for Stem Cell Lecture 2 Debashis Sahoo, PhD

Hierarchical Clustering

• Cluster 3.0– http://bonsai.hgc.jp/~mdehoon/software/cluster/

• Distance metric– Euclidian, Squared Euclidean, Manhattan,

maximum, cosine, Pearson’s correlation

• Linkage– Single, complete, average, median, centroid

Page 8: Bioinformatics for Stem Cell Lecture 2 Debashis Sahoo, PhD

Multivariate Analysis - PCA

X = data matrixV = loading matrixU = scores matrix

Principal Component Analysis

Page 9: Bioinformatics for Stem Cell Lecture 2 Debashis Sahoo, PhD

Fundamentals of PCA

• Reduces dimensions of the data

• PCA uses orthogonal linear transformation

• First principal component has the largest possible variance.

• Exploratory tool to uncover unknown trends in the data

Page 10: Bioinformatics for Stem Cell Lecture 2 Debashis Sahoo, PhD

PCA Analysis

Page 11: Bioinformatics for Stem Cell Lecture 2 Debashis Sahoo, PhD

HIGH-THROUGHPUT DATA ANALYSIS

Page 12: Bioinformatics for Stem Cell Lecture 2 Debashis Sahoo, PhD

MICROARRAY ANALYSIS

Page 13: Bioinformatics for Stem Cell Lecture 2 Debashis Sahoo, PhD

Microarray

• Spotted vs. in situ• Two channel vs. one

channel• Probe vs. probeset vs.

gene

Page 14: Bioinformatics for Stem Cell Lecture 2 Debashis Sahoo, PhD

Quantile NormalizationS

ort Average

#1 #2 #3

Val(Probe_i) = SortedAvg[Rank(Probe_i)]

SortedAvg

Page 15: Bioinformatics for Stem Cell Lecture 2 Debashis Sahoo, PhD

Invariant Set Normalization

BeforeNormalization

After Normalization

Invariant set

Page 16: Bioinformatics for Stem Cell Lecture 2 Debashis Sahoo, PhD

Good to Check the Image

Page 17: Bioinformatics for Stem Cell Lecture 2 Debashis Sahoo, PhD

1. Assign experiments to two groups, e.g., in the expression matrix below, assign Experiments 1, 2 and 5 to group A, and experiments 3, 4 and 6 to group B.

Exp 1 Exp 2 Exp 3 Exp 4 Exp 5 Exp 6

Gene 1

Gene 2

Gene 3

Gene 4

Gene 5

Gene 6

2. Question: Is mean expression level of a gene in group A significantly different from mean expression level in group B?

Exp 1 Exp 2 Exp 3 Exp 4Exp 5 Exp 6

Gene 1

Gene 2

Gene 3

Gene 4

Gene 5

Gene 6

Group A Group B

SAM Two-Class Unpaired

Page 18: Bioinformatics for Stem Cell Lecture 2 Debashis Sahoo, PhD

Permutation tests

i) For each gene, compute d-value (analogous to t-statistic). This isthe observed d-value for that gene.

ii) Rank the genes in ascending order of their d-values.

iii) Randomly shuffle the values of the genes between groups A and B,such that the reshuffled groups A and B respectively have the same number of elements as the original groups A and B. Compute the d-value for each randomized gene

Exp 1 Exp 2 Exp 3 Exp 4Exp 5 Exp 6

Gene 1

Group A Group B

Exp 1Exp 4 Exp 5Exp 2Exp 3 Exp 6

Gene 1

Group A Group B

Original grouping

Randomized grouping

SAM Two-Class Unpaired

Page 19: Bioinformatics for Stem Cell Lecture 2 Debashis Sahoo, PhD

SAM Two-Class Unpaired

iv) Rank the permuted d-values of the genes in ascending order

v) Repeat steps iii) and iv) many times, so that each gene has many randomized d-values corresponding to its rank from the observed(unpermuted) d-value. Take the average of the randomized d-values for each gene. This is the expected d-value of that gene.

vi) Plot the observed d-values vs. the expected d-values

Page 20: Bioinformatics for Stem Cell Lecture 2 Debashis Sahoo, PhD

SAM Two-Class Unpaired

Significant positive genes (i.e., mean expression of group B >

mean expression of group A)

Significant negative genes (i.e., mean expression of group A > mean expression of group B)

“Observed d = expected d” line

The more a gene deviates from the “observed = expected” line, the more likely it is to be significant. Any gene beyond the first gene in the +ve or –ve direction on the x-axis (including the first gene), whose observed exceeds the expected by at least delta, is considered significant.

Page 21: Bioinformatics for Stem Cell Lecture 2 Debashis Sahoo, PhD

GenePatternhttp://genepattern.broadinstitute.org/

Page 22: Bioinformatics for Stem Cell Lecture 2 Debashis Sahoo, PhD

AutoSOMEhttp://jimcooperlab.mcdb.ucsb.edu/autosome/

Aaron Newman and James Cooper, BMC Bioinformatics, 2010, 11:117

Aaron Newman

Page 23: Bioinformatics for Stem Cell Lecture 2 Debashis Sahoo, PhD

Gene Set Analysis

Cell CycleCell Cycle

Transcription factorTranscription factor

TGF-beta Signaling PathwayTGF-beta Signaling Pathway

Wnt-signaling PathwayWnt-signaling Pathway

Protein-protein interaction network

Your Gene Set

Compute enrichment in pathways and

networks

Compute enrichment in pathways and

networks

Tools: GSEA, DAVID, Toppfun, MSigDB, and STRING

Page 24: Bioinformatics for Stem Cell Lecture 2 Debashis Sahoo, PhD

BOOLEAN ANALYSIS

Page 25: Bioinformatics for Stem Cell Lecture 2 Debashis Sahoo, PhD

Boolean Implication

• Analyze pairs of genes.• Analyze the four

different quadrants.• Identify sparse

quadrants.• Record the Boolean

relationships.– If ACPP high, then GABRB1

low– If GABRB1 high, then ACPP

low

ACPP

GA

BR

B1

[Sahoo et al. Genome Biology 08]

45,000 Affymetrix microarrays

Page 26: Bioinformatics for Stem Cell Lecture 2 Debashis Sahoo, PhD

Threshold Calculation

• A threshold is determined for each gene.

• The arrays are sorted by gene expression

• StepMiner is used to determine the threshold

Sorted arrays

CD

H e

xpre

ssio

n

[Sahoo et al. 07]

Threshold

High

Low

Intermediate

Page 27: Bioinformatics for Stem Cell Lecture 2 Debashis Sahoo, PhD

BooleanNet Statistics

[Sahoo et al. Genome Biology 08]

nAlow = (a00+ a01), nBlow = (a00+ a10)

total = a00+ a01+ a10+ a11, observed = a00

expected = (nAlow/ total * nBlow/ total) * total

a00

(a00+ a01)

a00

(a00+ a10)+( )1

2error rate =

a00

a01 a11

a10

A

B

statistic =(expected – observed)

expected√

Boolean Implication = (statistic > 3, error rate < 0.1)

Page 28: Bioinformatics for Stem Cell Lecture 2 Debashis Sahoo, PhD

Six Boolean Implications

[Sahoo et al. Genome Biology 08]

Page 29: Bioinformatics for Stem Cell Lecture 2 Debashis Sahoo, PhD

MiDReG Algorithm

[Sahoo et al. PNAS 2010]

MiDReG = (Mining Developmentally Regulated Genes)

Page 30: Bioinformatics for Stem Cell Lecture 2 Debashis Sahoo, PhD

MiDReG Algorithm

[Sahoo et al. PNAS 2010]

MiDReG = (Mining Developmentally Regulated Genes)

Page 31: Bioinformatics for Stem Cell Lecture 2 Debashis Sahoo, PhD

MiDReG Algorithm

[Sahoo et al. PNAS 2010]

MiDReG = (Mining Developmentally Regulated Genes)

Page 32: Bioinformatics for Stem Cell Lecture 2 Debashis Sahoo, PhD

B Cell Genes

[Sahoo et al. PNAS 2010]

CD19

KIT

Boolean Implications

Page 33: Bioinformatics for Stem Cell Lecture 2 Debashis Sahoo, PhD

Jun Seita

[Seita, Sahoo et al. PLoS ONE, 2012]

http://gexc.stanford.edu

Page 34: Bioinformatics for Stem Cell Lecture 2 Debashis Sahoo, PhD

SEQUENCING DATA ANALYSIS

Page 35: Bioinformatics for Stem Cell Lecture 2 Debashis Sahoo, PhD

Sequencing Data Format

@HWI-EAS209:5:58:5894:21141#ATCACG/1TTAATTGGTAAATAAATCTCCTAATAGCTTAGATNT +HWI-EAS209:5:58:5894:21141#ATCACG/1efcfffffcfeefffcffffffddf`feed]`]_Ba

>SEQUENCE_1MTEITAAMVKELRESTGAGMMDCKNALSETNGDFDKAVQLLREKGLGKAAKKADRLAAEGLVSVKVSDDFTIAAMRPSYLSYEDLDMTFVENEYKALVAELEKENEERRRLKDPNKPEHKIPQFASRKQLSDAILKEAEEKIKEELKAQGKPEKIWDNIIPGKMNSFIADNSQLDSKLTLMGQFYVMDDKKTVEQVIAEKEKEFGGKIKIVEFICFEVGEGLEKKTEDFAAEVAAQL>SEQUENCE_2SATVSEINSETDFVAKNDQFIALTKDTTAHIQSNSLQSVEELHSSTINGVKFEEYLKSQIATIGENLVVRRFATLKAGANGVVNGYIHTNGRVGVVIAAACDSAEVASKSRDLLRQICMH

FASTA

FASTQ

S - Sanger Phred+33, (0, 40) X - Solexa Solexa+64,(-5, 40) I - Illumina 1.3+ Phred+64, (0, 40) J - Illumina 1.5+ Phred+64, (3, 40) L - Illumina 1.8+ Phred+33, (0, 41)

Page 36: Bioinformatics for Stem Cell Lecture 2 Debashis Sahoo, PhD

Mapping

Page 37: Bioinformatics for Stem Cell Lecture 2 Debashis Sahoo, PhD

Mapping Software

• Long reads– BLAST, HMMER, SSEARCH

• Short reads– BLAT– Bowtie, BWA, Partek, SOAP, Tophat, Olego,

BarraCUDA

Page 38: Bioinformatics for Stem Cell Lecture 2 Debashis Sahoo, PhD

Visualizations

Page 39: Bioinformatics for Stem Cell Lecture 2 Debashis Sahoo, PhD

Visualizations

• UCSC Genome Browser• GenoViewer, Samtools tview, MaqView, rtracklayer,

BamView, gbrowse2• Integrative Genomics Viewer (IGV)

Page 40: Bioinformatics for Stem Cell Lecture 2 Debashis Sahoo, PhD

Quantification

• Peak calling– QuEST, MACS, PeakSeq, T-PIC, SIPeS, GLITR, SICER,

SiSSRs, OMT

• Expression quantification– Cufflinks, NEUMA, RSEM, ABySS, ERANGE, RSAT,

Velvet, MISO, RSEQ

• SNP calling– samtools, VarScan, GATK, SOAP2, realSFS, Beagle,

QCall, MaCH

Page 41: Bioinformatics for Stem Cell Lecture 2 Debashis Sahoo, PhD

Peak Discovery

[Pepke et al. Nature Methods 2009]

Page 42: Bioinformatics for Stem Cell Lecture 2 Debashis Sahoo, PhD

Transcript Quantification

[Pepke et al. Nature Methods 2009]

RPKM, FPKM

Page 43: Bioinformatics for Stem Cell Lecture 2 Debashis Sahoo, PhD

SNP Calling

Page 44: Bioinformatics for Stem Cell Lecture 2 Debashis Sahoo, PhD

Typical RNA-seq Workflow

[Trapnell et al. Nature Biotech 2010]

Page 45: Bioinformatics for Stem Cell Lecture 2 Debashis Sahoo, PhD

[Trapnell et al. Nature Biotech 2010]