Upload
gloria-austin
View
225
Download
2
Tags:
Embed Size (px)
Citation preview
Bioinformatics for Stem CellLecture 2
Debashis Sahoo, PhD
Outline
• Lecture 1 Recap• Multivariate analysis• Microarray data analysis• Boolean analysis• Sequencing data analysis
MULTIVARIATE ANALYSIS
Identify Markers of Human Colon Cancer and Normal Colon
4
Piero Dalerba Tomer Kalisky
Single Cell Analysis of Normal Human Colon Epithelium
Hierarchical Clustering
Hierarchical Clustering
• Cluster 3.0– http://bonsai.hgc.jp/~mdehoon/software/cluster/
• Distance metric– Euclidian, Squared Euclidean, Manhattan,
maximum, cosine, Pearson’s correlation
• Linkage– Single, complete, average, median, centroid
Multivariate Analysis - PCA
X = data matrixV = loading matrixU = scores matrix
Principal Component Analysis
Fundamentals of PCA
• Reduces dimensions of the data
• PCA uses orthogonal linear transformation
• First principal component has the largest possible variance.
• Exploratory tool to uncover unknown trends in the data
PCA Analysis
HIGH-THROUGHPUT DATA ANALYSIS
MICROARRAY ANALYSIS
Microarray
• Spotted vs. in situ• Two channel vs. one
channel• Probe vs. probeset vs.
gene
Quantile NormalizationS
ort Average
#1 #2 #3
Val(Probe_i) = SortedAvg[Rank(Probe_i)]
SortedAvg
Invariant Set Normalization
BeforeNormalization
After Normalization
Invariant set
Good to Check the Image
1. Assign experiments to two groups, e.g., in the expression matrix below, assign Experiments 1, 2 and 5 to group A, and experiments 3, 4 and 6 to group B.
Exp 1 Exp 2 Exp 3 Exp 4 Exp 5 Exp 6
Gene 1
Gene 2
Gene 3
Gene 4
Gene 5
Gene 6
2. Question: Is mean expression level of a gene in group A significantly different from mean expression level in group B?
Exp 1 Exp 2 Exp 3 Exp 4Exp 5 Exp 6
Gene 1
Gene 2
Gene 3
Gene 4
Gene 5
Gene 6
Group A Group B
SAM Two-Class Unpaired
Permutation tests
i) For each gene, compute d-value (analogous to t-statistic). This isthe observed d-value for that gene.
ii) Rank the genes in ascending order of their d-values.
iii) Randomly shuffle the values of the genes between groups A and B,such that the reshuffled groups A and B respectively have the same number of elements as the original groups A and B. Compute the d-value for each randomized gene
Exp 1 Exp 2 Exp 3 Exp 4Exp 5 Exp 6
Gene 1
Group A Group B
Exp 1Exp 4 Exp 5Exp 2Exp 3 Exp 6
Gene 1
Group A Group B
Original grouping
Randomized grouping
SAM Two-Class Unpaired
SAM Two-Class Unpaired
iv) Rank the permuted d-values of the genes in ascending order
v) Repeat steps iii) and iv) many times, so that each gene has many randomized d-values corresponding to its rank from the observed(unpermuted) d-value. Take the average of the randomized d-values for each gene. This is the expected d-value of that gene.
vi) Plot the observed d-values vs. the expected d-values
SAM Two-Class Unpaired
Significant positive genes (i.e., mean expression of group B >
mean expression of group A)
Significant negative genes (i.e., mean expression of group A > mean expression of group B)
“Observed d = expected d” line
The more a gene deviates from the “observed = expected” line, the more likely it is to be significant. Any gene beyond the first gene in the +ve or –ve direction on the x-axis (including the first gene), whose observed exceeds the expected by at least delta, is considered significant.
GenePatternhttp://genepattern.broadinstitute.org/
AutoSOMEhttp://jimcooperlab.mcdb.ucsb.edu/autosome/
Aaron Newman and James Cooper, BMC Bioinformatics, 2010, 11:117
Aaron Newman
Gene Set Analysis
Cell CycleCell Cycle
Transcription factorTranscription factor
TGF-beta Signaling PathwayTGF-beta Signaling Pathway
Wnt-signaling PathwayWnt-signaling Pathway
Protein-protein interaction network
Your Gene Set
Compute enrichment in pathways and
networks
Compute enrichment in pathways and
networks
Tools: GSEA, DAVID, Toppfun, MSigDB, and STRING
BOOLEAN ANALYSIS
Boolean Implication
• Analyze pairs of genes.• Analyze the four
different quadrants.• Identify sparse
quadrants.• Record the Boolean
relationships.– If ACPP high, then GABRB1
low– If GABRB1 high, then ACPP
low
ACPP
GA
BR
B1
[Sahoo et al. Genome Biology 08]
45,000 Affymetrix microarrays
Threshold Calculation
• A threshold is determined for each gene.
• The arrays are sorted by gene expression
• StepMiner is used to determine the threshold
Sorted arrays
CD
H e
xpre
ssio
n
[Sahoo et al. 07]
Threshold
High
Low
Intermediate
BooleanNet Statistics
[Sahoo et al. Genome Biology 08]
nAlow = (a00+ a01), nBlow = (a00+ a10)
total = a00+ a01+ a10+ a11, observed = a00
expected = (nAlow/ total * nBlow/ total) * total
a00
(a00+ a01)
a00
(a00+ a10)+( )1
2error rate =
a00
a01 a11
a10
A
B
statistic =(expected – observed)
expected√
Boolean Implication = (statistic > 3, error rate < 0.1)
Six Boolean Implications
[Sahoo et al. Genome Biology 08]
MiDReG Algorithm
[Sahoo et al. PNAS 2010]
MiDReG = (Mining Developmentally Regulated Genes)
MiDReG Algorithm
[Sahoo et al. PNAS 2010]
MiDReG = (Mining Developmentally Regulated Genes)
MiDReG Algorithm
[Sahoo et al. PNAS 2010]
MiDReG = (Mining Developmentally Regulated Genes)
B Cell Genes
[Sahoo et al. PNAS 2010]
CD19
KIT
Boolean Implications
Jun Seita
[Seita, Sahoo et al. PLoS ONE, 2012]
http://gexc.stanford.edu
SEQUENCING DATA ANALYSIS
Sequencing Data Format
@HWI-EAS209:5:58:5894:21141#ATCACG/1TTAATTGGTAAATAAATCTCCTAATAGCTTAGATNT +HWI-EAS209:5:58:5894:21141#ATCACG/1efcfffffcfeefffcffffffddf`feed]`]_Ba
>SEQUENCE_1MTEITAAMVKELRESTGAGMMDCKNALSETNGDFDKAVQLLREKGLGKAAKKADRLAAEGLVSVKVSDDFTIAAMRPSYLSYEDLDMTFVENEYKALVAELEKENEERRRLKDPNKPEHKIPQFASRKQLSDAILKEAEEKIKEELKAQGKPEKIWDNIIPGKMNSFIADNSQLDSKLTLMGQFYVMDDKKTVEQVIAEKEKEFGGKIKIVEFICFEVGEGLEKKTEDFAAEVAAQL>SEQUENCE_2SATVSEINSETDFVAKNDQFIALTKDTTAHIQSNSLQSVEELHSSTINGVKFEEYLKSQIATIGENLVVRRFATLKAGANGVVNGYIHTNGRVGVVIAAACDSAEVASKSRDLLRQICMH
FASTA
FASTQ
S - Sanger Phred+33, (0, 40) X - Solexa Solexa+64,(-5, 40) I - Illumina 1.3+ Phred+64, (0, 40) J - Illumina 1.5+ Phred+64, (3, 40) L - Illumina 1.8+ Phred+33, (0, 41)
Mapping
Mapping Software
• Long reads– BLAST, HMMER, SSEARCH
• Short reads– BLAT– Bowtie, BWA, Partek, SOAP, Tophat, Olego,
BarraCUDA
Visualizations
Visualizations
• UCSC Genome Browser• GenoViewer, Samtools tview, MaqView, rtracklayer,
BamView, gbrowse2• Integrative Genomics Viewer (IGV)
Quantification
• Peak calling– QuEST, MACS, PeakSeq, T-PIC, SIPeS, GLITR, SICER,
SiSSRs, OMT
• Expression quantification– Cufflinks, NEUMA, RSEM, ABySS, ERANGE, RSAT,
Velvet, MISO, RSEQ
• SNP calling– samtools, VarScan, GATK, SOAP2, realSFS, Beagle,
QCall, MaCH
Peak Discovery
[Pepke et al. Nature Methods 2009]
Transcript Quantification
[Pepke et al. Nature Methods 2009]
RPKM, FPKM
SNP Calling
Typical RNA-seq Workflow
[Trapnell et al. Nature Biotech 2010]
[Trapnell et al. Nature Biotech 2010]