Upload
ermin
View
42
Download
0
Embed Size (px)
DESCRIPTION
Statistics Tools in GeneSpring. The Center for Bioinformatics UNC at Chapel Hill Jianping Jin Ph.D. Bioinformatics Scientist Phone: (919)843-6015 E-mail: [email protected] Fax: (919)966-6821. What GeneSpring Can do?. Works with both Affymetrix and two-color data. - PowerPoint PPT Presentation
Citation preview
Statistics Tools in GeneSpringThe Center for Bioinformatics
UNC at Chapel Hill
Jianping Jin Ph.D.
Bioinformatics ScientistPhone: (919)843-6015
E-mail: [email protected]: (919)966-6821
What GeneSpring Can do?• Works with both Affymetrix and two-color
data.• Views data graphically (classification,
graph, tree, scatter plot, Vann Diagram …)• Performs statistical analyses.• Annotates genes (updating from GenBank,
LocusLink, Unigene; biochemical pathways).
• ……
• Clustering: • k-means (non-hierarchical) • Self-organizing map• Gene trees (hierarchical dendrograms).
• principal component analysis • T-Test analyses ( p-values)• Like a known gene or average of genes• Like a pattern drawn with the mouse • Genes with high confidence • Genes with relative expression in certain ranges • Pathway analysis finding genes that fit in a certain place in a pathway. • Sequence analysis to automatically find regulatory sequences. • Automatic functional annotation of sub-trees in dendrograms.• …
What statistical analyses does GS do?
Tree Clustering1. Standard correlation2. Smooth correlation3. Change correlation4. Upregulated correlation5. Pearson correlation6. Spearman correlation7. Spearman confidence8. Two-sided Spearman confidence9. Distance
Notations to the Formulas
Result: the result of the calculation for genes A and B.
n: the numbers of samples being correlated over.
a: the vector (a 1 , a 2 , a 3 ... a n) of expression values for gene A.
b: the vector (b 1 , b 2 , b 3 ... b n) of expression values for gene B.
a.b = a 1 b 1 +a 2 b 2 +...+a n b n. |a|=square root(a.a)
Standard Correlation
• Equation: a.b/(|a||b|) • also called “Pearson correlation around
zero”.• Measure the angular separation of
expression vectors for genes A & B.• Answer the question “do the peaks match
up?”
Pearson Correlation
• Equation: A.B / ( | A || B | ) • Very similar to the Std correlation, except it
measures the angle of expression vector for genes A & B around the mean of the expression vectors.
• A = the mean of all element in vector a - the value from each element in a.
• Do the same for b to make a vector B
Spearman Confidence
• r = the value of the Spearman correlation, SC = 1-(probability you would get a value
of r or higher by chance) • A measure of similarity, not a correlation• High SC value if a high Spearman corr, & a
low p-value.• Takes account of the number of sub-
experiment in your experiment set.
Two-sided Spearman Confidence• A measure of similarity, very similar to the
Spearman conf. • Two-sided test of whether the Spearman corr. is
either significantly gt/lt zero. • “what genes behave similarly/opposite to a
specific gene?”• Probably not good for k-means/tree clustering.• 1-(probability you would get a Spearman
correlation of |r| or higher, or -|r| or lower, by chance).
Distance
• A measurement of dissimilarity, not a correlation at all.
• Euclidian dist. b/w expression Profiles ( values for each point in N-dimensional space) of genes A & B.
• Distance = |a-b|/square root of N (expt. points)
Special Case Correlations
• Smooth correlation, Change correlation and Upregulated correlation.
• All three modified version of the Std. correlation.
• Only make sense when data in a sequence, such as “before”/”after”, a time series, or a drug series.
Smooth Correlation
• Make a new vector A from a by interpolating the avg. of each consecutive pair of elements of a.
• Insert this new value b/w the old values• Do this for each pair of elements that would
connected by a line in the graph screen• Do the same to make a vector B from b.
Change Correlation• The opposite of what the Smooth corr.
looks for. Only the chg. in expression level of adjacent points.
• Similar to the Std corr., but use an arc tangent transformation of ratio b/w adjacent pairs of points to create the expr. vector. Less sensitive to outliers than using the ratio directly.
• The value created b/w two values a i and a
i+1 is atan(a i+1 /a i )- /4
Upregulated Correlation
• Very similar to the Chg. Corr., but it only considers positive changes. All negative values for the arc tangent are set to zero.
• Make a new vector A from a by looking at the change b/w each pair of elements of a.
• The value created b/w two values a i and a
i+1 is max(atan(a i+1 /a i )- /4.0).
Algorithm to Build Gene Tree• Determine if there is only one gene or subtree
left. If yes, go to step five.• Find the two closest genes/subtrees.• Merge these two into one subtree.• Return to step one.• Merge together branches where the distance
between sub-branches is less than the separation ratio, subject to considering genes with less than the minimum distance apart.
Algorithm to Build Tree
• The minimum distance: how far down the tree discrete branches are depicted. Higher number, more genes in a group, less specific.
• The separate ratio: the correlation diff. b/w groups of clustered genes. B/w 0 and 1. Increasing separation increases the branchiness of the tree.
Principal Components Analysis
• Not a clustering method.• PCA, the most abundant building blocks, a
set of expression patterns.• 1st PC is obtained by finding the linear
combination of expr. Patterns for the most of variability in the data. And so on.
k-Means Clustering• Divides genes into a user-defined # (k) of
equal-sized groups, based on their expression patterns.
• Creates centroids at the avg. location of each group of genes
• With each iteration, genes are reassigned to the group with closest centroid
• After all of the genes have been reassigned, the location of the centroids is recalculated.
Self-Organizing Maps
• Similar to k-means clustering.• Relationship b/w groups in a 2-D map.• Best represents the variability of the data,
while still maintaining similarity b/w adjacent nodes, e.g. point 1,2 is one unit away from 1,3.
What does t-test mean in GS
• Replicates: one-sample Student’s t-test• Comparisons for 2 groups: Student’s two-sample
t-test.• Comparisons for multiple groups: one-way
analysis of variance (ANOVA).• Filtering genes: based on a one-sample t-test of the
mean expression level across replicates vs. a reference value (Expression Percentage Restriction)
Filter Genes Analysis Tools• Global Error Model: filters out genes with
large std deviations or error values.• Raw data filtering: gets rid of genes too
close to the background.• Sample to sample comparison: fold cmp.
Among different samples.• Statistical Group cmp.: filters out genes not
vary significantly across different groups.• Data File Restriction: based on other field
( P/S call, +/- pairs).
Statistical Group Comparison
• Genes statistically significant difference in the mean expression levels across all group.
• For two groups: Students’s two-sample t-test.• For multiple groups: ANOVA• Non-parametric cmp.: for each gene, the rank
order is used for analysis. Wilcoxon two-sample test (Mann-Whitney U test), the Kruskal-Wallis test for multiple groups.
Data Normalization
• In two-color experiments, normalizing vs. the control channel (green) for each gene.
• Normalize each sample to itself or to a positive control. Make diff. samples comparable to one another.
• Normalizing each gene to itself: remove the differing intensity scales from multiple expt readings (highly recommended if not using a two-color experiment.
NCI-60 cell lines
DrugActivity_AT