Statistics Tools in GeneSpring

Statistics Tools in GeneSpringThe Center for Bioinformatics

UNC at Chapel Hill

Jianping Jin Ph.D.

Bioinformatics ScientistPhone: (919)843-6015

E-mail: [email protected]: (919)966-6821

mailto:[email protected]

What GeneSpring Can do?• Works with both Affymetrix and two-color

data.• Views data graphically (classification,

graph, tree, scatter plot, Vann Diagram …)• Performs statistical analyses.• Annotates genes (updating from GenBank,

LocusLink, Unigene; biochemical pathways).

• ……

• Clustering: • k-means (non-hierarchical) • Self-organizing map• Gene trees (hierarchical dendrograms).

• principal component analysis • T-Test analyses ( p-values)• Like a known gene or average of genes• Like a pattern drawn with the mouse • Genes with high confidence • Genes with relative expression in certain ranges • Pathway analysis finding genes that fit in a certain place in a pathway. • Sequence analysis to automatically find regulatory sequences. • Automatic functional annotation of sub-trees in dendrograms.• …

What statistical analyses does GS do?

Tree Clustering1. Standard correlation2. Smooth correlation3. Change correlation4. Upregulated correlation5. Pearson correlation6. Spearman correlation7. Spearman confidence8. Two-sided Spearman confidence9. Distance

Notations to the Formulas

Result: the result of the calculation for genes A and B.

n: the numbers of samples being correlated over.

a: the vector (a 1 , a 2 , a 3 ... a n) of expression values for gene A.

b: the vector (b 1 , b 2 , b 3 ... b n) of expression values for gene B.

a.b = a 1 b 1 +a 2 b 2 +...+a n b n. |a|=square root(a.a)

Standard Correlation

• Equation: a.b/(|a||b|) • also called “Pearson correlation around

zero”.• Measure the angular separation of

expression vectors for genes A & B.• Answer the question “do the peaks match

up?”

Pearson Correlation

• Equation: A.B / ( | A || B | ) • Very similar to the Std correlation, except it

measures the angle of expression vector for genes A & B around the mean of the expression vectors.

• A = the mean of all element in vector a - the value from each element in a.

• Do the same for b to make a vector B

Spearman Confidence

• r = the value of the Spearman correlation, SC = 1-(probability you would get a value

of r or higher by chance) • A measure of similarity, not a correlation• High SC value if a high Spearman corr, & a

low p-value.• Takes account of the number of sub-

experiment in your experiment set.

Two-sided Spearman Confidence• A measure of similarity, very similar to the

Spearman conf. • Two-sided test of whether the Spearman corr. is

either significantly gt/lt zero. • “what genes behave similarly/opposite to a

specific gene?”• Probably not good for k-means/tree clustering.• 1-(probability you would get a Spearman

correlation of |r| or higher, or -|r| or lower, by chance).

Distance

• A measurement of dissimilarity, not a correlation at all.

• Euclidian dist. b/w expression Profiles ( values for each point in N-dimensional space) of genes A & B.

• Distance = |a-b|/square root of N (expt. points)

Special Case Correlations

• Smooth correlation, Change correlation and Upregulated correlation.

• All three modified version of the Std. correlation.

• Only make sense when data in a sequence, such as “before”/”after”, a time series, or a drug series.

Smooth Correlation

• Make a new vector A from a by interpolating the avg. of each consecutive pair of elements of a.

• Insert this new value b/w the old values• Do this for each pair of elements that would

connected by a line in the graph screen• Do the same to make a vector B from b.

Change Correlation• The opposite of what the Smooth corr.

looks for. Only the chg. in expression level of adjacent points.

• Similar to the Std corr., but use an arc tangent transformation of ratio b/w adjacent pairs of points to create the expr. vector. Less sensitive to outliers than using the ratio directly.

• The value created b/w two values a i and a

i+1 is atan(a i+1 /a i )- /4

Upregulated Correlation

• Very similar to the Chg. Corr., but it only considers positive changes. All negative values for the arc tangent are set to zero.

• Make a new vector A from a by looking at the change b/w each pair of elements of a.

• The value created b/w two values a i and a

i+1 is max(atan(a i+1 /a i )- /4.0).

Algorithm to Build Gene Tree• Determine if there is only one gene or subtree

left. If yes, go to step five.• Find the two closest genes/subtrees.• Merge these two into one subtree.• Return to step one.• Merge together branches where the distance

between sub-branches is less than the separation ratio, subject to considering genes with less than the minimum distance apart.

Algorithm to Build Tree

• The minimum distance: how far down the tree discrete branches are depicted. Higher number, more genes in a group, less specific.

• The separate ratio: the correlation diff. b/w groups of clustered genes. B/w 0 and 1. Increasing separation increases the branchiness of the tree.

Principal Components Analysis

• Not a clustering method.• PCA, the most abundant building blocks, a

set of expression patterns.• 1st PC is obtained by finding the linear

combination of expr. Patterns for the most of variability in the data. And so on.

k-Means Clustering• Divides genes into a user-defined # (k) of

equal-sized groups, based on their expression patterns.

• Creates centroids at the avg. location of each group of genes

• With each iteration, genes are reassigned to the group with closest centroid

• After all of the genes have been reassigned, the location of the centroids is recalculated.

Self-Organizing Maps

• Similar to k-means clustering.• Relationship b/w groups in a 2-D map.• Best represents the variability of the data,

while still maintaining similarity b/w adjacent nodes, e.g. point 1,2 is one unit away from 1,3.

What does t-test mean in GS

• Replicates: one-sample Student’s t-test• Comparisons for 2 groups: Student’s two-sample

t-test.• Comparisons for multiple groups: one-way

analysis of variance (ANOVA).• Filtering genes: based on a one-sample t-test of the

mean expression level across replicates vs. a reference value (Expression Percentage Restriction)

Filter Genes Analysis Tools• Global Error Model: filters out genes with

large std deviations or error values.• Raw data filtering: gets rid of genes too

close to the background.• Sample to sample comparison: fold cmp.

Among different samples.• Statistical Group cmp.: filters out genes not

vary significantly across different groups.• Data File Restriction: based on other field

( P/S call, +/- pairs).

Statistical Group Comparison

• Genes statistically significant difference in the mean expression levels across all group.

• For two groups: Students’s two-sample t-test.• For multiple groups: ANOVA• Non-parametric cmp.: for each gene, the rank

order is used for analysis. Wilcoxon two-sample test (Mann-Whitney U test), the Kruskal-Wallis test for multiple groups.

Data Normalization

• In two-color experiments, normalizing vs. the control channel (green) for each gene.

• Normalize each sample to itself or to a positive control. Make diff. samples comparable to one another.

• Normalizing each gene to itself: remove the differing intensity scales from multiple expt readings (highly recommended if not using a two-color experiment.

NCI-60 cell lines

DrugActivity_AT

Documents

Statistics Tools in GeneSpring