CSCE555 Bioinformatics Lecture 16 Identifying Differentially Expressed Genes from microarray data Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun

CSCE555 BioinformaticsCSCE555 BioinformaticsLecture 16 Identifying Differentially

Expressed Genes from microarray data

Meeting: MW 4:00PM-5:15PM SWGN2A21

Instructor: Dr. Jianjun Hu

Course page: http://www.scigen.org/csce555

University of South CarolinaDepartment of Computer Science and Engineering

2008 www.cse.sc.edu.

OutlineOutline

The problem: identifying Diff Expressed Genes

Statistic Methods: t-testNon-parametric: Rank productSummary

04/21/23 2

The Biological Problem: Identify The Biological Problem: Identify Differentially Expressed GenesDifferentially Expressed Genes

3

No treatment TreatmentWhich pathways will be affected?

Which genes are involved?

Identify differentially expressed Identify differentially expressed genesgenes

One of the core goals of microarray data analysis is to identify which of the genes show good evidence of being DE. This goal has two parts.

1. The first is select a statistic which will rank the genes in order of evidence for differential

expression, from strongest to weakest evidence.

2. The second is to choose a critical-value for the ranking statistic above which any value is

considered to be significant.

k-fold changek-fold change1. measure of differential expression by the ratio of

expression levels between two samples

2. genes with ratios above a fixed cut-off k that is, those whose expression underwent a k-fold change, were said to be differentially expressed

3. this test is not a statistical test, and there is no associated value that can indicate the level of confidence in the designation of genes as differentially expressed or not differentially expressed

k-fold changek-fold change

4. replication is essential in experimental design because it allows an estimate of variability

5. ability to assess such variability allows identification of biologically reproducible changes in gene expression levels

Standard statistical testsStandard statistical tests1. More typically, researchers now rely on

variants of common statistical tests.2. These generally involve two parts:

calculating a test statistic and determining the significance of the observed statistic.

3. A standard statistical test for detecting significant change between repeated measurements of a variable in two groups is the t-test;

4. this can be generalized to multiple groups via the ANOVA F statistic.

Standard statistical testsStandard statistical tests

1. For most practical cases, computing a standard t or F statistic is appropriate, although referring to the t or F distributions to determine significance is often not.

2. The main hazard in using such methods occurs when there are too few replicates to obtain an accurate estimate of experimental variances. In such cases, modeling methods that use pooled variance estimates may be helpful.

Standard statistical testsStandard statistical tests1. Regardless of the test statistic used, one must

determine its significance

2. Standard interpretations of t-like tests assume that the data are sampled from normal populations with equal variances

3. Expression data may fail to satisfy either or both of these constraints

Standard statistical testsStandard statistical tests1.use of non-parametric rank-based statistics is also

common, via both traditional statistical methods and

2.ad hoc ones designed specifically for microarray data

RankProd : a non-parametric method to detect RankProd : a non-parametric method to detect differentially regulated genes in replicated differentially regulated genes in replicated experimentsexperiments

(1) originates from an analysis of biological reasoning , easy to understand (2) fast, simple and robust to outliers (suitable for noisy data ) (3) provides statistical significance for each gene and allows for the control

of the overall significance (e.g., false discovery rate) (4) provides straightforward way for cross-platform meta-analysis

(integrates data generated at different laboratories/under different environments into one study, and achieves increased power)

• What does it do? What is the method implemented in the packageRankProd utilizes the so called rank product non-parametric method (Breitling et al., 2004 ) to identify up-regulated or down-regulated genes under one condition against another condition.Rank Product is a non-parametric statistic which detects items that are consistently highly ranked in a number of lists, for example genes that are consistently found among the most strongly unregulated genes in a number of replicate experiments.

• How does it compare to other methods for similar purpose

Rank ProductRank ProductCalculate RP:

Calculate significance

Permutation tests for calulating Permutation tests for calulating significance levelssignificance levels

Permutation tests, generally carried out by repeatedly scrambling the samples’ class labels and computing t statistics for all genes in the scrambled data, best capture the unknown structure of the data.

Tusher, V.G., Tibshirani, R. & Chu, G. Significance analysis of microarrays applied to the ionizing radiation response. Proc. Natl Acad. Sci. USA 98, 5116-5121 (2001).

Golub, T.R. et al. Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 286, 531-537 (1999).

Dudoit, S., Yang, Y.-H., Callow, M.J. & Speed, T.P. Statistical methods for identifying differentially expressed genes in replicated cDNA microarray experiments. Technical Report 578 (Department of Statistics, University of California at Berkeley, Berkeley, CA, 2000).

SummarySummaryThe problem: Identify

Differentially expressed genes from Microarray data

How to identify: t-test and Rank product

How to evaluate significance of identified genes

Documents

CSCE555 Bioinformatics Lecture 16 Identifying Differentially Expressed Genes from microarray data Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun