56
Examples of Classifying Expression Data 6.892 / 7.90 Computational Functional Genomics Spring 2002

Examples of Classifying Expression Data 6.892 / 7.90 Computational Functional Genomics Spring 2002

Embed Size (px)

Citation preview

Page 1: Examples of Classifying Expression Data 6.892 / 7.90 Computational Functional Genomics Spring 2002

Examples of Classifying Expression Data

6.892 / 7.90

Computational Functional Genomics

Spring 2002

Page 2: Examples of Classifying Expression Data 6.892 / 7.90 Computational Functional Genomics Spring 2002

Interpreting patterns of gene expression with self-organizing

maps: Methods and application to hematopoietic differentiation

Tamayo, Slonim, Mesirov, Zhu, Kitareewan, Dmitrovsky, Lander,

Golub

PNAS 96, pp. 2907-2912, March 1999

Page 3: Examples of Classifying Expression Data 6.892 / 7.90 Computational Functional Genomics Spring 2002

Hierarchical clustering problems

• Not designed to reflect multiple ways expression patterns can be similar

• Clusters not be robust or unique

• Points can be clustered based on local decisions that lock in structure

Page 4: Examples of Classifying Expression Data 6.892 / 7.90 Computational Functional Genomics Spring 2002

Self-Organizing Maps (SOMs)

• Mathematical space for SOMs– n genes with k samples define n points in k-dimensional

space

• Impose partial structure on the clusters to start– Choose a geometry of nodes – e.g. 3 x 2 grid– Mapped into k dimensional space at random– Each iteration moves nodes in direction of a randomly

selected point– Closest node is moved the most– 20,000 – 50,000 iterations later have clustered the genes

Page 5: Examples of Classifying Expression Data 6.892 / 7.90 Computational Functional Genomics Spring 2002

Example SOM iteration

Page 6: Examples of Classifying Expression Data 6.892 / 7.90 Computational Functional Genomics Spring 2002

Iterative point moving

fi+1(N) = fi(N) + L( d(N, Np) , i) (P – fi(N))P is observation used in iteration i to update map pointsN map point being updated

Np is closest point in map to PLearning rate L decreases with distance and i

T is total number of iterationsL(x, i) = 0.02T / (T + 100i ) for x <= p(i)L(x, i) = 0 otherwisep(i) decreases linearly with ip(0) = 3

Page 7: Examples of Classifying Expression Data 6.892 / 7.90 Computational Functional Genomics Spring 2002

Data normalization

• Genes were eliminated if they did not change significantly (eliminate attraction to invariant genes)

• Expression levels are normalized to have mean 0 and variance 1 (focus on shape)

• Yeast data – levels were normalized within each of the two cell cycles

• Human data – expression levels were normalized within the time points

Page 8: Examples of Classifying Expression Data 6.892 / 7.90 Computational Functional Genomics Spring 2002

SOM computation

• Computation time is about 1 minute; 20,000 – 50,000 iterations for 416 to 1,036 genes

• Web based interface used to visualize the data• Average expression pattern is displayed with error bars• Can also overlay members of a cluster on a single plot• Yeast cell cycle

– 6 x 5 SOM– 416 genes– Computed in 82 seconds

Page 9: Examples of Classifying Expression Data 6.892 / 7.90 Computational Functional Genomics Spring 2002

Cluster 29 detail –76 members exhibiting periodic

behavior in late G1

Page 10: Examples of Classifying Expression Data 6.892 / 7.90 Computational Functional Genomics Spring 2002

G1, S, G2, and M phase related clusters (C29, C14, C1, C5)

Page 11: Examples of Classifying Expression Data 6.892 / 7.90 Computational Functional Genomics Spring 2002

Centroids for groups of genes identified by visual inspection by

Cho et. al.

Page 12: Examples of Classifying Expression Data 6.892 / 7.90 Computational Functional Genomics Spring 2002

PMA treated HL-60 cells SOM

567 genes passing the variation filter were grouped into a 4 x 3 SOM

PMA causes macrophage differentiation (PMA = phorbol 12-myristate 13-acetate.)

Cluster 11 – PMA induced genes

Page 13: Examples of Classifying Expression Data 6.892 / 7.90 Computational Functional Genomics Spring 2002

Hematopoietic differentiation across four cells lines

HL-60

U937

Jurkat

NB4

n = 17

1,036 genes

6 x 4 SOM

Page 14: Examples of Classifying Expression Data 6.892 / 7.90 Computational Functional Genomics Spring 2002

SOM conclusion

• Successful at finding new structure• Inspection still necessary to find insights• Able to recover temporal response to

perturbation• Can provide richer topology than linear

ordering• However, topology needs to be provided in

advance

Page 15: Examples of Classifying Expression Data 6.892 / 7.90 Computational Functional Genomics Spring 2002

Plan

• Overview of classification techniques

• Mixture Model Clustering– Alon - Colon tumors

• Weighted Voting of Selected Genes– Golub – Leukemia (ALL, AML)

• Hierarchical Clustering– Alizadeh – Diffuse large B-cell lymphoma

Page 16: Examples of Classifying Expression Data 6.892 / 7.90 Computational Functional Genomics Spring 2002

Statistical Pattern Recognition

• A classifier is an algorithm that assigns an observation to a class

• A class can be a letter (handwriting recognition), a person (face recognition), a type of cell, a diagnosis, or a prognosis

• Data set -- data with known classes for training• Generalize data set knowledge to new observations• Classification is based on features• Feature selection is key

Page 17: Examples of Classifying Expression Data 6.892 / 7.90 Computational Functional Genomics Spring 2002

Model Complexity

• A model describes a data set and is used to make future decisions

• If a model is too simple it gives a poor fit to the data set

• If a model is too complex, it gives a poor representation of the systematic aspects of the data (overfit to data set)

Page 18: Examples of Classifying Expression Data 6.892 / 7.90 Computational Functional Genomics Spring 2002

Types of classifiers

• Discriminative– No assumptions about underlying model

• Generative– Assumptions made about form of underlying

model (e.g. variables are Gaussian)– Assumptions cause performance advantages –

and disadvantages if the assumptions are incorrect

Page 19: Examples of Classifying Expression Data 6.892 / 7.90 Computational Functional Genomics Spring 2002

Mixture Models for Clustering

Alon, U et al. Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays, PNAS 96, pp. 6745-6750 (June 1999)

Page 20: Examples of Classifying Expression Data 6.892 / 7.90 Computational Functional Genomics Spring 2002

Problem Definition

• 40 colon adenocarcinoma biopsy specimens

• 22 normal tissue specimens

• Cell lines derived from colon carcinoma (EB and EB-1)

• Can we tell the cancer specimens from the normal specimens by expression analysis?

Page 21: Examples of Classifying Expression Data 6.892 / 7.90 Computational Functional Genomics Spring 2002

Gene Pair Correlations

Dashed line is correlation with data set randomized 104 times

Shaded area

P < 10-3

Each gene:

30 genes sig. positive correlation.

10 genes sig. negative correlation.

Page 22: Examples of Classifying Expression Data 6.892 / 7.90 Computational Functional Genomics Spring 2002

Mixture Model

• Each gene is represented by a vector that has been normalized so that its sum is 0 and the magnitude is 1

• Mixture model used assumes two distributions with centroids Cj

• Pj(Vk) is probability that Vk is in class j

• Cj = k Vk Pj(Vk) / k Pj(Vk)

Page 23: Examples of Classifying Expression Data 6.892 / 7.90 Computational Functional Genomics Spring 2002

Mixture Model is used for top down clustering

• At end of iteration, each gene is assigned to the cluster with the highest probability

• Makes hard boundary between clusters

• Repeat process on both subclusters

• Both genes and tissues are clustered using the same algorithm

Page 24: Examples of Classifying Expression Data 6.892 / 7.90 Computational Functional Genomics Spring 2002

Results of clustering algorithm

Page 25: Examples of Classifying Expression Data 6.892 / 7.90 Computational Functional Genomics Spring 2002

Excerpt from ribosomal gene cluster

Page 26: Examples of Classifying Expression Data 6.892 / 7.90 Computational Functional Genomics Spring 2002

Expanded view of clustering

Tumor tissues have arrows at left

** are EB and EB1 colon carcinomia cell lines

Page 27: Examples of Classifying Expression Data 6.892 / 7.90 Computational Functional Genomics Spring 2002

Five of 20 most informative genes are muscle genes

Muscle index is normalized average intensity of 17 muscle related ESTs

Page 28: Examples of Classifying Expression Data 6.892 / 7.90 Computational Functional Genomics Spring 2002

Sensitivity of clustering to genes used

Genes

sorted by

t test

Page 29: Examples of Classifying Expression Data 6.892 / 7.90 Computational Functional Genomics Spring 2002

Conclusion

• Epithelial origin tumors distinguished from muscle-rich normal tissue samples

• Tumor cell lines distinguished

• Need tissue purity of in vivo samples

Page 30: Examples of Classifying Expression Data 6.892 / 7.90 Computational Functional Genomics Spring 2002

Weighted Voting for Classification

Golub,T. et al. Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring, Science 286, pp. 531-537, October 15, 1999.

Page 31: Examples of Classifying Expression Data 6.892 / 7.90 Computational Functional Genomics Spring 2002

Two challenges

• Class discovery – defining previously unrecognized tumor subtypes

• Class prediction – assignment of tumor samples to already defined classes

Page 32: Examples of Classifying Expression Data 6.892 / 7.90 Computational Functional Genomics Spring 2002

Data source

• 38 bone marrow samples– 27 acute myeloid leukemia (AML)– 11 acute lymphoblastic leukemia (ALL)

• Hybridized to Affymetrix arrays– 6817 human genes

Page 33: Examples of Classifying Expression Data 6.892 / 7.90 Computational Functional Genomics Spring 2002

Classifier architecture

Page 34: Examples of Classifying Expression Data 6.892 / 7.90 Computational Functional Genomics Spring 2002

Pick informative feature set

Page 35: Examples of Classifying Expression Data 6.892 / 7.90 Computational Functional Genomics Spring 2002

Correlation function

• All variables are first log transformed

• g is a vector of samples [e1 .. en]

• c tells us the class of each sample [1 0 .. 0]

• Thus we can compute 1(g) 2(g) 1(g) 2(g)

• P(g,c) = (1(g) - 2(g)) / (1(g) + 2(g))

• N1(c,r) all genes g such that (P(g,c) = r)

• N2(c,r) all genes g such that (P(g,c) = -r)

Page 36: Examples of Classifying Expression Data 6.892 / 7.90 Computational Functional Genomics Spring 2002

~1100 genes are informative-- number of genes within

neighborhoods

Page 37: Examples of Classifying Expression Data 6.892 / 7.90 Computational Functional Genomics Spring 2002

Weighted voting for features

Page 38: Examples of Classifying Expression Data 6.892 / 7.90 Computational Functional Genomics Spring 2002

Weighted voting

• vi = (xi – (aml + all)/2)

• wi = P(g,c)

• Total votes– Class 1 – sum all positive wivi

– Class 2 – sum all negative wivi

Page 39: Examples of Classifying Expression Data 6.892 / 7.90 Computational Functional Genomics Spring 2002

Prediction Strength

• PS = (Vwin – Vlose)/(Vwin + Vlose)

• Vwin and Vlose are vote totals for winning and losing classes, respectively

• Gives a “margin of victory”

• Sample assigned to winning class if

PS > 0.3

Page 40: Examples of Classifying Expression Data 6.892 / 7.90 Computational Functional Genomics Spring 2002

Performance of 50 gene predictor – 100% accuracy

Page 41: Examples of Classifying Expression Data 6.892 / 7.90 Computational Functional Genomics Spring 2002

Genes most correlated with AML/ALL class distinction

Page 42: Examples of Classifying Expression Data 6.892 / 7.90 Computational Functional Genomics Spring 2002

Feature sets

• All predictors that used between 10 and 200 genes were 100% accurate

Page 43: Examples of Classifying Expression Data 6.892 / 7.90 Computational Functional Genomics Spring 2002

Using SOM to discover classes

Page 44: Examples of Classifying Expression Data 6.892 / 7.90 Computational Functional Genomics Spring 2002

Bayesian perspective

• Assuming– Class distributions are normal with equal

variances

• Weight for a gene is (1 - 2) / 2

Page 45: Examples of Classifying Expression Data 6.892 / 7.90 Computational Functional Genomics Spring 2002

Conclusion

• Can classify AML and ALL with as little as 10 genes

• “Many other gene selection metrics could be used; we considered several …. The best performance was obtained with the relative class separation metric defined above”

Page 46: Examples of Classifying Expression Data 6.892 / 7.90 Computational Functional Genomics Spring 2002

Discovering new types of cancer

• Alizadeh, A., Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling, Nature 403, pp. 503-511 (February 3, 2000)

Page 47: Examples of Classifying Expression Data 6.892 / 7.90 Computational Functional Genomics Spring 2002

Goal

• Discover cause for different disease courses for diffuse large B-cell lymphoma (DLBCL)– 40% of patients respond to therapy– 60% succumb to disease

• Provide diagnostic / prognostic tool• DLBCL is most common subtype of non-

Hodgkin’s lymphoma

Page 48: Examples of Classifying Expression Data 6.892 / 7.90 Computational Functional Genomics Spring 2002

Questions

• Can we create a molecular portrait of distinct types of B-cell malignancy?

• Can we identify types of malignancy not yet recognized?

• Can we relate malignancy to normal stages in B-cell development and physiology?

Page 49: Examples of Classifying Expression Data 6.892 / 7.90 Computational Functional Genomics Spring 2002

Lymphochip

• 17,856 cDNA clones– 12,069 from germinal B-cell library– 2,338 from DLBCL, follicular lymphoma (FL),

mantle cell lymphoma, and chronic lymphocytic leukaemia (CLL)

– 3,186 genes important to lymphocyte and/or cancer biology

– B- and T-lymphocyte genes that respond to mitogens or cytokines

Page 50: Examples of Classifying Expression Data 6.892 / 7.90 Computational Functional Genomics Spring 2002

Data sources

• Rearranged immunoglobulin genes in DLBCL are characteristic of germinal center of secondary lymphoid organs

• 96 normal and malignant lymphocyte samples

Page 51: Examples of Classifying Expression Data 6.892 / 7.90 Computational Functional Genomics Spring 2002

Lymphochip cluster

Page 52: Examples of Classifying Expression Data 6.892 / 7.90 Computational Functional Genomics Spring 2002

DLBCL subtypes visible

Page 53: Examples of Classifying Expression Data 6.892 / 7.90 Computational Functional Genomics Spring 2002

Feature discovery

A: cluster Germinal Center B genes and samples

B: cluster more genes, use A’s sample cluster

C: expanded view of B

Page 54: Examples of Classifying Expression Data 6.892 / 7.90 Computational Functional Genomics Spring 2002

DLBCL vs. normal B-lymphocyte differentiation

Page 55: Examples of Classifying Expression Data 6.892 / 7.90 Computational Functional Genomics Spring 2002

Distinct DLBCL groups by gene expression profiling

A: gene expression

B: IPI C: IPI 0-2; gene expression

Page 56: Examples of Classifying Expression Data 6.892 / 7.90 Computational Functional Genomics Spring 2002

Summary

• DLBCL groups are still diverse – some members of GC B-like DLBCL group die– 5 in first 2 years

• May be able to find informative features for more groups

• If can find constitutive genes in cancers, target upstream regulators