Geoff McLachlan Department of Mathematics & Institute of Molecular Bioscience University of Queensland The Classification

Geoff McLachlan Department of Mathematics & Institute of Molecular Bioscience University of QueenslandThe Classification of Microarray Data Outline of Talk Introduction Supervised classification of tissue samples selection bias Unsupervised classification (clustering) of tissues mixture model-based approach Sample 1Sample n Gene Gene p Class 1 (good prognosis) Class 2 (poor prognosis) Supervised Classification (Two Classes) Microarray to be used as routine clinical screen by C. M. Schubert Nature Medicine 9, 9, The Netherlands Cancer Institute in Amsterdam is to become the first institution in the world to use microarray techniques for the routine prognostic screening of cancer patients. Aiming for a June 2003 start date, the center will use a panoply of 70 genes to assess the tumor profile of breast cancer patients and to determine which women will receive adjuvant treatment after surgery. Selection bias in gene extraction on the basis of microarray gene-expression data Ambroise and McLachlan Proceedings of the National Academy of Sciences Vol. 99, Issue 10, , May 14, 2002 LINEAR CLASSIFIER FORM for the production of the group label y of a future entity with feature vector x. FISHERS LINEAR DISCRIMINANT FUNCTION and covariance matrix found from the training data where and S are the sample means and pooled sample SUPPORT VECTOR CLASSIFIER Vapnik (1995) subject to where 0 and are obtained as follows: relate to the slack variables separable case Leo Breiman (2001) Statistical modeling: the two cultures (with discussion). Statistical Science 16, Discussants include Brad Efron and David Cox GUYON, WESTON, BARNHILL & VAPNIK (2002, Machine Learning) LEUKAEMIA DATA: Only 2 genes are needed to obtain a zero CVE (cross-validated error rate) COLON DATA: Using only 4 genes, CVE is 2% Figure 1: Error rates of the SVM rule with RFE procedure averaged over 50 random splits of colon tissue samples Figure 3: Error rates of Fishers rule with stepwise forward selection procedure using all the colon data Figure 5: Error rates of the SVM rule averaged over 20 noninformative samples generated by random permutations of the class labels of the colon tumor tissues BOOTSTRAP APPROACH Efrons (1983, JASA).632 estimator where B1 is the bootstrap when rule is applied to a point not in the training sample. A Monte Carlo estimate of B1 is where Toussaint & Sharpe (1975) proposed the ERROR RATE ESTIMATOR McLachlan (1977) proposed w=w o where w o is chosen to minimize asymptotic bias of A(w) in the case of two homoscedastic normal groups. Value of w 0 was found to range between 0.6 and 0.7, depending on the values of where .632+ estimate of Efron & Tibshirani (1997, JASA) where (relative overfitting rate) (estimate of no information error rate) If r = 0, w =.632, and so B.632+ = B.632 r = 1, w = 1, and so B.632+ = B1 Ten-Fold Cross Validation T r a i n i n g Test MARKER GENES FOR HARVARD DATA For a SVM based on 64 genes, and using 10-fold CV, we noted the number of times a gene was selected. No. of genes Times selected No. of Times genes selected tubulin, alpha, ubiquitous Cluster Incl N90862 cyclin-dependent kinase inhibitor 2C (p18, inhibits CDK4) DEK oncogene (DNA binding) Cluster Incl AF transducin-like enhancer of split 2, homolog of Drosophila E(sp1) ADP-ribosyltransferase (NAD+; poly (ADP-ribose) polymerase) benzodiazapine receptor (peripheral) Cluster Incl D21063 galactosidase, beta 1 high-mobility group (nonhistone chromosomal) protein 2 cold inducible RNA-binding protein Cluster Incl U79287 BAF53 tubulin, beta polypeptide thromboxane A2 receptor H1 histone family, member X Fc fragment of IgG, receptor, transporter, alpha sine oculis homeobox (Drosophila) homolog 3 transcriptional intermediary factor 1 gamma transcription elongation factor A (SII)-like 1 like mouse brain protein E46 minichromosome maintenance deficient (mis5, S. pombe) 6 transcription factor 12 (HTF4, helix-loop-helix transcription factors 4) guanine nucleotide binding protein (G protein), gamma 3, linked dihydropyrimidinase-like 2 Cluster Incl AI transforming growth factor, beta receptor II (70-80kD) protein kinase C-like 1 MARKER GENES FOR HARVARD DATA Breast cancer data set in vant Veer et al. (vant Veer et al., 2002, Gene Expression Profiling Predicts Clinical Outcome Of Breast Cancer, Nature 415) These data were the result of microarray experiments on three patient groups with different classes of breast cancer tumours. The overall goal was to identify a set of genes that could distinguish between the different tumour groups based upon the gene expression information for these groups. van de Vijver et al. (2002) considered a further 234 breast cancer tumours but have only made available the data for the top 70 genes based on the previous study of van t Veer et al. (2002) Number of GenesError Rate for Top 70 Genes (without correction for Selection Bias as Top 70) Error Rate for Top 70 Genes (with correction for Selection Bias as Top 70) Error Rate for 5422 Genes (with correction for Selection Bias) Two Clustering Problems: Clustering of genes on basis of tissues genes not independent Clustering of tissues on basis of genes - latter is a nonstandard problem in cluster analysis (n

Documents

Geoff McLachlan Department of Mathematics & Institute of Molecular Bioscience University of Queensland The Classification