1
INTRODUCTION GOAL: to provide novel types of interaction between classification systems and MIAME-compliant databases We present a prototype module aimed at providing graphical interaction between systems for gene- profiling and MIAME-compliant databases. The prototype has been developed to support outlier analysis and semi-supervised class discovery in microarray data experiments. The module is designed to integrate the newly developed PostgreSQL porting of the GUS/RAD platform [1,2] with a display automatically built in Scalable Vector Graphics (SVG). The display organizes the graphical outputs from a predictive classification system, supporting query construction and retrieval of MIAME annotation linked to automatically or manually selected curves. THE PROTOTYPE This first version provides an interface to sample- tracking curves (profiles of classification errors of single samples as a function of gene panel sizes), as derived from the ERFE-SVM gene ranking system [3]. We automatically cluster these curves according to a Dynamic Time Warping (DTW) metric [4], obtaining hypotheses on the potential presence of outliers and of subtypes. The analysis is a by-product of the ERFE-SVM complete cross-validation set-up, which is run on a Open Mosix Linux cluster facility. Scripts based on the trellis (lattice) graphics library of the R computing environment are interfaced to the classification system. The SVG directives providing the interactive display are also directly built by R, according to an adaptation of the RSVG driver package. FEATURES The user may pick up one or more curves from the display, or consider indication from unsupervised hierarchical clustering (from the standard R clustering package), and construct specific queries. In particular, given a potential outlier sample [5], the user may retrieve information on the biomaterial, or on the experimental conditions. We plan to fit the new module within the RAD (RNA Abundance Database) schema and to further support the interaction with the classification setup. The prototype is currently interfaced to a standalone PostgreSQL database, and a few elementary features have been implemented in order to covariate the selected samples with phenotype information possibly present in the dataset. REFERENCES [1] Manduchi, E., Pizarro, A., Stoeckert, C. (2001). RAD (RNA Abundance Database): an infrastructure for array data analysis. Proc. SPIE, vol 4266, pp. 68-78. [2] Manduchi E. et al. RAD and the RAD Study-Annotator: an approach to collection, organization, and exchange of all relevant information for high-throughput gene expression studies. Bioinformatics, 20(4):452- 459. [3] Furlanello, C., Serafini, M., Merler, S., and Jurman, G. (2003). Entropy-based gene ranking without selection bias for the predictive classification of microarray data. BMC Bioinformatics, 54(4). [4] Aach, J. and Church, G. M. (2001). Aligning gene expression time series with time warping algorithms. Bioinformatics, 17(6):495-508. [5] Furlanello, C., Merler, S., Jurman, G., and Serafini, M. Unsupervised Discovery from Gene Tracking with RFE Classification Systems. ISMB/ECCB 2004. Interfacing predictive models with MIAME compliant databases Cesare Furlanello, Maria Serafini, Silvano Paoli, Giuseppe Jurman ITC-irst, Trento, Italy -- http://mpa.itc.it MGED 7 MGED 7 September September 8-10, 8-10, 2004 2004 Toronto, ON, Toronto, ON, Canada Canada DATA In this example, the prototype is connected to PostgreSQL data tables. Microarray data: mouse model of Myocardial Infarction from the Cardiogenomics PGA - Genomics of Cardiovascular Development, Adaptation, and Remodeling - NHLBI Program for Genomic Applications, Harvard Medical School. http:// www.cardiogenomics.org In its final version, GUS/RAD will become its natural interface to the data. The development of the PostgreSQL porting of GUS is on its way. The MPBA group at ITC-irst is a member of the team involved in the project. (a) Gene profiling tasks require intensive computational resources. Our E-RFE system for gene profiling [2] is currently implemented on a high-throughput computing facility, the MPA-HTC Linux Cluster. Discovery of outlier patterns and of potential subtypes, and analysis of gene importance may be derived as a by-product of the computation (e.g. as needed by a complete validation setup to avoid selection bias). QUESTIONS 1. Interact with the resources (Cluster+Algorithms) for understanding and refining machine learning results 2. Provide access to the gene profiling algorithms and their outcomes through a web service 3. Connect to MIAME-compliant information to support investigation and discovery Build query Zoom on plot Choose the cluster you are interested in and display the curves for the selected cluster Selection of sample-tracking curves is obtained from DTW-based clustering. Curves from selected cluster are added to the sample analysis area and are ready for query. Query the Database for info on the selected (blue) sample, or for all those listed in the working area or displayed in the image: Browse through the samples, then select/remove the current curve from the working area Save in JPG format the selected (blue) curve or all those displayed in the working area Interface: Profile Browser, Working Area, Query Tools (b) EXAMPLE: Interfacing to sample-tracking profiles. We study the influence of gene panel sizes on predictive classification error, on a sample-by- sample basis. Errors are accumulated on multiple replicated runs in which the sample is in test, and plotted for increasing panel sizes. Specific sample-tracking profiles may be investigated to discover patterns (potential outliers, subtypes). How to automate the discovery of patterns and interconnect the investigation to experimental, biological and clinical data about the microarray? Automating discovery: DTW-based clustering Scalable Vector Graphic SVG is a language for describing two-dimensional graphics and graphical applications in XML. SVG 1.1 is a W3C Recommendation and forms the core of the current SVG developments.

INTRODUCTION GOAL: to provide novel types of interaction between classification systems and MIAME-compliant databases We present a prototype module aimed

Embed Size (px)

Citation preview

Page 1: INTRODUCTION GOAL: to provide novel types of interaction between classification systems and MIAME-compliant databases We present a prototype module aimed

INTRODUCTION

GOAL: to provide novel types of interaction between classification systems and MIAME-compliant databases

We present a prototype module aimed at providing graphical interaction between systems for gene-profiling and MIAME-compliant databases.

The prototype has been developed to support outlier analysis and semi-supervised class discovery in microarray data experiments.

The module is designed to integrate the newly developed PostgreSQL porting of the GUS/RAD platform [1,2] with a display automatically built in Scalable Vector Graphics (SVG).

The display organizes the graphical outputs from a predictive classification system, supporting query construction and retrieval of MIAME annotation linked to automatically or manually selected curves.

THE PROTOTYPE

This first version provides an interface to sample-tracking curves (profiles of classification errors of single samples as a function of gene panel sizes), as derived from the ERFE-SVM gene ranking system [3].

We automatically cluster these curves according to a Dynamic Time Warping (DTW) metric [4], obtaining hypotheses on the potential presence of outliers and of subtypes. The analysis is a by-product of the ERFE-SVM complete cross-validation set-up, which is run on a Open Mosix Linux cluster facility.

Scripts based on the trellis (lattice) graphics library of the R computing environment are interfaced to the classification system.

The SVG directives providing the interactive display are also directly built by R, according to an adaptation of the RSVG driver package.

FEATURES

The user may pick up one or more curves from the display, or consider indication from unsupervised hierarchical clustering (from the standard R clustering package), and construct specific queries.

In particular, given a potential outlier sample [5], the user may retrieve information on the biomaterial, or on the experimental conditions.

We plan to fit the new module within the RAD (RNA Abundance Database) schema and to further support the interaction with the classification setup.

The prototype is currently interfaced to a standalone PostgreSQL database, and a few elementary features have been implemented in order to covariate the selected samples with phenotype information possibly present in the dataset.

REFERENCES

[1] Manduchi, E., Pizarro, A., Stoeckert, C. (2001). RAD (RNA Abundance Database): an infrastructure for array data analysis. Proc. SPIE, vol 4266, pp. 68-78.

[2] Manduchi E. et al. RAD and the RAD Study-Annotator: an approach to collection, organization, and exchange of all relevant information for high-throughput gene expression studies. Bioinformatics, 20(4):452-459.

[3] Furlanello, C., Serafini, M., Merler, S., and Jurman, G. (2003). Entropy-based gene ranking without selection bias for the predictive classification of microarray data. BMC Bioinformatics, 54(4).

[4] Aach, J. and Church, G. M. (2001). Aligning gene expression time series with time warping algorithms. Bioinformatics, 17(6):495-508.

[5] Furlanello, C., Merler, S., Jurman, G., and Serafini, M. Unsupervised Discovery from Gene Tracking with RFE Classification Systems. ISMB/ECCB 2004.

Interfacing predictive models with MIAME compliant databasesCesare Furlanello, Maria Serafini, Silvano Paoli, Giuseppe Jurman

ITC-irst, Trento, Italy -- http://mpa.itc.it

MGED MGED 77

SeptemberSeptember 8-10, 8-10, 20042004

Toronto, ON, CanadaToronto, ON, Canada

DATA

In this example, the prototype is connected to PostgreSQL data tables.

Microarray data: mouse model of Myocardial Infarction from the Cardiogenomics PGA - Genomics of Cardiovascular Development, Adaptation, and Remodeling - NHLBI Program for Genomic Applications, Harvard Medical School. http://www.cardiogenomics.org

In its final version, GUS/RAD will become its natural interface to the data.

The development of the PostgreSQL porting of GUS is on its way.

The MPBA group at ITC-irst is a member of the team involved in the project.

(a) Gene profiling tasks require intensive computational resources. Our E-RFE system for gene profiling [2] is currently implemented on a high-throughput computing facility, the MPA-HTC Linux Cluster.

Discovery of outlier patterns and of potential subtypes, and analysis of gene importance may be derived as a by-product of the computation (e.g. as needed by a complete validation setup to avoid selection bias).

QUESTIONS

1. Interact with the resources (Cluster+Algorithms) for understanding and refining machine learning results

2. Provide access to the gene profiling algorithms and their outcomes through a web service

3. Connect to MIAME-compliant information to support investigation and discovery

Build query

Zoom on plot

Choose the cluster you are interested in and display the curves for the selected cluster

Selection of sample-tracking curves is obtained from DTW-based clustering. Curves from selected cluster are added to the sample analysis area and are ready for query.

Query the Database for info on the selected (blue) sample, or for all those listed in the working area or displayed in the image:

Browse through the samples, then select/remove the current curve from the working area

Save in JPG formatthe selected (blue) curve or all those displayed in the working area

Interface: Profile Browser, Working Area, Query Tools

(b) EXAMPLE: Interfacing to sample-tracking profiles. We study the influence of gene panel sizes on predictive classification error, on a sample-by-sample basis. Errors are accumulated on multiple replicated runs in which the sample is in test, and plotted for increasing panel sizes. Specific sample-tracking profiles may be investigated to discover patterns (potential outliers, subtypes).

How to automate the discovery of patterns and interconnect the investigation to experimental, biological and clinical data about the microarray?

Automating discovery: DTW-based clustering

Scalable Vector Graphic

SVG is a language for describing two-dimensional graphics and graphical applications in XML.

SVG 1.1 is a W3C Recommendation and forms the core of the current

SVG developments.