23
Advanced Review Data mining and life sciences applications on the grid Mario Cannataro, 1,2Pietro Hiram Guzzi 1 and Alessia Sarica 1 Data mining (DM) is increasingly used in the analysis of data generated in life sci- ences, including biological data produced in several disciplines such as genomics and proteomics, medical data produced in clinical practice, and administrative data produced in health care. The difficulty in mining such data is twofold. First of all, data in life sciences are inherently heterogeneous, spanning from molec- ular level data to clinical and administrative data. Second, data in life sciences are produced at an increasing rate and data repositories are becoming very large. Thus, the management and analysis of such data is becoming a main bottleneck in biomedical research. The main goal of this paper is to review the main methodolo- gies to mine life sciences data and the ways they are coupled to high-performance infrastructures and systems that result in an efficient analysis. This paper recalls basic concepts of DM, grids, and distributed DM on grids, and reviews main approaches to mine biomedical data on high-performance infrastructures with special focus on the analysis of genomics, proteomics, and interactomics data, and the exploration of magnetic resonance images in neurosciences. The paper can be of interest both to bioinformaticians, who can learn how to exploit high performance infrastructures to mine life sciences data, and to computer scientists, who can address the heterogeneity and the high volumes of life sciences data at the data management, algorithm, and user interface layers. C 2013 Wiley Periodicals, Inc. How to cite this article: WIREs Data Mining Knowl Discov 2013, 3: 216–238 doi: 10.1002/widm.1090 INTRODUCTION D ata mining (DM) is the set of methods, which aim to extract hidden knowledge from a huge quantity of data or databases. DM techniques are used to discover correlation among data (association rules), to create classifiers (e.g., decision trees, neu- ral networks, Bayesian networks) that predict values of some variable (classification), or to split unknown data into groups with similar characteristics (cluster- ing). Witten and Frank 1 define DM as the process of discovering patterns in data, and these patterns must be meaningful to take some advantage of them. In particular, this process must be automatic or semi- The authors have declared no conflicts of interest in relation to this article. Correspondence to: [email protected] 1 Department of Medical and Surgical Sciences, University Magna Graecia of Catanzaro, Catanzaro, Italy 2 ICAR-CNR, Rende, Italy DOI: 10.1002/widm.1090 automatic. Han and Kamber 2 list several steps that characterize the ‘Knowledge Discovery in Databases’ used as a synonym for DM: (1) data cleaning, to re- move noise or irrelevant data; (2) data integration, in case of data coming from multiple sources; (3) data selection, to choose relevant data; (4) data transfor- mation, to transform cleaned data into appropriate forms for mining; (5) DM, the application of tech- niques for discovering knowledge; (6) pattern eval- uation, to identify the truly interesting patterns rep- resenting knowledge, based on some interestingness measures; and (7) knowledge presentation, the visu- alization of the information retrieved. 2 DM is used in many application domains and es- pecially in the analysis of data generated in the life sci- ences domain. This includes biological data produced in the so called ‘omics’ disciplines (e.g., genomics, proteomics, and interactomics data), medical data produced in medicine (e.g., several types of clinical data, including biomedical images in radiology), clin- ical and administrative data produced in health care (e.g., data extracted from electronic patient records, 216 Volume 3, May/June 2013 c 2013 John Wiley & Sons, Inc.

Data mining and life sciences applications on the grid

  • Upload
    alessia

  • View
    219

  • Download
    2

Embed Size (px)

Citation preview

Page 1: Data mining and life sciences applications on the grid

Advanced Review

Data mining and life sciencesapplications on the gridMario Cannataro,1,2∗ Pietro Hiram Guzzi1 and Alessia Sarica1

Data mining (DM) is increasingly used in the analysis of data generated in life sci-ences, including biological data produced in several disciplines such as genomicsand proteomics, medical data produced in clinical practice, and administrativedata produced in health care. The difficulty in mining such data is twofold. Firstof all, data in life sciences are inherently heterogeneous, spanning from molec-ular level data to clinical and administrative data. Second, data in life sciencesare produced at an increasing rate and data repositories are becoming very large.Thus, the management and analysis of such data is becoming a main bottleneck inbiomedical research. The main goal of this paper is to review the main methodolo-gies to mine life sciences data and the ways they are coupled to high-performanceinfrastructures and systems that result in an efficient analysis. This paper recallsbasic concepts of DM, grids, and distributed DM on grids, and reviews mainapproaches to mine biomedical data on high-performance infrastructures withspecial focus on the analysis of genomics, proteomics, and interactomics data,and the exploration of magnetic resonance images in neurosciences. The papercan be of interest both to bioinformaticians, who can learn how to exploit highperformance infrastructures to mine life sciences data, and to computer scientists,who can address the heterogeneity and the high volumes of life sciences data atthe data management, algorithm, and user interface layers. C© 2013 Wiley Periodicals,Inc.

How to cite this article:WIREs Data Mining Knowl Discov 2013, 3: 216–238 doi: 10.1002/widm.1090

INTRODUCTION

D ata mining (DM) is the set of methods, whichaim to extract hidden knowledge from a huge

quantity of data or databases. DM techniques areused to discover correlation among data (associationrules), to create classifiers (e.g., decision trees, neu-ral networks, Bayesian networks) that predict valuesof some variable (classification), or to split unknowndata into groups with similar characteristics (cluster-ing). Witten and Frank1 define DM as the process ofdiscovering patterns in data, and these patterns mustbe meaningful to take some advantage of them. Inparticular, this process must be automatic or semi-

The authors have declared no conflicts of interest in relation to thisarticle.∗Correspondence to: [email protected] of Medical and Surgical Sciences, University MagnaGraecia of Catanzaro, Catanzaro, Italy2ICAR-CNR, Rende, Italy

DOI: 10.1002/widm.1090

automatic. Han and Kamber2 list several steps thatcharacterize the ‘Knowledge Discovery in Databases’used as a synonym for DM: (1) data cleaning, to re-move noise or irrelevant data; (2) data integration, incase of data coming from multiple sources; (3) dataselection, to choose relevant data; (4) data transfor-mation, to transform cleaned data into appropriateforms for mining; (5) DM, the application of tech-niques for discovering knowledge; (6) pattern eval-uation, to identify the truly interesting patterns rep-resenting knowledge, based on some interestingnessmeasures; and (7) knowledge presentation, the visu-alization of the information retrieved.2

DM is used in many application domains and es-pecially in the analysis of data generated in the life sci-ences domain. This includes biological data producedin the so called ‘omics’ disciplines (e.g., genomics,proteomics, and interactomics data), medical dataproduced in medicine (e.g., several types of clinicaldata, including biomedical images in radiology), clin-ical and administrative data produced in health care(e.g., data extracted from electronic patient records,

216 Volume 3, May/ June 2013c© 2013 John Wi ley & Sons , Inc .

Page 2: Data mining and life sciences applications on the grid

WIREs Data Mining and Knowledge Discovery Data mining and life sciences applications on the grid

epidemiology data, survival data in oncology), and soon.

The difficulty in mining such data is twofold;first of all, data in life sciences are inherently hetero-geneous, spanning from molecular level data to clin-ical and administrative data, and second, data in lifesciences are produced at an increasing rate and datarepositories are becoming very large. Although theavailability of high-throughput experimental and di-agnostic platforms is producing an increasing volumeof heterogeneous data, methods used in the manage-ment and analysis of such data are usually sequential,require human intervention, and thus are becomingthe real bottleneck in biomedical research.

To face these challenge, recent life sciencesproject are using high-performance infrastructures,such as grids and clouds, to store and analyze data,whereas ontologies are more and more used to modelbiomedical data and processes, allowing to face theheterogeneity of data and to improve the effectivenessof analysis by taking advantage of existing knowl-edge. For instance, ‘The Open Biological and Biomed-ical Ontologies Foundry’ (http://www.obofoundry.org) supports the development of a set of ontolo-gies that are designed specifically to model the bi-ological and biomedical domains, where ‘Gene On-tology’ (http://www.geneontology.org) models thebiological process, molecular function, and cellularcomponent of several biological entities such as genesand proteins. In medicine, ontologies are mainly usedto model medical terms such as diseases, whereasin biology, these are mainly used to annotate ex-perimental data, for instance, proteins detected inan experiment may be annotated with the terms ex-tracted from the Gene Ontology Annotation database(http://www.ebi.ac.uk/GOA/).

The huge dimensionality of life sciences data isthe first reason to implement large distributed datarepositories, whereas high-performance computing isnecessary to allow the efficient storage and analysisof huge data. Conventional high-performance com-puters (e.g., clusters) and novel parallel architectures(e.g., General Purpose Graphics Processing Units),programmed with either conventional (e.g., Mes-sage Passing Interface) or emerging parallel program-ming models (e.g., Service Oriented Architecture,MapReduce), are more and more used in life sciencesresearch and may overcome the limits posed by con-ventional computers to the mining and exploration oflarge amounts of data.

An emerging approach to face the high com-plexity of life sciences applications is the modeling ofapplications as collections of web services composedthrough workflows. In particular, web services may

be used to implement different steps of the biomedi-cal analysis pipeline, whereas workflows are used tocombine such web services forming reusable applica-tions that may be deployed on several distributed orparallel architectures, such as grids or clusters.

Moreover, using parameter sweep technology, asingle workflow may be instantiated in various formsto test in parallel different algorithms on some of thesteps of the analysis pipeline. For instance, classifica-tion of data may be implemented by an ensemble ofclassifiers run in parallel whose results are eventuallyintegrated.

This paper reviews main approaches to minebiomedical data on high-performance infrastructureswith special focus on the analysis of ‘omics’ dataand the exploration of magnetic resonance imagesin neurosciences. The paper is organized as follows.Section Background on Distributed and Grid-BasedData Mining introduces grids, their programmingmodels, and main grid-based platforms for DM andfor specific life sciences applications. This sectionis mainly devoted to grids because of their matu-rity and the high number of successful projects in-volving grids and life sciences.3 Moreover, this sec-tion briefly introduces some promising emerging in-frastructures such as clouds that start to be usedin several life sciences projects. Section DistributedProcessing and Mining of Omics Data describes somerecent approaches and bioinformatics tools for thedistributed preprocessing and mining of genomics,proteomics, and interactomics data. Section DataMining in Neuroscience discusses the main workflowfor analyzing magnetic resonance images in neurologywith special focus on the mining of preprocessed data.Finally, Conclusions section concludes the paper andunderline current problems and future directions.

BACKGROUND ON DISTRIBUTEDAND GRID-BASED DM

High-performance computing infrastructures, such ascluster computers, supercomputers, grids, and clouds,are becoming an invaluable tool for large scale re-search in scientific computing, life sciences, bioinfor-matics, and systems biology.

Life sciences applications have to face the over-whelming amount of experimental molecular dataproduced by novel omics platforms (e.g., next gen-eration sequencing, microarrays, mass spectrome-try), clinical data produced by diagnostic tools(e.g., magnetic resonance, computerized tomogra-phy, positron emission tomography), and day-by-dayclinical practice (e.g., electronic patient records).

Volume 3, May/ June 2013 217c© 2013 John Wi ley & Sons , Inc .

Page 3: Data mining and life sciences applications on the grid

Advanced Review wires.wiley.com/widm

High-performance computing technologies, such asgrids and clouds, offer scalable solutions combin-ing, in a seamless and economic way, shared stor-age and computing resources with application levelservices, improving performance and availability atan affordable price. In particular, grids and more re-cently cloud infrastructures offer native support fordistributed computing, resources sharing, service ori-entation, and easy access by the user through web orgrid portals.

Grid ComputingGrid computing is a mature distributed computingmodel based on large-scale resource sharing and of-fering high-performance computing power. A gridis a geographically distributed computing infrastruc-ture comprising large data storage systems and high-performance computers or instruments that users ac-cess through a command line interface or througha grid portal, that is, a web-accessible grid inter-face. Grid resources may belong to different orga-nizations that form virtual organizations, that is, aset of resources and services shared across differentinstitutions.

The management of grid resources and servicesis performed by the grid middleware, a distributedsoftware that playes a role similar to the operatingsystem in centralized computers. The grid middlewaremanages the following functions: single sign-on forauthentication and authorization, security and pri-vacy, information discovery, resource and data man-agement, communication, fault detection, and appli-cation deployment.4

The grid middleware is made by the followingsoftware layers:

• Grid fabric that comprises the hardware andresources operated by local operating sys-tems that compose the grid, that is, comput-ers, clusters, storage, networks, remote instru-ments, and related local services;

• Grid services, which include the core gridmiddleware services such as authentication,authorization, resource discovery, resourcemanagement, grid scheduling, data transferprotocols, monitoring, and fault detection;

• Application toolkits, which are software toolsthat compose grid services and grid fabric tooffer:

• Data grid services, offering data man-agement, data replication, and replicalocation functions,

• Grid portals, allowing to access, sub-mit, and monitor jobs on the gridthrough a web-based (World WideWeb) interface,

• Remote computation and visualiza-tion,

• Remote instrumentations and sen-sors,

• Collaboration laboratory (Collabo-ratory), a virtual cooperative re-search environment that supportsthe work of researchers in differentlocations;

• Applications, that is, domain-specific applica-tions that use the grid resources and servicessuch as high energy physics, climate modeling,aerospace, chemical engineering.

The main grid middlewares adopting the lay-ered architecture are Globus Toolkit, Unicore, andgLite, used both in research projects as well as inproduction systems. More recently, novel grid mid-dlewares exploiting web services, workflows, on-tologies, and DM have been developed, and amongthose, we cite the semantic grid and the knowledgegrid.

Finally, important trends related to interoper-able access to grids and distributed computing in-frastructures, and to the use of scientific workflowsin eScience are reported in the SCI-BUS and SHIWAprojects.

The SCI-BUS project (https://www.sci-bus.eu/)provides seamless access to computing, data, andnetworking infrastructures and services in Europeincluding clusters, supercomputers, grids, desktopgrids, and academic and commercial clouds. Recently,the SCI-BUS project has signed a Memorandumof Understanding with the Quantitative BiologyCenter (QBiC), an interdisciplinary core facility ofthe University of Tubingen, the University HospitalTubingen, and the Max Planck Institute for Develop-mental Biology.

The SHIWA (SHaring Interoperable Workflowsfor large-scale scientific simulations on AvailableDCIs) project (http://www.shiwa-workflow.eu/) al-lows running scientific workflows on multiple dis-tributed computing infrastructures in a transparentway with respect to workflow language.

Globus ToolkitThe Globus Toolkit (http://www.globus.org) is anopen source grid middleware allowing to sharecomputing power, databases, and software tools,

218 Volume 3, May/ June 2013c© 2013 John Wi ley & Sons , Inc .

Page 4: Data mining and life sciences applications on the grid

WIREs Data Mining and Knowledge Discovery Data mining and life sciences applications on the grid

maintaining local autonomy. The Globus Toolkit(current version is Globus Toolkit Version 5—GT5)includes services and libraries for resource discoveryand management, security, data management, com-munication, fault detection, and portability. It is of-fered as a bag of services that can be used to developapplications. Main Globus Toolkit Version 5 modulesare:

• Grid security infrastructure provides secureauthentication and communication and offerssingle sign-on. It is based on public key infras-tructure technology such as X.509 certificatesand Transport Layer Security.

• GridFTP provides a secure, high-performance, reliable data transfer protocolbased on File Transfer Protocol optimizedfor wide-area networks.

• Replica location service manages the replica-tion of data with the aim to move huge dataclose to applications to improve their perfor-mance. It supports metadata registration andsearch.

• Grid resource allocation and management al-lows to submit, monitor, and cancel jobs onthe grid, through the coordination of localschedulers available on grid nodes.

UNICOREUNICORE (UNiform Interface to COmputingREsources) is an open source grid middleware thatprovides easy access to distributed computationalresources and databases (http://www.unicore.eu).UNICORE is used in many research projects and inmany supercomputer centers. UNICORE version 6comprises three main layers, the client, service, andsystem layers.

• The client layer makes the following types ofclients available to the user: (1) the UNICOREcommand line client allows to run single jobsand workflows, (2) the Eclipse-based graphi-cal UNICORE rich client supports the designof complex scientific workflows and gives agraphical view of the grid.

• The service layer comprises the following ser-vices: (1) the gateway component that au-thenticates all incoming requests sent to aUNICORE site, (2) the UNICORE/X serverthat submits job and manages file transfer,(3) the common information service that im-plements the information services of UNI-CORE including a geographical representa-

tion of the grid, (4) the workflow engine thatmanages workflow execution, (5) and the ser-vice orchestrator that run individual tasks in aworkflow.

• The system layer includes the target systeminterface that sits between the local operatingsystem and UNICORE and maps abstract gridcommands to local commands.

gLitegLite (http://glite.cern.ch) is a grid middleware orig-inally used by the CERN to support the LargeHadron Collider experiments, then it was the officialgrid middleware of the Enabling Grids for E-sciencE(EGEE) European project and now is managed by theEuropean Middleware Initiative (EMI) Europeanproject. In particular, the EMI middleware willintegrate three general-purpose grid middleware,namely Advanced Resource Connector (ARC),gLite, and UNICORE, and the dCache storagesolution. ARC is an open source grid middle-ware that allows the submission of jobs to var-ious distributed systems allowing the building ofgrids of varying size. It is distributed as anApache License and has been developed by theNorduGrid collaboration. dCache is a virtual file sys-tem that stores and retrieves data among heteroge-neous servers.

CONDORThe Condor (http://www.cs.wisc.edu/condor/)project5,6 from the University of Wisconsin hasbeen used in many different projects on the useof grid resources for analysis of life science data.From a technical point of view, Condor is not aself-contained grid middleware, but a middlewaretoolkit that may be used in grid environments as wellas in high-performance computing infrastructures(e.g., clusters). Main characteristic of Condor isthe possibility to realize distributed application byproviding efficient workload management systemfor compute-intensive jobs and harnessing collectionof dedicated or nondedicated hardware under dis-tributed ownership. For instance, Condor is largelyused as a queuing system or a job scheduler forresource subsets. The Globus Toolkit distributionincludes a Condor job manager. Thus user may runa distributed application using the Condor paradigmthrough Globus in a transparent way.

The Semantic GridThe semantic grid7 aims to apply to the grid themethodologies developed for the semantic web andfocuses on the systematic adoption of metadata and

Volume 3, May/ June 2013 219c© 2013 John Wi ley & Sons , Inc .

Page 5: Data mining and life sciences applications on the grid

Advanced Review wires.wiley.com/widm

ontologies to assign to grid resources and servicesa well-defined meaning, enabling better cooperationamong computers and people.

The semantic web is ‘. . .an extension of the cur-rent Web in which information is given well definedmeaning, better enabling computers and people towork in cooperation. . .’.8 The main goal of semanticweb is to allow web entities (software agents, users,and programs) to interoperate with each other andto dynamically discover and use resources to solvecomplex problems.

The semantic web comprises three layers: (1) aset of web resources described by metadata and a setof rules for inferring new metadata and knowledgethrough ontologies, (2) a set of basic services, includ-ing semantic search engines, reasoning and queryingover metadata and ontologies, (3) a set of high-levelapplications developed using basic services. At thecore of those services are methods and tools for man-aging the entire life cycle of ontologies and for usingthem in applications.

The semantic grid initiative aims to integrateand bridge the efforts made in the grid and in the se-mantic web communities.7 The semantic grid visionis to incorporate the semantic web approach (system-atic description of resources through metadata andontologies, and provision for basic services about rea-soning and knowledge extraction), into the ongoinggrid. De Roure et al.9 declare ‘As the Semantic Webis to the Web, so is the Semantic Grid to the Grid.’that forecasts for the grid a similar evolution of theweb toward the semantic web.

The semantic grid vision is to improve the gridin the easiness-to-use and automation of processes tofacilitate collaborations and computations on a globalscale. This is obtained through machine-processableknowledge available on the grid. The semantic gridgives full support of the three recognized layers com-posing a grid, that is, computation/data layer, infor-mation layer (where data produces information), andknowledge layer (where knowledge can be used totake decisions).

As reported by De Roure et al.,9 the key char-acteristics of the semantic grid are: (1) services for re-source description, discovery, and use, able to identifycontent, services, and computational resources on thegrid, and enabling the efficient storage and processingof huge volumes of distributed data; (2) process de-scription based on workflows that allow the composi-tion of resources, and distributed process (workflow)enactment; (3) autonomic behavior, that is, a seman-tic grid system should autoconfigure and ‘self-heal’ inthe presence of faults; (4) security and trust, as thepresence of multiple organizations requires authen-

tication, encryption, and privacy, as well as propercharging and billing; (5) annotation that enriches thedescription of any digital content and can supportprovenance, allowing to repeat experiments and reusethe results.

Cloud ComputingCloud computing allows to access computers, ser-vices, and infrastructures as a utility, using the In-ternet as the transport layer. A main innovationof cloud computing is the possibility to access ser-vices available on the Internet using a pay-per-usemodel and without the need to buy large and costlyhardware.10,11

Cloud computing was originated from severalideas and projects developed in the last 30 years: dur-ing 1961, John McCarthy suggested the utility busi-ness model for using computing power and applica-tions, whereas more recently, grid computing allowedusers to access computing power and services on de-mand, as they obtain electrical power from the electricpower grid.

Because cloud computing involves several as-pects, such as technology, economics, and businessmodel, many definitions of cloud computing havebeen produced in these last years. Recently, theNational Institute of Standards and Technology(NIST) released the following definition12: ‘Cloudcomputing is a model for enabling convenient, on-demand network access to a shared pool of config-urable computing resources (e.g., networks, servers,storage, applications, and services) that can be rapidlyprovisioned and released with minimal managementeffort or service provider interaction’.

NIST also defined the basic components of thecloud computing model:

• five basic characteristics (on-demand self-service, broad network access, resource pool-ing, rapid elasticity, and measured service);

• three service models, that is, Cloud Softwareas a Service, Cloud Platform as a Service, andCloud Infrastructure as a Service;

• four deployment models, that is, Privatecloud, Community cloud, Public cloud, andHybrid cloud.

Foster et al.11 define cloud computing as a spe-cialized distributed computing paradigm that presentssome unique characteristics, such as scalability,elasticity, and on demand delivery of services. Theydo not consider cloud computing as a new comput-ing paradigm, but as an evolution of grid computing

220 Volume 3, May/ June 2013c© 2013 John Wi ley & Sons , Inc .

Page 6: Data mining and life sciences applications on the grid

WIREs Data Mining and Knowledge Discovery Data mining and life sciences applications on the grid

in which computing resources are packaged and soldto the users as in public utility services, leveragingeconomies of scale. In fact, grids are infrastructuresthat deliver storage and computing, whereas cloudsdeliver more abstract resources and services follow-ing an economy model.

According to Ahmed et al.10 ‘Cloud comput-ing is a way of leveraging the Internet to consumesoftware or other information technology services ondemand’. Thus, it is more a business model than acomputing paradigm.

According to Schwiegelshohn et al.,13 the gridconcept will remain valid also in the presence of theemerging clouds. Moreover, the authors strongly be-lieve that some future applications will continue torequire the grid approach and thus further researchwill be required to move grids into reliable, efficient,and user-friendly computing platforms.

Currently, there are both commercial clouds,such as Amazon Elastic Computing Cloud (AmazonEC2—http://aws.amazon.com/ec2/), Windows Azure(http://www.windowsazure.com), Google Cloud(http://cloud.google.com), and GoGrid (http://www.gogrid.com), and some open source software to buildand operate private clouds, such as OpenNebula(http://opennebula.org), Eucalyptus (http://www.eucalyptus.com), and Nimbus (http://www.nimbusproject.org).

Parallel and Distributed DMEven if DM algorithms deal predominantly with sim-ple data formats, there is an increasing interest onmining complex and advanced data type such asobject-oriented, spatial, temporal data14 and bioim-ages. Nowadays experimental datasets and in gen-eral databases are geographically distributed, oftenfar from the applications and from the users, thusperforming DM on these distributed datasets needsnew methodologies that leverage parallelism and dis-tributed infrastructures such as grids and clouds. Thesignificant factors,14 which have led to the emergenceof distributed DM from centralized mining, are asfollows:

• The need to mine distributed subsets of data,the integration of which is nontrivial and ex-pensive.

• The performance and scalability bottlenecksof DM.

• Distributed DM provides a framework forscalability, which allows the splitting up oflarger datasets with high dimensionality into

smaller subsets that require computational re-sources individually.

A new branch of DM was born, called dis-tributed DM, with the aim to support the applica-tion of mining algorithms to huge dimensions dataset.For compute-intensive applications, parallelization isan obvious means for improving performance andachieving scalability. A variety of techniques may beused to distribute the workload involved in DM overmultiple processors. In particular, three kind of com-puting technologies are listed14:

• Parallel computing. Single systems with manyprocessors work on the same problem.

• Distributed computing. Many systems looselycoupled by a scheduler work on differentparts of the same problem.

• Grid computing. Many systems tightly cou-pled by software, perhaps geographically dis-tributed, are made to work together on singleproblems or on related problems.

The main function of a grid-based DM systemis to facilitate the sharing of data, DM programs, pro-cessing units, and storage devices to improve existing,and enable novel, DM applications.15 In the followingsections, we describe two main grid-based DM suitenamed knowledge grid16 and data mining grid.17

The Knowledge GridBefore the appearance of the semantic grid concept,a first evolution of the grid was the so called datagrid, an enhancement of grids that is able to handlelarge data sets in distributed data-intensive applica-tions and that is specifically designed to store, move,and manage large data sets located in remote sites.

A further advancement of the data grid con-cept is the development of knowledge-based grids18

that supports the process of analysis, inference, anddiscovery over data available on the grid. The cre-ation of knowledge grids19,20 may allow the devel-opment high-performance knowledge discovery (KD)processes facing the challenges posed by many appli-cation domains, including life sciences applications.Knowledge grids offer high-level services for the dis-tributed mining of data repositories available on thegrid, thus they realize the higher layer of the grid ar-chitecture. The main issue for the development of theknowledge layer in grids is the ability to synthesizeuseful knowledge from data through distributed DM,exploiting the grid infrastructure to perform data-intensive large-scale computations.

Volume 3, May/ June 2013 221c© 2013 John Wi ley & Sons , Inc .

Page 7: Data mining and life sciences applications on the grid

Advanced Review wires.wiley.com/widm

The knowledge grid is a joint research projectof ICAR-CNR, University of Calabria, and Uni-versity of Catanzaro that implemented an environ-ment for geographically distributed high-performanceKD applications on the grid.16 The knowledgegrid exploits semantic descriptions of services anddata through metadata and ontologies and providesseveral grid-based KD services that allow users tocreate and manage distributed KD applications. Ap-plications are modeled through workflows that in-tegrate data sets, mining tools, and computing andstorage resources on the grid. Users can compose,store, share, and execute these KD workflows and canpublish them as new components and services on thegrid.

The knowledge grid architecture comprises acore K-grid layer, whose services communicate withthe basic grid services, and a high-level K-grid layerthat offers to the user a set of services for the designand execution of KD applications. Such architectureincludes several knowledge bases that store: metadataabout resources, workflows of KD applications, andknowledge models obtained as result of KD appli-cations. VEGA (Visual Environment for Grid Appli-cations) is a software prototype that implements themain modules of the knowledge grid.21

The Data Mining GridThe DataMiningGrid17 is a software platform for run-ning distributed DM applications on the grid. Simi-lar to the knowledge grid, it uses a layered archi-tecture: the software and hardware resources layerare at the bottom, then the Globus Toolkit layer sitsat the middle layer offering core grid middlewarefunctions, then the DataMiningGrid client is usedto design and run distributed DM applications thatare represented by workflows composed using theTRIANA (http://www.trianacode.org/) workflowsystem.

The DataMiningGrid has been developedaccording to three main principles: service-orientedarchitecture (SOA), standardization, and open tech-nology. Among the others, it provides the followingpositive features (1) user friendliness, supporting dif-ferent user interfaces; (2) extensibility, as it allowsusers to quickly grid-enable existing DM applicationsusing a workflow editor; (3) parameter sweeps exe-cution, as it allows the definition of multijobs thatiterate over application parameters; use of standardssuch as WSRF (Web Services Resource Framework);and (4) open source implementation, it is open sourceand freely available.

Grids and Clouds for Life SciencesIn these last few years, many applications of lifesciences have been developed on grids (and morerecently onto clouds) yielding to specialized gridsystems named BioGrids. BioGrids are high perfor-mance computing infrastructures dedicated to solvebiomedical and bioinformatics problems on the gridthat usually implement a virtual collaborative labo-ratory integrating biological, medical, and bioinfor-matics tools, and often manages biology and medicalontologies.22 Some relevant BioGrids are reported22:(1) the SARSGrid initiative supports global medicalemergency related to the SARS disease (http://access.ncsa.illinois.edu/Stories/SARS/); (2) the eScience Di-agnostic Mammography National Database (eDia-MoND) is a grid-enabled database of annotatedmammograms used for the screening of breastcancer (http://www.ediamond.ox.ac.uk); (3) theBiomedical Informatics Research Network is a gridspecialized to study different aspects of neurologi-cal diseases (http://www.birncommunity.org); (4) theGenome Grid provides grid-enabled bioinformat-ics tools for genome analyses (http://gmod.org/wiki/Genome grid).

The book3 describes further applications ofthe grid to life sciences research; among those,we cite ‘High Performance BLAST Over the Grid’,‘PheGee@Home: A Grid-Based Tool for Compar-ative Genomics’, ‘High-Throughput Data Analysisof Proteomic Mass Spectra on the SwissBioGrid’,‘ProGenGrid: A Grid Problem Solving for Bioin-formatics’, the ‘BioSimGrid Biomolecular SimulationDatabase’, ‘IntegraEPI: Epidemiologic Surveillanceon the Grid’, ‘Health-e-Child Project’, and so on.

The work23 reports the porting of the well-known Snake24 segmentation algorithm on the gridand its application to the analysis of magnetic reso-nance images in cardiology.

Recently, also clouds are being used in lifesciences applications, for instance, the Amazon Elas-tic Computing Cloud has made available the An-notated Human Genome Database provided byENSEMBL and the UniGene database provided bythe National Center for Biotechnology Information(http://aws.amazon.com/publicdatasets). As reportedby Dudley and Butte,25 clouds not only can offerelastic computational power to bioinformatics ap-plications, but they can make computational analy-ses more reproducible because of the availability ofmore instances of bioinformatics applications thatare stored and shared in the clouds. A compre-hensive list of bioinformatics tools made availablethrough the cloud is reported in Schatz et al.26 For in-stance, Cloud BioLinux (cloudbiolinux.org) provides

222 Volume 3, May/ June 2013c© 2013 John Wi ley & Sons , Inc .

Page 8: Data mining and life sciences applications on the grid

WIREs Data Mining and Knowledge Discovery Data mining and life sciences applications on the grid

FIGURE 1 | The Weka explorer interface.

high-performance bioinformatics tools on differentcloud platforms.27

Software Tools for DMHere, we report two relevant DM suites, Weka andRapidMiner, with a special focus on Weka and itsparallel and distributed variants. Recently, Talia andTrunfio28 presented techniques, algorithms, and sys-tems based on the service-oriented paradigm for thedistributed mining of data.

Weka and its Parallel and Distributed VersionsA well-known open-source software that imple-ments DM algorithms is Weka (http://www.cs.waikato.ac.nz/ml/weka/) standing for Waikato Envi-ronment for Knowledge Analysis.29 More precisely,Weka is a collection of algorithms and techniques forautomatic learning entirely written in the Java pro-gramming language, see Figure 1.

In the bioinformatics field, it has been usedfor automated protein annotation by Kretschmannet al.30 and Bazzan et al.,31 for probe selection forgene-expression arrays by Tobler et al.,32 for develop-ing a computational model for frame-shifting sites byBekaert et al.33 and for plant genotype discriminationby Taylor et al.34 Many of the algorithms available inWeka are described in Frank et al.35

Thus, Weka can be seen as the de facto stan-dard in the field of machine learning in general. Al-though Weka supports a wide range of formats forimporting data, surprisingly, neuroimage data in theAnalyze or Nifti formats, which are standard formatsfor MR images in the neuroscience community, arenot supported. Pyka et al.36 present a tool that con-verts functional magnetic resonance imaging (fMRI)data in a way in which they can be handled by Weka.In particular, this tool allows the user to load, mod-ify, and convert neuroimaging data to the .arff Wekaformat.

Celis and Musicant37 presented Weka-Parallel,a modification of Weka, which performs n-fold cross-validations in parallel. Weka-Parallel expands theoriginal Weka framework adding parallelism so todecrease the amount of time necessary to run cross-validation on a dataset using any given classifier.

Another modification of the original Weka iscalled Grid Weka, developed by the University Col-lege of Dublin; the two main components of the sys-tem are Weka server and Weka client. The first one isbased on the original Weka and each machine partic-ipating in a Weka Grid runs the server components.The Weka client is responsible for accepting a learn-ing task and input data from a user and distributingthe work on the grid.38

Perez et al.39 proposed a vertical and genericarchitecture, named DMGA (Data Mining Grid

Volume 3, May/ June 2013 223c© 2013 John Wi ley & Sons , Inc .

Page 9: Data mining and life sciences applications on the grid

Advanced Review wires.wiley.com/widm

FIGURE 2 | The RapidMiner interface.

Architecture) in which the main functionalities ofevery stage are deployed by means of grid services.The same authors adapted the Weka Data MiningToolkit to a grid-based environment,40 called WekaG.WekaG has a server side, which is responsible of thecreation of instances of grid service by using a fac-tory pattern. On the client side, an interface to theusers is responsible for communicating with the gridservice.

The SOA paradigm can be exploited for theimplementation of data and knowledge-based appli-cations in distributed environments. The WSRF hasrecently emerged as the standard for the implementa-tion of grid services and applications. WSRF can beexploited for developing high-level services for dis-tributed DM applications. Weka4WS is an evolutionof the knowledge grid and adopts the WSRF technol-ogy for running remote DM algorithms and managingdistributed computations. The Weka4WS user inter-face supports the execution of both local and remoteDM task.41

Rapid MinerYALE, now called RapidMiner, see Figure 2, is anopen-source system for DM, modeling KD process asoperator trees.42 Similar to programming languages,using operator trees allows concepts like loops, con-ditions, or other meta application schemes. The leafsin the operator tree correspond to simple steps in themodeled process like learning a prediction model orapplying a preprocessing filter. Inner nodes of the treecorrespond to more complex or abstract steps in the

process. This is often necessary if the children shouldbe applied several times like, for example, in loops.In general, inner operator nodes define the data flowthrough their children. The root of the tree corre-sponds to the whole experiment.42

DISTRIBUTED PROCESSING ANDMINING OF OMICS DATA

Recently, computational biology is focusing on thestudy of living organisms at molecular scale and us-ing a system level approach. Consequently, novel dis-ciplines have been emerging, often referred to as the‘omics’ disciplines. The omics term refers to differentbiology disciplines such as, for instance, genomics,proteomics, or interactomics (see Box 1).

The suffix ‘-ome’ is used to indicate the objectsof study of such disciplines, such as the genome, pro-teome, or interactome, and usually refers to a totalityof some sort. The main common characteristic of suchdisciplines is the availability of many high-throughputtechnologies able to produce large amounts of data ina relatively small time. Consequently, the increasingavailability of omics data due to the advent of high-throughput technologies poses novel issues on datamanagement and analysis that can be faced by paral-lel and distributed storage systems and algorithms.

Genomics and Microarray DataMicroarray technology is used in biology andmedicine to explore the behavior of genes in

224 Volume 3, May/ June 2013c© 2013 John Wi ley & Sons , Inc .

Page 10: Data mining and life sciences applications on the grid

WIREs Data Mining and Knowledge Discovery Data mining and life sciences applications on the grid

BOX 1: OMICS DATA

The ‘omics’ neologism informally refers to different biol-ogy fields whose name ends in omics, such as genomics,proteomics, interactomics, and so on. The suffix ‘-ome’ isused to indicate the objects of study of such disciplines,such as the genome, proteome, or interactome, and usu-ally refers to a totality of some sort. Main omics disciplines,such as genomics, proteomics, and interactomics, respec-tively refer to the study of the genome, proteome, andinteractome of an organism. Such disciplines are gainingan increasing interest in the scientific community becauseof the availability of novel, high-throughput platforms forthe investigation of the cell machinery, such as mass spec-trometry, microarray, next generation sequencing, whichare producing an overwhelming amount of experimentalomics data. The increased availability of omics data posesnew challenges for their efficient storage, preprocessing,and analysis. Moreover, omics data are more and morestored in various databases spread all over the Internet.Thus, managing omics data requires both large data storesas well as tools and infrastructures for data preprocessing,analysis, and sharing.71

Genomics data refer to genes and in general to sequencesof nucleic acids forming DNA, RNA, mRNA, or miRNA. Amain technique to detect expression levels of genes or togenotype multiple regions of a genome (e.g., to find singlenucleotide polymorphisms) in a biological sample is throughmicroarray. A DNA microarray contains a set of DNA spotseach one containing a specific DNA sequence (said probe)that is a short section of a gene or other DNA elementsthat are used to hybridize a cDNA or cRNA sample (target).The probe-target hybridization is detected and quantifiedusing different techniques determining the relative abun-dance of nucleic acid sequences in the target. For instance,Affymetrix microarray files (said CEL files) store the resultsof the intensity calculations obtained from the pixel val-ues of the raw image files (also referred to as DAT files).In a typical case-control study, microarray is applied to aset of samples belonging to two classes (case and control)producing a set of CEL files that need to be preprocessed be-fore any statistical or DM analysis. The preprocessing of anAffymetrix dataset receives in input a set of CEL files, sev-eral chip-specific libraries, and produces as output a matrixwhose element (i,j) represents the intensity of the ith genein the jth sample. Preprocessing can be structured as: (1)background correction and quality control, (2) normaliza-tion, (3) summarization, and (4) annotation. Backgroundcorrection aims to identify the background noise and toremove it. Normalization reduces the bias among chipsand within different regions of the same chip removingnonbiological variability within a dataset. Summarizationcombines multiple probe intensities to a single expressionvalue, in fact all arrays employ more than one probe for

BOX 1: CONTINUEDeach gene. Annotation associates to each probe its knownannotations such as gene symbol or gene ontology termsby matching probes to public databases or knowledgebases. Annotation libraries are provided by the chipmanufacturer and contain different levels of annotation.Proteomics data refer to proteins and in general to se-quences of amino acids (e.g., short sequences forming pep-tides). A main technique to investigate proteins in a qual-itative and quantitative way is mass spectrometry. Massspectrometry is an analytical tool used for measuring themolecular mass of a sample. Mass spectrometry-based pro-teomics is a powerful technique for identifying moleculartargets in different pathological conditions.49 Mass spec-trometry output can be represented as a (large) sequenceof value pairs, say spectrum. Each pair contains a measuredintensity, which depends on the quantity of the detectedbiomolecule, and a mass to charge ratio (m/z), which de-pends on the molecular mass of the detected biomolecule.Files dimensions range from a few kilobytes per spectrumto a few gigabytes. This variability depends on the typeof spectrometer and the bin dimension that is the totalnumber of measurements. Increasing either the resolutionof the spectrometer or the number of analyzed biologicalsamples may lead to very huge datasets that require largestorage systems and high computing power. Finally, themeasurements contained in a spectrum may be affectedby noise, so spectra preprocessing aims to correct inten-sity and m/z values to reduce noise, reduce the amount ofdata, and make spectra comparable.72 Typical applicationsin proteomics are biomarker discovery, that is, the identifi-cation of spectra peak patterns discriminating case versuscontrol samples; peptide/protein identification, that is, theidentification through tandem mass spectrometry of thesequences of peptides/proteins contained in a sample andfinally, protein quantification, that is, the measurementof the relative quantity of proteins using labeling tech-niques, such as ICAT or SILAC, coupled to tandem massspectrometry.50

Interactomics data refer to the interactions occurringamong biomolecules and in particular among proteins.PPIs are represented as pairs of protein identifiers (Pi, Pj),whereas the complete set of interactions occurring in an or-ganism, named PIN, is the graph obtained by representingall the PPIs of that organism.59 In such networks, nodes areproteins and edges are the pairs of interacting proteins. Pro-tein interactions are detected using different experimentalplatforms such as Yeast2Hybrid or mass spectrometry andare stored in different protein interaction databases suchas DIP, BIND, MINT, MIPS, and so on. Main methods for theanalysis of protein interaction data comprise: (1) complexprediction, (2) pathway extraction, (3) network alignment,and (4) semantic annotation.

Vo lume 3, May/ June 2013 225c© 2013 John Wi ley & Sons , Inc .

Page 11: Data mining and life sciences applications on the grid

Advanced Review wires.wiley.com/widm

biological processes aiming at the final goal of un-derstanding the modification of such behavior thatmay be related to disease progression. Initially,microarray-based experiments were focused on thestudy of few samples due to economic and techno-logical limitations. Now, thanks to the recent devel-opment that resulted in the production of cheaperarrays, the size of experiments is rapidly increasing.

A common study using microarrays involves therecruitment of a set of samples (e.g., serum, tissues,or cell-lines) that belong to different classes. Suc-cessively, the nucleic acids of samples are extractedand analyzed through microarrays that enable theinvestigation of all the genes showing their level ofexpression and thus of activity or alternatively thedetermination of sequences of genes (in next gener-ation sequencing experiments) or the determinationof mutations related drug response (e.g., in DMETexperiments).43

Raw data (usually binary images) produced bythe instrumentations, for example, Affymetrix, aremined by employing a workflow structured on fourmain phases: (1) preprocessing that comprises thetranslation of images into numerical data and noiseremoval; (2) annotation, that is, integrating numericaldata and biological information; (3) statistical-DManalysis; and (4) biological interpretation.44,45

The typical dimensionality of a microarraydataset is growing because of the two main factors:(1) the increasing dimensionaility of files encoding asingle chip, and the growing number of samples andthus arrays that are usually produced in a single ex-periment. Let us consider, for instance, two commonAffymetrix microarray files (also known as CEL files):the older Human 133 Chip CEL file that has a dimen-sion of 5 MB and contains 20,000 different genes,and the newer Human Gene 1.0 st that has a typicaldimension of 10 MB and contains 33,000 genes. Asingle array of the Exon family (e.g., Human Exon orMouse Exon) can have up to 100 MB of size. Finally,a recent trend in genomics is to perform microarrayexperiment considering a large number of patients.46

From this scenario, the need for the introductionof tools and technologies to process such huge volumeof data in an efficient way arises. A possible way todevelop the efficient preprocessing of microarray datais represented by the parallelization of existing algo-rithm on multicore architectures.47,48

In such a scenario, the whole computation is dis-tributed onto different processors that perform com-putations on smaller sets of data and results are fi-nally integrated. Such scenario requires the design ofnew algorithms for summarization and normalizationthat take advantage of the underlying parallel archi-

tectures. Nevertheless, a first step in this direction canbe represented on the replication on different nodesof existing preprocessing softwares that run on parti-tions of the initial datasets.

Despite its relevance, the parallel processing ofmicroarray data is a relatively new field in whichmany projects are currently in their initial stage.

One of the main work is affyPara47 that is aBioconductor package for parallel preprocessing ofAffymetrix microarray data. It is freely available fromthe Bioconductor project. Similarly, the μ-CS projectpresents a framework for the analysis of microarraydata based on a distributed architecture made of dif-ferent web services internally parallel for the anno-tation and preprocessing of data.44 Compared withaffyPara, such an approach presents three main dif-ferences: (1) the possibility to realize more summa-rization schemes such as Plier, (2) the easy extensionto new SNP arrays, and (3) it does not require theinstallation of the Bioconductor platform.

Proteomics and Mass Spectrometry DataProteomics regards the qualitative and quantita-tive study of proteins expressed in an organismor a cell. The increasing application of proteomicsexperiments and the increasing resolution offeredby technological platforms, especially in massspectrometry-based high-throughput proteomics,49

make the analysis of proteomics experiments diffi-cult and error prone without efficient algorithms andeasy-to-use tools. Thus, a novel field, called computa-tional proteomics, has been introduced, regarding thecomputational methods, algorithms, databases, andmethodologies used to manage, analyze, and inter-pret the data produced in proteomics experiments.50

Mass spectrometry-based proteomics requirescomputational methods for analyzing data inqualitative (e.g., the identification of proteins presentsin a sample analyzed by mass spectrometry) and inquantitative proteomics (e.g., the identification of thequantity of protein expressed).

Distributed approaches of analysis in computa-tional proteomics are often based on the idea of real-izing a virtual laboratory,51 that is, a set of distributedlaboratories sharing data, algorithms, and tools thatperform different steps of data analysis in a coordi-nated fashion. A main issue in the realization of suchdistributed laboratories is represented by the data het-erogeneity introduced by several available platforms,as well as the need to conduct repeatable biomed-ical experiments on large populations (e.g., for out-come research), which demands for suitable standardsfor the representation, storage, transmission, and

226 Volume 3, May/ June 2013c© 2013 John Wi ley & Sons , Inc .

Page 12: Data mining and life sciences applications on the grid

WIREs Data Mining and Knowledge Discovery Data mining and life sciences applications on the grid

sharing of proteomics data among different researchcenters.

Raw spectra generated both from MALDI–MSand LC–MS instruments are usually stored in flat filesthat are encoded in a proprietary format. The het-erogeneity of available formats causes a set of issuessuch as data portability and experiment reproducibil-ity that have been faced with the introduction of astandardized data format. Standard formats can beused by different algorithms allowing an easy dataexchange between geographically disseminated labo-ratories, and can be stored in different databases forfurther use.

Two main efforts toward the standardization ofMS data format both based on XML have been intro-duced. The first one, mzXML, is being developed atthe Seattle Proteome Center of the Institute for Sys-tems Biology.52 The second one, mzData,53 has beendeveloped by the Proteomics Standard Initiative ofthe Human Proteome Organization. Currently, thereexists a new standard format, namely mzML,54 thatmerges together the two previous standards. Mostmass spectrometers do not directly produce mzXMLor mzData data, but there are several tools avail-able that generate these files from native acquisitionfiles. For instance, an open source project known asSashimi offers a collection of converter programs forsome common mass spectrometric file format.

On the other hand, the realization of such vir-tual laboratories needs the introduction of algorithms,tools, and platforms to manage and analyze pro-teomics data. In particular, the introduction of high-performance infrastructures for preprocessing andmining spectra data represents one of the most rel-evant research areas. Such approaches are usuallybased on the use of the grid infrastructure as com-puting middleware and workflow technology for as-sisting the composition of ‘in silico’ experiments.

BioDCV55 is a distributed computing frame-work composed of a suite of modules enabling theresearchers to plan and execute an analysis of bothproteomics and genomics data. myGrid56 is a grid-based toolkit for composing and executing in silicoexperiments. myGrid initially focused on bioinfor-matics applications, but currently users can definetheir own web services without any restriction onthe type of science. ProGenGrid (Proteomic and Ge-nomic Grid)57 is a project that aims to deploy differ-ent bioinformatics tools over the grid. It is accessedthrough a grid portal that uses the well-known GRB(Grid Resource Broker) middleware, a system usedin the Globus architecture to hide the grid middle-ware details and providing a standardized interface togrid resources. MS-Analyzer58 is a tool for the man-

agement and analysis of proteomics data that takesinto account such requirements. MS-Analyzer usesontologies to model the proteomics domain and itsdata analysis methodologies, and a novel applicationdesigner based on workflows to design proteomicsexperiments, that is, workflows of activities involv-ing different bioinformatics tools (e.g., preprocessing,DM, visualization).

Interactomics and ProteinInteraction NetworksInteractomics is the study of the whole set of protein-to-protein interactions (PPIs) within cells in a livingorganism, often referred to as protein interaction net-works (PINs).59

Modeling and storing PINs is often realizedby using graph theory. In such formalism, proteinsare represented as nodes, whereas their interactionsare the edges connecting them. Graphs representingPINs can be used as input for DM algorithms, whereraw data are binary interactions forming interactiongraphs, and analysis algorithms retrieve biologicallymeaningful properties by investigating graph topol-ogy. Once an interaction network is modeled by us-ing graphs, the study of biological properties can bedone using graph-based algorithms,60,61 and associ-ating graph properties to biological properties of themodeled PPI.62 Algorithms for the analysis of localproperties of graphs may be used to analyze localproperties of PPIs networks, for example, dense dis-tribution of nodes in a small graph region may be as-sociated to proteins (nodes) and interactions (edges)relevant to represent biological functions. Emergingapproaches in analyzing PINs consider also seman-tic similarity of proteins63 in addition to topologi-cal properties of the graphs. Semantic informationis usually formed by annotations extracted by geneontology and may be used to underline subgraphs ofthe PINs where semantic properties, such as biologicalprocess of function, are over or under represented.64

In such a scenario, many different laboratoriesare producing data by using different experimentaltechniques. Then, data can be modeled as a graphand stored in repositories by using different technicalsolutions. Finally, data stored in such databases canbe mined to derive novel interactions or to extractfunctional modules, that is, subgraphs of the PPI net-work that have a biological meaning.

Distributed Analysis on Interactomics DataThe distributed processing of protein interactiondata consequently involves the following activities:(1) sharing and dissemination of PPI data among

Volume 3, May/ June 2013 227c© 2013 John Wi ley & Sons , Inc .

Page 13: Data mining and life sciences applications on the grid

Advanced Review wires.wiley.com/widm

different databases, (2) collection of data stored inheterogeneous databases, and (iii) distributed analy-sis.

The first activity requires the development ofboth standards and tools to manage the processof data curation and exchange between interactiondatabases. Currently, there is an ongoing project,namely the International Molecular Exchange Con-sortium (IMEx),65 devoted to build an enablingframework for data exchange. It is based on an exist-ing standard for protein interaction data, the HUPOPSI-MI format.66 Databases that participate in thisconsortium accept the deposition of interaction datafrom authors, helping the researcher to annotate thedataset through a set of ad hoc developed tools.

The second activity requires to solve the classicalbioinformatics problem of linking identical data iden-tified with different primary keys. The cPath tool62

is an open source software for collecting and storingpathways coming from different data sources. From atechnological point of view, this software is an opensource database integrated in a web application ca-pable of collecting data from different data sourcesand exporting these data through a web serviceinterface.

Finally, the rationale for the third activity isdue to the algorithmic nature of problems regard-ing graphs. A large class of algorithms that mine in-teraction data can be mapped to classical problemsof graph and subgraph isomorphism that are com-putationally hard.67 So, the need for the availabil-ity of high-powerful computational platform arises.Currently, different software that mine PINs areavailable through web interfaces. For instance, theIMPRECO68 framework is able to detect proteincomplexes on a PIN by employing a distributed ar-chitecture based on web services architecture. Forinstance, NetworkBlast69 and Graemlin,70 which al-low the comparison of multiple interaction networks,are both available through a web interface. Networkalignment algorithms usually employ different heuris-tics to face with the subgraph isomorphism problem.Although they are usually time consuming and thedimension of input data is still growing, so the devel-opment of high performance architectures will be animportant challenge in the future.

DM IN NEUROSCIENCE

The acquisition of brain images from diagnostic tech-niques, such as MRI, diffusion tensor imaging (DTI),or PET, and their further processing, involves a largenumber of data from different temporal and spatial

scales. The aim of mining data from neuroimages is toprovide the basis for decisions, both immediately andin the long term, especially for the early detection ofneurodegenerative diseases, that is, Alzheimer’s andits precursor The Mild Cognitive Impairment (MCI)or Parkinson’s disease. A Machine Learning approachhas proved necessary where data from neuroimagingare merged with data from multiple sources, as in thecase of Alzheimer’s disease.73

Volume measure and signal intensity data ofanatomical areas of the brain are produced duringthe segmentation and reconstruction of neuroimages,in a manual, semi-automatic, or automatic manner.Ye et al.73 claim that in addition to neuroimagingdata, researchers take into account demographic in-formation such as age or genders, genetic makers suchas possession of ApoE4 allele and other multisourcedata associated with Alzheimer disease. Thus, the bot-tleneck of the early detection of Alzheimer’s diseaseis integrating the latter information with data fromdifferent neuroimaging sources. Akil et al.74 empha-size how strong the need for neuroinformatics is andhow the neuroinformatics can use a variety of toolsto decipher and to understand the ‘neural choreogra-phy’ strictly connected to the brain functions. Currentresearches are focused on the use of both regions ofinterest (ROIs) and on a voxel-based approach forextracting the features of interest from a neuroimag-ing technique, for example, from a structural MRIor a PET. The integration of information of ROIand voxel-based approach from different sources ofneuroimaging, with additional information such asdemographic and genetic data, can significantly im-prove the early detection of Alzheimer’s and Parkin-son’s diseases.73 DM in brain imaging can thereforebring new knowledge and improve decision makingby merging ‘multisource’ data from diagnostic imag-ing, genomics and the collection of epidemiologicalinformation.

The complete workflow for the discovery ofknowledge from neuroimaging, which normally is ap-plied in the clinical field, is therefore composed bymany steps and it is showed in Figure 3: (1) acquisitionof structural and/or functional bioimages, which pro-duces a DICOM format standard; (2) segmentationand reconstruction of bioimages by Freesurfer and/orFSL, two well-known tools used in medicine, whichproduce data or measures of intensity in numericformat, and tractography, which return values of frac-tional anisotropy of each tract; (3) application of sta-tistical techniques and normalization of data, pro-ducing tabular formats excel, or tab-delimited files;(4) application of mining algorithms using Weka(which does not recognize the excel file and requires a

228 Volume 3, May/ June 2013c© 2013 John Wi ley & Sons , Inc .

Page 14: Data mining and life sciences applications on the grid

WIREs Data Mining and Knowledge Discovery Data mining and life sciences applications on the grid

FIGURE 3 | Workflow diagram for the knowledge discovery inneuroscience.

conversion in the only recognized format ARFF) orRapidMiner or other DM platforms; (5) representa-tion of knowledge extracted by means of graphs easyto understand, for example, the so called ReceiverOperating Characteristic (ROC) curves (see Box 5).

Neuroimaging MethodsThe primary source of information in neuroscience isrepresented by brain bioimages; there are differentimaging methods to acquire neuroimages, the firstof which is the MRI. MRI can visualize the brainanatomy with a high degree of contrast between dif-ferent tissue types. Researchers use MRI to measurespecific areas of the brain, that is, hippocampus, en-torhinal cortex, or amygdala, to detect abnormalchanges in volumes that characterize neurodegener-ative diseases such as Alzheimer.

Magnetic resonance is often divided into struc-tural MRI and fMRI.75 Symms et al.75 state that thedivision between structural and fMRI is difficult be-cause structure and function of the brain are often‘inextricably intertwined’. We can generically say thatthe structural imaging provides static anatomical in-formation, whereas the functional one provides dy-namic physiological information. An appropriate di-agnostic technique must be selected according to theneurodegenerative diseases, which researcher wantsto investigate. Volume measurement of hippocampuscan be done using coronal T1 that provides three-dimensional and high-resolution images. This kind ofimages is the basis for many cross-sectional and lon-gitudinal studies determining the volume changes inthe hippocampus over time in hippocampal sclerosisand Alzheimer’s disease.75

The functional magnetic resonance presents abroad variety of techniques, starting from the BOLD(blood oxygen level dependent) technique, passingfrom phase contrast flow measurement ending tocerebrospinal fluid (CSF) pulsation measurements.Regional cerebral blood flow (CBF) reflects neu-ronal activity76 and this concept is the basis for allhemodynamic-based brain imaging techniques beingused today. Since then, functional MRI has been thetool of choice for visualizing neural activity in thehuman brain. The fMRI has been extensively usedfor investigating various brain functions including vi-sion, motor, language, and cognition, and for thesereasons, fMRI is a very powerful method to map brainfunctions with relatively high spatial and temporalresolution.

The DTI is based on NMR basic principles thatencode molecular diffusion effects in the NMR signalby using bipolar magnetic field gradient pulses.77 Inthe white matter, diffusion MRI has already shownits potential in diseases such as multiple sclerosis,however, DTI offers more through the separation ofmean diffusivity indices, such as the trace of the diffu-sion tensor, which reflects overall water content, and

Volume 3, May/ June 2013 229c© 2013 John Wi ley & Sons , Inc .

Page 15: Data mining and life sciences applications on the grid

Advanced Review wires.wiley.com/widm

anisotropy indices, which point toward myelin fiberintegrity.

Automated and Semi-Automated Toolsfor Brain Images Segmentation andReconstructionOnce neuroimages are collected and before DM tech-niques are applied, data need to be processed.78 Thisphase identifies and normalizes brain structures thatmust be stored and subsequently analyzed. In studiesrelating to the anatomical zones, after the image is col-lected, each lesion, area, or structure is outlined (seg-mented) as ROI, that is, the ROI, and this is done foreach slice in an automatic, semi-automatic, or manualway. First segmentation methods were based on theintensity but because many of the structures are notdistinguishable based solely on the intensity of thesignal, a priori information such as intensity gradi-ents or spatial distributions of density, are integrated.During the preprocessing phase, the image can be seg-mented using an atlas or by using a map of probabili-ties for each voxel. The most widely used open-sourcesoftware platforms in the field of neuroscience areFreesurfer (http://surfer.nmr.mgh.harvard.edu/) andFSL (http://www.fmrib.ox.ac.uk/fsl/).

Specifically, Freesurfer brings together a set oftools for the reconstruction of the brain cortical sur-face from structural MRI data. It includes featuressuch as the representation of the cortical surface be-tween the gray and white matter, the representationof the pial surface, the segmentation of white mat-ter from the rest of the brain and the skull stripping.The tools, inputs, and parameters Freesurfer needsare invoked from the command line and if the user re-quires the execution of many commands he/she needsto write scripts. For visualization of the segmentedimages, two packages are available, tkmedit and tk-surfer, only invoked after the reconstruction wasperformed.

FSL software includes a library of tools for theanalysis of images of the brain from fMRI, MRI, andDTI. Even in the case of FSL, the tools are executedfrom the command line and in a few cases they comewith a GUI (graphical interface). Freesurfer and FSLare fully compatible and their capabilities are fullyintegrated. The main purpose of their interoperabil-ity is to obtain anatomical data, the structural ones,via Freesurfer, and then to extend them through FLS,with functional data. The coupled use of Freesurferand FSL is therefore a powerful tool for neurologistsbut the characteristic that penalizes the use is the lackof a graphical interface for running the tool.

Statistical Methods for Features ExtractionThe DM is heavily dependent on statistical methodsfor discovering associations and classification. Oneof the most commonly used statistical approach is thestatistical parametric mapping, which analyzes everychange of the voxel independently of the others andbuilds a map of statistical values for each of it. Thesignificance of each voxel may be verified statisticallywith a Student’s t-test, a test of Fisher, a correlationcoefficient, or any other parametric univariate test.The result of the t-test is a t-value, which, when in-dexed with the degrees of freedom, returns a proba-bility value that indicates how likely the occurrenceof a change can take place at random. The thresholdlevel of significance (alpha value) is typically 0.05 orless and it represents the probability of 5% or less ofhaving a ‘false positive’. The Student’s test is usuallythe most commonly applied in the medical field andis equivalent to the analysis of variance (ANOVA).

DM on Neuroscience DataIn neuroscience, the application of DM techniquesmust follow similar steps as in other disciplines.The ultimate goal is to classify a subject of studyas normal/abnormal or sane/insane, merging multi-source data from diagnostic imaging, genomics, andthe collection of epidemiological information. A par-ticular approach is image mining, which goals to dis-cover significant patterns from images, before anyother kind of extraction of alphanumeric data isperformed.79 If in pattern recognition, the mission itto find some specific patterns, in image mining, all sig-nificant patterns are generated without prior knowl-edge about them. Researchers investigate the relationsbetween brain areas and brain functions and conven-tional statistical methods are not able to discover allthe rules involved in these relations. Kakimoto et al.80

successfully identified such rules from slices of a brain,applying their DM algorithm to functional brainimaging.

DM methods have been directly applied to neu-roimages segmentation too, in particular for hip-pocampal segmentation.81 Morra et al.81 comparedfour automated methods for hippocampal segmenta-tion using different machine learning algorithms (1)hierarchical AdaBoost, (2) support vector machines(SVMs) with manual feature selection, (3) hierarchicalSVM with automated feature selection (Ada-SVM),and (4) FreeSurfer software. They have demonstratedtheir AdaBoost and Ada-SVM algorithms detecteddisease effects as well as FreeSurfer on the data tested.Recently there has been a growing interest in SVMsmethods (see Box 2): SVM overcome the limits of

230 Volume 3, May/ June 2013c© 2013 John Wi ley & Sons , Inc .

Page 16: Data mining and life sciences applications on the grid

WIREs Data Mining and Knowledge Discovery Data mining and life sciences applications on the grid

univariate analyses and they have been successfullyapplied to the individual classification of a variety ofneurological conditions.82 Swain et al.83 argue SVMhave become the dominant way to analyze fMRI dataand they demonstrated the feasibility of training clas-sifiers to distinguish a variety of cognitive states.

The analysis of neuroimaging data with ma-chine learning techniques has strongly influencedthe neuroscience community. Predictability of diseaserisks, therapy success, and genotype–phenotype re-lations seem to come into reach using fMRI data.However, because of the unpropitious relation be-tween sample and feature size, prediction rates caneasily be overestimated. In this regard, reporting pre-diction rates of various classifiers can help to betterevaluate the predictive power of the data.84

Although several classification methods wereapplied to neuroscience data, SVM has proved to havethe best accuracy (ACC). Maroco et al.85 comparedseven nonparametric classifiers, such as SVM andRandom Forest, to three traditional classifiers, such asLogistic Regression, on data from neuropsychologicaltests in progression of MCI patients to dementia. Theresearchers evaluated the sensitivity, specificity (SPC),overall classification ACC, and area under the ROC ofSVM (see the Boxes 4 and 5 for an overview of qualitymetrics of a classification model), Classification Treesand Random Forest compared with the traditionalalgorithms. Their results showed that SVM has thehighest ACC and SPC values, even if it has a lowervalue of sensitivity. Some details about the SVM arereported in Box 2.

SVMs in NeurosciencesCristianini and Shawe-Taylor87 argue that SVMs arelearning systems that use a hypothesis space of linearfunctions in a high-dimensional feature space, trainedwith a learning algorithm from optimization theorythat implements a learning bias derived from statisti-cal learning theory. This learning strategy introducedby Vapnik88–90 has become in few years the widelyapplied method in a wide variety of applications.

Meligy and Al-Khatib91 present a grid-baseddistributed SVM algorithm. They used the GlobusToolkit as grid middleware, using its three func-tions: resource management, data management, andinformation services. Users submit instructions to theSVM application and through Globus find the suit-able resources.91 Lu et al.92 proposed a distributedparallel SVM (DPSVM) training mechanism in a con-figurable network environment for distributed DM.Their basic idea is to exchange support vectors amonga strongly connected network so that multiple servers

BOX 2: SUPPORT VECTOR MACHINES

Classification is the analogue of regression when the vari-able being predicted is discrete, rather than continuous. Aclassifier is a function that takes the values of various fea-tures (independent variables or predictors, in regression) inan example (the set of independent variable values) andpredicts the class that that example belongs to (the depen-dent variable).86 In a neuroimaging setting, the featurescould be voxels and the class could be the type of stimu-lus the subject was looking at when the voxel values wererecorded. Thus, the input/output pairings typically reflect afunctional relationship mapping inputs to outputs. Whenan underlying function from inputs to outputs exists, it isreferred to as the target function. The estimate of the targetfunction that is learnt or output by the learning algorithm isknown as the solution of the learning problem. In the caseof classification, this function is sometimes referred as thedecision function.87

The method used by SVM is based on kernel functions,namely the representation of data through matrix definedby the kernel. The value of the kernel can be interpretedas a measure of ‘similarity’ between two inputs (not nec-essarily between vectors). In neuroimaging, variables tobe compared usually can be measures of volume and sizeor intensity of certain anatomical areas of the brain be-longing to two different groups, generally labeled as nor-mal/abnormal. Some of the types of kernel used are linear,polynomial, and Gaussian (Radial Basis). The choice of ker-nel is heavily dependent on prior knowledge, and withoutthis, a type of Radial Basis kernel and a selection of thesample of the type 10-fold cross-validation are used. Thecross-validation to 10-fold, or more generally a k-fold, ran-domly partitions the original sample into k subsamples. Asingle subsample is used as the model validation (test set)and the remaining k − 1 samples are used as data for thelearning phase (training set).

may work concurrently on distributed data set withlimited communication cost and fast training speed.

SVMs suffer from a widely recognized scala-bility problem in both memory use and computa-tional time. To improve scalability, Chang et al.93

have developed a parallel SVM algorithm, which re-duces memory use through performing a row-based,approximate matrix factorization, and which loadsonly essential data to each machine to perform paral-lel computation.

Boosting is a general method for improving theACC of any given learning algorithm94 and AdaBoost(Adaptive Boosting)95 is one of the most known im-plementation (see Box 3).

Volume 3, May/ June 2013 231c© 2013 John Wi ley & Sons , Inc .

Page 17: Data mining and life sciences applications on the grid

Advanced Review wires.wiley.com/widm

BOX 3: BOOSTING AND ADABOOST

Boosting is a general method for improving the ACC of anygiven learning algorithm.94 In theory, boosting can be usedto significantly reduce the error of any ‘weak’ learning al-gorithm that consistently generates classifiers, which needonly be a little bit better than random guessing. Similarto SVM, AdaBoost works by combining several ‘votes’. In-stead of using support vectors (i.e., important examples),AdaBoost uses weak learners.96 AdaBoost was first intro-duced by Freund and Schapire95 and it solved many of thepractical difficulties of the earlier boosting algorithms.AdaBoost is a sequential algorithm that minimizes an up-per bound of the empirical classification error by selectingthe weak classifiers and their weights. These are ‘pursued’one-by-one with each one being selected to maximally re-duce the upper bound of error. It is a sequential algorithmthat minimizes an upper bound of the empirical classi-fication error by selecting the weak classifiers and theirweights. These are ‘pursued’ one-by-one with each one be-ing selected to maximally reduce the upper bound of error.AdaBoost defines a distribution of weights over the datasamples. These weights are updated each time a new weakclassifier is added such that samples misclassified by thesenew weak classifiers are given more weight. In this man-ner, currently misclassified samples are emphasized moreduring the selection of the subsequent weak classier.The margin theory points to a strong connection betweenboosting and the support-vector machines of Vapnik andothers. The computation involved in maximizing the mar-gin is mathematical programming, that is, maximizing amathematical expression given a set of inequalities. Thedifference between the two methods in this regard is thatSVM corresponds to quadratic programming, whereas Ad-aBoost corresponds only to linear programming.94

Boxes 4 and 5 describe some quality indexesto evaluate classification models. Box 4 describesmetrics-based indexes, whereas Box 5 describes ROCcurves.

CONCLUSIONS

The availability of high-throughput experimentalplatforms (e.g., microarray, mass spectrometry, nextgeneration sequencing) and the extensive use of dig-italization of clinical data and health processes (e.g.,imaging, electronic patient records) is leading to bigdata in life sciences.

Big data in life sciences may be explored bycombining large-scale high-performance infrastruc-tures such as grids and clouds with distributed DM

BOX 4: EVALUATING A CLASSIFICATIONMODEL BY METRICS

Once a classification model is created, there is always theneed to evaluate its performance and to discover how goodit is. In medical diagnosis, like in neuroscience too, correctlyidentifying instances labeled as affected by a disease (TruePositive Rate, TPR) or not-affected (True Negative Rate,TNR), is a primary goal. True and False Positives (TP/FP)refer to the number of predicted positives predictions thatare correct/incorrect, and True and False Negatives (TN/FN)are similarly defined as Predicted Negatives predictions thatare correct/incorrect. In a binary classification problem, thevalues of TP, FP, TN, and FN are summarized in a tablecalled confusion matrix:

Data Class Classified as Positive Classified as Negative

Positive TP FNNegative FP TN

A confusion matrix, generated by Weka by applying SVM(SMO Weka Classifier) on a dataset containing informationabout breast cancer (the dataset is available for downloadon the Weka website), with two classes (benign/malignant),is showed in Figure 4.The number of instances correctly classified is given bythe sum of the left-to-right diagonal, whereas the otherinstances are classified incorrectly. For example, we have12 instances of the class a incorrectly classified as belongingto the class b, and 9 instances of b are wrongly classifiedas belonging to class a. Several measures exist to quantifythe quality of a predictive model using the values of TP, FP,TN, and FN:

• TPR or Recall or Sensitivity is the fraction of ex-amples correctly classified as belonging to classa, among all those who are really the class a. Inthe confusion matrix, it is the diagonal elementdivided by the sum of the elements of the row.

TPR = TP/P = TP/(TP + FN)

Example: 446/(446 + 12) = 0.974 for the class a and232/(232 + 9) = 0.963 for the class b.

• False Positive Rate (FPR) is the fraction of exam-ples classified as belonging to class a, but whichin reality belong to a different class b, among allthose which do not belong to the class a. In thematrix, it is the sum of the column less the diago-nal element divided by the sum of the rows in theother classes.

232 Volume 3, May/ June 2013c© 2013 John Wi ley & Sons , Inc .

Page 18: Data mining and life sciences applications on the grid

WIREs Data Mining and Knowledge Discovery Data mining and life sciences applications on the grid

BOX 4: CONTINUED

FPR = FP/N = FP/(FP + TN)

Example: 9/(9 + 232) = 0.037 for the class a and 12/(12 +446) = 0.026 for be the class b.

• TNR or SPC is the fraction of negatives cases cor-rectly classified.

SPC = TN/(FP + TN)

Example: 232/(232 + 9) = 0.963 for the class a and446/(446 + 12) = 0.974 for the class b. (the exact oppositeof the TPR).

• Positive Predicted Value (PPV) or Precision is thefraction of examples really belonging to class a,among all those classified as a. In the confusionmatrix, it is the diagonal element divided by thesum of the relevant columns.

PPV = TP/(TP + FP)

Example: 446/(446 + 9) = 0.98 for the class a and232/(232 + 12) = 0.951 for the class b.

• F-Measure or F-Score is a standard measure thatsummarizes precision and recall:

2 ∗ Precision ∗ Recall/(Precision + Recall)

Example: 2∗0.98∗0.974/(0.98 + 0.974) = 0.977 for theclass a and 2∗0.951∗0.963/(0.951 + 0.963) = 0.957 forthe class b.

• ACC is the proportion of the true results:

ACC = (TP + TN)/(TP + FP + FN + TN)

Example: (446 + 232)/(446 + 9 + 12 + 232) = 0.9699 =96.99%

platforms. The paper introduced high-performanceinfrastructures with special focus on grids and theirevolutions and discussed main approaches and soft-ware tools for mining large volumes of life sci-ences data on grids. Although high-performance in-frastructures and distributed programming environ-ments are almost mature, some research problems

BOX 5: EVALUATING A CLASSIFICATIONMODEL BY ROC CURVES

ROC graphs are useful for organizing classifiers and visu-alizing their performance. ROC graphs are commonly usedin medical decision making, and in recent years have beenused increasingly in machine learning and DM research.97

ROC graphs are two-dimensional graphs in which TP rateis plotted on the Y axis and FP rate is plotted on the Xaxis. Because TPR is equivalent with sensitivity and FPR isequal to 1 − SPC, the ROC graph is sometimes called thesensitivity versus (1 − SPC) plot.A ROC graph depicts relative tradeoff between benefits(true positives) and costs (false positives). Fawcett97 statesthat several points in ROC space are important to note:

• The lower left point (0, 0) represents the strategyof never issuing a positive classification; such aclassifier commits no false positive errors but alsogains no true positives.

• The opposite strategy, of unconditionally issuingpositive classifications, is represented by the up-per right point (1, 1).

• The point (0, 1) represents perfect classification.

Figure 5 shows the ROC graph related to previous classifi-cation of breast cancer dataset, generated by Weka, relatedto the benign class.Depending of the position of a point we can argue howwell the model predicts:

• One point in ROC space is better than another oneif it is to the northwest (TP rate is higher, FP rateis lower, or both) of the first one.

• Classifiers appearing on the left-hand side of anROC graph, near the X axis, may be thought of as‘conservative’: they make positive classificationsonly with strong evidence so they make few falsepositive errors, but they often have low TPRs aswell.

Classifiers on the upper right-hand side of an ROC graphmay be thought of as ‘liberal’: they make positive clas-sifications with weak evidence so they classify nearly allpositives correctly, but they often have high FPRs.

still remain open. Among them, the structured useof metadata and ontologies to model applicationsdomain,98,99,100,101 the extensive use of service ori-ented approach and workflows for the modeling oflife sciences applications, and their easy reuse and

Volume 3, May/ June 2013 233c© 2013 John Wi ley & Sons , Inc .

Page 19: Data mining and life sciences applications on the grid

Advanced Review wires.wiley.com/widm

FIGURE 4 | Confusion matrix generated by Weka by applying SVM(SMO Weka Classifier) on the breast cancer dataset available on theWeka website.

FIGURE 5 | ROC graph related to the classification of breastcancer dataset generated by Weka.

deployment on cloud infrastructures are certainlysome of the problems to be faced yet.

Recently, computational biology has focused onthe study of living organism using a system levelapproach at molecular scale. Consequently, data

about genes, proteins, and interactions among suchentities are constantly produced in geographically dis-tributed laboratories. The paper, after a brief intro-duction of the on main disciplines and data, discussedmain approaches to distributed and parallel data pre-processing and mining. In spite of the existence ofdifferent approaches for parallel DM currently sev-eral problems remain still open. In particular, the ex-istence of different formats for data (in all the omicsdisciplines), the intrinsic difficulty of parallelizationof graph algorithms (for interactomics data), as wellas the lack of easy to use tools (main tools have com-mand line interfaces) remain main challenges for fu-ture research.

KD in neuroscience has demonstrated how itcan support clinicians in medical diagnosis. We listedthe steps of workflow in neuroimages mining, startingfrom the main imaging techniques (MRI, DTI, and soon), passing from the most used open-source softwarefor the preprocessing of bioimages, ending to the ap-plication of algorithms such as SVM and AdaBoost.Each step comes with a huge computational time cost,and the complete workflow takes up to 3 or 4 daysfor each patient to be completed. Thus, there is thestrong need of a parallelization and an automation ofthese procedures with the aim of reducing time spentto write and setting the elaboration parameters. Anopen problem still remains: the computational cost.The latter can be improved by implementing new par-allel algorithms and using distributed software thatshould be able to process big data. Another frontieris represented by the right choice of the right algo-rithm depending on the particular neurodegenerativedisease investigated and on the particular brain areaaffected.

ACKNOWLEDGMENTS

Mario Cannataro and Pietro Hiram Guzzi wish to thank Domenico Talia and Concettina Guerrafor the joint early works on the Knowledge Grid and on Interactomics. Alessia Sarica is supportedby the European Committee (F.S.E.) and by Regione Calabria, Italy. This article is cofundedwith support from the European Commission, European Social Fund and ‘Regione Calabria’,Italy. This publication reflects only the views of the author, so the European Commission and‘Regione Calabria’ cannot be held responsible for any use which may be made of the informationcontained therein. All authors wish to thank the Institute of Neurological Sciences of the ItalianNational Research Council (ISN-CNR) and the Institute of Neurology of the University ofCatanzaro who introduced them to neurosciences.

234 Volume 3, May/ June 2013c© 2013 John Wi ley & Sons , Inc .

Page 20: Data mining and life sciences applications on the grid

WIREs Data Mining and Knowledge Discovery Data mining and life sciences applications on the grid

REFERENCES

1. Witten IH, Frank E. Data Mining. Practical Ma-chine Learning Tools and Techniques. 2nd ed. SanFrancisco, CA: Morgan Kaufmann; 2005.

2. Han J, Kamber M. In: Jim Gray, series editor.Data Mining: Concepts and Techniques. 2nd ed. SanFrancisco, CA: Morgan Kaufmann Publishers; 2006.ISBN 1-55860-901-6.

3. Cannataro M, ed. Computational Grid Technologiesfor Life Sciences, Biomedicine and Healthcare, Med-ical Information Science Reference. Hershey, PA: IGIGlobal Press; 2009.

4. Foster I, Kesselman C. The Grid 2: Blueprint for aNew Computing Infrastructure. San Francisco, CA:Morgan Kaufmann Publishers; 2003.

5. Thain D, Tannenbaum T, Livny M. Distributed com-puting in practice: the Condor experience. ConcurrComput Pract Exp 2005, 17:323–356.

6. Couvares P, Kosar T, Roy A, Jeff Weber J, WengerK. Workflow Management in Condor. In: Taylor I,Deelman E, Gannon D, Shields M, eds. Workflows fore-Science. Berlin, Deutschland: Springer Press; 2007.ISBN: 1-84628-519-4.

7. De Roure D, Jennings NR, Shadbolt NR. The seman-tic grid: past, present, and future. Proc IEEE 2005,93:669–681.

8. Berners-Lee T, Hendler J, Lassila O. The SemanticWeb. New York: Scientific American; 2001.

9. De Roure D, Jennings NR, Shadbolt N. The semanticgrid: a future e-Science infrastructure. In: Berman F,Hey AJG, Fox G, eds. Grid Computing: Making TheGlobal Infrastructure a Reality. John Wiley & Sons;2003, 437–470.

10. Ahmed M, Chowdhury ASMR, Ahmed M, RafeeMMH. An advanced survey on cloud computing andstate-of-the-art research issues. Int J Comp Sci 2012,9:201–207.

11. Foster I, Zhao Y, Raicu I, Lu S. Cloud computing andgrid computing 360-degree compared. Grid ComputEnviron Workshop; 2008, 1–10.

12. Mell P, Grance T. The NIST definition of cloudcomputing. Recommendations of the National Insti-tute of Standards and Technology; 2011. Availableat: http://csrc.nist.gov/publications/nistpubs/800-145/SP800-145.pdf. (Accessed March 13, 2013).

13. Schwiegelshohn U, Badia RM, Bubak M, DaneluttoM, Dustdar S, Gagliardi F, Geiger A, Ladislav HluchyL, Kranzlmuller D, Laure E, et al. Perspectives on gridcomputing. Future Gen Comp Syst 2010, 26:1104–1115.

14. Paul S. Parallel and Distributed Data Mining. inKimito Funatsu (ed.), New Fundamental Technolo-gies in data Mining, ISBN: 978-953-307-547-1,InTech. Coimbatore, India; 2011. Available at:

http://cdn.intechopen.com/pdfs/13261/InTech-Para-llel and distributed data mining.pdf. (AccessedMarch 13, 2013).

15. Stankovski V, Swain M, Kravtsov V, NiessenT, Wegener D, Kindermann J, Dubitzky W.Grid-enabling data mining applications withDataMiningGrid: an architectural perspective. SciDirect Future Gen Comp Syst 2008, 24:259–279.

16. Cannataro M, Talia D. Knowledge Grid: an architec-ture for distributed knowledge discovery. CommunACM 2003, 46:89–93.

17. Stankovski V, Swain M, Kravtsov V, Niessen T, We-gener D, Rohm M, Trnkoczy J, May M, Franke J,Schuster A, Dubitzky W. IEEE Int Comput 2008,12:69–76 .

18. Moore R. Knowledge-based grids. Technical Re-port San Diego Supercomputer Center, SDCS TR-2001-2; 2001. Available at: http://legacy.sdsc.edu/techreports/TR-2001-02.doc.pdf. (Accessed March13, 2013).

19. Berman F. From TeraGrid to Knowledge Grid.CACM 2001, 44:27–28.

20. Johnston WE. Computational and data grids in large-scale science and engineering. Future Gen Comp Syst2002, 18:1085–1100.

21. Cannataro M, Congiusta A, Pugliese A, Talia D,Trunfio P. Distributed data mining on grids: services,tools, and applications. IEEE Trans Syst Man CybernB 2004, 34:2451–2465.

22. Ellisman M, Brady M, Hart D, Lin FP, Muller M,Smarr L. The emerging role of biogrids. CommunACM 2004, 47:52–57.

23. Cannataro M, Guzzi PH, Lobosco M, Weber dos San-tos R. GridSnake: a Grid-based Implementation of theSnake Segmentation Algorithm. 22nd IEEE Interna-tional Symposium on Computer-Based Medical Sys-tems, 2009. CBMS 2009. Albuquerque, NM; 2009,1–6.

24. Teixeira GM, Pommeranzembaum IR, de OliveiraBL, Lobosco M, dos Santos RW. Automatic segmen-tation of cardiac mri using snakes and genetic al-gorithms. In: Bubak M, van Albada GD, DongarraJ, Sloot PMA, eds. ICCS (3), volume 5103 of Lec-ture Notes in Computer Science. Berlin, Deutschland:Springer; 2008, 168–177.

25. Dudley JT, Butte AJ. In silico research in the eraof cloud computing. Nat Biotechnol 2010, 28:1181–1185.

26. Schatz MC, Langmead B, Salzberg SL. Cloud com-puting and the DNA data race. Nat Biotechnol 2010,28:691–693.

27. Krampis K, Booth T, Chapman B, Tiwari B, Bi-cak M, Field D, Nelson KE. Cloud BioLinux:

Volume 3, May/ June 2013 235c© 2013 John Wi ley & Sons , Inc .

Page 21: Data mining and life sciences applications on the grid

Advanced Review wires.wiley.com/widm

pre-configured and on-demand bioinformatics com-puting for the genomics community. BMC Bioinform2012, 13:42.

28. Talia D, Trunfio P. Service-Oriented DistributedKnowledge Discovery. London, UK: Chapman &Hall/CRC Data Mining and Knowledge Discovery Se-ries; 2012.

29. Hall M, Frank E, Holmes G, Pfahringer B, ReutemannP, Witten IH. The WEKA data mining software: anupdate. SIGKDD Explor 2009, 11: 1–18.

30. Kretschmann E, Fleischmann W, Apweiler R. Auto-matic rule generation for protein annotation with theC4.5 data mining algorithm applied on SWISS-PROT.Bioinformatics 2001, 17:920–6.

31. Bazzan AL, Engel PM, Schroeder LF, da Silva SC.Automated annotation of keywords for proteins re-lated to mycoplasmataceae using machine learningtechniques. Bioinformatics 2002, 18:S35–S43.

32. Tobler JB, Molla MN, Nuwaysir EF, Green RD,Shavlik JW. Evaluating machine learning approachesfor aiding probe selection for gene-expression arrays.Bioinformatics 2002, 18:S164–S71.

33. Bekaert M, Bidou L, Denise A, Duchateau-NguyenG, Forest JP, Froidevaux C, et al. Towards a compu-tational model for -1 eukaryotic frameshifting sites.Bioinformatics 2003, 19:327–35.

34. Taylor J, King RD, Altmann T, Fiehn O. Applicationof metabolomics to plant genotype discrimination us-ing statistics and machine learning. Bioinformatics2002, 18:S241–S248.

35. Frank E, Hall M, Trigg L, Holmes G, Witten IH. Datamining in bioinformatics using Weka. Bioinform ApplNote 2004, 20:2479–2481.

36. Pyka M, Balz A, Jansen A, Krug A, Hullermeier E.A WEKA interface for fMRI data. Neuroinformatics2012, 10:409–413.

37. Celis S, Musicant DR. Weka-parallel: machinelearning in parallel. Available at: http://weka-parallel.sourceforge.net/report.pdf. (Accessed March13, 2013).

38. Khoussainov R, Zuo X, Kushmerick N. Grid-enabledWeka: a toolkit for machine learning on the Grid.ERCIM News 2004, 59:47–48.

39. Perez MS, Sanchez A, Robles V, Herrero P, PenaJM. Design and Implementation of a Data Min-ing Grid-aware Architecture. FutureGen Comp Syst2007, 23:42–47.

40. Perez MS, Sanchez A, Robles V, Herrero P, Pena JM.Adapting the Weka data mining toolkit to a grid basedenvironment. Adv Web Intell Lect Notes Comput Sci2005, 3528:492–497.

41. Talia D, Trunfio P, Verta O. The Weka4WS frame-work for distributed data mining in service-orientedGrids. Concurr Comput Pract Exp 2008, 20:1933–1951.

42. Mierswa I, Scholz M, Klinkenberg R, Wurst M, EulerT. YALE: rapid prototyping for complex data miningtasks. Proceedings of the 12th ACM SIGKDD In-ternational Conference on Knowledge Discovery andData Mining (KDD-06); 2006.

43. Di Martino MT, Arbitrio M, Leone E, Guzzi PH,Saveria Rotundo M, Ciliberto D, Tomaino V, FabianiF, Talarico D, Sperlongano P, et al. Single nucleotidepolymorphisms of ABCC5 and ABCG1 transportergenes correlate to irinotecan-associated gastrointesti-nal toxicity in colorectal cancer patients: A DMETmicroarray profiling study. Cancer Biol Ther 2011,12:780–778.

44. Guzzi PH, Mario Cannataro M. Parallel pre-processing of affymetrix microarray data. Proc EurWorkshops Lect Notes Comp Sci; 2010, 6586.

45. Kreil DP, Russell RR. Tutorial section: there is no sil-ver bullet a guide to low-level data transforms andnormalization methods for microarray data. BriefBioinform 2005, 6:86–97

46. Sangurdekar D, Srienc F, Khodursky A. A classifi-cation based framework for quantitative descriptionof large-scale microarray data. Genome Biol 2006,7:R32+.

47. Schmidberger M, Vicedo E, Mansmann U. affypara-abioconductor package for parallelized preprocessingalgorithms of affymetrix microarray data. BioinformBiol Insights 2009, 30:83–87.

48. Guzzi PH, Mario Cannataro M. mu-cs: An exten-sion of the tm4 platform to manage affymetrix binarydata. BMC Bioinform 2010, 11:315.

49. Aebersold R, Mann M. Mass spectrometry-based pro-teomics. Nature 2003, 422:198–207.

50. Cannataro M. Computational proteomics: manage-ment and analysis of proteomics data. Brief Bioin-form 2008, 9:97–101.

51. Veltri P, Cannataro M, Tradigo G. Sharing mass spec-trometry data in a grid-based distributed proteomicslaboratory. Inf Process Manage 2007, 43:577–591.

52. Pedrioli PG, Eng JK, Hubley R, Vogelzang M,Deutsch EW, Raught B, Pratt B, Nilsson E, AngelettiRH, Apweiler R. A common open representation ofmass spectrometry data and its application to pro-teomics research. Nat Biotechnol 2004, 22:1459–1466.

53. Orchard S, Montecchi-Palazzi L, Hermjakob H, Ap-weiler R. The use of common ontologies and con-trolled vocabularies to enable data exchange anddeposition for complex proteomic experiments. PacSymp Biocomput 2005, 186–196.

54. Martens L, Chambers M, Sturm M, Kessner D,Levander F, Shofstahl J, Tang WH, Rompp A, Neu-mann S, Pizarro AD, et al. mzML—a communitystandard for mass spectrometry data. Mol Cell Pro-teomics 2011, 10:R110.000133.

236 Volume 3, May/ June 2013c© 2013 John Wi ley & Sons , Inc .

Page 22: Data mining and life sciences applications on the grid

WIREs Data Mining and Knowledge Discovery Data mining and life sciences applications on the grid

55. Barla A, Jurman G, Riccadonna S, Chierici M, MerlerS, Furlanello C. Machine learning methods for predic-tive proteomics. Brief Bioinform 2008, 9:119–128.

56. Stevens RD, Robinson AJ, Goble CA. myGrid: per-sonalised bioinformatics on the information Grid.Bioinformatics 2004, 19:302–302C.

57. Aloisio G, Cafaro M, Epicoco I, Fiore S, Mirto M. Aservices oriented system for bioinformatics applica-tions on the grid. CACT-NNL/CNR-INFM, CMCC,University of Salento, Lecce, Italy. Stud Health Tech-nol Inform 2005, 126:174–183.

58. Cannataro, M, Guzzi, PH, Mazza T, Tradigo G, Vel-tri, P. Using ontologies for preprocessing and miningspectra data on the grid. Future Gen Comp Syst 2007,23:55–60.

59. Cannataro M, Guzzi PH. Data Management of Pro-tein Interaction Networks. Hoboken, NJ: Wiley-IEEEComputer Society Press, Wiley Book Series on Bioin-formatics; 2011.

60. Aittokallio T, Schwikowski B. Graph-based methodsfor analysing networks in cell biology. Brief Bioin-form 2006, 7:243–255.

61. Cannataro M, Guzzi PH, Veltri P. Protein-to-protein interactions: Technologies, databases, and al-gorithms. ACM Comput Surv 2010, 43:1:1–1:36.

62. Cerami EG, Bader GD, Gross BE, Sander C. cPath:open source software for collecting, storing, andquerying biological pathways. BMC Bioinform 2006,7:497.

63. Guzzi PH, Mina M, Guerra C, Cannataro M. Seman-tic similarity analysis of protein data: assessment withbiological features and issues. Brief Bioinform 2012,13:569–585.

64. Nassa G, Tarallo R, Ambrosino C, Bamundo A, Fer-raro L, Paris O, Ravo M, Guzzi PH, Cannataro M,Baumann M, et al. A large set of estrogen receptor β-interacting proteins identified by tandem affinity pu-rification in hormone-responsive human breast cancercell nuclei. Proteomics 2011, 11:159–65.

65. Orchard S, Kerrien S, Jones P, Ceol A, Chatr-Aryamontri A, Salwinski L, Nerothin J, HermjakobH. Submit your interaction data the IMEx way: a stepby step guide to trouble-free deposition. Proteomics2007, 7:28–34.

66. Hermjakob H, Montecchi-Palazzi L, Bader G, WojcikJ, Salwinski L, Ceol A, Moore S, Orchard S, SarkansU, von Mering C, et al. The HUPO PSI’s molecular in-teraction format–a community standard for the repre-sentation of protein interaction data. Nat Biotechnol2004, 22:177–183.

67. Ciriello G, Mina M, Guzzi PH, Cannataro M, GuerraC. AlignNemo: a local network alignment method tointegrate homology and topology. PLoS ONE 2012,7:e38107.

68. Cannataro M, Guzzi PH, Veltri P. IMPRECO: dis-tributed prediction of protein complexes. Future GenComp Syst 2012, 26:434–440.

69. Kalaev M, Smoot M, Ideker T, Sharan R. Network-BLAST: comparative analysis of protein networks.Bioinformatics 2008, 24:594–596.

70. Flannick J, Novak A, Srinivasan BS, McAdams HH,Batzoglou S. Graemlin: general and robust alignmentof multiple large interaction networks. Genome Res2006, 16:1169–1181.

71. Cannataro M, Guzzi PH, Mazza T, Tradigo G, Vel-tri P. Preprocessing of mass spectrometry proteomicsdata on the grid. 18th IEEE International Symposiumon Computer-Based Medical Systems (CBMS’05).Dublin, Ireland: Trinity College Dublin; 2005, 549–554.

72. Cannataro M, Guzzi PH. Parallel preprocessing ofaffymetrix microarray data. Workshop HIBB 2010(High performance bioinformatics and biomedicine),held in conjunction with Int Conference Euro-Par2010. Ischia: Euro-Par 2010 Workshops, LectureNotes in Computer Science; 2010, 6586:225–232.

73. Ye J, Wu T, Li J, Chen K. Machine learning ap-proaches for the neuroimaging study of Alzheimer’sdisease. IEEE Comp Soc 2011, 44:99–101.

74. Akil H, Martone ME, Van Essen DC. Challenge andopportunities in mining neuroscience data. Science2011, 331:708–712.

75. Symms M, Jager HR, Schmierer K, Yousry TA. A re-view of structural magnetic resonance neuroimaging.J Neurol Neurosurg Psychiat 2004, 75:1235–1244.

76. Roy CS, Sherrington CS. On the regulation of bloodsupply of the brain. J Physiol 1890, 1:85–108.

77. Stejskal EO, Tanner JE. Spin diffusion measurements:spin echoes in the presence of a time-dependent fieldgradient. J Chem Phys 1965, 42:288–292.

78. Megalooikonomou V, Ford J, Shen L, Makedon F.Data mining in brain imaging. Stat Method Med Res2000, 9:359–394.

79. Hsu W, Li Lee M, Zhang J. Image Mining: Trendsand Developments. J Intell Inf Syst 2002, 19:7–23.

80. Kakimoto M, Morita C, Tsukimoto H. Data min-ing from functional brain images. The Sixth ACMSIGKDD Int Conf on Knowled Discov and DataMining, Workshop on Multimedia Data Mining;2000.

81. Morra JH, Tu Z, Apostolova LG, Green AE, TogaAW, Thompson PM. Comparison of AdaBoost andsupport vector machines for detecting Alzheimer’sdisease through automated hippocampal segmenta-tion. IEEE Trans Med Imaging 2010, 29:30–43.

82. Cuingnet R, Chupin M, Benali H, Colliot O. Spatialand anatomical regularization of SVM from brain im-age analysis. Proc NIPS; 2010, 460–468.

Volume 3, May/ June 2013 237c© 2013 John Wi ley & Sons , Inc .

Page 23: Data mining and life sciences applications on the grid

Advanced Review wires.wiley.com/widm

83. Swain R, Jena L, Kamila NK. Cognitive statesfrom brain images: SVM approach. Int J ComputCommun Technol 2010, 2(Special Issue):194–199.

84. Pyka M, Balz A, Jansen A, Krug Hullermeier E. AWEKA interface for fMRI data. Neuroinformatics2012, 10:409–413.

85. Maroco J, Silva D, Rodrigues A, Guerreiro M, San-tana I, de Mendoca A. Data mining methods in theprediction of Dementia: a real-data comparison of theaccuracy, sensitivity and specificity of linear discrim-inant analysis, logistic regression, neural networks,support vector machines, classification trees and ran-dom forests. BMC Res Notes 2011, 4:299.

86. Pereira F, Mitchell T, Botvinick M. Machine learningclassifiers and fMRI: a tutorial overview. Neuroimage2009, 45:S199–S209.

87. Cristianini N, Shawe-Taylor J. An Introduction toSupport Vector Machines and Other Kernel-BasedLearning Methods. Cambridge, UK: Cambridge Uni-versity Press; 2000.

88. Vapnik V. Estimation of Dependences Based on Em-pirical Data [in Russian]. Nauka, Moscow: SpringerVerlag (English translation—New York: SpringerVerlag; 1982); 1979.

89. Vapnik V. The Nature of Statistical Learning Theory.New York: Springer-Verlag; 1995.

90. Vapnik V. Statistical Learning Theory. New York:John Wiley and Sons, Inc.; 1998.

91. Meligy A, Al-Khatib M. A grid-based distributedSVM data mining algorithm. Eur J Sci Res. 2009,313–321.

92. Lu Y, Roychowdhury V, Vandenberghe L. Dis-tributed parallel support vector machines in strongly

connected networks. IEEE Trans Neural Netw 2008,19:1167–1178.

93. Chang EY, Zhu K, Wang H, Bai H, Li J, ZhihuanQiu Z, Cui H. PSVM: Parallelizing Support VectorMachines on Distributed Computers. Beijing, China:Google Research.

94. Freund Y, Schapire RE. A short introduction to boost-ing. J Japan Soc Artif Intell 1999, 14:771–780.

95. Freund Y, Schapire RE. Experiments with a newboosting algorithm. Machine Learn: Proc ThirteenInt Conf; 1996.

96. Falaki H. AdaBoost Algorithm. Los Angeles, CA:Computer Science Department, University of Califor-nia.

97. Fawcett T. An introduction to ROC analysis. J Pat-tern Recogn Lett 2006, 27:861–874.

98. Cannataro M, Comito C, Guzzo A, Veltri P. Inte-grating ontology and workflow in PROTEUS, a grid-based problem solving environment for bioinformat-ics. Proc Inf Technol: Coding and Comput Conf(ITCC 2004) 2004, 2:90–94.

99. Mastroianni C, Talia D, Trunfio P. Managingheterogeneous resources in data mining applica-tions on grids using xml-based metadata. Nice,France: Proceedings of the 17th International Paral-lel and Distributed Processing Symposium (IPDPS);2003.

100. Cannataro M, Veltri P. MS-Analyzer: composing andexecuting preprocessing and data mining services forproteomics applications on the grid. Concurr CompPract Exp 2007, 19:2047–2066.

101. Cannataro M, Guzzi PH, Mazza T, Tradigo G, VeltriP. Managing ontologies for grid computing. Multia-gent Grid Syst 2006, 2:29–44.

238 Volume 3, May/ June 2013c© 2013 John Wi ley & Sons , Inc .