17
Inf Syst Front DOI 10.1007/s10796-012-9399-0 A virtual mart for knowledge discovery in databases Claudia Diamantini · Domenico Potena · Emanuele Storti © Springer Science+Business Media New York 2013 Abstract The Web has profoundly reshaped our vision of information management and processing, enlightening the power of a collaborative model of information pro- duction and consumption. This new vision influences the Knowledge Discovery in Databases domain as well. In this paper we propose a service-oriented, semantic-supported approach to the development of a platform for sharing and reuse of resources (data processing and mining techniques), enabling the management of different implementations of the same technique and characterized by a community- centered attitude, with functionalities for both resource production and consumption, facilitating end-users with dif- ferent skills as well as resource providers with different technical and domain specific capabilities. We first describe the semantic framework underlying the approach, then we demonstrate how this framework is exploited to give differ- ent functionalities to users through the presentation of the platform functionalities. 1 Introduction The present paper proposes Knowledge Discovery in Data- bases Virtual Mart (KDDVM), a framework and a platform for distributed Knowledge Discovery in Databases (KDD) experiments. As the name suggests, the aim is to provide C. Diamantini () · D. Potena · E. Storti DII, Universit´ a Politecnica delle Marche, Ancona, Italy e-mail: [email protected] D. Potena e-mail: [email protected] E. Storti e-mail: [email protected] a virtual environment where distributed and heterogeneous KDD resources (i.e. data processing and mining techniques) can be easily introduced, acquired and exploited. In order to achieve the goal, in a five-years long work on the subject, we have individuated the major issues and requirements of a distributed experimental environment, from which the technological aspects and basic components of a support- ing architecture have been derived. We have also developed suitable technologies for the representation of knowledge about resources and developed suitable services for its man- agement and exploitation. All these achievements will be systematically described in the paper. The work is in the mainstream of the recent web-based revolution. The Web has profoundly reshaped our vision of information management and processing, software architec- tures and design, delivery models. All of this can be at least in part explained by the more than linear growth of value in a network of elements, synthesized by the Metcalfe’s Law. The social aspect emerging from the success of so-called Web 2.0 show the strategic value of a collaborative model of information production and consumption (Pedrinaci and Domingue 2010). As a matter of fact, more and more orga- nizations share their data in a way that allows people to process and exploit them. The Semantic Web, and its recent evolution in the Web of Data, through the use of machine- readable languages for the representation of knowledge and semantics, aims at simplifying the discovery, exploitation and recombination of information. Data processing in the Web era follows a similar trend. In particular, the Service Oriented Architecture (SOA) paradigm is a powerful principle supporting the collabora- tive production and consumption of computational capabil- ities, allowing to build new solutions by reusing and recom- bining existing ones, and changing the software vision from product to service provision. Indeed the use of libraries and

A virtual mart for knowledge discovery in databases

Embed Size (px)

Citation preview

Inf Syst FrontDOI 10.1007/s10796-012-9399-0

A virtual mart for knowledge discovery in databases

Claudia Diamantini · Domenico Potena ·Emanuele Storti

© Springer Science+Business Media New York 2013

Abstract The Web has profoundly reshaped our visionof information management and processing, enlighteningthe power of a collaborative model of information pro-duction and consumption. This new vision influences theKnowledge Discovery in Databases domain as well. In thispaper we propose a service-oriented, semantic-supportedapproach to the development of a platform for sharing andreuse of resources (data processing and mining techniques),enabling the management of different implementations ofthe same technique and characterized by a community-centered attitude, with functionalities for both resourceproduction and consumption, facilitating end-users with dif-ferent skills as well as resource providers with differenttechnical and domain specific capabilities. We first describethe semantic framework underlying the approach, then wedemonstrate how this framework is exploited to give differ-ent functionalities to users through the presentation of theplatform functionalities.

1 Introduction

The present paper proposes Knowledge Discovery in Data-bases Virtual Mart (KDDVM), a framework and a platformfor distributed Knowledge Discovery in Databases (KDD)experiments. As the name suggests, the aim is to provide

C. Diamantini (�) · D. Potena · E. StortiDII, Universita Politecnica delle Marche, Ancona, Italye-mail: [email protected]

D. Potenae-mail: [email protected]

E. Stortie-mail: [email protected]

a virtual environment where distributed and heterogeneousKDD resources (i.e. data processing and mining techniques)can be easily introduced, acquired and exploited. In order toachieve the goal, in a five-years long work on the subject,we have individuated the major issues and requirementsof a distributed experimental environment, from which thetechnological aspects and basic components of a support-ing architecture have been derived. We have also developedsuitable technologies for the representation of knowledgeabout resources and developed suitable services for its man-agement and exploitation. All these achievements will besystematically described in the paper.

The work is in the mainstream of the recent web-basedrevolution. The Web has profoundly reshaped our vision ofinformation management and processing, software architec-tures and design, delivery models. All of this can be at leastin part explained by the more than linear growth of value ina network of elements, synthesized by the Metcalfe’s Law.The social aspect emerging from the success of so-calledWeb 2.0 show the strategic value of a collaborative modelof information production and consumption (Pedrinaci andDomingue 2010). As a matter of fact, more and more orga-nizations share their data in a way that allows people toprocess and exploit them. The Semantic Web, and its recentevolution in the Web of Data, through the use of machine-readable languages for the representation of knowledge andsemantics, aims at simplifying the discovery, exploitationand recombination of information.

Data processing in the Web era follows a similar trend.In particular, the Service Oriented Architecture (SOA)paradigm is a powerful principle supporting the collabora-tive production and consumption of computational capabil-ities, allowing to build new solutions by reusing and recom-bining existing ones, and changing the software vision fromproduct to service provision. Indeed the use of libraries and

Inf Syst Front

modular programming is in vogue since the very beginningof the programming practices. What is different in modernSOA paradigm is the extensive use of metadata, support-ing “reuse-in-the-large”, that is supporting the developmentof new complex distributed systems from heterogeneous,loosely coupled components. The management of purelysyntactic metadata has been extended with the use of seman-tics, aiming at a uniform and rigorous specification of ser-vice capabilities that facilitates automatic service discoveryand composition.

The underlying principles of the Semantic Web andWeb 2.0 are present in our proposal. As a service-oriented,semantic-based framework, KDDVM enables the sharingand reuse of heterogeneous tools and workflows, givingadvanced support for the semantic enrichment throughsemantic annotation of tools, deployment of the tools as webservices and discovery and use of such services. This leadsto a collaborative, natively open environment where tradi-tional as well as latest techniques can be dynamically addedand exploited, and where specific techniques can exceedthe boundaries of the domain in which they were born andbecome available to end-users with diverse knowledge andskills.

Service-oriented, semantic-supported platforms for KDDare not new in the literature. A review of existing systemsis given in the next section. What makes KDDVM orig-inal with respect to other proposals is that it is nativelyconceived for a distributed, collaborative environment, thatled to a novel semantic framework where resources arerepresented at three different abstraction levels: algorithm-tool-service, with a one-to-many relationship among levels.This is the fundamental pillar on which the main featuresof the platform are built: (a) the capability to manage vari-ous kinds of heterogeneities, like different implementationswith different characteristics of the same algorithm, (b) acommunity-centered attitude, with functionalities for bothresource production and consumption, facilitating end-userswith different skills as well as resource providers withdifferent technical and domain specific capabilities.

The rest of the paper is organized as follows: in Section 2we review related work. Section 3 introduces the require-ments seen as necessary for effectively supporting KDDexperimentations, then it discusses the framework proposedto deal with these requirements. Section 4 describes thesystem implementing such a framework. Finally, Section 5reports some concluding remarks and presents possibleextensions of the work.

2 Related work

Knowledge Discovery in Database is a complex process forextracting valuable knowledge from huge amounts of data.

Hence, various systems have been proposed in the literaturefor supporting users in the design and management of KDDprocesses. These systems can be analyzed along their his-torical evolution: while the 1st-generation of KDD and DMsystems provided support for single users in local settings,2nd-generation systems are asked to address the decentral-ization of users and tools, thus producing more complexityboth for managing distributed computation and for sup-porting cooperative work and knowledge sharing amongdistributed teams (Park and Kargupta 2003).

From a technical perspective, recent 2nd-generation sys-tems are based on SOA and Grid technologies. The exploita-tion of service-oriented paradigm for distributed KDD canbe traced back to (Sarawagi and Nagaralu 2000), whichintroduced the idea of Data Mining models as services withthe aim of facilitating the use of such techniques amongnovice users. Such approach concerns single, ready to usemodels which can be shared as well as data, tools and skills,and it does not address design issues. Many authors haveproposed SOA-based Data Mining frameworks. In Antea-ter (Guedes et al. 2006) authors focused their efforts onfacing architectural issues such as communication amongservices as well as management and load-balancing of par-allel clustered servers in order to achieve computationallyintensive processing on large amount of data. The platformrequires tools to be converted in a filter-stream structure,which allows high scalability but limit its extensibility.Furthermore, the platform provides the user with a sim-ple interface that abstracts the algorithms’ technical details.A change of the same perspective is proposed in Kumaret al. (2004), where authors describe a SOA architecture forKDD, in which services are sent to the nodes where datasetsare resident, instead of transferring data, with the aim tohelp in lowering consumption of bandwidth and to preserveprivacy.

Using a different perspective, some authors introduceduser-oriented features both to provide support to data pro-cessing, and to manage a knowledge discovery process. Aliet al. (2005) introduces a SOA architecture in which userscan manually build a KDD process made of a set of pre-defined services from Weka toolkit (Hall et al. 2009) andfrom third-parties. The generated process is a workflow thatcan be executed through Triana (Majithia et al. 2004). Asimilar approach is adopted in Tsai and Tsai (2005), whereusers are provided with an integrated interface for searchingData Mining services, building BPEL4WS processes andexecuting them.

Within the Grid community, many proposals deal witharchitectures allowing Data Mining to be deployed in adistributed environment, often with a great benefit for scal-ability. Among them, Olejnik et al. (2009) aims at solvingcomputational issues by massive parallelism of Data Min-ing tasks, also introducing algorithm specifically suited for

Inf Syst Front

such architecture. Similarly, in Cheung et al. (2006) authorspropose a Grid architecture for distributed Data Mining,although they mainly concern the privacy issue by learningglobal models from local abstractions, to prevent sensibledata from being sent over the net.

In recent years, the overlap between the goals of Gridcomputing and SOA based on Web services has becomeclear, and new solutions are emerging that mix bothbenefits, namely Service Oriented Grids, often compliantwith open standards like Open Grid Services Architecture(OGSA). Based on OGSA standard, Perez et al. (2007)focuses on a vertical and generic Grid application, whichallows the execution of Data Mining workflows, mainlyformed by Weka algorithms.

These 2nd-generation systems have focused on issuesrelated to distributed execution, namely tool performancesand privacy issues. Very few work has been done on the defi-nition of KDD-specific support functionalities like choosingalgorithms to use, composing services and appropriatelysetting algorithms parameters. Hence, a 3rd-generation ofsystems has been introduced with the aim of supportingusers in the design of KDD processes in distributed and het-erogeneous environments. Proposals of this generation arebased on the semantic enrichment of resources.

Some research projects like Discovery Net, GridMinerand Knowledge Grid exploit resources metadata for design-ing distributed KDD processes over the Grid. Discovery Net(Alsairafi et al. 2003) allows users to collaboratively man-age remotely available data analysis software as well asdata sources. In Discovery Net processes can be describedthrough an abstract language, which relies on metadataassociated to each tool. Once an abstract process is defined,it is translated into a concrete plan to be executed on specificservers, and also made available as a new service. A relatedrecent project is GridMiner (Kickinger et al. 2004), anOGSA-based system which provides a collection of KDDservices, together with a set of modules for managing low-level and high-level functionalities on the Grid, such as aservice broker, OGSA-DAI based data integration, orches-tration and composition. Knowledge Grid, firstly introducedin Cannataro and Talia (2003) and discussed in detail inCongiusta et al. (2008), represents the first attempt to builda domain-independent KDD environment on the grid. Themain focus of the approach is on high-performance paral-lel distributed computing and on generic high-level servicesfor knowledge management and discovery. While the high-level layer is used to create abstract execution plans, thecore layer keeps metadata describing KDD objects (data andresults) and manages the mapping between a workflow andthe available resources for its execution over the grid. InCongiusta et al. (2008) is also described WekaWS, an imple-mentation of the mentioned architecture, in which standardWeka algorithms are available over the Grid as services, and

many support services are extended to take into account gridpeculiarities, for instance to allow a workflow with parallelexecution of services.

Advanced support functionalities have been introducedin various proposals by means of KDD ontologies. In Bern-stein et al. (2005), the ontology provides a conceptualizationof DM tools’ interfaces allowing the automatic composi-tion of valid KDD processes. Since this ontology containseven information about tools performances (e.g. executionspeed), the system is able to suggest more efficient work-flow. The process composition is treat also in Orange4WS(Podpecan et al. 2010), where a KDD ontology represent-ing tools, datasets and models is introduced for guiding aplanning algorithm in the process design. Resulting work-flows are written to run in the Orange platform (Demsaret al. 2004), which is a Data Mining framework providingeasy-to-use functionalities for composition and executionof pre-defined data mining tools. A fully annotated Gridplatform is proposed in Comito et al. (2006), where anyresources is semantically enriched by referring both toontologies describing the application domain and to theDAMON ontology, which is the first attempt of giving a for-mal representation of DM domain. This ontology is mainlya taxonomy of DM tools and is exploited for tools retrieval.Another Grid based system is proposed in Yu-hua et al.(2006), where the ontology is used to guide the selectionof suitable processes; to this end, the ontology describesalgorithm sequences and their relations with the specificapplication domain they are used for. A different perspec-tive is introduced in Panov et al. (2008), where a generalpurpose (i.e. not-conceived for achieving specific supportfunctionalities) ontology is proposed; hence, systems basedon such an ontology can be used for different activities, butproviding inefficient supports in each of them.

Many other intelligent data analysis systems, which canbe defined as general frameworks for workflow manage-ment, data manipulation and knowledge extraction, aredescribed in the survey (Serban et al. 2010).

Analyzing the literature, we find in the e-Science com-munity systems similar to those above described; indeed, theKDD domain can be considered as a branch of e-Science.Such systems aim at managing enormous amounts of dataproduced during experiments (e.g., in biomedicine, parti-cle physics), and need to integrate different methodologies,tools and algorithms to analyze complex problems (Jurisicaand Glasgow 2006). Furthermore, e-Science systems exploittechnological architectures like Grid for achieving efficientand scalable computation, and with a strong focus on col-laboration in distributed settings. Among the many projects,MyExperiment (De Roure et al. 2009) is trying to com-bine computational efforts typical of Grid with semantictechnologies. In detail, it is a middleware for support-ing personalized in-silico experiments, where scientists can

Inf Syst Front

collaboratively design, edit, annotate, execute and shareworkflows in the biology domain.

Although the discussed literature shares some aspectswith our platform, the design of KDD services and pro-cess is based on tools with homogeneous interfaces, hencesimplifying process composition and activation. In our pro-posal, we deal also with issues deriving from an autonomousdistributed design, e.g. often services to integrate in theplatform run in various environments, typically more thanone service implementing a specific algorithm is consid-ered, compatible outputs are produced by services, and soforth. We introduce functionalities to build a platform sup-porting a KDD community, including not only analystsusing services, but also developers as well as autonomousorganizations hosting services over their machines.

3 The framework

This section is devoted to present the KDDVM frame-work. In its essence, it defines and organizes the bulkof (metadata and semantic) information needed to complywith requirements of a KDD support system in a collabo-rative networked environment, and formalizes it by suitedtechnologies.

The design of the framework starts with the idea thatthere is no standard way to deal with a knowledge discoveryproblem. This fact leads to three major considerations: (1)novel tools and algorithms for data processing and analysisare continuously developed. Many of them are developed inspecific domains, like e.g. life science, (2) analysts shouldbe provided with the highest possible number of KDD toolsin order to perform the work in the most effective way,and (3) analysts should be enough experienced in the (com-bined) use of these tools. Hence, in this scenario, when anew tool is developed or a new algorithm is proposed, theorganization has to invest resources for making this toolinteroperable with its own analysis system, and for train-ing people in the effective use of the tool (or for externalconsultants). Although this type of investment could beestablished, a KDD project typically involves several kindof users, some of whom neither have a specific backgroundin data analysis field nor hold enough technical expertisein order to manage a whole project on their own. Amongthem, for instance, there are domain experts, which inti-mately know the problem and are able to assess whetherthe knowledge extracted at the end of the process is use-ful or not in order to solve it. Then, DB administrators haveto gather data from databases/datawarehouse and possiblyperform transformation operations.

Such a scenario depicts a KDD project as a collab-orative and distributed work, where several users, pos-

sibly from different organizations, make tools availableand share knowledge and expertise. Our goal is to sup-port users with such different degree of skill and exper-tise, facilitating the adoption of novel tools, the choice ofthe “right” tools and their composition, and collaboration.Hence, we envisage the following general, non-functionalrequirements for a system effectively supporting KDDexperimentations:

(a) flexibility: the system should adapt easily to any mod-ification in the experiment, like the changing of parame-ters and tools. It should grow with the organization needsproviding mechanisms for seamless integration of newtechniques and tools;

(b) transparency: tools with different and heterogeneousinterfaces should be managed, hiding localization andexecution technicalities from users. Most operations ondata should be automated;

(c) ease-of-use: the system should provide KDD-specificsupport to users with different skills, ranging from KDDexperts to novice users and domain experts. In particu-lar, while principles underlying Data Mining algorithmsshould be known to computer scientists, this cannot beexpected for domain experts, that should be supported inchoosing best tools for their goals, in setting up the rightparameters, in combining different tools, in the manage-ment of the whole process, in sharing domain knowledgeand so forth;

(d) reusability: the framework should allow to use avail-able tools without imposing any modification on theircode, and to exploit available information to a maximumextent.

KDDVM relies on an open and modular architecture. Bythe term Service Oriented Architecture we refer to “a styleof building reliable distributed systems that deliver func-tionality as services, with the additional emphasis on loosecoupling between interacting services” (Treadwell 2005).Thus, like in any SOA, each KDD tool is regarded as a mod-ular service that can be discovered and used by a client,with the following benefits: they can be used independentlyor integrated for providing more complex functionalities;each KDD tool is provided with a public description ofits interface and other information (e.g., supported proto-cols, capabilities), but its implementation and other internaldetails are no concern with clients and remain hidden;clients communicate with services by exchanging messages.Moreover, focusing on loose coupling, the system allowsto dynamically add and remove KDD services, update theirimplementation or suggest alternative services providing thesame functionalities, in case of unavailability. Hence, theSOA paradigm partially addresses previous requirements.With the aim of satisfying at a higher extent the require-

Inf Syst Front

ments, in particular ease-of-use, we add two more levels ofresource description, forming a three-layer architecture asfollows:

– at the algorithm level, the resource is seen as a proto-typical tool describing capabilities without any imple-mentative detail;

– at the tool level, the specific implementations ofan algorithm, in a given programming language isrepresented;

– at the service level, characteristics of a tool running ona server, offering its interface through standard SOAprotocols are considered.

This three layer architecture possesses a hierarchicalstructure, in that many services can refer to the same tooland several tools can implement the same algorithm, whileon the other way, each service is the deployment of aspecific tool that in turn implements an algorithm. Froman informative perspective, this means that all the charac-teristics defined at one abstraction level, are inherited bythe lower level(s). In turn, lower levels posses other layerspecific information as well as specifications of the gen-eral characteristics of the upper level. To be more specific,an algorithm is described by its inputs and outputs (e.ga dataset and the mined model), the task it is aimed at(e.g. classification), the method it uses to accomplish itstask (e.g. decision tree), and some performance indexes (e.g.complexity). By contrast, a tool has its specific interface, itis written in a programming language, it has an executionpath and some specific performance values obtained fromthe execution upon specific datasets (e.g., accuracy valueon Iris dataset (Frank and Asuncion 2010)). Note that thetool inherits the abstract characteristics of the algorithm itimplements, for instance if the algorithm takes in input alabeled dataset, the tool still has the same kind of input,however at this level also the format of the input file isspecified (e.g. arff, or csv). Also, the tool can have its owninputs or outputs in addition, like e.g. a parameter allowingto set the verbosity level of the output. Finally, the servicelevel adds information about the URL and peculiar QoSindexes (e.g., availability) which depend on the propertiesof the server on which it is executed, and on the networkstatus. The information related to each layer is summarizedin Fig. 1.

The information managed at the upper layers is funda-mental to satisfy the ease-of-use requirement. As a matterof facts, to give KDD-specific support calls for the formal-ization of the KDD principles and practices typically ownedby an expert. To make novice user capable of managingservices, we formalized KDD knowledge into an ontol-ogy (KDDONTO), which allows to describe algorithmsand their properties and relations in a formal fashion. Theontology allows us to define a formal common ground

Fig. 1 Abstraction layers and related information

on the basis of which most of the support is given. Itincludes:

– heterogeneity management and service integration;– enhanced service discovery and composition;– best practices for service activation;

Details on semantic-based support functionalities are givenin Section 4.

On the other hand, conceptual information is not suf-ficient by itself since most of the heterogeneities appearsat the tool level, when different implementations of anabstract algorithm coexist in the distributed environment.This makes integration and reuse of tools very time con-suming. For instance, before actually using a tool, a usershould understand the syntax of the input and conform hisdata. For such a reason, in order to allow the usage of sev-eral tools in a single framework we provide each of themwith a description in a common, XML-based language.Functionalities enabled by the descriptor are discussed inSection 4.1.

In the rest of this section, a more detailed description ofeach layer is provided: the ontology for describing resourcesat algorithmic level is summarized in Section 3.1, inSection 3.2 we introduce the language to annotate resourcesat tool layer. Finally, Section 3.3 will focus on resources atservice layer.

3.1 Algorithm layer

KDDONTO is an ontology aimed at representing the mainconcepts and relations within the KDD domain in a for-mal fashion. Among many available methodologies forontology building, we developed the ontology following

Inf Syst Front

Table 1 KDDONTO: main classes

Name Description Examples

Algorithm Algorithm for data analysis SVM, RemoveMissingValues, PrincipalComponentAnalysis

Method Technique to extract knowledge KernelMethod, RandomFill, FeatureExtraction

Phase Step in a KDD process Modeling, PreProcessing, FeatureExtraction

Task Data mining task Classification, regression

Data I/O data Dataset, LearningRate, Model

Model Type of I/O data Any induced classification model

Dataset Collection of data records Iris dataset

DataFeature Characteristics of data Numeric, Literal, MissingValues, BalancedDataset, NormalizedDataset

a formal approach based on Noy and Mcguinness (2002),Gruber (1995), Fernandez-Lopez et al. (1997) taking intoaccount several quality criteria in order to guarantee clarity,coherence, extensibility and minimality.

Formalism in concepts definition allows to support infer-ential mechanisms, to find non-explicit relations among theontological concepts: for instance, it permits to determinewhether a certain kind of data is compatible with the input ofan algorithm, or to automatically assign an algorithm to theright class, on the basis of its relations with other concepts.

The main purpose of KDDONTO is to represent gen-eral properties of algorithms, in that differing from otherproposed ontologies which use a more applicative perspec-tive as already discussed in Section 2. Tables 1 and 2 showthe main classes and relations. Besides the central con-cept of algorithm, the ontology describes the task, the KDDphase in which it is commonly used, the method it imple-ments for achieving its task. Moreover, the I/O interfaceis described: the data required in input and yield in out-put, together with the preconditions that such data must

Table 2 KDDONTO: main relations

Name Description Examples

uses(Algorithm, Method) Relation between an algorithm uses(BACKPROP, NEURALNETWORK)

and the used method

specifies task(Method, Task) Relation between method specifies task(NEURALNETWORK, CLASSIFICATION)

and a task specifies task(NEURALNETWORK, REGRESSION)

specifies phase(Task,Phase) A task to a phase specifies phase(CLASSIFICATION, MODELING)

has input(Algorithm∪Method An input to an algorithm, method has input(SOM,UnlabeledDataset, NO MISS VALUE,1,0)

∪Task, Data, Datafeature, or task. DataFeature are preconditions has input(SOM, VectorQuantizer, FLOAT, 0.4, 0)

strenght, is parameter) on input Data. strenght defines has input(SOM, LearningRate, null, null, 1)

whether the precondition is mandatory has input(CLASSIFICATION, Data)

(strenght=1) or it can be relaxed

(strength<1). is parameter defines

whether the input is a parameter

has output(Algorithm∪Method An output for an algorithm, method has output(SOM, VectorQuantizer, NO LITERAL)

∪Task, Data, DataFeature) or task. DataFeature are postconditions has output(CLASSIFICATION, Data)

resulting from data manipulation by

the algorithm

is a(Thing, Thing) The generic subsumption relation between is a(ClassificationAlgorithm, Algorithm)

a class and its superclass is a(Model,Data) is a(Dataset, Data) is a(Parameter, Data)

part of(Data, Data) The relation between two data, such part of(Label, LabeledDataset) part of(Neuron, MLP)

that the second contains the first as

a subcomponent

in module/out module(Algorithm, Explicit suggestion (best practice) in module(LVQ, SOM) out module

Algorithm) in linking together two algorithms (SOM, DRAWVORONOIREGIONS)

in contrast(DataFeature, DataFeature) Disjoint properties of data in contrast(NUMERIC, LITERAL)

Inf Syst Front

Fig. 2 An excerpt of theKDDONTO’s description of theC4.5 algorithm visualized by theOntoGraf module of Protege

satisfy in order to be actually used. Finally, computationalindexes like complexity and scalability are represented aswell. Among relations, we like to note that in module,out module and in contrast are introduced to elicitexpert knowledge about process composition typical prac-tices. A peculiar characteristic of KDDONTO is the intro-duction of the part of relation, motivated by the needto manage structured data. Finally note that, differing fromother proposals, the main concepts of algorithm, methodand tasks are related each other by many-to-many relations,allowing for instance to associate a neural network methodto classification as well as regression tasks.

To give an example, Fig. 2 shows an excerpt ofthe ontology describing the C4.5 algorithm, which per-forms the classification task and generates as predictivemodel a decision tree. Thus, in the KDDONTO it isdefined as an instance of TreeAlgorithm class. Thislast is a subclass of ClassificationAlgorithm,which contains all algorithms that use a method suit-able for classification. Indeed, C4.5 algorithm producesas output a BinaryDecisionTree model, which is aDecisionTreeModel that in turn is a subclass ofClassificationModel. As input it accepts an instanceof LabeledDataset. For further details about the struc-ture of the ontology, we refer the interested reader toDiamantini et al. (2009a).

3.2 Tool layer

In order to describe services at tool level, we introduced theKnowledge Discovery Tool Markup Language (KDTML).It is an XML-language aimed at annotating a KDD toolthrough a set of metadata, in order to describe its details

in a structured fashion. Legacy KDD tools, as well as freeKDD software, produced by third-parties and written in anyprogramming language can be described through a KDTMLdocument in order to allow advanced support for sharingand integration.

The structure of a KDTML document is formed by thefollowing sections:

1. development/execution: tool’s name, programming lan-guage, local execution path;

2. I/O interface: number and type of each I/O datum,syntax in which data must be provided;

3. algorithm: the concept and the ontology thatdescribes the algorithm which is implemented by thetool;

4. tool performances: data-dependent and data-indepen-dent performance values; the former describe the tool’sbehavior w.r.t. previous executions (e.g., accuracy oncertain datasets), while the latter refers to intrinsicproperties of the tool;

5. publication: the author of the KDTML document, thepublication date.

To exemplify the use of the main elements, in Fig. 3 afragment of a KDTML document is shown. The describedtool, namely J48weka, is a tool implementing the Wekaversion of the C4.5 algorithm. We hasten to note the useof RDF triples to semantically annotate KDTML informa-tion. This is done in the example for the algorithm tag, inorder to express the meaning of the label j48 in terms ofthe reference concept C4.5 of the kdontology.owl ontology.Similarly, RDF is used to link the publisher to an exter-nal URI. Each element in the descriptor is identified by anXML tag, in which the prefix kdtml: as usual denotes the

Inf Syst Front

Fig. 3 An example of KDTML Document. WEKA J48 tool

namespace of the KDTML’s DTD, which defines the syn-tax of each structural element. Further information abouteach KDTML element is available in Diamantini and Potena(2008). Here we like to enlighten only the use of the hiddendata structure to describe an input that the software needs,but whose structure is not given explicitly on the standardinput. The typical case is that of a file containing the train-ing/testing datasets. Even if the user typically supplies onlythe name of this file as input parameter, the tool reads thecontent of the file according to a fixed predefined struc-

ture, that has to be known by the user to format the filecorrectly. To describe the hidden structure of a parameter,a C-like I/O format is used (an ARFF file in the example)as part of a structured datum, referred by the value of thepar ref tag.

3.3 Service layer

In general terms, services are loosely coupled entitiesthat encapsulate reusable functionalities, and are definedby implementation-agnostic interfaces. As previously men-tioned, each KDD tool in our platform has to be availableon the Net as a service, according to SOA principles. Inpractice, actual services can be offered by using severalimplementation technologies, the most important of whichare Web Services and Service-Oriented Grids. While theformer is a well-known and mature technology aimed tosupport interoperability, the latter is a rather novel formfor distributed computing, in which traditional grid con-cepts (e.g., virtualization, collective management, adapta-tion, high-performances) go with emerging standards fromseveral segments of the Web Service community. Althoughour approach can be considered independent of the spe-cific technology, in our platform we choose to rely onWeb Services since they represent a more mature technol-ogy and offer a comprehensive set of full-fledged tools andsolutions.

The key specifications used by Web Services are XMLdescriptors written according to the Web Service Descrip-tion Language (WSDL), for representing attributes, inter-faces and other properties. However, the WSDL 2.0 W3CRecommendation does not include semantics in the descrip-tion of Web services. Given that it is widely recognizedthat resolving such ambiguities in Web services descrip-tions is an important step toward automating the discoveryand composition of Web services, Semantic Annotationfor WSDL and XML Schema (SAWSDL) defines mech-anisms by which semantic annotations can be added toWSDL components such as I/O message structures, inter-faces and operations (Farrel and Lausen 2007). In ourframework, each service is described by an extended ver-sion of SAWSDL specifications. SAWSDL extension hasbeen necessary to allow us representing the whole bunchof information available in KDTML. For such a reason, weprovide each service with an extended-SAWSDL document(eSAWSDL), which is fully compatible with the SAWSDLstandard and has some additional details, namely specificsyntax of I/O data or performance values, necessary tosupport tasks such as choosing the most high-performingservice, or finding a service which has an input inter-face syntactically compatible to the output interface of theservice at hand.

Inf Syst Front

4 KDDVM platform

Given the three description layers introduced in the previ-ous section, each computational unit in our framework isa service. Among the services offered by the platform werecognize basic services, which provide single Data Miningand KDD functionalities allowing to analyze and transformdata and to extract knowledge. Moreover, a set of supportservices is provided for giving both low-level and KDD-specific support functionalities to different categories ofusers. We classify users on the basis of their role: publishers,providers and consumers. The publisher is responsible fortool description by means of KDTML language, referring toconcepts of the KDDONTO. Notice that the publisher is notnecessarily the tool developer, though he should have someknowledge and skills about the published tool. The provider(i.e. business and organization) takes charge of making thetool available as a web service on his own platform, build-ing the eSAWSDL document, publishing the service in aregistry, managing technical issues and ensuring appropriatequality-of-service. Finally the consumer is the final user (i.e.domain experts, DBAs, Data Mining specialists) that usesplatform’s functionalities to solve specific KDD problems.

Concerning publishers and providers, the platformexposes a set of services supporting the deployment phase,which implement all the functionalities needed for mak-ing a given KDD tool available as a basic service. Inparticular, such kind of services are used to describe thetool at any layers, making the service available over thehost machine, and publishing it in a public registry. Ser-vices for the deployment phase are deepened into details inSection 4.1.

From the consumers’ viewpoint, we divide support ser-vices into three main categories, on the basis of the providedfunctionality: discovery, composition and activation. Thefirst kind of services supports the browsing of basic servicesinto the repository and the retrieval of details about each ser-vice. Composition services enable the design of a processat the different abstraction levels ranging from a processof algorithms to an executable workflow of web services.They support also the choice of the most suitable serviceson the basis of user requirements (e.g., useful for a certaintask, executable after the service at hand, providing a certainlevel of performances on some kinds of problems). Finally,the third category manages the execution of the whole pro-cess as well as of a single service, and the data transferfrom a service to another. These categories are described inSections 4.2, 4.3 and 4.4 respectively. In Fig. 4 we showan overview of the platform and its main support servicesdivided by category. Following the loose-couple principleunderlying a Service Oriented Architecture, support ser-vices can be composed in order to define more complexfunctionalities.

Fig. 4 KDDVM: Services supporting Discovery, Composition andActivation. The KDDDesigner as access point

Note that, without denying the principle that a Web Ser-vice can be accessed by various end-user interface (by usingthe SOAP protocol), the KDDVM platform provides evensome consumer-side applications with user-friendly graph-ical interfaces. Technical details are available on KDDVMproject web site,1 where one can access the platform and useavailable applications.

4.1 Deployment

Before using the platform, services have to be published intoa repository, from which final users can retrieve them. Forthis reason, a set of services and consumer-side applicationsare available for (1) transforming a KDD tool, written bya developer in any programming language, into a web ser-vice with a standard interface; for (2) deployment of such aservice in an application server, and (3) publishing into thecommon registry. As shown in Fig. 5, the mentioned stepsare carried out by the following applications:

– Annotation Generator (AnGen): consumer-side appli-cation to support a user in writing the KDTML descrip-tor for a given tool. By analyzing KDTML’s DTD(which defines the syntactical structure), AnGen guidesthe user in filling in every part, thus simplifying itswriting and avoiding syntax typos.

1http://kddvm.diiga.univpm.it/

Inf Syst Front

Fig. 5 KDDVM: services for the deployment phase

– Automatic Wrapping Service (AWS): web service capa-ble to transform a tool written in any programminglanguage in a standard web service (i.e. a basic ser-vice), by reading its KDTML descriptor. In detail, AWSencapsulate any tools, even legacy software, by buildinga wrapper that allows to activate such a tool via standardweb service communication protocols. This wrapper isgiven as a java code that will be executed over thehost machine. Moreover, AWS produces the eSAWSDLdescriptor and all the needed scripts for deploying theservice on the publisher’s server.

– Authorization Service (AuthS): web service thatenables a user both to execute a service and to accessfiles in specific directories, on the basis of permissionsthe user has.

– File Transfer Service (FTS): this web service, availableon each publisher’s server, allows the user to upload thecode of the wrapper and the eSAWSDL document in thespecific directory. Since this functionality involve writ-ing and reading information on disk, the service makesuse of AuthS service.

– Deployment Service (DS): this web service is responsi-ble to compile the code of the wrapper as provided bythe AWS, and to deploy the service on the server, askingpermission to AuthS.

– Broker: a support service to discovery and publish ser-vices. For the publication, the Broker is responsible tostore information about the service inside the UDDIregistry, namely generic detail about the service andthe publisher, and the eSAWSDL’s URL. To go beyondUDDI limits we use some UDDI standard constructsto add specific eSAWSDL details such as which algo-

rithm is implemented by the service and informationabout performance values (a preliminary version usingan extended WSDL is presented in Diamantini et al.(2007)).

Since AuthS, FTS and DS interact with the operatingsystem of the server, they are the only services needed forhosting a node of our platform. On the contrary, the AWSdoes not have to run over the server where the basic servicewill be deployed.

We like to note that these support services facilitate thecollaborative networking, in fact assisting the community insharing tools and resources they reduce the barriers to entryof new users.

In order to automate the service deployment activities, weput at disposal WSClient a consumer-side application that,orchestrating the above services, automatically generate Awrapper (AWS), upload the needed files (FTS+AuthS),deploy (DS) and publish the service (Broker); In Fig. 6 thesesteps are shown for the deployment of the service imple-menting the J48weka tool. Furthermore, the BrokerClient ismade available to interact with the Broker supporting theservice publishing and, in particular, the service discoveryas described in the following.

4.2 Discovery

In the typical scenario, besides the publishing functional-ities introduced in the previous subsection, the main goalof a service broker is to help users in obtaining informa-tion about services, in particular the location of the serviceprovider hosting their descriptors.

Fig. 6 A screenshot of the WSClient, where the Weka implementationof the C4.5 algorithm is deployed

Inf Syst Front

On the one hand, UDDI standard allows providers todescribe how their services can be accessed, which infor-mation they want to make public, and which they chooseto keep private. On the other hand, UDDI supplies cus-tomers with a set of standard API for service discovery,even if only syntactic search is enabled: as a matter of fact,users may search services only by their names (i.e., whitepages) or by a plain taxonomy (i.e., yellow pages). Further-more, although UDDI entities are defined by XML, theylack explicit semantics, thus limiting the support that can beoffered to a user in the KDD domain. In order to enhanceUDDI’s expressiveness, both UDDI structures and standardAPIs have been extended, to allow the mapping between aservice and an abstract algorithm, thus improving searchingcapabilities. From a technical point of view, such modifica-tions regard the introduction, for each annotated service, ofa customized categoryBag structure, which introducesan “algorithm-ontology” pair representing the name of thealgorithm implemented by the service and the KDDONTO’sURL, where the algorithm is formally defined. A furtherextension concerns performance indexes and values, whichare stored in an other categoryBag and allow to usethe search API to find services whose performances arebelow/above certain thresholds. Since Broker is available asa web service, its APIs can be called by any external toolsuch as the BrokerClient or the KDDDesigner introduced inthe next subsection. Details about the adopted UDDI struc-ture and Broker APIs are discussed in Diamantini et al.(2007).

By means of such new features, searching capabilitiesmove from purely syntactic to semantic-based. In order toexplain what a semantic query gives to the user, let us con-sider the following case: a user has to build a predictiveanti-spam filter, and she has at disposal a set of previ-ously classified emails (spam/no-spam). The user knowsthat spam emails are much less frequent than no-spam ones,and that classification algorithms have bad performanceson unbalanced data. Then she decides to preprocess databy undersampling the most frequent class, but she does notknow what services are available in the registry. The useof standard UDDI white pages would be of no help, sinceshe does not know services names. Even the use of yel-low pages would lead to incomplete results, for instance,it is likely that she would not be able to find solutionslike k-NN. The reason is that k-NN is a classification algo-rithm that is also used for undersampling (Zhou and Liu2006). Since in a plain taxonomy only one category can beassigned to each algorithm, the best choice is to define k-NN as a classifier. On the other hand, if the taxonomy isflat and categories contain a larger number of algorithms,it is likely that many algorithms that can not be used forundersampling are returned. The logical organization ofKDDONTO, which uses relationships to link classes, allows

a more consistent representation of the reality by assigningk-NN to both undersampling and classification categories.Thus more complete and accurate results are returned to theuser.

After balancing the dataset, the user is interested in find-ing a classification service suited for her data, which havesome nominal attributes and missing values. Again, stan-dard UDDI registries would not give any support to this kindof searches. Reasoning over the ontology, the Broker returnsthe right candidates by looking at algorithms performingthe classification task, whose inputs satisfy the requiredproperties. Figure 7 shows a screenshot of the BrokerClientapplication, where on the left a subset of classification algo-rithms satisfying the user’s request (i.e. able to process datawith nominal and missing values), and on the bottom leftsome details of the service implementing the highlightedalgorithm, namely the service wrapping the j48weka tooldescribed in the previous section. In particular note thatthe availability is reported as an example of performanceindex.

4.3 Composition

The most important problem during the process design is togenerate a meaningful sequence of tools such that (a) it isable to process the user dataset, (b) it solves the user prob-lem, and (c) it is valid and semantically correct, i.e. tools aresyntactically interfaceable each other and the use of a tool’soutput as input for an other tool makes sense in the KDDdomain.

We base our composition approach on the definitionof matching criteria, which are able to formally evaluatewhether (and to which extent) an algorithm interface is com-patible with a certain datum, thus allowing to verify whether(a) an algorithm can accept a certain dataset as input, (b) analgorithm yields a certain model and so it satisfies a certaintask and (c) an algorithm is executable before/after another.These criteria are based on the semantics of algorithms(described in KDDONTO) and a proper conceptualizationof the KDD domain, by annotating each element of aprocess (user dataset, goal, interfaces) as follows:

(a) the user dataset is annotated by the syntacticdescription of the format of its content and a setof ontological terms which semantically describe itsintensional properties. For instance, a labeled datasetwithout missing values, balanced with respect to itsclasses and including only float numbers will be anno-tated as a LabeledDataset, with the property setno missing values, balanced, float;

(b) the user goal can be specified by the user as one of theKDD task available in the ontology (e.g., CLASSIFI-CATION, CLUSTERING, RULE ASSOCIATION);

Inf Syst Front

Fig. 7 A screenshot of theBrokerClient returninginformation about servicesimplementing the C4.5algorithm

(c) service interfaces are both syntactically and semanti-cally fully described in the eSAWSDL descriptor, giventhat each input/output datum’s description includes a ref-erence to the corresponding concept in KDDONTO, asexplained in Section 3.3.

Each algorithm takes data with certain features in input, per-forms some operations and returns data in output, whichare then used as input for the next algorithm in the pro-cess. Thus, two algorithms can be matched if the outputof the first is compatible with the input of the second.The check is performed by comparing the data being con-nected, not only syntactically but also semantically, withthe aim to understand if they are conceptually equivalent,or if the output is a subtype/subcomponent of the input.Such operation is realized by looking for a path, insideKDDONTO, between the input and the output, made of theontological relations sameAs, is a and part of, and byproperly weighting each relation. For instance, we can feedan algorithm with a LabeledDataset even if it requiresa NotLabeledDataset because the latter is a part ofthe former. As regards the weight of a relation, if a con-cept is subclass of another it inherits all the properties of itsparent, thus using a specialization instead of its generaliza-tion does not hamper services composition; on the contrary,if a part of relation exists, a data manipulation operationhas to be performed before linking the two services, henceincreasing matching weight.

According to the length of the path and the weights ofeach link, a matching cost is evaluated. The evaluation of

the cost takes into account also the computational com-plexity of algorithms and the existence of preconditionson the input datum, that can be completely or partiallysatisfied by the output. In order to evaluate the cost match-ing for a given pair of algorithms, the platform includesthe MatchMaker Service, which implements the matchingcriteria.

The composition functionalities are implemented inKDDDesigner, a web-based blackboard-like visual toolaimed at supporting users in the complex task of compos-ing a collaborative KDD process out of a set of KDD tools.KDDDesigner has been conceived to be the access pointto the whole platform (see Fig. 4). Through a search form,users can query the Broker in order to find services, thatthen can be dragged and dropped in the design blackboard.When a connection between two input/output is establishedby the user, the MatchMaker service is called to evaluatethe semantic cost of their connection, and hence to verifywhether the data can be actually connected. As mentionedin Section 4.2, the MatchMaker is also exploited during theservice discovery. In fact, by selecting a service (and so itscorresponding algorithm), it is possible to search for algo-rithms which are executable before/after the algorithm athand, and then to look in the UDDI registry for servicesimplementing them. To this end the Broker interacts withthe MatchMaker returning a list of useful services, rankedaccording to their cost in the match. Such functionality givesusers an invaluable help during composition, by supportingnovice users to reduce the search space, hence helping themto find only those services that are compatible with the oneat hand.

Inf Syst Front

In order to provide an advanced support especially fornovice users, which have difficulty in choosing very effec-tive solutions for their goals, the KDDComposer is intro-duced, which is a support service aimed at generatingprototypical KDD processes in a semi-automatic fashion.Through the KDDComposer, a user is only asked to pro-vide a dataset and to specify the task to achieve; the service,by iteratively calling the MatchMaker service, will yield alist of possible KDD processes (Diamantini et al. 2009b).In order to improve the quality of the results, two parame-ters are set: the maximum length of the process in terms ofalgorithms (max length), which affects the process execu-tion time; and the maximum length of a path in the ontologylinking two algorithms (max distance), which affects thequality of a match avoiding to use concepts that are too dis-tant. These parameters reduce the search space, hence alsoimprove the efficiency of the whole procedure. The gen-erated processes are prototypical (i.e., abstract) and hencenot directly executable, because they are formed of algo-rithms; their aim is to provide a support, describing whichpossible sequences of algorithms may be used to solve theproblem at hand. We like to note that despite they are noexecutable, prototype processes are in themselves new pat-terns, which are valid (they are based on the ontology)and potentially useful; hence, a prototype process representsitself a KDD outcome as defined in Fayyad et al. (1996). Atpresent, KDDDesigner and KDDComposer are available asindependent prototypes, which will be integrated in a futurework.

Coming back to our case study, let us suppose that theuser does not know how to proceed after combining K-NNand C4.5. In this case, calling the MatchMaker from the

KDDDesigner, candidate services are shown to the user, likeservices for performances evaluation or tree visualization.In the case the user does not have the necessary knowl-edge to start building a suitable process, the KDDComposeris the right solution. Figure 8 shows the interface of theKDDComposer, where the user has highlighted the classi-fication task and selected the characteristics of the dataset:nominal attributes, with missing values,and not balanced. The max length and max distancehave been both set to five, which has experimentally beenshown to produce good performances. Although in theexample the dataset annotation is manually set, we can sup-port such activity by the DBAnalyzer, which is a supportservice capable of analyzing a generic dataset and to extractits properties.

An example of resulting process is shown in Fig. 9,where the KDDComposer suggests to use in cascade: LDAfor selecting the most important features, SpaceReductionto reduce the dimensionality of the space in accordanceto the outcome of LDA, and MetaCost as classificationalgorithm. As one can notice, in this process no algorithmfor rebalancing the dataset is used; in fact, MetaCost isable to transform most of classification algorithms in cost-sensitive classification algorithms (hence, enabling themto manage unbalanced data). As shown in the bottom leftpanel, MetaCost returns as output a decision tree, in fact ourimplementation of MetaCost works with C4.5 as suggestedin the literature. We like to note that the KDDComposerhas discovered a different approach to the classification ofunbalanced data, which is novel and, probably, more effec-tive. In general, KDDComposer allows users, ranging fromnovice to the most expert, to perform a more systematic and

Fig. 8 An example of request to the KDDComposer: processes performing classification task over a labeled dataset with missing values, nominalattributes and not balanced data

Inf Syst Front

effective exploration of possible solutions, while improvingtheir skills.

As any experimental workflow, KDD activities are itera-tive in nature due to the difficulty to a priori define the bestplan to discover that knowledge. This fact is recognized inall the existing process models (see e.g. Shearer 2000) byaccounting for the need of repeated backtracking to previoussteps and repetition of certain actions: the lessons learnedduring a step can help recognizing errors in previous steps orcan give knowledge to enhance their outcomes. Backtrack-ing has to be managed in order to facilitate the comparisonof different trials. In order to take into account this fea-ture, the system is equipped with VerMan, a service forthe management of different versions of the same process.This service interacts with the KDDDesigner by keepingtrack of the creation and modification of a process by thedesign team. VerMan also allows the team to attach an anno-tation explaining the changes. Any version of the processis stored in a repository hosted at the server providing theVerMan service. Functionalities enabled by VerMan includethe browsing of the versions tree, the retrieval of a spe-cific version and the loading of the corresponding processin the KDDDesigner. The tree structure underlying VerMancan also be exploited for synchronization of different copiesgenerated during process co-design, hence enabling a multi-synchronous editing environment (Dourish 1995). However,its definition and development are beyond the goal of thepresent research work.

4.4 Activation

Once having produced a process made of KDD services,the user may want to execute it. A process is a workflowof services, which can be executed by means of a workflow

engine that is able to interpret standard workflow languages(e.g. XPDL, BPMN and BPEL). Processes are currently rep-resented in an internal XML format, and can be exported ina standard workflow language so that their execution can bemanaged by the WorkFlow Manager (WFM). This serviceextends existing engine for interpreting semantic annota-tion and specific KDDVM information. Before the processis executed, each service is assigned to a user who willbe responsible for its execution; usually users with somedegree of knowledge of the functionalities provided by theservice or of data handled by it. At present we have devel-oped a prototype WFM that interprets processes writtenin XPDL. Since the KDDDesigner lets users create a pro-cess without completely wiring all the services’ interfaces,when the execution of a service needs the human interven-tion (e.g. for parameters setting), the WorkFlow Managercalls the ClientFactory. Such a consumer-side applicationexploits semantic descriptions of the eSAWSDL and gen-erates on-the-fly a graphical interface showing to the user,which has been entrusted with the execution of the service,every service parameter plus other information taken fromthe descriptor. Among them, the valid range of values (min-max) for each numeric parameter, or the default value, usedif the user does not specify any value. Figure 10 provides anexample of interface generated by the ClientFactory for thesetting of parameters; in particular, it is shown the interfacefor invoking the service implementing the J48weka tool.

The use of ClientFactory during the process executionenables the collaborative work, allowing users to design pro-cesses at different details levels. As a matter of fact, a teamcan either specify all details needed to execute the process,or fix just the structure of the process, outsourcing the workof tuning parameters to other users with more specific skillsand expertise.

Fig. 9 A example ofprototypical process reported bythe KDDComposer service

Inf Syst Front

Fig. 10 The interface generatedby the ClientFactory for thesetting of parameters of theservice wrapping the J48wekatool

In the described framework, even a process can be anno-tated in order to support its storing and retrieval. At presentwe describe a process by means the name, its authors, theteam working on it, date and time of the last edit, andtextual comments, which can be useful in case of latermodifications.

5 Conclusions and future work

The work has presented KDDVM, a novel framework and aplatform for KDD support in distributed collaborative envi-ronments. The framework organizes the core set of infor-mation deemed as fundamental to manage the complexityof such an environment. The organization of informationin abstraction layers allows to exploit the suited technolo-gies for its representation and exploitation, moreover eachlayer uses the upper layer as a semantic reference space.The exploitation of semantic information is widely recog-nized as the state-of-the-art way towards a new generationof systems with advanced intelligent functionalities for col-laboration in large heterogeneous, distributed environments.

The KDDVM framework and platform have been con-ceived to satisfy flexibility, transparency, reusability andease-of-use requirements. The satisfaction of flexibility isguaranteed by the choice of developing the platform as aService Oriented Architecture. As a matter of fact, the useof Web Service standards facilitates the integration of newtools in the platform, their discovery, composition, and acti-vation. Furthermore, the semantic representation of data andservices enables us to provide KDDVM with a set of supportservices that are conceived to give advanced support while

hiding technical details, hence satisfying the ease-of-use andtransparency requirements. Among others, the Broker thatprovides standard and semantic driven discovery functional-ities, the MatchMaker and the KDDComposer that supportthe composition of services in valid and effective processes.Reusability is mainly guaranteed by the introduction of theWrapping Service, which allows users to easily add legacyand third parties tools non specifically designed for the plat-form, or even for service-oriented environments. Besidessupport services, a set of client interfaces has been devel-oped to facilitate the use of the platform. Among them, theKDDDesigner represents an access point to all function-alities provided by KDDVM in a blackboard-like graphi-cal interface, and the ClientFactory that exploits semanticdescriptions contained in the eSAWSDL of the service andgenerates on-the-fly a graphical interface explaining the useof the service.

We like to note that principles adopted in KDDVMdesign greatly improves the perceived quality of an appli-cation, for several reasons: greater sharing and accessibilityto end users, reduced overhead of software installations,reduced cost to the occasional user of the software andincreased user mobility. As concerns the limits of theadopted framework, it is to be pointed out that semanticannotation is critical, in fact, the quality of the providedsupport greatly depends on the quality of the semantic rep-resentation of resources. To contain this criticality, it isimportant that the publisher is a domain expert with skills onthe tool at hand. Furthermore, the exploitation of SOA maylead to some critical factors, like the possible decrease ofprivacy and security, because data are computed remotely,and the need of sufficient bandwidth to operate with large

Inf Syst Front

datasets. Anyway, these aspects can be considered as tech-nical issues, which can be properly faced by the adoptionof up-to-date standards for secure transmission, and bydefining a proper access control and sizing of hardware.

Finally, the evaluation of the ease-to-use requirementdeserves special consideration. If on the one hand, the anal-ysis of users feedbacks can help to evaluate and improvethe whole platform, on the other hand the large amount ofboth functionalities provided by the platform and ways ofcombining them require that a systematic usability study hasto be accurately designed. At present, the use of the plat-form has been proposed to a group of graduate students incomputer science, during their first course on KDD. Posi-tive feedbacks have been returned about the usability of thesemantic enhanced Broker with respect to traditional webservice discovery, as it improves the precision and recall ofresults of the search. Also, users have greatly appreciatedthe acceleration given by ClientFactory and KDDDesignerto the setting of an experiment, as they avoid errors andmake suggestions in parameters setting and process com-position, while provide a unique interface for any service.We plan to perform a systematic usability study as a furtherrefinement step of the research.

In the future we also plan to extend functionalities forthe sharing and reuse of workflows, by enriching theirsemantic description and by applying clustering techniquesto organize the workflow repository in groups of simi-lar workflows, to enhance search and to discover typicalusage patterns. Also, the exploitation of semi-automatictechniques like those proposed in Peng et al. (2007), Bellet al. (2007) to support the semantic annotation of web ser-vices or to compile, check and improve the ontology (Chienet al. 2009) will be taken into consideration. Finally, we planto launch a widespread experimental campaign to augmentthe number of services and of different users, so to stress theheterogeneity of the repository.

References

Ali, A.S., Rana, O.F., Taylor, I.J. (2005). Web services composition fordistributed data mining. In Proc. of the International Conferenceon Parallel Processing Workshops (pp. 11–18).

Alsairafi, S., Ghanem, M., Giannadakis, N., Guo, Y., Kalaitzopou-los, D., Osmond, M., Rowe, A., Syed, J., Wendel, P. (2003). Thedesign of discovery net: Towards open grid services for knowledgediscovery. International Journal of High Performance ComputingApplications, 17(3), 297–315.

Bell, D., de Cesare, S., Iacovelli, N., Lycett, M., Merico, A. (2007).A framework for deriving semantic web services. InformationSystems Frontiers, 9, 69–84.

Bernstein, A., Provost, F., Hill, S. (2005). Toward intelligent assis-tance for a data mining process: An ontology-based approachfor cost-sensitive classification. IEEE Transactions on Knowledgeand Data Engineering , 17(4), 503–518.

Cannataro, M., & Talia, D. (2003). The knowledge grid. Communi-cations of the Association for Computing Machinery, 46(1), 89–93.

Cheung, W.K., Zhang, X.F., fai Wong, H., Liu, J., Luo, Z.W., Tong,F.C.H. (2006). Service-oriented distributed data mining. IEEEInternet Computing, 10(4), 44–54.

Chien, B., Hu, C., Ju, M. (2009). Learning fuzzy concept hierarchy andmeasurement with node labeling. Information Systems Frontiers,11, 551–559.

Comito, C., Mastroianni, C., Talia, D. (2006). Metadata, ontologiesand information models for resource management in grid-basedpse toolkits. International Journal of Web Services Research, 3(4),52–72.

Congiusta, A., Talia, D., Trunfio, P. (2008). Service-oriented middle-ware for distributed data mining on the grid. Journal of Paralleland Distributed Computing, 68(1), 3–15.

De Roure, D., Goble, C.A., Stevens, R. (2009). The design and realisa-tion of the myexperiment virtual research environment for socialsharing of workflows. Future Generation Computer Systems,25(5), 561–567.

Demsar, J., Zupan, B., Leban, G., Curk, T. (2004). Orange: Fromexperimental machine learning to interactive data mining. InBoulicaut, J.F., Esposito, F., Giannotti, F., Pedreschi, D. (Eds.),Knowledge discovery in databases: PKDD 2004, LNCS (Vol.3202, pp. 537–539). Springer.

Diamantini, C., & Potena, D. (2008). Semantic annotation and ser-vices for kdd tools sharing and reuse. In Proc. of the 8th IEEEinternational conference on data mining workshops. 1st int. work-shop on semantic aspects in data mining (pp. 761–770). Pisa,Italy.

Diamantini, C., Potena, D., Cellini J. (2007). Uddi registry for knowl-edge discovery in databases services. In Proc. of the internationalsymposium on collaborative technologies and systems, IEEE(pp. 321–328). Orlando, FL, USA.

Diamantini, C., Potena, D., Storti E. (2009a). Kddonto: an ontologyfor discovery and composition of kdd algorithms. In Proc. ofthe ECML/PKDD09 workshop on third generation data mining:Towards service-oriented knowledge discovery (pp. 13–24). Bled,Slovenia.

Diamantini, C., Potena, D., Storti, E. (2009b). Ontology-driven kddprocess composition In Adams, N. (Ed.), Advances in intelligentdata analysis VIII, proc. of the 8th international symposium onintelligent data analysis, LNCS (Vol. 5772, pp. 285–296). Lyon,France: Springer.

Dourish, P. (1995). The parting of the ways: Divergence, data man-agement and collaborative work. In Proc. of the fourth Europeanconference on computer-supported cooperative work (pp. 215–230). Stockholm, Sweden.

Farrel, J., & Lausen, H. (2007). Semantic annotations for wsdl and xmlschema, w3c recommendation. http://www.w3.org/TR/sawsdl/.

Fayyad, U.M., Piatetsky-shapiro, G., Smyth, P. (1996). From datamining to knowledge discovery: An overview, American asso-ciation for artificial intelligence (pp. 1–34). CA, USA: MenloPark.

Fernandez-Lopez, M., Gomez-Perez, A., Juristo, N. (1997). Methon-tology: From ontological art towards ontological engineering.In Proc. of the AAAI Spring Symposium Series on OntologicalEngineering (pp. 33–40). USA: Stanford.

Frank, A., & Asuncion, A. (2010). Uci: Machine learning repository.http://archive.ics.uci.edu/ml.

Gruber, T.R. (1995). Toward principles for the design of ontolo-gies used for knowledge sharing. International Journal of HumanComputer Studies, 43(5–6), 907–928.

Guedes, D., Meira, W., Ferreira, R. (2006). Anteater: A service-oriented architecture for high-performance data mining. IEEEInternet Computing, 10(4), 36–43.

Inf Syst Front

Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P.,Witten, I.H. (2009). The weka data mining software: an update.ACM SIGKDD Explorations Newsletter, 11(1), 10–18.

Jurisica, I., & Glasgow, J. (2006). Introduction: Knowledge discov-ery in high-throughput biological domains. Information SystemsFrontiers, 8, 5–7.

Kickinger, G., Hofer, J., Brezany, P., Tjoa, A.M. (2004). Grid knowl-edge discovery processes and an architecture for their composi-tion. In Proc. of the IASTED international conference on paralleland distributed computing and networks. Innsbruck, Austria.

Kumar, A., Kantardzic, M.M., Ramaswamy, P., Sadeghian, P. (2004).An extensible service oriented distributed data mining framework.In Proc. of the international conference on machine learning andapplications (pp. 256–263).

Majithia, S., Shields, M.S., Taylor, I.J., Wang, I. (2004). Triana: Agraphical web service composition and execution toolkit. In Proc.of IEEE international conference on web services (pp. 514–521).

Noy, N.F., & Mcguinness, D.L. (2002). Ontology development 101: Aguide to creating your first ontology. Stanford University.

Olejnik, R., Fortis, T.F., Toursel, B. (2009). Webservices oriented datamining in knowledge architecture. Future Generation ComputerSystems, 25(4), 436–443.

Panov, P., Dzeroski, S., Soldatova, L.N. (2008). Ontodm: An ontologyof data mining. In Proc. of the 8th IEEE int. conf. on data miningworkshops, 1st int. workshop on semantic aspects in data mining,(pp. 752–760).

Park, B.H., & Kargupta, H. (2003). Distributed data mining: Algo-rithms, systems, and applications. In Ye, N. (Ed.), The handbookof data mining (pp. 341–358). Routledge.

Pedrinaci, C., & Domingue, J. (2010). Web services are dead. long liveinternet services. Tech. rep.

Peng, D., Wang, X., Zhou, A. (2007). Vslattice: A vector-based con-ceptual index structure for web service retrieval. InformationSystems Frontiers, 9, 423–437.

Perez, M.S., Sanchez, A., Robles, V., Herrero, P., Pea, J.M. (2007).Design and implementation of a data mining grid-aware architec-ture. Future Generation Computer Systems, 23(1), 42–47.

Podpecan, V., Zakova, M., Lavrac, N. (2010). Workflow construc-tion for service-oriented knowledge discovery. In Margaria, T.,Steffen, B., (Eds.), Leveraging applications of formal methods,verification, and validation, LNCS (Vol. 6415, pp. 313–327).Springer.

Sarawagi, S., & Nagaralu, S.H. (2000). Data mining models as serviceson the internet. ACM SIGKDD Explorations Newsletter, 2(1),24–28.

Serban, F., Kietz, J.U., Bernstein, A. (2010). An overview of intelligentdata assistants for data analysis. In Proc. of the 3rd planning tolearn workshop at ECAI, 2010 (pp. 7–14).

Shearer, C. (2000). The crisp-dm model: The new blueprint for datamining. Journal of Data Warehousing, 5(4), 13–22.

Treadwell, J. (2005). Open Grid Services Architecture Glossary ofTerms. GGF Document GFD-I.044.

Tsai, C.Y., & Tsai, M.H. (2005). A dynamic web service baseddata mining process system. In Proc. of the 5th internationalconference on computer and information technology (pp. 1033–1039).

Yu-hua, L., Zheng-ding, L., Xiao-lin, S., Kun-mei, W., Rui-xuan, L.(2006). Data mining ontology development for high user usability.Wuhan University Journal of Natural Sciences, 11(1), 51–56.

Zhou, Z.H., & Liu, X.Y. (2006). Training cost-sensitive neuralnetworks with methods addressing the class imbalance prob-lem. IEEE Trans on Knowledge and Data Engineering, 18(1),63–77.

Claudia Diamantini is associate professor at the Information Engi-neering Department, Polytechnic University of Marche, where sheleads the Knowledge Discovery & Management research group. Shereceived the PhD degree in Artificial Intelligent Systems from theUniversity of Ancona in 1995. Her research interests are in the areasof business intelligence, data mining and knowledge discovery, andsemantic interoperability, with special attention to interdisciplinaryrelations among them. She has been working on these topics withinnational and international projects, and is author of about 80 techni-cal papers in refereed journals and conferences. She is a member ofthe IEEE and ACM, and a co-founder and scientific responsible of theInterop-Vlab. It scientific association.

Domenico Potena is an assistant professor at the Information Engi-neering Department of the Polytechnic University of Marche, Italy. Hereceived the MSc degree in Electronic Engineering from the Univer-sity of Ancona, Italy, in 2001, and the Ph.D. in Information SystemsEngineering from the Polytechnic University of Marche, in 2004.From June 2005 to October 2008, he was post-doctoral fellow at thesame University. His research interests include knowledge discoveryin databases, data mining, data warehousing, information systems andservice oriented architectures.

Emanuele Storti received a Ph.D. in Information System Engineer-ing from the Polytechnic University of Marche, Italy, in 2012, andis currently a postdoctoral fellow at the Information EngineeringDepartment, Polytechnic University of Marche. His research interestsinclude Knowledge Discovery, Semantic technologies, ECollaborativeplatforms and Business Intelligence systems.