Pattern based feature construction in semantic data mining ... · meta-learning called semantic meta-mining (Hilario, Nguyen, Do, Woznica, & Kalousis, 2011) which is an ontology-based,

Pattern based feature construction in semantic data mining

Agnieszka Ławrynowicz, Poznan University of Technology, Poland Jędrzej Potoniec, Poznan University of Technology, Poland

ABSTRACT We propose a new method for mining sets of patterns for classification, where patterns are represented as SPARQL queries over RDFS. The method contributes to so-called semantic data mining, a data mining approach where domain ontologies are used as background knowledge, and where the new challenge is to mine knowledge encoded in domain ontologies, rather than only purely empirical data. We have developed a tool that implements this approach. Using this we have conducted an experimental evaluation including comparison of our method to state-of-the-art approaches to classification of semantic data and an experimental study within emerging subfield of meta-learning called semantic meta-mining. The most important research contributions of the paper to the state-of-art are as follows. For pattern mining research or relational learning in general, the paper contributes a new algorithm for discovery of new type of patterns. For Semantic Web research, it theoretically and empirically illustrates how semantic, structured data can be used in traditional machine learning methods through a pattern-based approach for constructing semantic features. Keywords: pattern discovery, semantic data mining, SPARQL, meta-learning, ontology, intelligent system INTRODUCTION Pattern discovery is a fundamental data mining task. It deals with the automatic detection of patterns in data. Pattern is any regularity, relation or structure inherent in some source of data (Shawe-Taylor & Cristianini, 2004). Various methods have been proposed for finding patterns in a variety of forms such as item sets, association rules, correlations, sequences, episodes etc. From the point of view of this paper, we are interested in structured domains, where data is represented in complex forms like relational databases, logic programs, and in particular semantic data such as ontology-based knowledge bases or Linked Open Data (LOD)1. Relational pattern discovery has been investigated since the development of WARMR (Dehaspe & Toivonen, 1999), an algorithm for mining patterns using the Datalog subset of first-order logic as the representation language for data and patterns. This has been followed by subsequently proposed relational pattern mining algorithms such as FARMER (Nijssen & Kok, 2001) or c-armr (De Raedt & Ramon, 2004). They can all be classified under Inductive Logic

1 http://linkeddata.org/

Programming (ILP) (Nienhuys-Cheng & Wolf, 1997) methods since they use subsets of logic programs as the representation language. With the rise of the Semantic Web (Berners-Lee, Hendler, & Lassila, 2001), also called Web of Data, an interest has grown in employing languages, and knowledge representation formalisms underpinning the Semantic Web in data mining. This interest is motivated by increase of popularity, number and size of such semantic data sources as LOD (containing billions of pieces of data linked together2) that require statistical approaches able to handle Semantic Web knowledge representation formalisms. These formalisms include logic-based ontology languages such as description logics (DLs) (Baader, Calvanese, McGuinness, Nardi, & Patel-Schneider, 2003) that constitute the formalism underlying the standard ontology language for the Web, the Web Ontology Language (OWL) (McGuinness & van Harmelen, 2004). In this line, in (Lisi & Esposito, 2008) the foundations have been laid of an extension of relational learning, called onto-relational learning, to account for ontologies. (Fanizzi, d'Amato, & Esposito, 2010) propose the term ontology mining for all such activities that allow to discover hidden knowledge from ontological knowledge bases, by possibly using only a sample of data. Finally, (Kralj-Novak, Vavpetic, Trajkovski, & Lavrac, 2009) coined the term semantic data mining3 to denote a data mining approach where domain ontologies are used as background knowledge, and where the new challenge is to mine knowledge encoded in domain ontologies, rather than to mine purely empirical data. The above-mentioned interest has been reflected in the development of relevant pattern mining algorithms, firstly onto-relational ones like SPADA (Lisi & Malerba, 2004), SEMINTEC (Józefowska, Ławrynowicz, & Łukaszewski, 2010) or AL-QuIn (Lisi F.A., 2011), and subsequently fully based on a description logic based ontology language like the algorithm Fr-ONT (Ławrynowicz & Potoniec, 2011). In recent years, a topic of using patterns in predictive models has drawn a lot of attention (Bringmann, Nijssen, & Zimmermann, 2009). Especially in complex, structured domains, such as graphs and sequences, pattern mining can be helpful to obtain models. The main idea is that patterns can be used as features to build a predictive model. For instance, pattern-based classification is a process of learning a classification model where patterns are used as features. According to recent studies, classification models making use of pattern-based features may be more accurate or simpler to understand than the original feature set (Cheng, Yan, Han, & Hsu, 2007). In structured domains, pattern mining may work as a propositionalisation approach that enables using classical propositional data mining/machine learning methods by decoupling the data representation from the learning task. This paper describes a method for pattern-based classification based on a novel algorithm for pattern mining. The proposed algorithm discovers patterns represented as SPARQL (Prud'hommeaux & Seaborne, 2008) queries over a subset of RDF Schema (RDFS) (Brickley & Guha, 2004) suitable to represent lightweight ontologies. The algorithm takes the semantics of RDFS vocabulary into account, which enables it to exploit knowledge encoded in ontologies. Through a propositionalisation approach the patterns are used as features in classification. Subsequently, we describe a tool we have developed to support semantic data mining approaches in general, where the proposed method is implemented. The tool, an extension to a leading open source data mining environment RapidMiner (Mierswa, Wurst, Klinkenberg, Scholz, & Euler, 2006), enables building data mining processes (workflows) from small blocks (operators) by 2 http://lod-‐cloud.net/state/ 3 http://semantic.cs.put.poznan.pl/SDM-‐tutorial2011/doku.php?id=start

connecting their inputs and outputs, and contributes to so-called third generation data mining systems (Piatetsky-Shapiro, 1997) (Hilario, Lavrac, Podpecan, & Kok, 2010). Finally, we describe the results of experiments including an experimental study within emerging subfield of meta-learning called semantic meta-mining (Hilario, Nguyen, Do, Woznica, & Kalousis, 2011) which is an ontology-based, process-oriented form of meta-learning which aims to learn over full knowledge discovery processes rather than over individual algorithms. Consequently, the paper presents important research contributions to the state-of-art: (i) for pattern mining research, or semantic data mining and relational learning in general, it contributes a new algorithm for discovery of new type of patterns – SPARQL queries over RDFS; (ii) for Semantic Web research, it illustrates how semantic, structured data can be used in traditional data mining/machine learning methods through a propositionalisation approach; (iii) for meta-learning research, it contributes with a pattern-based method for learning over knowledge discovery processes with support of ontologies as background knowledge. The rest of the paper is organized as follows. The next section discusses the work related to ours. In the Preliminaries section, we introduce notions and definitions used in the text. Further sections contain the description of the proposed algorithm, the implemented tool, and the conducted experimental study. In the last section we conclude. RELATED WORK The work relevant to ours may be grouped into the following research threads: relational pattern mining, pattern-based classification, and classification methods for the Semantic Web data. As it has been already mentioned, there have been proposed such relational pattern discovery methods like WARMR (Dehaspe & Toivonen, 1999), FARMER (Nijssen & Kok, 2001), and c-armr (De Raedt & Ramon, 2004). All of them employ Datalog as the representation language for data, background knowledge and patterns, and are aimed at frequent pattern discovery. These systems were generally not designed to exploit ontological knowledge, e.g. in the form of taxonomies of classes. In particular, WARMR and FARMER employ a syntactic generality relation (e.g. so-called θ-subsumption (Plotkin, 1970)) that does not take background knowledge into account. c-armr uses the semantic generality relation, a kind of generalized subsumption (Buntine, 1988), but it is not fully exploited w.r.t. taxonomic information. More precisely, when c-armr discovers that a pattern containing atom C(x) is infrequent, it does not stop it from subsequently testing a pattern containing atom C’(x), where C is more general than C’. In turn, onto-relational methods SPADA (Lisi & Malerba, 2004), SEMINTEC (Józefowska, Ławrynowicz, & Łukaszewski, 2010), and AL-QuIn (Lisi, 2011) all exploit taxonomies in some way, and all use semantic generality relations such as generalized subsumption or query containment. SPADA (further refined to AL-QuIn) uses a hybrid knowledge representation formalism, AL–log (Donini, Lenzerini, Nardi, & Schaerf, 1998), that combines Datalog with description logic. Patterns in SPADA/AL-QuIn are represented as constrained Datalog clauses, where description logic concepts (that is ontology classes) are used as constraints in the body. SPADA/AL-QuIn solves a variant of frequent pattern discovery task, where classes from ontological taxonomies are used in the constraints of the clauses to produce patterns in multiple levels of granularity. These levels of granularity are exploited very systematically with such a drawback that classes from different granularity levels are never mixed in a one clause. One may expect that it may be a problem in case of unbalanced class hierarchies. SEMINTEC method

(Józefowska, Ławrynowicz, & Łukaszewski, 2010) does not have this restriction. It uses so-called DL-safe rules (Motik, Sattler, & Studer, 2005) as a representation language of the knowledge base on which it operates. The formalism of DL-safe rules combines Semantic Web ontologies (represented in description logic) and rules (represented in disjunctive Datalog). Description logic and disjunctive Datalog rules are integrated by allowing description logic concepts and roles (ontology classes and properties) to occur in rules as unary and binary predicates, respectively, forming so-called DL-atoms. Patterns in SEMINTEC are represented as conjunctive (DL-safe) queries over such a knowledge base, and are basically composed of sets of atoms. More in depth analysis of the properties of the above-mentioned systems may be found in (Józefowska, Ławrynowicz, & Łukaszewski, 2010). The Fr-ONT algorithm (Ławrynowicz & Potoniec, 2011) differs from the already described methods in that it uses a variable-free notation of description logic to represent patterns. Patterns in this approach are represented as concepts of the description logic EL++ (corresponding to classes of the OWL 2 EL profile). With our proposed method we go into different direction than all the already described relational pattern mining methods, that is into the direction of mining patterns from Resource Description Framework (RDF) (Manola & Miller, 2004)graphs of the Semantic Web. Therefore, we explicitly use the Semantic Web query language SPARQL as the language for representing patterns. While it may be argued that SPARQL basically does not go further with expressive power than Datalog (Angles & Gutierrez, 2008), this argument does not apply to our case. Despite of targeting RDF, we also explicitly employ the semantics of a subset of RDFS to properly handle ontological knowledge, and hence we work with SPARQL queries over RDFS. It is also worth mentioning that we handle data types (which was not a case e.g. with SEMINTEC method). Finally, using explicitly the language of SPARQL (and not transforming SPARQL queries to e.g. Datalog), allows us to use specific constructs (like FILTER) of the SPARQL language in the patterns. This paper assumes that patterns are mined with a goal to be subsequently used as features to train a classification model. A comprehensive study of the problem of mining sets of patterns for classification is presented in (Bringmann, Nijssen, & Zimmermann, 2009). The authors of the study categorize pattern-based classification methods along the following dimensions: (1) whether they post-process already pre-computed set of patterns or pattern mining algorithm is executed iteratively; (2) whether they select patterns model-independently or whether the pattern selection is guided by a model. Another study discussing the problem of computing so-called class-sensitive patterns may be found in (Kralj-Novak, Lavrac, & Webb, 2009). In this paper, we take the task of computing the classification model into account already during pattern construction that is reflected in the chosen pattern quality measures and search strategy. In general, our pattern-mining method is not dedicated to a single pattern quality measure, and indeed, our tool implementation supports several pattern quality measures that may be chosen during pattern mining. The subject of performing classification task on Semantic Web data have already been considered in several works such as DL-FOIL (Fanizzi, D'Amato, & Esposito, 2008), DL-Learner (Lehmann, 2009), SPARQL-ML (Kiefer, Bernstein, & Locher, 2008) and kernel based methods proposed by (Bloehdorn & Sure, 2007) and by (Loesch, Bloehdorn, & Rettinger, 2012). We observed that those methods are mainly designed to work on two polar knowledge representations: only RDF data or complex DLs. However, semantic data (such as LOD datasets)

are rarely pure RDF or DL ontologies, but rather a mix of those two. According to the recent survey (Glimm, Hogan, Kroetzsch, & Polleres, 2012), a subset of RDFS that corresponds to ρdf language (Munoz, Perez, & Gutierrez, 2007), employed in our method, constitutes the most frequently used vocabulary on the Web of Data. Another observation is that the mentioned works address classification task either via: concept learning (DL-FOIL and DL-Learner), statistical relational learning (SPARQL-ML) or kernel methods. In this work, we propose a new method that addresses classification task on Semantic Web data via pattern mining. Our research hypothesis is that if sufficiently large and good quality pattern based feature set is constructed then this kind of method is able to outperform the other proposed classification methods even in cases where only the semantics of ρdf is used instead of complex DLs.

PRELIMINARIES In this section we present notions and definitions that are further used in the text. First we provide a short overview of knowledge representation languages employed in this work, namely RDF and RDFS, and query language SPARQL. Subsequently, we formulate a problem of mining pattern sets, where patterns are represented as SPARQL queries. Language of knowledge representation RDF and RDFS syntax. RDF is a data format based on graphs designed to describe resources on the Web and properties of those resources by means of statements in the form of subject-predicate-object structures. To formally define RDF we follow (Munoz, Perez, & Gutierrez, 2007). We consider pairwise disjoint infinite sets U, B, and L which denote, respectively, URI references, blank nodes and literals. An RDF triple is a tuple τ = (s, p, o) ∈ (U ∪ B ∪ L) × U × (U ∪ B ∪ L). In this tuple, s is called the subject, p is called the predicate, and o is called the object. An RDF graph G is a set of RDF triples. We will also refer to an RDF graph as to an RDF dataset. The universe of G, denoted by universe(G), is the set of elements in U ∪ B ∪ L that occur in the triples of G. The vocabulary of G, denoted by voc(G), is the set universe(G) ∩ (U ∪ L). In this work, we assume a fragment of RDFS, called ρdf (Munoz, Perez, & Gutierrez, 2007) that covers fundamental features of RDFS. This fragment is defined as the following subset of the RDFS vocabulary: ρdf ={sp, sc, type, dom, range}, where sp stands for rdfs:subPropertyOf, sc stands for rdfs:subClassOf, type stands for rdf:type, and dom, and range stand for, respectively, rdfs:domain, and rdfs:range. Thus, by (p, sp, q) we will denote that property p is a subproperty of property q, by (c, sc, d) we will denote that class c is a subclass of class d, by (a, type, b) we will denote that a is of type b, and by (p, dom, c), and (p, range, c) we will denote, respectively, that the domain of property p is c, and that the range of property p is c. RDF and RDFS semantics. Following (Munoz, Perez, & Gutierrez, 2007) we will now define the semantics of RDF and RDFS. An interpretation I over a vocabulary Voc is a tuple I = ⟨∆Res, ∆P, ∆C, ∆L, P·, C·, ·I⟩, where

∆Res, ∆P, ∆C , ∆L are the interpretation domains of I, which are finite non-empty sets, and P·, C·, ·I are the interpretation functions of I, such that: 1. ∆Res are the resources, called the domain or universe of I; 2. ∆P are property names (not necessarily disjoint from ∆Res); 3. ∆C ⊆ ∆Res are the classes; 4. ∆L ⊆ ∆Res are the literal values, where ∆L contains all plain literals in L ∩ Voc; 5. P· is a function P·: ∆P → , a mapping that assigns an extension to each property name; 6. C· is a function C·: ∆C → , a mapping that assigns a set of resources to every resource denoting a class; 7. ·I: (U ∪ L) ∩ Voc → ∆Res ∪ ∆P is the interpretation mapping that assigns a resource or a property name to each element of (U ∪ L) in Voc, and such that ·I is the identity for plain literals and assigns an element in ∆Res to elements in L. An interpretation I is a model of a graph G, denoted I╞ G, iff I is an interpretation over the vocabulary ρdf ∪ universe(G) that satisfies the following conditions: 1. Simple: (a) there exists a function A: B→ ∆Res such that for each (s, p, o) ∈ G, p ∈ ∆P and (s ,o ) ∈ Pp , where is the extension of ·I using A; 2. Subproperty: (a) Psp is transitive over ∆P; (b) if (p, q) ∈ Psp then p, q ∈ ∆P and Pp ⊆ Pq; 3. Subclass: (a) Psc is transitive over ∆C; (b) if (c, d) ∈ Psc then c, d ∈ ∆C and Cc ⊆ Cd; 4. Typing I: (a) x ∈ Cc iff (x, c) ∈ Ptype ; (b) if (p, c) ∈ Pdom and (x, y) ∈ Pp then x ∈ Cc; (c) if (p, c) ∈ Prange and (x, y) ∈ Pp then y ∈ Cc; 5. Typing II: (a) for each e ∈ ρdf, e ∈∆P; (b) if (p, c) ∈ Pdom then p ∈ ∆P and c ∈ ∆C; (c) if (p, c) ∈ P range then p ∈ ∆P and c ∈ ∆C; (d) if (x, c) ∈ P type then c ∈ ∆C. Graph G entails graph H under ρdf, denoted G╞ ρdf H, iff every model under ρdf of G is also a model under ρdf of H. SPARQL syntax. A SPARQL query Q is composed of the body of the query, denoted body(Q), and the head of the query, denoted head(Q). The body of SPARQL query may be a complex RDF graph pattern expression including RDF triples with variables, conjunctions, disjunctions, optional parts and constraints over the values of the variables. The head of the query indicates

2!Res"!Res

2!Res

IA IA IA

IA IA

IA

IA

IA

IA

IA

IA

IA

IA

IA

IA

IA

how to construct the answer to the query, where the answer can have different forms such as for example yes/no answer, a table of values, or a new RDF graph. In this work we concentrate only on SELECT queries, that is a table of values. Let V be an infinite set of variables disjoint from (U ∪ B ∪ L). We assume that the elements from V are prefixed by ?. Following (Perez, Arenas, & Gutierrez, 2009) we present the syntax of SPARQL graph patterns in an algebraic way, using the binary operators UNION, AND and OPT, and FILTER. We define a SPARQL graph pattern recursively as follows: (1) A tuple from (U ∪ B ∪ L ∪ V) × (U ∪ V) × (U ∪ B ∪ L ∪ V) is a graph pattern (a triple pattern). (2) If P1 and P2 are graph patterns, then expressions (P1 AND P2), (P1 OPT P2), and (P1 UNION P2) are graph patterns. (3) If P is a graph pattern and R is a SPARQL built-in condition, then the expression (P FILTER R) is a graph pattern. In this paper, we use SPARQL built-in conditions that are constructed using elements of the set L ∪ V, and inequality symbols (≤, ≥,<,>). A set of triple patterns (which corresponds in our case to triples connected with AND operator) is commonly named Basic Graph Pattern (BGP). A group graph pattern may extend BGP with a FILTER operator. Optional graph patterns result from the extension with the OPT (OPTIONAL) operator, and alternative graph patterns (where two or more possible patterns are tried), result from the extension with the UNION operator. Our algorithm, proposed in this paper, constructs patterns using only AND, and FILTER operators, and hence further in the paper, we will concentrate only on those operators. Given a graph pattern P, by var(P) we denote the set of variables occurring in P, and given a built-in condition R, by var(R) we denote the set of variables occurring in R. Example 1. The following is a SPARQL graph pattern that intuitively corresponds to all trains having at least one passenger car with at least fifty seats, but no more than eighty seats: (?x, type, Train) AND (?x, hasPassengerCar, ?y) AND (?y, hasNumberOfSeats, ?z) FILTER(50<=?z && ?z<=80)

SPARQL semantics. SPARQL is basically a graph-matching query language. The evaluation of a query Q over an RDF graph G is performed in two steps. Firstly, the body of Q is matched to G in every possible way, resulting in a set of bindings for the variables in the body. Subsequently, based on the information from the head of Q, those bindings are processed by an application of relational operators such as projection etc. to produce the final answer to the query. In order to formally define the semantics of SPARQL, we will firstly introduce some terminology based on (Perez, Arenas, & Gutierrez, 2009). A mapping µ is a partial function µ: V → U ∪ L. Let us by dom(µ) denote the domain of µ, which is the subset of V where µ is defined. For a triple pattern t such that var(t) ⊆ dom(µ), we denote by µ(t) the triple that is obtained by

replacing the variables in t according to µ. Two mappings µ1 and µ2 are compatible when for all ?x ∈ dom(µ1) ∩ dom(µ2), it holds µ1(?x) = µ2(?x). It is worth noting, that two mappings with disjoint domains are always compatible, and that the empty mapping µ∅ (that is the mapping with empty domain) is compatible with any mapping. Let us by Ω1 and Ω2 denote sets of mappings. We define the join of Ω1 and Ω2 as4: Ω1 ⨝ Ω2 = {µ1 ∪ µ2 | µ1 ∈ Ω1, µ2 ∈ Ω2 and µ1, µ2 are compatible mappings}. We will now define the semantics of selected graph pattern expressions (those that are considered in this paper), as a function transforming a graph pattern expression into a set of mappings. The evaluation of a graph pattern over an RDF graph G, denoted by ⋅G, is defined recursively as follows: 1. tG = {µ | dom(µ) = var(t) and µ(t) ∈ G}. 2. (P1 AND P2)G = P1G ⨝ P2G. Further, we define the semantics of selected FILTER expressions (considered in this paper). Given a mapping µ and a built-in condition R, we say that µ satisfies R, denoted by µ╞ R, where op∈{≤, ≥,<,>}if:

– R is ?x op c, ?x ∈ dom(µ) and µ(?x) op c; – R is ?x op ?y, ?x ∈ dom(µ), ?y ∈ dom(µ) and µ(?x) op µ(?y);

Then (P FILTER R)G = {µ ∈ PG | µ╞ R}, that is, (P FILTER R)G is the set of mappings in PG that satisfy R. Since SPARQL follows basically a sub-graph-matching approach, a SPARQL query does not take RDFS vocabulary predefined semantics into account. However, many SPARQL query engines provide some forms of RDFS semantics aware inference. In this paper, we assume that such an inference takes place while evaluating SPARQL queries. Thus, we define the semantics of SPARQL over RDFS, where we take into account not only the explicit RDF triples of a graph G, but also those triples that can be inferred from G w.r.t. to the semantics of RDFS. The most direct way of defining such semantics is by considering the closure of the original graph. Following (Arenas, Gutierrez, & Pérez, 2008), we define the closure of an RDF graph G, denoted by cl(G), as the graph obtained from G by a successive application of the rules from Table 1 until the graph does not change. Given a SPARQL graph pattern P, the RDFS evaluation of P over G, denoted by PGrdfs, is defined as the set of mappings Pcl(G), that is, as the evaluation of P over the closure of G. Given two queries Q1 and Q2 of the same arity, we say that query Q1 is contained in query Q2 if the set of answers of Q1 is a subset of the set of answers of Q2, Q1cl(G)⊆ Q2cl(G).

4 Note that w.r.t. (Perez, Arenas, & Gutierrez, 2009) we omit here the union of, and the difference between Ω1 and Ω2, and the left outer-join since we do not consider UNION and OPT operators in the refinement operator of our algorithm

1. Subproperty: (a) (A,sp,B) (B,sp,C )

(A,sp,C )

(b) (A,sp,B) (X ,A,Y )(X ,B,Y )

2. Subclass: (a) (A,sc,B) (B,sc,C )

(A,sc,C )

(b) (A,sc,B) (X ,type,A)(X ,type,B)

3. Typing: (a) (A,dom,B) (X ,A,Y )

(X ,type,B)

(b) (A,range,B) (X ,A,Y )(Y ,type,B)

Table 1. The system of rules equivalent to the model theoretical semantics of ρdf. Problem of pattern-based classification In the following we formulate the machine learning problem as is addressed in the paper. Let D = (xi, yi) i=1

n be a training dataset, composed of instances (xi, yi), where xi is an object and yi

is its class label. An object xi is identified by an URI reference. Furthermore, D = E+ ∪ E-, where E+ denotes instances with class label yi = true (positive examples), and E- denotes instances with class label yi = false (negative examples). Let P be the set of all possible patterns in the dataset. We will represent the data in the feature space of such patterns. We assume that feature values of the mined patterns are binary, that is a given data instance either satisfies a pattern or not, or, in other words, a pattern covers an instance or not. It is denoted by cov(Q, xi), where it is said that Q covers xi if cov(Q, xi)=1, and cov(Q, xi)=0 otherwise. Thus, our setting is binary pattern-based classification in which the feature space is {0, 1}m, where m denotes the number of features. In order to define the form of patterns, we first introduce the linkedness property. Definition 1 (Linkedness). A variable ?x is linked in a query Q iff ?x occurs in the head of Q or there is a triple pattern t in the body of Q that contains the variable ?x and a variable ?y (different from ?x), and ?y is linked. Definition 2 (Pattern). Given is a ρdf dataset G. A pattern Q is a SPARQL query over G of the following form: (i) it is a SELECT query; (ii) in the head of the query there is one variable, denoted ?key; (iii) built-in conditions are constructed using elements of the set L ∪ V, and inequality symbols (≤, ≥,<,>); (iv) it possesses the linkedness property that is each variable in the body of a query is linked to the variable ?key through a path of triple patterns in the query body.

Now, we may formulate the task of pattern-based classification. Definition 3 (Task of pattern-based classification). Given a dataset D = (xi, yi) i=1

n , the task of pattern-based classification is to find a good feature set of discriminative patterns F = {Q1, Q2, … , Qm} ⊆ P so that D is mapped into {0, 1}m space to subsequently build a classification model. The training dataset in {0, 1}m space for building a classification model is denoted by D’ = (x’i, yi) i=1

n , where x’i = (x’ij) i=1m , and x’ij = cov(Qj, xi).

ALGORITHM In this section, we describe our method for mining sets of patterns. Our proposed method has the following features: (i) it iteratively (level-wise) executes a pattern mining algorithm; and (ii) the selection of patterns is guided by a measure that takes the task of computing a classification model into account. According to the terminology from (Bringmann, Nijssen, & Zimmermann, 2009) it could be considered to be classified under model-dependent iterative mining approaches. The constructed set of features (pattern set) is passed to the model induction algorithm. This is illustrated in Figure 1.

Figure 1. General overview of the proposed pattern-based classification method. Space of patterns is searched systematically from most general patterns to most specific ones. The generality relation structuring this space is defined below. Generality relation The following definition formalizes the notion of a pattern Q covering an example. Definition 4. Given a query Q containing a SPARQL graph pattern P and an RDF graph G, Q is said to cover an example (xi, yi) if there is a mapping µ∈Pcl(G) that maps the variable ?key from head(Q) to the URI reference identifier of xi. Checking cover relation corresponds to RDFS evaluation of query Q over G (query answering). In this case, checking whether one pattern is more general than another one boils down to

!"#$%&'(&)*+,&)

-+.&/")01"1"()

0#'&%)1"'234#")

-+.&/"),&5) 0#'&%)

01"1"()3#",5/+1"5)

checking query containment. Query containment as a generality relation works similarly to generalized subsumption. For the case of Datalog, (De Raedt & Ramon, 2004) pointed that if the search for patterns is restricted to only one pattern for each equivalence class of clauses that acts as the canonical form, then it closely corresponds to the application of generalized subsumption. Moreover, semantic closures (s-closed patterns) are unique representatives of their equivalence class and thus one may search for such representatives instead of searching for all patterns satisfying a given threshold. Unfortunately, in case of our representation language, it is not easy to enforce the s-closed constraint as it is not antimonotonic. Furthermore, testing query containment/generalized subsumption is expensive. For the reason of efficiency of subsumption tests, the most popular generality order for relational data has been θ-subsumption (Plotkin, 1970). Θ-subsumption, however, does not capture the semantics of background knowledge. Neglecting the presence of background knowledge, may lead to inefficiency due to semantic redundancies in discovered patterns (Józefowska, Lawrynowicz, & Lukaszewski, 2008). Taking the above considerations into account, we propose a new generality relation that elegantly bridges the gap between purely syntactic generality relation (θ-subsumption) and relatively heavy semantic generality relation (query containment/generalized subsumption). The proposed relation, we call taxonomical subsumption (or t-subsumption), generalizes semantically θ-subsumption. Furthermore, we propose a new form of closed patterns, called taxonomically closed patterns. W.r.t. to our language of representation, taxonomically closed query is a query to which it is not possible to add more triples of the form (?x type c) where a triple of such form and with variable ?x already exists in the query, or of the form (?x p ?y) where a triple with variables ?x and ?y of such form already exists in the query, without affecting the semantics. This corresponds to the possibility of application of Rule 1b and/or Rule 2b from Table 1. More formally, for our representation language ρdf we define taxonomically closed pattern as follows. Definition 5 (Taxonomically closed pattern) A pattern Q is taxonomically closed, or t-closed, w.r.t. the background knowledge G if for each triple of the form (?x type c) in Q, Q also contains the transitive closure of (?x type c) w.r.t. G, and for each triple of the form (?x p ?y) that appears in the pattern Q, Q also contains the transitive closure of (?x p ?y) w.r.t. G.

Definition 6 (Taxonomical subsumption) Given two patterns Q1 and Q2 over ρdf dataset G, and their t-closures Qt

1 and Qt2 respectively, Q1 taxonomically subsumes (t-subsumes) Q2 if and only

if there exists a mapping σ such that a set of triple patterns and FILTER expressions from σ(body(Qt

1)) is a subset of a set of triple patterns and FILTER expressions from body(Qt2).

From Definition 6 it follows that taxonomical subsumption boils down to θ-subsumption between t-closed patterns.

Definition 7 (Generality relation). Given two patterns Q1 and Q2 defined as SPARQL queries over a ρdf dataset G we say that Q1 is at least as general as Q2 under taxonomical subsumption w.r.t. G, Q1 ≽G Q2, iff pattern Q1 taxonomically subsumes (t-subsumes) Q2 w.r.t. G. Due to soundness of θ-subsumption, t-subsumption is also sound. Hence, whenever Q1 ≽G Q2 then Q1 covers every example that is covered by Q2. The defined generality relation is a reflexive and transitive binary relation, and so it induces a quasi-order on the space of patterns. It has been shown (Nienhuys-Cheng & Wolf, 1997) that any quasi-ordered space may be searched using refinement operators. Refinement operator A downward refinement operator ρ is a function that computes a set of specializations of a pattern. There have been various refinement operators proposed in the literature for logical and relational learning (Nienhuys-Cheng & Wolf, 1997)(De Raedt, 2008). From among them two classes are especially interesting, which are so-called ideal and optimal operators. The latter ones are used in complete search algorithms, while the former ones are suitable to be used in heuristic search algorithms (De Raedt, 2008). In our approach, we assume the application of heuristic search, thus an ideal refinement operator would be more suitable. A refinement operator is called ideal when it is locally finite, that is it computes only a finite set of specializations of each pattern, complete, that is every specialization is reachable by a finite number of applications of the operator, and proper, that is it computes only more specific (nonequivalent) specializations of a pattern. Unfortunately, even for relatively simple languages, and θ-subsumption as a generality measure, ideal refinement operators do not exist (Nienhuys-Cheng & Wolf, 1997). They can be approximated by dropping the requirement of properness, or by restricting the language. For practical reasons, keeping the properness requirement is often not desired, e.g. from the point of view of our application scenarios. For instance, the two following SPARQL graph patterns (?x, hasPart, ?y) and (?x, hasPart, ?y1) AND (?x, hasPart, ?y2) would yield queries equivalent under θ-subsumption, while it would be interesting to keep the second query for further refinement. Thus, we are interested in finite and complete operator that is not proper. In order to avoid generating several syntactic variants of the same query (queries that are equivalent w.r.t. the generality measure) it is helpful to use a trie data structure as in (Nijssen & Kok, 2001) (Józefowska, Ławrynowicz, & Łukaszewski, 2010) to represent the results of the refinement steps. This allows a refinement operator to generate queries only in a canonical form. Every node in this trie is labeled with a triple pattern or FILTER expression. Every path from the root to any node corresponds to a query, and in consequence every node has an associated query. An example of a trie is shown in Figure 2. Symbols on edges correspond to the refinement rules from Definition 8.

Figure 2. An example of a trie.

In order to define the refinement rules we assume from now on that all triple patterns and FILTER expressions in the body of a query Q are ordered. With (Q, t) we denote the query Q to whose body triple pattern or FILTER expression t is concatenated. With last(Q) we denote the last expression of the body of Q (either triple pattern or FILTER expression). The variables in t = last(Q) that do not occur in any earlier expressions of Q are called the new variables of t in Q. By declarative bias β we assume tuple (ΗC, ΗP, eqfillers, rangefillers, isabstract, useonlyfillers, maxcounter, basepatterns). This tuple is fully user-defined (w.r.t. the constrains given below), and serves for limiting search space to manageable size. By ΗC we denote a hierarchy of classes that the refinement operator uses to build specializations of queries, and by ΗP we denote a hierarchy of properties that the refinement operator uses to build the specializations. We assume that ΗC is consistent with the order induced by sc relation, and ΗP is consistent with the order induced by sp relation, and that all the classes belonging to ΗC, and all the properties belonging to ΗP belong also to the dataset G. Formally, ΗC and ΗP are directed forests of rooted trees (disjoint unions of rooted trees), ΗC={TC1,…,TCn}, and ΗP={TP1,…,TPm}. Furthermore, by root(Tj) we denote the root of a tree Tj and by children(n) we denote the set of all subtrees rooted in direct descendants of node n. The tree TCi (TPi) is defined recursively: the root(TCi) ∈∆C (root(TPi) ∈∆P ) and for every tree T from children(root(TCi)) (children(root(TPi))), T fulfills this definition and (root(TCi), sc, root(T)) ((root(TPi), sp, root(T))).

!

(?x, type, Train) (?x, hasPart, ?y) (?x, type, Car)

(?x, hasPart, ?y) (?x, hasCar, ?y) (?y, type, Car)

(?x, hasCar, ?y) (?y, type, Car) (?y, type, Car)

1(a) 1(b) 1(a)

4

2(b) 1(a)

2(b) 1(a)

4

(a)

(b)

Figure 3. An illustration of a HC (a) and a HP (b) hierarchy. eqfillers: ΔP→2L∪2U is a function mapping each property to a finite set of values, which are to be put in the object place in the triple instead of a variable. rangefillers:ΔP→2ℝ×ℝ×ℝ is a function mapping each property p to a finite set of triples (m,n,d), which are to be used in FILTER expressions considering variable which occurred as an object in the triple with property p. For every p∈∆P, function isabstract(p)=1 iff object variable in a triple where p is the predicate can be used as a subject in other triple and 0 otherwise. If OWL vocabulary is used, for isabstract(p)=1 there shall be triple (p, type, owl:ObjectProperty) in the RDF graph against which query is evaluated and if isabstract(p)=0, there shall be triple (p, type, owl:DatatypeProperty). Function useonlyfillers(p)=1 iff it is desired not to build triples with property p as a predicate and a new object variable. Although it does not break the refinement operator, the situation where useonlyfillers(p)=1 and eqfillers(p)∪ rangefillers(p)=∅ shall not occure, as those two conditions are contradictory. Also for every property p∈∆P, there is a function maxcounter: ∆P→ℕ, which denotes how many times one triple with p as a predicate, can be copied. Please note that this value shall be kept small, as it can easily lead to combinatorial explosion. Values other than 1 shall be used only if there are explicit rationales for this. Basepatterns is a set of tuples (body(Q), V), where body(Q) is a valid SPARQL query body and V is a set of variables, which are to be considered as new in the first step of refinement operator. The downward refinement operator ρ is a function that returns a set of the refined queries Q’∈ρ(Q). ρ is defined by the following refinement rules extending the body (RDF graph pattern expression) of Q and manipulating a trie data structure. Definition 8 (Downward refinement operator ρ). Let G be a ρdf dataset. Let T be a trie data structure that imposes an order of triple patterns and FILTER expressions in a query. Let Q be a query, let B be last(Q) that is the last expression in the body of query Q, let Bp be the parent of B in T. Let β be a declarative bias (ΗC, ΗP, eqfillers, rangefillers, isabstract, useonlyfillers,

!

Train Car

PassengerCar GoodsCar

Locomotive

ElectricLocomotive

EU07

!

hasPart

hasCar hasLocomotive

hasNumberOfSeats

maxcounter, basepatterns). Triple patterns or FILTER expressions are added to trie T as: 1. syntactically dependent expressions (share a variable ?x with last(Q) that was new in last(Q) or, if T is empty, ?x∈V, where V is the set of variables from a base pattern). The expressions are added by the following rules: (a) add (?x, type, ci), where ci∈∆C, ci = root(TCj), and TCj∈ΗC; (b) add (?x, pi, ?y), where pi∈ ∆P, pi = root(TPj), TPj ∈ΗP, isabstract(pi) is true and useonlyfillers(pi) is false and ?y is new; (c) add (?x, pi, a), where pi∈∆P, and a ∈eqfillers(pi); (d) add (?x, pi, ?y) FILTER(?y>=n && ?y<=m), where pi∈∆P, and (n,m,d) ∈rangefillers(pi); 2. semantically dependent expressions (these are refinements that exploit hierarchies of classes and properties). They are added by the following rules: (a) if last(Q)= (?x, type, ci), add (?x, type, ck), where ci = root(TCi) and ck = root(T) for some T∈ children(TCi); (b) if last(Q)= (?x, pi, ?y), add (?x, pk, ?y), where pi = root(TPi) and pk = root(T) for some T∈ children(TPi). 3. refined FILTER expressions, that equally divide an interval in a given FILTER expression in last(Q) and constitute two new FILTER based refinements corresponding to this division: (a) if last(Q)= (?x, pi, ?y) FILTER(?y>=n’ && ?y<=m’) and (n,m,d) ∈rangefillers(pi) and !!!! − !! > !, add (?x, pi, ?y) FILTER(?x>=n’&& ?x< m '

2) (and resp. add (?x, pi, ?y)

FILTER(?x>= m '2

&& ?x<m’));

4. right brothers of a given node in T (these are the copies of expressions that have the same parent Bp as the given expression B and are placed on the right-hand side of B in Bp’s child list), new variables are renamed such that they are also new in the copy;! 5. a copy of last(Q), added if last(Q) is a triple pattern, where counter(copy(last(Q))= counter(last(Q))+1, and counter(last(Q))<maxcounter(predicate(last(Q))), and where any new variable in last(Q) is given new name in copy(last(Q)). If not stated otherwise, counter(X)=1 for every X being a label of the trie node. The first rule introduces the dependent expressions that could not be added earlier. The second rule introduces semantically dependent expressions in a way that keeps the pattern taxonomically closed (w.r.t to G and hierarchies from β). By discretization, the third rule also induces a taxonomy (hierarchy) of intervals that are used in FILTER expressions, and thus conforms to the strategy of generating taxonomically closed patterns. The expressions added by rules 1-3 are brothers of each other in the trie. The fourth rule, the right brother copying mechanism, takes care that all possible subsets but only one permutation out of the set of expressions added by another rules is considered. The fifth rule must be handled with care, since it very easily may lead to an exponential explosion. That is why it needs an additional constraint (e.g. on the maximum number of allowed copies) that we introduce through a declarative bias.

In the following we prove the properties of ρ: (local) finiteness and completeness. Theorem 1. (Finiteness of ρ). Proof. The number of variables in a query is finite, and there are finitely many rooted trees in ΗC and in ΗP, thus rules 1a and 1b may be applied only finitely many times. The function eqfillers (rule 1c) maps each property to a finite set of values, and the function rangefillers (rule 1d) maps a property to a finite set of triples. The number of classes in HC and the number of properties in HP is finite, thus the rule 2 of ρ may be applied only the finite number of times. The number of times a refined FILTER expression may be added (rule 3) is bounded by d (interval width). Rule 4 generates permutation from a finite number of elements. The number of times any triple with p∈∆P as a predicate can be copied (rule 5) is bounded by the function maxcounter. Thus the number of possible new triple patterns and FILTER expressions that may be added to a query by all of the rules of ρ is finite, what completes the proof. Theorem 2. (Completeness of ρ). Proof. Given is a trie T, recursively generated by ρ, a declarative bias β that further restricts/specifies the language of taxonomically closed patterns, and a pattern Q which occurs in T. Queries are being refined by adding an expression, taking β into account. Then, if there exists an expression B∉Q which is a valid refinement of Q, a valid pattern Q'= (Q1,B,Q2) must be reachable in trie T through a finite number of applications of ρ, for some subdivision of Q into Q1 and Q2, such that Q=(Q1,Q2). As B is a valid refinement of Q, there is a prefix (Qp,Bp) of Q such that expression B is a dependent expression of Bp (added by rule 1 or rule 2 from Definition 8), refined FILTER expression (added by rule 3) or a copy of Bp (added by rule 5). If Bp is the last expression of Q, then it is clear that B, as a dependent expression, refined FILTER expression or a copy of Bp is generated as a refinement of Q to be added at the end of the query. Hence, query Q' is generated. Let us now assume that Bp is not the last expression and it has different successor Bp+1 in query Q, where expression Bp+1 is a child of Bp in T. Then let us consider the order of B and Bp+1 in the list of children of Bp in trie T, which is one of the following:

• B occurs before Bp+1; then Bp+1 is a right-hand brother of B. The right brothers copying mechanism, the rule 4 from Definition 8, will copy Bp+1 as a child of B; the same operations that created Q will create query Q' in subsequent steps.

• B occurs after Bp+1; B is copied as a child of Bp+1. In order to determine the exact injection place of B, we recursively apply our arguments, taking into account Bp+1 and B.

It follows from the above arguments that query Q' is always generated. This completes the proof.

Searching pattern space An algorithm for mining discriminative patterns is shown in Alg. 15. The algorithm performs beam search: it starts from a base query Qbase, such that (body(Qbase), Vbase)∈basepatterns, and repeatedly generates candidate patterns Q∈Q using the refinement operator ρ, and tests their quality. The algorithm follows the optimal rule discovery framework (Li, 2006), and thus applies two quality measures. The first one of the measures is used for directly pruning uninteresting patterns, and the second one is used for additional pruning of the search space by stopping further refinement of selected patterns. Therefore, there are two quality thresholds that the finally selected patterns meet. Let us byθ1 denote the first threshold, and by θ2 the second threshold. To apply optimal rule discovery based pruning, we need to select an optimonotone (Le Bras, Lenca, & Lallich, 2012) quality measure for θ2. In Alg. 1, θ1 stands for lift measure, and θ2 stands for coverage (known also as local support). Maximally top-k patterns (sorted by decreasing quality according to lift measure) are used to generate subsequent candidates at each iteration. The number of iterations is bound by a user defined parameter MAXLEVEL. The algorithm employs declarative bias β. As a result it computes a set of patterns Ftop. In the following, we present the complexity analysis of Fr-ONT-Qu. Let us assume worst-case scenario where: i) no patterns are pruned as uninteresting; ii) refinement operator always generates x new patterns from a given one. This x should be understood as an upper limit on the magnitude of a set ρ(Q). It is also important to understand, that x is mainly the function of the size of a declarative bias. Under these assumptions, one can conduct the following complexity analysis: • the body of top-level loop is always repeated MAXLEVEL times; • the body of internal while loop is always repeated at most size(Fl) ≤ k times, generating kx

patterns and using refinement operator k times; • the body of internal foreach loop is always repeated for each of the patterns generated in the

previous step, so it is repeated kx times. The complexity of query answering can not be ignored, and thus will be marked there as f(G). In practice, it depends on the underlying SPARQL query engine (Horst, 2005).

• selecting top-k patterns from the set of magnitude kx, has O(kxlog(kx)) complexity from sorting.

One can observe that each step in the refinement operator consists of iterating over some set and adding triples/patterns. As those sets are already available (appropriate pointers, e.g. to a subtree of classes, are kept with the query), it is easy to reason that complexity of generating one pattern is O(1). Overall complexity of the Fr-ONT-Qu algorithm is O(MAXLEVEL⋅(k+kx⋅f(G)+kx⋅log(kx))). One should expect that O(f(G))>O(log(kx)), and so the worst-case complexity class can be stated as O(MAXLEVEL⋅k⋅x⋅f(G)).

5 Note that we named the algorithm ‘Fr-ONT-Qu’ to maintain some continuity with the Fr-ONT algorithm (Ławrynowicz & Potoniec, 2011) for mining frequent description logic concepts, though Fr-ONT-Qu is suitable for mining not only frequent SPARQL Queries, but can use an arbitrary pattern quality measure

Algorithm 1. (Fr-ONT-Qu) input: D, G, Qbase, β , k, MAXLEVEL output: Ftop F1 ←Qbase; F’

1 ←∅; Q1 ←∅; l←0; while l < MAXLEVEL do Ql+1 ← ∅; Fl+1 ← ∅; F’

l+1 ← ∅; i ← 0; while i <k and i<size(top-k(Fl)) do Ql+1 ← Ql+1∪ρ(Qi∈ top-k(Fl)); i ← i + 1; end foreach Qi∈Ql+1do

if (!!,!!) (!!,!!)!!+ ∧ !"# !!,!! |

!+< !! and

(!!,!!) (!!,!!)!!− ∧ !"# !!,!! |!−

< !! then

drop the pattern as uninteresting;

elseif (!!,!!) (!!,!!)!!! ∧ !"# !!,!! |

!!< !! then

keep the pattern, it creates good enough rule for the set E-,but do not refine it: F’

l+1 ← F’l+1∪{Qi};

elseif (!!,!!) (!!,!!)!!! ∧ !"# !!,!! |

!!< !! then

keep the pattern, it creates good enough rule for the set E+,but do not refine it: F’

l+1 ← F’l+1∪{Qi};

else keep the pattern as one, which may be refined further, with the quality equal to the lift of better of two rules the pattern may form “if !"# !!,!! then !! = !"#$” and “if !"# !!,!! then !! = !"#$%”, that is lift equal

!"# (!!,!!)|(!!,!!)!!! ∧ !"# !!,!! !! ! !!

!! (!!,!!)|!"# !!,!!, (!!,!!)|(!!,!!)!!

! ∧ !"# !!,!! !! ! !!

!! (!!,!!)|!"# !!,!!:

Fl+1 ← Fl+1∪{Qi}; end

end Fl+1 ← top-k(Fl+1); l ← l + 1; end Ftop ← F1∪ F’

1∪ ... ∪FMAXLEVEL∪F’MAXLEVEL

EXPERIMENTAL EVALUATION In this section, we present the experimental evaluation of our method. We start from describing the tool we have developed to support semantic data mining that includes the implementation of the Fr-ONT-Qu algorithm. Then we present experimental results obtained by our algorithm in comparison to state-of-the–art methods on benchmark problems. Finally, we provide a motivating scenario for the real-life experimental study in semantic meta-mining, and describe the experiments within the study and their results. RMonto: a tool for semantic data mining Semantic data mining requires tools able to effectively handle ontological background knowledge. To the best of our knowledge such tools are barely available. There exist software implementations supporting single algorithms like a Protégé ontology editor6 plugin with AL-QuIn implementation (Lisi F.A., 2011) or algorithm families like SDM-Toolkit, a part of the Orange toolkit supporting semantic subgroup discovery (Vavpetic & Lavrac, 2013). More broad and systematic tool suites, however, are missing. As an exception may be considered DL-Learner (Lehmann, 2009), which provides a framework for learning in description logics and OWL 2 that supports different knowledge base formats and reasoner interfaces. However, currently implemented DL-Learner algorithms mostly address the concept learning task. In order to address the general lack of tools supporting semantic data mining approaches, and shortcomings of available implementations, we have developed a tool named RMonto (Potoniec & Ławrynowicz, 2011a) (Potoniec & Ławrynowicz, 2011b), where the implementation of the Fr-ONT-Qu algorithm is one of the offered services. While developing RMonto, we targeted requirements posed by so-called third generation data mining systems. It is claimed that a major challenge for these emerging systems is in the integration of distributed, and possibly heterogeneous data and knowledge resources and services, found e.g. on intranets or on the Web. RMonto is a plugin extension to a world-leading open source data mining environment RapidMiner7 (Mierswa, Wurst, Klinkenberg, Scholz, & Euler, 2006). RapidMiner provides data mining and machine learning procedures such as data loading and transformation, data preprocessing and visualization, modelling, evaluation, and deployment. The architecture of RapidMiner enables building data mining processes (workflows) from small blocks (operators) by connecting their inputs and outputs. Such architecture enables the RMonto’s custom operators to be interconnected with existing RapidMiner operators and provides added value to both. RMonto is implemented in Java, similarly to RapidMiner. The set of operators implemented within RMonto enables working directly on structured, relational data, and as such it enables 6 http://protege.stanford.edu/ 7 http://rapid-‐i.com/

Inductive Logic Programming style applications with Semantic Web data. RMonto also supports acquisition of data from local files and SPARQL endpoints to consume data from various distributed Semantic Web sources such as Linked Open Data as input for data mining experiments. Its custom algorithm implementations may be then combined with the other RapidMiner’s operators through transformation/extraction from the ontological data to attribute-value data8. RMonto is an easily extendable framework that has been split into several libraries to easily maintain code reusability and separability. These are: i) libraries implementing common interface between reasoning software (e.g. Pellet9 or OWLIM10) and (data mining) algorithm implementations (called PutOntoAPI), ii) actual algorithm implementations, put to separated libraries, and iii) the extension, being a bridge between above layers and RapidMiner. Providing common architecture to different reasoners has additional advantage that actual implementation of those interfaces does not have to be available during software building. Figure 4 presents the operator tree of RMonto, where the nodes corresponding to operator groups relevant for this paper have been expanded. These are:

• Loading: a group of operators for loading data and ontologies either from local files or from SPARQL endpoints, and for building a knowledge base from the loaded inputs;

• ABox: operators for selecting (via SPARQL) instances to be used in the learning process; • Pattern Mining: operator implementing Fr-ONT-Qu algorithm; • Data Transformation: contains operators for performing such operations like

propositionalisation which uses output of Fr- ONT-Qu and list of (labeled) examples to build binary matrix of features;

• Meta mining: support for importing RapidMiner workflows into an ontological format that is based on a background data mining ontology and is suitable for semantic meta-mining.

The operator ‘Fr-ONT-Qu’ implements all the aspects of the described algorithm, including the declarative bias directives. For example, Figure 5 illustrates a dialog that allows to introduce property hierarchy, and Figure 6 a dialog that allows to introduce base patterns. A screenshot of the workflow containing ‘Fr-ONT-Qu’ operator is presented further in the paper in Figure 9. The website http://semantic.cs.put.poznan.pl/RMonto/ contains further information on RMonto. RMonto may be also found at the RapidMiner marketplace11. The current version of RMonto that includes the features described in this paper can be downloaded from http://semantic.cs.put.poznan.pl/fr-ont/.

8 We are aware of the RapidMiner extension rapidminer–semweb (http://code.google.com/p/rapidminer-semweb/). However, to the best of our knowledge, it only provides pre-processing services for extracting an RDF graph from a repository and transforming it into a feature vector and not any learning algorithms

9 http://clarkparsia.com/pellet/ 10 http://www.ontotext.com/owlim 11 http://marketplace.rapid-‐i.com/

Figure 4. RapidMiner's operator tree with the RMonto’s operators (under ‘Ontologies' branch).

Figure 5. A dialog for entering property hierarchy as part of the Fr-ONT-Qu declarative bias.

Figure 6. A dialog for entering base patterns as part of Fr-ONT-Qu declarative bias.

Comparison with state-of-art approaches to classification of semantic data In this section, we describe the results of the experimental evaluation of Fr-ONT-Qu comparing it to the state-of-art approaches to classification of Semantic Web data. In contrast to those approaches, Fr-ONT-Qu is based on pattern mining. Our experimental hypothesis is that our method, that exploits the semantics of ρdf to generate a set of good quality pattern based features, is able to perform better on the same datasets than the state-of-art approaches including the methods that were designed to handle complex OWL constructs. To test it, we performed a set of classification experiments comparing Fr-ONT-Qu with the state-of-art methods, following experimental protocols proposed by their authors. We paid special attention to transparency and repeatability of our experiments. All the experimental data (datasets, workflows) are publicly available what enables to track the provenance of our results. That’s why, to perform experimental comparison with the state-of-art methods, we selected only those problems, where the results were obtained using publicly available datasets or where the software was available so we could perform experiments by ourselves on those datasets. The first dataset we used comes from the Semantic Portal of the institute AIFB. It contains metadata about research groups within this institute, their projects, publications and involved people. Those data are described using SWRC vocabulary12. We used the snapshot from November 2006 - the same as used in the related work. It contains 2547 individuals, of which 177 denote people with known affiliation to one of the four research projects conducted in the institute13. We performed an experiment using the evaluation setting denoted as person2affiliation. It consists of predicting appropriate affiliation for each of those people, that is to perform four class classification task. We followed the evaluation setting described in (Bloehdorn & Sure, 2007), (Kiefer, Bernstein, & Locher, 2008), and (Loesch, Bloehdorn, & Rettinger, 2012), and thus we used leaving-one-out method. The results are reported in Table 2. To achieve those results, we used Fr-ONT-Qu with depth 3, beam width 5, lift threshold 0.1 and coverage threshold 0.5. In the bias, there were placed all classes and object properties available in the SWRC ontology, except of the property of which values were to predict. In addition, for properties worksAtProject, publication and isAbout sets of fillers were used, populated with all possible values for those properties w.r.t. considered dataset. To solve the four class problem, we decided to mine patterns separately for each class (applying one versus all approach) and later join them in a single set of patterns. We used voting classifier consisting of three k-NN classifiers, with ! ∈ {1,5,9}. With these settings, Fr-ONT-Qu outperformed the statistical relational classifier SPARQL-ML and the kernel based classifiers on F1 measure.

12 http://ontoware.org/swrc/

13 In the dataset there is also a single person with an affiliation to a fifth project, but (following the authors of the related papers) we removed it, as it clearly is an error in the data.

algorithm macro F1 measure Error Fr-ONT-Qu 0.8539 0.1186 Bloehdorn et al. 0.7237 0.0449 SPARQL-ML 0.8368 0.0353 Loesch et al. 0.7488 0.0478

Table 2. The experimental results on SWRC dataset. The second employed dataset was OWLS-TC v2.114. It consists of 578 descriptions of Semantic Web services expressed in OWL-S language. Those descriptions are divided into 7 classes reflecting service categories: weapon, communication, medical, travel, economy, food, education. In Table 3, we present the results of our experiment compared with the best SPARQL-ML results. We used Fr-ONT-Qu with depth 8, beam width 200, lift threshold 0.001 and coverage threshold 0.5. In the bias, hasInput and hasOutput properties were selected, each with at most three occurrences on the same level and parameterType property with list of fillers consisting of all possible values w.r.t. considered dataset. We employed one versus all approach and three k-NN classifiers as described above. To maintain credibility of the results 10-fold cross-validation was used. On OWLS-TC dataset, Fr-ONT-Qu achieved better results than SPARQL-ML on all measures. class SPARQL-ML (RPT with inference) Fr-Ont-Qu

FP rate precision Recall F1 FP rate precision recall F1 Communication 0.004 0.900 0.600 0.720 0.007 0.857 0.828 0.842 Economy 0.018 0.964 0.889 0.925 0.027 0.953 0.985 0.969 Education 0.090 0.716 0.869 0.786 0.018 0.931 0.800 0.861 Food 0.002 0.960 0.800 0.873 0.000 1.000 0.560 0.718 Medical 0.030 0.688 0.550 0.611 0.017 0.833 0.865 0.849 Travel 0.069 0.744 0.873 0.803 0.051 0.808 0.953 0.874 Weapon 0.002 0.964 0.900 0.931 0.005 0.893 1.000 0.943 average 0.031 0.848 0.783 0.807 0.017 0.897 0.856 0.865

Table 3. The experimental results on OWLS-TC dataset. To compare with concept learning algorithms, those of DL-FOIL and DL-Learner, we adapted the methodology described in (Fanizzi, D'Amato, & Esposito, 2008). We used three ontologies represented in OWL: New Testament Names (NTN), BioPax glycolysis ontology and Financial ontology. For each of those ontologies, 30 random complex concepts were generated and a standard reasoning service (from Pellet) was used to label all individuals available in a given 14 http://projects.semwebcentral.org/projects/owls-‐tc/

ontology w.r.t. the given concept. Each individual was marked either as positive if it could be inferred that it belongs to the given concept, or as negative if such conclusion could not be inferred. To generate concepts we extended the code of DL-FOIL’s authors15 to make it compatible with our API. In this way, we obtained 30 different labelings for each dataset. Subsequently, for each of these randomly generated concepts, a DL-Learner configuration was prepared using its default parameters and the cutting-edge CELOE variant to solve a binary classification problem. DL-Learner does not compute precision nor recall, thus only F-measure was taken into account, whose values we parsed from DL-Learner logs and averaged. Following the protocol of the DL-FOIL’s authors, for each of the labelings we used 10-fold cross-validation and averaged obtained results. The results are presented in Table 4. In each experiment, we populated Fr-ONT-Qu bias only with all classes available in considered KB, search depth was set to 5, beam width to 10, and both thresholds to 0.01. A rule induction algorithm similar to RIPPER (Cohen, 1995), provided by ‘Rule induction’ operator from RapidMiner, was used as a classifier. For this setting, Fr-ONT-Qu outperformed both concept learning methods, DL-FOIL and CELOE (from DL-Learner), on all datasets and all measures. It is worth to notice that we achieved better performance even though languages used by DL-FOIL and DL-Learner account for much more expressive ontology languages than ours, as they support constructs such as concept negation and union or cardinality restrictions (DL-Learner).

The experimental results confirm our hypothesis that Fr-ONT-Qu that is a classification method based on pattern mining, and that exploits the semantics of a basic subset of RDFS to generate a set of good quality features, is able to perform better on the same datasets even than the methods that were specifically designed to handle complex OWL constructs.

It is noteworthy to mention that the approaches to which we compared Fr-ONT-Qu, those described in (Kiefer, Bernstein, & Locher, 2008) and (Loesch, Bloehdorn, & Rettinger, 2012), were awarded best papers at the Extended Semantic Web Conference in 2008 and 2012, respectively.

Dataset DL-FOIL Fr-ONT-Qu DL-Learner

precision Recall F1-measure

F1-measure

precision recall F1- measure

BioPax 0.660± 0.241

0.765± 0.217

0.696± 0.210

0.991± 0.010

1.000± 0.000

0.981± 0.090

0.991± 0.040

NTN 0.590± 0.368

0.649± 0.257

0.591± 0.300

0.989± 0.001

0.997± 0.003

0.980± 0.005

0.968± 0.088

Financial 0.621± 0.407

0.648± 0.370

0.633± 0.391

0.987± 0.012

0.989± 0.011

0.986± 0.012

0.833± 0.239

Table 4. The results of comparison of Fr-ONT-Qu with concept learning algorithms. 15 http://lacam.di.uniba.it:8000/~nico/research/snippet1.html

Case study: semantic meta-mining Meta-learning is defined as learning to learn, that is applying automatic learning techniques on meta-data of past machine learning/data mining experiments to improve the performance of the learning process, and its results (Jankowski, Duch, & Grabczewski, 2011). Meta-learning research has been predominantly focused on learning (data mining) phase of the knowledge discovery (KDD) process (Fayyad, Piatetsky-Shapiro, & Smyth, 1996), and on the task of learning a mapping from the meta-data about the datasets to algorithm performance. However, the performance of a KDD process is not only dependent on the learning algorithm, but also on the other its phases such as data selection or pre-processing. Therefore, (Hilario, Nguyen, Do, Woznica, & Kalousis, 2011) proposed to extend the meta-learning approach to the full KDD process by taking into account the interactions between different process operations (learning, pre-processing etc.) resulting in the approach named meta-mining. Finally, by semantic meta-mining (Hilario, Nguyen, Do, Woznica, & Kalousis, 2011) is denoted the meta-mining approach that relies on extensive use of background knowledge concerning knowledge discovery represented in a data mining ontology that covers such aspects of data mining as thorough characterization of algorithms, assumptions they make, their optimization strategies, and the models and patterns they produce. Meta-learning is supposed to provide an insight into what kind of algorithms are best suited for a particular data analysis goal. Currently available KDD platforms offer numbers of algorithm implementations that support various steps of KDD process e.g. pre-processing or actual data mining. For example, RapidMiner offers several hundreds of algorithm implementations (operators), either implemented by the RapidMiner developers or acquired by implementing wrappers to popular data mining libraries such as Weka16 and R17. The user of such a system has to choose proper operators, and their combination to construct a KDD workflow best addressing his/her goal. Various systems have been proposed to support the user in this process. A recent, comprehensive survey of such systems, called Intelligent Discovery Assistants (IDAs), is presented in (Serban, Vanschoren, Kietz, & Bernstein, 2012). One of the proposed architectures of IDA, so-called planning-based data analysis systems (Serban, Vanschoren, Kietz, & Bernstein, 2012), is based on Artificial Intelligence (AI) planning which is used to construct a set of valid workflow plans (workflow templates), that is the workflow plans fulfilling the user specific goal, taking the user dataset into account, and combining operators in the way that all their pre-conditions and post-conditions are met (Kietz, Serban, Bernstein, & Fischer, 2009). Though all of the workflow plans generated by the planner are valid for the given task, there may be many of them, possibly even billions. That’s why, before the final list of workflows can be presented to the user, additional step is needed to rank them w.r.t. a given performance measure (e.g., predictive accuracy). The planner-based IDA developed within the EU FP7 e-LICO project18 exploits the results of meta-mining for this purpose. More precisely, semantic meta-mining methods are used to compute a meta-mined model for ranking workflows composed by the planner. The general architecture of a planning-based data analysis system illustrated in (Serban, Vanschoren, Kietz, & Bernstein, 2012), and extended with a meta-mining module as proposed by (Hilario, Nguyen, Do, Woznica, & Kalousis, 2011) is presented in Figure 7. 16 http://www.cs.waikato.ac.nz/ml/weka 17 http://www.r-‐ project.org 18 http://www.e-‐lico.eu/

Figure 7. General architecture of a planning-based data analysis system extended with a meta-mining module. Both planning and ranking require extensive domain knowledge (e.g. about operators inputs, outputs, preconditions and effects, and about learning algorithms and their characteristics). In case of the e-LICO project, this knowledge was stored in background ontologies: Data Mining Work Flow Ontology (DMWF) (Kietz, Serban, Bernstein, & Fischer, 2010) for constructing workflows, and Data Mining Optimization Ontology (DMOP) (Hilario, Kalousis, Nguyen, & Woznica, 2009) for ranking workflows. The project e-LICO has provided an excellent test-bed for evaluating semantic meta-mining approaches, and first such approaches have already been proposed and evaluated using the e-LICO infrastructure (Hilario, Nguyen, Do, Woznica, & Kalousis, 2011) (Nguyen, Kalousis, & Hilario, 2011). In our experiments we used the infrastructure mentioned above, including in particular RapidMiner, AI planner, and DMOP ontology, as well as our semantic data mining tool RMonto. Protocol. In order to achieve statistically sound results, we adopted a methodology from (Srinivasan, Muggleton, Sternberg, & King, 1996). Let ! be some dataset, such that ! = !! ∪!!,!! ∩ !! = ∅. !! is a set of positive examples and !! negative ones. Experiments were performed conforming to the following protocol:

1. Split ! randomly into 10 disjoint, approximately equal in size folds !! using stratified

sampling, so in each of those folds !!∩!!

!!∩!!≈ |!!|

|!!|.

2. For every ! ∈ {!,!,… ,!"}, do the following: a. !"#$% = !\!!,!"#$ = !! b. Run Fr-ONT-Qu algorithm on set !"#$%, obtaining set of patterns !!.

!"#$$%&' (#$)%&'

*%+#,-.$%&'

/#+#0%+'

-%+#1#+#'1%0.1%&#+#'

23#"0'

4//'53&6"35'7"#$0'

-%+#,1#+#'3$''37%&#+3&0'

-%+#,-.$%1'''-31%"'

(#$).$8'39''7"#$0'

1#+#'-.$.$8''3$+3"38:'

1#+#'-.$.$8''%;7%&.-%$+0'&%730.+3&:'

c. Use set !! to perform propositionalisation on sets !"#$% and !"#$, obtaining matrices ! and ! of size, accordingly, |!"#$%|×|!!| and !"#$ ×|!!|. !!" = !"#$ iff an example corresponding to the !-th row satisfies the pattern corresponding to the !-th column, and !!" = !"#$% otherwise. The same applies to the matrix !, but the !"#$ set is considered instead of the !"#$% one.

d. Extend both matrices with additional attributes that are to be used in the learning and testing process.

e. Remove from ! useless columns, i.e. highly correlated with others or having almost all values equal. One should note that this step is optional, but makes the attribute space much smaller without significant loss in classification quality. Matrix ! stays intact, as the size of this matrix does not affect speed or feasibility of learning process.

f. Learn a theory distinguishing examples from the set !"#$% ∩ !! from those from the set !"#$% ∩ !!, using any compatible classification learning algorithm (e.g., RIPPER as in the case of our experimental case study) and matrix ! as an input.

g. Apply that theory to the !"#$ set using matrix ! as an input, obtaining two sets: !!!

! and !!!! , such that the first one corresponds to the examples from the !"#$ set which the learned theory recognized as belonging to the set !!, and the second one corresponds to the ones recognized to belong to the set !!.

3. Build and store sets !!! = !!!!!"

!!! and !!! = !!!!!"!!! .

After performing such experiments for the same dataset, but for different Fr-ONT-Qu parameter settings, one can compare their results using McNemar’s test of changes. For two different settings, let !!!

! and !!!! be the sets obtained in step 3 of the above procedure for an experiment with the first setting, and let !!!

! , !!!! be corresponding sets for the second one. Having such data, the contingency table as presented in Table 5 can be built. Null hypothesis is that the number of examples misclassified by one of the classifiers is the same as the number of examples misclassified by the other one, that is !!" = !!". Statistics value is

!!" − !!" − ! !

!!" + !!"

and has chi-squared distribution with 1 degree of freedom. We use correction for continuity to make sure that we do not draw unjustified conclusions. !!! = !! ∩ !!!! ∩ !!!! + !! ∩ !!!

! ∩ !!!! !!" = !! ∩ !!!

! ∩ !!!! + !! ∩ !!!! ∩ !!!!

!!" = !! ∩ !!!! ∩ !!!! + !! ∩ !!!

! ∩ !!!! !!! = !! ∩ !!!! ∩ !!!

! + !! ∩ !!!! ∩ !!!!

Table 5. Contingency table for the experimental protocol. Description of experimental data. To conduct semantic data-mining experiments we first needed to collect a number of KDD workflows that would constitute a repository of so-called

baseline data mining experiments on whose meta-data we could subsequently perform the proper experiments. For this purpose we used the e-LICO’s IDA (Kietz, Serban, Bernstein, & Fischer, 2010) to generate a set of workflows on 11 UCI19 datasets (listed in Table 6). We obtained 165 workflows, all solving a predictive modelling task (classification). We have also computed a set of characteristics of the used datasets using Data Characteristics Tool (DCT) developed within the EU Metal project20 (Lindner & Studer, 1999). The list of the characteristics used in experiments is included in Appendix A. The IDA relies on default values of learning operators’ parameters, such as number of trees to be generated by random forest classifier. In order to introduce more parameter settings’ variability, we had implemented a generator of workflows that took those generated by the IDA on input and produced a set of additional workflows. The list of operators, their parameters and parameter values that were used by the generator are presented in Table 7. It is important to notice that, even though the table is relatively small, the number of workflows is not, because each possible combination of parameters and their values is considered. For example, from a single workflow with Decision Tree classifier, we generated 12 workflows with different combinations of parameters. In this way, we obtained 1581 RapidMiner workflows. We executed all those workflows and collected their performance measures values (such as accuracy). In the next steps, we needed to link the workflows with background knowledge represented in the DMOP ontology, and to represent the meta-data of the workflows in the form suitable to our developed semantic data mining approach.

Dataset name Nr of attributes

Nr of numeric

attributes

Nr of symbolic attributes

Nr of examples

Nr of classes

post-operative 8 0 8 90 4 Golf 5 2 2 14 2 Ionosphere 35 34 0 351 2 Labor-Negotiations 17 8 8 40 2 tic-tac-toe 9 0 9 958 2 Nursery 9 1 7 11025 5 Ripley_Set 3 2 0 250 2 Sonar 61 60 0 208 2 tic-tac-toe 10 0 9 958 2 Weighting 7 6 0 500 2 Yeast 9 8 0 1484 10

Table 6. Basic characteristics of datasets used as IDA input. 19 http://archive.ics.uci.edu/ml/ 20http://www.metal-‐kdd.org/

Operator name Parameter name Set of values Decision Tree minimal size for split 4 16

minimal leaf size 2 8 minimal gain 0.01 0.1 0.8

k-NN K 1 3 5 7 9 weighted vote true false

Naive Bayes laplace correction true false SVM Epsilon 0.001 0.2

kernel type poly rbf linear sigmoid Rule Induction Criterion accuracy information_gain

sample ratio 0.5 0.9 Pureness 0.5 0.9 minimal prune benefit 0.1 0.25

Random Forest number of trees 2 10 minimal size for split 4 16 minimal leaf size 2 8 minimal gain 0.01 0.1 0.8

Neural Net learning rate 0.3 0.9 Momentum 0.2 0.6

Table 7. Operators and their parameter settings used by the workflow generator.

DMOP provides an extensive knowledge resource, containing around 700 classes, and around 200 properties. A current version of the ontology may be found at DMO-Foundry21 - a portal designed to support collaborative work on data mining ontologies. In our experiments we used version v5.2 of DMOP, and a subset of DMOP’s vocabulary, in particular DMOP’s conceptualization of processes and their operations, as well as of algorithms realized by these operations, which is shortly described below. Processes are ground (executed) workflows. Each process is represented by the DMOP’s class DM-Experiment. Each DM-Experiment is composed of a set of DM-Operation objects related to DM-Experiment by the hasSubprocess property. Both DM-Experiment and DM-Operation are subclasses of DM-Process. The property hasSubprocess is a subproperty of the transitive property hasSubpart. Operations may follow one another what is represented by the property isFollowedDirectlyBy holding between two operations that are directly connected in the process, which is a subproperty of the transitive property isFollowedBy. DM-Operation is associated with DM-Operator via the property executes, that is operations execute operators (i.e. algorithm implementations). Each DM-Operator, in turn, implements one or more DM-Algorithm (in the latter case, it depends on the runtime parameters which one is realized during workflow execution). DMOP contains deep, and highly axiomatized hierarchy of algorithms. Operators have parameters that are being set during operator execution, what is reflected in DMOP by the existence of the property 21 http://www.dmo-‐foundry.org

hasParameterSetting relating DM-Operation to the class OpParameterSetting, and by properties hasValue and setsValueOf, where the latter one relates a particular operator parameter setting with a relevant OperatorParameter whose name may be accessed through the property hasParemeterKey. In order to use this conceptualization of experiments (executed workflows), as well as further background knowledge represented in DMOP and concerning e.g. a mapping between operators and the algorithms they implement, and the algorithm hierarchy, we have implemented a parser from the RapidMiner workflow format to an RDF file format representing meta-data of KDD processes. The parser is implemented as the RMonto’s operator ‘Workflow to RDF’ (labeled ‘ImportWF’ in Figure 8) including a wizard for importing workflows from a RapidMiner repository. Though it is pre-configured to import RapidMiner workflows into the DMOP based format, it is highly parameterized to allow flexible choosing of a knowledge schema to represent DM experiments. Figure 8 shows a screenshot of the RapidMiner workflow with ‘Workflow to RDF’ operator (1), whose aim is to load a repository of RapidMiner workflows together with their associated performance vectors to an OWLIM repository via ‘Build knowledge base’ operator (2). A dialog appearing on the figure (3) illustrates a possibility of entering a user-defined triple template for each type of the component of the RapidMiner workflow format.

Subsequently, the ontology files may be loaded to the same OWLIM repository also via ‘Build knowledge base’ operator and the RMonto’s operator ‘Load file’. Final repository, over which we conducted our experiments, has over 85 millions of RDF triples. To prepare training examples for classification task, we have split the dataset D into two folds, called !"#$ and !"#$. Let W(D) denote a workflow for baseline level dataset D, !!""(!(!)) a mean accuracy of this workflow and !!""(! ! ) corresponding to standard deviation:

Figure 8. Importing RapidMiner worfklows to an RDF format based on DMOP ontology via ‘Workflow to RDF’ operator of RMonto.

!"#$ = {!(!)|!!"" ! ! ≥ !"#! ! {!!"" ! ! − !!"" ! ! }} There are 306 workflows in the !"#$ set and 1275 workflows in the !"#$ set. Setup. The first step from our experimental protocol was performed by the RapidMiner workflow where workflow identifiers (their URIs in the knowledge base) with labels were read from a CSV file, and split into ten separate folds. Each of those folds was numbered with a number from 1 to 10 and in the end they were collected once more into one dataset. Steps 2a-b were performed using a workflow that firstly in a loop selects appropriate subset from a dataset with operator ‘Create Train’ (1), executes Fr-ONT-Qu algorithm with operator ‘Fr-ONT-Qu ’ (2), and subsequently saves mined patterns in a repository with operator ‘Store patterns’ (3). This workflow is illustrated in Figure 9.

a)

b)

Figure 9. The RapidMiner workflow that mines a set of patterns using Fr-ONT-Qu. Part (b) illustrates a sub-workflow (an inside) of ‘Loop’ dominating operator22 from part (a).

The experiments on the prepared dataset were conducted with the following declarative bias23:

• ΗC containing all subclasses of dmop:DM-Algorithm and dmop:DM-Operator;

22 dominating operator is an operator that can contain sub-workflows 23 further in the text we use the following prefixes: dmop: http://www.e-lico.eu/ontologies/dmo/DMOP/DMOP.owl rm: http://www.e-lico.eu/ontologies/dmo/DMOP/RMOperators.owl ida: http://semantic.cs.put.poznan.pl/IDAoperators weka: http://www.e-lico.eu/ontologies/dmo/DMOP/WekaOperators.owl

1

2

3

• ΗP ={dmop:executes, dmop:hasValue, dmop:hasParameterSetting, dmop:setsValueOf, dmop:hasParameterKey, dmop:implements};

• eqfillers(dmop:executes) containing all operators occurring in collected workflows, eqfillers(dmop:hasValue)={“gain_ratio”, “information_gain”, “accuracy”, “poly”, “rbf”, “linear”, “sigmoid”, “true”, “false”}, eqfillers(dmop:hasParameterKey) corresponding to the parameter names in Table 6.

• rangefillers(dmop:hasValue)={(0,1,0.1), (2,16,2)}; • isabstract=1 for dmop:executes, dmop:hasParameterSetting, dmop:setsValueOf

and dmop:implements; • useonlyfillers=0 for all properties; • maxcounter=1 for all properties; • basepatterns={(body(Q),{?opex1, ?opex2})}, where body(Q) is

?x a dmop:DM-Experiment. ?x dmop:hasSubpart ?opex1. ?x dmop:hasSubpart ?opex2. ?opex1 dmop:isFollowedBy ?opex2. filter(?opex1!=?opex2).

The values of the thresholds !! and !! (lift and local support, respectively) were determined experimentally to !! = !.!" and !! = !.!. We mined patterns up to level ! = !", refining only ! = !"" best patterns (w.r.t. the lift measure defined above) at each level. It is vital to notice, that a set of patterns mined up to some level ! enables to perform later steps of experiment for every level from 0 up to !. Steps 2c-2g have been modeled as a one workflow containing sub-workflows (Figure 10); below the most important parts are described. In step 2d, attributes representing characteristics of datasets computed by DCT were added to the list of attributes. Figure 11 illustrates a process performing steps 2c, 2e and 2f (the presented workflow represents a training phase, sub-workflow of ‘X-Validation’ operator from Figure 10). Step 2c (w.r.t. the training set only) is achieved by using ‘Propositionalisation’ operator (1). Complex operator ‘Remove redundant features’ (2) contains sub-worfklow that realizes removing attributes which have at least 99% of the same value and removing those attributes (in a greedy way) that are correlated to some other attribute with absolute correlation at least 0.95. ‘Rule Induction’ operator (3) uses RIPPER algorithm to perform step 2f. This algorithm, by default, applies extensive pruning. In our case, it is not a desired behavior as pruning is already being performed on attribute set during pattern mining and by removing redundant features. Redoing it, especially when one deals with skewed data as in our case, leads to trivial models without much strength. Therefore, we set parameters of ‘Rule Induction’ operator as follows: pureness and minimal prune benefit to 1.0.

Figure 10. The screenshot of RapidMiner workflow realizing training and testing the meta-mining model.

Testing hypothesis, illustrated in Figure 12 by a subworkflow of the ‘X-Validation’ operator, consists of performing propositionalisation on the test set with ‘Propositionalisation’ operator (1) and applying hypothesis with ‘Apply Model’ (2).

Figure 11. Using mined patterns to learn a rule model (training phase of X-validation).

Figure 12. Testing hypothesis (testing phase of X-validation).

1 2 3

1 2

Results of experiments. Below we present the results of experimental evaluation of Fr-ONT-Qu in the meta-mining scenario. In the experiments, we used OWLIM SE (v5.3.5849) as an underlying reasoning engine and a semantic store with the owl2-rl-reduced-optimized ruleset. The choice of such a ruleset was motivated by the expressivity of our background knowledge base, e.g. existence of object property chains. During each cycle of cross-validation, Fr-ONT-Qu discovered around 2000 patterns, and redundant patterns were subsequently pruned. We discuss some of the discovered patterns below (for compactness denoting by Bd the body of the base pattern used in the experiments). The first example pattern: Q1 = select distinct ?x where { Bd ∪ ?opex2 dmop:executes ?front0 . ?opex2 dmop:executes rm:RM-Decision_Tree . ?opex2 dmop:hasParameterSetting ?front1. ?front0 dmop:executes rm:DM-Operator . ?front0 dmop:implements ?front2 . ?front2 a dmop:DM-Algorithm . ?front2 a dmop:InductionAlgorithm . ?front2 a dmop:ModelingAlgorithm . ?front2 a dmop:ClassificationModelingAlgorithm . ?front2 a dmop:ClassificationTreeInductionAlgorithm . }

was mined when Fr-ONT-Qu traversed down the algorithm classes hierarchy specializing variable ?front2. In this way, it is possible to abstract from the level of operators (algorithm implementations) to the level of algorithms and their taxonomy. For instance, both rm:RM-Decision_Tree and weka:Weka-J48 operators implement a classification tree induction algorithm and one may generalize over it. The patterns containing class hierarchies provide similar expressivity to this of patterns mined in so-called generalized association rule mining.

The following pattern covers only those workflows that contain ‘Decision Tree’ operator, for which the parameter minimal size for split has value between 2 and 5.5: Q2 = select distinct ?x where { Bd ∪ ?opex2 dmop:executes ?front0 . ?opex2 dmop:executes rm:RM-Decision_Tree . ?opex2 dmop:hasParameterSetting ?front1. ?front0 dmop:executes rm:DM-Operator . ?front1 dmop:setsValueOf ?front2. ?front1 dmop:hasValue ?front3. filter(2.000000 <= xsd:double(?front3) && xsd:double(?front3) <= 16.000000) . ?front2 dmop:hasParameterKey 'minimal_size_for_split'. ?front1 dmop:hasValue ?front3. filter(2.000000 <= xsd:double(?front3) && xsd:double(?front3) <= 9.000000) . ?front1 dmop:hasValue ?front3. filter(2.000000 <= xsd:double(?front3) && xsd:double(?front3) <= 5.500000) . }

The following pattern represents a situation where some concrete value of a property is more interesting than the information about the used algorithm. In this case it is gain ratio property value that may be used by at least two operators: ‘Decision Tree’ and ‘Random Forest’: Q3 = select distinct ?x where { Bd ∪ ?opex1 dmop:hasParameterSetting ?front0. ?front0 dmop:hasValue 'gain_ratio'. }

In general, our language of patterns enables flexibility in referring to the workflow topology. For instance, by use of the transitive property dmop:isFollowedBy one is able to represent which operators are used in the workflow, and how they are placed one after another, possibly indirectly, e.g. even in the situations where any of these operators is nested deep in sub-workflows of some dominating operators or is a dominating operator by itself. Similarly, via the use of the transitive property dmop:hasSubpart, and an object property chain linking operations at multiple levels of the process depth, discovered patterns may express that some operations indirectly follow each other. Such a possibility is of help in case of process mining where it may be rare that two processes have very similar number of operations and a topology.

We compared our models trained on feature sets containing discovered patterns (‘workflow patterns’) with a baseline corresponding to the models not using mined patterns. They only use dataset characteristics and learning operator names, as it is done in the classic meta-learning setup. More precisely, a model of the baseline was trained on a feature set composed of the dataset characteristics produced by the DCT tool from the Metal project and a feature representing the learning algorithm implementation (RapidMiner operator). Table 8 presents accuracies and standard deviations for classifiers built on pattern sets generated up to the specified level. By baseline we denote a model that was not using mined patterns at all. Table 9 presents the values obtained by performing McNemar’s test for pairs of classifiers. The table is symmetric; the diagonal contains 0. The null hypothesis is that a classifier built using pattern set mined up to a level specified in a column has the same error rate as the classifier built on the pattern set mined up to the level specified in a row. The cells contain the probability (p-value) that the null hypothesis is true w.r.t. the obtained results of both classifiers. Taking into account the increase in the accuracy at each level of the search, one may conclude that classifiers that were trained using workflow patterns perform significantly better than the one trained only on dataset characteristics and the learning operator name. There is no statistically significant difference in performance between the models that were trained on the pattern sets mined on the level 5 and above (e.g., there is no statistical difference in performance for the levels 5 and 15). Level baseline 1 2 3 4 5 15 Accuracy+-std dev

0.890+- 0.016

0.899+- 0.024

0.908+- 0.019

0.913+- 0.019

0.923+- 0.021

0.927+- 0.021

0.926+- 0.017

Table 8. Accuracies and standard deviations of pattern based models.

Classifier 1

Classifier 2 1 2 3 4 5 15

baseline 0.3502 0.0263 0.0033 4.7e-5 7.52e-6 1.55e-5

1 0.1255 0.0350 0.0014 0.0002 0.0003

2 0.3911 0.0327 0.0063 0.0122

3 0.1106 0.0339 0.0574

4 0.5201 0.6442

5 0.9121

15

Table 9. The results of experimental evaluation against data characteristics based baseline. An extended description of the experimental evaluation may be found in (Ławrynowicz & Potoniec, 2013). In order to ensure transparency and repeatability of our research and make explicit the provenance of our results, we have published the supplementary material to this paper concerning all the experimental data (datasets and workflows) at myExperiment portal24 at http://www.myexperiment.org/packs/421.html, as well as at the Fr-ONT-Qu dedicated website (http://semantic.cs.put.poznan.pl/fr-ont/).

CONCLUSION In this paper we have proposed a new method for pattern-based classification. The method introduces a new algorithm, named Fr-ONT-Qu, for mining patterns represented as SPARQL queries over RDFS. The patterns are subsequently used as features to learn a classification model. We have implemented the proposed method within a tool we have developed, named RMonto, an ontological extension to RapidMiner that supports semantic data mining approaches. We have experimentally compared our approach to state-of-art classification methods for semantic data, including those that were awarded at a major conference in the field, and we achieved better results in this comparative study. Using our proposed method we have also conducted an experimental study within emerging subfield of meta-learning called semantic meta-mining which is an ontology-based, process-oriented form of meta-learning aiming to learn over full knowledge discovery processes rather than over individual algorithms. The study provided the proof-of-concept of the method.

The primary motivation for our work is the real-world need for data-mining approaches to mine the Semantic Web data. This need has been already reflected by recently emerging approaches for mining semantic data. The importance of such approaches has been recognized in

24 http://www.myexperiment.org

the field resulting in best paper awards at a major Semantic Web conference. In the paper, we have illustrated how mining semantic data for classification may be performed through the construction and use of semantic patterns. However, since our approach enables to decouple the data representation from the learning task, our method is not restricted to a single machine learning task. Its output may be used on input to classical, propositional data mining algorithms thus extending their scope of applicability to semantic data.

REFERENCES

Angles, R., & Gutierrez, C. (2008). The Expressive Power of SPARQL. Proc. of the 7th International Conference on the Semantic Web, ISWC'2008 (pp. 114-129). Karlsruhe: Springer-Verlag.

Arenas, M., Gutierrez, C., & Pérez, J. (2008). An Extension of SPARQL for RDFS. SWDB-ODBIS 2007 (pp. 1-20). Berlin Heidelberg: Springer-Verlag.

Baader, F., Calvanese, D., McGuinness, D. L., Nardi, D., & Patel-Schneider, P. F. (2003). The description logic handbook: theory, implementation, and applications. New York, NY, USA: Cambridge University Press.

Berners-Lee, T., Hendler, J., & Lassila, O. (2001). The Semantic Web. Scientific American , 284 (5), 34-43.

Bloehdorn, S., & Sure, Y. (2007). Kernel methods for mining instance data in ontologies. Proceedings of the 6th International the Semantic Web and 2nd Asian Conference on Asian Semantic Web Conference. LNCS 4825, (pp. 58-71). Busan, Korea: Springer-Verlag.

Brickley, D., & Guha, R. (2004). RDF Vocabulary Description Language 1.0: RDF Schema. W3C Recommendation 10 February 2004, http://www.w3.org/TR/rdf-schema/.

Bringmann, B., Nijssen, S., & Zimmermann, A. (2009). Pattern-based Classification: A Unifying Perspective. Proceedings of 'From Local Patterns to Global Models': Second ECML PKDD Workshop (LeGo), (pp. 36-50). Bled, Slovenia.

Buntine, W. (1988, September). Generalized subsumption and its applications to induction and redundancy. Artif. Intell., 36 (2), 149-176.

Cheng, H., Yan, X., Han, J., & Hsu, C.-W. (2007). Discriminative Frequent Pattern Analysis for Effective Classification. Proceedings of the 23rd International Conference on Data Engineering, ICDE 2007 (pp. 716-725). IEEE.

Cohen, W. (1995). Fast Effective Rule Induction. In Proc. of the Twelfth International Conference on Machine Learning (pp. 115-123). Tahoe City, USA: Morgan Kaufmann.

De Raedt, L. (2008). Logical and Relational Learning. Berlin Heidelberg: Springer.

De Raedt, L., & Ramon, J. (2004). Condensed representations for inductive logic programming. Principles of Knowledge Representation and Reasoning: Proceedings of the Ninth International Conference (KR2004) (pp. 438-446). AAAI Press.

Dehaspe, L., & Toivonen, H. (1999, March). Discovery of Frequent DATALOG Patterns. Data Min. Knowl. Discov., 3 (1), 7-36.

Donini, F. M., Lenzerini, M., Nardi, D., & Schaerf, A. (1998). AL-log: Integrating Datalog and Description Logics. J. Intell. Inf. Syst., 10 (3), 227-252.

Fanizzi, N., D'Amato, C., & Esposito, F. (2008). DL-FOIL Concept Learning in Description Logics. Proceedings of the 18th International Conference on Inductive Logic Programming. LNCS 5194, (pp. 107-121). Prague: Springer-Verlag.

Fanizzi, N., d'Amato, C., & Esposito, F. (2010). Machine Learning Methods for Ontology Mining. W P. C.-Y. Sheu, H. Yu, C. V. Ramamoorthy, A. K. Joshi, & L. A. Zadeh (Ed.), Semantic Computing (pp. 131-153). Wiley/IEEE.

Fayyad, U., Piatetsky-Shapiro, G., & Smyth, P. (1996). From Data Mining to Knowledge Discovery in Databases. AI Magazine , 17, 37-54.

Glimm, B., Hogan, A., Kroetzsch, M., & Polleres, A. (2012). OWL: Yet to arrive on the Web of Data? WWW2012 Workshop on Linked Data on the Web. Lyon, France: CEUR-WS.org.

Hilario, M., Kalousis, A., Nguyen, P., & Woznica, A. (2009). A Data Mining Ontology for Algorithm Selection and Meta-Learning. Proc of the ECML/PKDD09 Workshop on Third Generation Data Mining: Towards Service-oriented Knowledge Discovery (SoKD-09), (pp. 76-87).

Hilario, M., Lavrac, N., Podpecan, V., & Kok, J. (2010). Proceedings of the 3rd International Workshop on Third-Generation Data Mining: Towards Service-Oriented Knowledge Discovery (SoKD'10), held in conjunction with ECML/PKDD-2010. Barcelona.

Hilario, M., Nguyen, P., Do, H., Woznica, A., & Kalousis, A. (2011). Ontology-Based Meta-Mining of Knowledge Discovery Workflows. In N. Jankowski, W. Duch, & K. Grabczewski (Ed.), Meta-Learning in Computational Intelligence (pp. 273-316). Springer.

Horst, H. (2005, October). Completeness, decidability and complexity of entailment for RDF Schema and a semantic extension involving the OWL vocabulary. J. Web Sem., 3 (2-3), 79-115.

Jankowski, N., Duch, W., & Grabczewski, K. (Ed.). (2011). Meta-Learning in Computational Intelligence. Springer.

Józefowska, J., Ławrynowicz, A., & Lukaszewski, T. (2008). On Reducing Redundancy in Mining Relational Association Rules from the Semantic Web. Web Reasoning and Rule Systems, Second International Conference, RR 2008. LNCS, (pp. 205-213). Karlsruhe: Springer.

Józefowska, J., Ławrynowicz, A., & Lukaszewski, T. (2010). The role of semantics in mining frequent patterns from knowledge bases in description logics with rules. Theory Pract. Log. Program., 10 (3), 251-289.

Kiefer, C., Bernstein, A., & Locher, A. (2008). Adding data mining support to SPARQL via statistical relational learning methods. Proceedings of the 5th European Semantic Web Conference on the Semantic Web: research and applications. LNCS 5021, (pp. 478-492). Tenerife, Canary Islands, Spain: Springer-Verlag.

Kietz, J.-U., Serban, F., Bernstein, A., & Fischer, S. (2010). Data Mining Workflow Templates for Intelligent Discovery Assistance and Auto-Experimentation. Proc of the ECML/PKDD10 Workshop on Third Generation Data Mining: Towards Service-oriented Knowledge Discovery (SoKD-10).

Kietz, J.-U., Serban, F., Bernstein, A., & Fischer, S. (2009). Towards cooperative planning of data mining workflows. Proc. of the ECML/PKDD09 Workshop on Third Generation Data Mining: Towards Service-oriented Knowledge Discovery (SoKD-09).

Kralj-Novak, P., Lavrac, N., & Webb, G. I. (2009, June). Supervised Descriptive Rule Discovery: A Unifying Survey of Contrast Set, Emerging Pattern and Subgroup Mining. J. Mach. Learn. Res. ,10, 377-403.

Kralj-Novak, P., Vavpetic, A., Trajkovski, I., & Lavrac, N. (2009). Towards semantic data mining with g-SEGS. Proc. of the 11th International Multiconference Information Society 2009.

Ławrynowicz, A., & Potoniec, J. (2011). Fr-ONT: an algorithm for frequent concept mining with formal ontologies. Proceedings of the 19th international conference on Foundations of intelligent systems (pp. 428-437). Springer-Verlag.

Ławrynowicz, A., & Potoniec, J. (2013). Pattern based feature construction in semantic data mining. Institute of Computing Science, Poznan University of Technology, Technical report RA-2/2013.

Le Bras, Y., Lenca, P., & Lallich, S. (2012, November). Optimonotone measures for optimal rule discovery. Computational Intelligence, 28 (4), 475-504.

Lehmann, J. (2009, December). DL-Learner: Learning Concepts in Description Logics. J. Mach. Learn. Res., 10, 2639-2642.

Li, J. (2006, April). On optimal rule discovery. IEEE Transactions on Knowledge and Data Engineering, 18 (4), 460–471.

Lindner, G., & Studer, R. (1999). AST: Support for Algorithm Selection with a CBR Approach. Principles of Data Mining and Knowledge Discovery, Third European Conference, PKDD'99 (pp. 418-423). Springer.

Lisi, F. A. (2011). AL-QuIn: An Onto-Relational Learning System for Semantic Web Mining. International Journal on Semantic Web and Information Systems (IJSWIS), 7 (3), 1-22.

Lisi, F. A., & Esposito, F. (2008). Foundations of Onto-Relational Learning. Inductive Logic Programming, 18th International Conference, ILP 2008, Proceedings (pp. 158-175). Springer.

Lisi, F. A., & Malerba, D. (2004, May). Inducing Multi-Level Association Rules from Multiple Relations. Mach. Learn., 55 (2), 175-210.

Loesch, U., Bloehdorn, S., & Rettinger, A. (2012). Graph kernels for RDF data. Proceedings of the 9th International Conference on the Semantic Web: research and applications. LNCS 7295, (pp. 134-148). Heraklion, Crete, Greece: Springer-Verlag.

Manola, F., & Miller, E. (2004). RDF Primer. W3C Recommendation 10 February 2004, http://www.w3.org/TR/rdf-primer/.

McGuinness, D., & van Harmelen, F. (2004). OWL Web Ontology Language Overview. W3C Recommendation 10 February 2004, http://www.w3.org/TR/owl-features/.

Mierswa, I., Wurst, M., Klinkenberg, R., Scholz, M., & Euler, T. (2006). YALE: rapid prototyping for complex data mining tasks. Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 935-940). New York: ACM.

Motik, B., Sattler, U., & Studer, R. (2005, July). Query Answering for OWL-DL with rules. Journal of Web Semantics, 3 (1), 41-60.

Munoz, S., Perez, J., & Gutierrez, C. (2007). Minimal Deductive Systems for RDF. In Proc. of the 4th European Semantic Web Conference (pp. 53-67). Berlin, Heidelberg: Springer-Verlag.

Nguyen P., Kalousis, A., & Hilario, M. (2011). A meta-mining infrastructure to support KD workflow optimization. Proc of the ECML/PKDD-11 Workshop on Planning to Learn and Service-Oriented Knowledge Discovery (PlanSoKD-2011).

Nienhuys-Cheng, S.-H., & Wolf, R. d. (1997). Foundations of Inductive Logic Programming. Secaucus, NJ, USA: Springer-Verlag.

Nijssen, S., & Kok, J. (2001). Faster association rules for multiple relations. Proceedings of the 17th international joint conference on Artificial intelligence - Volume 2 (pp. 891-896). San Francisco, CA, USA: Morgan Kaufmann.

Perez, J., Arenas, M., & Gutierrez, C. (2009, September). Semantics and complexity of SPARQL. ACM Trans. Database Syst., 34 (3), 16:1-16:45.

Phong, N., Kalousis, A., & Hilario, M. (2011). A meta-mining infrastructure to support KD workflow optimization. Proc of the ECML/PKDD-11 Workshop on Planning to Learn and Service-Oriented Knowledge Discovery (PlanSoKD-2011).

Piatetsky-Shapiro, G. (1997). Data Mining and Knowledge Discovery: The Third Generation (Extended Abstract). Foundations of Intelligent Systems, 10th International Symposium, ISMIS '97 (pp. 48-49). Springer.

Plotkin, G. D. (1970). A Note on Inductive Generalization. Machine Intelligence, 5, 153-163.

Potoniec, J., & Ławrynowicz, A. (2011a). RMonto - towards KDD workflows for ontology-based data mining. Planning to Learn and Service-Oriented Knowledge Discovery, Workshop at ECML/PKDD 2011.

Potoniec, J., & Ławrynowicz, A. (2011b). RMonto: Ontological extension to RapidMiner. Poster and Demo Session of the ISWC 2011 - 10th International Semantic Web Conference.

Prud'hommeaux, E., & Seaborne, A. (2008). SPARQL Query Language for RDF. W3C Recommendation 15 January 2008, http://www.w3.org/TR/rdf-sparql-query/.

De Raedt, L. (2008). Logical and Relational Learning. Berlin Heidelberg: Springer.

Serban, F., Vanschoren, J., Kietz, J.-U., & Bernstein, A. (2012). A survey of intelligent assistants for data analysis. ACM Computing Surveys.

Shawe-Taylor, J., & Cristianini, N. (2004). Kernel Methods for Pattern Analysis. New York, NY, USA: Cambridge University Press.

Srinivasan, A., Muggleton, S. H., Sternberg, M. J., & King, R. D. (1996, August). Theories for Mutagenicity: A Study in First-Order and Feature-Based Induction. Artificial Intelligence, 85 (1-2), 277-299.

Vavpetic, A., & Lavrac, N. (2013). Semantic Subgroup Discovery Systems and Workflows in the SDM-Toolkit. Comput. J., 56 (3), 304-320.

Acknowledgements: This work was partially supported by the European Union within FP7 ICT project e-LICO (Grant No 231519). Agnieszka Ławrynowicz acknowledges the support of the Foundation for Polish Science under the PARENT/BRIDGE programme, cofinanced from European Union, Regional Development Fund. We thank all our colleagues who contributed to the development of the e-LICO infrastructure, tools, and knowledge resources used in the experiments, especially: Simon Fischer, Melanie Hilario, Alexandros Kalousis, Joerg Uwe-Kietz, Phong Nguyen, Raul Palma, Floarea Serban. We thank Veli Bicer for sharing the AIFB dataset.

APPENDIX A: DATA SET CHARACTERISTICS

Following is the list of data set characteristics computed by the DCT tool (Lindner & Studer, 1999). Only characteristics considering dataset as a whole were used, because characteristics for single attributes or pairs of attributes are incomparable between different datasets. In the end, the following set of characteristics has been used (for some datasets, some of those attributes were incomputable, and thus missing):

• Value of BHEP test and its critical values for alphas: 0.01, 0.05, 0.1 • Highest eigenvalue • Number of attributes with outliers • Number of missing values in the dataset • Mean absolute skew • Number of attributes • Number of numeric attributes • Bartlett’s test: value, critical value, degrees of freedom • Probability of an example having missing value • Class entropy • Number of classes • Minimal, average and maximum number of symbolic values in symbolic attribute • Probability of the largest class • Relative importance of the largest eigenvalue (as an indication of the importance of the

first discriminant function) • Number of symbolic attributes • Number of default class • Number of examples • Wilks Lambda • Mean kurtosis of numeric attributes • Number of examples with missing values • Canonical correlation (an indicator for the degree of correlation between the most

significant discriminant function and the class distribution) • Number of eigenvalues • Probability of value being missing

Documents

Pattern based feature construction in semantic data mining ... · meta-learning called semantic meta-mining (Hilario, Nguyen, Do, Woznica, & Kalousis, 2011) which is an ontology-based,