More and more, organizations are using data mining to make

XXX International Workshop XXX

The 3rd Workshop on Data Mining Standards,

Services, and Platforms

DM-SSP 2005

Workshop Chairs:

Kurt Thearling Capital One Dave Selinger Overstock.com Robert Grossman Open Data Parnters &University of Illinois at Chicago Rick Pechter Microstrategy Stefan Raspl IBM August 21, 2005

Chicago, Illinois, USA

Proceedings of the Third Annual Workshop on Data Mining Standards, Services and Platforms

KDD 2005

August 21, 2005

Chicago, IL

Edited by

Robert Grossman

University of Illinois at Chicago

and Open Data Partners

And

Kurt Thearling

Capital One

1

Table of Contents

Schedule…………………………………………………………………………………...………….Page 3

Affiliations………………………………………………………………………………...………….Page 4

Preface….………………………………………………………….………………………...………..Page 5

PMML Models for Detecting Changes by Robert Grossman…..……………………….…..Page 6

Explanation of PMML Models by Christoph Lingenfelder and Stefan Raspl..……………...…..….Page 16

DMX Query and XML for Analysis Standard by ZhaoHui Tang and Jamie MacLennan……………………………………………………………………....……………….….Page 23

2

Schedule 9:00-10:00 Panel

Panel Discussion: Future Directions for Data Mining Standards, Services, and Platforms 10:00 - 10:30 Coffee Break 10:30 - 12:00 Session 1

Rick Pechter, MicroStrategy Integrating Data Mining Models into the Enterprise Business Intelligence Platform ZhaoHui Tang and Jamie MacLennan, Microsoft DMX Query and XML for Analysis Standard Tom Khabaza, SPSS Model Management and Automation using SPSS Predictive Enterprise Services

12:00 - 1:30 Lunch 1:30-2:30 Session 2

Christoph Lingenfelder and Stefan Raspl, IBM Explanation of PMML Models Robert Grossman, University of Illinois at Chicago and Open Data PMML Models for Detecting Changes

2:30-3:00 Session 3

Svetlana Levitan, SPSS, Recent Changes in PMML David Duling and Wayne Thompson, SAS, Maximizing Data Mining Effectiveness Through More Efficient Model Deployment and Management

3:00 - 3:30 Coffee Break 3:30 - 4:30 Session 4: Panel Discussion

Panel and Audience Discussion: The Evolution of PMML: Version 4.0 and Beyond

4:30 - 5:00 Break 5:00 – 7:15 KDD Opening and Awards

3

Affiliations David Duling SAS Robert Grossman University of Illinois at Chicago & Open Data Partners Tom Khabaza SPSS Svetlana Levitan SPSS Christoph Lingenfelder IBM Jamie MacLennan Microsoft Corporation Rick Pechter MicroStrategy, Inc. Stefan Raspl IBM ZhaoHui Tang Microsoft Corporation Wayne Thompson SAS

4

Preface This year marks the fifth year that there has been a KDD workshop on the Predictive Model Markup Language (PMML) and related areas and the third year of a broader conference with the theme of Data Mining Standards, Services, and Platforms. Using PMML and the abstractions provided by PMML Producers and PMML Consumers, it becomes natural to develop statistical and data mining models on one system or application and to deploy them on another. For large enterprises, this can be quite helpful since they may be several different deployment environments and systems and these may change over time. Over time, an enterprise may accumulate enough different PMML models that that a model repository is useful. Some of the papers in this year’s workshop address these and related matters. With the broad acceptance of web services, web service-based data mining standards are becoming more and more important. This year’s workshop also includes papers about this topic. This year’s workshop also includes a panel that broadly considers future directions in data mining standards, services, and platforms.

The Editors

5

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage, and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. DM-SSP’05, August 21, Chicago, IL, USA. ISBN: 1-59593-225-9 Copyright 2005 ACM $5.00

6

PMML Models for Detecting Changes

Robert L. Grossman Abstract In this note, we describe how PMML models can be used to monitor data streams, build baselines, and detect deviations from the baselines. 1. Introduction and Background Common statistical and data mining tasks include prediction, classification, summarization and detecting changes. The Predictive Model Markup Language (PMML) provides good support for all of these tasks, except for the latter. In this note, we introduce a PMML proposal for change detection models, that is currently pending before the PMML Working Group [PMML]. Complex applications may require not just one, but perhaps dozens, hundreds or even thousands of different change detection models [Bugajski:2005]. In this note, we also introduce a simple mechanism in PMML for working with segmented models that can easily support this requirement. Here is an outline of this article. In Section 2, we briefly describe an event based model for data mining that is well suited for change detection applications. In Section 3, we introduce a PMML change detection model in the continuous case. In Section 4, we introduce a PMML change detection model in the discrete case. Section 5 introduces our proposal for segmented models. Section 6 is the conclusion. 2. Event Based Data Mining Process Model In this section, we give a brief review of an event-based data mining process model following [Grossman:2004]. An event-based data mining process model is based upon three abstractions, which we now describe. We also describe how these abstractions are represented in PMML [PMML].

• Events. Events contain data that are processed in the learning phase to produce models and in the deployment phase to produce scores. Data fields in events are represented in PMML by mining fields.

• Feature Vectors. Feature vectors are the inputs to models. One or more events are

transformed and aggregated to produce feature vectors. Data fields in feature vectors are represented in PMML by mining fields, and the PMML transformation dictionary, built-

in functions, and derived fields are used to describe how events are shaped into feature vectors.

• Models. Broadly speaking, models are functions that take feature vectors as inputs and

produce one or more outputs. Two models may be composed by considering the output of the first model as events to produce new feature vectors, which can be used as the inputs to the second model. PMML models are used for models, and PMML output and target functions are used to describe the outputs of models.

Using these abstractions, we can define learning is the process that produces a model, while deployment is the process that consumes a model to produce scores. In practice, it is often advantageous to have different applications produce models and consume models. For example, producing models may be done with one application running on a large cluster, while deploying models may be done with several different applications or embedded applications, each running on a different system.

With prior data mining process models, for example, [Fayyad:1996], there was not a precise distinction between events and features nor between producing and consuming a model.

For our targeted application of change detection, an event based process model is very important and works as follows. First, in a training phase, statistics for the baseline model (null model), and, if desired, the alternate model, are computed. Second, in the deployment phase, for each event processed, the corresponding feature vector is updated, the model is scored, and changes or deviations result when the model’s threshold is exceeded. 3. Null and Alternate Models – Continuous Case One standard approach for detecting changes is to decide whether a stream of events belongs to one or two statistical distributions – one representing the null or expected behavior and another representing alternate behavior. Here is a simple example from [Grossman:2005]. Consider a stream of data events from a network of highway traffic sensors measuring traffic congestion and the problem of determining whether there is a change in traffic from what would normally be expected at that time of day. One approach to this problem is to create a series of baselines. For example, separate baselines could be created for different hours, for different days, and for different road segments. This approach might generate:

37,800 baselines = 18 (hours) x 7 (days) x 300 (road segments) Given this many measured (normal) baselines and their associated distributions, simple models can be used to determine the parameters of alternate baselines. Assume that for each baseline, the statistical distribution of normal and abnormal conditions is known and that there is a stream of events. The challenge is determine whether the events

7

belong to the normal or abnormal distribution, which in general overlap. A common score used for this purpose is the Cumulative Sum or CUSUM defined as follows [Basseville:1993]: Let f1(x) and f2(x) be the density functions for two distributions and let g(x) be the log odds ratio:

g(x) = log f2(x) / f1(x).

Given a stream of events with features x1, x2, x3, …, define the CUSUM score by:

Z[0] = 0

Z[n] = max{ 0, Z[n-1] + g(x[n]). It is easy to capture this in PMML. We simply need to provide:

• the types of distributions, for example, Gaussian, Poisson, Binomial, etc. • the parameters for the two distributions, for example, by providing the mean and standard

distribution for Gaussian distribution, the mean number of occurrences for a Poisson distribution, etc.

• the statistic for the test for example, simple threshold, CUSUM, etc. and which field will be used for the statistic

• any parameters required for the test

Figure 1 gives a simple example.

8

4. Null and Alternate – Discrete Case In the discrete case, instead of working with distributions, we can use tables of counts. As a simple example, consider the following example involving payment cards from [Bugajski:2005]. Suppose we are trying to understand whether the presence of a missing value in a payment field impacts whether the payment is approved or not. In this case, we can use a 2x2 table of counts:

Payment Approved

Payment Not Approved

Payment Field Populated

n11 n12

Payment Field Missing

n21 n22

We can think of the Payment Approved column as the null or expected behavior and the Payment Not Approved column as the alternate or unexpected behavior. Alternately, for some applications, it is more natural to think of the Payment Field Populated row as the null or expected behavior and the Payment Field Missing row as the alternate or unexpected behavior. More generally, instead of discretizing the distribution into two values (corresponding to the k=2 rows), we could work with a table of counts containing k rows. Even more generally, we could work with k x j table of counts. There are standard statistical tests, such as the Chi-Squared, Fisher’s Exact Test, etc. that can be used to help determine whether the explanatory factor in the rows is statistically related to the response in the columns. We can handle an event based approach in essentially the same way. As a simple example, consider first a single row:

Payment Approved


Payment Field Populated – Null

Distribution

p11 p12

representing the null distribution, and a row

9

Payment Approved


Payment Field Populated – Alternative Distribution

p12 p22

representing the alternate distribution that is being filled event by event. In the event case, we assume we are given percentages that sum to one in each row. As each new event arrives, we can use the same tests as before to weigh the evidence in favor of the alternate distribution over the null distribution. Exactly, the same framework can be used to handle the case of 2 x 2 tables, and, more generally, k x j tables. As in the continuous case, it is easy to capture this in PMML. We simply need to provide for each table:

• the names of the explanatory and response fields • the counts for each element in the table • the statistic for the test for example Chi-Squared, Fisher’s Exact Test. and which field

will be used for the statistic • any parameters required for the test, for example, the p-value

The PMML proposal also supports the degenerate case of a single table. Figure 2 gives an example.

10

5. Segmented Models It is common in practice to build separate models for different segments of a population. Today, this is generally done in PMML by using separate PMML files. As the number of segmented models grows, the lack of explicit support for segmented modeling can begin to be a problem. Moreover, since PMML is designed to encapsulate all the information required for scoring within a single XML file, it would be useful for many applications to have available a general mechanism in PMML for segmented modeling. Here is one approach for providing explicit support for segmented modeling that has been proposed to the PMML Working Group. It turns out that many common use cases are captured by the following segmentation methods:

• Regular Partitions. With a regular partition, a field name, the left end point, the right end point, and the number of partitions is specified. Regular partitions in two or more dimensions can be defining by specifying the required data for each field independently.

• Explicit Partitions. With an explicit partition, the field name, the left end point, and the

right end point are given for each interval in the partition. Note that with explicit partitions, the intervals may be overlapping. Again, multi-dimensional partitions are defining by defining each dimension independently.

• Implicit Partitions. With an implicit partition, a field name is provided and then each

value of the field is used to define a distinct partition. For example, assume that city is a field in a data set that is identified as an implicit partition field. In this case, a separate model would be created for each city.

• Spherical Partitions. With a spherical partition, a center, radius and distance function is

provided. Any feature vector whose distance from the center is less than or equal to the radius is included in the partition. Notice that a feature vector may be in more than one partition.

• Bounding Box. With a bounding box, the coordinates of a two-dimensional bounding

box are provided. If a feature vector is within the bounding box, then it is included in the segment.

See Figure 3 for a simple example.

11

6. Conclusion In this article, we introduced change detection models for continuous and discrete distributions. We have also introduced a simple mechanism for working with segmented models, which is particularly useful when a large number of segmented models are used. The proposals described above are currently pending with the Predictive Model Markup Language Working Group. References [Aanand:2005] Anushka Aanand, John Chaves, Steve Vejcik, Robert L. Grossman, Michal Sabala, and Pei Zhang, Real Time Change Detection and Alerts from Highway Traffic Data, submitted for publication. [Basseville:1993] M. Basseville and I. V. Nikiforov. Detection of Abrupt Changes: Theory and Application. Prentice Hall, 1993. [Bugajski:2005] Joseph Bugajski, Robert L. Grossman, Eric Sumner and Zhao Tang, An Event Based Framework for Improving Information Quality That Integrates Baseline Models, Causal Models and Formal Reference Models, Second International ACM SIGMOD Workshop on Information Quality in Information Systems (IQIS 2005), ACM, 2005. [Fayyad:1996], U. Fayyad, G. Piatetsky-Shapiro, and P. Smyth, 1996, The KDD process for extracting useful knowledge from volumes of data. Communications of the ACM, Volume 39, Number 11, 1996, pages 27-34. [Grossman:2004] Robert L. Grossman, An Event-Based Process Model for Data Mining, submitted for publication. [PMML] Predictive Model Markup Language (PPML), retrieved from www.dmg.org on July 20, 2005.

12

<PMML> ... <MiningModel function="regression"> <MiningSchema> as usual </MiningSchema> ... derived fields as usual ... <ChangeDetectionModel modelName="geo-cusum" functinonName="classification" > <NullModel type="Gaussian"> <TestParameters> <Parameter name = "mean" value = "550.2"> <Parameter name = "variance" value = "48.2"> </TestParameters> </NullModel> <AlternateModel type="Gaussian"> <TestParameters> <Parameter name = "mean" value = "460.4"> <Parameter name = "variance" value = "39.2"> </TestParameters> </AlternateModel> <TestField name = "cusum-score" /> <TestType name = "CUSUM" /> <TestParameters> <Parameter name = "threshold" value = "21.0"> <Parameter name = "reset-value" value = "0.0"> </TestParameters> </ChangeDetectionModel> </PMML> Figure 1. This Figure contains a simple change detection model in the continuous case.

13

<PMML> ... <MiningModel function="regression"> <MiningSchema> as usual </MiningSchema> ... derived fields as usual ... <ChangeDetectionModel modelName="geo-cusum" functionName="classification" > <CountTable> <FieldValueCounts fieldName="mechant_name"> <FieldValueCount value="merchant_name_populated"> <TargetValueCounts> <TargetValueCount value="payment_accepted" count="41"/> <TargetValueCount value="payment_declined" count="81"/> </TargetValueCounts> </FieldValueCount> <FieldValueCount value="merchant_name_misssing"> <TargetValueCounts> <TargetValueCount value="payment_accepted" count="12"/> <TargetValueCount value="payment_declined" count="30"/> </TargetValueCounts> </FieldValueCount> </FieldValueCounts> </CountTable> <TestField name = "cusum-score" /> <TestType name = "CUSUM" /> <TestParameters> <Parameter name = "threshold" value = "21.0"> <Parameter name = "reset-value" value = "0.0"> </TestParameters> </ChangeDetectionModel> … </PMML> Figure 2. This Figure contains a fragment for a simple change detection model in the discrete case.

14

<SegmentAssignment> <SegmentAssignnmentField name="latitude"/> <SegmentAssignnmentType name="regular-partition"/> <SegmentParameters> <Parameter name = "left-endpoint" value = "0.0"> <Parameter name = "right-endpoint" value = "90.0"> <Parameter name = "number-partitions" value = "10"> </SegmentParameters> </SegmentAssignment> Here is an example of a model that uses this segment assignment: <ChangeDetectionModel modelName="geo-cusum" functionName="classification" > <SegmentAssignmentIdentification> <SegmentAssignnmentGUID name="39583AF0203A"/> <SegmentAssignnmentField name="latitude"/> <SegmentAssignnmentType name="regular-partition"/> <SegmentParameters> <Parameter name = "left-endpoint" value = "20.0"> <Parameter name = "right-endpoint" value = "30.0"> </SegmentParameters> </SegmentAssignmentIdentification> <NullModel type="Gaussian"> <TestParameters> <Parameter name = "mean" value = "550.2"> <Parameter name = "variance" value = "48.2"> </TestParameters> </NullModel> <AlternateModel type="Gaussian"> <TestParameters> <Parameter name = "mean" value = "460.4"> <Parameter name = "variance" value = "39.2"> </TestParameters> </AlternateModel> …etc…. </ChangeDetectionModel> Figure 3. This Figure contains a fragment for a simple, segment-based change detection model.

15


16

Explanation of PMML Models Christoph Lingenfelder

Stefan Raspl

Abstract The PMML standard has been largely used as a format to exchange data mining models with the

purpose of scoring. Model explanation, be it through visualization or text, has not been of interest yet. In this paper, we give an overview of popular methods of model explanation, illustrate how the current PMML standard caters for these and take a look at possible solutions to overcome the current limitations.

1 Introduction The Predictive Modeling Markup Language is defined by the Data Mining Group (DMG), a vendor led group of providers and users of data mining software. PMML has been designed to ensure interoperability between vendors by defining a standard format to score data mining models. With the publication of PMML 3.01 all major mining model types are covered, and the requirement for easy model exchange has been met.

However, for many applications, understanding a model is at least as important as deploying the model to new data. Unfortunately, information that is necessary for visualizing and explaining data mining models is in most cases not necessary for the purpose of scoring. And even if useful attributes and elements exist in PMML, they are often not required, so model producers do not feel obligated to write them. As a consequence, PMML is currently only of limited use for visualization.

The result of all this is that while model exchange for scoring works quite well, model exchange for model introspection is next to impossible with few exceptions. Since PMML offers an extension mechanism, where model producers can write arbitrary information in a self-defined format, model producers might be tempted to store information for their visualizers in such extensions. Once established, this might become an obstacle for future attempts to integrate explanatory components into PMML. Also, if PMML only works as a format for scoring but not for model explanation, the standard might be weakened considerably.

1 1

http://www.dmg.org/pmmlv30.html

Figure1: Graphical view of a tree model

2 Popular Explanatory Components Before we take a look at specific algorithms, let us summarize what common explanative components exist.

Explanative Component Usage Graph descriptive overview Other graphics (e.g. scatter plots, histograms) descriptive overview Statistics (distributions, correlations, etc.) detailed description, context Model quality data (confusion matrix, gains chart, error rates, etc.)

Model evaluation and comparison

Field importance data collection Rationale for prediction explanation

Natural language description verbalization Although there are different ways of explaining data mining models - from tree diagrams and three-dimensional scatter plots to descriptive natural language - the underlying information is rather generic for the different data mining functions.

Graphs are a widespread component for model explanation, and most model types have an individual graph for display. Tree models, for instance, are usually displayed as sets of connected nodes arranged as a tree (see figure 1). But graphs are not restricted to tree models. Association rule models are much easier to understand if item sets are depicted as vertexes and rules as edges in a graph. Then, the relationships between an interesting item and the other items can be viewed at a glance (see figure 2).

17

Figure2: Graphical view of an association rule mode

Clustering models are often represented as graphs, too, in order highlight the relationships between clusters. This is particularly useful for hierarchical clustering models.

Another common technique to visualize clustering models arranges representative records in three-dimensional scatter plots. This lets the beholder immediately see differences in the cluster extensions, densities and distances from one another.

When users think of model introspection, they usually think of visual explanation first. While this is an important part, especially to display or reveal complex relationships in the model, many components are non-visual. Graphical representations of clustering models can only give an overview. For a more detailed understanding, statistical information, such as field distributions of the clusters or a statistical comparison between cluster and overall distributions, are necessary.

And although distributions are in most cases illustrated via histograms and pie charts, at the core they are just numbers, from which images can be rendered. The same holds true for other statistics such as correlations and distribution parameters. All of these statistics mean a lot of added value for model introspection, but have no functional meaning in terms of scoring. Distribution-based clustering models are an exception, as they make use of distributions in order to compute cluster affiliation.

Since the information needed to generate the respective graphs varies between model types, it is hard to make generalizations towards their usage for scoring. As we can see later on, some model types already contain rich information for model understanding. But for most types, one would need additional information to generate meaningful explanation.

The field importance is another very helpful component, as it can be used to make the decision which data to collect. If, for instance, an input to a model is difficult or costly to obtain, the user wants to be certain of its importance for the model before investing into these values. It may be more efficient to do without a field, either feeding missing values into the scoring engine, or even retraining the model with fewer active

18

fields.

Knowledge about the reasons for high or low predicted values or for the assignment of a particular class label, the rationale for prediction, is obviously unnecessary for scoring. But this information is essential whenever decisions based on predictive models must be explained to management or to those affected. There may even be a legal requirement to make decisions revisable.

Another neglected component for model explanation is descriptions in natural language are of great help to users. This can range from giving clusters in a cluster model meaningful names to textual descriptions of each cluster detailing or summarizing its content.

Working with predictive models, besides purely descriptive information, users often want to know how accurate they can expect predictions to be. Examples for such model quality data are confusion matrices, gains charts and error rates. They are an invaluable source for comparing the performance of different models using the same test data. And they can also be used to observe the behaviour of a model when applied to different test data sets. A model may deteriorate over time, which can be seen by decreasing quality figures for subsequent test data sets. A similar phenomenon may occur when comparing how well a model works for data with different characteristics, for instance data from different regions.

All of the above components have one thing in common: They are very helpful if not essential to model introspection, but are of no function in the scoring process2. Hence PMML model producers are tempted to omit the respective information, if there exists a place to put them at all

3 Areas of Interest - by Function After summarizing popular components for model explanation, let us now take a look at which elements are most suitable for certain data mining models by function.

It is common to display rule-based models, for instance association or sequence rule models, as a set of rules, which can be arranged visually as a graph or described in natural language. In addition, detailed statistical information is useful. This will create insight into the overall rule model answering questions such as "How many rules are contained in the model?", or "What is the average number of items in a transaction?". And on the level of individual rules, statistical characteristics, for instance the rule support, allow a better understanding.

Classification models have a common set of model quality components. Regardless of the algorithm, gains charts and confusion matrices can always be produced. As discussed in the previous section, field importance should be provided. In addition, distributions of records assigned to individual class labels help to get an overview.

There are only few common components for further explanation of regression models. As for classification models, gains charts can be provided, and error values add valuable information to determine model quality. Basic field importance information can simply consist of the correlations of input fields with the target. Analogous to classification, qualitative insight into regression models can be gained in an algorithm-independent fashion. One splits the data into different ranges of the predicted value. Distribution information of the resulting subsets may hold valuable explanation for high or low predictions.

Cluster models fall into two categories, center-based and distribution-based. But from a user perspective, both require the same explanatory components. Probably the most important one for clustering models is the distribution in the clusters. It can be displayed graphically, for instance using pie and bar charts, but also in tabular form, or even as text. Further statistics such as cluster sizes can add to this. Of special 2 It may occur to the attentive reader that there is something to be said for making use of model quality information in order to generate accuracy information on a per record basis, though.

19

interest are measures of cluster homogeneity and separation

4 Areas of Interest - by Model Type Common to all mining functions is the fact, that some information is algorithm-independent. A lot of deep insight, however, depends heavily upon the algorithm used for training.

In this section we consider the different types of data mining models supported by PMML and analyze to what degree PMML supports popular explanatory components. Supervised models are discussed first, before we consider unsupervised ones.

Tree models can be used for both, classification and regression. In either case, gains charts are frequently used. From tree models, gains chart figures can be computed using the confidence values of the leaf nodes. This approach has several disadvantages: First, it puts the burden of computation on the visualization tool, which would also have to be knowledgeable about the specifics of the algorithm. The tool would have to know surrogate nodes, complex predicates and other specialized methods, just to present quality information. Then, it would be impossible to describe gains chart data for training and validation data in the same model. And finally the visualization code would become algorithm-dependent. Apart from that, the tree itself is contained in the PMML model already and can be displayed along with its rules in the nodes. When an application wants to explain a particular prediction, it helps to understand the data in the record's node. Each node is, of course, characterized by the restrictions contained in its predicates. But often, the detailed value distributions in the nodes further help to understand the model. Suitable elements to store distributions can be found in the Statistics section of PMML. But there is currently no entry in PMML tree nodes to capture them. Tree nodes only hold few statistical values designed for scoring and confidence calculation.

PMML knows several model types capable of predicting numeric values. As discussed in the previous section, all types of model quality data, in particular gains charts and error values are of interest for all of these. Further, distributions of particular target value ranges could be useful. As this information is the same for all model types, PMML should have a generic mechanism to capture this sort of data.

Two model types deal with regression models in the classic sense: General Regression and Simple Regression. Both of them can be used to store linear, polynomial and logistic regression models. For all of them, it is essential to support displaying the actual regression formula, which is already stored for scoring purposes. Information depending on the particular algorithmic approach can be used in addition. For linear regression models, ANOVA tables are commonly used. Some algorithms, such as RBF, subdivide the data in order to construct a useful regression function. In this case, distributions of the subsets may be helpful for model interpretation. Some applications use regression models to perform classification between two class labels, often "yes" and "no". In this case, confusion matrices can be useful as well, if a suitable cut-off value is defined that can determine the class from the predicted numeric value. For none of these explanatory items does a standard PMML mechanism exist.

Much of what was said for regression also applies to the other numeric prediction models. But there are some specific considerations. For Neural networks, displaying the actual network layout and weights may be only useful for an expert. But in general, it can be good to get a feeling for the model complexity. Also, in some cases, the importance of certain inputs can be derived from it. Fortunately, the complete network layout is part of the model.

Support Vector Machines are relatively new, and only few display methods exist so far. The most important information is probably the kernel type of the model and its formula, both of which are covered by the standard.

Model display and the actual model go together for Naive Bayes models as well.

Clustering models are divided into two groups: distribution-based and center-based. For both of these,

20

displaying the cluster distributions is of major interest. Since the distribution is essential to distribution-based clustering models, this area is well covered. Center-based models have a center-vector for each cluster, and provided that no preprocessing is present in the model, they can be used to approximate cluster distributions. But this is no substitute for real distribution information in the model. In addition, for Kohonen networks, the layout of the feature map is of interest to users, because it shows information about the relationship between clusters. PMML already offers a place to store this information for up to three dimensions.

Association rule models hold all the rules already, and they are therefore well suited for model explanation. Further statistics for a more comprehensive display of rules are also already present, and there is not much missing here.

The same can be said of sequence models, which comes as no surprise if one considers that they are a simple generalization of association models, taking sequential information into account.

Finally, ruleset models hold the most essential part for model explanation, the rules, already. Other than the classical components for classification and regression models and similar to tree models, the distribution among the rules support a better understanding. This is understandable, as ruleset models are largely described as flattened decision tree models with only a single level.

Here is a summary table reflecting the current status as of PMML 3.0:

Model Type Support for Explanation Associations complete Clustering some support General/Simple/Logistic Regression little support Naïve Bayes no support Neural Networks some support Ruleset little support Sequences complete Support Vector Machine little support Text Model complete Tree some support

5 Conclusion Support for model explanation in PMML exists, but is rather rudimental. Only few elements are missing completely, most prominently Gains charts and ANOVA tables. But although elements for statistics and distributions exist, they are rarely being used, only in places where they are required to compute scores from a model.

In general, rule models are currently well placed for model explanation in PMML. Regression and Classification models are missing common components such as Gains charts. Adding these to PMML would mean a big step forward. Cluster models are not much better and will routinely need distribution information.

Only very few circumventions exist for some rare cases as indicated above. Using these might be a temporary solution, but would, over time, consolidate the current notion that PMML has little means for model explanation and drive model producers into placing the information in extensions instead. For those useful elements that do exist in PMML, it is important to raise the awareness and make them

21

available in the models at the right places. This can be done without much effort, by educating model producers and convince them to write more than the minimum required for simple scoring.

However, it is not too late yet. PMML has come a long way, model producers are just starting to pick up on the last major version released in 2004, which brought a lot of additional functionality to PMML necessary for scoring. Since PMML can be regarded as mature in terms of scoring by now, we propose to shift the focus towards the next challenge now: model explanation.

6 Outlook One huge area which has not been covered here is how to handle the preprocessing that may be present in some models. In PMML, preprocessing is expressed as DerivedFields, which are then fed as input to the core mining model. These DerivedFields are in many cases unknown to the user as they are often automatically generated, and users might be confused by their names and, worse still, be totally unaware of their meaning. Another challenge is the representation, either in text or graphically, of the respective transformations.

This is a rather general problem, not only for PMML, but for data mining model introspection in general. With model producers taking full advantage of the PMML mechanism for preprocessing, it remains to be seen if the information provided will suffice for application vendors to come up with good applications for model explanation or if there is a need for more details in PMML.

22


23

DMX Query and XML for Analysis Standard

ZhaoHui Tang Jamie MacLennan

Data Mining as a technology is beginning to mature. However, the industry today is highly fragmented, making it difficult for application software vendors and corporate developers to integrate different knowledge-discovery tools. We can consider the current data mining market similar to the database market before SQL was introduced. Every data mining vendor has its own data mining package, each with their own proprietary interfaces. For example, a customer is interested in decision tree algorithm from Vendor A and has built their data mining application based on Vendor A’s package. Later on, the customer finds the time series algorithm from Vendor B is also very attractive for prediction tasks. The customer faces a difficult situation as the products A and B are completely dissimilar and he has to relearn the concepts and interfaces of Vendor B’s product from the scratch. With a simple and standard API, most data mining products and are difficult to integrate into business applications such as customer care, CRM, ERP, etc. The OLE DB for DM Specification describes a standard language and interface that allow any data mining algorithm to be accessed from a wide variety of programming languages, and thus can be easily embedded into consumer applications. Another problem of most commercial data mining products is that data extraction from relational database to an intermediate storage format is necessary. Data porting and transformation are very expensive operations. Why can’t mine data directly on relational database where most data are stored? To solve these problems, Microsoft has initiated the work of OLE DB for Data Ming (DM) Specification with more than 40 ISVs in the business intelligence field since 2000. Its goal is to provide an industry standard for data mining so that different data mining algorithms from various data mining ISVs can be easily plug into consumer applications. Those software packages that provide data mining algorithms are called Data Mining Providers, those applications that use data mining features are called Data Mining Consumer. Consumers communicate with providers using DMX (Data Mining Extensions), which defines data mining objects and a SQL dialect for manipulating and querying these objects. The major concept introduced by OLEDB for DM is that a “data mining model” becomes a database object much like a table. Similar to a relational table, a model has a list of columns with different data types. Some of these columns are input columns while others are predictable columns. However, data mining model is different from a relational table as it doesn’t store raw data, rather it stores the patterns discovered by a data mining algorithm. Therefore, in addition to the columns list, a mining model specifies the associated data mining algorithm any optional

parameters. Creation of a data mining model uses a similar syntax to table creation in SQL. The following example creates a mining model to predict credit risk level based on customer demographic information using the Microsoft Decision Trees algorithm:

CREATE MINING MODEL CreditRisk ( CustomerId long key, Profession text discrete, Income text discrete, Age long continuous, RiskLevel text discrete predict, ) USING [Microsoft Decision Tree]

Upon creation, the model is an empty container. The model is populated during the training stage, when the data mining algorithm analyzes the input cases, learning patterns which are saved into the model. To simplify the data mining process, training data can be from any tabular data source with a OLE DB driver using the OPENROWSET command. Users are not required to export data from relational source to any special intermediate storage format. To be consistent with SQL, OLE DB for DM adopts the syntax of data insertion query. The following sample trains the CreditRisk mining model with the data stored in the customers table of a SQL Server database.

INSERT INTO CreditRisk ( CustomerId,Profession,Income, Age,RiskLevel ) OPENROWSET('sqloledb', 'mylogin'; 'mypass'; '' , ‘SELECT CustomerID, Profession, Income,

Age, Risk FROM Customers’ )

Once trained, users can browss the mining model to examine the discovered patterns, or use the trained mining model for prediction tasks. OLEDB for DM recognizes that the most important use of a trained model is prediction, or scoring. To facilitate development, OLEDB for DM specifies prediction as a form of SQL query, returning data in the same format as you would expect from a SQL query. To perform a prediction query, you need a trained data mining model and new data from which to predict. The semantics of prediction are very similar to a relational join, except that instead of joining two tables, we join a data mining model with an input table. Thus we introduce a new concept called Prediction Join. The following example predicts the credit risk for a set of customers and returns each customers ID along with their predicted risk and the probability (confidence) of that risk.

24

SELECT Customers.ID, CreditRisk.RiskLevel, PredictProbability(CreditRisk.RiskLevel) FROM CreditRisk PREDICTION JOIN

OPENROWSET(‘…’, ‘SELECT * FROM Customers’) ON CreditRisk.Profession = Customers.Profession AND CreditRisk.Income = Customers.Income AND CreditRisk.Age = Customers.Age

Many consumer applications of data mining require prediction to be done on the fly, using customer input rather than data pulled from a database. DMX allows for real-time predictions using a standard SELECT syntax for describing the input data. These types prediction queries singleton queries. The following example predicts the credit risk and probability for a single customer.

SELECT Customers.ID, CreditRisk.RiskLevel, PredictProbability(CreditRisk.RiskLevel) FROM CreditRisk NATURAL PREDICTION JOIN

(SELECT ‘Administrative/Clerical’ as Profession, 45000 as Income, 35 as Age) as t

ON CreditRisk.Profession = Customers.Profession AND CreditRisk.Income = Customers.Income AND CreditRisk.Age = Customers.Age

A list of standard prediction functions that can be included in the select clause of the prediction statement. These functions will return the probability of the predicted value, histogram information about other possible values and related probabilities, top counts, cluster id, etc. For example, the following query returns the likelihood of each credit risk level for a customer using PredictHistogram function. The result of the query contains a nested table in the Histogram column, which stores the probability of each risk state of a given customer.

SELECT Customers.ID, CreditRisk.RiskLevel, PredictHistogram(CreditRisk.RiskLevel) as Histogram FROM CreditRisk PREDICTION JOIN

OPENROWSET(‘…’, ‘SELECT * FROM Customers’) ON CreditRisk.Profession = Customers.Profession AND CreditRisk.Income = Customers.Income AND CreditRisk.Age = Customers.Age

While OLE DB for DM makes great strides in programmability, it is still limited to the OLE DB – MS Windows platform. To order to break this limitation and allow cross platform data mining applications, Microsoft, SAS, Hyperion and a dozen BI product companies created the XML for Analysis Council in 2001. The purpose of the council is to define an XML based API that allows client applications to query DM and OLAP servers from any platform to any platform. XML for Analysis (XMLA) is a SOAP based API standardizing the interaction between clients and

25

analytical data providers. It specifies how to construct SOAP packets that can be sent to an XMLA server to discover metadata and execute queries. The format of the result is a SOAP packet containing a rowset encoded in XML. This allows connection and interaction from any client platform without any specific client components to communicate to a server, which simplifies application deployment and permits cross-platform development. XMLA specifies two commands use to interact with the server Discover and Execute. The former is used to pull metadata describing the capabilities and the state of the server, and the latter is used to execute queries and commands against the server objects. The Discover command has the following syntax:

Discover ( [in] RequestType As EnumString, [in] Restrictions As Restrictions, [in] Properties As Properties, [out] Result As Rowset)

The RequestType parameter indicates the schema that is being requested – the schemas that are supported by the provider can be accessed by first using the DISCOVER_SCHEMA_ROWSETS request type. Restrictions is an array of OLEDB-type restrictions used to limit the data returned from the server – the list of acceptable restrictions is also available from the same DISCOVER_SCHEMA_ROWSET request. Properties is a collection of properties used to control various aspects of the discover method, such as the return type. The list of supported

Figure - Architecture of XML for Analysis properties can be accessed through the DISCOVER_PROPERTIES request type. Required request types and properties are specified in the XML for Analysis 1.1 specification. Finally, the Result is the tabular result of the call returned in XML format. The following example specifies a discover call to a XML/A server requesting a list of mining models in a data base. Note that the response is restricted to those models in the “Foodmart 2000” database and the return format is specified as “Tabular.” <SOAP-ENV:Envelope xmlns:SOAP-ENV="http://schemas.xmlsoap.org/soap/envelope/" SOAP-ENV:encodingStyle="http://schemas.xmlsoap.org/soap/encoding/"> <SOAP-ENV:Body> <Discover xmlns="urn:schemas-microsoft-com:xml-analysis" SOAP-ENV:encodingStyle ="http://schemas.xmlsoap.org/soap/encoding/">

26

<RequestType>DMSCHEMA_MINING_MODELS </RequestType> <Restrictions> <RestrictionList> <CATALOG_NAME> FoodMart 2000 </CATALOG_NAME> </RestrictionList> </Restrictions> <Properties> <PropertyList> <DataSourceInfo> Provider=MSOLAP;Data Source=local; </DataSourceInfo> <Catalog> Foodmart 2000 </Catalog> <Format> Tabular </Format> </PropertyList> </Properties> </Discover> </SOAP-ENV:Body> </SOAP-ENV:Envelope> The following is a truncated sample response from the discover call. <?xml version="1.0"?> <SOAP-ENV:Envelope xmlns:SOAP-ENV= "http://schemas.xmlsoap.org/soap/envelope/" SOAP-ENV:encodingStyle ="http://schemas.xmlsoap.org/soap/encoding/"> <SOAP-ENV:Body> <DiscoverResponse xmlns="urn:schemas-microsoft-com:xml-analysis"> <return> <root> <xsd:schema xmlns:xsd= "http://www.w3.org/2001/XMLSchema">  ... </xsd:schema> <row> <CATALOG_NAME>FoodMart2000 </CATALOG_NAME> <MODEL_NAME>Sales</MODEL_NAME> ... </row> <row> <CATALOG_NAME>FoodMart2000 </CATALOG_NAME> <MODEL_NAME>Warehouse</MODEL_NAME> ... </row> ... </root>

27

</return> </DiscoverResponse> </SOAP-ENV:Body> </SOAP-ENV:Envelope> The result of an XML/A discover request is a simple XML structure that can be interpreted on any platform using the XML tools available thereon. The Execute method is very similar to the Discover method with this syntax:

Execute ( [in] Command As Command, [in] Properties As Properties, [out] Result As Resultset)

The Command parameter specifies the query to be executed, along with any query parameters. The syntax of the query is that described in the OLEDB for Data Mining specification. The Properties parameter is identical to that of the Discover method. And, of course, the Result is the result as specified in the properties parameter. The following example shows an Execute call of the same query we described previously using ADO. <SOAP-ENV:Envelope xmlns:SOAP-ENV="http://schemas.xmlsoap.org/soap/envelope/" SOAP-ENV:encodingStyle="http://schemas.xmlsoap.org/soap/encoding/"> <SOAP-ENV:Body> <Execute xmlns="urn:schemas-microsoft-com:xml-analysis" SOAP-ENV:encodingStyle ="http://schemas.xmlsoap.org/soap/encoding/"> <Command> <Statement> Select t.CustomerId, CreditRisk.RiskLevel From CreditRisk natural prediction join (Select 100 as CustomerId, ‘Engineer’ as Profession, 50000 as Income, 30 as Age) as t </Statement> <Command> <Properties> <PropertyList> <DataSourceInfo>Provider=MSOLAP;Data Source=local; </DataSourceInfo> <Catalog>Foodmart 2000</Catalog> <Format>Tabular</Format> </PropertyList> </Properties> </Execute> </SOAP-ENV:Body> </SOAP-ENV:Envelope> The result rowset will be formatted as in the discover call. With XML/A, the effort to develop data mining web services, especially data mining prediction is minimum. SQL Server 2005 data mining component is one of the first commercial products that has native support for XML/A with DMX as its query language. It also provides algorithm plug-in interfaces

28

for 3rd parties to integrate their algorithms seamlessly inside SQL Server. Once integrated, these algorithms can benefit DMX query, XML/A API and .Net object models such as ADO.Net automatically. SQL Server 2005 provides a true platform for data mining providers.

29

Documents

More and more, organizations are using data mining to make