OLEDB for DMcalwhite.com/files/OLEDBDM1.doc · Web viewDescription: The MINING_SERVICES schema rowset exposes the data mining algorithms available from the provider. It can be used

OLE DB for Data MiningSpecification

Version 1.0

Microsoft Corporation

J U L Y 2 0 0 0

Contents

1 Introduction to OLE DB for Data Mining (DM)........................................................51.1 Goals of Data Mining......................................................................................................51.2 Data Mining Tasks..........................................................................................................6

1.2.1 Predictive Modeling (Classification).....................................................................61.2.2 Segmentation (Clustering).....................................................................................81.2.3 Association (Data Summarization)........................................................................91.2.4 Sequence and Deviation Analysis........................................................................111.2.5 Dependency Modeling.........................................................................................12

1.3 The OLE DB for DM Specification..............................................................................121.4 The Columns Structure of a Data Mining Model (DMM)...........................................15

1.4.1 Model Columns....................................................................................................151.4.2 Prediction Columns..............................................................................................20

2 OLE DB for DM Programmer's Guide....................................................................212.1 Connecting to a Data Mining Provider.........................................................................212.2 Creating New Mining Models.......................................................................................22

2.2.1 Detecting the Capabilities of the Provider...........................................................222.2.2 Defining a New Mining Model............................................................................272.2.3 Copying a Mining Model.....................................................................................292.2.4 Creating a Mining Model from Predictive Model Markup Language (PMML)..29

2.3 Finding Existing Mining Models..................................................................................302.4 Browsing Model Column Definition............................................................................31

2.4.1 Input Columns......................................................................................................312.4.2 Prediction Columns..............................................................................................33

2.5 Populating the Mining Model.......................................................................................342.5.1 Inserting Cases.....................................................................................................352.5.2 Populating the Column Values............................................................................35

Specification Version 1.0— Microsoft 1

2.6 Source Data...................................................................................................................362.6.1 SINGLETON CONSTANT as Source Data........................................................362.6.2 SINGLETON SELECT as Source Data...............................................................372.6.3 OPENROWSET as Source Data..........................................................................382.6.4 SELECT as Source Data......................................................................................382.6.5 SHAPE as Source Data........................................................................................38

2.7 Browsing Mining Model Content.................................................................................402.8 Browsing All Possible Cases and Distinct Column Values..........................................412.9 Querying—Applying Mining Models on New Data.....................................................46

2.9.1 Components of a Prediction Query......................................................................462.9.2 An Example..........................................................................................................482.9.3 Prediction Details.................................................................................................492.9.4 Flattening Nested Tables......................................................................................61

2.10 Deleting Existing Mining Models...............................................................................622.11 Refining Mining Models.............................................................................................63

3 Appendix A: Schema Rowsets...............................................................................653.1 MINING_MODELS Schema Rowset...........................................................................653.2 MINING_COLUMNS Schema Rowset........................................................................673.3 MINING_MODEL_CONTENT Schema Rowset........................................................753.4 Layout of DISTRIBUTION Chapter in MINING_CONTENT Schema Rowset.........783.5 MINING_SERVICES Schema Rowset........................................................................793.6 SERVICE_PARAMETERS Schema Rowset...............................................................853.7 MODEL_CONTENT_PMML Schema Rowset...........................................................86

4 Appendix B: OLE DB for DM Grammar.................................................................874.1 Statements.....................................................................................................................87

4.1.1 CREATE MINING MODEL...............................................................................874.1.2 INSERT INTO.....................................................................................................904.1.3 SELECT...............................................................................................................904.1.4 DELETE...............................................................................................................924.1.5 DROP...................................................................................................................93

4.2 A Sample BNF..............................................................................................................934.2.1 CREATE..............................................................................................................934.2.2 INSERT................................................................................................................944.2.3 SELECT...............................................................................................................954.2.4 DELETE/DROP...................................................................................................974.2.5 RENAME.............................................................................................................974.2.6 MISCELLANEOUS............................................................................................97


Using OLE DB for Data Mining

5 Appendix C: Functions...........................................................................................995.1 Predict...........................................................................................................................995.2 PredictSupport.............................................................................................................1005.3 PredictVariance...........................................................................................................1005.4 PredictStdev................................................................................................................1015.5 PredictProbability........................................................................................................1015.6 PredictProbabilityVariance.........................................................................................1025.7 PredictProbabilityStdev..............................................................................................1025.8 Cluster.........................................................................................................................1035.9 ClusterDistance...........................................................................................................1035.10 ClusterProbability.....................................................................................................1045.11 PredictHistogram......................................................................................................1045.12 TopCount..................................................................................................................1055.13 TopSum.....................................................................................................................1065.14 TopPercent................................................................................................................1075.15 Sub-SELECT............................................................................................................1085.16 RangeMid..................................................................................................................1085.17 RangeMin..................................................................................................................1095.18 RangeMax.................................................................................................................1095.19 PredictScore..............................................................................................................1095.20 PredictNodeId...........................................................................................................110

6 Appendix D: XML Format for Data Mining Models.............................................1116.1 DTD for the DMM Extended PMML.........................................................................1126.2 Example: Tree Model to Predict Credit Risk..............................................................122

7 Appendix E: Provider Support for SHAPE Syntax.............................................127

8 Appendix F: Provider Support for OPENROWSET Syntax................................129

9 Appendix G: Support for Other Data Mining Algorithms...................................1319.1 Support for Association Algorithm.............................................................................1319.2 Support for Regression Algorithm..............................................................................132

Copyright....................................................................................................................133



1 Introduction to OLE DB for Data Mining (DM)

The OLE DB for Data Mining (hereafter referred to as OLE DB for DM) draft specification assumes that the reader has a working knowledge of the following technologies and languages:

OLE DB

SQL (Structured Query Language)

Microsoft® Visual C++®

Data mining theory and practice

1.1 Goals of Data MiningData mining is about finding interesting structures in data, which may be interpreted as knowledge about the data or may be used to predict events related to the data. These structures take the form of patterns, which are concise descriptions of the data set. Data mining makes the exploration and exploitation of large databases easy, convenient, and practical for those who have data but not years of training in statistics or data analysis.

The "knowledge" extracted by a data mining algorithm can have many forms and many uses. It can be in the form of a set of rules, a decision tree, a regression model, or a set of associations, among many other possibilities. It may be used to produce summaries of data or to get insight into previously unknown correlations. It also may be used to predict events related to the data—for example, missing values, records for which some information is not known, and so forth. There are many different data mining techniques, most of them originating from the fields of machine learning, statistics, and database programming.

Note Machine learning, as defined here, refers to the computer's ability to improve data mining algorithms automatically through experience. Data training, an important term that will be used in this context throughout this specification, refers to the process where the data mining algorithm analyzes the input data and finds hidden patterns. Using this trained data, these discovered patterns can then be formed into a model and applied to the machine's learning process.


1.2 Data Mining TasksData mining can be applied for a number of different tasks. The major ones are predictive modeling (classification), segmentation (clustering), association, sequence and deviation analysis, and dependency modeling. This section presents a brief description of each of these tasks.

1.2.1 Predictive Modeling (Classification)Predictive modeling targets predicting one or more fields in the data by using the rest of the fields. When the variable being predicted is categorical (to approve or reject a loan, for example), the problem is called classification. When the variable is continuous (such as expected profit or loss), the problem is referred to as regression. Classification is a traditionally well-studied problem. Methods popular in data mining include decision trees, rules, neural networks (nonlinear regression), radial basis functions, and many others.

For example, based on debt level, income level and employment type, you can use predictive modeling to predict the credit risk of a given customer. The classification algorithm determines the relationship of these attributes to the risk class in a training data set where the risk is known. Decision trees are a common and useful technique for predictive modeling. Figure 1 shows a set of training data that will be used to predict credit risk. Historical information was collected on customers that included their debt level, income level, what type of employment they had and whether they turned out to be a good or bad credit risk. Figure 2 shows a decision tree that might be created from this data.

Customer ID Debt level Income level Employment type Credit risk

1 High High Self-employed Bad

2 High High Salaried Bad

3 High Low Salaried Bad

4 Low Low Salaried Good

5 Low Low Self-employed Bad

6 Low High Self-employed Good

7 Low High Salaried Good

Figure 1. Sample data



Figure 2. A decision tree

In this trivial example, a decision tree algorithm might decide that the most significant attribute for predicting credit risk is debt level. The first split in the decision tree is therefore made on debt level. One of the two new nodes (debt level = high) is a leaf node, having three bad credit risks and no good credit risks. In this example, a high debt level is a perfect predictor of a bad credit risk. The other node (debt level = low) is still mixed, having three good credit risks and one bad. The decision tree algorithm then chooses employment type as the next most significant predictor of credit risk. The split on employment type gives two leaf nodes. It turns out that self-employed people are a bad credit risk. This is, of course, a completely imaginary and trivial example, but it illustrates how the decision tree can use known attributes of the credit applicants to predict credit risk. In reality, there would be far more attributes for each credit applicant and the numbers of applicants would be very large. When the scale of the problem expands like this, it is very difficult for a person to extract the rules to identify good and bad credit risks. The classification algorithm, on the other hand, can consider hundreds of attributes and millions of records to come up with the decision tree that describes rules for credit risk prediction.


All

Credit Risk

Good: 3

Bad: 4

Debt = Low

Credit Risk

Good: 3

Bad: 1

Debt = High

Credit Risk

Good: 0

Bad: 3

Employment Type = Self Employed

Credit Risk

Good: 0

Bad: 1

Employment Type = Salaried

Credit Risk

Good: 3

Bad: 0


1.2.2 Segmentation (Clustering)Segmentation is finding the groups (clusters) in the data that consist of similar subsets of records. Unlike in predictive modeling, there is no target variable that appears as an attribute in the data. The clustering algorithm determines this new "hidden" attribute (the cluster ID to which each example belongs) by examining the data. Examples include segmenting a customer database into clusters of similar customers, which enables the design of a separate marketing strategy for each segment. There are many methods for clustering data. Popular approaches include K-Means algorithm, hierarchical agglomerative methods, and mixture modeling using the Estimation-Maximization (EM) algorithm for fitting probabilistic mixture models to data. It is possible for a data record to belong to different clusters with different degrees of membership.

Consider an employee database in which each employee has three attributes—age, salary, and vested amount in a company pension plan. A user may want to issue a query that provides a cross-tabulation of the average ages of employees having pension plans in the ranges 100K–200K, 200K–400K, and 400K–1000K and having salaries in the ranges 50K–100K, 100K–200K, and 200K–300K. For traditional approaches, the problem is that the ranges specified by the user can be arbitrary. In other words, the query hierarchy is dynamic and not pre-discretized along each dimension.

Multidimensional data records can be viewed as points in a multidimensional space. For example, the records of the schema (age, salary) could be viewed as points in a two-dimensional space, with the dimensions of age and salary. Figure 3a shows some data conforming to the above example schema. Figure 3b shows its representation as points in a two dimensional space.

Figure 3. Clustering sample

Specification Version 1.0 — Microsoft 8


Now suppose one is to give a short representation of this simple data set. One could provide the average age and the average salary (and their standard deviations). This would represent the average employee as having a salary of $85.5K ( $35.5K) and an average age of 40 ( 15.5) years. However, imagine inspecting the data further and realizing that there are two groups of employees. The summary on the data would then be as shown in Figure 4.

Group Age Income

Average Std Dev. Average Std Dev.

Segment 1 26 years 1.0 $54.3K $4K

Segment 2 54 years 3.6 $116.6K $15.2

Figure 4. Clustering result

As Figure 4 illustrates, the data has not only been identified to comprise two distinct segments but its average values are much more meaningful within each segment. This is evidenced by a much more reasonable standard deviation associated with each segment.

How does one identify the presence of such segments? This is what a clustering algorithm does. While it may be obvious what these segments should be in two dimensions (as shown in the preceding simple two-dimensional example), finding segments in higher dimensions (for example, four or higher) is much more difficult for humans because simply plotting the data may no longer help. Also, plotting data becomes extremely inconvenient with many data points. However, clustering algorithms automatically find such segments in data. Each segment is represented by its own distribution. The normal distribution was used in this example, but categorical dimensions, such as gender or job description, can also be admitted and can be represented by using the multinomial distribution. A clustering algorithm can deal with both types of attributes and can produce useful groupings for summaries.

1.2.3 Association (Data Summarization)Association (data summarization) describes a class of methods that target producing summaries of parts of the data—for example, discovering correlations between variables over substantial subsets of the data or deriving an association between some items and other items. The most common technique in this category of methods is the use of association rules. Sometimes referred to as market basket analysis, the process of finding association rules depends on identifying frequent item sets in transactional data. Frequent item sets consist of sets of items (for example, products) that frequently occur together in the same transaction.



Frequent item sets can be used to summarize the sets of products customers tend to buy together in a supermarket basket. (For another example, to understand how a Web site is used by its visitors, frequent item sets can also be used to find a set of Web pages that will be visited during a Web-browsing session.) Therefore, retailers can use association techniques to do cross-selling by stocking related products together. For example, consider a set of transactions representing checkout baskets in a grocery store. Given a minimum support level (supplied by the analyst), the data mining algorithm can find items in the store that are bought together. Suppose one has a set of baskets shown in the Transaction table in Figure 5a. The Frequent item sets table in Figure 5b shows the respective support levels for the frequent item sets derived from the Transaction table.

Basket ID Item ID

1 Milk

1 Butter

2 Milk

2 Honey

2 Butter

3 Milk

3 Bread

3 Butter

4 Milk

4 Bread

4 Honey

(a) Transaction table

Support Item sets found

4 {Milk}

3 {Milk}, {Butter}, {Milk, Butter}

2 {Milk}, {Butter}, {Milk, Butter}

{Honey}, {Bread}, {Honey, Bread}, {Honey, Milk}, {Honey, Butter}, {Bread, Milk}, {Bread, Butter}

(b) Frequent item sets

Figure 5. Association

Note that as the support level decreases, the number of frequent item sets grows monotonically. In general, in real databases—whether storing market baskets, tracking Web-browsing behavior, or monitoring customer uses of a service (for example, a phone service)—the number of item sets having a high support value tends to be very small, and the number of item sets tends to grow exponentially as the support level is decreased.

Once the frequent item sets are derived, they can be used to produce association rules. Association rules are derived by selecting one of the items in a frequent item set as the item to be predicted and then evaluating the remaining items as the conditions of a rule for predicting that item. For example, in the Frequent item sets table in Figure 5b, one may use the set "{milk, Butter} with support 3" to derive the following association rule:



If a customer buys Milk, that customer also buys Butter.

However, studying the example data set, one also determines that this rule has an accuracy rate of only 75%, because the transaction indicated by Basket ID number 4 does not obey this rule even though it satisfies the first condition.

1.2.4 Sequence and Deviation AnalysisSequence and deviation analysis accounts for sequence information and anomalies in the data. In the preceding three categories of data mining techniques—predictive modeling, segmentation, and association—the sequence in which events occurred was ignored and was treated simply as part of one record (the case). For example, on a data set consisting of people visiting a Web site, suppose user U774 first visits the home page (page 0), then page 13, then page 2, and then page 17 on the Web site. This case could simply be flattened into the following statement:

Case: User U774: visited {page 0, page 2, page 13, page 17}

On the other hand, it might be preferable to preserve the sequence information. This means that another user who visited the same pages, but in a different order, will be distinct from U774.

Algorithms in this category focus on one of the following objectives:

1. Summarizing frequent sequences or episodes in data

2. Detecting changes in data over time

3. Detecting changes in knowledge (models or patterns) over time

As an example of the first kind of task, summarizing, suppose it is discovered that users visit a particular Web site as follows:

Figure 6. Sequence and deviation analysis

The sequences found in the data may indicate that on a given Web site, 90% of users visit page 0 and 2% enter at page 10. The sequences also may indicate that from page 0, 60% go to page 15, and so forth. The graph in Figure 6 summarizes ordering relationships and gives an



idea of the flow. There may be infrequently visited pages between pages 15 and 17, but only the frequent visits are reported.

Deviation analysis focuses on finding the anomalies in data. For example, if a user usually visits only page 0, 1, 15 and then one day visits page 17, the deviation analysis algorithm outlines this particular event. Deviation analysis is a common technique in fraud detection.

1.2.5 Dependency ModelingDependency modeling or "density estimation" refers to the estimation of the underlying joint probability distribution or density of the data. If you know the joint probability distribution is, you can answer any question of interest about the data. Dependency modeling can be used to identify (sometimes novel) dependencies among attributes of cases. Identifying dependencies is one way to gain insight into your data.

An often-used density estimate for a small number of attributes is the histogram. Unfortunately, this technique is not useful when then are many attributes . An simple form of density estimation that can handle a large number of attributes uses the Naïve Bayes model. In this model, it is assumed that all attributes are independent within a class or a cluster. Note that the model does not assume that attributes are globally independent. Another simple example of density estimation is to fit a multivariate-normal distribution to data.

More complex (and more accurate) models for density estimation include mixture models and graphical models. In the mixture-model approach, one fits several distributions to a data set. For example, one may decide a population of users is composed of three distinct subpopulations, each having its own multivariate-normal distribution. Graphical models useful for density estimation include Bayesian networks and dependency networks.

1.3 The OLE DB for DM SpecificationOLE DB for DM is an OLE DB extension that supports data mining operations over OLE DB data providers. The goal of this specification is to provide an industry standard for data mining so that different data mining algorithms from various data mining ISVs can be easily plugged into user applications. In this documentation, software packages that provide data mining algorithms are called data mining providers and those applications that use data mining features are called data mining consumers. OLE DB for DM specifies the API between data mining consumers and data mining providers.

OLE DB for DM introduces one new virtual object, referred to as the data mining model (DMM), as well as several new commands for manipulating the DMM. In its characteristics and use, the DMM is very similar to a table and is created with a CREATE statement very similar to the SQL CREATE TABLE statement. It is populated using the INSERT INTO statement, just as a table would be populated. The client uses a SELECT statement to make predictions and explore the DMM.



OLE DB for DM treats a DMM as if it were a special type of table. When you insert the data into the table, it is processed by a DM algorithm and the resulting abstraction (or data mining model) is saved instead of the data itself. Subsequently, the DMM can be browsed, refined, or used to derive predictions.

Data to be mined is represented logically as a collection of tables in a relational database. For instance, a customer database might record customers, demographic data about customers, orders, and order items. A join of the customer orders and order items tables may have many records for one customer (one per order item). This collection of data pertaining to a single entity is often called a case, and the set of all relevant cases is referred to as a case set. To represent these relationships, OLE DB for DM uses nested tables as defined by the Data Shaping Service, which is included with the Microsoft Data Access Components (MDAC) products. Note that the same physical data may be used to generate different case sets for different analysis purposes. For example, if one chooses to mine models or patterns over specific products, each product then becomes a single case and customers become attributes of the case.

The content of a DMM can be thought of as a "truth table" containing a row for every possible combination of the distinct values for each column in the DMM. In other words, it contains every possible case. With this view in mind, a DMM can be used to look up learned values and statistics.

A fundamental operation in OLE DB for DM is the training of a data mining model, followed by use of the model to derive predictions. The following is an outline of the process.

The INSERT statement invokes the DM algorithm on the provider to create an abstraction of the data in the form of a DMM. This abstraction represents the patterns the algorithm found in the data; the patterns are saved rather than the training data. Selecting from a PREDICTION JOIN allows new data to be processed through the model to produce predictions.

1. Create an OLE DB data source object and obtain an OLE DB session object. This is the standard mechanism of connecting to data stores via OLE DB.

2. Create the data mining model object. Using an OLE DB command object, the client executes a CREATE statement that is similar to a CREATE TABLE statement.

CREATE MINING MODEL [Age Prediction]([Customer ID] LONG KEY,[Gender] TEXT DISCRETE,[Age] DOUBLE DISCRETIZED() PREDICT,[Product Purchases] TABLE

([Product Name] TEXT KEY,[Quantity] DOUBLE NORMAL CONTINUOUS,[Product Type] TEXT DISCRETE RELATED TO [Product Name]

))USING [Decision Trees]



3. Insert training data into the model. In a manner similar to populating an ordinary table, the client uses a form of the INSERT INTO statement. Note the use of the SHAPE statement to create the nested table.

INSERT INTO [Age Prediction]([Customer ID], [Gender], [Age],[Product Purchases](SKIP, [Product Name], [Quantity], [Product Type])

)SHAPE {SELECT [Customer ID], [Gender], [Age] FROM Customers ORDER BY [Customer ID]

} APPEND ({SELECT [CustID], [Product Name], [Quantity], [Product Type] FROM Sales ORDER BY [CustID]} RELATE [Customer ID] To [CustID]

) AS [Product Purchases]

4. Use the data mining model to make some predictions. Predictions are made with a SELECT statement that joins the model's set of all possible cases with another set of actual cases. The actual cases can be incomplete. In this example, the value for "Age" is not known. Joining these incomplete cases to the model and selecting the "Age" column from the model will return a predicted "age" for each of the actual cases.

SELECT t.[Customer ID], [Age Prediction].[Age]FROM [Age Prediction] PREDICTION JOIN (SHAPE {SELECT [Customer ID], [Gender], FROM Customers ORDER BY [Customer ID]

} APPEND ({SELECT [CustID], [Product Name], [Quantity] FROM Sales ORDER BY [CustID]}

RELATE [Customer ID] To [CustID]) AS [Product Purchases]

) as tON [Age Prediction] .Gender = t.Gender and [Age Prediction] .[Product Purchases].[Product Name] = t.[Product Purchases].[Product Name] and [Age Prediction] .[Product Purchases].[Quantity] = t.[Product Purchases].[Quantity]



Note Because the process of combining actual cases with all possible model cases is not as simple as the semantics of a normal SQL JOIN, a new type of join, the PREDICTION JOIN, is introduced in OLE DB for DM. For the instance when the schema of the actual case table matches the schema of the model, NATURAL PREDICTION JOIN can be used, obviating the need for the ON clause of the join. Columns from the source query will be matched to columns from the DMM based on the names of the columns.

Part 2 of this document describes the language for creating and manipulating a DMM in more detail. The complete details of the language and the schema rowsets used when working with a data mining provider (DMP) are described in Appendix A.

1.4 The Columns Structure of a Data Mining Model (DMM)In usage, the DMM is very similar to a SQL table. The SELECT statement returns columns from the input data, columns from the model, and predictions produced by the model. The DMM definition includes a definition of the columns of data over which the model will be created, including detailed information about the nature of the data and relationships between columns.

1.4.1 Model ColumnsThe model columns describe all of the information about a specific case. For example, assume that each case in the DMM represents a customer. The columns of the DMM will include all known and desired information about the customer.

The following table illustrates a customer case.

Customer ID Gender

Hair Color Age

Age Probability

Product Name

Product Quantity

Product Type

Cars Owned

Car Probability

1 Male Black 35 100% TV 1 Electronic Truck 100%

VCR 1 Electronic Van 50%

Ham 2 Food

Beer 6 Beverage

As the table indicates, a customer case is not easily describable using simple relational tables. Each case can include not only simple columns but also multiple tables. Each of these tables inside the case can have a variable number of rows and a different number of columns. The meaning of the information contained in the columns can also greatly differ.



Note The ability of a case to contain multiple tables of data is a key requirement for most of the data mining algorithms. Although most of the relational data stores today cannot support such table structures, the theoretical notion of nested tables (also known as table columns) already exists in the relational world and is also supported by MDAC. This specification will rely on these data structures with some anticipation of a wider adoption in the relational world in the future.

Some of the columns in the example have a direct one-to-one relationship with the case (such as "Gender" and "Age"), while others have a one-to-many relationship with the case and therefore exist in tables. As noted above, the nested tables are a key element in the basic data structure of the case and therefore have an explicit representation in the case definition. You can easily identify the following two tables contained in the sample case:

"Product Purchases" table containing the columns "Product Name," "Product Quantity," and "Product Type"

"Car Ownership" table containing the columns "Cars Owned" and "Car Probability"

The main row of the case is the case row. Columns in the case row describe the entity of the case. For example, in the case illustrated in the preceding table, the "Age" column contains the age of the customer whose Customer ID is 1. Rows inside nested tables are referred to as nested rows. Columns in nested rows describe the entity of the nested row as it relates to the case row. For example, the "Product Quantity" column represents the quantity of the product indicated in the "Product Name" column; therefore, 2 is the quantity of "Ham" purchased by customer 1.

As the preceding example indicates, each column can represent the following content types:

KEY: the columns that identify a row. For example, "Customer ID" uniquely identifies customer cases, and "Product Name" uniquely identifies a row in the "Product Purchases" table. In the CREATE MINING MODEL command syntax, specifying the type flag KEY in the column definition identifies key columns.

ATTRIBUTE: A direct attribute of the case. This type of column represents some value for the case. For example, the age, gender, or hair-color of the customer or the quantity of a specific product the customer purchased.

RELATION: Information used to classify attributes, other relations, or key columns. For example, "Product Type" classifies "Product Name." A given relation value must always be consistent for all of the instance values of the other columns it describes—for example, the product "Ham" must always be shown as "Food" for all cases. In the CREATE MINING MODEL command syntax, relations are identified in the column definition by using a RELATED TO clause to indicate the column being classified.

QUALIFIER: A special value associated with an attribute that has a predefined meaning for the provider. Take for example the probability that the attribute is correct. These qualifiersare all optional and apply only if the data has uncertainties attached to it or if the output of previous predictions is being chained as input to a subsequent DMM training step. Following are examples of qualifiers.



Note In the CREATE MINING MODEL command syntax, modifiers are identified by using an OF clause to indicate the attribute column they modify.

PROBABILITY: A number between zero and one that describes the probability of the associated value.

VARIANCE (or Stdev): A number that describes the variance (or standard deviation) of the value of an attribute.

SUPPORT: A float that represents a weight (case replication factor) to be associated with the value.

PROBABILITY_VARIANCE (or Stdev): The variance (or standard deviation) associated with the probability estimator used for PROBABILITY.

ORDER: Specifies the order of a column. (See ORDERED below.)

TABLE: A nested table is represented in the case as consisting of special column with the data type TABLE. For any given case row, the value of a TABLE type column contains the entire contents of the associated nested table. The value of a TABLE type column is in itself a table containing all of the columns for the nested table. In the CREATE MINING MODEL command syntax, nested tables are described by a set of columns, all of which are contained within the definition of a named TABLE type column.

DISCRETE: The attribute values are discrete. These are the simplest forms of an attribute. Gender is a typical example of such an attribute, where the values describe categories. Even if the values are numeric, no ordering is implied by the values. ("Area Code" is a good example.) The values of a discrete attribute are often called its states.

ORDERED: Columns that define an ordered set of values. Although there is a total ordering, no distance or magnitude semantics are implied. A ranking of skill level (say one through five) is an ordered set, but a skill level of five isn't necessarily five times better than a skill level of one. Attributes with a type flag of ORDERED are also considered to be discrete. There may be an associated "Order Of" column with numeric values that gives the ordering for this attribute type column. The order of column values can be defined before the model training. (See the section "Populating the Column Values.")

CYCLICAL: A set of values that have cyclical ordering. Day of the week is a good example, since day number one follows day number seven. Attributes with a type flag of CYCLICAL are also considered to be ordered and discrete.

CONTINUOUS: Attributes with values that form a continuous curve. Values are naturally ordered and have implicit distance and magnitude semantics. Salary is a typical example.



DISCRETIZED: The data that will be inserted into the model is continuous, but it should be transformed into and modeled as a number of ORDERED states by the provider. Some Data Mining algorithms cannot accept CONTINUOUS attributes as input, or they may not be able to predict CONTINUOUS values. For these cases, columns with continuous domains should be made into DISCRETIZED attributes. In the CREATE MINING MODEL command syntax, the DISCRETIZED type flag can take arguments to override default discretization behavior.

SEQUENCE_TIME: A column containing time measurement units. A time column does not have to contain a data type of any particular format. A period number is acceptable. This is typically used to associate a sequence time with individual attribute values such as purchase time.

A CONTINUOUS attribute's domain may also have a distribution associated with it. This is a hint given to the data mining provider describing the expected distribution of the column values that will be inserted into the model when trained. Specific values may be known to have typical distributions. For some algorithms, it is particularly beneficial to know the distribution ahead of time. If the distribution isn't known or isn't given, the provider may assume whatever distribution it finds convenient. Following are examples of distributions:

NORMAL: A histogram of the continuous values forms a normal Gaussian distribution. Household income values may form this curve.

LOG_NORMAL: A histogram of the continuous values forms a Gaussian distribution with all values greater than 0, with an elongated upper tail, and with a skew toward the low end of the curve. The quantity associated with a product purchase may form this curve if a value of 0 is not explicitly recorded and if most consumers tend to buy smaller quantities of the product.

UNIFORM: The likely occurrence of all values is equal.

There are a number of other distribution models, such as BINOMIAL, MULTINOMIAL, POISSON, T-DISTRIBUTION, and so on. A data mining provider may support a subset of these distributions.

All of the preceding column descriptions allow the provider to make some sense of the training data it is given with the INSERT command. Returning to the example, the columns can now be classified as shown in the following table.

Containing Table Column Content Type Model Hints Comments

Customer ID

Key Special column that serves as the case identifier (key)

Gender Discrete Attribute

HairColor Discrete Attribute



Containing Table Column Content Type Model Hints Comments

Age Continuous Attribute

Age Probability

Probability Modifier of Age

Customer Loyalty

Ordered Attribute Doesn't exist in the sample case. Added for additional illustration.

Product Purchases

Table

Product Purchases

Product Name

Key Each distinct key represents the purchase of a product with a "Quantity" attribute.

Product Purchases

Product Quantity

Continuous Attribute Log Normal

Product Purchases

Product Type

Relation of Product Name

Product Purchases

Month Purchased

Cyclical Attribute Doesn't exist in the sample case. Added for additional illustration.

Car Ownership

Cars Owned

Key Has an implicit "Exists" attribute for each distinct key.

Car Ownership

Cars Probability

Probability Modifier of Implicit "Exists" Attribute

Other hints can be given to the data mining provider to help it build good models of the training data. These modeling flags are provider-specific, but following are two examples:

MODEL_EXISTENCE_ONLY: The actual values for an attribute are not nearly as important as the simple existence of the attribute. For example, assume the existence of some general demographic data for a selected group of people, along with a nested table of the television programs and the viewing duration for all of the programs that each person watched. For modeling purposes, the fact that the person watched a particular program may be more important than how long they watched it. In this case, the Duration attribute should be marked as MODEL_EXISTENCE_ONLY.

NOT NULL: The attribute can never contain a null value, and encountering one while training should generate an error.



1.4.2 Prediction ColumnsAttribute or Table type columns can be input columns, output columns, or both. The data mining provider will build a data mining model capable of predicting or explaining output column values based on the values of the input columns.

Predictions may convey not only simple information such as "estimated age is 21", but they may also convey additional statistical information such as confidence level and standard deviation. Further, the prediction may actually be a collection of predictions, such as "the set of products that the customer is likely to buy." Each of the predictions in the collection may also include a set of statistics.

A prediction can be expressed as a histogram. A histogram provides multiple possible prediction values, each accompanied by a probability and other statistics. When histogram information is required, each prediction (which by itself can be part of a collection of predictions) may have a collection of possible values that constitutes a histogram.

Since the prediction information may be very rich, it is often necessary to extract only a portion of the predictions. For example, you may want to see only the "best estimate," "top 3 estimates," or "estimates with probability greater then 55%." Not every provider nor every DMM can support all of the possible requests. Therefore, it is necessary for the output column to define whatever information may be extracted out of it.

OLE DB for DM defines a set of standard transformation functions on output columns. These functions are discussed in detail in section 2.9 Querying—Applying Mining Models on New Data," and in Appendix C.



2 OLE DB for DM Programmer's Guide

This section of the specification illustrates how data mining consumers and providers work together. The section will walk you through the following operations:

Connecting to a DMP

Creating a new DMM

Enumerating and exploring existing data mining models

Executing queries and deriving predictions with a DMM

Housekeeping activities

This section is not a formal representation of the interfaces and does not intend to describe every option and variation that the API enables. Instead, all of the interfaces are formally detailed in the appendixes. You should consider this section a tutorial that describes the principles of working with a DMP and introduces application programmers to the new world of DM client development.

2.1 Connecting to a Data Mining ProviderThe process of connecting to a DMP is the same as connecting to any other OLE DB provider (whether relational, multidimensional, or any other type). The connection sequence to an OLE DB provider is described in the OLE DB Programmer's Reference.

As with all other OLE DB providers, a DMP supports the data source, session, command, and rowset objects.

Although during the connection sequence a DMP behaves just like any other OLE DB provider, it is still very useful to be able to determine whether a specific provider supports the OLE DB for DM specification. To this end, the constant DBSOURCETYPE_DATASOURCE_DMP is defined and can be used when enumerating providers to locate a provider capable of performing data mining. A single provider may support many data store types. For example, a provider may support both relational operations as well as data mining operations concurrently. Bit operations on the SOURCE_TYPE value can detect whether a provider supports a specific data store type.

Once a session object has been instantiated, the client application can query the provider for information and execute various commands.



2.2 Creating New Mining ModelsA new DMM is created with the CREATE MINING MODEL command. This command correlates closely to the common relational database operation CREATE TABLE, which defines a table object structure. As will be shown in following sections, creating and populating a DMM follows the approach taken by relational databases for the management of tables.

The similarities between DMMs and tables are not coincidental. It is widely expected that data mining capabilities will be fully integrated with relational databases in the future. Therefore, the present approach looks at the DMM as a future standard object of an RDBMS, just like a table or a view, and the DMM is indeed represented and accessed to a large degree as if it were a special type of a table.

However, unlike a table, a DMM must announce a predefined goal and analysis technique. Each provider may support many and different analysis techniques. It is therefore necessary to be able to identify the provider capabilities.

2.2.1 Detecting the Capabilities of the ProviderThe different mining services (or algorithms as they are also known) are exposed through a new schema rowset—the mining services schema rowset. This schema rowset exposes the different algorithms supported by a provider and the way to specify goals for the algorithm.

Many algorithms require a goal—for example, "predict whether the customer's transactions look fraudulent," "predict the sales amount for the customer," "predict the profit for a product," and "predict the sales of each store for next year" all have targeted goals. The algorithm will try to predict something about the case, usually one of the attributes of the case. Most of the algorithms will need to get a training set of cases where the attributes to be predicted are already known, and they will then create a DMM capable of predicting these attributes for cases in which the attribute is unknown.

Different algorithms will be capable of predicting different things. They may also differ in the type of data they are capable of processing. The list of algorithms (or services), their possible goals, their limitations, and their capabilities are all exposed in the mining services achema rowset. This information will be used when defining a new model.

The mining services schema rowset is described in detail in Appendix A. The following table describes some of the important columns that are found in the mining services schema rowset.



Column Name Type Indicator Description

SERVICE_NAME DBTYPE_WSTR The name of the algorithm. Provider-specific. Used with the CREATE MINING MODEL command to specify algorithm.

SERVICE_TYPE_ID DBTYPE_UI4 A bitmask that describes mining service types. The list includes known popular mining services, such as the following:

DM_SERVICETYPE_CLASSIFICATION (0x0000001)

DM_SERVICETYPE_CLUSTERING

(0x0000002)

DM_SERVICETYPE_ASSOCIATION

(0x0000004)

DM_SERVICETYPE_DENSITY_ESTIMATE (0x0000008)

DM_SERVICETYPE_SEQUENCE (0x0000010)

PREDICTED_CONTENT DBTYPE_WSTR The attribute types that can be predicted. This is a comma-delimited list of content types.

PREDICTION_LIMIT DBTYPE_UI4 The maximum number of predictions the model and algorithm can provide; 0 means no limit.

SUPPORTED_DISTRIBUTION_FLAGS

DBTYPE_WSTR A comma-delimited list of one or more of the following:

NORMAL

LOG_NORMAL

UNIFORM

BINOMIAL

MULTINOMIAL

POISSON

T-DISTRIBUTION

Provider-specific flags may also be defined.




SUPPORTED_INPUT_CONTENT_TYPES


KEY

DISCRETE

CONTINUOUS

DISCRETIZED

ORDERED

SEQUENCE_TIME

CYCLICAL

PROBABILITY

VARIANCE

STDEV

SUPPORT

PROBABILITY_VARIANCE

PROBABILITY_STDEV

ORDER

SEQUENCE

TABLE





SUPPORTED_PREDICTION_CONTENT_TYPES


DISCRETE

CONTINUOUS

DISCRETIZED

ORDERED

SEQUENCE_TIME

CYCLICAL

PROBABILITY

VARIANCE

STDEV

SUPPORT

PROBABILITY VARIANCE

PROBABILITY_STDEV

ORDER

TABLE


SUPPORTED_MODELING_FLAGS


MODEL_EXISTENCE_ONLY

NOT NULL





TRAINING_COMPLEXITY DBTYPE_I4 Indication of expected time for training:

DM_TRAINING_COMPLEXITY_LOW—Running time is proportional to input and is relatively short.

DM_ TRAINING_COMPLEXITY_MEDIUM—Running time may be long but is generally proportional to input.

DM_ TRAINING_COMPLEXITY_HIGH—Running time is long and may grow exponentially in relationship to input.

PREDICTION_COMPLEXITY DBTYPE_I4 Indication of expected time for prediction:

DM_PREDICTION_COMPLEXITY_LOW—Running time is proportional to input and is relatively short.

DM PREDICTION_COMPLEXITY_MEDIUM—Running time may be long but is generally proportional to input.

DM_ PREDICTION_COMPLEXITY_HIGH—Running time is long and may grow exponentially in relationship to input.

EXPECTED_QUALITY DBTYPE_I4 Indication of expected quality of model produced with this algorithm:

DM_EXPECTED_QUALITY_LOW

DM_EXPECTED_QUALITY_MEDIUM

DM_EXPECTED_QUALITY_HIGH

ALLOW_INCREMENTAL_INSERT

DBTYPE_BOOL TRUE if additional INSERT INTO statements are allowed after the initial training.

ALLOW_DUPLICATE_KEY DBTYPE_BOOL TRUE if cases may have duplicate key.



2.2.2 Defining a New Mining ModelDefining a new model is done using a CREATE MINING MODEL statement. Similar to the CREATE TABLE statement, the creation of a DMM defines only its structure and properties. It does not define the specific content (the learned graphical structure), which will be created only when the DMM is populated. (See below.)

The CREATE MINING MODEL statement will define the following:

1. The DMM columns

2. The specific algorithm to be used in the DMM

The syntax used to define the DMM columns is similar to the syntax used to define the columns in a table object, as follows:

CREATE MINING MODEL <mining model name> (<Column definitions>) USING <Service>[(<service arguments>)]

However, since the columns of a DMM require a lot of specialized information, some extensions were added to the standard SQL syntax. Following is a statement example that applies to the case structure illustrated in Section 1.3:

CREATE MINING MODEL [Age Prediction](

[Customer ID] LONG KEY,[Gender] TEXT DISCRETE,[Hair Color] TEXT DISCRETE,[Age] DOUBLE DISCRETIZED() PREDICT,[Age Probability] DOUBLE PROBABILITY OF [Age],[Product Purchases] TABLE(

[Product Name] TEXT KEY,[Quantity] DOUBLE NORMAL CONTINUOUS[Product Type] TEXT RELATED TO [Product Name]

),[Car Ownership] TABLE(

[Car Name] TEXT KEY,[Probability] DOUBLE PROBABILITY OF [Car Name]

))USING [Microsoft_Decision_Trees]

As the example shows, the definition includes the following information for each column:

Name (mandatory)

Data type (mandatory)—a special data type exists for tables contained in a case (TABLE)

List of column type flags and modeling flagsSpecification Version 1.0— Microsoft 27


Relationship to an attribute column (mandatory only if applies)—indicated by the RELATED TO or OF clauses

Prediction request (that is, indication to the algorithm to predict this column)—indicated by the PREDICT or PREDICT_ONLY string

While a complete BNF for this grammar is given in Appendix B, following are a few interesting points:

The syntax allows for explicit definition of "Table Columns." "Product Purchases" and "Car Ownership" are both columns that contain a full table each.

A potential list of supported of data types is as follows: LONG, DOUBLE, TEXT, DATE, BOOL, and TABLE. For a list of the data types supported by the provider, see the PROVIDER_TYPES schema rowset in Appendix B of the OLE DB Programmer's Reference.

The Discretized function cuts the value range of a continuous variable to a number of buckets. The syntax for the Discretized attribute type is as follows: Discretized([method[,n]]). Both arguments are optional, but parentheses are always required and a value must be given for "method" in order to supply a value for "n". The "n" argument is the recommended number of buckets that the discretization method should try to find to divide up the values of the column. Each provider will have a reasonable default. The "method" argument describes the algorithm that the provider should use to find the buckets. All providers should support the method DEFAULT as the default. Other possible provider-specific algorithms could be AUTOMATIC, EQUAL_AREAS, THRESHOLDS, CLUSTERS, and so forth.

A column may have missing values. There are different ways to deal with missing values. The easy way is to ignore it, but sometimes missing values can be informative, and thus it is often beneficial to model the missing state. Users can specify how to deal with missing values in the column definition statement. For example, Gender TEXT DISCRETE NULL IGNORE means to ignore the missing state in the Gender column. The following is a list of possible ways to specify missing value treatment:

NOT NULL: The column should not contain missing values; otherwise it returns an error during the model training stage.

IGNORE NULL: Ignore the missing value.

NULL INFORMATIVE: Data mining algorithm will model the missing state.

The default option is NULL INFORMATIVE. After the column definition, the statement indicates the type of algorithm to be used. Only one of the services listed by the provider in the services schema rowset can be used.

The USING clause can be followed by a PARAMETERS clause containing provider-specific pairs of parameter-value settings. THE SERVICE_PARAMETERS schema rowset contains a list of parameters supported by the provider. A full description of this schema rowset is provided in Appendix A. Algorithm providers define the names of their parameters. However, we suggest the following list of parameters, which may used in many algorithms:



HOLDOUT_PERCENTAGE: The percentage of data that is held out during the training stage. This data may be used in validation or test phase.

HOLDOUT_SEED: The seed used to hold out data.

SAMPLE_PERCENTAGE: The percentage of data that is selected after sampling.

SAMPLE_SEED: The seed used in sampling data.

When a CREATE MINING MODEL statement is executed, the model is cr eated and will appear in the schema rowsets of the provider. However, since data has not been inserted into the model, the model cannot be used for any kind of useful analysis. The client can use the MODEL_STATE column in the mining models schema rowset to get this indication.

2.2.3 Copying a Mining ModelSometimes you may want to run multiple algorithms against the same source data and model column structure. The OLE DB for DM specification provides a mechanism that allows you to easily create a new model from an existing model.

SELECT * INTO <new model> USING <model type> [( <parameter list> )] FROM <model>

The new model will contain all information from the existing model that is not specific to the actual algorithm. Executing this statement will cause the new model to be trained using the same training query as the existing model. If the existing model is not trained, only the structure of the model will be copied.

2.2.4 Creating a Mining Model from Predictive Model Markup Language (PMML)Because all of the structure and content of a DMM may be expressed as an XML string in the Predictive Model Markup Language (PMML) format (see Appendix D), it is conceivable that the expert user can use such a string as the basis for the creation of a model. This string could be a modified version of the string retrieved from another model. (See The MODEL_PMML column of the MODEL_CONTENT_PMML schema rowset.) Changes to the XML string will typically allow manipulation of the content nodes. The change may include pruning of the tree additions of other nodes or changing the rules described in the nodes.

A provider does not have to support initialization based on a PMML document. To discover whether the provider supports this capability, the services schema rowset offers the ALLOW_PMML_INITIALIZATION column.

To create a new model from PMML, use a modified version of the CREATE MINING MODEL statement, as follows:

CREATE MINING MODEL <mining model name> FROM PMML <xml string>



2.3 Finding Existing Mining ModelsData mining models are exposed in the mining models schema rowset. This rowset can be viewed as an enhanced version of the TABLES schema rowset because it contains all of the same types of information. In addition, several DMM-specific columns have been added to the rowset. A complete description of the MINING_MODELS schema rowset can be found in Appendix A; the following table describes some of the interesting columns.


MODEL_NAME DBTYPE_WSTR Model name. This column cannot contain NULL.

SERVICE_TYPE_ID DBTYPE_UI4 A bitmask that describes mining service types. The list includes known popular mining services, such as the following:


DM_SERVICETYPE_CLUSTERING(0x0000002)

DM_SERVICETYPE_ASSOCIATION(0x0000004)



SERVICE_NAME DBTYPE_WSTR A provider-specific name that describes the algorithm used to generate the model.

CREATION_STATEMENT DBTYPE_WSTR Optional. The statement used to create the original data mining model.

PREDICTION_ENTITY DBTYPE_WSTR A comma-delimited list indicating which columns the model can predict.

IS_POPULATED DBTYPE_BOOL VARIANT_TRUE if the model is populated.

VARIANT_FALSE if the model is not populated. An empty model has a defined structure but has not been "trained" with data.



2.4 Browsing Model Column DefinitionOnce an interesting DMM has been identified, you may want to explore its structure. The structure of a DMM is similar to the structure of a table that is represented as a set of columns. Like columns of a table, the structure represents the kind of inputs and outputs that the DMM can provide. Like a table, the structure is independent of the specific data instances that were or will be input into it. In fact, the structure of a DMM is described using a schema rowset that is derived from the COLUMNS schema rowset (see the Appendix B of the OLE DB Programmer's Reference), with new columns added to support data mining operations.

2.4.1 Input ColumnsThe structure of the DMM is described by the inputs that are used to describe a case and by the set of possible predictions that can be selected from the model. This structure is described in the MINING_COLUMNS schema rowset. Data mining providers must support all mandatory columns, as defined by the OLE DB for DM specification.

The section on The Columns Structure of a DMM in part one of this document describes the data types, content types, and other interesting flags that describe the columns of a DMM. Several columns in the MINING_COLUMNS schema rowset (the complete description can be found in Appendix A) describe these properties of a model column. The following table describes some interesting columns from that rowset.


COLUMN_NAME DBTYPE_WSTR The name of the column; this might not be unique. If this cannot be determined, a NULL is returned.

DATA_TYPE DBTYPE_UI2 The indicator of the column's data type—for example:

"TABLE" = DBTYPE_HCHAPTER

"TEXT" = DBTYPE_WCHAR

"LONG" = DBTYPE_I8

"DOUBLE" = DBTYPE_R8

"DATE" = DBTYPE_DATE




DISTRIBUTION_FLAG DBTYPE_WSTR One of the following:

NORMAL

LOG_NORMAL

UNIFORM

BINOMIAL

MULTINOMIAL

POISSON

T-DISTRIBUTION


CONTENT_TYPE DBTYPE_WSTR One of the following:

KEY

DISCRETE

CONTINUOUS

DISCRETIZED([args])

ORDERED

SEQUENCE TIME

CYCLICAL

PROBABILITY

VARIANCE

STDEV

SUPPORT


PROBABILITY_STDEV

ORDER

SEQUENCE





MODELING_FLAG DBTYPE_WSTR A comma-delimited list of flags. The defined flags are:


NOT NULL


RELATED_ATTRIBUTE DBTYPE_WSTR This is the name of the target column that the current column either relates to or is a special property of.

CONTAINING_COLUMN DBTYPE_WSTR Name of the TABLE column containing this column. NULL if any table does not contain the column.

2.4.2 Prediction ColumnsATTRIBUTE or TABLE type columns can be input columns, output columns, or both. The data mining provider will build a DMM capable of predicting or explaining output column values based on the values of the input columns. In the CREATE MINING MODEL command syntax, output columns are identified with the PREDICT or the PREDICT_ONLY keyword. Marking a column for prediction (or not) has various implications for usage in the model, as described in the following table.

Prediction Flag in Command Input Output Description

PREDICT_ONLY No Yes Input column values will be used to predict this column's values. This column's values will not be used to predict other columns.

PREDICT Yes Yes Input column values will be used to predict this column's values. This column's values will be used to predict predictable columns.

(None mentioned) Yes No This column's values will be used to predict predictable columns.



The following table lists two additional columns in the MINING_COLUMNS schema rowset that describe the input/output state of a column.


IS_INPUT DBTYPE_BOOL VARIANT_TRUE if this is an input column.

IS_PREDICTABLE DBTYPE_BOOL VARIANT_TRUE if this is an output column.

Any TABLE column containing a predictable column will itself become predictable.

The MINING_COLUMNS schema rowset has additional columns that indicate the kind of additional information that can be found in the prediction of a predictable column and what extraction functions on the predictable column are supported. These additional columns apply only to output columns (that is, when IS_PREDICTABLE is set to TRUE).


PREDICTION_SCALAR_FUNCTIONS DBTYPE_WSTR A comma-delimited list of scalar functions that may be performed on the column.

PREDICTION_TABLE_FUNCTIONS DBTYPE_WSTR A comma-delimited list of functions that may be applied to the column, returning a table. The list has the following format:

<function name>(<column1> [, <column2>], ...)

The format allows the client to determine which columns will be present in the table returned by any given function.

2.5 Populating the Mining ModelAfter the structure of the DMM is defined, you can use the INSERT INTOcommand to populate the model with training data. This command correlates closely to the common relational database operation INSERT, which populates a table with data.

The model population stage will run the training data through the data mining algorithm and will generate a predictive model (referred to in this document as the DMM content).

Notice that although massive quantities of data are fed into the DMM, the DMM usually will not store any of the data and will retain only the DMM content and distinct column values after the process is done.

The population step may involve intensive processing of the data, and you should expect it to last for a while. A notification mechanism is available to follow the progress of the algorithm and the OLE DB asynchronous execution cancellation interfaces are also available. Specifically, for commands that do not return a rowset, the DM provider's command object



should return an object that supports the following interfaces: IDBAsynchStatus and IConnectionPointContainer (allowing users to get a connection point for the IDBAsynchNotify interface).

2.5.1 Inserting CasesThe command syntax for populating the DMM with data is identical to the population of a relational table with data in SQL. The basic syntax has the form:

INSERT [INTO] <mining model name>[ <mapped model columns> ]<source data query>

As is described in the following sections, various syntaxes can be used to specify the <source data query>. Regardless of which syntax is used, the column binding between the target DMM and the source query is done by column order, as is the standard with the INSERT INTO statement, or the command may specify an explicit mapping from source data columns into DMM columns using the <mapped model columns> clause. Because not every <source data query> syntax (for example, the SHAPE syntax) allows complete control over the set of columns that is returned, using the keyword SKIP in the INTO clause indicates columns that must be present in the source data query but have no meaning to the DMM. Once the DMM is populated, the client application can browse its content and perform queries to predict new data points.

2.5.2 Populating the Column ValuesIn general, the DMM will learn the available set of distinct column values while training. However, there are instances when it is preferable or necessary to explicitly train these values independently of the model.

ORDERED or CYCLICAL attributes—The model may depend on the maintenance of a certain order of discrete attributes; for example, Monday < Tuesday. This order cannot be guaranteed to be introduced in that order in the training data.

Value hierarchies—Related columns introduce value hierarchies that would have to be described every time the attribute is used. For example, it is not necessary to tell the DMM that "Beer" is of type "Beverage" each time it appears in the training data.

To train a column, OLEDB for DM specifies the following syntax:

INSERT INTO <model>.COLUMN_VALUES(<mapped model columns>)<source_data_query>

Unlike the model itself, the column values are incrementally trainable. Individual columns can be trained separately and repeatedly to add more values. However, if there are relationships between columns through the RELATED TO clause in the CREATE MINING MODEL statement, these columns must be trained together, as in the following example:



INSERT INTO [Age Prediction].COLUMN_VALUES(Gender)OPENROWSET('SQLOLEDB', '…', 'SELECT DISTINCT Gender FROM Customers')

INSERT INTO [Age Prediction].COLUMN_VALUES([Product Purchases].[Product Name], [Product Purchases].[Product Type])

OPENROWSET('SQLOLEDB', '…', 'SELECT DISTINCT [Product Name], [Product Type] FROM Sales')

INSERT INTO [Age Prediction].COLUMN_VALUES( SKIP, [Month])OPENROWSET('SQLOLEDB', '…', 'SELECT MonthID, Month FROM Months ORDER BY MonthID')

When the column values have been trained, the client application can browse those values but cannot yet perform queries or browse model content. Also, since all column-value relationships are now known, all RELATED TO columns can be omitted from the model-training query.

2.6 Source DataThe <source data query> part of the INSERT (See "Populating the Mining Model") and SELECT FROM PREDICTION JOIN (See "Querying—Applying Mining Models on New Data") commands can be any of the sources described by the SUPPORTED_SOURCE_QUERY column from the MINING_SERVICES schema rowset described in Appendix A. The possible values for this column are as follows:

SINGLETON CONSTANT SINGLETON SELECT OPENROWSET SELECT SHAPE

The meanings of each of these constants are described in more detail in the following section.

If the data-mining provider is embedded in a relational provider that supports nested tables (also known as table columns), the entire population process could occur under the aegis of a single provider. However, it is expected that at first the DM providers will be separated from the relational providers and that the relational providers usually will not natively support nested tables.

This specification offers suggested ways to overcome these issues. Data mining providers are strongly encouraged to support at least one of the methods discussed in the following sections and must publish which methods they support in the MINING_SERVICES schema rowset.

2.6.1 SINGLETON CONSTANT as Source DataIf the provider supports SINGLETON CONSTANT as a SUPPORTED_SOURCE_QUERY value from the MINING_SERVICES schema rowset, a syntax allowing specification of cases as a set of constant values is supported in place of the <source data query> for the INSERT and SELECT FROM PREDICTION JOIN commands.



<singleton constant> ::= (<value or set of values> [,<value or set of values>] )

<value or set of values> ::= <value> | (<set of values>)

For example, the following could be a valid syntax to supply a set of values:

('1', 'Male', (('TV', 1), ('VCR', 2)), (('Van'), ('Truck')))

Although the syntax is identical, the (<singleton constant list>) used by the INSERT INTO VALUES command syntax is not the same as replacing <source data query> with a singleton constant data source object. (The only syntax difference is the word "VALUES." However, inserting a constant row by using the word VALUES is standard SQL, and accepting a constant list as a general replacement for a table is not.)

2.6.2 SINGLETON SELECT as Source DataIf the provider supports SINGLETON SELECT as a SUPPORTED_SOURCE_QUERY value from the MINING_SERVICES schema rowset, a syntax allowing specification of cases as a selection of constant values is supported in place of the <source data query> for the INSERT and SELECT FROM PREDICTION JOIN commands.

The syntax has the following form:

<singleton select> ::= <compound constant select> as <alias>

<compound constant select> ::= <constant select> | <compound constant select> UNION <compound constant select>

<constant select> ::= (SELECT <alias constant list>)

<alias constant list> ::= <alias constant element> |<alias constant list>, <alias constant element>

<alias constant element> ::= <CONSTANT> |<CONSTANT> as <alias> |<singleton select>

For example, the following could be valid syntaxes to supply a set of values:

(SELECT 21 as Age, 'Male' as Gender) as Case

(SELECT 21 as Age, 'Male' as Gender, ((SELECT 'ham' as Product, 10 as Qty) UNION (SELECT 'beer' as Product, 1 as Qty)) as Purchases)

as Case



2.6.3 OPENROWSET as Source DataIf the provider supports OPENROWSET as a SUPPORTED_SOURCE_QUERY value from the MINING_SERVICES schema rowset, a syntax allowing cases to result from an OPENROWSET of an external command is supported in place of the <source data query> for the INSERT and SELECT FROM PREDICTION JOIN commands.

Since many of the DM providers will not be embedded within the RDBMS containing the source data, the <source data query> will most likely need to read data from another data source. The OPENROWSET function supports this functionality and has the following basic syntax:

OPENROWSET('provider_name','provider_string','query_syntax')

The 'provider_name' is an OLE DB provider name, the 'provider_string' is the OLE DB connection string for that provider, and the 'query_syntax' is a query syntax that returns a rowset (either simple or using SHAPE). The DM provider will establish connection to the data source object using the 'provider_name' and 'provider_string' and will execute the query specified in 'query syntax' to retrieve the source data rowset.

The complete syntax for OPENROWSET is described in Appendix F.

2.6.4 SELECT as Source DataIf the provider supports SELECT as a SUPPORTED_SOURCE_QUERY value from the MINING_SERVICES schema rowset, the standard SQL SELECT command can is supported in place of the <source data query> for the INSERT and SELECT FROM PREDICTION JOIN commands.

2.6.5 SHAPE as Source DataIf the provider supports SHAPE as a SUPPORTED_SOURCE_QUERY value from the MINING_SERVICES schema rowset, a syntax allowing specification of cases as a SHAPE of related queries is supported in place of the <source data query> for the INSERT and SELECT FROM PREDICTION JOIN commands.

A single query to most popular relational providers cannot return the nested tables shaped result set that is needed for the population of many DMMs. Therefore, multiple queries must be executed in the data source to retrieve all of the data that a case represents. The queries must be shaped into a nested table form to feed them into the DMM.

OLE DB for DM provides a number of alternatives for performing this operation, including the following:

Use of the MDAC Data Shaping Service. The Data Shaping Service is an OLE DB provider that can be layered on top of other providers. In OLE DB for DM, it can be invoked via OPENROWSET as follows:



INSERT INTO [Age Prediction]([Customer ID], [Gender], [Age], [Age Probability],[Product Purchases] (SKIP, [Product Name], [Product Type], [Quantity]),[Car Ownership] (SKIP, [Cars Owned], [Probability])

)OPENROWSET('MSDataShape','Data Provider=SQLOLEDB','SHAPE{

SELECT [Customer ID], [Gender], [Age], [Age Probability] FROM [Customers]

}APPEND ( {SELECT [CustID], [Product Name], [Product Type] , [Quantity]

FROM [Customer Product Sales] }RELATE [Customer ID] TO [Cust ID]

)AS [Product Purchases],( {SELECT [CustID], [Car Name], [Probability]

from [Customer Cars] }RELATE [Customer ID] TO [Cust ID]

)AS [Car Ownership] '

)

Note Of course, OPENROWSET can be used to direct the query to any provider so that any syntax can be used as long as the relevant provider supports it. At this time, there is no standard SQL syntax to query a nested table. Until such a standard is established, it is likely that different relational database vendors will create unique and incompatible syntaxes.

Integrated support for the SHAPE syntax. Some DM providers may choose to adopt the SHAPE command syntax and provide integrated support for it within the data mining provider. With these providers, the SHAPE command does not need to be executed within the context of an OPENROWSET command:

INSERT INTO [Age Prediction]([Customer ID], [Gender], [Age], [Age Probability],[Product Purchases] (SKIP, [Product Name], [Product Type], [Quantity]),[Car Ownership] ( SKIP, [Car Name], [Car Probability] )

)SHAPE{OPENROWSET ('SQLOLEDB', 'catalog=Sales',

'SELECT [Customer ID], [Gender], [Age], [Age Probability] FROM [Customers] ORDER BY [Customer ID] ' )

}



APPEND ( { OPENROWSET ('SQLOLEDB', 'catalog=Sales', 'SELECT [CustID], [Product Name], [Product Type] , [Quantity]

FROM [Customer Product Sales] ORDER BY [CustID]' ) }

RELATE [Customer ID] TO [Cust ID])AS [Product Purchases],( { OPENROWSET ('SQLOLEDB', 'catalog=Sales',

'SELECT [CustID], [Car Name], [Probability] FROM [Customer Cars] ORDER BY [CustID] ' )

}RELATE [Customer ID] TO [Cust ID]

)AS [Car Ownership]

Note Appendix E contains more detail on the SHAPE command syntax. Provider support of the SHAPE command will likely depend on the explicit ordering of the input data.

Native support for nested tables. In time, data mining providers may become integrated with relational providers capable of fully supporting nested tables. Such providers might adopt their own syntax for specifying nested tables. OLE DB for DM does not preclude support for such syntax.

2.7 Browsing Mining Model ContentIn addition to listing the column structure of a DMM, a very different type of browsing is to navigate the graphical content of the model. Using a set of input cases, the content of a DMM is learned by the data mining algorithm. The content of a DMM is the set of rules, formulas, classifications, distributions, nodes, or any other information that was derived from a specific set of data using a data mining technique.

Depending on the specific data mining technique used in the creation of the DMM, the content type may differ from one model to the other. The DMM content of a decision tree–based classification will differ from a segmentation model, which, in turn, is very different from a multiregression DMM.

Browsing the content can provide important insight into the data. In many cases it allows you to understand the patterns and rules that can be used to predict new data points. You must be aware, however, that some DMMs do not support a way to express DMM content.

One of the ways to browse the content of the DMM is to extract an XML description of it. The XML description of the contents can be found in the TABLES schema rowset. The format of the XML string is provided in Appendix D. The XML string provides an easy way to get, store, manipulate, and re-create all of the DMM information. However, this format requires significant expertise from the client application to navigate the content.



The most popular way to express DMM content is by using a directed graph (that is, a tree of nodes). A decision tree is the classic example. Each node in the tree may have relationships to other nodes. A node may have one or more parent nodes and zero or more child nodes. The depth of the graph may vary depending on the specific node.

Tree navigation is already defined in the OLE DB for OLAP specification, and a similar navigation mechanism is adopted for traversing DMM nodes. The MINING_MODEL_CONTENT schema rowset described in Appendix A provides a rich functional set of navigation operations.

Querying the model directly will also return the MINING_MODEL_CONTENT rowset. The following query provides a result table with the exact structure of the MINING_MODEL_CONTENT schema rowset:

SELECT * FROM <mining model>.CONTENT

This allows the relational database to expose the set of DMM nodes without requiring custom OLE DB coding.

2.8 Browsing All Possible Cases and Distinct Column ValuesWhen a mining model is trained, it will encounter in the set of training cases a distinct set of possible values or "states" that the attributes of the model can take on.

For example, consider a DMM with the following columns: Gender, Age and HairColor. After this DMM has been trained, the Gender column should end up knowing about the states "Male," "Female," and "Missing." (For completeness, assume that all attributes, even those with continuous domains, can take on the "Missing" state. This is true even when NULL or missing values are not encountered in the training data.) For HairColor, the DMM sees and remembers the values "Black," "Gray," and "Missing." Although the DMM has seen all of the values for the continuous attribute column Age, it does not remember every distinct value for the column. Instead, it learns the minimum, mean, and maximum values for the column.

If the example model was built to predict the HairColor column from a set of 100 people, browsing the contents of the DMM might show the following tree structure:



The set of all possible cases contained in a DMM has one entry for every possible combination of the distinct values for each attribute. For discrete attributes, this is a list of the distinct values seen in the column (plus the "Missing" state). For continuous attributes, the "Minimum," "Maximum," "Mean," and "Missing" states are reported. For Discretized attributes, the buckets found during discretization are listed. The return value is the midpoint between the up and low bound of the bucket. Use of the SELECT command on the DMM reports these possible cases. Along with each possible case, the DMM can report statistics learned for the attributes that it has been built to predict.

In the example, the following command and results (shown in the following table) are possible:

SELECT *, PredictProbability(HairColor) FROM HairColorPredictDMM

Gender Age HairColor P(HairColor)

Male 2 Black .667

Male 2 Gray .267

Male 2 NULL .067

Male 91 Black .300

Male 91 Gray .625

Male 91 NULL .075

Male 45 Black .667

Male 45 Gray .267




Male 45 NULL .067

Male NULL Black .600

Male NULL Gray .350

Male NULL NULL .05

Female 2 Black .933

Female 2 Gray .067

Female 2 NULL .000

Female 91 Black .300

Female 91 Gray .625

Female 91 NULL .075

Female 45 Black .933

Female 45 Gray .067

Female 45 NULL .000

Female NULL Black .600

Female NULL Gray .350

Female NULL NULL .05

NULL 2 Black .800

NULL 2 Gray .167

NULL 2 NULL .033

NULL 91 Black .300

NULL 91 Gray .625

NULL 91 NULL .075

NULL 45 Black .800

NULL 45 Gray .167

NULL 45 NULL .033

NULL NULL Black .600

NULL NULL Gray .350

NULL NULL NULL .05



Providers may support a WHERE clause on this command to filter the resulting set of all possible cases, as shown in the following example and results table:

SELECT Age, PredictProbability(HairColor) FROM HairColorPredictDMM WHERE Gender = 'Male' and HairColor = 'Black'


Male 2 Black .667

Male 91 Black .300

Male 45 Black .667

Male NULL Black .600

2.8.1 Finding Distinct Column ValuesTo find the list of possible values against which a column from a DMM can be compared, use a command with the SELECT DISTINCT syntax from SQL, as in the following example:

SELECT DISTINCT HairColor FROM HairColorPredictDMM

HairColor

Black

Gray

NULL

As expected, selecting distinct combinations of columns will report rows for only the possible combinations of the selected columns values.

SELECT DISTINCT HairColor, Gender FROM HairColorPredictDMM



Gender HairColor

Male Black

Male Gray

Male NULL

Female Black

Female Gray

Female NULL

NULL Black

NULL Gray

NULL NULL

In theory, you could select TABLE type columns from a DMM that contains nested tables. However, in practice, such an operation would be impractical. This is because the set of possible values for a table-valued column is all of the conceivable tables having every possible combination of the keys for that nested table. Although this is the conceptual "truth table" content of the DMM, no provider should be expected to manifest this set of records.

However, selecting distinct column values from a set of all possible nested table cases is often a useful task. Consider the larger example from Section 1.3 that contained a nested table of product purchases. The following command produces a list of the distinct product names that a customer may purchase:

SELECT DISTINCT [Product Purchases].[Product Name] FROM [Age Prediction]

Note that this syntax uses the "." operator to refer to a column from the scope of a nested table.

Furthermore, you can determine relationships between trained column values with a WHERE clause. In the larger example, product names were classified by product type. To find the products of a certain type, consider the following command:

SELECT DISTINCT [Product Purchases].[Product Name] FROM [Age Prediction]WHERE [Product Purchases].[Product Type] = 'Electronic'

This will return a list of all Product Names with which the model was trained that have a corresponding type of "Electronic."



2.9 Querying—Applying Mining Models on New DataPrediction queries on a DMM allow you to predict attributes that may be missing from new cases. To perform a query, you need a populated DMM (that is, already trained) and a set of new cases to predict (generally not the cases upon which the DMM was trained).

2.9.1 Components of a Prediction QueryPrediction queries are retrieved from a DMM with a SELECT command. (The complete syntax for the OLE DB for DM–compliant SELECT statement is presented in Appendix B.)

SELECT [FLATTENED] <SELECT-expressions>FROM <mining model name> PREDICTION JOIN <source data query> ON <join condition>[WHERE <WHERE-expression>]

2.9.1.1 Source Data QueryThe <source data query> clause identifies the set of new cases that will have attributes predicted by combining this set with the learned knowledge in the DMM. For information on source data queries, please see the section "Source Data."

2.9.1.2 PREDICTION JOINWhen retrieving predictions from a DMM, the actual cases from <source data query> are matched up with the set of all possible cases from the model (<mining model name>) via a PREDICTION JOIN operation. See "Browsing All Possible Cases and Distinct Column Values" for an explanation of the possible cases contained in a DMM. For the following simple reasons, the matching of source cases to all possible cases with a PREDICTION JOIN does not follow the semantics of a standard relational JOIN:

The DMM cases do not represent every possible value of a continuous column, but a PREDICTION JOIN must match an exact continuous value from the source case to some learned distribution in the DMM. Using the simple example set of all possible cases defined earlier, the following command returns no records because the possible cases for the DMM contains the Age column values for only the "Minimum," "Mean," "Maximum," and "Missing" ages (2, 45, 91, "Missing"):

SELECT * FROM GenderPredictDMM WHERE Gender = 'Male' AND Age = 30

However, a PREDICTION JOIN using the decision tree described for this model finds a distribution on HairColor for a 30-year-old Male of (Black = .667; Grey = .267; Missing = .067).



The DMM cases represent all possible states for a column being predicted, while a user selecting a prediction for a column often expects to get the single "Best" predicted state. Use of the same simple example model produces the following results:

SELECT * FROM GenderPredictDMM WHERE Gender = 'Male' AND Age = 45

Gender Age HairColor

Male 45 Black

Male 45 Gray

Male 45 NULL

However, selecting HairColor from this model using PREDICTION JOIN to a case for a 45-year-old male would simply report "Black" as the single value for HairColor.

The PREDICTION JOIN may need to make some aggregations and assumptions when confronted with missing values in the source case. To continue the example, a PREDICTION JOIN between the simple model and a case where the person's age is 30 but the gender is unknown would report a hair color of "Black" with a probability of 80%. (As the sample tree indicates, this is a probability which is independent of Gender.)

In general, PREDICTION JOIN will take one case from the input set, and using the conditions in the ON clause, it will find a matching set of cases from the DMM. This set of matching DMM cases is then "collapsed" by the algorithm (in an algorithm-specific way) into one aggregate case that contains the best predictions for all predictable columns in the model. This collapsed case may have prediction-describing statistics that are not directly observable in the set of all possible DMM cases because the statistics are the result of the collapsing process.

2.9.1.3 SELECT ExpressionsThe <SELECT-expressions> clause is a set of comma-separated expressions, each of which can be just a simple column reference or a general expression containing prediction functions that may be connected with various types of operators. (See "Prediction Details.") Columns can be referenced from the DMM or from the source data query. When a name conflict occurs between the DMM and source, the column reference must be prefixed with the model name or the source query's alias.

To validate the accuracy of the learned model, make a prediction on a set of new source cases where the predicted column value is known (a set of cases reserved from the set upon which the model was trained). Use SELECT to find the predicted value of the column from the model and the actual value from the source query.

2.9.1.4 ON and the Join ConditionThe existence of key columns on the case row are really for bookkeeping and consistency reasons; the key values from a set of training data may not be used by the DMM, and the



DMM does not retain the set of distinct values for these columns. However, because each row from the DMM's set of all possible cases is unique, it can be matched to rows from the source query of actual cases through the <join condition> clause of the ON keyword. The join condition matches columns from the DMM to columns from the source query. The join condition has one "=" expression for each set of columns to be matched, and the expressions are joined with the AND keyword. Column references in the join condition can be simple column names, they can be prefixed with a model or alias name to scope namespaces and resolve name conflicts, and they can have many scope levels to identify columns which are in turn members of table type columns. Consider the following examples:

SELECT … ON GenderPredictDMM.Gender = T2.Gender AND GenderPredictDMM.Age = T2.Age

Notice that even though the model has a column for HairColor, the source query may not have this column. In fact, if the SELECT command is predicting the "best" HairColor, the DMM's HairColor column should not be bound to a source column.

SELECT … ON M1.Gender = T2.Sex AND M1.[Product Purchases].[product name] = T2.Age.[Product Purchases].[product name]

The DMM [Age Prediction] has been aliased in the FROM clause as M1, and the source query has been renamed to T2. For both tables, the [product name] column exists in a nested table-valued column called [product purchases].

For the situation where the schema of the DMM matches the schema of the input query, the key words NATURAL PREDICTION JOIN can be used and the ON clause must be omitted. Columns from the source query will be matched to columns from the DMM based on the names of the columns.

2.9.1.5 WHERE ClauseThe <WHERE-expression> supports a simplified form of the SQL WHERE clause semantics that can limit the cases returned from a prediction query. Column references in the WHERE expression have the same semantics of column references in the <SELECT-expressions>.

2.9.2 An ExampleThe following sample query will return the predicted age for set of new customers where the prediction is more than 80% likely:

SELECTT1.[Customer ID], T1.[Gender], M1.[Age]

FROM [Age Prediction] as M1 PREDICTION JOIN

OPENROWSET('MSDataShape', 'data provider=Microsoft.Jet.OLEDB.4.0;data source=D:\customer.mdb',

'SHAPE { SELECT [Customer ID], [Gender] FROM [Customers] ORDER BY [Customer ID]}

APPEND ( {SELECT [CustID], [Product Name], [Quantity]



FROM [Customer Product Sales] ORDER BY [CustID] } RELATE [Customer ID] TO [Cust ID]) AS [Product Purchases], ( {SELECT [CustID], [Car Name]

FROM [Customer Cars] ORDER BY [CustID] } RELATE [Customer ID] TO [Cust ID]) AS [Car Ownership]') as T1

ON M1.Gender = T1.Gender AND M1.[Product Purchases].[Product Name] = T1.[Product Purchases].[Product Name] AND M1.[Product Purchases].Quantity = T1.[Product Purchases].Quantity ANDM1.[Car Ownership].[Car Name] = T1.[Car Ownership].[Car Name]

WHERE PredictProbability(M1.Age) > .8

2.9.3 Prediction DetailsAlong with the "best" predicted values, prediction queries on DMMs can convey additional information and statistics learned from the training data set. There are not explicit columns in the DMM dedicated to hold these additional bits of information; instead, they can be selected from the DMM by calling the appropriate functions (often a function taking the predicted column as an argument).

Some of these functions report simple scalar values that relay measures of the confidence in a prediction or give fine-grained control over how a prediction is made. Other functions can expand a prediction into a table of details that better explain the prediction.

Also, the value predicted for a nested table (a column of type TABLE that is predictable) will in theory produce a nested table with one row for every distinct value for the key of the nested table. Various functions can operate on this nested table and limit, expand, or reorder the records. These functions are often a shorthand form of a nested SELECT clause. (A SELECT statement operating on the nested table can produce a new version of the nested table. A nested SELECT can be used as an entry in the <SELECT-expressions> list to generate a nested table.)

These functions will be described briefly in the following sections and are fully enumerated in Appendix C.

2.9.3.1 Scalar FunctionsDirectly selecting a predictable column from a DMM is a shortcut for using the default behavior of the Predict function on the column. It will return the "best" predicted value for the column (that is, the one with highest probability or whatever the provider decides is most appropriate). When a non-TABLE type column is given to the Predict function, the result is a scalar value.

All attributes of a DMM implicitly consider "Missing" as one of the possible values or states that they should model. In general, it is assumed that "Missing" or NULL values should not be returned as predictions, even if they are the most likely states. However, for some domains, a prediction of "Missing" could be informative. For example, consider a data set for the result of a survey that asked for Age, Gender, and Weight. If you are trying to predict Weight when



given Age and Gender, for example, you might learn that for a certain segment of the population the average Weight is 135 lbs, but the most likely response to the question is "Missing" (that is, "none of your business!"). An (optional) argument to the Predict function can be the value INCLUDE_NULL, which is used to force the Predict function to return "Missing" as one of the potential prediction values.

Along with the predicted value, other functions can give statistics that describe the prediction. PredictSupport(MyColumn) will return the number of cases in support of the prediction, and PredictProbability will give the likelihood of the returned value amongst the set of possible values for the column.

SELECT [Customer ID], Predict(Age), PredictProbability([Age]) as P …

Customer ID Age P

10001 43 .667

10203 43 .400

In the preceding example, [Age] is the predicted attribute and it is a Discretized attribute, so the predicted value for age will be the midpoint of one of the "buckets" that were found for age values. To get a better description for the range of a predicted bucket, the RangeMin, RangeMax, and RangeMid functions can be called on the prediction for the Discretized column.

However, if instead of Discretized, this model was created with [Age] as a continuous attribute, the reported prediction for Age would be a continuous value (in the domain of Age). This predicted age may be the mean of some local distribution—for example, the average age of people who buy the same products as those purchased by a person in the source case. Using this predicted value alone may be sufficient, but additional pieces of information might also be available. For example, the standard deviation will usually accompany a continuous attribute prediction, as follows:

SELECT [Customer ID], [Age], PredictStdev([Age]) as S …

Customer ID Age S

10001 45 5.2

10203 15 2.1

[Age] will return the mean value of prediction of age for the input case. The PredictStdev function will return the standard deviation for the predicted [Age] column. Notice that, unlike the SQL STD function, which is an aggregation function, the PredictStdev is a scalar function that may provide different results for each returned row.

If the DMM supports finding a clustering of records, the cluster membership information for a given input case can be obtained with the Cluster function. It returns the cluster identifier that the given input case most likely belongs to. Details about the input case's fit into its cluster are retrieved with the ClusterDistance and ClusterProbability.



SELECT [Customer ID], [Gender], Cluster() as C, ClusterProbability() as CP, ……

Customer ID Gender C CP

10001 Male 2 .21

10203 Female 7 .32

The list of available functions for each of the prediction columns is found in the MINING_COLUMNS schema rowset of the DMM. Many of the common functions were standardized in this specification and are available in Appendix C. The following table provides a short description of these functions.

Function Return Value Description

Predict(<scalar column reference>, options, …)

<column reference>

General prediction function to modify behavior of prediction for scalar values, such as including a missing state. Returns the "best" value, given the options, for the specified scalar column.

PredictSupport(<column reference>) Scalar value Count of cases in support of the predicted value.

PredictVariance(<column reference >) Scalar value Variance describing the distribution for which the value of Predict is the mean (generally for continuous attributes).

PredictionStdev(<column reference >) Scalar value Square root of PredictVariance.

PredictProbability(<column reference >) Scalar value Likelihood that Predict is the correct value.

PredictProbabilityVariance(<column reference >)

Scalar value Expresses certainty in the value of PredictVariance.

PredictProbabilityStdev(<column reference >)

Scalar value Square root of PredictProbabilityVariance.

Cluster Scalar value or <cluster column reference>

Cluster identifier that the input case belongs to with the highest probability. It also can be used as a <cluster column reference> for a PredictHistogram function.

ClusterDistance([ClusterID_expr]) Scalar value Distance from the center of the cluster that is identified by ClusterID_expr or the highest probability cluster.

ClusterProbability([ClusterID_expr]) Scalar value Probability that the input case belongs to the cluster that is identified by ClusterID_expr or the highest probability cluster.





RangeMid(<column reference>) Scalar value Gives the midpoint of the predicted bucket for a discretized column.

RangeMin(<column reference>) Scalar value Gives the low end of the predicted bucket for a discretized column.

RangeMax(<column reference>) Scalar value Gives the upper end of the predicted bucket for a discretized column.

2.9.3.2 Expanding Scalar Predictions with PredictHistogramThe additional information on a prediction need not be a simple scalar. For example, when predicting a discrete attribute (such as Gender), a histogram is one possible way to provide the predictions. The histogram will have one entry for each of the possible values that could have been returned for the column. Along with each value are some statistics that describe its likelihood. (The exact format of a histogram is presented in Appendix C.) This histogram is a table, and the PredictHistogram function returns this table as a column with the data type of TABLE (that is, a table column). The nested table has a predefined set of information-containing columns. These columns are $Support, $Variance, $Stdev (standard deviation), $Probability, $ProbabilityVariance, and $ProbabilityStdev.

SELECT [Customer ID], PredictHistogram([Gender]) AS GH …

Customer ID GH

10001

10203

Note For simplicity, only a few of the automatic information columns are shown in the preceding example.


Gender $Support $Probability …

Male 621 .621

Female 379 .379

Gender $Support $Probability …

Male 446 .446

Female 554 .554


The Predict functions are selecting their return values from the table returned by PredictHistogram. From this table, the record with the highest value for $Probability is found and the value for the appropriate column is returned.



Depending on the capabilities of the underlying DMM, the distribution for a continuous column may have more than one mode. (That is, the distribution graph shows more than one peak.) In this case, users can obtain the statistics (mean, standard deviation, and so on) of each mode by using the PredictHistogram function against a continuous column.

SELECT [Customer ID], PredictHistogram([Age]) AS AH …

Customer ID AH

10001

If the DMM supports finding a clustering of records, the Cluster function returns the most likely cluster membership for a given input case. However, the input case may exist with various degrees of probability in many or all of the clusters. Using the PredictHistogram(Cluster) functions will expand the cluster prediction out to a table describing the full cluster membership of the input case.

SELECT [Customer ID], PredictHistogram(Cluster()) AS CH …

Customer ID CH

10001

By default, the PredictHistogram function will not include "Missing" as one of the reported states. To force the function to return statistics for the attribute's missing state, the argument passed into PredictHistogram should be a call to Predict on the attribute, with the argument to include "Missing" specified, as shown in the following example:

SELECT [Customer ID], PredictHistogram(Predict([Gender], INCLUDE_NULL)) AS GH …

If a column supports the PredictHistogram function, it will be found in the MINING_COLUMNS schema rowset of the DMM. A full description of PredictHistogram can be found in Appendix C. The following table provides a short description:


Age $StdDev $Probability …

32.1 17.2 .621

65.2 6.4 .379

Cluster() $Support $Probability …

1 724 .55

2 1025 .05

3 20 .40



PredictHistogram(<scalar column reference>)

<table> Generates a histogram that contains details of the predictions for the column. Input column reference can be a column returning a function such as Predict or Cluster.

2.9.3.3 Predictions on Table ColumnsTABLE type columns may be predicted. The result of selecting such a TABLE type column from a DMM in a PREDICTION JOIN query is a nested table with one row for every distinct value learned for the key of the nested table. Along with each row of the generated nested table will be the "best" predicted value for any predictable columns from the nested table. Directly selecting a TABLE type column by name is a shortcut for using the default behavior of the Predict function on the column. Also, because the column is in itself a table, a nested SELECT statement can be used to return the rows. Using the example schema, where the Gender. Product Purchases, and Quantity columns are predictable, the following three queries are equivalent and will return the same results:

SELECT [Customer ID], [Gender], [Product Purchases] …

SELECT [Customer ID], [Gender], Predict([Product Purchases]) …

SELECT [Customer ID], [Gender], (SELECT * FROM [Product Purchases]) …

Customer ID Gender Product Purchases

10001 Male


Product Name Quantity

Product Type

TV 1 Electronic

Ham 2 Food

Beer 6 Beverage


10203 Female

The input table of actual cases may or may not contain a nested table that matches the nested table being predicted. If not, the interpretation of Predict on the table column is quite natural. Predict the membership of this table based on the other factors given for the case. If, however, the input case has a matching nested table, three possible behaviors may be desired. Consider the following example model:

1. A prediction simply could be the complete list of products the store offers, with associated predictions for quantities.

2. The prediction might show what other products a customer is likely to buy based on the products the customer has already bought. The reported list should not include the product from the input case.

3. The prediction might be just the predicted "Quantity" value associated with the products from the input case, or perhaps just the likelihood of each product in the input case. No other products should appear in the nested output table.

To express these three different cases, user can specify, respectively, one of the following options in the Predict function:

INCLUSIVE, which affects behavior number 1.

EXCLUSIVE (default option), which causes behavior number 2.

INPUT_ONLY, which ensures that the predicted table contains only the rows supplied by the input (behavior number 3).

Each entry in the predicted nested table has some probabilistic measurements for inclusion or ranking in the list. This is different from the probabilities and statistics associated with individual predictable columns within the nested table. Instead, these are statistics that describe what was learned about the mere existence of the record in the nested table. For instance, A model may show an 80% chance that a certain customer will buy beer but only a 40% chance that the beer will be purchased on sale, or a 70% chance that the number of units purchased will be 12. Another value for the option argument of the Predict function appends a new statistic containing columns to the returned nested table (similar to the way the PredictHistogram function creates statistics columns in the nested table it produces). Using the INCLUDE_STATISTICS value adds a $Support and a $Probability column to the resulting nested table, as illustrated in the following example:

SELECT [Customer ID], [Gender], Predict([Product Purchases], INCLUDE_STATISTICS, INPUT_ONLY) …Specification Version 1.0 — Microsoft 56


Product Type

TV 2 Electronic

Ham 1 Food

Beer 0 Beverage



10001 Male

10203 Female

Note In the preceding example, the customer 10001 input case contained a Product Purchases subrow only for Ham, and the customer 10203 case contained subrows for Ham and Beer. Because the INPUT_ONLY option was used, only these rows show up in the prediction.

The $Probability column for a nested table contains the probability of existence for the particular subtable entry. No assumptions can be made about the relationships among the sets of probabilities returned for nested table membership. As they may be derived from independent parts of the DMM, they cannot be added together to make anything meaningful.

One of the more complex forms of a returned prediction results from requesting a histogram for a value column inside a predicted table column. In this case, the prediction may include a histogram for the different statistics of each of the values. The following query will provide such a structure. (For simplicity, only a few of the automatic info columns are shown in this example.)

SELECT [Customer ID], [Gender], (SELECT [Product Name], PredictHistogram([Quantity]) AS [Quantity Histogram]

FROM Predict([Product Purchases]), INCLUDE_STATISTICS) …



Product Type $Support

$Probability

Ham 2 Food 725 .267


Product Type $Support

$Probability

Ham 1 Food 30 .34

Beer 0 Beverage 56 .83



10001 Male

If a TABLE column supports the Predict function, it will be found in the MINING_COLUMNS schema rowset of the DMM. A full description of Predict can be found in Appendix C. The following table provides a short description.


Predict(<TABLE column reference>, options, …)

<table column reference>

General prediction function to modify default behavior of prediction—for example, including missing records, appending statistics, inclusive/exclusive/input only membership, and so on.

2.9.3.4 Operating on Nested TablesIf a nested table returned as a prediction contains a great number of records (as would be the case if a store sold many, many different items), slogging through the results of the nested table to pick out interesting predictions would be an onerous task for both the provider and the consumer. Even if the nested table contains a relatively small number of records, finding good predictions from the set would be inconvenient. To solve this problem, OLE DB for DM


Product Name

QuantityHistogram $Probability

TV 0.23

Ham Quantity $Variance $Probability

1 0.5 0.25

2 0.7 0.55

3 3.7 0.20

0.267

Beer Quantity $Variance $Probability

1 1.1 0.15

2 0.7 0.15

3 0.2 0.70

0.832

Quantity $Variance $Probability

1 1.3 0.60

2 1.8 0.10

3 3.2 0.30


introduces the TopX and BottomX family of functions, which operate on nested tables (including those resulting from PredictHistogram, a nested SELECT, or any other table returning an expression). These functions order the records of the nested table by a specified column's value and then truncate the sorted list to a specified length.

For example, using the TopCount function, the following syntax retrieves the three most probable hair colors (from the learned set of 8 possible) for an input case:

SELECT [Customer ID], TopCount(PredictHistogram([HairColor]), $Probability, 3)…

Or to get the 10 products (out of the 10,000) that a customer a customer is predicted buy in the largest quantity, the TopCount function could be used as follows:

SELECT [Customer ID], TopCount([Product Purchases], [Quantity], 10) …

If a nested table contains a large number of columns and only a few are interesting to the prediction, or if using a function that produces information columns (such as PredictHistogram or Predict) and some of the automatic columns are not needed, a nested SELECT can be used on the nested table or function to project out the desired columns. Following are two examples using a nested SELECT:

SELECT [Customer ID], (SELECT [Product Name], Quantity FROM [Product Purchases]) …

or

SELECT [Customer ID], (SELECT HairColor, $Support as Sup, FROMTopCount(PredictHistogram([HairColor]), $Probability, 3)) as PH …

Customer ID PH

200

220

Suppose you wanted to get a list of predicted records from a TABLE type column and, along with each nested table record, you wanted additional statistics on a predictable column in the nested table. An earlier example in this document provided this information (and more). This earlier example generated a prediction of product purchases and, along with each prediction, a detailed histogram explaining the prediction for the quantity column. Navigating such a


HairColor Sup

Red 100

Brown 57

Black 13

HairColor Sup

Grey 675

Black 453

Green 2


nested rowset may be a bit cumbersome and is also unnecessary if the only information needed is the best prediction of quantity and some other measure of the prediction's strength that is returned from the prediction histogram. The following example shows how to get this result:

SELECT [Customer ID], Gender, (SELECT [Product Name], [Quantity] as [Best Quantity],

PredictStdev(Quantity) AS [Quantity Deviation],$Probabilty

FROM Predict([Product Purchases], INCLUDE_STATISTICS)), …


10001 Male

The sub-SELECT in the preceding example extracts desired columns from the histogram generated by Predict([Product Purchases], INCLUDE_STATISTICS). Note that $Probability is one of the columns that the Predict function automatically creates and is the probability of the record existing in the set, not the probability on the quantity.

A nested SELECT with a WHERE clause can be used to pull out certain records from a nested table. For example, if instead of always getting the "best" prediction for gender a query wanted to get the probability that each customer was "Female," this syntax would work as shown in the following example:

SELECT [Customer ID],(SELECT $Probability FROM PredictHistogram([Gender]) WHERE Gender = 'Female')AS [Female Probability] …

Customer ID Female Probability

10001 .379

10203 .554

Another similar use of the WHERE clause is to limit the records in the prediction on a TABLE type column to some specific entries or set of entries. The following example shows how to get only predictions for the purchase of "Beer" for any customer:

SELECT [Customer ID], (SELECT * FROM [Product Purchases] WHERE [Product Name] = 'Beer') …


Product Name

Best Quantity

Quantity Deviation $Probability

TV 1 1.3 0.23

Ham 2 0.7 0.267

Beer 3 0.2 0.832


Customer ID Product Purchases

10001

10203

The same idea applies to limit the scope of nested table predictions to a set of related records as defined by another column that is related to the key of the subtable, as in illustrated by the following example:

SELECT [Customer ID], (SELECT * FROM [Product Purchases] WHERE [Product Type] = 'Beverage') …

The list of available functions for a predictable TABLE type column is found in the MINING_COLUMNS schema rowset of the DMM. Many of the common functions were standardized in this specification and are available in Appendix C. The following table provides a short description of these common functions.


TopCount(<table expr>,

<rank expr >, <n-items>)

<table expr> Return the first <n-items> rows in a decreasing order of <rank expr >.

TopSum(<table expr>,

<rank expr >, <sum>)

<table expr> Return the first N rows in a decreasing order of <rank expr > such that the sum of the <rank column reference> values is at least <sum>.

TopPercent(<table expr>,

<rank expr >,

<percent>)

<table expr> Return the first N rows in a decreasing order of <rank expr> such that the sum of the <rank expr> values is at least the given percentage of the total sum of <rank expr> values.

Sub-SELECT:

(SELECT <SELECT-expressions>

FROM <table expr>

[WHERE <WHERE clause>])

<table expr> Apply a SELECT against <table expr>. <table expr> can be either a table column reference or any table-returning function except a sub-SELECT.



Product Type

Beer 6 Beverage


Product Type

Beer 0 Beverage


2.9.3.5 Singleton QueriesIn some cases, you may want to make a prediction for a case that is not contained in a table. For example, during a Web site visit, the Web server needs to make a prediction about the visitor preferences based on the current activities recorded. The current activities may not yet be recorded in the RDBMS, and it may be very inefficient to generate a record (or a set of records in multiple tables) only for the prediction purpose.

To solve this problem, the provider can support a syntax allowing sets of constant values in place of the <source data query> for the SELECT FROM PREDICTION JOIN syntax. See the section "Source Data" for examples of singleton data sources.

2.9.4 Flattening Nested TablesThe nested table is a very useful form of data representation that is well suited to the needs of data mining algorithms. Unfortunately, however, there is currently no widespread support in relational databases for this form of data representation. The way to convert flat relational views to a nested table was discussed earlier, and the SHAPE statement is introduced in Appendix E. This mechanism helps to feed data into the DM provider.

Some data mining clients will not be able to accept result sets in hierarchical format from a DM provider. This may be because the client lacks the ability to handle hierarchy or because the client application needs to store the results in a single relational table. To convert the data from nested tables to flattened tables, it is necessary to request that the query results be flattened. For this, the SELECT syntax provides the FLATTENED option, as in the following example:

SELECT FLATTENED <SELECT-expressions> FROM …

The FLATTENED option turns the SELECT result table from a hierarchical table to a flattened table form. The result set will contain one row for each predicted value, simplifying the processing of the prediction results. If the columns in the <SELECT-expressions> clause come from various levels of a hierarchy of table nesting, the resulting flattened table will not put the prediction results on the same record. Doing so implies a connection between the predictions, and no connection is assumed to exist. For example, a FLATTENED prediction on [Products Purchases] might give the result set shown in the following table.

Customer ID Product Name Quantity Probability1 TV 1 .251 TV 2 .11 TV 3 .021 Ham 2 .21 Ham 1 .051 Ham 3 .03



In this result set, each row contains a single prediction of products and the possible quantities. If the columns in the <SELECT-expressions> clause include columns from more than one table column, the results will return the hierarchical shape in a flattened result set. Each row again contains a single prediction, but different rows might contain different types of predictions. For example, if a prediction is made for Gender and Product Purchases, the flattened result set might look like the following table.

Customer ID GenderGender

Probability Product Name QuantityProduct Quantity

Probability

Female .43 Null Null Null

Male .57 Null Null Null

Null Null TV 1 .25

1 Null Null TV 2 .1

1 Null Null TV 3 .02

1 Null Null Ham 2 .2



Each row contains a single prediction; some rows contain a prediction for Gender while others have a prediction on Product Purchases.

2.10 Deleting Existing Mining ModelsFollowing are two ways to perform deletion operations:

1. Delete the DMM object—Remove the object from the system, with both its structure and its content.

2. Clear the DMM content—Clear the object of its content, but leave its structure intact.

These two operations are similar to the operations of dropping a table from the database or clearing all of the table content by using the following statements:

DROP MINING MODEL <model name>: Will delete the DMM from the database. The model will disappear from the namespace.

DELETE FROM <model name>: Will delete the content and the column values of the mining model but will leave the object structure intact. You may now repopulate the DMM with a new set of training data (using the INSERT INTO statement) without having to re-create the DMM structure.

DELETE FROM <model name>.CONTENT: Will delete the content of the mining model but leave the structure and learned column values intact.



2.11 Refining Mining ModelsExisting DMMs may also be refined. Refinement refers to modifying the content, or set of rules, by inserting a new set of training cases.

Refining a DMM based on additional cases is limited to certain algorithms that can be updated on an incremental basis. If the specific algorithm supports this capability, the ALLOW_INCREMENTAL_INSERT column in the MINING_SERVICES schema rowset indicates whether the provider supports this capability. If the capability is supported, the DMM can be refined by simply executing another INSERT INTO statement with the additional cases.

If the capability is not supported, all of the DMM content will have to be deleted and the DMM must be retained using the full set of cases (both the old ones and the new ones).


3 Appendix A: Schema RowsetsSchema information in OLE DB is retrieved using predefined schema rowsets; this appendix lists the contents of each schema rowset. Providers can add columns to these standard schema rowsets. We recommend that the names of the columns extended by the provider have the provider name as the prefix.

3.1 MINING_MODELS Schema RowsetNumber of restriction columns: 6

Restriction columns: MODEL_CATALOG, MODEL_SCHEMA, MODEL_NAME, MODEL_TYPE, SERVICE_NAME, SERVICE_TYPE_ID

Default sort order: MODEL_CATALOG, MODEL_SCHEMA, MODEL_NAME

Description: Data mining models are exposed in the MINING_MODELS schema rowset. This schema rowset can be viewed as an enhanced form of the TABLES schema rowset for data mining models.

Column Name

Type Indicator Description

1 MODEL_CATALOG DBTYPE_WSTR Catalog name. NULL if the provider does not support catalogs.

2 MODEL_SCHEMA DBTYPE_WSTR Unqualified schema name. NULL if the provider does not support schemas.

3 MODEL_NAME DBTYPE_WSTR Model name. This column cannot contain NULL.

4 MODEL_TYPE DBTYPE_WSTR Model type, a provider-specific string—can be NULL.

5 MODEL_GUID DBTYPE_GUID GUID that uniquely identifies the model. Providers that do not use GUIDs to identify tables should return NULL in this column.

Column Name


6 DESCRIPTION DBTYPE_WSTR Human-readable description of the model. Null if there is no description associated with the column.

7 MODEL_PROPID DBTYPE_UI4 Property ID of the model. Providers that do not use PROPIDs to identify columns should return NULL in this column.

8 DATE_CREATED DBTYPE_DATE Date when the model was created or NULL if the provider does not have this information.

Note 1.x providers do not return this column.

9 DATE_MODIFIED DBTYPE_DATE Date when the model definition was last modified or NULL if the provider does not have this information.

10 SERVICE_TYPE_ID DBTYPE_UI4 A bitmask that describes mining service types. The following list includes known popular mining service values:


DM_SERVICETYPE_CLUSTERING (0x0000002)

DM_SERVICETYPE_ASSOCIATION (0x0000004)



Column Name


11 SERVICE_NAME DBTYPE_WSTR A provider-specific name that describes the algorithm used to generate the model.

12 CREATION_STATEMENT

DBTYPE_WSTR Optional. The statement used to create the original data mining model.

13 PREDICTION_ENTITY

DBTYPE_WSTR A comma-delimited list indicating which columns the model can predict.

14 IS_POPULATED DBTYPE_BOOL VARIANT_TRUE if the model is populated; VARIANT_FALSE if the model is not populated. An empty model has a defined structure but has not been trained with data.

3.2 MINING_COLUMNS Schema RowsetNumber of restriction columns: 4

Restriction columns: MODEL_CATALOG, MODEL_SCHEMA, MODEL_NAME, COLUMN_NAME

Default sort order: MODEL_CATALOG, MODEL_SCHEMA, MODEL_NAME, COLUMN_NAME

Description: The MINING_COLUMNS schema rowset describes the individual columns of all defined data mining models known to the provider. This schema rowset can be viewed as an enhanced form of the COLUMNS rowset for data mining models. Many of the entries are derived from the COLUMNS schema rowset and are optional.

Column Name



Column Name



3 MODEL_NAME DBTYPE_WSTR Model name. This column cannot contain a NULL.

4 COLUMN_NAME DBTYPE_WSTR The name of the column; this might not be unique. If this cannot be determined, a NULL is returned.

This column, together with the COLUMN_GUID and COLUMN_PROPID columns, forms the column ID. One or more of these columns will be NULL, depending on which elements of the DBID structure the provider uses.

If possible, the resulting column ID should be persistent. However, some providers do not support persistent identifiers for columns.

5 COLUMN_GUID DBTYPE_GUID Column GUID. Providers that do not use GUIDs to identify columns should return NULL in this column.

6 COLUMN_PROPID DBTYPE_UI4 Column property ID. Providers that do not associate PROPIDs with columns should return NULL in this column.

7 ORDINAL_POSITION DBTYPE_UI4 The ordinal of the column. Columns are numbered starting from one. NULL if there is no stable ordinal value for the column.

Column Name


8 COLUMN_HASDEFAULT DBTYPE_BOOL VARIANT_TRUE—The column has a default value.

VARIANT_FALSE—The column does not have a default value, or it is unknown whether the column has a default value.

9 COLUMN_DEFAULT DBTYPE_WSTR Default value of the column. A provider may expose DBCOLUMN_DEFAULTVALUE but not DBCOLUMN_HASDEFAULT (for SQL-92 tables) in the rowset returned by IColumnsRowset::GetColumnsRowset.

If the default value is the NULL value, COLUMN_HASDEFAULT is VARIANT_TRUE and the COLUMN_DEFAULT column is a NULL value.

10 COLUMN_FLAGS DBTYPE_UI4 A bitmask that describes column characteristics. The DBCOLUMNFLAGS enumerated type specifies the bits in the bitmask. This column cannot contain a NULL value.

11 IS_NULLABLE DBTYPE_BOOL VARIANT_TRUE—The column might be nullable.

VARIANT_FALSE—The column is known not to be nullable.

Column Name


12 DATA_TYPE DBTYPE_UI2 The indicator of the column's data type—for example:

"TABLE" = DBTYPE_HCHAPTER

"TEXT" = DBTYPE_WCHAR

"LONG" = DBTYPE_I8

"DOUBLE" = DBTYPE_R8

"DATE" = DBTYPE_DATE

13 TYPE_GUID DBTYPE_GUID The GUID of the column's data type. Providers that do not use GUIDs to identify data types should return NULL in this column.

Column Name


14 CHARACTER_MAXIMUM_LENGTH

DBTYPE_UI4 The maximum possible length of a value in the column. For character, binary, or bit columns, this is one of the following:

The maximum length of the column in characters, bytes, or bits, respectively, if the length is defined. For example, a CHAR(5) column in an SQL table has a maximum length of 5.

The maximum length of the data type in characters, bytes, or bits, respectively, if the column does not have a defined length.

Zero (0) if neither the column nor the data type has a defined maximum length.

NULL for all other types of columns.

15 CHARACTER_OCTET_LENGTH

DBTYPE_UI4 Maximum length in octets (bytes) of the column, if the type of the column is character or binary. A value of zero means the column has no maximum length. NULL for all other types of columns.

Column Name


16 NUMERIC_PRECISION DBTYPE_UI2 If the column's data type is of a numeric data type other than VARNUMERIC, this is the maximum precision of the column. The precision of columns with a data type of DBTYPE_DECIMAL or DBTYPE_NUMERIC depends on the definition of the column

If the column's data type is not numeric or is VARNUMERIC, this is NULL.

17 NUMERIC_SCALE DBTYPE_I2 If the column's type indicator is DBTYPE_DECIMAL, DBTYPE_NUMERIC, or DBTYPE_VARNUMERIC, this is the number of digits to the right of the decimal point. Otherwise, this is NULL.

18 DATETIME_PRECISION DBTYPE_UI4 Datetime precision (number of digits in the fractional seconds portion) of the column if the column is a datetime or interval type. If the column's data type is not datetime, this is NULL.

19 CHARACTER_SET_CATALOG

DBTYPE_WSTR Catalog name in which the character set is defined. NULL if the provider does not support catalogs or different character sets.

20 CHARACTER_SET_SCHEMA

DBTYPE_WSTR Unqualified schema name in which the character set is defined. NULL if the provider does not support schemas or different character sets.

Column Name


21 CHARACTER_SET_NAME

DBTYPE_WSTR Character set name. NULL if the provider does not support different character sets.

22 COLLATION_CATALOG DBTYPE_WSTR Catalog name in which the collation is defined. NULL if the provider does not support catalogs or different collations.

23 COLLATION_SCHEMA DBTYPE_WSTR Unqualified schema name in which the collation is defined. NULL if the provider does not support schemas or different collations.

24 COLLATION_NAME DBTYPE_WSTR Collation name. NULL if the provider does not support different collations.

256 DOMAIN_CATALOG DBTYPE_WSTR Catalog name in which the domain is defined. NULL if the provider does not support catalogs or domains.

26 DOMAIN_SCHEMA DBTYPE_WSTR Unqualified schema name in which the domain is defined. NULL if the provider does not support schemas or domains.

27 DOMAIN_NAME DBTYPE_WSTR Domain name. NULL if the provider does not support domains.

28 DESCRIPTION DBTYPE_WSTR Human-readable description of the column. For example, the description for a column named Name in the Employee table might be "Employee name." Null if there is no description associated with the column.

Column Name


29 DISTRIBUTION_FLAG DBTYPE_WSTR One of the following:

"NORMAL"

"LOG_NORMAL"

"UNIFORM"

"BINOMIAL"

"MULTINOMIAL"

"POISSON"

"HEAVYTAIL"

"MIXTURE"


Column Name


30 CONTENT_TYPE DBTYPE_WSTR One of the following:

"KEY"

"DISCRETE"

"CONTINUOUS"

"DISCRETIZED([args])"

"ORDERED"

"SEQUENCE_TIME"

"CYCLICAL"

"PROBABILITY"

"VARIANCE"

"STDEV"

"SUPPORT"

"PROBABILITY_VARIANCE"

"PROBABILITY_STDEV"

"ORDER"

"SEQUENCE"


31 MODELING_FLAG DBTYPE_WSTR A comma-delimited list of flags. The defined flags are as follows:

"MODEL_EXISTENCE_ONLY"

"NOT NULL"


Column Name


32 IS_RELATED_TO_KEY DBTYPE_BOOL VARIANT_TRUE if this column is related to the key. If the key is a single column, the RELATED_ATTRIBUTE field optionally may contain its column name.

33 RELATED_ATTRIBUTE DBTYPE_WSTR This is the name of the target column that the current column either relates to or is a special property of.

34 IS_INPUT DBTYPE_BOOL VARIANT_TRUE if this is an input column.

35 IS_PREDICTABLE DBTYPE_BOOL VARIANT_TRUE if the column is predictable.

36 CONTAINING_COLUMN DBTYPE_WSTR Name of the TABLE column containing this column. NULL if any table does not contain the column.

37 PREDICTION_SCALAR_FUNCTIONS

DBTYPE_WSTR A comma-delimited list of scalar functions that may be performed on the column.

38 PREDICTION_TABLE_FUNCTIONS

DBTYPE_WSTR A comma-delimited list of functions that may be applied to the column, returning a table. The list has the following format:

<function name>(<column1> [, <column2>], ...)

The format allows the client to determine which columns will be present in the table returned by any given function.

Column Name


39 IS_POPULATED DBTYPE_BOOL VARIANT_TRUE if the column has learned a set of possible values.

VARIANT_FALSE if the column is not populated.

40 PREDICTION_SCORE DBTYPE_R8 The score of the model on the predicting column. Score is used to measure the accuracy of a model.

3.3 MINING_MODEL_CONTENT Schema RowsetNumber of restriction columns: 10

Restriction columns: MODEL_CATALOG, MODEL_SCHEMA, MODEL_NAME, ATTRIBUTE_NAME, NODE_NAME, NODE_UNIQUE_NAME, NODE_TYPE, NODE_GUID, and NODE_CAPTION

Note A tenth restriction, called the tree operation, is not on any particular column of the MINING_MODEL_CONTENT rowset; rather, it specifies a tree operator. The idea is that the consumer specified a NODE_UNIQUE_NAME restriction and the tree operator (ANCESTORS, CHILDREN, SIBLINGS, PARENT, DESCENDANTS, SELF) to obtain the desired set of members. The SELF operator includes the row for the node itself in the list of returned rows. The following constants are defined:

DMTREEOP_ANCESTORS 0x00000020

DMTREEOP_CHILDREN 0x00000001

DMTREEOP_SIBLINGS 0x00000002

DMTREEOP_PARENT 0x00000004

DMTREEOP_SELF 0x00000008

DMTREEOP_DESCENDANTS 0x00000010

(These designations comprise a bit mask and may be combined.)

Default sort order: MODEL_CATALOG, MODEL_SCHEMA, MODEL_NAME, ATTRIBUTE_NAME

Description: The MINING_MODEL_CONTENT schema rowset allows browsing of the content of a data mining model. The user can employ special tree-operation restrictions to navigate the content as a directed acyclic graph.

Column Name


1 MODEL_CATALOG DBTYPE_WSTR The name of the catalog to which this model belongs. NULL if the provider does not support catalogs.

2 MODEL_SCHEMA DBTYPE_WSTR The name of the schema to which this model belongs. NULL if the provider does not support schemas.

3 MODEL_NAME DBTYPE_WSTR Name of the model.

4 ATTRIBUTE_NAME DBTYPE_WSTR Name(s) of the attribute(s) corresponding to this node. For a model node, this would be a list of predictable attributes. For a leaf distribution node, this would be a single attribute that the distribution corresponds to.

5 NODE_NAME DBTYPE_WSTR Name of the node.

6 NODE_UNIQUE_NAME

DBTYPE_WSTR Unique name of the node. For providers that generate unique names by qualification, each component of this name is delimited.

Column Name


7 NODE_TYPE DBTYPE_I4 The type of the node. Can be one of the following values:

DM_NODE_TYPE_MODEL

DM_NODE_TYPE_TREE

DM_NODE_TYPE_INTERIOR

DM_NODE_TYPE_DISTRIBUTION

DM_NODE_TYPE_CLUSTER

DM_NODE_TYPE_UNKNOWN

8 NODE_GUID DBTYPE_GUID Node GUID. NULL if no GUID.

9 NODE_CAPTION DBTYPE_WSTR A label or a caption associated with the node. Used primarily for display purposes. If a caption does not exist, NODE_NAME is returned.

10 CHILDREN_CARDINALITY

DBTYPE_UI4 Number of children that the node has. This can be an estimate of the number of children. Consumers should not rely on this being the exact count. Providers should return as good an estimate as possible.

11 PARENT_UNIQUE_NAME

DBTYPE_WSTR Unique name of the node's parent. NULL is returned for any nodes at the root level. For providers that generate unique names by qualification, each component of this name is delimited.

12 NODE_DESCRIPTION

DBTYPE_WSTR A human-readable description of the node.

13 NODE_RULE DBTYPE_WSTR An XML description of the rule embedded in the node. The format of the XML string is based on the PMML standard.

Column Name


14 MARGINAL_RULE DBTYPE_WSTR An XML description of the rule moving to the node from the parent node.

15 NODE_PROBABILITY

DBTYPE_R8 The probability for reaching the node.

16 MARGINAL_PROBABILITY

DBTYPE_R8 The probability of reaching the node from the parent node.

17 NODE_DISTRIBUTION

DBTYPE_HCHAPTER

A table containing the probability histogram of the node.

18 NODE_SUPPORT DBTYPE_R8 Number of cases in support of this node.

3.4 Layout of DISTRIBUTION Chapter in MINING_CONTENT Schema RowsetNumber of restriction columns: Not applicable.

Restriction columns: Not applicable.

Default sort order: None.

Description: The DISTRIBUTION column in the MINING_CONTENT schema rowset is a nested table (which is represented in OLE DB as a chapter column). It provides statistical distribution information for the attributes corresponding to the node that the parent row represents. Each attribute will have multiple rows in this table.


1 ATTRIBUTE_NAME DBTYPE_WSTR Name of the attribute.

2 ATTRIBUTE_VALUE

DBTYPE_VARIANT

The attribute value represented as a variant.

3 SUPPORT DBTYPE_R8 The number of cases that support this attribute value.

4 PROBABILITY DBTYPE_R8 Probability of occurrence of this attribute value.

5 VARIANCE DBTYPE_R8 Variance of this attribute value.


6 VALUETYPE DBTYPE_I4 The value type of the attribute. Can be one of the following values:

VALUETYPE_MISSING = 1

VALUETYPE_EXISTING = 2

VALUETYPE_CONTINUOUS = 3

VALUETYPE_DISCRETE = 4

VALUETYPE_DISCRETIZED = 5

VALUETYPE_BOOLEAN = 6

3.5 MINING_SERVICES Schema RowsetNumber of restriction columns: 2

Restriction columns: SERVICE_NAME, SERVICE_TYPE_ID

Default sort order: SERVICE_NAME

Description: The MINING_SERVICES schema rowset exposes the data mining algorithms available from the provider. It can be used to determine the prediction capabilities, complexity, and similar information about the algorithm.

Column Name


1 SERVICE_NAME DBTYPE_WSTR The name of the algorithm. Provider-specific. This will be used as the service identifier in the language. (It is not localizable.)

Column Name


2 SERVICE_TYPE_ID DBTYPE_UI4 A bitmask that describes mining service types. The following list includes known popular mining service values:


DM_SERVICETYPE_CLUSTERING (0x0000002)

DM_SERVICETYPE_ASSOCIATION (0x0000004)



3 SERVICE_DISPLAY_NAME

DBTYPE_WSTR The localizable display name of the algorithm. Provider-specific.

4 SERVICE_GUID DBTYPE_GUID GUID for the algorithm. NULL if no GUID.

5 DESCRIPTION DBTYPE_WSTR Description of the algorithm.

6 PREDICTION_LIMIT DBTYPE_UI4 The maximum number of predictions the model and algorithm can provide; 0 means no limit.

7 SUPPORTED_DISTRIBUTION_FLAGS


"NORMAL"

"LOG_NORMAL"

"UNIFORM"


Column Name


8 SUPPORTED_INPUT_CONTENT_TYPES


"KEY"

"DISCRETE"

"CONTINUOUS"

"DISCRETIZED"

"ORDERED"

"SEQUENCE_TIME"

"CYCLICAL"

"PROBABILITY"

"VARIANCE"

"STDEV"

"SUPPORT"


"PROBABILITY_STDEV"

"ORDER"

"SEQUENCE"


Column Name


9 SUPPORTED_PREDICTION_CONTENT_TYPES


"DISCRETE"

"CONTINUOUS"

"DISCRETIZED"

"ORDERED"

"SEQUENCE_TIME"

"CYCLICAL"

"PROBABILITY"

"VARIANCE"

"STDEV"

"SUPPORT"


"PROBABILITY_STDEV"


10 SUPPORTED_MODELING_FLAGS


"MODEL_EXISTENCE_ONLY"

"NOT NULL"


Column Name


11 SUPPORTED_SOURCE_QUERY

DBTYPE_WSTR The <source_data_query> types that the provider supports. This is a comma-delimited list of one or more of the following syntax descriptions that can be used as the source of data for INSERT INTO or that can be PREDICTION JOINED to a DMM for SELECT:

"SINGLETON_CONSTANT"

"SINGLETON_SELECT"

"OPENROWSET"

"SELECT"

"SHAPE"

12 TRAINING_COMPLEXITY

DBTYPE_I4 Indication of expected time for training:

DM_TRAINING_COMPLEXITY_LOW—Running time is proportional to input and is relatively short.

DM_ TRAINING_COMPLEXITY_MEDIUM—Running time may be long but is generally proportional to input.

DM_ TRAINING_COMPLEXITY_HIGH—Running time is long and may grow exponentially in relationship to input.

Column Name


13 PREDICTION_COMPLEXITY

DBTYPE_I4 Indication of expected time for prediction:

DM_PREDICTION_COMPLEXITY_LOW—Running time is proportional to input and is relatively short.

DM PREDICTION_COMPLEXITY_MEDIUM—Running time may be long but is generally proportional to input.

DM_ PREDICTION_COMPLEXITY_HIGH—Running time is long and may grow exponentially in relationship to input.

14 EXPECTED_QUALITY

DBTYPE_I4 Indication of expected quality of model produced with this algorithm:

DM_EXPECTED_QUALITY_LOW

DM_EXPECTED_QUALITY_MEDIUM

DM_EXPECTED_QUALITY_HIGH

15 SCALING DBTYPE_I4 Indication of the scalability of the algorithm:

DM_SCALING_LOW

DM_SCALING_MEDIUM

DM_ SCALING_HIGH

16 ALLOW_INCREMENTAL_INSERT

DBTYPE_BOOL VARIANT_TRUE if additional INSERT INTO statements are allowed after the initial training.

Column Name


17 ALLOW_PMML_INITIALIZATION

DBTYPE_BOOL VARIANT_TRUE if the creation of a DMM (including both structure and content) based on an XML string is allowed.

18 CONTROL DBTYPE_I4 One of the following:

DM_CONTROL_NONE

DM_CONTROL_CANCEL

DM_CONTROL_SUSPENDRESUME

DM_CONTROL_SUSPENDWITHRESULT

19 ALLOW_DUPLICATE_KEY

DBTYPE_BOOL TRUE if cases may have duplicate key.

3.6 SERVICE_PARAMETERS Schema RowsetNumber of restriction columns: 2

Restriction columns: SERVICE_NAME, PARAMETER_NAME

Default sort order: SERVICE_NAME, PARAMETER_NAME

Description: The SERVICE_PARAMETERS schema rowset provides a list of parameters that can be supplied when generating a mining model via the CREATE MINING MODEL statement. The client will generally restrict by SERVICE_NAME to obtain the parameters supported by the provider and applicable to the type of mining model being generated.

Column Name


1 SERVICE_NAME DBTYPE_WSTR The name of the algorithm. Provider-specific.

Column Name


2 PARAMETER_NAME

DBTYPE_WSTR The name of the parameter.

3 PARAMETER_TYPE DBTYPE_WSTR Data type of parameter (DBTYPE).

4 IS_REQUIRED DBTYPE_BOOL If true, the parameter is required.

5 PARAMETER_FLAGS

DBTYPE_UI4 A bitmask that describes parameter characteristics. The following values (or a combination thereof) may be used:

DM_PARAMETER_TRAINING (0x0000001)—for training

DM_PARAMETER_PREDICTION (0x00000002)—for prediction

6 DESCRIPTION DBTYPE_WSTR Text describing the purpose and format of the parameter.

3.7 MODEL_CONTENT_PMML Schema RowsetNumber of restriction columns: four

Restrictions: MODEL_CATALOG, MODEL_SCHEMA, MODEL_NAME, MODEL_TYPE

Default Sort Order: MODEL_NAME, MODEL_SCHEMA, MODEL_NAME

Description: MODEL_CONTENT_PMML schema rowset stores the XML representation of the content of each model. The format of the XML string follows the PMML standard.




3 MODEL_NAME DBTYPE_WSTR Model name. This column cannot contain NULL.


4 MODEL_TYPE DBTYPE_WSTR Model type, a provider-specific string—can be NULL

5 MODEL_GUID DBTYPE_GUID GUID that uniquely identifies the model. Providers that do not use GUIDs to identify tables should return NULL in this column.

6 MODEL_PMML DBTYPE_WSTR An XML representation of the model's content with PMML format.

7 SIZE DMTYPE_UI4 Number of bytes of the XML string size.

8 LOCATION DMTYPE_WSTR The location of the XML file. NULL if the file is stored in the default directory.

4 Appendix B: OLE DB for DM Grammar

4.1 Statements

4.1.1 CREATE MINING MODELCREATE MINING MODEL <model>(

<column definition list>)USING <algorithm> [(<parameter list>)]

CREATE MINING MODEL <model> FROM PMML <xml string>

Parameters

<model> A unique name for the model.

<column definition list> A comma-separated list of column definitions.

<algorithm> The provider-defined name of a data mining algorithm.

<parameter list> (Optional) A comma-separated list of provider-defined parameters for the algorithm.

<xml string> An XML-encoded model (for advanced use only).

Remarks

The CREATE MINING MODEL statement creates a new mining model based on the column definition list. A column definition is one of the following forms:

<column name> <type> [<content flags>] [<column relation>] [<prediction flag>]<column name> TABLE [<prediction flag>] ( < non-table column definition list > )

<column name> Any valid column identifier.

<type> Any valid SQL type, including LONG, DOUBLE, DATE, TEXT, and TABLE.

<content flags> Content flags are "hints" to the data mining algorithm that provide additional information. Flags appear in the order of the grouping shown here, and flags within the same group cannot appear on the same column.

Distribution Flags

NORMAL The values of the column appear in a normal distribution.

LOG NORMAL The values of the column appear in a log normal distribution

UNIFORM The values of the column appear in a uniform distribution.

Type Flags

KEY The column is discrete and is a key. Key columns will not have any other flags except in the case of a nested table with no attribute columns.

CONTINUOUS The column contains values in a continuous range, such as Age or Salary.

DISCRETE The column contains a discrete set of values, such as Gender.

DISCRETIZED The column contains a continuous set of values that should be converted to buckets.

ORDERED The column contains a discrete set of values that are ordered, such as Salary Level.

CYCLICAL The column contains an ordered discrete set of values that are cyclical, such as Day of Week, or Month.

SEQUENCE TIME The column contains time measurement units.

SEQUENCE The column contains the sorting key of the related columns.

Modeling Flags


The column should be modeled has having two states, missing and nonmissing, regardless of the values in the column. This is particularly useful for columns in a nested table, where values are sparse across cases.

NOT NULL The column cannot accept NULL values.

Special Property Flags

These flags indicate a property of another column and will not appear with any other content flags or prediction flags

PROBABILITY The value in this column is the probability (0–1) of the associated value.

VARIANCE The value in this column is value variance of the associated value.

STDEV The value in this column is the standard deviation of the associated value.


The value in this column is the variance of the probability associated with the associated value.

PROBABILITY_STDEV The value in this column is the standard deviation of the probability associated with the associated value.

SUPPORT The value in this column is the weight (case replication factor) of the associated value.

<column relation> The column relation appears in two forms: OF <column name> and RELATED TO <column name>.

OF This form is restricted to use for columns with Special Property content flags—for example, ProbGender Double PROBABILITY OF Gender.

RELATED TO This form indicates a value hierarchy. The target of a related to column can be a key column in a nested table, a discretely valued column on the case row, or another column with a RELATED TO clause (indicating a deeper hierarchy). A special target "KEY" is reserved for nested tables with multiple keys and indicates a relation between the value in this column and the composite of all the key columns.

<prediction flags> These flags indicate that the column can be predicted by the model and can have one of two values.

PREDICT This column can be predicted by the model and it can be supplied in input cases to predict the value of other predictable columns.

PREDICT_ONLY This column can be predicted by the model, but its values cannot be used in input cases to predict the value of other predictable columns.

4.1.2 INSERT INTOINSERT INTO <model> (<mapped model columns>) <source data query>INSERT INTO <model> (<mapped model columns>) VALUES <constant list>INSERT INTO <model>.COLUMN_VALUES(<mapped model columns>) <source data query>

Parameters

<model> A model identifier.

<mapped model columns> A comma-separated list of column identifiers and nested identifiers.

<source data query> The source query in the provider-defined format.

Remarks

The INSERT INTO statement inserts training data into the model. The columns from the query are mapped to model columns through the <mapped model columns> section. The keyword SKIP is used to instruct the model to ignore columns that appear in the source data query that are not used in the model.

The INSERT INTO <model>.COLUMN_VALUES form inserts data directly into the models columns without training the model's algorithm. This allows you to provide column data to the model in a concise ordered manner that is useful when dealing with data sets containing hierarchies or ordered columns. The "." operator is used to specify columns that are part of a nested table. When using this form, columns that are part of a relation (either through RELATE TO or by being a KEY in a nested table) cannot be inserted individually and must be inserted together with all the columns in the relation.

The <mapped model columns> section has the following form:

<column identifier> | <table identifier>(<column identifier> | SKIP), …

4.1.3 SELECT4.1.3.1 SELECT INTOSELECT * INTO <new model> USING <algorithm> [(<parameter list>)]FROM <existing model>

Parameters

<new model> A unique name for the new model being created.

<algorithm> The provider-defined name of a data mining algorithm.

<parameter list> (Optional) A comma-separated list of provider-defined parameters for the algorithm.

<existing model> The name of the existing model to be copied.

Remarks

The SELECT INTO statement creates a new mining model by copying schema and other information from an existing mining model. If the existing model is trained, the new model will automatically be trained with the same query; otherwise, the new model will be empty.

4.1.3.2 SELECT FROM CONTENTSELECT * FROM <model>.CONTENT

Parameters

<model> A name of the model.

Remarks

The SELECT FROM CONTENT statement returns the mining model schema rowset for the specified model. See Appendix C for a description of the mining model schema rowset.

4.1.3.3 SELECT FROM <MODEL>SELECT [DISTINCT] <expr list> FROM <model> [ WHERE < condition list > ]

Parameters


<expr list> A comma-separated list of related column identifiers or expressions.

<condition list> (Optional) Conditions to restrict the values returned from the column list.

Remarks

The SELECT FROM <model> statement allows you to directly browse the values on which the columns have been trained.

4.1.3.4 SELECT FROM PREDICTION JOINSELECT <select expression list> FROM <model> [NATURAL] PREDICTION JOIN <source data query> [ON <join mapping list>] [ WHERE <condition expression> ]

Parameters

<select expression list> A comma-separated list of column identifiers and other expressions to describe the columns in the results of the query.


<source data query> The source query in the provider-defined format.

<join mapping list> A logical expression comparing column from model to column from source query.

<condition expression> (Optional) A condition to restrict the values returned from the column list.

Remarks

The SELECT FROM PREDICT syntax allows you to predict columns based on the input data that are supplied in the PREDICT clause. You can specify the OLE DB for DM feature-rich prediction functions, including prediction histograms, prediction probability, sub-SELECT, and so forth, in <select expression list> and <condition expression>. Only the rows that qualify the condition in the WHERE clause will be included in the result.

4.1.4 DELETEDELETE * FROM <model>[.CONTENT]

Parameters


Remarks

Deletes all training data from the model. If CONTENT is specified, only the algorithm training is discarded and the column values are retained.

4.1.5 DROPDROP MINING MODEL <model>

Parameters


Remarks

Removes the model and all associated information from the database.

4.2 A Sample BNFThis example BNF is from Microsoft's implementation of an OLE-DB for DM provider and does not represent the entire breadth of grammar described by this document.

<statement> -> <create> |<insert> |<select> |<delete> |<rename>

4.2.1 CREATE<create> -> <dm_create>

|<select_into> |<pmml_create>

<dm_create> -> CREATE MINING MODEL <identifier> ( <col_def_list> ) USING <algorithm> [(<algo_param_list>)]

<pmml_create> -> CREATE MINING MODEL <identifier> FROM PMML <string>

<select_into> -> SELECT * INTO <identifier> USING <algorithm> FROM <identifier>

<col_def_list> -> <col_def> |<col_def_list> , <col_def>

<col_def> -> <col_def_reg> | <col_def_tbl><col_def_reg> -> <identifier> <col_type> [<col_distribution>] [<col_binary>] [<col_content>]

[<col_content_qual>] [<col_qualif>] [<col_prediction>] [<relation_clause>]<col_def_tbl> -> <identifier> TABLE <col_prediction> ( <col_def_list> )<algorithm> -> MICROSOFT_DECISION_TREES | MICROSOFT_CLUSTERING<algo_param> -> <identifier> = <value><algo_param_list>-> <algo_param>

| <algo_param>, <algo_param_list>

<col_type> -> LONG | BOOLEAN | TEXT | DOUBLE | DATE

<col_distribution>-> NORMAL| UNIFORM

<col_binary> -> MODEL_EXISTENCE_ONLY | NOT NULL

<col_content> -> DISCRETE | CONTINUOUS | DISCRETIZED( [<disc_method> [, <numeric_const>]] ) | SEQUENCE_TIME

<disc_method> -> AUTOMATIC | EQUAL_AREAS | THRESHOLDS | CLUSTERS

<col_content_qual>-> ORDERED | CYCLICAL

<col_qualif> -> KEY | PROBABILITY | VARIANCE | STDEV | STDDEV | PROBABILITY_VARIANCE | PROBABILITY_STDEV | PROBABILITY_STDDEV | SUPPORT

<col_prediction> -> PREDICT | PREDICT_ONLY

<relation_clause> -> <related_to_clause> | <of_clause>

<related_to_clause>-> RELATED TO <identifier> | RELATED TO KEY

<of_clause> -> OF <identifier> | OF KEY

4.2.2 INSERT<insert> -> <insert_att>

| <insert_reg><insert_att> -> INSERT [INTO] <identifier>.COLUMN_VALUES ( <column_ref_list> ) <query><insert_reg> -> INSERT [INTO] <identifier> ( <column_ref_list> ) <query>

<query> -> <external_query> | <shape>

<external_query> -> OPENROWSET ( <string>, {<string>|<string>;<string>;<string>}, <string> )<shape> -> SHAPE { <query> } APPEND <append_list>

<append_list> -> <append_list> , <append_list><append> -> ( { <query> } RELATE <relate_list> ) AS <identifier><relate_list> -> <relate>

| <relate_list> , <relate><relate> -> <column_ref> TO <column_ref>

4.2.3 SELECT<column_ref_list> -> <column_ref>

| <column_ref_list> , <column_ref><column_ref> -> <identifier>

| <identifier>.<column_ref> | <column_ref> ( <column_ref_list> ) | SKIP | CLUSTER() | $SUPPORT | $VARIANCE | $STDEV | $STDDEV | $PROBABILITY | $PROBABILITY_VARIANCE | $PROBABILITY_STDEV | $PROBABILITY_STDDEV | $DISTANCE | PREDICT ( <column_ref> [, <pred_option_list>] ) | <column_ref> AS <identifier>

<pred_option_list>-> <pred_option> | <pred_option_list> , <pred_option>

<pred_option> -> EXCLUDE_NULL | INCLUDE_NULL | INPUT_ONLY | EXCLUSIVE | INCLUSIVE | INCLUDE_STATISTICS

<select> -> <pred_select> | <model_select>

<pred_select> -> SELECT [FLATTENED] <expression_list> FROM <identifier> [NATURAL] PREDICTION JOIN

<query> AS <identifier> [ON <on_list>] [<where_clause>] | SELECT [FLATTENED] <expression_list> FROM <identifier> [NATURAL]

PREDICTION JOIN <expression> AS <identifier> [ON <on_list>] [<where_clause>]

<model_select> -> SELECT [DISTINCT] <expression_list> FROM <identifier> [<where_clause>] | SELECT [DISTINCT] <expression_list> FROM <identifier>.PMML | SELECT [DISTINCT] <expression_list> FROM <identifier>.CONTENT

[<where_clause>]<expression_list> -> <expression>

| <expression_list> , <expression><expression> -> <value>

| <column_ref> | * | <expression> + <expression> | <expression> - <expression> | <expression> * <expression> | <expression> / <expression> | -<expression> | +<expression> | ( <expression> ) | <expression> OR <expression> | <expression> AND <expression> | NOT <expression> | <expression> = <expression> | <expression> <> <expression> | <expression> < <expression> | <expression> <= <expression> | <expression> > <expression> | <expression> >= <expression> | PREDICTSTDEV ( <column_ref> ) | PREDICTSTDDEV ( <column_ref> ) | PREDICTVARIANCE ( <column_ref> ) | PREDICTSUPPORT ( <column_ref> ) | PREDICTPROBABILITY ( <column_ref> ) | PREDICTPROBABILITYSTDEV ( <column_ref> ) | PREDICTPROBABILITYSTDDEV ( <column_ref> ) | PREDICTPROBABILITYVARIANCE ( <column_ref> ) | CLUSTERDISTANCE ( [<expression>] ) | CLUSTERPROBABILITY ( [<expression>] ) | PREDICTHISTOGRAM ( <column_ref> ) | TOPCOUNT ( <expression>, <column_ref>, <expression> ) | TOPSUM ( <expression>, <column_ref>, <expression> ) | TOPPERCENT ( <expression>, <column_ref>, <expression> ) | BOTTOMCOUNT ( <expression>, <column_ref>, <expression> ) | BOTTOMSUM ( <expression>, <column_ref>, <expression> ) | BOTTOMPERCENT ( <expression>, <column_ref>, <expression> ) | ( SELECT <expression_list> FROM <expression> <where_clause> ) | ( <singleton_list> ) | <expression> AS <identifier>

<singleton_list> -> <singleton> | <singleton_list> UNION <singleton>

<singleton> -> SELECT <expression_list><where_clause> -> WHERE <expression><delete> -> <delete_reg>

| <delete_content>

4.2.4 DELETE/DROP

<delete_reg> -> DELETE * FROM <identifier><delete_content> -> DELETE * FROM <identifier>.CONTENT<drop> -> DROP MINING MODEL <identifier>

4.2.5 RENAME<rename> -> RENAME MINING MODEL <identifier> TO <identifier>

4.2.6 MISCELLANEOUS<value> -> <numeric_const>

| <string><identifier> -> [([^\]]|(\]\]))*]

| [a-zA-Z_][a-zA-Z_0-9]*

5 Appendix C: Functions5.1 PredictSyntax:

Predict(<scalar column reference>, option1, option2, …)Predict(<table column reference>, option1, option2, …)

Applies To:

Either a scalar column or table column reference.

Return Type:

<scalar column reference> or

<table column reference> depending on which type of column this function is applied to.

Description:

This is a general form of prediction function that modifies the behavior of a prediction (for example, missing value control, association control, and so on). Possible options include EXCLUDE_NULL (default), INCLUDE_NULL, INCLUSIVE, EXCLUSIVE (default), INPUT_ONLY, and INCLUDE_STATISTICS.

Note INCLUSIVE, EXCLUSIVE, INPUT_ONLY, and INCLUDE_STATISTICS are applicable only for a table column reference, and EXCLUDE and INCLUDE_NULL are only for scalar values columns.

In most cases, the following shorthand will be used:

[Gender] is shorthand for Predict([Gender], EXCLUDE_NULL).

[Products Purchases] is shorthand for Predict([Products Purchases], EXCLUDE_NULL,

EXCLUSIVE_ASSOCIATION).

Note The return type of this function is itself regarded as a column reference. This means that this function can be used as an argument in other functions that take a column reference as an argument (except the Predict function itself).

Passing INCLUDE_STATISTICS to a prediction on a TABLE-valued column will add the metacolumns $Probability and $Support to the resulting table. These columns describe the likelihood of existence for the associated nested table record.

5.2 PredictSupport

Syntax:

PredictSupport(<scalar column reference>)

Applies to:

Scalar column

Return Type:

Scalar value

Description:

This function returns the support value for the histogram entry that has the highest probability (the top row in the histogram obtained by PredictHistogram(<column reference>).

5.3 PredictVarianceSyntax:

PredictVariance(<scalar column reference>)

Applies to:

Scalar column

Return Type:

Scalar value

Description:

This function returns the variance value for the histogram entry that has the highest probability (the top row in the histogram obtained by PredictHistogram(<column reference>).

5.4 PredictStdevSyntax:

PredictStdev(<scalar column reference>)

Applies to:

Scalar column

Return Type:

Scalar value

Description:

This function returns the standard deviation for the histogram entry that has the highest probability (the top row in the histogram obtained by PredictHistogram(<column reference>).

5.5 PredictProbabilitySyntax:

PredictProbability(<scalar column reference>)

Applies to:

Scalar column

Return Type:

Scalar value

Description:

This function returns the probability for the histogram entry that has the highest probability (the top row in the histogram obtained by PredictHistogram(<column reference>).

5.6 PredictProbabilityVarianceSyntax:

PredictProbabilityVariance(<scalar column reference>)

Applies to:

Scalar column

Return Type:

Scalar value

Description:

This function returns the variance of the probability for the histogram entry that has the highest probability (the top row in the histogram obtained by PredictHistogram(<column reference>).

5.7 PredictProbabilityStdevSyntax:

PredictProbabilityStdev(<scalar column reference>)

Applies to:

Scalar column

Return Type:

Scalar value

Description:

This function returns the standard deviation of the probability for the histogram entry that has the highest probability (the top row in the histogram obtained by PredictHistogram(<column reference>).

5.8 ClusterSyntax:

Cluster

Applies to:

This function does not require any parameter, but it can be used only when the underlying DMM supports clustering.

Return Type:

This function returns a scalar value of cluster identifier. However, if this function is used as an argument of other functions, it must be regarded as a <cluster column reference>.

Description:

This function returns a cluster identifier that the input case belongs to with the highest probability. It also can be used as a <cluster column reference> for a PredictHistogram function.

5.9 ClusterDistanceSyntax:

ClusterDistance([<ClusterID expression>])

Applies to:

This function can be used only when the underlying DMM supports clustering.

Return Type:

Scalar value.

Description:

This function returns the distance between the input case and the center of the cluster that has the highest probability. If <ClusterID expression> is given, the cluster is identified by the evaluation of the expression.

5.10 ClusterProbabilitySyntax:

ClusterProbability([<ClusterID expression>])

Applies to:

This function can be used only when the underlying DMM supports clustering.

Return Type:

Scalar value.

Description:

This function returns the probability that the input case belongs to the cluster that has the highest probability. If <ClusterID expression> is given, the cluster is identified by the evaluation of the expression.

5.11 PredictHistogramSyntax:

PredictHistogram(<scalar column reference>)PredictHistogram(<cluster column reference>)

Applies to:

A scalar or cluster column reference.

Return Type:

<table expression>

Description:

This function returns a table representing a histogram for prediction of the given column. A histogram generates statistics columns. For a <scalar column reference>, a histogram consists of the following seven columns:

The column being predicted

$Support

$Variance

$Stdev (standard deviation)

$Probability

$ProbabilityVariance

$ProbabilityStdev

A histogram for a <cluster column reference> consists of the following columns:

Cluster to represent the cluster identifier

$Distance

$Probability

$Support

5.12 TopCountSyntax:

TopCount(<table expression>, <rank expression>, <n-items>)

Applies to:

A table-returning expression that includes <table column reference> and functions that return a table.

Return Type:

<table expr>

Description:

This function returns the first <n-items> rows in a decreasing order of <rank expression>. As an example, a table expression (for example, a sub-SELECT) may contain the following table:

(SELECT [Product Name], $Probability AS [Probability] FROM Predict([Products Purchases], INCLUDE_STATISTICS))

Product Name Probability

Apples 0.4

Kiwi 0.1

Oranges 0.5

Lemons 0.2

If so, the function TopCount((SELECT ….), [Probability], 2) returns the following table:

Product Name Probability

Oranges 0.5

Apples 0.4

5.13 TopSumSyntax:

TopSum(<table expression>, <rank expression>, <sum>)

Applies to:


Return Type:

<table expr>

Description:

This function returns the first N rows in a decreasing order of <rank column reference>, such that the sum of the <rank expression> values is at least <sum>. TopSum returns the smallest number of elements possible while still meeting that criterion. For example, a table column named [Products] might contain the following table:

Product Name Unit Sales

Apples 1200

Kiwi 500

Oranges 1500

Lemons 750

If so, TopSum([Products], [Unit Sales], 2500) would return the following table:


Oranges 1500

Apples 1200

5.14 TopPercentSyntax:

TopPercent(<table expression>, <rank expression>, <percent>)

Applies to:


Return Type:

<table expr>

Description:

This function returns the first N rows in a decreasing order of <rank expression>, such that the sum of the <rank column reference> values is at least the given percentage of the total sum of <rank column reference> values. TopPercent returns the smallest number of elements possible while still meeting that criterion.

Using a table column named [Products], as shown here:


Apples 30

Kiwi 10

Oranges 40

Lemons 20

TopPercent([Products], [Unit Sales], 60) function would return the following table:


Oranges 40

Apples 30

Note that Apples were selected instead of Lemons.

5.15 Sub-SELECTSyntax:

(SELECT <SELECT-expressions> FROM <table expression> [WHERE <WHERE-clause>])

Applies to:


Return Type:

<table expr>

Description:

A sub-SELECT selects columns (generally speaking, expressions containing columns) from the given table-returning expression. Users also can specify a WHERE clause to filter out undesired rows.

5.16 RangeMidSyntax:

RangeMid(<scalar column reference>)

Applies to:

Discretized scalar columns

Return Type:

Scalar value

Description:

This function returns the midpoint of the predicted bucket that was discovered for a discretized column.

5.17 RangeMinSyntax:

RangeMin(<scalar column reference>)

Applies To:


Return Type:

Scalar value

Description:

This function returns the lower end of the predicted bucket that was discovered for a discretized column.

5.18 RangeMaxSyntax:

RangeMax(<scalar column reference>)

Applies To:


Return Type:

Scalar value

Description:

This function returns the upper end of the predicted bucket that was discovered for a discretized column.

5.19 PredictScoreSyntax:

PredictScore(<scalar column reference>)PredictScore(<table column reference>)

Applies To:

Predictable columns

Return Type:

Scalar value

Description:

This function returns the prediction score of the specified column.

5.20 PredictNodeIdSyntax:

PredictNodeId(<scalar column reference>)

Applies To:

Predictable columns (except table columns or predictable columns in nested table).

Return Type:

Scalar value

Description:

This function returns the node id of the tree leaf node in which the case is classified.

6 Appendix D: XML Format for Data Mining Models

DMMs are represented in XML using a variation of the Predictive Model Markup Language (PMML) version 1.0. A few of the additions to PMML 1.0:

Support for the nested table nature of a DMM through nested Data Dictionaries.

The idea of Discretized, ordered, and Cyclical model variables beyond the simple Categorical and Continuous.

Support for Key columns in nested dictionaries that list instances as categories.

Support for Relation type columns as "hierarchy parents."

All model variables can have a missing state described, even ones with continuous domain.

Data dictionary is no longer a complete list of all attributes; rather, it is an "attribute factory." Any attribute reference outside the data dictionary must "instantiate" a model variable by locating it in the data dictionary hierarchy.

Because of the previous point, it is no longer sufficient to reference a model variable (called attribute) as an attribute (in XML terms) of a tag. Instead, they must be properties (nested tags) that describe the variable instance.

Statistics on the global distribution of the model variables have been separated out into a new section.

It is expected that most of these changes will simply become part of PMML version 1.1.

6.1 DTD for the DMM Extended PMML<?xml encoding="UTF-8"?>

<!ENTITY % predicates "(predicate | compound-predicate | true | false)">

<!ENTITY % NUMBER "NMTOKEN">





<!ELEMENT pmml (head?, statements?, data-dictionary, global-statistics?, (tree-model | segment-model | regression-model)+)>

<!ATTLIST pmml version CDATA #REQUIRED name CDATA #IMPLIED GUID CDATA #IMPLIED Modified-time CDATA #IMPLIED Creation-time CDATA #IMPLIED

>





<!ELEMENT head (application?, annotation*, timestamp?, datasrc?)>

<!ATTLIST head copyright CDATA #REQUIRED description CDATA #IMPLIED>

<!ELEMENT timestamp (#PCDATA)>

<!ELEMENT application EMPTY>

<!ATTLIST application name CDATA #REQUIRED version CDATA #IMPLIED>

<!ELEMENT annotation (#PCDATA)>



<!ELEMENT datasrc EMPTY>

<!ATTLIST datasrc src CDATA #REQUIRED query CDATA #REQUIRED>





<!ELEMENT statements(statement+)><!ELEMENT statement EMPTY>

<!ATTLIST statement type CDATA #REQUIRED value CDATA #REQUIRED>





<!ELEMENT data-dictionary (compound-categories? , (categorical | ordinal | continuous | categorical-continuous | data-dictionary | key | hierarchy-parent)+)>

<!ATTLIST data-dictionary name CDATA #IMPLIED>



<!ELEMENT key (category+)><!ATTLIST key name CDATA #REQUIRED ispredict ( true | false ) "false" isinput ( true | false ) "false" datatype CDATA #IMPLIED>



<!ELEMENT hierarchy-parent ((relates-to | category)+)><!ATTLIST hierarchy-parent name CDATA #REQUIRED ispredict ( true | false ) "false" isinput ( true | false ) "false" datatype CDATA #IMPLIED>



<!ELEMENT relates-to EMPTY><!ATTLIST relates-to name CDATA #REQUIRED

>

 <!ELEMENT categorical (category+)><!ATTLIST categorical name CDATA #REQUIRED ispredict ( true | false ) "false" isinput ( true | false ) "false" datatype CDATA #IMPLIED>


1. Allow the category ELEMENT to contain the parent ELEMENT, which specifies the hierarchical parent(s).

2. Relax the value ATTRIBUTE to be an optional ATTRIBUTE. A missing state does not need a value.

3. Added "uninformative" to the possible states of the missing ATTRIBUTE. A value can be present, missing at random, or missing informative. missing = "true" is equivalent to missing informative.

-->

<!ELEMENT category (parent*)><!ATTLIST category value CDATA #IMPLIED display-value CDATA #IMPLIED proportion CDATA #IMPLIED missing (true | false | uninformative) "false">



<!ELEMENT parent EMPTY><!ATTLIST parent name CDATA #REQUIRED value CDATA #REQUIRED>



<!ELEMENT ordinal (order+)><!ATTLIST ordinal name CDATA #REQUIRED cyclical ( true | false ) "false" timesequence ( true | false ) "false">


1. Relax the value ATTRIBUTE to be an optional ATTRIBUTE. A missing state does not need a value.

2. Added "uninformative" to the possible states of the missing ATTRIBUTE. A value can be present, missing at random, or missing informative. missing = "true" is equivalent to missing informative.

3. Relax the rank ATTRIBUTE to be an optional ATTRIBUTE. The states are implied to be ordered if rank is not specified for any of them.

-->

<!ELEMENT order EMPTY><!ATTLIST order value CDATA #IMPLIED display-value CDATA #IMPLIED rank CDATA #IMPLIED proportion CDATA #IMPLIED missing (true | false | uninformative) "false">





<!ELEMENT continuous (category?, (%predicates;)*)><!ATTLIST continuous name CDATA #REQUIRED minimum CDATA #IMPLIED maximum CDATA #IMPLIED mean CDATA #IMPLIED median CDATA #IMPLIED standard-deviation CDATA #IMPLIED inter-quartile-range CDATA #IMPLIED ispredict ( true | false ) "false" isinput ( true | false ) "false" datatype CDATA #IMPLIED>



<!ELEMENT categorical-continuous (category?, (%predicates;)*)><!ATTLIST continuous name CDATA #REQUIRED ispredict ( true | false ) "false" isinput ( true | false ) "false" datatype CDATA #IMPLIED

>


1. Define compound-category ELEMENT that contains a set of compound-category ELEMENTs, which lists out all the valid combinations of multiple keys.

2. Define compound-category ELEMENT that contains the combination of valid keys and its hierarchical parents.

3. Define the category ref ELEMEMT that refers to an existing key.

-->

<!ELEMENT compound-categories (compound-category+)>

<!ELEMENT compound-category ( categoryref | parent )+>

<!ELEMENT categoryref EMPTY><!ATTLIST categoryref name CDATA #REQUIRED value CDATA #REQUIRED>



<!ELEMENT simple-attribute EMPTY><!ATTLIST simple-attribute name CDATA #REQUIRED>



<!ELEMENT compound-attribute (key-val+ , (simple-attribute | compound-attribute)?)><!ATTLIST compound-attribute name CDATA #REQUIRED>

<!ELEMENT derived-attribute ((simple-attribute | compound-attribute)+)>

<!ATTLIST derived-attribute index CDATA #REQUIRED>

<!ELEMENT key-val EMPTY><!ATTLIST key-val name CDATA #REQUIRED value CDATA #REQUIRED>



<!ENTITY % attribute "(simple-attribute | compound-attribute | derived-attribute)">


1. Define the global-statisics ELEMENT, which contains a list of data-distribution ELEMENTs.

2. Define the data-distribution ELEMENT, which contains the sufficient statistics for a given attribute.

3. Define the state ELEMENT that specifies the statistics of a given state.

-->



<!ELEMENT global-statistics (data-distribution+)><!ELEMENT data-distribution (%attribute;, state+)><!ELEMENT state EMPTY><!ATTLIST state value CDATA #IMPLIED missing ( true | false | uninformative) "false" minimum CDATA #IMPLIED maximum CDATA #IMPLIED mean CDATA #IMPLIED median CDATA #IMPLIED standard-deviation CDATA #IMPLIED inter-quartile-range CDATA #IMPLIED support CDATA #IMPLIED proportion CDATA #IMPLIED>




1. Allow the tree-model to contain more than one tree.

2. Relax the criteria so that a tree does not need a model-ID.

-->

<!ELEMENT tree-model (node+)><!ATTLIST tree-model model-id CDATA #IMPLIED>


1. Allows the node to contain the targets ELEMENT that specifies the target of the prediction tree.

2. The root node does not need any arriving predicates and contains all of the pertaining information of that tree.

-->

<!ELEMENT node (targets?, (%predicates;)?, info*, node*, score-distribution*, data-distribution*)>

<!ATTLIST node score CDATA #IMPLIED>


Define the targets ELEMENT.

-->

<!ELEMENT targets ((%attribute;)+)>

<!ELEMENT score-distribution EMPTY><!ATTLIST score-distribution label CDATA #REQUIRED

value CDATA #REQUIRED>

<!ELEMENT info EMPTY><!ATTLIST info name CDATA #REQUIRED value CDATA #REQUIRED>

<!ELEMENT compound-predicate (%predicates;, (%predicates;)+)><!ATTLIST compound-predicate bool-op (or | and | xor | cascade) #REQUIRED>



<!ELEMENT predicate (%attribute;)><!ATTLIST predicate attribute CDATA #IMPLIED op (eq | ne | lt | le | gt | ge) #REQUIRED value CDATA #REQUIRED>

<!ELEMENT true EMPTY>

<!ELEMENT false EMPTY>


1. Define the segment-model ELEMENT.

2. The segment-model contains a list of nodes, which are the cluster points.

3. The cluster points contain a list of data-distribution for all of the attributes.

-->



<!ELEMENT segment-model (info*, node+)>



<!ELEMENT regression-model (factor-list?, covariate-list?, predictor-to-parameter-correlation-matrix?, parameter-table)> <!ATTLIST regression-model model-id CDATA #REQUIRED response-variable-name CDATA #REQUIRED number-parameters %NUMBER; #REQUIRED model-type (regression | general-linear | log-linear | multinomial-logistic)

#REQUIRED verbose-model-specification CDATA #IMPLIED>

<!ELEMENT factor-list (var-name+)>

<!ELEMENT covariate-list (var-name+)>

<!ELEMENT var-name (#PCDATA)>

<!ELEMENT predictor-to-parameter-correlation-matrix (predictor-to-parameter-cell+)>

<!ELEMENT predictor-to-parameter-cell (#PCDATA)>

<!ATTLIST predictor-to-parameter-cell predictor-name CDATA #REQUIRED parameter-name CDATA #REQUIRED>

<!ELEMENT parameter-table (parameter-cell+)>

<!ELEMENT parameter-cell EMPTY>

<!ATTLIST parameter-cell target-category CDATA #REQUIRED parameter-name CDATA #REQUIRED beta %NUMBER; #REQUIRED std-error %NUMBER; #IMPLIED df %NUMBER; #IMPLIED>

6.2 Example: Tree Model to Predict Credit Risk<?xml version="1.0"?><pmml><statements><statement type = "CREATE" value = "Create Mining Model CreditTree1( ID long key, Credit text discrete predict, Education text discrete, Age text discrete, Pay text discrete) using microsoft_decision_trees"/><statement type = "TRAIN" value = "Insert Into CreditTree1( ID, Credit, Education, Age, Pay)OPENROWSET("Microsoft.Jet.OLEDB.4.0", "data source=w:\test\demozero\credit.mdb", "SELECT ID, Credit, Education, Age , Pay FROM CreditTraining")"/></statements><data-dictionary name = "CreditTree1" GUID = "{707D31A7-D42A-11D3-8AEF-00C04F68DDCA}"><key name = "ID" datatype = "LONG"/><categorical name = "Credit" isinput = "true" ispredict = "true" datatype = "TEXT"><category missing = "true"/><category value = "Bad"/><category value = "Good"/></categorical><categorical name = "Education" isinput = "true" datatype = "TEXT"><category missing = "true"/><category value = "Bachelor"/><category value = "High School"/><category value = "Graduate"/><category value = "Partial College"/><category value = "Partial High School"/></categorical><categorical name = "Age" isinput = "true" datatype = "TEXT"><category missing = "true"/><category value = "Middle Age"/><category value = "Young"/><category value = "Old"/></categorical><categorical name = "Pay" isinput = "true" datatype = "TEXT"><category missing = "true"/><category value = "Weekly pay"/><category value = "Monthly salary"/>

</categorical></data-dictionary><global-statistics><data-distribution><simple-attribute name = "Credit"/><state missing = "true" support = "0."/><state value = "Bad" support = "114."/><state value = "Good" support = "109."/></data-distribution><data-distribution><simple-attribute name = "Education"/><state missing = "true" support = "0."/><state value = "Bachelor" support = "109."/><state value = "High School" support = "24."/><state value = "Graduate" support = "28."/><state value = "Partial College" support = "34."/><state value = "Partial High School" support = "28."/></data-distribution><data-distribution><simple-attribute name = "Age"/><state missing = "true" support = "0."/><state value = "Middle Age" support = "55."/><state value = "Young" support = "126."/><state value = "Old" support = "42."/></data-distribution><data-distribution><simple-attribute name = "Pay"/><state missing = "true" support = "0."/><state value = "Weekly pay" support = "114."/><state value = "Monthly salary" support = "109."/></data-distribution></global-statistics><tree-model><info name = "Scorer" value = "4"/><info name = "Splitter" value = "1"/><info name = "Minimum Leaf Cases" value = "10"/><info name = "Number of ESS" value = "16"/><info name = "Complexity Penalty" value = "0.80000000000000004"/><node><targets><target><simple-attribute name = "Credit"/></target></targets><node missing = "false"><predicate op = "eq" value = "Weekly pay"><simple-attribute name = "Pay"/></predicate><node missing = "false">

<predicate op = "eq" value = "Young"><simple-attribute name = "Age"/></predicate><node missing = "false"><predicate op = "eq" value = "High School"><simple-attribute name = "Education"/></predicate><data-distribution><simple-attribute name = "Credit"/><state missing = "true" support = "0."/><state value = "Bad" support = "24."/><state value = "Good" support = "0."/></data-distribution></node><node missing = "false"><predicate op = "ne" value = "High School"><simple-attribute name = "Education"/></predicate><data-distribution><simple-attribute name = "Credit"/><state missing = "true" support = "0."/><state value = "Bad" support = "60."/><state value = "Good" support = "9."/></data-distribution></node></node><node missing = "false"><predicate op = "ne" value = "Young"><simple-attribute name = "Age"/></predicate><data-distribution><simple-attribute name = "Credit"/><state missing = "true" support = "0."/><state value = "Bad" support = "13."/><state value = "Good" support = "8."/></data-distribution></node></node><node missing = "false"><predicate op = "ne" value = "Weekly pay"><simple-attribute name = "Pay"/></predicate><node missing = "false"><predicate op = "eq" value = "Young"><simple-attribute name = "Age"/></predicate><data-distribution><simple-attribute name = "Credit"/><state missing = "true" support = "0."/>

<state value = "Bad" support = "16."/><state value = "Good" support = "17."/></data-distribution></node><node missing = "false"><predicate op = "ne" value = "Young"><simple-attribute name = "Age"/></predicate><node missing = "false"><predicate op = "eq" value = "Bachelor"><simple-attribute name = "Education"/></predicate><data-distribution><simple-attribute name = "Credit"/><state missing = "true" support = "0."/><state value = "Bad" support = "1."/><state value = "Good" support = "52."/></data-distribution></node><node missing = "false"><predicate op = "ne" value = "Bachelor"><simple-attribute name = "Education"/></predicate><data-distribution><simple-attribute name = "Credit"/><state missing = "true" support = "0."/><state value = "Bad" support = "0."/><state value = "Good" support = "23."/></data-distribution></node></node></node></node></tree-model></pmml>

7 Appendix E: Provider Support for SHAPE Syntax

The complete syntax of the SHAPE command is documented in the Microsoft Data Access Component SDK. This appendix describes the subset of that syntax needed to shape multiple result sets into a single nested table. Data mining providers should provide support for this subset, at a minimum. Following is the basic syntax:

SHAPE {<master query>} APPEND ({ <child table query> } RELATE <master column> TO <child column>) AS < column table name> [ APPEND ({ <child table query> } RELATE <master column> TO <child column>) AS < column table name>

… ]

The SHAPE statement allows the addition of table columns to a master query by specifying the child table rows and the way to match between the row in <master query> and its child rows in the <child query>.

Using this syntax, you can now read all of the data needed for the cases from multiple queries and shape these into a single table that is fed into the DMM.

The following example illustrates how this is done:

INSERT INTO [Age Prediction](

[Customer ID], [Gender], [Age], [Age Probability],[Product Purchases] (SKIP, [Product Name], [Product Type], [Quantity]),[Car Ownership] (SKIP, [Car Name], [Car Probability])

)

SHAPE { select [Customer ID], [Gender], [Age], [Age Probability] from [Customers] order by [Customer ID]}

APPEND ( {select [CustID], [Product Name], [Product Type] , [Quantity]from [Customer Product Sales] order by [CustID] }

RELATE [Customer ID] TO [Cust ID])AS [Product Purchases], ( {select [CustID], [Car Name], [Probability],

from [Customer Cars] order by [CustID] }RELATE [Customer ID] TO [Cust ID])

AS [Car Ownership]

Following are important notes:

The SHAPE statement has a rich syntax, and DM providers are encouraged to support as much of it as possible. At a minimum, DM providers should support the syntax described in this appendix.

The column binding between the target DMM and the source query is done by columns order, as is the standard with INSERT INTO statement.

Table columns ("Product Purchases" and "Car Ownership") are listed in the source columns, although they are mapped into whole tables and not to single columns.

The columns in the child query used for the relation (in the RELATE clause) are skipped by using the SKIP keyword in the column map and not mapped into any of the columns contained in the target table-column.

A DM provider may (and usually will) mandate that the relation columns in the child queries be ordered the same as the key column in the master query.

8 Appendix F: Provider Support for OPENROWSET Syntax

The complete documentation of the OPENROWSET command is found in the Microsoft SQL Server® Programmer's Toolkit. This appendix provides an abbreviated version of that. Data mining providers should provide support for OPENROWSET to be used for the <source data query> in INSERT INTO and PREDICT commands.

OPENROWSET('provider_name' { 'datasource';'user_id';'password' | 'provider_string' }, { 'query' })

'provider_name'

A character string that represents the friendly name of the OLE DB provider as specified in the registry. provider_name has no default value.

'datasource'

A string constant that corresponds to a particular OLE DB data source object. datasource is the DBPROP_INIT_DATASOURCE property passed to the provider's IDBProperties interface to initialize the provider. Typically, this string includes the name of the database file, the name of a database server, or a name that the provider understands to locate the database(s).

'user_id'

A string constant that is the user name passed to the specified OLE DB provider. user_id specifies the security context for the connection and is passed in as the DBPROP_AUTH_USERID property to initialize the provider.

'password'

A string constant that is the user password passed to the OLE DB provider. password is passed in as the DBPROP_AUTH_PASSWORD property when initializing the provider.

'provider_string'

A provider-specific connection string that is passed in as the DBPROP_INIT_PROVIDERSTRING property to initialize the OLE DB provider. provider_string typically encapsulates all the connection information needed to initialize the provider.

'query'

A string constant that is sent to and executed by the provider. For more information, see SQL Server OLE DB Programmer's Reference.

9 Appendix G: Support for Other Data Mining Algorithms

Although most examples in this document are based on decision tree and clustering algorithms, the purpose of the OLE DB for Data Mining specification is to provide a data mining standard to support all the data mining algorithms. For presentation of the content of different algorithms, PMML is adopted. The information is stored in the content schema rowset after the model gets trained. In this appendix, the support for Association and Regression Algorithm is illustrated, based on the syntax defined in the document.

9.1 Support for Association AlgorithmAssociation is one of the popular data mining algorithms. It can be applied to do market basket analysis, cross-selling, Web site mining, and so forth. The typical problem the association algorithm solves is that given a transaction table with products customers have bought, what items does a customer tend to buy together.

Suppose there are two tables: Transaction and Purchase. The Transaction table stores information about a transaction, such as transaction ID, time, store, and so on. The Purchase table stores the purchased products for each transaction.

The following statement creates a data mining model to find out those products which sell together based on an association algorithm. The model is interested only in rules with at least five items.

Create Mining Model MyAssociationModel ( Transaction_id long key, [Product purchases] table predict ([Product Name] text key ))Using [My Association Algorithm] (Minimum_size = 5)

Training an association model is exactly the same as training a tree model or a clustering model. The results of the training are stored in the MINING_MODEL_CONTENT schema rowset. In the content schema rowset, there is a column called Rule, which stores the PMML representation of an association rule.

To get all the association rules discovered by the algorithm, run the following statement:

Select * from MyAssociationModel.content

This returns the content schema rowset that contains all the rules. It is also possible to search for some particular rules—for example, all the products associated with "Milk."

9.2 Support for Regression AlgorithmRegression is another popular data mining algorithm. It is used to find the relationship between a response variable and several possible predictor variables by some mathematic formula. There are some different regression methods, such as linear regression, logistic regression, and nonlinear regression.

A linear regression equation is usually written as follows:

Y = a + bX + e where Y is the dependent variable a is the intercept b is the slope or regression coefficient X is the independent variable e is the error term

Suppose there is a loan table containing customer demographic information and the level of risk of each loan. By using a regression algorithm, the following mining model predicts loan risk level based on age, income, homeowner, and marital status.

Create Mining Model MyRegressionModel (Customer_id long key,Age long continuous,Homeowner boolean discrete,Marital_status Boolean discrete,Loan_risk_LEVELcontinuous predict

) Using [My Regression Algorithm]

Training a regression model is exactly the same as training tree model or a clustering model. The value of intercept, regression coefficient, and error rate are stored in MINING_MODEL_CONTENT schema rowset, in the Rule column, with the PMML format. The following statement returns all the coefficients of regression:

Select * from MyRegressionModel.content

CopyrightThis is a preliminary document and may be changed substantially prior to final commercial release. This document is provided for informational purposes only and Microsoft makes no warranties, either express or implied, in this document. Information in this document, including URL and other Internet Web site references, is subject to change without notice. The entire risk of the use or the results of the use of this document remains with the user. Unless otherwise noted, the example companies, organizations, products, people and events depicted herein are fictitious and no association with any real company, organization, product, person or event is intended or should be inferred. Complying with all applicable copyright laws is the responsibility of the user. Without limiting the rights under copyright, no part of this document may be reproduced, stored in or introduced into a retrieval system, or transmitted in any form or by any means (electronic, mechanical, photocopying, recording, or otherwise), or for any purpose, without the express written permission of Microsoft Corporation.

Microsoft may have patents, patent applications, trademarks, copyrights, or other intellectual property rights covering subject matter in this document. Except as expressly provided in any written license agreement from Microsoft, the furnishing of this document does not give you any license to these patents, trademarks, copyrights, or other intellectual property.

2000 Microsoft Corporation. All rights reserved.

Microsoft, MS-DOS, Windows, Windows NT, SQL Server, and Visual C++ are either registered trademarks or trademarks of Microsoft Corporation in the United States and/or other countries.

The names of actual companies and products mentioned herein may be the trademarks of their respective owners.

Documents

OLEDB for DMcalwhite.com/files/OLEDBDM1.doc · Web viewDescription: The MINING_SERVICES schema rowset exposes the data mining algorithms available from the provider. It can be used