Machine Learning for PI System and SQL Server Analysis Services

Machine Learning for PI System

and SQL Server Analysis

Services

OSIsoft vCampus White Paper

How to Contact Us

Email: [email protected]

Web: http://vCampus.osisoft.com > Contact Us

OSIsoft, Inc.

777 Davis St., Suite 250

San Leandro, CA 94577 USA

Houston, TX

Johnson City, TN

Mayfield Heights, OH

Phoenix, AZ

Savannah, GA

Seattle, WA

Yardley, PA

Worldwide Offices

OSIsoft Australia

Perth, Australia

Auckland, New Zealand

OSIsoft Europe

Altenstadt, Germany

OSI Software Asia Pte Ltd.

Singapore

OSIsoft Canada ULC

Montreal, Quebec

Calgary, Alberta

OSIsoft, Inc. Representative Office

Shanghai, Peoples Republic of China

OSIsoft Japan KK

Tokyo, Japan

OSIsoft Mexico S. De R.L. De C.V.

Mexico City, Mexico

Sales Outlets and Distributors

Brazil

Middle East/North Africa

Republic of South Africa

Russia/Central Asia

South America/Caribbean

Southeast Asia

South Korea

Taiwan

WWW.OSISOFT.COM

OSIsoft, Inc. is the owner of the following trademarks and registered trademarks: PI System, PI ProcessBook, Sequencia, Sigmafine, gRecipe, sRecipe, and RLINK. All terms mentioned in this book that are known to be trademarks or service marks have been appropriately capitalized. Any trademark that appears in this book that is not owned by OSIsoft, Inc. is the property of its owner and use herein in no way indicates an endorsement, recommendation, or warranty of such partys products or any affiliation with such party of any kind.

RESTRICTED RIGHTS LEGEND Use, duplication, or disclosure by the Government is subject to restrictions as set forth in subparagraph (c)(1)(ii) of the Rights in Technical Data and Computer Software clause at DFARS 252.227-7013

Unpublished rights reserved under the copyright laws of the United States.

1998-2011 OSIsoft, LLC

ii

TABLE OF CONTENTS

Overview ........................................................................................................................................... 1

About this Document ............................................................................................................................. 1

What You Need to Start ......................................................................................................................... 1

What is Machine Learning? ................................................................................................................ 2

Introduction ............................................................................................................................................ 2

Applications ............................................................................................................................................ 2

Connection to the PI System .................................................................................................................. 3

Machine Learning ................................................................................................................................... 4

Available Tools ........................................................................................................................................ 5

Machine Learning with PI Server and SQL Server Analysis Services ...................................................... 6

Architecture ............................................................................................................................................ 6

What You Need ...................................................................................................................................... 6

Creating an Estimation Model ................................................................................................................ 7

Generating Predictions ......................................................................................................................... 12

Discussion ............................................................................................................................................. 16

Conclusion ....................................................................................................................................... 18

Revision History ............................................................................................................................... 19

1

OVERVIEW

ABOUT THIS DOCUMENT

This document is exclusive to the OSIsoft Virtual Campus (vCampus) and is available on its

online Library, located at http://vCampus.osisoft.com/Library/library.aspx.

Any question or comment related to this document should be posted in the appropriate

vCampus discussion forum (http://vCampus.osisoft.com/forums) or sent to the vCampus

Team at [email protected].

ABOUT THIS WHITE PAPER

This White Paper is focused on introducing the concept of Machine Learning to the PI

Users Community and providing an example of doing that. The purpose of such introduction

and usage is deriving more value out of the data collected and archived by the PI

infrastructure. This means extracting more information out of the recorded and collected

data.

WHAT YOU NEED TO START

You must have the following software installed to be able to follow the examples in this

white paper:

PI Server

PI DataLink or the piconfig utility

Microsoft SQL Server Analysis Services

Microsoft Excel Data Mining add-in

2

WHAT IS MACHINE LEARNING?

We are drowning in information and starving for knowledge.

Rutherford D. Roger

INTRODUCTION

Machine Learning is a broad term referring to a slew of techniques used for learning

information and knowledge from past observations. Usually these methods are based on

statistical procedures. In classical methods, such a linear regression, one was able to fit a

function of a specific format to the observed variables and optimize some measure of

accuracy to find the corresponding unknown parameters. Other statistical methods were

good at handling modest volumes of data. However, in todays world the number of

variables and/or observations can be so big that no classical method of function fitting or

statistical analysis would yield an acceptable outcome.

APPLICATIONS

The applications of machine learning abound. It is applicable whenever we are interested in

deriving a model representing large amounts of observations for future uses and predictions

or filling the gaps of the measurement.

In the field of power generation being able to predict the going price of the market for

some time in the future, say one or a few days ahead, plays a crucial role. The predicted

price would depend on the time of the year, day of the week, hour of the day, temperature

forecast, other generators availability, and potentially some other factors. This price would

determine the profit-maximizing generation point for a generator. On the other hand, there

are sizeable amounts of observations archived from years ago available for mining. These

observations can be used to build a relationship between the predicting parameters and the

output (price). Given the size of observations and number of variables, machine learning is a

very good fit to approach the problem.

As another example, assume that we have a device/element and a number of PI

tags/attributes attached to it archiving multiple quantities/attributes. Historically, we know

when the device failed or needed maintenance. Now the goal is to learn from previous

incidents and predict in advance when the device is about to fail or needs maintenance.

Even though this will definitely need intuition and knowledge of the operation of the device,

3

being able to decipher the connection between many variables and the outcome can be

formidable. This is where a machine learning algorithm can tackle the previous cases of

failure, build a model and learn the relationship between variables, and use that for

prediction purposes. It means we can use machine learning to perform preemptive

maintenance.

Another application would be to most accurately and consistently fill the gaps between data

measurements. Imagine that a variable is supposed to being measured and archived

regularly. If for some reason some measurements are missed, we can use other relevant

variables measurements and fill the gaps based on historical learning while conforming to

the general behavior of the underlying process.

Web analytics is another application where there are typically millions or billions of

observations. The goal would be to predict the chance of a specific action performed by the

website visitor given previous measurements and observations. Meteorologists can use

previous measurements to make more accurate forecasts by using machine learning.

Another popular application is in the field of genomics where a single observation includes

millions or billions of variables and an outcome such as a defect or feature. Calculating the

probability of certain characteristics based on the observed genetics properties is the goal.

In all the cases above, if the number of variables or observations is limited, we can use more

traditional methods such as linear or nonlinear regression based on a quadratic cost

function and least squares methodology to fit a model to our observations. However, when

the number of observations and/or the predictors exceed a certain point such classic

methods would not converge in reasonable time or deliver acceptable results.

CONNECTION TO THE PI SYSTEM

The PI System is the infrastructure to collect, archive, analyze, and serve time series data

across multiple industries. Thousands of organizations around the world utilize their

investment in the infrastructure and gain strategic actionable insight by collecting data using

the PI System.

Over the years and by expansion of the organizations, the volume of the collected data

becomes larger than ever. In almost all cases this valuable collection of information hides

much more valuable insight than what it appears on the surface. In other words, if one

knows how to mine the data they can improve their return of investment significantly.

Our goal in this white paper is to take a step in that direction. We would like to showcase

some data mining methods using third party analytical tools to perform such operations on

4

PI Data. In particular, we focus on machine learning.

Figure 1: A noisy version of the Sine wave (red) used for learning and to make predictions (green).

A Decision Tree was used to create the learning model.

MACHINE LEARNING

Machine learning refers to a set of analytical and computational tools used to make

inferences based on previous observations. In supervised learning, there is a measure

variable that is sought to be predicted based on previous data. While in unsupervised

learning, patterns, segmentations, or otherwise associations are the aims of the analysis.

In this paper our attention is devoted to supervised learning. In fact, we would like to make

predictions for a single variable based on our current measurement of the predicting

variables or predictors. This prediction is based upon previous observations of the predictors

and the output variable. The bulk of the machine learning algorithm deals with the question

of how to teach the model the association between the predictors and the output variable

so that some accuracy metric (usually statistical) is optimized.

Our focus here will be on regression algorithms; i.e. the algorithms in which the output

variable is a real number. However, very similar methodology applies to classification

algorithms where the output variable is an unordered variable belonging to a set such as

{Yes, No} or {Faulty, Medium condition, Excellent condition}.

5

AVAILABLE TOOLS

Among the most popular tools providing machine learning solutions are open source

packages written in R, MATLAB Data Mining Toolbox, Microsoft SQL Server Analysis Services,

SAS, and Google Prediction API.

Each of the commercial tools mentioned above deploy one or more of the underlying

machine learning algorithms. Among the most popular algorithms are Decision Trees and

Neural Networks. Decision Trees come in several different flavors to improve the precision

or other criteria of the learning and prediction. Here are a number of important factors to

be aware of when choosing the right algorithm:

Time to learn: The time each algorithm takes to learn is a function of the number of

observations and predictors. This is one of the important factors the designers should

consider.

Time to predict: Once the learning procedure is over, it is time to use the model for

prediction purposes. Sometimes, this has to be done as quickly as possible for real-time

applications while in some other applications time sensitivity is not too high.

Interpretability: Intuition is a very important feature in prediction algorithms. Imagine an

algorithm where any prediction is backed by an easy to understand and, intuitive, or even

visually presentable explanation. Compare that with a model that is only producing a

number as the output. Even if the latter does a better job in predicting, usually decision

makers prefer the former over a black box that just spits out a number.

Accuracy and variance: These are two very important factors one has to consider. While

accuracy is always desired, sometimes you do not want to be as accurate as possible with

the given set of observations. The reason is that usually there is a trade-off between

accuracy and variance. Too much variance in the output variable with respect to the changes

in the predictors is usually an undesired phenomenon. Different machine learning methods

vary in this sense.

Ability to handle missing data: Often times in practice some observations are missing. For

example, if there are 4 predictors and 1 million observations, chances are not all 4 values are

available in every single observation.

Sensitivity to irrelevant data: This is a very important phenomenon which could be counter-

intuitive as well. Sometimes, we are not sure about any causality or correlation between a

variable and the output variable. We just collect, say, five variables as predictors and let the

model learn the relationship between them and the output variable. If some of the

predictors are not correlated with the output, the model is prone to some difficulties. One

would imagine that in the worst scenario, the model would gain nothing from the irrelevant

predictor; however, in some models irrelevant predictors deteriorate the prediction

procedure.

6

MACHINE LEARNING WITH PI SERVER AND

SQL SERVER ANALYSIS SERVICES

In this section we will see how we can leverage machine learning capabilities provided by Microsoft

SQL Server Analysis Services for data stored in PI Server. In particular we will focus on the Excel client

that operates on an Excel sheet exposing some of those features in the Office environment. Being an

easy to use and insightful data mining tool it can be extremely useful for decision makers and

operators who use PI infrastructure.

ARCHITECTURE

PI System is the infrastructure for acquiring, archiving, analysis, and delivery of enterprise time-

series data. In this section we export the data stored in PI archives to Microsoft Excel using PI

DataLink. The machine learning algorithms take place inside the SQL Server Analysis Services. The

Data Mining add-in to Microsoft Excel serves as a client building the underlying model inside the SQL

Server, invoking the learning algorithm, and bringing the results back to the Excel sheet.

Figure 2 The data flow of the machine learning procedure

WHAT YOU NEED

In order to use data mining features of MS SQL Server in MS Excel we need to have SQL

Server Analysis Services (SSAS) accessible. On top of this we need the Data Mining add-in to

MS Excel which serves as a client for SSAS. This add-in will add new features to MS Excel

ribbons to be used later. At the time of writing this document the add-in for MS Office 2010

can be accessed from here for free.

Note that once the data mining model is built and trained, you can send as many queries

to it as you need without further training the model. The Data Mining add-in to MS

Excel is an easy user interface to expose such features. Also note that SQL Server

7

Analysis Services also provides a graphical design interface for creating queries, and

also a query language called Data Mining Extensions (DMX) that is useful for creating

custom predictions and complex queries. To build DMX prediction queries, you can start

with the query builders that are available in both SQL Server Management Studio and

Business Intelligence Development Studio. A set of DMX query templates is also provided

in SQL Server Management Studio. To read more on building queries you can follow this

link.

On the PI System side we need a PI Server and PI DataLink available. The idea is to use PI

DataLink to bring in the observations and predictor values into MS Excel.

CREATING AN ESTIMATION MODEL

In this section we explain how you can use machine learning features of SSAS through an

example.

Note that throughout this example we make use of some data that we create ourselves

using an explicit formula. This is only because we need to verify the result of the

machine learning procedure by comparing the predictions against the values we know.

Otherwise, in most cases this procedure is used to model unstructured and unknown data.

In any case, no knowledge of the formula will be used by any means to perform the data

mining procedure and the generated values will be treated as mere numerical

observations.

To make this example, assume there are two tags called tag1 and tag2. There is another tag,

the output of the model, which is being measured and is believed to be a function of tag1

and tag2. We call it tag3. However, the relationship between the predictors and the output

tag is unknown to the designer. The idea is to build and train a machine learning model with

existing measurements and use it for future predictions.

We generated the data for this example as the following: we created 10000 values for each

of the predictors (tag1 and tag2) according to a uniformly distributed random distribution in

the range of [0, 10]. We construct our output to be:

tag3 = SIN(tag1 + 2*tag2)

Obviously we dont use this knowledge in our machine learning example. Therefore, we

have gathered 10,000 observations of two predictors and one output variable. We export

the 10,000 triplets either using PI DataLink or through exporting to a csv file (using the

piconfig utility) to an Excel sheet:

8

Figure 3 The two predictors and the output variable exported to Excel.

The next step is to turn the range containing our data into a table. This is not required for

the operation we are intending to do but it will make our work space more manageable

down the road. Also some other features of the Data Mining add-in would only work on

tables. So, we select the 10,000 rows and 3 columns and click on Format as Table on the

Home ribbon and choose the desired format. Name the columns to match our naming.

Figure 4 The three columns of our data formatted as a table.

The next step is the heart of building the machine learning model to predict new values of

the output variable tag3. The purpose of this step is to create a machine learning model

inside the SQL server using the data stored in our table. To start we select the Estimate

button on the Data mining ribbon.

9

Figure 5 Choosing the Estimate to create the mining structure.

This will open a wizard for us to define model parameters. As for the source data we can use

our table, an Excel range, or an external source. Choose the table we just created.

Figure 6 Choose the table as the source of data.

In the next step we check all three columns as the inputs to our model as we will be using all

the data in the three columns. In general we might be interested in investigating the

relationship between only a subset of the predictors and the output. This can be because

some observed variables do not have any relevant information or we are running some

relevance tests. Also, we are interested in predicting tag3; therefore, we select tag3 as the

Column to analyze. Click Next.

10

Figure 7 Including all three columns in our model and selecting the output.

After selecting the columns in the model, the wizard asks us what percentage of the data

will be used for training and what percentage for validation of the model. The typical value

is 30% for validation and the rest for training. SSAS uses validation to improve and trim the

learning model. The Decision Tree algorithm applied in SSAS builds a tree based on the raw

data designated for training. In this case 70% of the data is randomly picked for training

purposes. It then applies the resulting tree to the remaining portion of the data to validate

the prediction against observations. In case the error in prediction exceeds a threshold the

tree gets improved or trimmed. For more information you can read more on cross validation

of the learning models. Click next to accept the 30%.

The last step in creating the model is to name the structure. Name the structure Table1

Estimate tag3. Click Finish.

You will see that the model starts being built inside SQL Server. What the SSAS does is that it

takes all the predictors (in this case two of them) as well as the output variable and starts

building a Decision Tree minimizing the error criterion. This Decision Tree will be saved in

the SSAS for future prediction purposes. As you will see later, we will send queries to this

structure to predict the value of the output variable based on predictor values.

11

Figure 8 Naming the structure is the last step in creating the model.

One point of interest is the amount of time it takes for the whole analysis. This is a very

efficient algorithm which is capable of handling large amounts of data. In this case with

10,000 observations with two predictors on a laptop running Windows 7 with 8GB of RAM

the building of tree takes no more than 10 seconds.

As a result of the analysis the Data Mining add-in shows a graphical representation of the

Decision Tree as well. Each node of the tree is comprised of a predictor variable and a

collection of numbers or breaking points. While traversing down the tree the prediction can

be very easily achieved by following the corresponding branch. This is one of the very strong

points of the Decision Tree algorithm in machine learning. Not only it makes it fast and

efficient to look up estimation, but also it results in very intuitive interpretation of the

prediction. This is very important because often when a black box prediction algorithm, such

as Neural Networks, offers just the prediction value the decision makers would love to see

some background on why the prediction has been made.

12

Figure 9 The resulting Decision Tree

GENERATING PREDICTIONS

Having built our model based on the observations, we will move on to use the model to predict tag3

values. So we create a table with a few pairs of values for tag1 and tag2, i.e. observations on the

predictors. These are potential future values of the predictors that we will use to make predictions.

Figure 10 Adding new pairs of predictors

Now its time to use the structure we built. In order to do that, click on the Query button on the Data

Mining tab. It opens a wizard for us to build the query against the machine learning model we built

in the last section. First we select the structure and model to send the query to. So we choose

Table1 Estimate tag3. There is only one model under this structure which is Estimate tag3_1 in

this case. Select the model and click Next.

13

Figure 11 Selecting the model we built in previous section.

We are now asked to provide the source of data. We will have the option to refer to a table, an Excel

range, or an external data source. We use the table containing our new values as the source of data.

14

Figure 12 Pointing to our table for the predictor values.

Now the wizard wants to know how each column of the table is mapped to each predictor in the

previously stored model. In this case we have given each column a similar name as in the model,

namely tag1 and tag2. We dont have in this case any values for tag3 because this is the output

variable to be predicted. Click Next.

15

Figure 13 Map the columns to the variables in the model.

The next step is to define an output by clicking on Add Output. In this case tag3 is the output.

Note that in general this can be not straightforward. In a lot of applications lots of observations lack

one or more variable values. So this is not always necessarily obvious to SSAS which variable should

be the output of the model.

Figure 14 Defining the output

16

Note that we choose tag 3 as the output and we are interested in predicting the values of tag3. With

other options we can take a look at the variance of the output variable or its support instead. In

other words, instead of the actual value, these other options allow us to examine some basic

statistical characteristics of the predicted value. More variance for example means the predicted

value is prone to bigger changes in case the observations change. Click OK and then Next. On the

next dialog we choose to append to the input data to see the results on the spread sheet along

with the predictors. Click finish. Now you should be able to see the predictions as a third column

added to the table. We have calculated the actual values of the function SIN(tag1+ 2*tag2) as the 4th

column of the table for comparison purposes. As you can compare tag3 (prediction) and the actual

value, the prediction has done a very good job in learning the behavior of the function being

predicted.

Figure 15 Prediction values for tag3 along with the actual values. The prediction has done a good

job.

DISCUSSION

In short we have been able to learn the behavior of the function SIN(tag1 + 2*tag2) through 10,000

observations and apply it for prediction purposes. You can also apply the model to the original table

which was used to train the model - to see the actual and predicted values side by side.

The precision of the model depends on many parameters of the algorithm as well as the nature of

the problem. As a general rule the more observations the better. Also, noisy observations can

contaminate the data at a level determined by the power of noise. Another important factor is

having irrelevant data in training. Sometimes we dont know for sure if the output depends on a

certain variable (tag) or not. We have to note that irrelevant data can in fact hurt. Decision Trees are

among the more robust algorithms when it comes to irrelevant data. However, knowing the physics

of the problem at hand plays a key role in making the learning procedure more efficient. Also, in

more complex cases we need to run several different models with different predictor variables to

see which one makes a better model of the problem.

Also, it is common that not all observations have all the predictor values in them. Decision Trees are

among the best algorithms handling missing samples.

When the relationship between the predictors and the output variable is nonlinear the Decision

Trees typically do a better job. When predicting a linear relationship they tend to struggle. This is a

direct result of the underlying algorithm which is trying to fit a piecewise constant function to the

observations. Therefore, if we believe the relationship between the predictors and the output is

linear, we may get better results using other machine learning algorithms such as neural networks.

17

A very important point is that each problem in machine learning has its own characteristics. Even

though todays algorithms and their implementations are very powerful, some nuances and fine

tunings will be left to the specific problem at hand. Therefore, good and close knowledge of the

underlying problem and the relationship between predictors and the output variable is extremely

important to a successful machine learning procedure.

Besides, in almost all machine learning cases data needs to be prepared before being ready for the

algorithms. This includes separating useful informative portions of the data from spam or hollow

samples, enriching the portions we need more precision in the model, and other problem-specific

operations.

18

CONCLUSION

In this white paper we discussed an important aspect of data mining for the data stored in a PI

System. In particular, we focused on how we can learn the relationship between several PI tag values

and an output tag. The machine learning algorithm offered in SQL Server Analysis Services along with

its Excel client provides us with a very convenient way to perform machine learning on PI data. We

use PI DataLink to import data onto an Excel sheet and perform machine learning algorithm on

them.

We analyzed several aspects and factors involved in the choice of the right tool. Among those are

the time we can allocate to learning and prediction, robustness to missing data, and robustness to

irrelevant data. In this paper we focused on the Decision Tree algorithm offered by SSAS.

We showed how we can leverage SSAS to do machine learning by walking through an example. We

generated two random tags and created a third tag, as the output tag, which was defined as a sine

function of a linear combination of the two predictors. We used 10,000 observations and fed that

into SSAS in order to learn the behavior of the output. We then used the resulting structure to

predict the value of the output at arbitrary pairs of values of the predictors.

19

REVISION HISTORY

16-Aug-2011 Initial draft by Ahmad Fattahi

Documents

Machine Learning for PI System and SQL Server Analysis Services