23
Machine Learning for PI System and SQL Server Analysis Services OSIsoft vCampus White Paper

Machine Learning for PI System and SQL Server Analysis Services

Embed Size (px)

DESCRIPTION

PI

Citation preview

  • Machine Learning for PI System

    and SQL Server Analysis

    Services

    OSIsoft vCampus White Paper

  • How to Contact Us

    Email: [email protected]

    Web: http://vCampus.osisoft.com > Contact Us

    OSIsoft, Inc.

    777 Davis St., Suite 250

    San Leandro, CA 94577 USA

    Houston, TX

    Johnson City, TN

    Mayfield Heights, OH

    Phoenix, AZ

    Savannah, GA

    Seattle, WA

    Yardley, PA

    Worldwide Offices

    OSIsoft Australia

    Perth, Australia

    Auckland, New Zealand

    OSIsoft Europe

    Altenstadt, Germany

    OSI Software Asia Pte Ltd.

    Singapore

    OSIsoft Canada ULC

    Montreal, Quebec

    Calgary, Alberta

    OSIsoft, Inc. Representative Office

    Shanghai, Peoples Republic of China

    OSIsoft Japan KK

    Tokyo, Japan

    OSIsoft Mexico S. De R.L. De C.V.

    Mexico City, Mexico

    Sales Outlets and Distributors

    Brazil

    Middle East/North Africa

    Republic of South Africa

    Russia/Central Asia

    South America/Caribbean

    Southeast Asia

    South Korea

    Taiwan

    WWW.OSISOFT.COM

    OSIsoft, Inc. is the owner of the following trademarks and registered trademarks: PI System, PI ProcessBook, Sequencia, Sigmafine, gRecipe, sRecipe, and RLINK. All terms mentioned in this book that are known to be trademarks or service marks have been appropriately capitalized. Any trademark that appears in this book that is not owned by OSIsoft, Inc. is the property of its owner and use herein in no way indicates an endorsement, recommendation, or warranty of such partys products or any affiliation with such party of any kind.

    RESTRICTED RIGHTS LEGEND Use, duplication, or disclosure by the Government is subject to restrictions as set forth in subparagraph (c)(1)(ii) of the Rights in Technical Data and Computer Software clause at DFARS 252.227-7013

    Unpublished rights reserved under the copyright laws of the United States.

    1998-2011 OSIsoft, LLC

  • ii

    TABLE OF CONTENTS

    Overview ........................................................................................................................................... 1

    About this Document ............................................................................................................................. 1

    What You Need to Start ......................................................................................................................... 1

    What is Machine Learning? ................................................................................................................ 2

    Introduction ............................................................................................................................................ 2

    Applications ............................................................................................................................................ 2

    Connection to the PI System .................................................................................................................. 3

    Machine Learning ................................................................................................................................... 4

    Available Tools ........................................................................................................................................ 5

    Machine Learning with PI Server and SQL Server Analysis Services ...................................................... 6

    Architecture ............................................................................................................................................ 6

    What You Need ...................................................................................................................................... 6

    Creating an Estimation Model ................................................................................................................ 7

    Generating Predictions ......................................................................................................................... 12

    Discussion ............................................................................................................................................. 16

    Conclusion ....................................................................................................................................... 18

    Revision History ............................................................................................................................... 19

  • 1

    OVERVIEW

    ABOUT THIS DOCUMENT

    This document is exclusive to the OSIsoft Virtual Campus (vCampus) and is available on its

    online Library, located at http://vCampus.osisoft.com/Library/library.aspx.

    Any question or comment related to this document should be posted in the appropriate

    vCampus discussion forum (http://vCampus.osisoft.com/forums) or sent to the vCampus

    Team at [email protected].

    ABOUT THIS WHITE PAPER

    This White Paper is focused on introducing the concept of Machine Learning to the PI

    Users Community and providing an example of doing that. The purpose of such introduction

    and usage is deriving more value out of the data collected and archived by the PI

    infrastructure. This means extracting more information out of the recorded and collected

    data.

    WHAT YOU NEED TO START

    You must have the following software installed to be able to follow the examples in this

    white paper:

    PI Server

    PI DataLink or the piconfig utility

    Microsoft SQL Server Analysis Services

    Microsoft Excel Data Mining add-in

  • 2

    WHAT IS MACHINE LEARNING?

    We are drowning in information and starving for knowledge.

    Rutherford D. Roger

    INTRODUCTION

    Machine Learning is a broad term referring to a slew of techniques used for learning

    information and knowledge from past observations. Usually these methods are based on

    statistical procedures. In classical methods, such a linear regression, one was able to fit a

    function of a specific format to the observed variables and optimize some measure of

    accuracy to find the corresponding unknown parameters. Other statistical methods were

    good at handling modest volumes of data. However, in todays world the number of

    variables and/or observations can be so big that no classical method of function fitting or

    statistical analysis would yield an acceptable outcome.

    APPLICATIONS

    The applications of machine learning abound. It is applicable whenever we are interested in

    deriving a model representing large amounts of observations for future uses and predictions

    or filling the gaps of the measurement.

    In the field of power generation being able to predict the going price of the market for

    some time in the future, say one or a few days ahead, plays a crucial role. The predicted

    price would depend on the time of the year, day of the week, hour of the day, temperature

    forecast, other generators availability, and potentially some other factors. This price would

    determine the profit-maximizing generation point for a generator. On the other hand, there

    are sizeable amounts of observations archived from years ago available for mining. These

    observations can be used to build a relationship between the predicting parameters and the

    output (price). Given the size of observations and number of variables, machine learning is a

    very good fit to approach the problem.

    As another example, assume that we have a device/element and a number of PI

    tags/attributes attached to it archiving multiple quantities/attributes. Historically, we know

    when the device failed or needed maintenance. Now the goal is to learn from previous

    incidents and predict in advance when the device is about to fail or needs maintenance.

    Even though this will definitely need intuition and knowledge of the operation of the device,

  • 3

    being able to decipher the connection between many variables and the outcome can be

    formidable. This is where a machine learning algorithm can tackle the previous cases of

    failure, build a model and learn the relationship between variables, and use that for

    prediction purposes. It means we can use machine learning to perform preemptive

    maintenance.

    Another application would be to most accurately and consistently fill the gaps between data

    measurements. Imagine that a variable is supposed to being measured and archived

    regularly. If for some reason some measurements are missed, we can use other relevant

    variables measurements and fill the gaps based on historical learning while conforming to

    the general behavior of the underlying process.

    Web analytics is another application where there are typically millions or billions of

    observations. The goal would be to predict the chance of a specific action performed by the

    website visitor given previous measurements and observations. Meteorologists can use

    previous measurements to make more accurate forecasts by using machine learning.

    Another popular application is in the field of genomics where a single observation includes

    millions or billions of variables and an outcome such as a defect or feature. Calculating the

    probability of certain characteristics based on the observed genetics properties is the goal.

    In all the cases above, if the number of variables or observations is limited, we can use more

    traditional methods such as linear or nonlinear regression based on a quadratic cost

    function and least squares methodology to fit a model to our observations. However, when

    the number of observations and/or the predictors exceed a certain point such classic

    methods would not converge in reasonable time or deliver acceptable results.

    CONNECTION TO THE PI SYSTEM

    The PI System is the infrastructure to collect, archive, analyze, and serve time series data

    across multiple industries. Thousands of organizations around the world utilize their

    investment in the infrastructure and gain strategic actionable insight by collecting data using

    the PI System.

    Over the years and by expansion of the organizations, the volume of the collected data

    becomes larger than ever. In almost all cases this valuable collection of information hides

    much more valuable insight than what it appears on the surface. In other words, if one

    knows how to mine the data they can improve their return of investment significantly.

    Our goal in this white paper is to take a step in that direction. We would like to showcase

    some data mining methods using third party analytical tools to perform such operations on

  • 4

    PI Data. In particular, we focus on machine learning.

    Figure 1: A noisy version of the Sine wave (red) used for learning and to make predictions (green).

    A Decision Tree was used to create the learning model.

    MACHINE LEARNING

    Machine learning refers to a set of analytical and computational tools used to make

    inferences based on previous observations. In supervised learning, there is a measure

    variable that is sought to be predicted based on previous data. While in unsupervised

    learning, patterns, segmentations, or otherwise associations are the aims of the analysis.

    In this paper our attention is devoted to supervised learning. In fact, we would like to make

    predictions for a single variable based on our current measurement of the predicting

    variables or predictors. This prediction is based upon previous observations of the predictors

    and the output variable. The bulk of the machine learning algorithm deals with the question

    of how to teach the model the association between the predictors and the output variable

    so that some accuracy metric (usually statistical) is optimized.

    Our focus here will be on regression algorithms; i.e. the algorithms in which the output

    variable is a real number. However, very similar methodology applies to classification

    algorithms where the output variable is an unordered variable belonging to a set such as

    {Yes, No} or {Faulty, Medium condition, Excellent condition}.

  • 5

    AVAILABLE TOOLS

    Among the most popular tools providing machine learning solutions are open source

    packages written in R, MATLAB Data Mining Toolbox, Microsoft SQL Server Analysis Services,

    SAS, and Google Prediction API.

    Each of the commercial tools mentioned above deploy one or more of the underlying

    machine learning algorithms. Among the most popular algorithms are Decision Trees and

    Neural Networks. Decision Trees come in several different flavors to improve the precision

    or other criteria of the learning and prediction. Here are a number of important factors to

    be aware of when choosing the right algorithm:

    Time to learn: The time each algorithm takes to learn is a function of the number of

    observations and predictors. This is one of the important factors the designers should

    consider.

    Time to predict: Once the learning procedure is over, it is time to use the model for

    prediction purposes. Sometimes, this has to be done as quickly as possible for real-time

    applications while in some other applications time sensitivity is not too high.

    Interpretability: Intuition is a very important feature in prediction algorithms. Imagine an

    algorithm where any prediction is backed by an easy to understand and, intuitive, or even

    visually presentable explanation. Compare that with a model that is only producing a

    number as the output. Even if the latter does a better job in predicting, usually decision

    makers prefer the former over a black box that just spits out a number.

    Accuracy and variance: These are two very important factors one has to consider. While

    accuracy is always desired, sometimes you do not want to be as accurate as possible with

    the given set of observations. The reason is that usually there is a trade-off between

    accuracy and variance. Too much variance in the output variable with respect to the changes

    in the predictors is usually an undesired phenomenon. Different machine learning methods

    vary in this sense.

    Ability to handle missing data: Often times in practice some observations are missing. For

    example, if there are 4 predictors and 1 million observations, chances are not all 4 values are

    available in every single observation.

    Sensitivity to irrelevant data: This is a very important phenomenon which could be counter-

    intuitive as well. Sometimes, we are not sure about any causality or correlation between a

    variable and the output variable. We just collect, say, five variables as predictors and let the

    model learn the relationship between them and the output variable. If some of the

    predictors are not correlated with the output, the model is prone to some difficulties. One

    would imagine that in the worst scenario, the model would gain nothing from the irrelevant

    predictor; however, in some models irrelevant predictors deteriorate the prediction

    procedure.

  • 6

    MACHINE LEARNING WITH PI SERVER AND

    SQL SERVER ANALYSIS SERVICES

    In this section we will see how we can leverage machine learning capabilities provided by Microsoft

    SQL Server Analysis Services for data stored in PI Server. In particular we will focus on the Excel client

    that operates on an Excel sheet exposing some of those features in the Office environment. Being an

    easy to use and insightful data mining tool it can be extremely useful for decision makers and

    operators who use PI infrastructure.

    ARCHITECTURE

    PI System is the infrastructure for acquiring, archiving, analysis, and delivery of enterprise time-

    series data. In this section we export the data stored in PI archives to Microsoft Excel using PI

    DataLink. The machine learning algorithms take place inside the SQL Server Analysis Services. The

    Data Mining add-in to Microsoft Excel serves as a client building the underlying model inside the SQL

    Server, invoking the learning algorithm, and bringing the results back to the Excel sheet.

    Figure 2 The data flow of the machine learning procedure

    WHAT YOU NEED

    In order to use data mining features of MS SQL Server in MS Excel we need to have SQL

    Server Analysis Services (SSAS) accessible. On top of this we need the Data Mining add-in to

    MS Excel which serves as a client for SSAS. This add-in will add new features to MS Excel

    ribbons to be used later. At the time of writing this document the add-in for MS Office 2010

    can be accessed from here for free.

    Note that once the data mining model is built and trained, you can send as many queries

    to it as you need without further training the model. The Data Mining add-in to MS

    Excel is an easy user interface to expose such features. Also note that SQL Server

  • 7

    Analysis Services also provides a graphical design interface for creating queries, and

    also a query language called Data Mining Extensions (DMX) that is useful for creating

    custom predictions and complex queries. To build DMX prediction queries, you can start

    with the query builders that are available in both SQL Server Management Studio and

    Business Intelligence Development Studio. A set of DMX query templates is also provided

    in SQL Server Management Studio. To read more on building queries you can follow this

    link.

    On the PI System side we need a PI Server and PI DataLink available. The idea is to use PI

    DataLink to bring in the observations and predictor values into MS Excel.

    CREATING AN ESTIMATION MODEL

    In this section we explain how you can use machine learning features of SSAS through an

    example.

    Note that throughout this example we make use of some data that we create ourselves

    using an explicit formula. This is only because we need to verify the result of the

    machine learning procedure by comparing the predictions against the values we know.

    Otherwise, in most cases this procedure is used to model unstructured and unknown data.

    In any case, no knowledge of the formula will be used by any means to perform the data

    mining procedure and the generated values will be treated as mere numerical

    observations.

    To make this example, assume there are two tags called tag1 and tag2. There is another tag,

    the output of the model, which is being measured and is believed to be a function of tag1

    and tag2. We call it tag3. However, the relationship between the predictors and the output

    tag is unknown to the designer. The idea is to build and train a machine learning model with

    existing measurements and use it for future predictions.

    We generated the data for this example as the following: we created 10000 values for each

    of the predictors (tag1 and tag2) according to a uniformly distributed random distribution in

    the range of [0, 10]. We construct our output to be:

    tag3 = SIN(tag1 + 2*tag2)

    Obviously we dont use this knowledge in our machine learning example. Therefore, we

    have gathered 10,000 observations of two predictors and one output variable. We export

    the 10,000 triplets either using PI DataLink or through exporting to a csv file (using the

    piconfig utility) to an Excel sheet:

  • 8

    Figure 3 The two predictors and the output variable exported to Excel.

    The next step is to turn the range containing our data into a table. This is not required for

    the operation we are intending to do but it will make our work space more manageable

    down the road. Also some other features of the Data Mining add-in would only work on

    tables. So, we select the 10,000 rows and 3 columns and click on Format as Table on the

    Home ribbon and choose the desired format. Name the columns to match our naming.

    Figure 4 The three columns of our data formatted as a table.

    The next step is the heart of building the machine learning model to predict new values of

    the output variable tag3. The purpose of this step is to create a machine learning model

    inside the SQL server using the data stored in our table. To start we select the Estimate

    button on the Data mining ribbon.

  • 9

    Figure 5 Choosing the Estimate to create the mining structure.

    This will open a wizard for us to define model parameters. As for the source data we can use

    our table, an Excel range, or an external source. Choose the table we just created.

    Figure 6 Choose the table as the source of data.

    In the next step we check all three columns as the inputs to our model as we will be using all

    the data in the three columns. In general we might be interested in investigating the

    relationship between only a subset of the predictors and the output. This can be because

    some observed variables do not have any relevant information or we are running some

    relevance tests. Also, we are interested in predicting tag3; therefore, we select tag3 as the

    Column to analyze. Click Next.

  • 10

    Figure 7 Including all three columns in our model and selecting the output.

    After selecting the columns in the model, the wizard asks us what percentage of the data

    will be used for training and what percentage for validation of the model. The typical value

    is 30% for validation and the rest for training. SSAS uses validation to improve and trim the

    learning model. The Decision Tree algorithm applied in SSAS builds a tree based on the raw

    data designated for training. In this case 70% of the data is randomly picked for training

    purposes. It then applies the resulting tree to the remaining portion of the data to validate

    the prediction against observations. In case the error in prediction exceeds a threshold the

    tree gets improved or trimmed. For more information you can read more on cross validation

    of the learning models. Click next to accept the 30%.

    The last step in creating the model is to name the structure. Name the structure Table1

    Estimate tag3. Click Finish.

    You will see that the model starts being built inside SQL Server. What the SSAS does is that it

    takes all the predictors (in this case two of them) as well as the output variable and starts

    building a Decision Tree minimizing the error criterion. This Decision Tree will be saved in

    the SSAS for future prediction purposes. As you will see later, we will send queries to this

    structure to predict the value of the output variable based on predictor values.

  • 11

    Figure 8 Naming the structure is the last step in creating the model.

    One point of interest is the amount of time it takes for the whole analysis. This is a very

    efficient algorithm which is capable of handling large amounts of data. In this case with

    10,000 observations with two predictors on a laptop running Windows 7 with 8GB of RAM

    the building of tree takes no more than 10 seconds.

    As a result of the analysis the Data Mining add-in shows a graphical representation of the

    Decision Tree as well. Each node of the tree is comprised of a predictor variable and a

    collection of numbers or breaking points. While traversing down the tree the prediction can

    be very easily achieved by following the corresponding branch. This is one of the very strong

    points of the Decision Tree algorithm in machine learning. Not only it makes it fast and

    efficient to look up estimation, but also it results in very intuitive interpretation of the

    prediction. This is very important because often when a black box prediction algorithm, such

    as Neural Networks, offers just the prediction value the decision makers would love to see

    some background on why the prediction has been made.

  • 12

    Figure 9 The resulting Decision Tree

    GENERATING PREDICTIONS

    Having built our model based on the observations, we will move on to use the model to predict tag3

    values. So we create a table with a few pairs of values for tag1 and tag2, i.e. observations on the

    predictors. These are potential future values of the predictors that we will use to make predictions.

    Figure 10 Adding new pairs of predictors

    Now its time to use the structure we built. In order to do that, click on the Query button on the Data

    Mining tab. It opens a wizard for us to build the query against the machine learning model we built

    in the last section. First we select the structure and model to send the query to. So we choose

    Table1 Estimate tag3. There is only one model under this structure which is Estimate tag3_1 in

    this case. Select the model and click Next.

  • 13

    Figure 11 Selecting the model we built in previous section.

    We are now asked to provide the source of data. We will have the option to refer to a table, an Excel

    range, or an external data source. We use the table containing our new values as the source of data.

  • 14

    Figure 12 Pointing to our table for the predictor values.

    Now the wizard wants to know how each column of the table is mapped to each predictor in the

    previously stored model. In this case we have given each column a similar name as in the model,

    namely tag1 and tag2. We dont have in this case any values for tag3 because this is the output

    variable to be predicted. Click Next.

  • 15

    Figure 13 Map the columns to the variables in the model.

    The next step is to define an output by clicking on Add Output. In this case tag3 is the output.

    Note that in general this can be not straightforward. In a lot of applications lots of observations lack

    one or more variable values. So this is not always necessarily obvious to SSAS which variable should

    be the output of the model.

    Figure 14 Defining the output

  • 16

    Note that we choose tag 3 as the output and we are interested in predicting the values of tag3. With

    other options we can take a look at the variance of the output variable or its support instead. In

    other words, instead of the actual value, these other options allow us to examine some basic

    statistical characteristics of the predicted value. More variance for example means the predicted

    value is prone to bigger changes in case the observations change. Click OK and then Next. On the

    next dialog we choose to append to the input data to see the results on the spread sheet along

    with the predictors. Click finish. Now you should be able to see the predictions as a third column

    added to the table. We have calculated the actual values of the function SIN(tag1+ 2*tag2) as the 4th

    column of the table for comparison purposes. As you can compare tag3 (prediction) and the actual

    value, the prediction has done a very good job in learning the behavior of the function being

    predicted.

    Figure 15 Prediction values for tag3 along with the actual values. The prediction has done a good

    job.

    DISCUSSION

    In short we have been able to learn the behavior of the function SIN(tag1 + 2*tag2) through 10,000

    observations and apply it for prediction purposes. You can also apply the model to the original table

    which was used to train the model - to see the actual and predicted values side by side.

    The precision of the model depends on many parameters of the algorithm as well as the nature of

    the problem. As a general rule the more observations the better. Also, noisy observations can

    contaminate the data at a level determined by the power of noise. Another important factor is

    having irrelevant data in training. Sometimes we dont know for sure if the output depends on a

    certain variable (tag) or not. We have to note that irrelevant data can in fact hurt. Decision Trees are

    among the more robust algorithms when it comes to irrelevant data. However, knowing the physics

    of the problem at hand plays a key role in making the learning procedure more efficient. Also, in

    more complex cases we need to run several different models with different predictor variables to

    see which one makes a better model of the problem.

    Also, it is common that not all observations have all the predictor values in them. Decision Trees are

    among the best algorithms handling missing samples.

    When the relationship between the predictors and the output variable is nonlinear the Decision

    Trees typically do a better job. When predicting a linear relationship they tend to struggle. This is a

    direct result of the underlying algorithm which is trying to fit a piecewise constant function to the

    observations. Therefore, if we believe the relationship between the predictors and the output is

    linear, we may get better results using other machine learning algorithms such as neural networks.

  • 17

    A very important point is that each problem in machine learning has its own characteristics. Even

    though todays algorithms and their implementations are very powerful, some nuances and fine

    tunings will be left to the specific problem at hand. Therefore, good and close knowledge of the

    underlying problem and the relationship between predictors and the output variable is extremely

    important to a successful machine learning procedure.

    Besides, in almost all machine learning cases data needs to be prepared before being ready for the

    algorithms. This includes separating useful informative portions of the data from spam or hollow

    samples, enriching the portions we need more precision in the model, and other problem-specific

    operations.

  • 18

    CONCLUSION

    In this white paper we discussed an important aspect of data mining for the data stored in a PI

    System. In particular, we focused on how we can learn the relationship between several PI tag values

    and an output tag. The machine learning algorithm offered in SQL Server Analysis Services along with

    its Excel client provides us with a very convenient way to perform machine learning on PI data. We

    use PI DataLink to import data onto an Excel sheet and perform machine learning algorithm on

    them.

    We analyzed several aspects and factors involved in the choice of the right tool. Among those are

    the time we can allocate to learning and prediction, robustness to missing data, and robustness to

    irrelevant data. In this paper we focused on the Decision Tree algorithm offered by SSAS.

    We showed how we can leverage SSAS to do machine learning by walking through an example. We

    generated two random tags and created a third tag, as the output tag, which was defined as a sine

    function of a linear combination of the two predictors. We used 10,000 observations and fed that

    into SSAS in order to learn the behavior of the output. We then used the resulting structure to

    predict the value of the output at arbitrary pairs of values of the predictors.

  • 19

    REVISION HISTORY

    16-Aug-2011 Initial draft by Ahmad Fattahi