SIMCA-P+ 11 Tutorial

Tutorial SIMCA-P, SIMCA-P+

Version 11.0

By Umetrics AB

© 1992-2005 Umetrics AB Information in this document is subject to change without notice and does not represent a commitment on the part of Umetrics AB. The software, which includes information contained in any databases, described in this document is furnished under a license agreement or non-disclosure agreement and may be used or copied only in accordance with the terms of the agreement. It is against the law to copy the software except as specifically allowed in the license or nondisclosure agreement. No part of this manual may be reproduced or transmitted in any form or by any means, electronic or mechanical, including photocopying and recording, for any purpose, without the express written permission of Umetrics AB. SIMCA is a registered trademark of Umetrics; Windows is a trademark of Microsoft Corporation. Covers products: SIMCA-P SIMCA-P+ Manual edition date: May 16, 2005

UMETRICS AB

Box 7960 S-907 19 Umeå

Sweden Tel. +46 (0)90 184800

Fax. +46 (0)90 184899 Email: [email protected]

Home page: www.umetrics.com

Tutorial SIMCA-P, SIMCA-P+ Contents • i

Contents

How to get started with SIMCA 1 Regular Project (non-Batch) ......................................................................................................1

General.........................................................................................................................1 The Analysis cycle.......................................................................................................1 Import the primary data, create a new project .............................................................2 View ............................................................................................................................3 Pre-processing the data (Dataset menu).......................................................................3 Prepare the data (Workset menu).................................................................................3 Develop the model (Analysis menu)............................................................................4 Fit the model ................................................................................................................5 Review the fit (Analysis menu) ...................................................................................5 Predictions (Predictions menu) ....................................................................................6 Plots/Lists ....................................................................................................................7

Road map to SIMCA-P ..............................................................................................................7 Batch Projects (SIMCA-P+ 10) .................................................................................................7

General.........................................................................................................................7 The Analysis cycle.......................................................................................................8

Introduction 11 General.....................................................................................................................................11

Plots and Lists............................................................................................................11

Foods 13 Data..........................................................................................................................................13

Data table...................................................................................................................13 Objective..................................................................................................................................14

Analysis Outline ........................................................................................................14 Define project...........................................................................................................................14 Workset Wizard .......................................................................................................................17 Analysis ...................................................................................................................................19

Scores and Loadings ..................................................................................................20 Third Component.......................................................................................................22 Summary....................................................................................................................22

Mineral sorting at LKAB 25 Introduction..............................................................................................................................25

Data description .........................................................................................................26 Data table...................................................................................................................26

Objective..................................................................................................................................28 Analysis outline .........................................................................................................28

Create the project .....................................................................................................................29

ii • Contents Tutorial SIMCA-P, SIMCA-P+

Prepare the data ....................................................................................................................... 31 Workset Wizard ........................................................................................................ 31

Analysis ................................................................................................................................... 33 PC of Y...................................................................................................................... 33 Scores and Loadings ................................................................................................. 34 PLS MODELING...................................................................................................... 37 Refining the model.................................................................................................... 40 Excluding observation 208 using the interactive tool box ........................................ 40 Removing some observations for a test set ............................................................... 41 Observation Risk....................................................................................................... 46 Predictions................................................................................................................. 47 Summary ................................................................................................................... 48

NIR 49 Introduction ............................................................................................................................. 49 Data ......................................................................................................................................... 49

Variables ................................................................................................................... 49 Observations.............................................................................................................. 50

Objective ................................................................................................................................. 51 Analysis Outline ...................................................................................................................... 51

The steps to follow in SIMCA-P are:........................................................................ 51 Create the project..................................................................................................................... 52 Prepare the data ....................................................................................................................... 53

Default Workset ........................................................................................................ 53 Transform the variables............................................................................................. 53

Analysis ................................................................................................................................... 55 PLS model of all the samples.................................................................................... 55 Excluding sample 32 ................................................................................................. 61 Separate PLS models for the Sphagnum and Carex .................................................. 62 Sphagnum Model, class 2.......................................................................................... 62 Model class 1 (Carex peat)........................................................................................ 63

Predictions ............................................................................................................................... 63 Making a prediction Set ............................................................................................ 64 Cooman's Plot ........................................................................................................... 64 Summary ................................................................................................................... 64 Plots and Lists ........................................................................................................... 64

Hierarchical Models 67 Introduction ............................................................................................................................. 67 Data ......................................................................................................................................... 67 Objective ................................................................................................................................. 67 Analysis Outline ...................................................................................................................... 68

The steps to follow in SIMCA-P are:........................................................................ 68 Create the project..................................................................................................................... 68 Summarizing the feed.............................................................................................................. 69

Workset ..................................................................................................................... 69 Analysis..................................................................................................................... 70

Summarizing the reactor.......................................................................................................... 72 Workset ..................................................................................................................... 72 Analysis..................................................................................................................... 73 Scores t1 vs. t2 .......................................................................................................... 73 Loadings p1 and p3 (the 2 most important components) .......................................... 74

Tutorial SIMCA-P, SIMCA-P+ Contents • iii

Summarizing the purification...................................................................................................74 Workset......................................................................................................................74

Summarizing the less important Y's.........................................................................................76 Workset......................................................................................................................76

Preparing for the hierarchical model........................................................................................77 Workset of the top level model..................................................................................77 Analysis .....................................................................................................................78 The score plot (t1 vs. t2) of the top level model ........................................................79 The w*c plot ..............................................................................................................80 Coefficients................................................................................................................81 Variable Importance (VIP) ........................................................................................82 Observed vs. Predicted ..............................................................................................83

Predictions................................................................................................................................84 DModXPS .................................................................................................................85 Scores tPS1 vs. tPS2 colored by test set and training set...........................................87 Cusum Chart ..............................................................................................................87

Conclusion ...............................................................................................................................89

Spectral Filtering and Compression, including OPLS 91 Introduction..............................................................................................................................91 Data..........................................................................................................................................91 Objective..................................................................................................................................91 Analysis Outline.......................................................................................................................92

The steps to follow in SIMCA-P are: ........................................................................92 Create the project .....................................................................................................................92 Plotting the Spectra ..................................................................................................................93 Prepare the Data .......................................................................................................................94

Workset......................................................................................................................94 Analysis ...................................................................................................................................95

PLS model .................................................................................................................95 Validating the Model 1 ..........................................................................................................100 Orthogonal Signal Correction and Wavelets Compression....................................................101 Model with the Signal corrected and compressed data ..........................................................103

Summary of the preprocessed project......................................................................103 Change the default Scaling ......................................................................................104

Validating the Model 2 ..........................................................................................................106 Conclusion OSC-Wavelets ....................................................................................................107 OPLS (Orthogonal PLS) ........................................................................................................107 Conclusions............................................................................................................................109

Batch Modelling with SIMCA-P+ 111 Introduction............................................................................................................................111 Data........................................................................................................................................111 Objectives ..............................................................................................................................111 Analysis Outline.....................................................................................................................112

The steps in SIMCA-P are:......................................................................................112 Create the observation level project .......................................................................................112 Analysis .................................................................................................................................115

Batch Control charts (Training set)..........................................................................117 Monitoring new batches.........................................................................................................119

Import the secondary data set with the new batches ................................................119 Control Charts for new batches ...............................................................................120

iv • Contents Tutorial SIMCA-P, SIMCA-P+

Prediction | Batch Control Charts | DModX............................................................ 123 Creating and Modelling the batch level project..................................................................... 124

Analysis: Autofit ..................................................................................................... 124 Analysis: Scores ...................................................................................................... 124 Analysis |Batch Control Charts | Batch Variable Importance ................................. 125

Predicting the quality of the new batches .............................................................................. 125 Predictions: T Predicted .......................................................................................... 126 Predictions: Contribution Scores for batch 51......................................................... 126

Conclusion............................................................................................................................. 128

Modelling of a Batch Digester 129 Introduction ........................................................................................................................... 129 Data ....................................................................................................................................... 129 Objectives.............................................................................................................................. 130 Analysis Outline .................................................................................................................... 130

The steps in SIMCA-P are: ..................................................................................... 131 Create the observation level project ...................................................................................... 131 Specify the Workset .............................................................................................................. 136 Analysis ................................................................................................................................. 138

Fitting All the Class models .................................................................................... 138 Scores Line plot of t1, t2 and t3 .............................................................................. 139 Loadings p1, p2 and p3 ........................................................................................... 140 Batch Control charts (Training set) ......................................................................... 141

Monitoring new batches ........................................................................................................ 146 Creating the Prediction set: Complement of Workset ............................................. 146 Batch Control Chart of the Prediction set ............................................................... 147 OOC plot ................................................................................................................. 147 Group Contribution plot .......................................................................................... 148 Variable control chart.............................................................................................. 148 Prediction | Batch Control Charts | DModX............................................................ 149

Creating and Modelling the batch level project..................................................................... 150 Analysis: Autofit ..................................................................................................... 151 Analysis: Scores ...................................................................................................... 151 Analysis | Batch Variable Importance..................................................................... 153

Predicting the quality of the prediction set batches ............................................................... 153 Predictions: T Predicted .......................................................................................... 154 Predictions: Contribution Scores for batch 28......................................................... 154 Predictions: Distance to the Model (DModX)......................................................... 156 Contribution Plot..................................................................................................... 156

Conclusion............................................................................................................................. 158

Tutorial SIMCA-P, SIMCA-P+ How to get started with SIMCA • 1

How to get started with SIMCA

Regular Project (non-Batch)

General SIMCA-P is organized into projects. A project is a folder containing the results of the analysis (unlimited number of models) of a primary dataset. You start a new project by importing its data (primary dataset). Unfitted models are implicitly created by SIMCA-P when you specify a Workset or with an existing Workset when you select Active Model Type. At the very beginning of a project, the default Workset consists of all data with all variables centered and scaled to unit variance and considered as X, and the model is a principal components model (PC) of X. The project window displays, for every model, one line summarizing the model results. The active model, the one you are working with, is also listed in a list box to the left of the gray area (status bar) just beneath the command menu bar. To open a model, double click on it in the project window. This opens a model window with the details (one line per component) of the model results. Another way to activate a model (if several are available), is to select its name from the list box (upper left).

The Analysis cycle 1. Pre-processing and selection of data: (Dataset and

Workset menu) 2. The Dataset menu allows you to trim / winsorize your

data, generate new variables, and perform spectral filtering, or wavelet compression of the data. A model is developed from a Workset. The default Workset, to start, is the whole dataset with all variables as X and scaled to unit variance. This is also obtained by Workset | New.

2 • How to get started with SIMCA Tutorial SIMCA-P, SIMCA-P+

The Workset menu allows you to modify the starting Workset.

3. Specifying and fitting the model (Analysis menu). 4. Reviewing the results and performing diagnostics

(Analysis menu). 5. Using the model for predictions (Predictions menu).

Import the primary data, create a new project File: New Select to import data from file or databases. SIMCA-P imports files with the following format types: DIF: Data interchange format (many applications can export DIF files). TXT: Standard delimited text file (one observation per line). TXT: Free format text, with or without header. MAT: Matlab version 4.0 files (binary). XLS: All versions of EXCEL files. LOTUS 1 2 3 : *.wk1 files JCAMP-DX : *.jcm, *.dx, *.jdx ANDI: Chromatography AIA files NSAS: files GRAM: Galactic *.spc files Others (refer to chapter 4), including old SIMCA-P file types.

Select the Source file Source Directory: The directory that contains the data file . Name: Locate the source file, e.g., ENVIRO.DIF Double click on the name of the source file. Destination Directory: The directory in which to store the project, e.g., C:\SIMDATA\ENVIRO. You may also change the project directory (destination), if you wish. By default SIMCA-P uses the source directory as the destination directory.

Indicate file contents Specify Primary and as many Secondary identifiers as desired for both variables and observations.

Secondary datasets Later you may import additional data (secondary datasets) for use in predictions. You do that in the menu File | Import Secondary Dataset


View Customize your display and specify the project level option and general options.

Pre-processing the data (Dataset menu) Plotting variables or observations from the dataset Mark the variables or observations you want to plot, right click on the marked objects and select the desired plots. To plot all the X observations as a line plot, just right click on the dataset and select Plot | Xobs. Use the Dataset menus to view or modify a SIMCA-P dataset as follows: Quick Info Interactive plots tied to the dataset displaying variables or observations in the time or frequency domain. Trimming / Winsorizing single, or all variables Edit dataset General Edit commands Generate new variables Generate new variables as functions of existing ones or from model results Spectral filter the dataset with: Orthogonal Signal Correction (OSC) Multiple Scatter Correction (MSC) Standard Normal Variates (SNV) 1st and 2nd Derivatives Wavelet transform and compression PLS wavelet transform of time series Decimation of time series

Prepare the data (Workset menu) The default Workset, at the project start, is the whole dataset with variables defined as X’s and Y’s as specified at import, and scaled to unit variance. The associated model (unfitted) is listed in the active area. You are ready to fit a PLS model (default), or PC of X or Y’s, with all the data of the primary dataset. If this is what you want, you can go directly to the Analysis menu. To fit a different model with maybe excluded variables, or transformations, or different scaling, it is necessary to first modify the Workset. An unfitted model is generated by SIMCA-P when you specify the Workset (select a starting Workset New or As Model).


Workset

New Uses the whole original primary dataset with X’s and Y’s as defined at import

New As Model Use the Workset of a selected model as starting point.

Modify the Workset as follows Observations Include / exclude observations or group them into classes for classification.

Variables Define X/Y variables, transformations, scaling tec. Transform Transform variables. Lag Create lagged variables (SIMCA-P only). Variables/Block Select variables, and specify roles. To select variables as X, Y or excluded, mark the variables as X, Y or excluded and click on the Set button. Expand Expand the X matrix with cross terms, squares or cubes. Scale Select scaling base type (UV = cantered, unit variance, Par = cantered and Pareto, etc.). A modifier can be selected (default = 1.0) that changes the scaling of a variable relative to its base weight. Block scaling can also be specified.

Trim / Winsorize variables Trimming / Winsorizing the workset does not affect the dataset but just that particular workset (refer to the workset chapter).

Options Specify the model level options

Develop the model (Analysis menu) Select model type The default model is a PCX model if all your variables are defined as X's, or a PLS model if you have defined both X's and Y's at import. You can change the model type, and when the Workset specification allows it, you can select among:

PCX PC model of the X's.


PCY PC model of the Y's.

PCAll PC model of all included variables, X and Y.

PC Class PC of a selected class when your observations are divided into classes.

PLS PLS analysis of X and Y

PLS Class PLS of a selected class when your observations are divided into classes.

PLSDA PLS discriminate analysis when your observations are divided into classes.

Fit the model Autofit Rule based fitting.

2 First Components Calculate two components directly, often used to get a quick overview of the data.

Next Component Calculate one component at a time. Here it is possible to force components to be calculated regardless of significance rules.

Remove Component Remove the last component

Autofit Class Models Autofit or takes as many components as specified of all class models

Specify Hierarchical Models Specify a model to be Base or Top Hierarchical

Review the fit (Analysis menu) After a fit, the whole spectrum of plots and lists are available for model interpretation.

Summary of fit 1. Model Overview


2. X/Y overview: Cumulative Fit of all variables (Y only in PLS

3. X/Y/Comp: The Fit of a Variable by Component. 4. Component Contribution: The contribution of a model

component to the Fit. 5. Scores:t1 vs. t2, t1 vs. u1, etc. 6. Loadings: p1 vs. p2, w*c1vs. W*c2, etc. 7. Coefficients (PLS) 8. VIP (PLS) Variable influence on projection 9. DMod (X or Y) Distance to the model (X or Y ) 10. Observed vs. predicted (PLS) 11. Residual plots:

Normal probability plot (for selected Y's) 12. Observation risk

Note: By default, in the Analysis menu, all plots and lists are displayed for the last component. To select a different component for display in the plots, and/or a different variable, click on the right mouse button and select from the available options.

Select a New Model Type You can, after fitting the model, select a new model type. SIMCA-P then creates a new unfitted model with the selected model type. For example, if you have defined your Workset variables as X's and Y's, you can first fit a PCY (PC of the responses), then change the model type to PLS and fit a PLS model (another model) to the same data.

Predictions (Predictions menu) Building the Prediction Set Use the menu Predictions | Specify Prediction set to build your prediction set from the Primary or any secondary datasets. You can display the Prediction set as a spreadsheet or just plot or list results. When you do not specify a prediction set, the prediction set is by default the primary dataset with all the data. You can build the prediction set from observations belonging to the primary dataset or any secondary dataset that you have imported. You can also enter the data in the prediction set through the keyboard when you build the prediction set in the spreadsheet.

Displaying the predictions All the prediction results (scores, y-values, etc.), computed with the active model, are displayed as plots or lists.


Plots/Lists Under this menu you can find general plot and list routines. Here it is possible to plot and list any data, and results from the analysis. There are scatter, line, column, 3d scatter, histogram, contour, response surface, normal probability plots, wavelets plots, control charts and batch control charts available.

Note: Click on the right mouse button to display available properties for an active plot or list. You can generate lists form plots and plots from list.

Road map to SIMCA-P

1. Start a projectFile New

Read Data FileSpecify Label Cols & Rows

2. Look at the dataData set

Quick InfoVariables or Obs.

6. Outliers in scoresPolish data

Prepare new worksetGraphically or via Workset

7. New dataPredictions

Select Pred.set (observations)T_pred, Y_pred, DModX, etc.

6. No outliers in scoresContinue

Interpret model (plots)Relate to Objective

5. Plot resultsAnalysis

Scores, LoadingsDistance to Model

4. Fit the modelAnalysis

Autofitot fast button

3. Prepare a work copyWorksetvariables

observations

Batch Projects (SIMCA-P+ 10)

General A SIMCA-P Batch projects consists of two or more linked projects. (a) The Observation level project with several observations per batch with the variables measured during the evolution of the batch, and (b) the batch level project(s) consisting of the completed batches, with one batch being one observation (matrix row). The variables of the Batch level project are the scores, or original variables of the observation level at every time point folded out side-wise. Batches may be divided into phases.


Observation level project With Batch data, you start by importing the Observation level data and create the Observation Level project. In the data, you must have a Batch identifier, indicating the start and end of the batch, and if phases are present, also a phase identifier. You may also have a variable indicating the evolution of the batch or phase and its end point. This variable can be Time or Maturity. You can have different Maturity variables for different phases. Unfitted batch models are implicitly created by SIMCA-P. When batches have phases, theses are one PLS class model with Time or Maturity as Y for each phase. By default all variables in a phase are scaled to unit variance. The project window displays, for every model, one line summarizing the model results. When Batches have phases the PLS Batch class models (one for every phase) are grouped under an umbrella call MBxx , xx is a sequential number. You can display the results of the analysis of the training set batches in Control Charts, either as scores, DModX, predicted time or maturity, or as individual variables. Secondary datasets can be imported with new batches. These can also be displayed in Control Charts in the same way.

Batch Level Project The Batch level project is based on scores or original variables for completed batches, obtained from the observation level project. The Batch level project is a regular SIMCA-P project. Batch initial conditions and quality variables, when present, are automatically added to the batch level dataset. You can change the default model type (PCA) to any desired model type allowed by the workset specification.

The Analysis cycle Observation level project

13. Pre-processing and selection of data: (Dataset and Workset menu)

6. The Dataset menu allows you to trim / Winsorize your data, generate new variables, and perform spectral filtering, or wavelet compression of the data. A model is developed from the default Workset. The default Workset consists of PLS Batch class models, one for every phase.

7. Fitting the Observation level model (Analysis menu). 8. Reviewing the results and performing diagnostics

(Analysis menu). 9. Batch Control Charts for training set batches (Analysis

menu)


10. Importing a Secondary dataset with new batches and using the model to display the new batches in the Control Charts (Prediction | Batch Control Chart).

Batch Level Project 11. Creating the Batch level project (File | Create Batch Level

project) 12. Fitting the Batch level Project 13. Interpretation using score plots, loading plots, DModX,

contribution plots, etc. 14. Predicting and interpreting results for new whole batches.

Tutorial SIMCA-P, SIMCA-P+ Introduction • 11

Introduction

General This tutorial is just a brief introduction to using SIMCA-P on selected data sets. The user is advised to go through the different phases of modeling, import data, PC and PLS modeling, and look at the results in graphs and lists. For a more detailed description of how to use SIMCA-P, the USER’S GUIDE and the ON-LINE HELP system (identical) are recommended. There are five examples in this tutorial. The first example shows the strength of using projection methods on food data. The second example is from a real process at a mineral sorting plant. The third example is a multivariate calibration often performed in analytical chemistry. The fourth example illustrates hierarchical modeling. The fifth demonstrates the use of Spectral filtering. Example six and seven show how to handle batch type of data, without and with phases. As a tutorial, this provides just a brief introduction to the main functionality’s and plots in SIMCA-P. We recommend that you continue with your own data, and use the Manual for details. The Help system contains the same information as the Manual, but organized in a different way.

Plots and Lists You can display the results of SIMCA-P in numerous graphs and lists. From the Analysis and the Prediction menu, results of the active model are available as plots and lists. With the menu Plot/List, you have access for plotting or listing, to the raw data and every computed value from every model. You can even plot vectors from different models against each other. Auto and Cross Correlation plots as well as Power Spectrum are available for all vectors. In Dataset you can preprocess the data by trimming and winsorizing. Quick info plots are available with all spreadsheets.

12 • Introduction Tutorial SIMCA-P, SIMCA-P+

Tutorial SIMCA-P, SIMCA-P+ Foods • 13

Foods

Data Collected data are often presented as data tables which are almost useless when it comes to extract information. A data table is much better presented graphically. The example below will illustrate the principles of projection. The data in this example describes the consumption of different food items in several European countries.

Variables The selection of the variables reflects the different traditions and cultural behavior of the countries.

Observations 16 European countries have been selected.

Data table

1 2 3 4 5 6 7 8 9 10

Grain_Coffee Inst_Coffee Tea Sweet Bisc Pa_Soup Ti_Soup In_Potat Fro_Fish Fro_Veg

1 Germany 90 49 88 19 57 51 19 21 27 21

2 Italy 82 10 60 2 55 41 3 2 4 2

3 France 88 42 63 4 76 53 11 23 11 5

4 Holland 96 62 98 32 62 67 43 7 14 14

5 Belgium 94 38 48 11 74 37 23 9 13 12

6 Luxembourg 97 61 86 28 79 73 12 7 26 23

7 England 27 86 99 22 91 55 76 17 20 24

8 Portugal 72 26 77 2 22 34 1 5 20 3

9 Austria 55 31 61 15 29 33 1 5 15 11

10 Switzerland 73 72 85 25 31 69 10 17 19 15

11 Sweden 97 13 93 31 43 43 39 54 45

12 Denmark 96 17 92 35 66 32 17 11 51 42

13 Norway 92 17 83 13 62 51 4 17 30 15

14 Finland 98 12 84 20 64 27 10 8 18 12

15 Spain 70 40 40 62 43 2 14 23 7

16 Ireland 30 52 99 11 80 75 18 2 5 3

14 • Foods Tutorial SIMCA-P, SIMCA-P+

11 12 13 14 15 16 1 18 19 20

Apples Orang Ti_Fruit Jam Garlic Butter Margarine Olive_Oil Youg Crisp_Bread

1 Germany 81 75 44 71 22 91 85 74 30 26

2 Italy 67 71 9 46 80 66 24 94 5 18

3 France 87 84 40 45 88 94 47 36 57 3

4 Holland 83 89 61 81 15 31 97 13 53 15

5 Belgium 76 76 42 57 29 84 80 83 20 5

6 Luxembourg 85 94 83 20 91 94 94 84 31 24

7 England 76 68 89 91 11 95 94 57 11 28

8 Portugal 22 51 8 16 89 65 78 92 6 9

9 Austria 49 42 14 41 51 51 72 28 13 11

10 Switzerland 79 70 46 61 64 82 48 61 48 30

11 Sweden 56 78 53 75 9 68 32 48 2 93

12 Denmark 81 72 50 64 11 92 91 30 11 34

13 Norway 61 72 34 51 11 63 94 28 2 62

14 Finland 50 57 22 37 15 96 94 17 64

15 Spain 59 77 30 38 86 44 51 91 16 13

16 Ireland 57 52 46 89 5 97 25 31 3 9

Objective The objective of this study is to understand how the variation in food consumption among a number of industrialized countries is related to culture and tradition and hence find the similarities and dissimilarities among the countries. Hence data have been collected on 20 variables and 16 countries. The data show how many percent of households use 20 food items regularly.

Analysis Outline The steps to follow in SIMCA-P are:

• Import the data set. • Prepare the data (Workset menu). • Fit a PC model and review the fit (Analysis menu). • Interpret the results (Analysis menu).

Define project Start SIMCA-P and create a new project from FILE | NEW


Select type of data (XLS) or ALL Supported Files (the default) and find the data set (FOODS.XLS). Data can be imported from your hard-disk or from a network drive. Data can be imported in different formats, so select the one which is appropriate or All Supported Files. In this example we have the data in a XLS-file created from Excel. If the data set is on a floppy disk, we recommend that you first copy the file to the hard disk.


If you want to leave open the current project, remove the check mark from the box Close Current Project.

Note: The data set to import can be located anywhere on an accessible directory. It does not have to be located where you have defined the destination directory.

When you click on Open, SIMCA-P opens the Import Wizard. With SIMCA-P+, mark the radio button SIMCA-P normal project.

SIMCA-P has recognized that this example has observation numbers and names and variable names, and has correctly color coded them.


When you click on Next, the Project specification page opens. You can change the project name and a destination directory. Mark the check box Use workset wizard and click on Finish.

Workset Wizard The workset wizard opens to guide through the creation of the workset and the fitting of the model.


Select in the Variable page, which variables are X or Y and which variables to exclude. If you mark variables and press Transform, the software checks and applies transformation (Log Transform) when needed. For this example, all variables are X and no transformation is needed; click on Next.

In this page you exclude/include observations or set observations into classes. The Set class from ObsID uses a selected part of any observation ID to set classes automatically.


This example is a PCA to get an overview of the data table, all observations are included and no classes are specified. Click on Next to display a summary of the specifications and then click on Finish to fit the model with cross validation.

Analysis The plot with the summary of the fit of the model is displayed with R2X(cum) (fraction of the variation of the data explained after each component) and Q2(cum) (cross validated R2X(cum)). Double click on model summary line. The summary of the fit of the model is displayed with R2X (fraction of the variation of the data explained by each component) and cumulative R2X(cum), Q2 and Q2(cum) (cross validated R2X and R2X(cum)) as well as the eigenvalues. The food variables are, as expected, correlated, and fairly well summarized by three new variables, the scores, explaining 65% of the variation.


Scores and Loadings Scores

Select Analysis | Scores | Scatter Plot or the fast button to display the score plot of t1 vs. t2 (default). In the Label Types page, make sure the secondary identifier Onam is selected.

The ellipse represents the Hotelling T2 with 95% confidence (see statistical appendix). The scores t1 and t2, one vector for components 1 and 2, are new variables computed as linear combinations of all the original variables to provide a good summary. The weights combining the original variables are called loadings (p1 and p2), see below. The score plot shows 3 groups of countries. One group with the Scandinavian countries (the North), the second with countries from the South of Europe, and a third more diffuse with countries from Central Europe.


To color the observations (countries) by the values of a variable, right click, and open the properties. Select color, by categories, and in the combo box choose a variable (here garlic). In the split range window, enter 4.

Change the split range as needed in the text boxes on the right.

Garlic separates clearly Northern Europe from Southern Europe.

Loadings Select Analysis | Loadings | Scatter Plot to display the loadings p1 vs. p2. The loadings are the weights with which the X-variables are combined to form the X-scores, t (se above). This plot shows which variables describe the similarity and dissimilarity between countries.


Scandinavians eat crisp bread, frozen fish and vegetables, while in southern Europe people use garlic and olive oil, and central Europeans (in particular the French) consume a lot of yogurt.

Third Component Plot the scores (t1 vs. t3) and loadings (p1 vs. p3). The third component explains 13.8% of the variation in the data, and mainly shows high consumption of Tea, Jam and canned soups mainly in England and Ireland.

Summary In conclusion, a three components model of the data summarizes the variation in three major latent variables, describing the main variation of food consumption in the investigated European countries. This example shows a simple PC modeling to get an overview of a data table. The user is encouraged to continue to play around with


the data set. Take away observations and/or variables, refit new models, and interpret at the results.

Tutorial SIMCA-P, SIMCA-P+ Mineral sorting at LKAB • 25

Mineral sorting at LKAB

Introduction The following example is taken from a mineral sorting plant at LKAB in Malmberget, Sweden. Research engineer Kent Tano, at LKAB was responsible for this investigation. In this process, raw iron ore (TON_IN) is divided into finer material (<100 mm, 50% Fe) passing several grinders. After grinding, the material is sorted and concentrated in several steps by magnetic separators. The separation flow is divided in several parallel lines and there are also feedback systems to get as high Fe concentration as possible. The concentrated material is divided into two products, one (PAR) which is sent to a flotation process and another part (FAR, fines) which is sold as is. For both these products high Fe content is important. Twelve process factors were identified. Of these, three important factors were used to set up a statistical design (RSM). The results of each experiment were measured in 6 response variables. Several observations were collected for each design point. The process is equipped with an ABB Master system with a SuperView 900 connected to the process data system. Data where transferred from the ABB system to a personal computer with the SIMCA-P software for modeling. Models were transferred back to the SuperView system for on-line monitoring (predictions, score and loading plots) of the process. The investigation was made in 1992. The multivariate on-line control of the process is still in work with very good results concerning the quality of the products.

26 • Mineral sorting at LKAB Tutorial SIMCA-P, SIMCA-P+

Data description The following is a description of variables and observations.

Variables Data from 18 variables were collected. Process variables (X)

Explanation Abbr. RSM

1 Total load TON_IN Design

2 Load of grinder 30 KR30_IN

3 Load of grinder 40 KR40_IN

4 PARmull PARM

5 Velocity of separator 1 HS_1 Design

6 Velocity of separator 2 HS_2 Design

7 Effect grinder 30 PKR_30

8 Effect grinder 40 PKR_40

9 Ore waste GBA

10 Load of separator 3 TON_S3

11 Waste from grinding KRAV_F

12 Total waste TOTAVF

Responses (Y)

Explanation Abbr.

13 Amount of concentrate type 1 PAR

14 Amount of concentrate type 2 FAR

15 Distribution of type 1 and 2 r-FAR

16 Iron (Fe) in FAR %Fe_FAR

17 Phosphor (P) in FAR %P_FAR

18 Iron (Fe) in raw ore %Fe_malm

Observations A subset of 231 observations was used for modeling. Each observation has a name referring to the date and time when data were collected.

Data table A subset of the data is shown in Table 1.


/ Sovr.XLS Last change 930818/ 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18/Nr ID (logged time) Ton_in KR30_IN KR40_IN PARM HS_1 HS_2 PKR_30 PKR_40 GBA TON_S3 KRAV_F TOTAVF PAR FAR r_FAR %Fe_FAR %P_FAR %Fe_malmONUM ONAM Ton_in KR30_IN KR40_IN PARM HS_1 HS_2 PKR_30 PKR_40 GBA TON_S3 KRAV_F TOTAVF PAR FAR r_FAR %Fe_FAR %P_FAR %Fe_malm

91 1992030512300000 1271.81 275.81 190.88 62.98 90.41 79.52 57.44 41.35 163.8 203.94 75.34 383.71 307.88 591.75 65.78 66.2 0.24 47.992 1992030512310000 1290.56 278.55 208.58 58.08 90.41 79.52 56.85 43.57 156.38 203.64 79.49 384.28 314.63 601.5 65.66 66.2 0.24 47.993 1992030512320000 1267.39 278.55 207.38 63.19 90.41 79.52 51.38 42.2 188.57 200.64 81.99 398.21 312.19 585.06 65.21 66.2 0.24 47.994 1992030512330000 1250.44 278.06 204.53 57.48 90.41 79.52 54.02 41.22 155.1 206.27 75.79 384.56 298.63 576.75 65.89 66.2 0.24 47.995 1992030512340000 1265.51 279.56 190.43 49.31 90.41 79.52 54.74 42.66 169.35 214.82 79.99 415.38 304.13 591.75 66.05 66.2 0.24 47.996 1992030512350000 1268.18 276.11 194.63 62.88 90.41 79.52 52.29 42.66 169.8 212.27 80.69 403.71 310.88 592 65.57 66.2 0.24 47.997 1992030512360000 1284.3 272.55 211.28 58.68 90.41 79.52 48.03 41.74 174.83 206.57 80.89 405.33 293.13 575.5 66.25 66.2 0.24 47.998 1992030512370000 1284.41 275.4 208.28 50.48 90.41 79.52 59.11 43.77 182.25 208.67 76.49 420.53 304.13 580 65.6 66.2 0.24 47.999 1992030512380000 1272.79 274.35 207.53 62.68 90.41 79.52 59 44.36 181.5 201.92 74.53 394.57 300.13 598.25 66.59 66.2 0.24 47.9

100 1992030512390000 1317.11 269.81 192.23 56.18 90.41 79.52 56.25 42.59 185.93 199.44 79.49 409.68 311.13 579.75 65.08 66.2 0.24 47.9101 1992030512400000 1273.16 264.71 195.38 49.56 90.41 79.52 54.5 42.53 186.15 193.82 79.14 405.47 291.63 585.25 66.74 66.2 0.24 47.9102 1992030512410000 1239.15 264.86 209.93 56.78 90.41 79.52 62.6 44.88 173.78 207.24 83.94 393.18 300.63 601 66.66 66.2 0.24 47.9103 1992030512420000 1290.86 272.21 201.08 62.83 90.41 79.52 56.28 42.66 159.38 211.52 79.79 389.36 325.63 605.5 65.03 66.2 0.24 47.9104 1992030512430000 1272.64 267.6 201.38 65.58 90.41 79.52 53.66 40.89 163.8 209.04 78.89 377.68 304.13 592.75 66.09 66.2 0.24 47.9105 1992030512440000 1285.58 264.26 203.78 55.43 90.41 79.52 52.33 43.38 168.83 213.17 75.69 400.21 314.44 594.56 65.41 66.2 0.24 47.9106 1992030512450000 1263.75 267.9 187.13 63.58 90.41 79.52 50.4 42.85 176.55 196.74 81.44 390.11 314.88 587.75 65.12 66.2 0.24 47.9107 1992030512460000 1289.36 264.86 212.63 53.03 90.41 79.52 52.3 42.72 175.65 190.52 76.59 406.36 300.38 593.5 66.4 66.2 0.24 47.9108 1992030512470000 1309.05 272.55 194.78 50.93 90.41 79.52 50.01 45.47 172.35 200.64 76.54 392.13 297.88 596.25 66.69 66.2 0.24 47.9109 1992030512480000 1282.01 271.16 209.48 61.83 90.41 79.52 47.49 46.85 197.94 193.59 75.99 397.38 315.88 576.5 64.6 66.2 0.24 47.9110 1992030512490000 1288.91 264.26 222.83 51.88 90.41 79.52 60.47 44.36 193.07 193.29 77.19 419.35 305.63 577.5 65.39 66.2 0.24 47.9111 1992030512500000 1289.96 262.46 210.38 45.11 90.41 79.52 51.09 47.77 195.77 189.24 79.14 419.04 310.13 566.25 64.61 66.2 0.24 47.9131 1992030513100000 1062.49 217.61 170.63 32.41 94.52 74.7 41.08 40.82 130.71 153.08 59.13 310.49 242.56 477.19 66.3 67.2 0.2 51.2132 1992030513110000 1024.8 218.06 178.43 43.46 94.52 74.7 38.12 38.27 132.58 152.7 62.28 294.07 261.13 505.5 65.94 67.2 0.2 51.2133 1992030513120000 1070.74 215.06 165.34 38.51 94.52 74.7 40.92 39.25 140.01 147.88 57.33 304 248.56 501.75 66.87 67.2 0.2 51.2134 1992030513130000 1054.65 216.08 176.03 31.41 94.52 74.7 44.23 39.19 135.88 148.48 59.33 318.05 249.56 523 67.7 67.2 0.2 51.2135 1992030513140000 1072.05 214.73 166.24 41.08 94.52 74.7 42.43 37.09 127.71 149.7 61.14 302.51 263.56 514.5 66.13 67.2 0.2 51.2136 1992030513150000 1056.71 224.63 177.68 35.31 94.52 74.7 46.42 39.25 117.51 143.68 57.73 285.5 252.88 522.25 67.38 67.2 0.2 51.2137 1992030513160000 1025.7 216.68 174.68 29.61 94.52 74.7 46.51 36.76 117.81 148.93 57.13 294.25 257.63 513.25 66.58 67.2 0.2 51.2138 1992030513170000 1045.91 215.03 171.23 43.26 94.52 74.7 43.41 39.19 129.58 148.63 52.23 290.73 253.63 499.94 66.34 67.2 0.2 51.2139 1992030513180000 1044.15 219.08 166.88 36.01 94.52 74.7 47.06 39.58 127.41 136.56 56.43 279.88 237.06 508.25 68.19 67.2 0.2 51.2140 1992030513190000 1106.14 219.08 175.58 27.11 94.52 74.7 41.08 41.94 130.11 149.38 55.28 312.23 243.81 519.25 68.05 67.2 0.2 51.2141 1992030513200000 1079.55 222.83 199.58 37.91 94.52 74.7 48.49 38.92 121.41 148.33 56.03 285.54 259.88 521.5 66.74 67.2 0.2 51.2142 1992030513210000 1024.35 213.23 184.43 36.41 94.52 74.7 45.05 42.33 140.53 136.41 57.68 296.2 257.13 515.5 66.72 67.2 0.2 51.2143 1992030513220000 1071.49 213.41 174.83 27.91 94.52 74.7 44.67 40.17 125.21 141.9 59.73 300.23 266.88 515.5 65.89 67.2 0.2 51.2144 1992030513230000 1069.65 220.76 192.38 36.41 94.52 74.7 40.41 39.71 122.76 145.63 55.03 287 254.63 531.25 67.6 67.2 0.2 51.2145 1992030513240000 1055.14 217.76 182.03 36.96 94.52 74.7 38.3 39.51 128.16 142.71 52.73 284.99 265.38 530.5 66.66 67.2 0.2 51.2146 1992030513250000 1087.2 225.11 157.88 34.98 94.52 74.7 41.79 38.73 131.03 140.03 56.09 292.16 256.38 527.25 67.28 67.2 0.2 51.2173 1992030513520000 1059.75 224.33 176.48 6 94.52 48.19 46.19 39.58 104.81 99.11 55.68 249.03 238.31 591.75 71.29 64.3 0.39 50.9174 1992030513530000 1056.15 229.01 164.93 6.5 94.52 48.19 43.1 41.81 115.03 95.66 54.03 258.22 232.81 557.75 70.55 64.3 0.39 50.9175 1992030513540000 1032.9 221.33 173.48 0.05 94.52 48.19 46.29 39.25 112.56 95.14 57.43 260.62 230.31 564.75 71.03 64.3 0.39 50.9176 1992030513550000 1059.04 237.26 173.93 6.5 94.52 48.19 45.42 39.38 115.78 96.56 55.93 263.47 223.31 587.5 72.46 64.3 0.39 50.9177 1992030513560000 1008 228.11 166.28 9.86 94.52 48.19 42.97 41.22 105.58 102.06 57.34 257.74 248.38 567.75 69.57 64.3 0.39 50.9178 1992030513570000 1079.29 223.61 170.03 9.65 94.52 48.19 46.19 38.07 119.68 102.11 56.03 263.05 225.31 615.75 73.21 64.3 0.39 50.9179 1992030513580000 1096.84 225.41 177.68 3.15 94.52 48.19 42.26 40.69 120.28 99.71 57.38 278.64 205.06 600.25 74.54 64.3 0.39 50.9180 1992030513590000 1057.65 225.11 174.98 11.1 94.52 48.19 43 39.97 109.31 98.36 56.13 252.7 216.56 562.75 72.21 64.3 0.39 50.9181 1992030514000000 1073.29 223.01 168.04 5.65 94.52 48.19 42.33 37.81 120.88 101.14 55.58 268.34 227.56 604.5 72.65 64.3 0.39 50.9182 1992030514010000 1094.55 224.03 163.24 2 94.52 48.19 45.65 38.27 106.16 102.26 53.58 260.15 221.56 611.5 73.4 64.3 0.39 50.9183 1992030514020000 1033.5 219.98 168.98 10.9 94.52 48.19 41.81 39.12 111.49 99.64 53.38 265.02 226.06 576.25 71.82 64.3 0.39 50.9184 1992030514030000 1061.29 214.73 167.44 4.61 94.52 48.19 42.85 39.05 111.56 101.53 58.43 269.5 223.06 571.81 71.94 64.3 0.39 50.9185 1992030514040000 1037.4 217.43 167.14 1.25 94.52 48.19 42.2 39.19 100.84 99.79 54.83 254.2 218.81 595 73.11 64.3 0.39 50.9186 1992030514050000 1048.65 219.98 174.53 8.4 94.52 48.19 42.1 37.29 109.84 102.79 52.23 258.65 220.56 571 72.14 64.3 0.39 50.9198 1992030514170000 1059.9 221.93 152.63 12.9 86.3 74.7 39.63 36.76 101.89 127.78 50.63 267.39 238.31 579.5 70.86 66.3 0.28 51.9199 1992030514180000 1043.25 215.48 159.98 20.7 86.3 74.7 37.7 37.02 94.16 124.26 49.06 246.78 227.81 595.75 72.34 66.3 0.28 51.9200 1992030514190000 1054.8 211.43 168.53 17.2 86.3 74.7 39.45 37.4 94.01 124.78 46.86 256.11 217.81 557.25 71.9 66.3 0.28 51.9201 1992030514200000 1037.85 219.23 168.04 11.4 86.3 74.7 40.28 39.97 90.41 130.86 47.26 262.16 228.56 590 72.08 66.3 0.28 51.9202 1992030514210000 1062.49 225.56 155.63 19.7 86.3 74.7 41.25 35.65 105.56 137.61 49.91 273.38 228.81 572.25 71.44 66.3 0.28 51.9203 1992030514220000 1036.05 208.13 163.43 17.2 86.3 74.7 40.86 37.88 83.29 135.06 50.88 258.99 229.31 570.75 71.34 66.3 0.28 51.9204 1992030514230000 1059.15 210.98 155.03 11.95 86.3 74.7 37.32 35.5 102.19 135.81 51.78 280.93 227.81 586.75 72.03 66.3 0.28 51.9205 1992030514240000 1033.99 213.11 169.58 18.15 86.3 74.7 38.06 38.14 102.49 128.68 47.26 259.08 218.81 575.75 72.46 66.3 0.28 51.9206 1992030514250000 1035.04 215.96 161.03 21.4 86.3 74.7 40.17 38.4 109.01 130.33 50.63 261.36 235.81 577.75 71.01 66.3 0.28 51.9207 1992030514260000 1029.3 208.73 163.09 20.7 86.3 74.7 40.86 36.83 102.11 127.48 48.36 249.91 228.56 568.5 71.32 66.3 0.28 51.9208 1992030514270000 0 0 0 1.5 86.3 74.7 20.45 15.52 0 0 0 0 182.06 532 74.5 66.3 0.28 51.9233 1992030514520000 1491.83 297.86 233.51 1.41 86.3 48.19 59.31 48.68 168.23 140.38 75.79 372.87 305.63 861.88 73.82 62.6 0.42 49.8234 1992030514530000 1489.58 298.46 254.21 6.8 86.3 48.19 58.97 47.57 141.73 140.76 85.74 359.68 299.38 846.38 73.87 62.6 0.42 49.8235 1992030514540000 1500.08 295.31 235.91 9.75 86.3 48.19 62.99 47.18 144.73 141.28 89.64 365.9 331.88 825.63 71.33 62.6 0.42 49.8236 1992030514550000 1467.9 305.4 261.26 3.35 86.3 48.19 59.59 49.67 165.53 134.98 90.89 394.19 313.38 838.88 72.8 62.6 0.42 49.8237 1992030514560000 1477.58 307.8 247.16 6.8 86.3 48.19 62.85 48.75 145.48 138.51 90.79 363.23 331.63 856.13 72.08 62.6 0.42 49.8238 1992030514570000 1490.55 306.75 251.66 11.3 86.3 48.19 60.13 50.25 159.75 137.31 81.34 372.12 323.13 833.38 72.06 62.6 0.42 49.8239 1992030514580000 1500.83 296.81 262.16 12.3 86.3 48.19 63.78 51.11 151.35 135.88 81.99 356.92 318.88 857.63 72.9 62.6 0.42 49.8240 1992030514590000 1506.04 295.01 255.23 5.45 86.3 48.19 59.47 51.83 159.83 133.5 84.54 375.34 323.88 787.44 70.86 62.6 0.42 49.8241 1992030515000000 1495.2 307.8 271.46 13.4 86.3 48.19 60.11 54.71 171.08 139.03 89.39 382.27 303.88 835.63 73.33 62.6 0.42 49.8242 1992030515010000 1493.78 310.8 259.46 7.35 86.3 48.19 59.2 49.34 152.1 133.63 84.99 366.02 339.13 848.38 71.44 62.6 0.42 49.8251 1992030515100000 1269.71 273.11 215.33 21 90.41 66.24 49.55 46.72 152.48 146.03 74.73 347.11 294.63 650.06 68.81 67.1 0.19 50.1252 1992030515110000 1270.91 270.41 228.71 18.3 90.41 66.25 50.76 51.04 148.48 144.21 72.78 347.16 274.38 652.81 70.41 67.1 0.19 50.1253 1992030515120000 1279.91 266.21 232.31 16.5 90.41 66.26 46.13 48.36 153 139.41 71.68 350.78 309.63 638.56 67.35 67.1 0.19 50.1254 1992030515130000 1280.36 267.86 230.21 25.71 90.41 66.26 50.34 46.39 141.13 141.96 75.54 336.06 291.38 616.75 67.91 67.1 0.19 50.1255 1992030515140000 1249.46 262.61 213.38 23.6 90.41 66.26 56.84 45.8 149.91 139.93 77.49 339.68 299.88 653.06 68.53 67.1 0.19 50.1256 1992030515150000 1277.63 264.71 211.43 16.7 90.41 66.26 47.34 49.93 162.08 149.53 72.63 362.66 287.13 655.31 69.53 67.1 0.19 50.1257 1992030515160000 1306.05 260.81 210.23 25.81 90.41 66.26 50.91 46.65 172.2 140.23 75.69 367.11 306.63 635.06 67.44 67.1 0.19 50.1258 1992030515170000 1243.24 271.01 216.08 20.7 90.41 66.26 47.08 47.05 165.83 148.48 77.89 362.19 288.88 656.06 69.43 67.1 0.19 50.1259 1992030515180000 1262.59 277.31 205.58 13.65 90.41 66.26 55.09 45.34 152.7 146.76 74.43 364.57 280.88 654.06 69.96 67.1 0.19 50.1260 1992030515190000 1282.76 262.76 224.03 23.61 90.41 66.26 52.17 47.5 163.43 137.61 77.29 362.88 289.38 664.06 69.65 67.1 0.19 50.1261 1992030515200000 1277.18 257.66 221.48 17.95 90.41 66.26 50.37 47.24 156.23 145.33 73.18 356.78 288.13 659.06 69.58 67.1 0.19 50.1262 1992030515210000 1258.46 263.81 203.63 10.4 90.41 66.26 49.32 45.93 154.95 143.16 77.59 368.96 268.13 653.31 70.9 67.1 0.19 50.1263 1992030515220000 1236.94 255.86 207.53 20.25 90.41 66.26 48.91 47.57 173.1 143.76 75.39 364.57 284.63 632.06 68.95 67.1 0.19 50.1323 1992030516220000 1247.89 253.31 230.78 6.3 82.19 66.26 56.59 46.06 190.82 140.08 90.44 411.49 257.13 622.25 70.76 64.6 0.24 44324 1992030516230000 1262.66 258.26 216.83 13.1 82.19 66.26 51.78 52.81 191.34 139.86 90.64 413.86 275.38 628.06 69.52 64.6 0.24 44325 1992030516240000 1303.99 259.31 238.31 12.45 82.19 66.26 52.94 50.32 196.14 135.88 96.64 416.21 263.13 622.5 70.29 64.6 0.24 44326 1992030516250000 1280.74 267.71 229.13 5.25 82.19 66.26 56.69 49.53 203.27 138.88 91.49 428.39 273.13 626.06 69.63 64.6 0.24 44327 1992030516260000 1259.63 263.81 220.43 11.4 82.19 66.26 54.28 53.73 184.95 136.86 84.39 394.79 285.13 622.75 68.59 64.6 0.24 44328 1992030516270000 1301.21 262.46 223.13 16.15 82.19 66.26 54.95 48.22 196.82 134.76 89.74 407.26 278.13 628.06 69.31 64.6 0.24 44329 1992030516280000 1282.58 264.11 222.83 22.8 82.19 66.26 51.83 50.58 191.34 136.56 94.84 411.26 290.13 625 68.3 64.6 0.24 44330 1992030516290000 1252.13 265.01 226.43 11.2 82.19 66.26 54.12 48.16 191.12 135.51 84.89 390.99 268.13 635.06 70.31 64.6 0.24 44331 1992030516300000 1248.98 270.11 230.63 17 82.19 66.26 54.58 45.93 190.37 133.11 88.34 386.02 270.88 607.75 69.17 64.6 0.24 44

Table 1


Objective The objective of this study is to investigate the relationship between the process variables and the 6 output variables describing the quality of the final product.

Analysis outline An Overview of the Responses A PC model of the responses is made to understand:

• How the responses relate to each other and to the observations.

• The similarity and dissimilarity between the observations, and if there are outliers.

• The explanatory power of the variables.

Relating the process conditions to the responses • Understand and interpret the relationship between the

process variables and the responses. • Predict the output of new process conditions.

The steps to follow in SIMCA-P • Define the project: Import the primary data set. • Prepare the data (Workset menu).

Specify which variables are process variables (X) and which are responses (Y). Expand the X matrix with the squares and cross terms of the 3 designed variables.

• Fit the models, first PC-Y and then PLS, and review the fit (Analysis menu).

• Refine models if necessary by removing outliers (Workset menu).

• Use the PLS model for predictions (Prediction menu).


Create the project Start SIMCA-P and import the data file from FILE | NEW

Find the data set (SOVR.XLS). If you have SIMCA-P+, select the radio button to create a normal SIMCA-P project and click on Next.


Click on Commands and create Index Variable to generate Variables numbers, and mark them as secondary ID's.

Mark the columns (Variables) PAR to the end, use the arrow on one of the variables, and from the drop down menu, select them as Y’s. This selection becomes the default workset. Click on Next.


The Import wizard opens. In the Project specification page, you can change the project name and destination directory. Make sure the check box use workset wizard is marked and click on Finish, the workset wizard opens.

Prepare the data

Workset Wizard SIMCA-P's default workset consists of all the observations in the primary data set with all variables, scaled to unit variance and defined as X's or Y’s as specified at import.


To Expand the X matrix with squares and/or cross terms press Use Advanced Mode and click on Expand.

The three variables TON_IN, HS_1 and HS_2 were varied according to a statistical design (RSM) supporting a full quadratic model. We will expand the X matrix with the squares and cross-terms of these 3 variables. Mark TON_IN, HS_1, HS_2. Press the button Sq & Cross and the squares and cross-terms of these 3 variables are displayed in the expanded list. Click on OK to exit the workset menu.


Analysis To first get an overview of the responses, we fit a PC model of the Y variables (PCY).

PC of Y

When you exit the workset window, an unfitted model (M1) is created with model type PLS (The default for a workset with both X's and Y's). Click on Analysis | Active Model Type and select PCY. The model type changes to PC-Y. Click on Analysis | 2 First Components to fit a PC model of the Y's with 2 components. The model overview plot opens.

Click on the model summary line to open a table with the summary of the fit of the model. This table displays R2X (fraction of the variation of the data explained by each component) and cumulative R2X(cum), as well as the eigen values and the Q2 and Q2(cum)(cross validated R2). The six Y's are correlated, and are summarized by two new variables, the scores t1 and t2, explaining 70.9% of their variation.


Scores and Loadings Scores Select Analysis | Scores | Line Plot to display the score plot of t1 vs. t2 with a line drawn between the points. In Label Types mark Use identifier Obs ID (primary).


The scores t1 and t2, one vector for dimension 1 and 2, are the new variables computed as linear combinations of the six responses and summarizing Y. The score plot shows that the observations cluster in different groups. Each group represents a setting of the experimental design. The process ran for a certain time at each of these settings (design points) to reach stability. Measurements on the process (the observations in the score plot) were recorded every minute. No obvious outliers are present.

Loadings Select Analysis | Loadings | Scatter Plot to display the loadings p1 vs. p2. In Label Types mark Use Identifier Var ID (Primary) and click on Save AS Default Options, to always display variable names.


The loadings are the weights with which the variables are combined to form the scores, t. The loadings, p, for a selected PC component, represent the importance of the variables in that component and show the correlation structure between the variables, here the responses Y. In this plot we see that PAR, FAR, %P_FAR is positively correlated and negatively correlated to %Fe_FAR. r_Far dominates the second component, is here negatively correlated to PAR and has only a small correlation to the other variables in component 2. %Fe-Malm is not correlated to any of these variables in the first two components. Click on Analysis | Next Component, and compute a third component. Display the loadings p1 vs. p3. The third component (explaining 22% of the variation of the data) is dominated by %Fe-Malm. In the third component this variable has a small positive correlation to %Fe-FAR, r_FAR and FAR and little to the others.


Summary of Overview of Responses No outliers were detected. All of the responses participate in the model, and are correlated to each other, with the exception of %Fe-Malm, which is only slightly correlated to three of them.

PLS MODELING The main objective is to develop a predictive model, relating the process variables X's to the output measurements (responses) Y. The experimental design in three of the process variables accounts for an important part of the variation of the Y's.

New Model Type Click on Analysis | Active Model type and select PLS.

Another unfitted model, M2, is created and you are ready to fit a PLS model.

Autofit

Click on Analysis | Autofit, or the fast button , to fit a PLS model, with cross validation. The Model Overview Plot displays R2Y(cum), the fraction of the variation of Y (all the responses) explained by the model after each components, and Q2(cum), the fraction of the variation of Y that can be predicted by the model according to the cross-validation. Values of R2Y(cum) and Q2Y(cum) close to 1.0 indicate an excellent model.


C

Double click on the model summary line to display a list of the fit of the model per component. The present model is indeed excellent and explains 80% of the variation of Y, with a predictive ability (Q2) of 76%.

Summary: X/Y Overview Click on Analysis | Summary | X/Y Overview | Plot and display the cumulative R2Y and Q2Y for every response. With the exception of %Fe-FAR and %P-FAR, all responses have an excellent R2 and Q2.

Scores t1 vs. t2 Click on Scores | Scatter plot and t1 vs. t2. Use the marker

to label the outlying observation. Observation 208 lies far away in the first component.


Scores t1 vs. u1 Right click and in properties select t1 vs. u1, and in Label Types mark ObsID (Primary). We have a good relationship between the first summary of the X's (t1), and the first summary of the Y's (u1), with the exception of observation 208.


Contribution plot To understand why observation 208 differs from the others in the first score (t1), in the t1vs u1 plot double click on observation 208.

This contribution plot displays the differences, in scaled units, for all the terms in the model, between the outlying observation 208 and the normal (or average) observation, weighted by w1* (the importance of the X-variables in component 1). The raw iron ore (TON_IN) as well as the load on the grinders and the other variables were all far below average. Inspecting the data, we find TON_IN and load on the grinders to be 0 for observation 208, obviously causing a process upset (an outlier) at time 14:27.

Refining the model We will remove observation 208, set aside a few observations as a Test set, and then refit the PLS model.

Excluding observation 208 using the interactive tool box In the score plot t1 vs u1, mark observation 208 and click on the

red arrow . SIMCA-P excludes observation 208 from the workset and asks if you want to generate a new unfitted model M3. Say Yes.


The workset bar opens with the workset for model M3. Observation 208 is excluded. When you display the Dockable window Observations, 208 is marked excluded.

Removing some observations for a test set In the Workset bar, hold the Ctrl key and mark observations 140-146, 173-179,350-379,551-555, then right click and select Exclude. The deleted observations are also marked on the plot

Autofit Click on Analysis | Autofit or the fast button, to refit the PLS model. The Summary | Model Overview plot is updated as the model is fitted.. Note the improvement in both R2Y(cum) and Q2(cum).


Summary: X/Y Overview Click on Analysis | Summary |X/Y Overview | Plot to display the cumulative R2Y and Q2Y for every response.

• The responses PAR, FAR and %FE_malm are very well explained (90% or better) and the others a little less well.

Scores t1 vs. t2 Click on Analysis | Scores | Scatter t1 vs. t2 and display the t1 vs. t2 plot. We see the observations separated in groups, each group representing a setting of the experimental design.


Scores t1 vs. u1 In the Properties change the Scores to t1 vs. u1.

We now have an excellent relationship between t1 and u1 with no outliers.

Loadings w*c1 vs. w*c2 The w*'s are the weights that combine the original X variables (not their residuals in contrast to w) to form the scores t. In the first component w* is equal to w. The w*'s are related to the correlation between the X variables and the Y scores u. X variables with large values of w* (positive or negative) are highly correlated with u (and thereby Y). The c's are the weights used to combine the Y's (linearly) to form the scores u. The c's express the correlation between the Y's and the t's (X-scores).

In the first two component, PAR, and FAR are positively correlated with all the load variables and negatively correlated with r_PAR, %Fe-FaR and %Fe_Malm. The model is almost


linear except for HS_2 and its squared term dominating the second component.

Normal Probability plot of residuals Click on Analysis | Residuals | Normal Probability Plot to display the Normal probability plot of residuals.

Examining this plots,we see the residuals close to normally distributed with no outliers. Right click and in the Properties page shift between different Y variables and/or change options.

Coefficients

Click on Analysis | Coefficients | Plot to display the PLS regression coefficients (for scaled and centered data) for PAR, with confidence intervals (the default is 95%). The dominating factors are TON_IN, KR30_in KR40_in and Ton_S3 with a positive effect. Use the Property bar to change responses or components.


Variable Importance Click on Analysis | Variable Importance. This plot shows the importance of the terms in the model, as they correlate with Y (all the responses) and approximate X.

Distance to the Model Click on Analysis | Distance to the Model | XBlock to display the distance to the model (how far away an observation is from the model hyper-plane) in the X space. These distances are in normalized units and are the same as the row residual standard deviations.


Observation Risk Click on Analysis | Observation Risk

This plot displays the Observation Risk for every Y and for the pooled Y’s. Using the zoomer around observation 349 which has a large observation risk we get the following plot:

Observation 349 for Y Far has a larger Y residual when not in the training set model than when the observation is included in the model; hence its prediction is uncertain, risky.


The following list displays the Y (Far) residual when observation 349 is and is not in the model.

Predictions We can now use the model to predict the outcome of the process for the Test set observations. Click on Prediction | Specify Prediction set | Specify. Remove all observations from Observations in the Prediction set. In the left window select Workset Complement, click on Select All and use the arrow to move all the observations to the left window. Mark observation 208 and click on Remove to exclude it from the prediction set. Click on Apply and Close this dialog


Click on Predictions |Y Predicted | Scatter plot. The observed vs. predicted plot, for PAR, is displayed.

For PAR and FAR (Select from properties), we have excellent predictions, they are less good for the other responses. Also look at DModX (under prediction menu).

Summary This example shows that statistical design in the dominating process variables gives data with high quality that can be used to develop good predictive process models. With multivariate analysis we extract and display the information in the data.

Tutorial SIMCA-P, SIMCA-P+ NIR • 49

NIR

Introduction The following example originates from a research project on peat in Sweden. Peat is formed by an aerobic microbiological decomposition of plants followed by a slow anaerobic chemical degradation. Peat in Sweden (northern hemisphere in general) is mainly formed from two types of plants, Sphagnum mosses and grass of Carex type. Within the main groups there is variation among the species. Depending on location, climate etc. there are several other plants involved in the peat forming process. In the project many different types of chemical analyses were performed to get detailed information about the material and to investigate differences among different peat types. Chemical analysis was performed according to traditional methods (GC, HPLC, etc.) which often were laborious and time consuming. To speed up the analysis of samples, Near Infrared Spectroscopy (NIR) together with multivariate calibration was introduced. This strategy was found to work very well and after the calibration phase, samples were analyzed in minutes instead of weeks. In this tutorial we selected a subset of samples, which represents the typical variation of peat in Sweden.

Data

Variables Variables 1-19 represent spectra from the NIR instrument, which in this case was a 19 channel filter instrument. Spectra are recorded as Log (Absorbance) and then scatter corrected by a MSC procedure. Variables 20-46 represent different chemical analyses, which the NIR spectra can be calibrated against.

50 • NIR Tutorial SIMCA-P, SIMCA-P+

Var. No. Type Name Explanation

1-19 X NIR Log Absorbance

20 Y Rhamnos Mono saccharide

21 Y Fucos Mono saccharide

22 Y Arabinos Mono saccharide

23 Y Xylos Mono saccharide

24 Y Mannos Mono saccharide

25 Y Galaktos Mono saccharide

26 Y Glukos Mono saccharide

27 Y Klason l Klason Lignine

28 Y Bitumen Bitumen

29 Y Aspargin Amino acid

30 Y Threonin Amino acid

31 Y Serin Amino acid

32 Y Glutamin Amino acid

33 Y Prolin Amino acid

34 Y Glycin Amino acid

35 Y Alanin Amino acid

36 Y valin Amino acid

37 Y Methionin Amino acid

38 Y Isoleucin Amino acid

39 Y leucin Amino acid

40 Y Tyrosin Amino acid

41 Y Fenylalanin Amino acid

42 Y Histidin Amino acid

43 Y Lysin Amino acid

44 Y Aginin Amino acid

45 Y Glucose-amin Amino sugar

46 Y Galactos-amin Amino sugar

Variable 27 (Klason l) is Klason Lignin (rest after hydrolysis) and variable 28 is Bitumen, which represents carbohydrates solvable in acetone.

Observations From a huge number of peat samples 41 were selected, representing the main variation of peat in Sweden. The sample (observation) names are coded in all 20 characters. Each position


in the names carries certain information. In the plots a sub-string of two characters (position 6 and 7) are often used. Position 6 represents the degree of decomposition, L (low), M (medium) and H (high). Position 7 represents peat type, S (Sphagnum) and C (Carex).

Objective The objective of this study is to model and predict different constituents of samples of peat directly from their NIR spectra. 41 samples of peat, mainly of two types Sphagnum and Carex, were subjected to NIR spectroscopy. The spectra were recorded at 19 wavelengths (19 filters) with a reflectance instrument (log(abs)) and scatter corrected before the analysis. For this objective, we will now develop a PLS model relating the X variables (NIR spectra) to the Y variables (peat constituent concentrations measured by traditional analysis).

Analysis Outline • Making a PLS model relating the NIR spectra variables to

the peat constituents in order to: Understand and interpret the relationship between the spectra (X) and peat composition (Y variables).

• Develop separate PLS model for each type of peat (Sphagnum and Carex), to: 1) Increase the precision of the calibration. 2) Be able to classify and predict peat types.

The steps to follow in SIMCA-P are: • Define the project: Import the primary data set. • Prepare the data (Workset menu).

a) Specify which variables are process variables (X) and which are responses (Y) b) Transform the variables The responses are concentrations of the chemical constituents of peat, and their variation is non linear, a Log transformation is warranted. (log Y + 0.1) with 0.1 to make sure that all values are positive before the transformation. c) Group the observations in 2 classes for peat type Sphagnum and Carex.

• Fit the model, a PLS of all the data (Analysis menu). • Fit a PLS model for each of the peat type, Sphagnum and

Carex. • Use the PLS model for predictions and classification

(Prediction menu).


Create the project Start a new project. The data set name now is NIRKHAM.XLS Start SIMCA-P and create a new project from FILE: NEW. If you have SIMCA-P+, select normal SIMCA-P project. The import wizard opens.

The first two columns are correctly marked as observations numbers and names and the first raw is variable names. Mark the first raw and click on Variable secondary Id’s to have a variable index. Mark the variables starting with Ramos to end, and from the combo box select Y Variable.


SIMCA-P marks these variables as Y (response) variables. Click on Next to open the Project specification page. You can change, as desired, the destination folder, or the project name. Click on Finish, the data set Nirkham is imported.

Prepare the data

Default Workset SIMCA-P’s default workset consists of all the observations in the primary data set with all variables, scaled to unit variance and defined as X's or Y’s as specified at import. This is the starting workset when you select Workset | New.

Transform the variables Click on Workset | New and select the Transform tab. Mark all the Y variables, select Log, with C1= 1 and C2=0.1 (some of the concentrations are 0.0), and click on Set.


Group observations in classes:

Select the Observations tab and display the secondary ID's

Right click on the Primary ID's and select Observation label.

To group observations in 2 classes the Carex and Sphagnum click on From Obs ID and select Obs Sec ID.


And select start position 7 for length 1.

The Carex are set to class 1 and the Sphagnum to class2

One observation, 21 (not Sphagnum or Carex type) is set to class 3 as it belongs to neither group. Mark it and set to no class. Click on OK to exit the Workset window.

Analysis When you exit the workset window, an unfitted model (M1) is created with model type PLS class (The default for a workset with both X's and Y's and classes). In Analysis Model Type change it to PLS. You are ready to fit a PLS model.

PLS model of all the samples Autofit Change the model type to PLS and click on Analysis | Autofit. The model overview plot is updated as the model is fitted. This plot displays R2Y cumulative by component and Q2 Y cumulative by


component. R2 Y is the fraction of the variation of Y (all the responses) explained by the model after each component, and Q2Y is the fraction of the variation of Y that can be predicted by the model according to the cross-validation. Values of R2Y(cum) and Q2Y(cum) close to 1.0 indicate an excellent model.

Double click on the Model Summary line to display the corresponding list.

Multivariate calibration with NIR spectra often leads to many components due to the high precision of the data. The present model is indeed excellent and explains 88.2% of the variation of Y, with a predictive ability (Q2) of 73.9%.

Summary: X/Y Overview Click on Analysis | Summary | X/Y Overview | Plot and display the cumulative R2Y and Q2Y for every response. Use the Properties page to select variable labels and Click on Save As default Options to always have variable names.. With the exception of Bitumen all responses have an excellent R2 and Q2.


Scores t1 vs. t2 Click on Analysis | Scores | t1 vs. t2 plot. Use the Marker to mark the outlying observation, and then use the label button to label it. Observation 32 lies far away in the second component, indicating that sample 32 is different with respect to NIR spectra.

Comparing the spectra of observation 32 and 39 Mark both observations, right click and select Plot Xobs to display the spectra of these 2 observations.


Scores t1 vs. u1 We have a good relationship between the first summary of the X's (t1), and the first summary of the Y's (u1), with some spread in the data.

To display informative labels, select in properties Obs Sec ID, start in position 6 for length 2.


You can now distinguish two groups of observations, S Sphagnum peat and C Carex peat.

Scores u1 vs. u2 The projection of the samples in the Y space (traditional chemical analyses) does not show observation 32 as outlier as in the Scores plot.. NIR spectroscopy can detect very small changes in chemical composition (PPM level) compared to the traditional analyses which typically have large measurements errors (3-50%). With NIR spectroscopy one achieves better control of the samples.

Contribution plot To understand why sample 32 differs from the others, double click on observation 32 in the Scores t1 vs. t2.


This contribution plot displays the differences, in scaled units, for all the terms in the model, between the outlying observation 32 and the normal (or average) observation, weighted by w*1 w*2 (the importance of the X-variables in component 1, 2 In the plot we see some spectral variables close to 8 standard deviations, indicating some contamination in this sample. We shall remove sample 32.

Loadings w*c1 vs. w*c2 The w*'s are the weights that combine the original X variables (not their residuals in contrast to w) to form the scores t. In the first component w* is equal to w. The w*'s are related to the correlation between the X variables and the Y scores u. X variables with large values of w* (positive or negative) are highly correlated with u (and thereby Y). The c's are the weights used to combine the Y's (linearly) to form the scores u. The c's express the correlation between the Y's and the t's (X-scores).


This plot shows how the different chemical compounds correlate to the different parts of the NIR spectra. Plots displaying the loadings, one component at a time, may be more informative.

Loadings: Column plot w*c1 Click on Analysis | Loadings | Column plot w*c1.. This plot shows the importance of different parts of the NIR spectra, in the first component, to explain the variation among the constituents of the peat.

Excluding sample 32 Display the Score plot (t1 vs t2), mark observation 32 and click on the red arrow to exclude this sample from the workset. SIMCA-P excludes this sample from the workset and creates a new unfitted model with model type PLS class(1)


Separate PLS models for the Sphagnum and Carex

Autofit class models

Click on the fast button Autofit class models to fit both classes.

Sphagnum Model, class 2 Note that the model for class (2), the Sphagnum has only one component as the second component was not significant. When we take the second component and the third, we find it significant. We continue taking components until not significant. The model has 7 significant components.

The present model is excellent and explains 86% of the variation of Y, with a predictive ability (Q2) of 41%.


Summary: X/Y Overview Click on Analysis | Summary | X/Y Overview | Plot to display the cumulative R2Y and Q2Y for every response. All responses have excellent R2 and good Q2 values.

Scores t1 vs. u1 and t1 vs. t2 These plots do not show any outliers.

Model class 1 (Carex peat) The model with 7 components is not as good as the preceding one, and though it explains 84.4% of the variation of Y, it only has a predictive ability (Q2) of 12.6%.

This is mainly due to the fact that the Carex peat is not as rich in carbohydrates as the Sphagnum peat, and the variations in the chemical constituents are small.

Scores t1 vs. u1 and t1 vs. t2 These plots do not show any outliers.

Predictions We now have two good models describing the relation between NIR spectra and Chemical composition of peat and they can be used to classify peat samples as Sphagnum or Carex.


In this tutorial we do not have new peat samples. However, we will use the data set and classify every sample with respect to the two models. We first will want to remove sample 32.

Making a prediction Set By default the prediction set is all of the primary data set.

Cooman's Plot Exclude sample 32 from the Prediction set (Predictions | Specify Prediction set | Remove observation 32 from prediction set) and display the Cooman's plot. This plot displays the Distance to the model of every observation with respect to model M2 and M3, and shows a very good separation between the Sphagnum and the Carex peat samples.

Sample 21 is correctly classified as being neither a Sphagnum sample nor a Carex peat sample.

Summary As a tutorial, this provides just a brief introduction to the main functionality’s and plots in SIMCA-P. We recommend that you continue with your own data, may be another tutorial, and then look in the Manual for details. The Help system contains the same information as the Manual, but organized in a different way.

Plots and Lists You can display the results of SIMCA-P in numerous graphs and lists. From the Analysis and Prediction menu, results of the active model are available as quick plots and lists. With the menu


Plot/List, you have access, to the raw data and every computed value from every model. You can even plot coefficient vectors from different models against each other.

Tutorial SIMCA-P, SIMCA-P+ Hierarchical Models • 67

Hierarchical Models

Introduction This example illustrates the use of hierarchical multivariate modeling (PCA and PLS), using a small set of process data. Details of the process are not revealed for proprietary reasons, but a general outline is given below.

Data In this process, raw materials are combined and reacted to give a product with certain properties measured by 8 y-variables. Two of these, y6=impurity level, and y8=yield are the most important. The feed is described by 7 input X-variables (x1-x7), and 18 intermediate process variables from steps such as reaction (x8-x15) and purification (x16-x25) are also available. The data are collected hourly, and comprise 92 observations. The process functioned fairly well to around obs. 79, but then went out of control and was closed down at point 92.

Objective To understand the relationship between the two most important y variables (y6= impurity, and y8= yield) and the three steps of the process, feed (x1-x7), reactor (x8-x15), and purification and work up (x16-x25). We shall do the following, using obs. 1-79 as a training set:

15. PLS model of X= feed (x1-x7) with y6 and y8 (Block 1) 16. PLS model of X= reactor (x8-x15) with y6 and y8

(Block2) 17. PLS model of X= purification (x16-x5) with y6 and y8

(Block 3) 18. PCA model of less important y's (y1 to y7 not including

y6) (block4) 19. Top level hierarchical model with scores of blocks 1 –3 as

X and scores of block 4 plus y6 and y8 as Y. The objective of Block Models 1 to 4 is to summarize the various steps of the process by scores to then be used as X variables the top level model.

68 • Hierarchical Models Tutorial SIMCA-P, SIMCA-P+

Analysis Outline

The steps to follow in SIMCA-P are: • Create the project by importing the data set • Generate and fit the three PLS model for X-blocks 1-3

(obs.1-79), and mark them as Base hierarchical. • Generate and fit the PC model for block 4., and mark it as

base hierarchical • Generate and fit the top level hierarchical • Interpret the hierarchical model • Validate the hierarchical model with the test set (obs. 80-

92)

Create the project Start a new project. The data set name proc1a.dif Start SIMCA-P and create a new project from FILE | NEW If you have SIMCA-P+ make sure to select Create a SIMCA-P normal project.

Holding the CTRL key, mark Y6+ and Y8+, click on X variables and select Y variables, to make these 2 variables Y’s.


Click on Command | Create Index | Variable and generate a variable index. Click on Next. Now you can change the project name and destination directory. Click on Finish and the project is imported.

Summarizing the feed

Workset Select Workset | New. In variable blocks, keep x1 to x7 as X's and y6 and y8 as Y's, exclude all other variables. Right click on variables name and mark variable secondary ID's to display variable number.

In Observations, exclude observations 80 to end, for a test set.

Click on OK to exit the workset.


Analysis Autofit the model, the model window opens and is updated as the model fits. One component is significant. Take 3 more component as the objective here is to summarize the X block (feed). Double click on the model title and call the model Feed. Double click on the Summary line of the model to display its details:

The model explains 65% of X, hence the scores of model M1are a good summary of the feed.

Scores t1 vs. t2

Click on the fast button to display the scores t1 vs t2.

Observation 1 is an outlier. Double click on it with the Contribution tool and SIMCA-P displays the Contribution plot.


Position the cursor on x6in, to shows its value in the data set (-4.54) The trend plot of that variable (double click on it) shows clearly an abnormal value of that variable in observation 1. We shall exclude observation 1 using the interactive marker and the red arrow and refit the model.

Fitting the model without observation 1

Model M2 is very similar to model M1 (make sure you take 4 components) and explains 64% of the variation of the feed.

Loadings p1 The loading p1 is the vector of weights that combine the original X variables to form the scores t1. The first dimension explains 32% of the X, i.e., the feed. You can think of t1 as a new variable, summarizing the feed and explaining 32% of their variation. To display p1, click on Analysis | Loadings | Column plot and p1.


In the first dimension, all the feed variables, with the exception of x4 and x5 are well summarized by t1.

Summarizing the reactor

Workset Prepare a workset with the reactor variables as X (variables 16 to 23), and y6+ and y8+ as Y's. Select only observations 2 to 79. Use the menus Workset | New as model M2, then select the variables in Variable Blocks. The included observations will be 2 to 79. The workset should look like this:



Analysis Click on Autofit and take 2 extra components.

With 4 components model M2 is a good summary of the reactor explaining 76% of the variation of X. Call the model reactor.

Scores t1 vs. t2 This plot shows no serious outliers.


Loadings p1 and p3 (the 2 most important components)

Summarizing the purification

Workset Prepare a workset with the purification variables as X (variables 24 to 33), and y6+ and y8+ as Y's. Select only observation 2 to 79. Use the menus Workset | New as model M2, then select the variables in Variable Blocks. The observations will be correct. The workset should look like this:


Click OK to exit the worksheet and Autofit the model. There are 4 significant components, explaining 65% of X.

The third component explains the most of X. Click on Analysis | Loadings | Column and select p3.


Summarizing the less important Y's

Workset In Workset | New, Variable Blocks, select as X variables, variables 8 to 12 and 14. Select only observation 2 to 79. The workset will look like this:

Exit the workset and Autofit the PC model. SIMCA-P will extract 0 components, as none are significant. Take 3 components.


Preparing for the hierarchical model Right click on model M2 and check hierarchical base Model Scores. The scores of model M2 are added to the workset as new variables.

Do the same for models M3 to M5. All these Models are marked B.

Workset of the top level model In Workset | New, Variable Blocks start by selection All and Exclude. Select as X's, all the scores from models M2, M3 and M4. Select as Y's, y6+ and y8+ and all the scores of model M5. Select observation 2-79, start with New | As model M2.


(continued, upper window scrolled down)

Exit the workset.

Analysis Autofit the model. There are 4 significant components explaining 53% of the Y's.


The Summary | X/Y Overview, shows that the 2 most important variables y6+ and y8+ are well explained and predicted.

The score plot (t1 vs. t2) of the top level model


This plot is colored by the values of y6+ in Model M6 , the side product. To display the legend, use Plot Settings | Plot area from the pop up menu. The process starts up to the right with high values of y6+, moves down to the left with lower values, and then is manipulated to give lower values of y6+ (upper left quadrant). The process then becomes unstable and moves back to the right.

The w*c plot

The important y variable (y6+), the side product, is on the right of the plot. Positively correlated to y6+ are the first component of the feed, and the second component of the reactor and purification. We also see that y6+ is negatively correlated to the first component of model M5 (a summary of the less important Y's). Y8+, the yield is negatively correlated to the first component of the purification. With the contribution tool, double click on any score variable point, and the corresponding loadings opens. This plot shows us the important original variables in that score. For example M3 t2 was positively correlated with y6+. In the loadings of M3 (reactor) second component, we see that variables 2, 3, and 8 are positively correlated with y6+, while variables 1, 4 and 7 are negatively correlated with y6+.


This gives us a zoom in zoom out picture. In the wc plot we understand relationships between the 2 important y's and the section of the process, i.e. the feed, reactor, purification. In the loadings plot we understand which variables in the feed, reactor, purification, dominates and its relationship to the y's. Click on the other score variables to display their loadings.

Coefficients Click on Analysis | Coefficients.

For y6+ the dominating variables are M2 (the feed)t1 (first component), M3 (reactor) t2 and t4 (second and fourth component) and Use the contribution tool and double click on any of these variables to open the corresponding loading plot. For example double click on M3 (reactor) t4.


Again we see here the importance of variable 3 as being positively correlated with y6+.

Variable Importance (VIP)

The most important variables, for all the y's, are t2 and t1 of the reactor, and t1 and t2 of the purification. You can use the contribution tool to display the corresponding loadings.


Observed vs. Predicted

Observation 41 is an outlier, and has a large residual. Using the contribution tool, double click on observation 41.

t4 of purification is the culprit variable. Double click on it, to display the contribution with the original variables. This plot points to variable 6 in the purification as being much too low.


The time series plot of this variable (double click on it) in the data set, shows an abnormal value for observation 41.

Predictions Make sure Model M6 (the top level) is the active model. To specify the prediction set click on Predictions | Specify Predictions set | Specify and remove observation 1 (outlier). Remember that obs. 2-79 actually comprise the training set. They are still included below for comparison in the plots.


DModXPS

From observation 80 on the process becomes unstable with DModXPS quickly increasing. The contribution plot for observation 91 (large DModXPS) shows the feed (t2, t3 and t4) as being the problem.


With the Contribution tool, double click on the feed, in all 3 components points to variable 6.

The trend plot confirms the problem with this variable


Scores tPS1 vs. tPS2 colored by test set and training set

The model was based on normal operation up to observation 79. The predicted scores from observation 80 on, colored red are show clearly that the process is going out of control.

Cusum Chart Click on Predictions | Control charts | Cusum and select subgroup 1.

A contribution plot around observation 76, shows both the feed and purification were related to the problem.


Double clicking on both the feed and the purification shows the culprit variables.


Conclusion Hierarchical approach to multivariate analysis greatly enhances our ability to understand complex problems. The zoom in zoom out capabilities, allows us first to understand complex relationships in terms of components of a process and then zoom in on a single component to resolve the details in terms of the process variables.

Tutorial SIMCA-P, SIMCA-P+ Spectral Filtering and Compression, including OPLS • 91

Spectral Filtering and Compression, including OPLS

Introduction This example illustrates the use of spectral filtering and wavelet compression with multivariate calibration. The recently added OPLS approach (Orthogonal OPLS) is also demonstrated. The data set of this example was collected at Akzo Nobel, Örnsköldsvik, in Sweden. The raw material for their cellulose derivative process is delivered to the factory in form of cellulose sheets. Before entering the process the cellulose sheets are controlled by a viscosity measurement, which functions as a steering parameter for that particular batch. In this data set NIR spectra for 180 cellulose sheets were collected after the sheets had been sent through a grinding process. Hence the NIR spectra were measured on the cellulose raw material in powder form. For calculation of a calibration model 160 samples spectra were used. A selection of 20 spectra was used for model validation.

Data The data consists of: X: 1201 wavelengths in the VIS-NIR region Y: Viscosity of cellulose powder.

Objective The objective of this study is to develop a good calibration model with the 160 samples and validate this model with the test set of 20 samples. We will use orthogonal signal correction (OSC) to improve the calibration model, and we will compress the X matrix, with orthogonal wavelets, for efficiency and fast computation. The results of the model after OSC and wavelets compression will be compared to the results of the model with the original data. Finally OPLS will be run on the same data.

92 • Spectral Filtering and Compression, including OPLS Tutorial SIMCA-P, SIMCA-P+

Analysis Outline • Make a PLS model relating the NIR spectra variables to

the viscosity with the original data. • Review and validate the calibration model with the test

samples. • Apply OSC and wavelet compression to the X matrix • Make a PLS model with the OSC and wavelet

compressed data. • Review and validate this model and compare the results

to the calibration model made with the original data. • Finally, OPLS is run and the results are compared to the

previous analyses.

The steps to follow in SIMCA-P are: • Define the project: Import the primary data set. • Prepare the data (Workset menu).

a) Specify which variables are process variables (X) and which are responses (Y)

• b) Exclude 20 specific samples from the training set for a test set.

• Fit the calibration model, and review the fit (Analysis menu).

• Validate the model with the test set. (Prediction menu) • Use Spectral Filter, OSC, followed by wavelet

compression (Dataset menu) • Prepare the data (Workset menu) • Fit a PLS model on the spectral filtered data (Analysis

menu) • Validate this model with the test set and compare the

results (Prediction menu) • Return to the first project (original data) and change to

OPLS (Analysis / Change model type), Fit model with Autofit, Predict (use same prediction set), and compare.

Create the project Start a new project with the data set Malyx.mat Start SIMCA-P and define a new project from FILE: NEW. The file is a Matlab file. Note that there are no variables names, no observations numbers or names.


The import wizard opens, make sure to select a normal SIMCA-P project, if you are using SIMCA-P+. Make 1.st column as Y: In the first column top cell, click on the ↓ arrow and from the Combo select Y Variable (Viscosity).

Click on Next to open the Project specification page. You can change, as desired, the destination folder, or the project name. Click on finish, the data set Malyx (we named it Malyx) is imported.

Plotting the Spectra With the dataset open and active, right click and select Plot Xobs to plot the spectra.


All spectra are plotted together:

0.30

0.40

0.50

0.60

0.70

0.80

0.90

1.00

1.10

1.20

0 100 200 300 400 500 600 700 800 900 1000 1100 1200

Num

MALYX.DS1 MALYXObservation

SIMCA-P+ 11 - 5/3/2005 4:17:38 PM

Prepare the Data

Workset SIMCA-P’s default workset consists of all the observations in the primary data set with all variables, scaled to unit variance and defined as X's or Y’s as specified at import This is the starting workset when you select Workset | New.

Workset | New The Workset window opens with the variable names, the variable block X or Y, scaling (default UV), and the observations numbers.


Change the scaling of the X variables to CTR (centered only) Click on Scale, mark variables 2 to 1202, select in Base Ctr and click on Set. The X variables are now just centered, and not scaled.

Exclude observations for the Training set Click on Observations and mark the following 20 observations: 4-5, 18-20, 30-34, 100-104, and 130-134 and click on Exclude.

All these observations are now excluded from the training set. Click on OK and exit the workset menu.

Analysis When you exit the workset window, an unfitted model (M1) is created with model type PLS (The default for a workset with both X's and Y's). You are ready to fit a PLS model.

PLS model Autofit Click on Analysis | Autofit, or use the fast button.


The model overview plot updates as the model is fitted.

0.00

0.10

0.20

0.30

0.40

0.50

0.60

0.70

0.80

0.90

1.00

Com

p[1]

P

Com

p[2]

P

Com

p[3]

P

Com

p[4]

P

Com

p[5]

P

Com

p[6]

P

Com

p[7]

P

Comp No.

MALYX.M1 (PLS) R2Y(cum)Q2(cum)

SIMCA-P+ 11 - 5/3/2005 4:28:42 PM Double click in the project window on the model summary line to display the details by component.

R2Y(cum) the fraction of the variation of Y explained by the model after 7 components, is equal to 0.756 and Q2(cum) the fraction of the variation of Y that can be predicted by the model according to the cross-validation is equal to 0.686. Values of R2Y(cum) and Q2Y(cum) close to 1.0 indicate an excellent model. For a calibration model, model M1 is a rather poor model.

Scores t1 vs. u1 Click on Scores: t1 vs. u1 to display the t1 vs. u1 plot. The relationship between t1 and u1 is not very good in particular for the cluster of samples 162-165 etc.


-1.50

-1.00

-0.50

0.00

0.50

-0.80 -0.70 -0.60 -0.50 -0.40 -0.30 -0.20 -0.10 0.00 0.10 0.20 0.30 0.40 0.50

u[1]

t[1]

MALYX.M1 (PLS)t[Comp. 1]/u[Comp. 1]

R2X[1] = 0.280355

y=1*x-1.063e-008R2=0.4213

12 367 8

91011 121314 1516

1721

22

23

2425 2627

28

29

3536

37

3839

4041

42

43

444546 4748

495051 52 535455565758

59

6061 6263 64

65

6667

68 69 7071 7273 7475767778

7980 8182

8384

8586

87 88 899091 9293 949596 97 9899 105106107108109 110111 112 113114115 116117118119120121 122123124125126

127

128

129135

136

137

138

139 140

141142

143144145

146147

148149 150

151

152

153

154155156157

158159

160161

162 163164

165

166167

168

169170171172

173174175 176 177178179180

SIMCA-P+ 11 - 5/3/2005 4:35:30 PM

Plotting the Spectra of selected observations Press on the CTRL key and mark the cluster of observations around observations 168 down left, and 27 high right, then right click and select Plot Xobs to plot the Spectra in original units.

Zooming on the plot , one clearly sees a separation between the two groups of spectra.


Loadings Click on Loadings | Line plot

Remove the series w*c1, select under Items w* and all components (*) and click on Add Series, then on OK.


-0.30

-0.20

-0.10

0.00

0.10

0.20

0.30

0.40

0.50

0.60

0 100 200 300 400 500 600 700 800 900 1000 1100 1200

Num

MALYX.M1 (PLS)w*

R2X[1] = 0.280355 R2X[2] = 0.66347 R2X[3] = 0.0369157 R2X[4] = 0.00567819 R2X[5] = 0.00847137 R2X[6] = 0.0010421 R2X[7] = 0.0013327 SIMCA-P+ 11 - 5/3/2005 4:43:42 PM

Components 1 to 3 capture almost 60% of the variation of Y. The other components are small correction. To display the first three loadings, open the properties page, mark series 4 to 7 and click on Remove and Apply.

-0.060

-0.040

-0.020

0.000

0.020

0.040

0.060

0 100 200 300 400 500 600 700 800 900 1000 1100 1200

Num

MALYX.M1 (PLS)w*

R2X[1] = 0.280355 R2X[2] = 0.66347 R2X[3] = 0.0369157

w*[1]w*[2]w*[3]

SIMCA-P+ 11 - 5/3/2005 4:44:47 PM The regions around 200, 400, 700 -- 800, and 900 capture most of the information.

Distance to the Model (DmodX)

0.50

1.00

1.50

2.00

2.50

0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160

DM

odX[

7](N

orm

)

Num

MALYX.M1 (PLS)DModX[Last comp.](Normalized)

M1-D-Crit[7] = 1.163 1 - R2X(cum)[7] = 0.002735

D-Crit(0.05)

SIMCA-P+ 11 - 5/3/2005 4:47:27 PM


Several samples have a distance to the model larger than the critical distance, indicating data inhomogeneity.

Observed vs. Predicted The predictions are poor particularly for a cluster of samples as in the t1 vs. u1 plot. They can be labeled by marking them and then clicking on the selected item fast button, and selecting labels as primary obs label.

600

800

1000

1200

1400

1600

1800

500 600 700 800 900 1000 1100 1200 1300 1400 1500 1600 1700 1800 1900

YVar

(Var

_1)

YPred[7](Var_1)

MALYX.M1 (PLS)YPred[Last comp.](Var_1)/YVar(Var_1)

RMSEE = 139.821

y=1*x-7.256e-006R2=0.7563

136153162 163164168

SIMCA-P+ 11 - 5/3/2005 4:49:31 PM A zoom in on these points shows their labels.

Validating the Model 1 Click on Predictions | Specify Prediction set | Complement Work-set. The prediction list opens. Click on Predictions | Y Predicted | Scatter plot.


1000

1100

1200

1300

1400

1500

1600

1700

1800

1000 1100 1200 1300 1400 1500 1600 1700 1800

YVar

PS(V

ar_1

)

YPredPS[7](Var_1)

MALYX.M1 (PLS), PS-Complement Model 1YPredPS[Last comp.](Var_1)/YVarPS(Var_1)

RMSEP = 110.255

4

5 1819

20 30

31

32

3334

100 101 102103104

130

131

132

133134

SIMCA-P+ 11 - 5/3/2005 4:57:43 PM The predictions are reasonable with an RMSEP of 110 compared to the training set RMSEE of 140.

Orthogonal Signal Correction and Wavelets Compression This rather poor model may indicate systematic variation in the X block that is not related to the response Y. This is corroborated by the similarities between w*2 and w*3 (see loading plot above). We will apply Orthogonal Signal Correction (OSC) to the X block (the NIR data) to remove the systematic variation in X not related to Y and then for speed and efficiency we will wavelets compress the X block. Click on Dataset | Spectral Filters | Combination | OSC-Wavelet

The first variable is marked as Y. Exclude the test set of observations: 4-5, 18-20, 30-34, 100-104,130-134, and click on Next.


SIMCA starts OSC and extracts one component; click on Next to extract the second component as two components are usually recommended. The angle of both components was 90 degrees indicating orthogonality and the remaining Sum of Squares after the second component is 13% Hence, 87% of the variation in X was not related to Y and was removed from X. Click on Next to perform the wavelet compression.

The wavelet window opens. Select Daubachies 10 wavelet. Select Variance as compression method and DWT (Discrete Wavelet Transform) as NIR signals are smooth and DWT is recommended for low frequency signals. Click on Next. The wavelet transform is performed, and SIMCA displays a plot of the percentage of variance explained by the largest coefficients.

We shall select to keep 50 (enter 50) in the box, these 50 coefficients explain 99.93% of the variation of X matrix and click on Next.


SIMCA-P creates a new project with the OSC and wavelet compressed data. You can change the default name of the project, and select a different destination directory. The test set (excluded observations) are automatically signal corrected and wavelet compressed, in the4 same way as the training set, and made into a prediction set. You can change the default name of the prediction set and click on Finish.

You are switched to the new project.

Model with the Signal corrected and compressed data

Summary of the preprocessed project Click on Dataset | Filter Summary to display a summary of the preprocessing done on the project.


Change the default Scaling Click on Workset | Edit model 1, select the Scale tab and mark all the X variables 1 to 50 and change the scaling to Ctr, then exit the workset window. Fit the PLS Model Click on Analysis | Autofit

The first component explains very little of the variation, the second component is highly significant. Together the two components explain 94% of Y, cross-validated to 93%. This is an excellent model.

Scores t2 vs. u2 Display the t2 vs. u2 (t1 vs. u1, explained only 11 %). This is now a good relationship.


Loading plot w*2

-0.080

-0.060

-0.040

-0.020

0.000

0.020

0.040

0.060

0.080

0.100

0 100 200 300 400 500 600 700 800 900 1000 1100 1200

w*[2

]

Num

MALYX_OSCW.M1 (PLS)w*[Comp. 2]

R2X[2] = 0.0786525 SIMCA-P+ 11 - 5/3/2005 6:22:04 PM This plot is reconstructed, by default, from the wavelet domain to the original domain. It shows again that the information in the spectra is located around 400, 700 -- 800, and 900 wavelength, as in the model with the original data. Observed vs. Predicted


The observed vs. Predicted plot is greatly improved from the previous model based on the unfiltered data.

Validating the Model 2 Click on Predictions | Specify Prediction set | Data set and select your prediction set. Click on Predictions | Y Predicted | Scatter

The predictions for the test set have greatly improved with the OSC treated data. The RMSEP is now 87.


Conclusion OSC-Wavelets This example illustrates how Orthogonal signal correction (OSC) sometimes greatly improves the calibration model when the signal contains large systematic variation not related to Y, such as baseline shifts etc. Wavelet compression efficiently compresses the signal form 1201 observations to 50 with very little loss of information.

OPLS (Orthogonal PLS) Return to original project (Malyx), click on Analysis / Change Model Type, and select OPLS

Autofit gives 8 components compared with 7 for PLS:

0.00

0.10

0.20

0.30

0.40

0.50

0.60

0.70

0.80

0.90

1.00

Com

p[1]

P

Com

p[2]

O

Com

p[3]

O

Com

p[4]

O

Com

p[5]

O

Com

p[6]

O

Com

p[7]

O

Com

p[8]

O

Comp No.

MALYX.M2 (OPLS) R2Y(cum)Q2(cum)

SIMCA-P+ 11 - 5/5/2005 9:40:31 AM Return to the original PLS model and add one component (Analysis / next component, or corresponding fast button). Note that the PLS and OPLS models now have the same R2Y and R2X, but the OPLS model shows a higher Q2(cum). The model window (below) now shows the first line as the Y-related single component, and the following are the Y-orthogonal


ones. The bottom line shows a summary of the model after all components.

Scores u1 vs. t1 The t/u plot now looks much better than the PLS one, because OPLS has rotated the solution to put all Y-related variation into the first component.

Loadings, w1 The first components PLS-weights now look like the spectrum of the ingredient related to y. This is one of the greatest advantages with OPLS, it makes the loading interpretable!


Predictions The prediction set remains the same unless you have changed something in between. In such case, restore the prediction set to what it was, and continue. Under Predictions/ Ypred / scatter plot, the following is obtained:

The RMSEP (prediction SD) is now 111.3, precisely the same as the original PLS model. This shows that usually OPLS.

Conclusions OSC, Wavelets, and OPLS are tools that have some additional features beyond ordinary PLS making these tools useful. OPLS makes the PLS model easier to interpret – only one component, and an interpretable loading plot. Wavelets compress the spectra with little loss of information, and, sometimes, especially in combination with OSC (OSC-Wavelets) even improves the predictions somewhat.

Tutorial SIMCA-P, SIMCA-P+ Batch Modelling with SIMCA-P+ • 111

Batch Modelling with SIMCA-P+

Introduction The following example is taken from the article: J.MacGregor and P.Nomikos, “Multivariate SPC Charts for Monitoring Batch Processes”, Technometrics Vol. 37 No. 1 (1995) 41-57 The duration of a batch was 2 hours. During this period, 10 variables were measured every 1.2 minutes, for a total of 100 measurements. A quality variable was measured at the completion of every batch. Data were collected on 55 batches. Batches 40 to 42 and 50 to 55 had their quality variable outside the specification limits. The quality variable of batches 38, 45, 46 and 49 was on the boundary.

Data Variables The following 10 variables were measured at equally spaced intervals during the evolution of a batch. x1 to x3: Temperature inside the reactor x6 and x7: Temperature inside the heating- cooling medium x4,x8 and x9: Pressures variables x5 and x10: Flow rates of material added to the reactor.

Objectives 20. Develop a model of the evolution of good batches (the

observation level model), and use it to monitor new batches as they are evolving, in order to detect problems as early as possible.

21. Make a model of the whole batch based on the scores of the observation level model, and use this model to classify the new batches as good or bad ones.

112 • Batch Modelling with SIMCA-P+ Tutorial SIMCA-P, SIMCA-P+

Analysis Outline We will use 18 good batches (1800 observations) to model the evolution of “good batches”. This is done by fitting a PLS model relating Y, the relative batch time, to the 10 measured variables. This observation level model is used to monitor the evolution of the new batches, batch 30 to 33 (good batches) and 49 to 55(bad batches). We will make a PCA model of the whole batch, with the unfolded scores of the observation level as X-variables.

The steps in SIMCA-P are: • Create the observation level project, import the primary data

set with the 18 good batches • Fit the observation level model, a PLS with Y, the relative

batch time, and X, the 10 measured variables (Analysis menu). • Display the control charts of the training set . (Analysis

|Batch|Contol Charts menu) • Import the secondary data set with the new batches • Monitor the evolution of the new batches (Prediction|Batch|

Control Charts menu) and use contribution plots to interpret the seen problems.

• Create the whole Batch project and fit a PCA model to the data • Classify the new batches as good or bad using the distance to

the model (DmodX) and use contribution plots to interpret the results

Create the observation level project Start a new project. The data set name now is NOM18a.xls Start SIMCA-P and create a new project from FILE | NEW.

The import wizard opens. Select the radio button SIMCA-P Batch project and click on Next.


The second column labelled observation names contains the batch identifiers. Both the Batch identifiers and the phase identifiers (when present) can be located in any variable (column) in the spreadsheet. Mark this second column and from the combo box (top of column) select batch identifiers. In this example you do not need to define phase identifiers, as the batch process has only one phase.

The following window opens:


click OK and Next.

The Batch page displays the list of batches in the dataset with the number of observations in each batch. The Conditional delete allows you to delete batches with fewer observations than a selected number.

In this example we do not use the Conditional Delete. Click on Next to display the project specification page and then click on Finish.


The following message is displayed:

Click on OK.

Analysis The workset M1 has been prepared with all the 10 measured variables specified as X’s and the auto generated variable $Time (relative batch time normalized) specified as Y, and all variables scaled and centered to unit variance (UV). You are ready to fit the PLS Batch model. Click on Autofit.


SIMCA-P takes only 2 components as they explain 85% of X and the third component explains less than 7%. The Model window summarises the fit of the model per component. We have an excellent model with 2 components, explaining 87% of X and 98% of Y.

Scores Line plot of t1 Click on “Scores|Line Plot| t1 to display the first summary variable t1, summarizing all the 10 variables.


All the 18 batches are within the 2 standard deviation limit.

Loadings p1 Click on Analysis |Loading s| Column | p 1.

With batches we are interested in summarizing the X variables and the loadings p1 are the weights that indicate the importance of the original X’s for t1. We can see here that all the variables participate in forming t1 with the first 3 variables having positive weights while the others have negative weights.

Batch Control charts (Training set) Analysis |Batch |Control Charts | Scores The Batch Control charts show how t1 and t2 vary with time, for good batches. A new good batch should evolve in the same way and its trace should be inside the control limits.


Use the side arrows to move the stack of displayed batches forward or backward by one batch. You can also use the property bar.

Properties page Use the up arrow to display the control chart of t2.

To display the Control chart in Normalized units, from the Limits and Averages tab, select Remove the average and normalize the values


and click on Apply.

The plot is displayed in normalized units.

Batch Control Charts DModX, Variables, Hotelling T2 and Observed vs. predicted The plots of the distance to the model (DmodX), Hotelling T2, and Observed vs. Predicted time, with their control limits, are also important monitoring charts for new batches. Display univariate Batch Control charts when needed.

Monitoring new batches

Import the secondary data set with the new batches Use the menu File | Import Secondary dataset, and import the file Alpred.xls as a secondary data set.


Mark the 2nd column as the Batch ID’s.

Creating a prediction set with the new batches

Click on Predictions | Specify Predictionset | Dataset | Alpred to select the alpred prediction set.

Control Charts for new batches Predictions | Batch Control Charts | Scores


Click on Predictions |Batch |Control charts | Scores to display the new batches in the control charts with the control limits derived from the training set. Use the Properties page to include batches 50 to 55. Use the Component tab to display the Control chart of t2.

In both of these control charts, batches 50 to 55 are out of the control limits in the first time period (0 - 15). Batches 50 - 55 are also out of the control limits in t1, for the last time period (90 to 100) of the polymerisation process.

Contribution plot Using the Contribution tool, double click in the t1 control chart on one of the outlying batches, 50 for example, at time point 4.

The Contribution plot clearly displays variable V-4 (pressure) as being lower than average trace.


Control Chart of batch 49 and Contribution plot

Batch 49 is slightly out of the control limits around time period 55- 60. The Contribution plot around time point 59 shows variable V-10 slightly lower than average good batches.


Prediction | Batch Control Charts | DModX

Batches 50 to 55 are clearly out of the control limit for the time period 0-20.

Contribution plot The Contribution plot for any of these batches in that time period shows again variable V4 (pressure) as being lower than in good batches.

The Control chart of variable 4 (pressure), double click on it, clearly shows the problem with the pressure for these 5 batches.


Creating and Modelling the batch level project Select the menu File | Batch |Create batch level project, mark scores, and the check box Bring secondary dataset to the batch level. In the batch level project, each row has the data from one batch and consists of the unfolded scores, from the observation level model, which describe the evolution of each batch. This example has no initial conditions.

Analysis: Autofit Click on Analysis | Autofit to fit a PC model. Simca extracts 4 components.

Analysis: Scores Click on Analysis | Scores | t1 vs t2


The 18 good batches span the space with no outliers.

Analysis |Batch Control Charts | Batch Variable Importance

This plot, by combining the importance of the scores in the batch level model, with the weights w* derived from the observation level model, displays the overall importance of the measured variables in the whole batch model. Here we see that all the 10 variables are important (this is to be expected as the 10 measured variables are highly correlated).

Predicting the quality of the new batches In the menu Predictions | Specify Predictionset | Dataset select the data set alpred as a prediction set. It contains the data for batches 1, 30-33 and 49 to 55, one observation per batch, and the predicted scores of the observation level as x’s.


Predictions: T Predicted

We clearly see that batches 50 to 55 (with the exception of 52) are outside the Hotelling T2 ellipse and are outliers in the second dimension.

Predictions: Contribution Scores for batch 51 Using the Contribution tool double click on batch 51.

Double click on the t2-M1:4 and the score variable is resolved with respect to original variables and displays variable 4 (pressure) as the problem variable.


Predictions: Distance to the Model (DmodX)

Batches 50 to 55 have their distance to the model way above the control limit, and batch 49 is also above the control limit. Clearly these batches are different than the good ones.

Prediction: Contribution | Distance to the model Using the Contribution tool double click on batch 50


. Double click on the score t2-M1:3 and the score variable is resolved with respect to original variables and displays variable 4 (pressure) as the problem variable.

Conclusion Modelling the evolution of a representative set of good batches allowed us to construct control charts to monitor new batches during their evolution. We detected problems in the evolution of the bad batches and understood why these batches were outside the control limits. The model of the whole batch has allowed us to classify the new batches as good or bad and understand why these batches had an inferior quality.

Tutorial SIMCA-P, SIMCA-P+ Modelling of a Batch Digester • 129

Modelling of a Batch Digester

Introduction The following example is derived from a batch digester. Batch digesters are used in the pulp and paper industry to produce pulp from wood chips. The batch process has 5 phases: chip, acid, cook, blowback and blow. In the chip phase, the wood chips are fed into the digester and steamed. In the acid phase, the chips are impregnated with an acid. They are then cooked at high temperature and pressure during the cook phase. This is the most important phase, as this is where the de-lignifications happen. In the blowback phase, the pressure is released and thereby brought back to atmospheric pressure. The temperatures also drop. Finally, in the blow phase, the pulp is blown out of the digester. The duration of a batch varies between 8 and 10 hours, and on the average, is around 9.4 hours in the present data set. 27 variables (including the sampling time) were measured every 2 minutes during the batch evolution. Different variables are meaningful in the different phases. Data were collected on 52 batches. Of these, thirty good batches are used to build the training set model.

Data Variables The following variables are meaningful in the following phases: Chip and Acid phase:

State of the acid (2 variables) State of the vent (2 variables) State of Steam1 (2 variables) State of Steam2 (2 variables) Temperature4 Pressure2

130 • Modelling of a Batch Digester Tutorial SIMCA-P, SIMCA-P+

Cook phase: Pressure1 Steam Temperature1 Temperature2 Temperature3 Temperature4 Temperature5 Pressure2 Temperature6 Pump

Blowback phase: Pressure1 Temperature2 Temperature3 Temperature4 Temperature5 Relief valve Blow1 Blow2 Pressure3 Pressure4 State of Dilution (2 variables) Dilution flow

Objectives 22. To develop a model of the evolution of good batches (the

observation level model), and use the model to monitor other batches as they are evolving, in order to detect problems as early as possible.

23. Make a model of the whole batch based on the scores of the observation level model, and use this model to classify other batches as good or bad.

Analysis Outline We will use 30 good batches to develop the model of the evolution of “good batches”. In the analysis, we will combine the chip and acid phase (they are not meaningful alone) and delete the blow phase which has no effect on the quality of the pulp.


We will fit 3 different PLS models relating Y, the relative batch time, to the measured variables in the 3 relevant phases (chp+acid, cook and Blowback). These observation level models are used to monitor the evolution of the other batches, in this example those left out of the training set. We will make a PCA model of the good batches at the batch level, with the unfolded scores of the observation level as X-variables.

The steps in SIMCA-P are: • Create the observation level project, import the primary data

set with the 52 batches, merge phases chip and acid and delete the blow phase.

• In menu workset, select 30 specified good batches and select the variables relevant in each phase.

• Fit the observation level models, one for each phase, by PLS with Y= relative batch time, and X = the relevant variables in each phase. (Analysis menu).

• Interpret the scores of the cook phase, and display the control charts of the training set. (Analysis | Batch | Control Charts menu)

• Select the complement of workset (training set) and save it as a secondary data set (Menu Prediction/Prediction Set)

• Monitor the evolution of the batches left out of the training set (Prediction | Batch | Control Charts menu) and use contribution plots to interpret the problems with some of the batches.

• Create the whole Batch project and fit a PCA model to the data. (Menu File/Create Batch Level Project)

• Classify the prediction set batches as good or bad using the distance to the model (DModX), and use contribution plots to interpret the results (Menu Prediction).

Create the observation level project Start a new project (Menu File/New). The present data set is DIGESTER. DIF Start SIMCA-P and create a new project from FILE | NEW.


The import wizard opens. Select the radio button SIMCA-P Batch project and click on Next.

The second column labelled observation names contains the batch and phase identifiers. Both the Batch identifiers and the phase identifiers (when present) can be located in the same variable (column) or in separate variables in the spreadsheet.


Mark this second column and from the combo box (top of column) select Batch/Phase identifiers.

The following window opens:


The batch identifiers are sequential numbers from 01 to 52 and the phases are chip, acid, cook, blbk, and blow, click OK . The batch and phase ID are now in 2 separate variables.

Mark the last column with the sampling time variable and from the drop down menu, select Y Variable (Time or Maturity) and click on Next. The Phase page displays the list of phases in the dataset with the number of observations and batches in each. Under every phase is the list of variables. Using the CTRL key mark both the chip and the acid phases and click on Merge. Mark the Blow phase and click on Delete.


We now have 3 phases left: chip+acid, cook and blbk. Click on Next. The Batch page opens listing all the batches with their numbers of observations. Listed under every batch are the phases included in the batch. In our example all the batches include all the phases.

The Conditional delete allows you to delete batches or phases or a selected phase with fewer observations than a selected number.


In this example we do not use the Conditional Delete. Click on Next to display the project specification page and then click on Finish.

The following message is displayed and new variable $Time is created.

Specify the Workset MB1 is an umbrella model which has been prepared with 3 unfitted models, one for every phase, and all the measured variables specified as X’s and the relative sampling time as Y. All variables are scaled and cantered to unit variance (UV). We need to edit MB1 to include only the relevant variables in each phase, and select the 30 good batches.


Click on Workset | Edit MB1 and select the Variables Tab. Select all the first 6 variables and click on the Configure Phases button

And assign them to the first phase. Continue and assign the variables to the respective phases as specified in the Variables section. The Variables page should be as follows:

Note the Y variable, sampling time, will automatically be shifted, to start at 0 for every phase and Normalized for better alignment. Normalizing the sampling time achieves linear time warping. Click on the batch page to select the 30 good batches: 1, 4, 6 to 13, 16. 18, 21, 23, 25, 29, 31, 32, 34, 36 to 38, 40, 42, 43, 46 to 49 and 51. To do this, first press Select All and Exclude. This excludes all batches. Then use the CTRL key, mark the 30 good batches and click on Include.



Analysis

Fitting All the Class models Click on Analysis | Autofit All Class Models, the Specify Autofit Window opens, click on OK.


The 3 class models are fitted and they all explain more than 80% of X. We will examine the cook phase at is the most important.

Scores Line plot of t1, t2 and t3 Double click on the cook model to examine its components.

The first three components are the most important, explaining together 68% of the variation of X; t1 explains 47%, t2 13% and t3 7% Click on Scores | Line Plot | t1 to display the first summary variable t1, summarizing all the variables of the Cook phase.

The 30 batches are all within the 3 sigma limit of t1. Select t2 from the component combo box in the properties bar The 30 batches are within the 3 sigma limit of t2. Select t3.


The score t3 displays more variability, as all of the batches have some time points above the 3 sigma limits.

Loadings p1, p2 and p3 With batches we are interested in summarizing the X variables, and the loadings p1, p2 and p3 are the weights that combine the original X variables to form t1, t2 and t3. To interpret the first three scores t1, t2 and t3 (new variables summarizing all the X variables) we look at the loadings p1, p2 and p3. Click on Analysis | Loadings | Column plot | p1, and then p2 and p3.

We can see that t1 consists mainly of the first 5 temperatures and pressure 1. The second score t2 is primarily pressure1, the steam and temperature1


The score t3 is again dominated by the pressures (1 and 2) with steam, temp1 and temp6.

Batch Control charts (Training set) Analysis |Batch |Control Charts | Scores All the batches in the training set are aligned to the same length with the same time points. Hence we can now, at each time point, compute the average t1 with its standard deviation. The Batch Control chart of t1 shows how this summary (the temperature trace) varies during the evolution of the cook phase. The green line is the average t1 computed from all good batches. The red limits are the 3 sigma limits computed from the variation of t1 around its average of all good batches. This green line represents the finger print of the ideal good batch. All new good batches should evolve in the same way and should be inside the red control limits.


Individual batches can be included in this control chart – the first training batch is included as default. More can be included in the stack of displayed batches by using the properties menu (after right click). Use the side arrows to move the stack of displayed batches forward or backward by one batch. You can also use the properties bar to select the batches to display.

Properties page Right click on the plot and from the pop-up menu open the Properties page.


Mark all the batches you want to display and move them into the selected window. In this case the traces of all the good batches are within the red control limits.

To display the Control chart in Normalized units, from the Limits and Averages tab (under Properties), select Remove the average and Normalize the values, and click on Apply.


The plot is now displayed in normalized units.

In the component tab, select the 2nd component from the combo box to display the Control Chart of t2.


Note that this plot is not in Normalized units.

Batch Control Charts DModX, Hotelling T2 and Observed vs. predicted

The plots of the distance to the model (DModX), Hotelling T2, and Observed vs. Predicted time, with their control limits, are also important monitoring charts for new batches. Display univariate Batch Control charts when needed by selecting the Variable Plot.


Monitoring new batches

Creating the Prediction set: Complement of Workset Use the menu Prediction | Specify prediction set | Specify

. Remove all batches from the prediction set (the right window), select all batches from the left window (the Complement batches of the Training set) move them to the right window and press OK. From the Prediction menu, save them as a Secondary data set, give it the name Pred1 and click OK


Batch Control Chart of the Prediction set For the cook phase, select Prediction | Batch Control Chart | Scores and use the Properties page to include all the batches.

In the Control chart of t1 with the average and 3 sigma computed from the good batches, we can see batch 28 far outside the control limits.

OOC plot

Right click on the control chart and This plot displays for every batch the percent of the area outside the limits relative to the total area inside the limits of the control chart.


Hence batch 28 has 40% of its area outside the control. Area

Group Contribution plot

Display batch 28, mark the time points outside the 3 sigma and click on the action plot.

The Contribution plot shows pressure1 being 6 standard deviations lower than the average batch for these time points, and temperature2 to temperature5 as also being lower than the average at these time points

Variable control chart Double click on Pressure1 to display the control chart of that variable.


Prediction | Batch Control Charts | DModX

Batch 28 is clearly out of the control limit for the time period 1 to 2 hrs.

Contribution plot The Contribution plot for batch 28 in that time period shows that the problem is also associated with pressure 2 and temp6 (correlated with pressure 2)


Double click on pressure2 to display the control chart.

Creating and Modelling the batch level project Select the menu File | Batch |Create batch level project, mark scores, and the check box “Bring secondary dataset and select the prediction set Pred1. Click on next, select the batch level name and click OK. In the batch level project, each row has the data from one batch and consists of the unfolded score vectors from the observation level models, which describe the evolution of each batch. This example has no initial conditions.


Analysis: Autofit Click on Analysis | Autofit to fit a PC model. Simca extracts 5 components.

Analysis: Scores Click on Analysis | Scores | t1 vs t2

Batch 6 is slightly out of the Hotelling T2 confidence interval. Using the Contribution Tool, clicking on batch 6 gives the contribution plot.


The Contribution plot is coloured by phases, and shows that t1 in the cook phase, at time 5.2 hours is lower than the average by 6.5 standard deviations. With the Contribution tool double click on this bar to resolve this contribution into the original variables,

The temperature2, around time 5.2 hours is lower than the average of the good batches at the same time point. Displaying the Control chart of temperature2, by double clicking on it, we can see that temperature2 at time 5.36 hours is equal to 114.9 degree and is slightly below the control limit. Temperature2 is equal to 141degrees for the average of the good batches at this time point.


Analysis | Batch Variable Importance Considering that the different phases have different variables, one must display the Batch Variable Importance separately for each phase. Select the Cook phase, as it is the most important.

This plot, by combining the importance of the scores in the batch level model, with the weights p derived from the observation level model, displays the overall importance of the measured variables for the whole batch model in the cook phase. Here we see that the temperatures, pressure1 and the steam dominate.

Predicting the quality of the prediction set batches In the menu Predictions | Specify, select both the training set and the prediction set batches in Pred1.


Predictions: T Predicted

Select t1 vs. t3. Batches 28 and 26 are outside the Hotelling T2 ellipse.

Predictions: Contribution Scores for batch 28. Use the Contribution tool double click on batch 28.

What is causing batch 28 to be an outlier? The problem clearly is the cook phase. Double click on one of the scores with large deviations, for example t1 at time 1.1 hours, to resolve the contribution into original variables.


The resolved contribution plot shows pressure1 as being much lower than average. The Control chart of pressure1 confirms this and shows the problem with batch 28.


Predictions: Distance to the Model (DModX)

Batches 28, 26, 33, 50 and 52 have the largest DModXPS.

Contribution Plot Double click with the contribution tool on batch 33 to display the contribution plot

The problem seems to be in t2 of the cook phase around time 0.4 hours (the beginning of that phase) and also in the chip+acid phase. Double clicking on a large score in the chip and acid phase we see that the problem was with the steam state.


The resolved contribution for the large t2 in the cook phase shows both the pressure1 and the steam lower than average, probably due to the problem with the steam state.

The Control charts for batch 33 of both pressure1 and steam confirms this.


Conclusion Modelling the evolution of a representative set of good batches allowed us to construct control charts to monitor new batches during their evolution. We detected problems in the evolution of the bad batches and understood why these batches were outside the control limits. The model of the whole batch has allowed us to classify the new batches as good or bad and understand why these batches had an inferior quality.

Documents

SIMCA-P+ 11 Tutorial