15
Industrial Data Modeling with DataModeler Mark Kotanchek Evolved Analytics [email protected]

Industrial Data Modeling with DataModeler Mark Kotanchek Evolved Analytics [email protected]

Embed Size (px)

Citation preview

Page 1: Industrial Data Modeling with DataModeler Mark Kotanchek Evolved Analytics mark@evolved-analytics.com

Industrial Data Modeling with DataModeler

Mark Kotanchek

Evolved Analytics

[email protected]

Page 2: Industrial Data Modeling with DataModeler Mark Kotanchek Evolved Analytics mark@evolved-analytics.com

Wolfram Tech Conf 2006 Evolved Analytics LLC 2

Nonlinear Data Modeling: The Bottom Line

• The world is nonlinear• People time is

expensive• Computing is cheap• Life doesn’t have to

be hard• Success has been

demonstrated in the real world.

The caveat here is that we are (mostly) looking at response

surface analysis and modeling of numerical data

Page 3: Industrial Data Modeling with DataModeler Mark Kotanchek Evolved Analytics mark@evolved-analytics.com

Wolfram Tech Conf 2006 Evolved Analytics LLC 3

Symbolic Regression

• Algorithmic advances in recent years have resulted greater than three-order-of-magnitude speed improvement in symbolic regression via genetic programming relative to conventional GP

• This has been coupled with continuing improvements in compute hardware

• Furthermore, symbolic regression is naturally parallelizable

• Symbolic regression features most of the unique nonlinear capabilities

• The net result is that symbolic regression has moved into the forefront of nonlinear modeling technologies for us

Symbolic Regression searches for both the expression structure as

well as the associated coefficients which capture the data behavior

Page 4: Industrial Data Modeling with DataModeler Mark Kotanchek Evolved Analytics mark@evolved-analytics.com

Wolfram Tech Conf 2006 Evolved Analytics LLC 4

What Makes DataModeler Special?

Goal: To Dazzle & Delight• Dazzle

– Extract value out of data– Robustness of models– Provide insight &

understanding in the process

• Delight– Ease & efficiency of model

development– Model lifecycle management

• Automatic variable selection & variable transform identification

• Ability to handle ill-conditioned data sets

• System insight• Problem insight• Robust & accurate models• Trust metric• Modeling lifecycle tools

Page 5: Industrial Data Modeling with DataModeler Mark Kotanchek Evolved Analytics mark@evolved-analytics.com

Wolfram Tech Conf 2006 Evolved Analytics LLC 5

Package Case Studies

• Distillation Column Quality Predictor– Large data set (skinny array:

6929 x 23 variables)– Multiple data sets (test, train,

validate)– Ensemble of models– Potential pathologies

• Emissions Inferential Sensor– Handling correlated data

sets• Train/test: 251/107 x 8

– Looking at extrapolation

• Process Optimization Emulator– Working against designed data

(320/275 x 10 + 5 response– Goal is to replace a 24 hour

optimization

• Blown Film Process Effects– Interpreting research data– 20 x 9 inputs, 21 responses– Applications into combinatorial

chemistry

• Balancing Service Price– Fat array (298 x 48)– Handle correlated data– Identify driving variables

Page 6: Industrial Data Modeling with DataModeler Mark Kotanchek Evolved Analytics mark@evolved-analytics.com

Wolfram Tech Conf 2006 Evolved Analytics LLC 6

Getting the Zen of the Data

Context-free analysis leads to confidently wrong answers

Page 7: Industrial Data Modeling with DataModeler Mark Kotanchek Evolved Analytics mark@evolved-analytics.com

Wolfram Tech Conf 2006 Evolved Analytics LLC 7

Evolving Models

• Models may be automatically archived• For convenience, default option sets are defined• Progress may be monitored at several levels

Page 8: Industrial Data Modeling with DataModeler Mark Kotanchek Evolved Analytics mark@evolved-analytics.com

Wolfram Tech Conf 2006 Evolved Analytics LLC 8

Pareto Front & Modeling Potential

Hard but potential

Exploratory Run

Useful??? Where is the knee?

Page 9: Industrial Data Modeling with DataModeler Mark Kotanchek Evolved Analytics mark@evolved-analytics.com

Wolfram Tech Conf 2006 Evolved Analytics LLC 9

Driving Variables

Notice the natural variable selection

Page 10: Industrial Data Modeling with DataModeler Mark Kotanchek Evolved Analytics mark@evolved-analytics.com

Wolfram Tech Conf 2006 Evolved Analytics LLC 10

Selecting Models(ad hoc)

Page 11: Industrial Data Modeling with DataModeler Mark Kotanchek Evolved Analytics mark@evolved-analytics.com

Wolfram Tech Conf 2006 Evolved Analytics LLC 11

Potential Pathologies?

Page 12: Industrial Data Modeling with DataModeler Mark Kotanchek Evolved Analytics mark@evolved-analytics.com

Wolfram Tech Conf 2006 Evolved Analytics LLC 12

Model Performance

• Visually, our goal is minimal error with an even distribution of errors and no structural error problems as a function of variable value.

Page 13: Industrial Data Modeling with DataModeler Mark Kotanchek Evolved Analytics mark@evolved-analytics.com

Wolfram Tech Conf 2006 Evolved Analytics LLC 13

Trust via Ensembles

• Models with independent error structures may be “stacked” with their consensus forming a trust metric

• Note that the models generally won’t be on the Pareto front

• Also note that for large data sets, the error residuals will be highly correlated (so we need a relaxed definition of uncorrelated)

Page 14: Industrial Data Modeling with DataModeler Mark Kotanchek Evolved Analytics mark@evolved-analytics.com

Wolfram Tech Conf 2006 Evolved Analytics LLC 14

On the (near!) Horizon

• Implement a ConvertModelForExcel[ ] function

• Complete documentation & release package sale

Page 15: Industrial Data Modeling with DataModeler Mark Kotanchek Evolved Analytics mark@evolved-analytics.com

Wolfram Tech Conf 2006 Evolved Analytics LLC 15

Symbolic Regression:Summary Benefits

Compact Nonlinear Models– Compact empirical models can be suitable for online

implementation– Model(s) can be used as an emulator for coarse system

optimization

Driving Variable Selection & Identification– Identified driving variables may be used as inputs into other

modeling tools

Models from Pathological Data Sets– Appropriate models may be developed from poorly

structured data sets (too many variables & not enough measurements)

Metasensor (Variable Transform) Identification– Identifying variable couplings can give insight into underlying

physical mechanisms– Identified metavariables can enable linearizing transforms to

meld symbolic regression and more traditional statistical analysis

– Metavariables can also be used as inputs into other modeling tools

Rapid Data Content Assessment– Examining the shape of the Pareto front allow us to quickly

assess whether viable models can be developed from the available data

Diverse Model Ensembles– The independent evolutions will produce independent

models. Independent (but comparable) models may be stacked into ensembles whose divergence in prediction may be an indicator of extrapolation & model trustworthiness. This is an issue in high dimensional parameter spaces.

Human Insight– The transparency of the evolved models as well as the

explicit identification of the model complexity-accuracy trade-off is very compelling

– Examining an expression can be viewed as a visualization technique for high-dimensional data

Rapid Modeling– Exploitation of the Pareto front has resulted in several

orders-of-magnitude in the symbolic regression performance relative to more traditional GP. This greatly increases the range of possible applications.

There are many benefits to symbolic regression. These are enhanced when coupled with other analysis tools and techniques.