View
219
Download
1
Category
Tags:
Preview:
Citation preview
Industrial Data Modeling with DataModeler
Mark Kotanchek
Evolved Analytics
mark@evolved-analytics.com
Wolfram Tech Conf 2006 Evolved Analytics LLC 2
Nonlinear Data Modeling: The Bottom Line
• The world is nonlinear• People time is
expensive• Computing is cheap• Life doesn’t have to
be hard• Success has been
demonstrated in the real world.
The caveat here is that we are (mostly) looking at response
surface analysis and modeling of numerical data
Wolfram Tech Conf 2006 Evolved Analytics LLC 3
Symbolic Regression
• Algorithmic advances in recent years have resulted greater than three-order-of-magnitude speed improvement in symbolic regression via genetic programming relative to conventional GP
• This has been coupled with continuing improvements in compute hardware
• Furthermore, symbolic regression is naturally parallelizable
• Symbolic regression features most of the unique nonlinear capabilities
• The net result is that symbolic regression has moved into the forefront of nonlinear modeling technologies for us
Symbolic Regression searches for both the expression structure as
well as the associated coefficients which capture the data behavior
Wolfram Tech Conf 2006 Evolved Analytics LLC 4
What Makes DataModeler Special?
Goal: To Dazzle & Delight• Dazzle
– Extract value out of data– Robustness of models– Provide insight &
understanding in the process
• Delight– Ease & efficiency of model
development– Model lifecycle management
• Automatic variable selection & variable transform identification
• Ability to handle ill-conditioned data sets
• System insight• Problem insight• Robust & accurate models• Trust metric• Modeling lifecycle tools
Wolfram Tech Conf 2006 Evolved Analytics LLC 5
Package Case Studies
• Distillation Column Quality Predictor– Large data set (skinny array:
6929 x 23 variables)– Multiple data sets (test, train,
validate)– Ensemble of models– Potential pathologies
• Emissions Inferential Sensor– Handling correlated data
sets• Train/test: 251/107 x 8
– Looking at extrapolation
• Process Optimization Emulator– Working against designed data
(320/275 x 10 + 5 response– Goal is to replace a 24 hour
optimization
• Blown Film Process Effects– Interpreting research data– 20 x 9 inputs, 21 responses– Applications into combinatorial
chemistry
• Balancing Service Price– Fat array (298 x 48)– Handle correlated data– Identify driving variables
Wolfram Tech Conf 2006 Evolved Analytics LLC 6
Getting the Zen of the Data
Context-free analysis leads to confidently wrong answers
Wolfram Tech Conf 2006 Evolved Analytics LLC 7
Evolving Models
• Models may be automatically archived• For convenience, default option sets are defined• Progress may be monitored at several levels
Wolfram Tech Conf 2006 Evolved Analytics LLC 8
Pareto Front & Modeling Potential
Hard but potential
Exploratory Run
Useful??? Where is the knee?
Wolfram Tech Conf 2006 Evolved Analytics LLC 9
Driving Variables
Notice the natural variable selection
Wolfram Tech Conf 2006 Evolved Analytics LLC 10
Selecting Models(ad hoc)
Wolfram Tech Conf 2006 Evolved Analytics LLC 11
Potential Pathologies?
Wolfram Tech Conf 2006 Evolved Analytics LLC 12
Model Performance
• Visually, our goal is minimal error with an even distribution of errors and no structural error problems as a function of variable value.
Wolfram Tech Conf 2006 Evolved Analytics LLC 13
Trust via Ensembles
• Models with independent error structures may be “stacked” with their consensus forming a trust metric
• Note that the models generally won’t be on the Pareto front
• Also note that for large data sets, the error residuals will be highly correlated (so we need a relaxed definition of uncorrelated)
Wolfram Tech Conf 2006 Evolved Analytics LLC 14
On the (near!) Horizon
• Implement a ConvertModelForExcel[ ] function
• Complete documentation & release package sale
Wolfram Tech Conf 2006 Evolved Analytics LLC 15
Symbolic Regression:Summary Benefits
Compact Nonlinear Models– Compact empirical models can be suitable for online
implementation– Model(s) can be used as an emulator for coarse system
optimization
Driving Variable Selection & Identification– Identified driving variables may be used as inputs into other
modeling tools
Models from Pathological Data Sets– Appropriate models may be developed from poorly
structured data sets (too many variables & not enough measurements)
Metasensor (Variable Transform) Identification– Identifying variable couplings can give insight into underlying
physical mechanisms– Identified metavariables can enable linearizing transforms to
meld symbolic regression and more traditional statistical analysis
– Metavariables can also be used as inputs into other modeling tools
Rapid Data Content Assessment– Examining the shape of the Pareto front allow us to quickly
assess whether viable models can be developed from the available data
Diverse Model Ensembles– The independent evolutions will produce independent
models. Independent (but comparable) models may be stacked into ensembles whose divergence in prediction may be an indicator of extrapolation & model trustworthiness. This is an issue in high dimensional parameter spaces.
Human Insight– The transparency of the evolved models as well as the
explicit identification of the model complexity-accuracy trade-off is very compelling
– Examining an expression can be viewed as a visualization technique for high-dimensional data
Rapid Modeling– Exploitation of the Pareto front has resulted in several
orders-of-magnitude in the symbolic regression performance relative to more traditional GP. This greatly increases the range of possible applications.
There are many benefits to symbolic regression. These are enhanced when coupled with other analysis tools and techniques.
Recommended