78
STATISTICAL APPROACHES TO V&V AND ADAPTIVE SAMPLING IN M&S APRIL 12, 2021 Jim Wisnowski, PhD [email protected] Jim Simpson, PhD [email protected]

STATISTICAL APPROACHES TO V&V AND ADAPTIVE SAMPLING …

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

PowerPoint PresentationSTATISTICAL APPROACHES TO V&V AND ADAPTIVE SAMPLING IN M&S
APRIL 12, 2021
2
Part I. Statistical Methods for M&S V&V and Equivalence Methods
THE BIG PICTURE OF M&S V&V HOW DOES THIS PROCESS TRADITIONALLY WORK?
DOEAnalyze
Plan
Design
Execute
Manage
Science
3
Definitions
Modeling and Simulation (M&S): “The discipline that comprises the development and/or
use of models and simulations.” (DoDD 5000.59, DoDI 5000.61)
“The use of models, including emulators, prototypes, simulators, and stimulators, either statically or over time, to develop data as a basis for making managerial or technical decisions.” (MSCO M&S Glossary)
“Using computers to imitate, or simulate, the operations of various kinds of real-world facilities or processes (a system).” (Law 2015)
Statistical Methods for Modeling and Simulation V&V – Session 1-1
More Definitions
• Verification: The process of determining that a model or simulation implementation and its associated data accurately represent the developer's conceptual description and specifications. (DoDI 5000.61)
• Validation: The process of determining the degree to which a model or simulation and its associated data are an accurate representation of the real world from the perspective of the intended uses of the model. (DoDI 5000.61)
• Accreditation: The official certification that a model or simulation and its associated data are acceptable for use for a specific purpose. (DoDI 5000.61)
5
• Verification:
Was the M&S built correctly according to what was intended?
• Validation:
• Accreditation:
Are the M&S predictions credible enough to inform a decision?
6
7
8
Equivalence Testing
If the means differ by some value, Δ, or more, we can declare the means inequivalent. Otherwise, we say they are equivalent for our purposes.
: 1 − 2 > Δ
: 1 − 2 < Δ
AND : 1 − 2 > Δ
Equivalent
Inequivalent
Inequivalent
Inequivalent
> 1− and 1 −2 −Δ
1 1 + 1 2
> −1−
We decide the two group means are equivalent if, and only if, both are rejected.
Confidence Interval for Equivalence Test
A 1 − 2 % CI for the difference in means
1 − 2 ± ∗ (1 − 2)
If both one-sided tests are rejected, the CI will be contained in the ±Δ interval, and we decide equivalence.
Comparing Distributions: Kolmogorov-Smirnoff Test
• Nonparametric test that compares distribution of data in live to simulated
• Computes the maximum distance between two cumulative distribution functions F(x) and G(x) with sample sizes m and n
, = max
− ()
With critical value
= () + ∗
Reject if , ≥ • Best used as a simple quick look to see if the distributions are even close to
one another and does not require a specific distribution
Kolmogorov-Smirnoff Test Example • Detection times in missile warning system
Comparing Distributions: Fisher’s Combined Probability Test
• Scenario: air-to-air missile test with some live tests and many simulated miss distances of same conditions
• Determine tail probabilities of likelihood each live test point falls in contours of simulation (more on this coming up)
Comparing Distributions: Fisher’s Combined Probability Test
• Fisher’s Combined Probability Test Null hypothesis is the live data are representative of the simulated
reference distributions across the live test space
Test statistic from the individual tail probabilities, pi
Reference distribution is Chi-Square with 2*(number of live tests) degrees of freedom (Test Stat X =19.33 and overall p-value =0.25)
Comparing Distributions: Fisher’s Combined Probability Test
• Fisher’s Combined Probability Test Ideas A single low p-value can drive the test statistic
There are no considerations to validating the factor space between live and simulated
Actual test had 16 live test events, so had opportunity to learn about the factors
Conceptual Framework for Modeling and Simulation
*figure from VV&A Recommended Practices Guide, DMSCO
WHY – ENHANCING M&S V&V RISK AND COVERING THE OPERATIONAL SPACE
DOEAnalyze
Plan
Design
Execute
Manage
Science
18
The World As it Is . . . Collecting Evidence Every observation collected has two components – signal and noise
We must distinguish between the two using the discipline of statistics – organizations can make better decisions with high-level impact
Random variation (noise) can be reduced by increasing sample size, which also gives greater clarity to the signal component
Ref: Walter Shewhart, 1924 and Lynn Hare, 2007, “10 Quality Basics,” Quality Progress
( )y f x ε= +
19
Test Design Space Simulation vs. Live Strategies for selecting the design space and points for live and simulation-based testing Ideal : the live design encompasses the simulation design space so comparisons
between them are interpolations, not extrapolations. Limitation in Live Space: often due to practical constraints that exist in live testing.
Here the domain of the live testing should span the maximum possible domain of the simulation experiment and regions of extrapolation should be clearly identified in the validation limitations
Ideal Limited Live 20
Controlling Risk Example: Does clutter degrade Small Diameter Bomb (SDB) II miss
distances? Compute the difference between lo/hi clutter misses
Risks or Errors defined: Type I error = α Type II error = β 1-β = Power
0
α
Bowl I. SDB II resists clutter. Distribution of differences average 0. α = draw samples with big differences between lo/hi clutter conditions.
0 0 + δ
αβ
Bowl II. SDB II fails in clutter. Distribution of differences average δ. β = draw samples with small differences
between lo/hi clutter conditions.
6 factors here, 6 more to go
How many points? 26 2-level points = 64 212 points = 4096 412 > 16 million!
Factor Target Type: Num Weapons Target Angle on Nose Release Altitude Release Velocity Release Heading Target Downrange Target Crossrange Impact Azimuth (°) Fuze Point Fuze Delay Impact Angle (°)
22
Sheet1
Factor
Science
Analyze
Plan
Design
Execute
Manage
Test Phases
DOD M&S Example: Missile Performance
• Estimate miss distance and probability of kill for missiles intercepting threats
Image courtesy of US Air Force, approved for public release
24
HOW – ENHANCING M&S V&V HOW TO BEST ADDRESS THE VALIDATION PROCESS
DOEAnalyze
Plan
Design
Execute
Manage
Science
26
M&S and Live and Statistical VV&A True System Behavior
M5 Fuse
M6 Target
M3 Guidance
M4 Seeker
M1 Aero
M2 Propulsion
Activity and Use Referent
Analysis Plan Execution
Analysis and Reporting
Analyze Statistically to Model
Performance Model, Predictions, Bounds
Plan Sequentially for Discovery
Factors, Responses and Levels
Execute to Control Uncertainty
28
Capability
DOEAnalyze
Plan
Design
Execute
Science
DOEAnalyze
Plan
Design
Execute
Science
DOEAnalyze
Plan
Design
Execute
Science
DOEAnalyze
Plan
Design
Execute
Science
DOEAnalyze
Plan
Design
Execute
Science
DOEAnalyze
Plan
Design
Execute
Science
DOEAnalyze
Plan
Design
Execute
Science
DOEAnalyze
Plan
Design
Execute
Science
29
Statistical Methods and DOE
Quality Assurance
31
Example: SDB II Weapon Effectiveness Notional live test program considered 7 factors initially in
Resolution IV screening design—16 runs and 4 centers; augmented the 5 significant factors to face-centered CCD
Simulated same experimental runs Goal is to see if the two are approximately equal
32
Miss Distance
Airspeed Impact Velocity Error
34
Summary
Why Statistical Methods in M&S V&V? Random variation needs to be recognized and quantified apart
from all the factors that truly influence system or subsystem performance
Brings a discipline that is efficient, covers the operational space more than any other method, and maximizes the knowledge gained from simulation experiments and live testing
How Statistical Methods in M&S V&V? Apply experimental design and statistical analyses to live
testing (captive, subsystem, end-to-end, live fire, etc.) and to simulation runs for record
Develop a rigorous approach to simulation V&V at all tiers: model level, subsystem, system
Don’t neglect culture change: leaders to lead, experts to train, policy & accountability
35
36
Vector Search Excursion Gen Factorial 2 / 24
Other EA modes Excursion Gen Factorial 3 / 16
High Off Bore Sight Excursion RSM CCD or I-
optimal 5 / 36
CAS Extremely High-
Resolution Mapping Primary
Sniper Air-to-Ground Movers
Primary Fractional Factorial
Fraction 8 / 140
Augmenting M&S Designs with Apaptive Sampling Methods
• Often we need to establish two or more responses over a
continuum are equal (e.g. time series, instrumentation data, • Possible to take differences at discrete points or min, max,
average etc, but truly miss the functional form
EQUIVALENCE IN CURVES
39
Consider a test of Navigate on Autopilot for an automobile
100 runs on highway testing lane changing functionality
Response is the curve for probability of collision over the 15 second merge period
Input factors: speed, on- coming car speed, aggressiveness (mild to Mad Max), lightness, wetness, and degree of human interaction
40
Prepare/clean data to create response that is probability of collision
Fit model to curves via splines or Fourier basis if seasonal structure
41
Components Eigenfunctions of these basis functions to represent different parts of the curves
Note runs 13 and 19 have large positive spikes in collision probability at end, we represent this with a good dose of (negative) Eigenfunction 4
42
Functional Data Analysis Methodology (3)
Fit response surface model with 6 DOE factors to Functional Principal Component Scores
Use generalized regression with adaptive lasso
Save functional prediction formulas to evaluate impact of DOE factors over time
43
44
• Cluster analysis of FPCs can group like curves with many graphics and metrics
• We have an “Ideal” curve we’d like to use to establish for equivalence test
• When FPC 1 is “green”, we have flat curve with low probability of collision (tan cluster). Note runs 21, 24,25, and 30 are in this group
Adaptive Sampling in M&S
Propose two important Design of Experiments augmentation methods
Gap filling designs Automated adaptive sampling
at high gradient regions of the response!
Motivate relevance for NASA/DoD test community
Explain algorithm technical details
Current software augment options typically focus on the factor space
GAP FILLING NEED
• Suppose we want to augment an existing design with runs that fill in gaps in the existing design space with factors X1 and X2. This may be an exaggerated
example, but with many factors the “Curse of Dimensionality” bears sparse
spaces.
• This could occur either in the context of a true augmentation after the original runs
have been collected, or in a phased construction of a single design. For
example, we may have started with a D- optimal design and want to add some
additional runs.
• 400 runs added (blue)
• Coloring turned off.
CURRENTLY AVAILABLE AUGMENTING OPTION
OUR GAP FILLING APPROACH
• New points are concentrated into the gaps and only occur sparsely throughout
already-populated regions, though smaller gaps are also filled in after larger
ones.
GAP FILLING METHODOLOGY
1. Overlay the existing design with a space filling design over the allowed space
• Let N be the number of new requested points.
• Currently use Fast Flexible Filling option of Space Filling Design to generate candidate points. Currently generate the maximum of 2000 points or 4*N.
• The “overlay” is done by concatenating the original table with the table of candidate points.
• Create a continuous indicator column(_ind) where the existing points are set to 1 and the candidate points are set to 0.
2. Fit a 10-nearest neighbor predictive model for the _ind indicator column and save the predictions to the combined data table.
3. Delete the original points so that only the new candidate points remain.
4. The points with predicted values of the indicator closest to zero are those that are farthest away from the existing points and will do the best job of filling gaps.
5. Basic solution: take the N points with the smallest value of _ind Predicted, and these are your new gap filling points.
• But this can overemphasize the largest gap. Solution: iterate this process K times (we use K=10), saving N/K new points in each iteration
Gap Filling Methodology
EXAMPLE- GAP IS A HOLLOW CYLINDER This example adds 1500 runs to fill in a gap that takes the shape of a hollow cylinder in the design space.
Adaptive Sampling Background and Motivation
Explosion of autonomous systems development, test, and production—particularly in DoD
Testing requires complex simulations with many factors and numerous runs requiring near-real time design augmentation
Inherently nonlinear and possibly discontinuous as subtle changes in factor levels in the input space lead to large changes in response values
Desire for new runs to be on or near these boundary conditions; not in the design space where performance is consistent or flat
Best as an iterative process and want new runs to adapt to information learned from previous set
https://www.autonews.com/shift/military-working-make-its-autonomous-technology-smarter https://www.tesla.com/cybertruck
Augment with space filling hoping random points will fall near boundaries
Manually augment based on response gradients
Adaptive sampling algorithms from Johns Hopkins University Applied Physics Lab Range Adversarial Planning Tool (RAPT) Advanced capability but difficult
accessibility and integration for practitioners
Source document for our boundary exploration algorithm development though we had numerous alternative approaches and enhancements
Example Problem
Consider Cybertruck’s autonomous navigation system executing a lane change Nonlinear “plate” response represents likelihood of collision that varies from 2 to 10 (e.g. 20-100%)
Input factors represent scaled vehicle speed and traffic density
Augment with new runs on the performance boundary using Adaptive Sampling
Performance Boundaries
Our solution has new points (asterisks) Near solid red boundary lines
Figure source: The Journal of Systems and Software 137 (2018) pp 206
Boundary Exploration
Augment 500 run design with 200 new points using both the continuous and nominal response
Create extra factor “Random Normal” that is not statistically significant to show predictor variable screening effectiveness
New columns automatically added
PREDICTOR SCREENING
• For each response, the Predictor Screening platform is used to select the factors that are relevant for boundary identification.
• Output hidden by default, can be shown with option to Show Boundary Probability Intermediate Tables and Results.
• No user input is required: the factors with proportion greater than or equal to that specified in the initial GUI are retained.
Relevant GUI Options
Boundary Explorer - Methodology
Relevant GUI Options
Boundary Explorer - Methodology
• For each row (existing design point), the nearest neighbors and their distances are saved from the KNN platform to an intermediate data table.
• This information is used to detect the boundary pairs. A different approach is required for nominal vs. continuous responses.
Boundary Explorer - Methodology
CONTINUOUS RESPONSE (Y)
Score Boundary Probabilities • For each point, calculate the standard
deviation of the response (Y) across the k nearest neighbors (5 in this example).
• Also calculate the mean distance of the 5 nearest neighbors from the parent point.
• Calculate the _Information metric as the product of the standard deviation and the mean distance.
• Points with relatively large values of _Information are either in an area with a steep gradient or in an area farther than typical away from other points (or some combination thereof).
• How do we decide what is a large/medium/small value of _information?
Intermediate Output
CONTINUOUS RESPONSE (Y) Score Boundary Probabilities
• Fit a Normal 3-Mixture to _Information.
• Use the largest of the three distribution means as the lower cutoff for high probability boundary points.
• Use the intersection of the first and second largest normal densities as the lower cutoff for medium probability boundary points.
Intermediate Output
Boundary Explorer - Methodology
Boundary Explorer - Methodology
True Boundaries and Sections
True Boundaries and Sections
Create new points with a Random Forest
• The candidate runs with the largest predicted values for the indicator column are those that are closest to known boundary points and are most likely to also represent boundary points. These are sorted in decreasing order and the top runs are taken as RF Method runs.
• While the midpoint and perturbation approaches produce a fixed number of runs, the RF method can be used to generate an arbitrary number of points.
• The RF method considers all responses at once, so any areas that represent boundaries for multiple responses will receive a larger share of additional runs.
Boundary Explorer - Methodology
• Green: Random Forest
• This represents the new runs added after one iteration of this tool. It is designed to work in multiple iterations.
• After collecting the response for these new runs, the tools is run again (using both the original space filling runs and the results from the first iteration).
• The wider spread of the perturbation points makes them useful for exploring along previously neglected boundaries.
Boundary Explorer - Methodology
Other Options
Also an option to add runs near the minimum/maximum/match target (using column property) of continuous responses (fit with a random forest).
Built in option to add Gap Filling runs
Michelawicz 2D: Initial Space Filling Runs
After Two Iterations
After Two Iterations
Adaptive Sampling Summary
Two new design augmentation tools now available to practitioners
Gap filling offers significant improvement over space filling to target sparse areas of design space
Adaptive sampling critical in large scale simulation models and requires judicious choice of new runs Highly recommend iteratively generating points
Though motivation came from with complex simulation models with near-real time augmentation needed through automation, methods scale well across many design choices and sizes (factors and runs)
Questions?
Statistical Approaches to V&Vand Adaptive Sampling in M&SApril 12, 2021
Part I. Statistical Methods for M&S V&V and Equivalence Methods
The Big Picture of M&S V&V
Definitions
VV&A Process for OT Evaluations*
Equivalence Testing
Comparing Distributions: Kolmogorov-Smirnoff Test
Conceptual Framework for Modeling and Simulation
Why – Enhancing M&S V&V
The World As it Is . . . Collecting Evidence
Test Design Space Simulation vs. Live
Controlling Risk
DOD M&S Example: Missile Performance
HOW – Enhancing M&S V&V
Truth & Models & Live Testing
The Process – Statistical M&S V&V
The Span of Testing
Summary
Multiple Experiments for a TestF-15E Radar/Sensor Example
Part II. Equivalence in Curves&Augmenting M&S Designs with Apaptive Sampling Methods
Equivalence in curves
Analysis Challenge: Temporal Responses
Additional Insights
Gap Filling Need
Adaptive Sampling Background and Motivation
Current Options
Example Problem
Boundary Exploration
Predictor Screening
Slide Number 65
Slide Number 67
Slide Number 68
Slide Number 69
Slide Number 70
Slide Number 71
Other Options
After Two Iterations
After Two Iterations
Adaptive Sampling Summary