PowerPoint PresentationSTATISTICAL APPROACHES TO V&V AND
ADAPTIVE SAMPLING IN M&S
APRIL 12, 2021
2
Part I. Statistical Methods for M&S V&V and Equivalence
Methods
THE BIG PICTURE OF M&S V&V HOW DOES THIS PROCESS
TRADITIONALLY WORK?
DOEAnalyze
Plan
Design
Execute
Manage
Science
3
Definitions
Modeling and Simulation (M&S): “The discipline that comprises
the development and/or
use of models and simulations.” (DoDD 5000.59, DoDI 5000.61)
“The use of models, including emulators, prototypes, simulators,
and stimulators, either statically or over time, to develop data as
a basis for making managerial or technical decisions.” (MSCO
M&S Glossary)
“Using computers to imitate, or simulate, the operations of various
kinds of real-world facilities or processes (a system).” (Law
2015)
Statistical Methods for Modeling and Simulation V&V – Session
1-1
More Definitions
• Verification: The process of determining that a model or
simulation implementation and its associated data accurately
represent the developer's conceptual description and
specifications. (DoDI 5000.61)
• Validation: The process of determining the degree to which a
model or simulation and its associated data are an accurate
representation of the real world from the perspective of the
intended uses of the model. (DoDI 5000.61)
• Accreditation: The official certification that a model or
simulation and its associated data are acceptable for use for a
specific purpose. (DoDI 5000.61)
5
• Verification:
Was the M&S built correctly according to what was
intended?
• Validation:
• Accreditation:
Are the M&S predictions credible enough to inform a
decision?
6
7
8
Equivalence Testing
If the means differ by some value, Δ, or more, we can declare the
means inequivalent. Otherwise, we say they are equivalent for our
purposes.
: 1 − 2 > Δ
: 1 − 2 < Δ
AND : 1 − 2 > Δ
Equivalent
Inequivalent
Inequivalent
Inequivalent
> 1− and 1 −2 −Δ
1 1 + 1 2
> −1−
We decide the two group means are equivalent if, and only if, both
are rejected.
Confidence Interval for Equivalence Test
A 1 − 2 % CI for the difference in means
1 − 2 ± ∗ (1 − 2)
If both one-sided tests are rejected, the CI will be contained in
the ±Δ interval, and we decide equivalence.
Comparing Distributions: Kolmogorov-Smirnoff Test
• Nonparametric test that compares distribution of data in live to
simulated
• Computes the maximum distance between two cumulative distribution
functions F(x) and G(x) with sample sizes m and n
, = max
− ()
With critical value
= () + ∗
Reject if , ≥ • Best used as a simple quick look to see if the
distributions are even close to
one another and does not require a specific distribution
Kolmogorov-Smirnoff Test Example • Detection times in missile
warning system
Comparing Distributions: Fisher’s Combined Probability Test
• Scenario: air-to-air missile test with some live tests and many
simulated miss distances of same conditions
• Determine tail probabilities of likelihood each live test point
falls in contours of simulation (more on this coming up)
Comparing Distributions: Fisher’s Combined Probability Test
• Fisher’s Combined Probability Test Null hypothesis is the live
data are representative of the simulated
reference distributions across the live test space
Test statistic from the individual tail probabilities, pi
Reference distribution is Chi-Square with 2*(number of live tests)
degrees of freedom (Test Stat X =19.33 and overall p-value
=0.25)
Comparing Distributions: Fisher’s Combined Probability Test
• Fisher’s Combined Probability Test Ideas A single low p-value can
drive the test statistic
There are no considerations to validating the factor space between
live and simulated
Actual test had 16 live test events, so had opportunity to learn
about the factors
Conceptual Framework for Modeling and Simulation
*figure from VV&A Recommended Practices Guide, DMSCO
WHY – ENHANCING M&S V&V RISK AND COVERING THE OPERATIONAL
SPACE
DOEAnalyze
Plan
Design
Execute
Manage
Science
18
The World As it Is . . . Collecting Evidence Every observation
collected has two components – signal and noise
We must distinguish between the two using the discipline of
statistics – organizations can make better decisions with
high-level impact
Random variation (noise) can be reduced by increasing sample size,
which also gives greater clarity to the signal component
Ref: Walter Shewhart, 1924 and Lynn Hare, 2007, “10 Quality
Basics,” Quality Progress
( )y f x ε= +
19
Test Design Space Simulation vs. Live Strategies for selecting the
design space and points for live and simulation-based testing Ideal
: the live design encompasses the simulation design space so
comparisons
between them are interpolations, not extrapolations. Limitation in
Live Space: often due to practical constraints that exist in live
testing.
Here the domain of the live testing should span the maximum
possible domain of the simulation experiment and regions of
extrapolation should be clearly identified in the validation
limitations
Ideal Limited Live 20
Controlling Risk Example: Does clutter degrade Small Diameter Bomb
(SDB) II miss
distances? Compute the difference between lo/hi clutter
misses
Risks or Errors defined: Type I error = α Type II error = β 1-β =
Power
0
α
Bowl I. SDB II resists clutter. Distribution of differences average
0. α = draw samples with big differences between lo/hi clutter
conditions.
0 0 + δ
αβ
Bowl II. SDB II fails in clutter. Distribution of differences
average δ. β = draw samples with small differences
between lo/hi clutter conditions.
6 factors here, 6 more to go
How many points? 26 2-level points = 64 212 points = 4096 412 >
16 million!
Factor Target Type: Num Weapons Target Angle on Nose Release
Altitude Release Velocity Release Heading Target Downrange Target
Crossrange Impact Azimuth (°) Fuze Point Fuze Delay Impact Angle
(°)
22
Sheet1
Factor
Science
Analyze
Plan
Design
Execute
Manage
Test Phases
DOD M&S Example: Missile Performance
• Estimate miss distance and probability of kill for missiles
intercepting threats
Image courtesy of US Air Force, approved for public release
24
HOW – ENHANCING M&S V&V HOW TO BEST ADDRESS THE VALIDATION
PROCESS
DOEAnalyze
Plan
Design
Execute
Manage
Science
26
M&S and Live and Statistical VV&A True System
Behavior
M5 Fuse
M6 Target
M3 Guidance
M4 Seeker
M1 Aero
M2 Propulsion
Activity and Use Referent
Analysis Plan Execution
Analysis and Reporting
Analyze Statistically to Model
Performance Model, Predictions, Bounds
Plan Sequentially for Discovery
Factors, Responses and Levels
Execute to Control Uncertainty
28
Capability
DOEAnalyze
Plan
Design
Execute
Science
DOEAnalyze
Plan
Design
Execute
Science
DOEAnalyze
Plan
Design
Execute
Science
DOEAnalyze
Plan
Design
Execute
Science
DOEAnalyze
Plan
Design
Execute
Science
DOEAnalyze
Plan
Design
Execute
Science
DOEAnalyze
Plan
Design
Execute
Science
DOEAnalyze
Plan
Design
Execute
Science
29
Statistical Methods and DOE
Quality Assurance
31
Example: SDB II Weapon Effectiveness Notional live test program
considered 7 factors initially in
Resolution IV screening design—16 runs and 4 centers; augmented the
5 significant factors to face-centered CCD
Simulated same experimental runs Goal is to see if the two are
approximately equal
32
Miss Distance
Airspeed Impact Velocity Error
34
Summary
Why Statistical Methods in M&S V&V? Random variation needs
to be recognized and quantified apart
from all the factors that truly influence system or subsystem
performance
Brings a discipline that is efficient, covers the operational space
more than any other method, and maximizes the knowledge gained from
simulation experiments and live testing
How Statistical Methods in M&S V&V? Apply experimental
design and statistical analyses to live
testing (captive, subsystem, end-to-end, live fire, etc.) and to
simulation runs for record
Develop a rigorous approach to simulation V&V at all tiers:
model level, subsystem, system
Don’t neglect culture change: leaders to lead, experts to train,
policy & accountability
35
36
Vector Search Excursion Gen Factorial 2 / 24
Other EA modes Excursion Gen Factorial 3 / 16
High Off Bore Sight Excursion RSM CCD or I-
optimal 5 / 36
CAS Extremely High-
Resolution Mapping Primary
Sniper Air-to-Ground Movers
Primary Fractional Factorial
Fraction 8 / 140
Augmenting M&S Designs with Apaptive Sampling Methods
• Often we need to establish two or more responses over a
continuum are equal (e.g. time series, instrumentation data, •
Possible to take differences at discrete points or min, max,
average etc, but truly miss the functional form
EQUIVALENCE IN CURVES
39
Consider a test of Navigate on Autopilot for an automobile
100 runs on highway testing lane changing functionality
Response is the curve for probability of collision over the 15
second merge period
Input factors: speed, on- coming car speed, aggressiveness (mild to
Mad Max), lightness, wetness, and degree of human interaction
40
Prepare/clean data to create response that is probability of
collision
Fit model to curves via splines or Fourier basis if seasonal
structure
41
Components Eigenfunctions of these basis functions to represent
different parts of the curves
Note runs 13 and 19 have large positive spikes in collision
probability at end, we represent this with a good dose of
(negative) Eigenfunction 4
42
Functional Data Analysis Methodology (3)
Fit response surface model with 6 DOE factors to Functional
Principal Component Scores
Use generalized regression with adaptive lasso
Save functional prediction formulas to evaluate impact of DOE
factors over time
43
44
• Cluster analysis of FPCs can group like curves with many graphics
and metrics
• We have an “Ideal” curve we’d like to use to establish for
equivalence test
• When FPC 1 is “green”, we have flat curve with low probability of
collision (tan cluster). Note runs 21, 24,25, and 30 are in this
group
Adaptive Sampling in M&S
Propose two important Design of Experiments augmentation
methods
Gap filling designs Automated adaptive sampling
at high gradient regions of the response!
Motivate relevance for NASA/DoD test community
Explain algorithm technical details
Current software augment options typically focus on the factor
space
GAP FILLING NEED
• Suppose we want to augment an existing design with runs that fill
in gaps in the existing design space with factors X1 and X2. This
may be an exaggerated
example, but with many factors the “Curse of Dimensionality” bears
sparse
spaces.
• This could occur either in the context of a true augmentation
after the original runs
have been collected, or in a phased construction of a single
design. For
example, we may have started with a D- optimal design and want to
add some
additional runs.
• 400 runs added (blue)
• Coloring turned off.
CURRENTLY AVAILABLE AUGMENTING OPTION
OUR GAP FILLING APPROACH
• New points are concentrated into the gaps and only occur sparsely
throughout
already-populated regions, though smaller gaps are also filled in
after larger
ones.
GAP FILLING METHODOLOGY
1. Overlay the existing design with a space filling design over the
allowed space
• Let N be the number of new requested points.
• Currently use Fast Flexible Filling option of Space Filling
Design to generate candidate points. Currently generate the maximum
of 2000 points or 4*N.
• The “overlay” is done by concatenating the original table with
the table of candidate points.
• Create a continuous indicator column(_ind) where the existing
points are set to 1 and the candidate points are set to 0.
2. Fit a 10-nearest neighbor predictive model for the _ind
indicator column and save the predictions to the combined data
table.
3. Delete the original points so that only the new candidate points
remain.
4. The points with predicted values of the indicator closest to
zero are those that are farthest away from the existing points and
will do the best job of filling gaps.
5. Basic solution: take the N points with the smallest value of
_ind Predicted, and these are your new gap filling points.
• But this can overemphasize the largest gap. Solution: iterate
this process K times (we use K=10), saving N/K new points in each
iteration
Gap Filling Methodology
EXAMPLE- GAP IS A HOLLOW CYLINDER This example adds 1500 runs to
fill in a gap that takes the shape of a hollow cylinder in the
design space.
Adaptive Sampling Background and Motivation
Explosion of autonomous systems development, test, and
production—particularly in DoD
Testing requires complex simulations with many factors and numerous
runs requiring near-real time design augmentation
Inherently nonlinear and possibly discontinuous as subtle changes
in factor levels in the input space lead to large changes in
response values
Desire for new runs to be on or near these boundary conditions; not
in the design space where performance is consistent or flat
Best as an iterative process and want new runs to adapt to
information learned from previous set
https://www.autonews.com/shift/military-working-make-its-autonomous-technology-smarter
https://www.tesla.com/cybertruck
Augment with space filling hoping random points will fall near
boundaries
Manually augment based on response gradients
Adaptive sampling algorithms from Johns Hopkins University Applied
Physics Lab Range Adversarial Planning Tool (RAPT) Advanced
capability but difficult
accessibility and integration for practitioners
Source document for our boundary exploration algorithm development
though we had numerous alternative approaches and
enhancements
Example Problem
Consider Cybertruck’s autonomous navigation system executing a lane
change Nonlinear “plate” response represents likelihood of
collision that varies from 2 to 10 (e.g. 20-100%)
Input factors represent scaled vehicle speed and traffic
density
Augment with new runs on the performance boundary using Adaptive
Sampling
Performance Boundaries
Our solution has new points (asterisks) Near solid red boundary
lines
Figure source: The Journal of Systems and Software 137 (2018) pp
206
Boundary Exploration
Augment 500 run design with 200 new points using both the
continuous and nominal response
Create extra factor “Random Normal” that is not statistically
significant to show predictor variable screening
effectiveness
New columns automatically added
PREDICTOR SCREENING
• For each response, the Predictor Screening platform is used to
select the factors that are relevant for boundary
identification.
• Output hidden by default, can be shown with option to Show
Boundary Probability Intermediate Tables and Results.
• No user input is required: the factors with proportion greater
than or equal to that specified in the initial GUI are
retained.
Relevant GUI Options
Boundary Explorer - Methodology
Relevant GUI Options
Boundary Explorer - Methodology
• For each row (existing design point), the nearest neighbors and
their distances are saved from the KNN platform to an intermediate
data table.
• This information is used to detect the boundary pairs. A
different approach is required for nominal vs. continuous
responses.
Boundary Explorer - Methodology
CONTINUOUS RESPONSE (Y)
Score Boundary Probabilities • For each point, calculate the
standard
deviation of the response (Y) across the k nearest neighbors (5 in
this example).
• Also calculate the mean distance of the 5 nearest neighbors from
the parent point.
• Calculate the _Information metric as the product of the standard
deviation and the mean distance.
• Points with relatively large values of _Information are either in
an area with a steep gradient or in an area farther than typical
away from other points (or some combination thereof).
• How do we decide what is a large/medium/small value of
_information?
Intermediate Output
CONTINUOUS RESPONSE (Y) Score Boundary Probabilities
• Fit a Normal 3-Mixture to _Information.
• Use the largest of the three distribution means as the lower
cutoff for high probability boundary points.
• Use the intersection of the first and second largest normal
densities as the lower cutoff for medium probability boundary
points.
Intermediate Output
Boundary Explorer - Methodology
Boundary Explorer - Methodology
True Boundaries and Sections
True Boundaries and Sections
Create new points with a Random Forest
• The candidate runs with the largest predicted values for the
indicator column are those that are closest to known boundary
points and are most likely to also represent boundary points. These
are sorted in decreasing order and the top runs are taken as RF
Method runs.
• While the midpoint and perturbation approaches produce a fixed
number of runs, the RF method can be used to generate an arbitrary
number of points.
• The RF method considers all responses at once, so any areas that
represent boundaries for multiple responses will receive a larger
share of additional runs.
Boundary Explorer - Methodology
• Green: Random Forest
• This represents the new runs added after one iteration of this
tool. It is designed to work in multiple iterations.
• After collecting the response for these new runs, the tools is
run again (using both the original space filling runs and the
results from the first iteration).
• The wider spread of the perturbation points makes them useful for
exploring along previously neglected boundaries.
Boundary Explorer - Methodology
Other Options
Also an option to add runs near the minimum/maximum/match target
(using column property) of continuous responses (fit with a random
forest).
Built in option to add Gap Filling runs
Michelawicz 2D: Initial Space Filling Runs
After Two Iterations
After Two Iterations
Adaptive Sampling Summary
Two new design augmentation tools now available to
practitioners
Gap filling offers significant improvement over space filling to
target sparse areas of design space
Adaptive sampling critical in large scale simulation models and
requires judicious choice of new runs Highly recommend iteratively
generating points
Though motivation came from with complex simulation models with
near-real time augmentation needed through automation, methods
scale well across many design choices and sizes (factors and
runs)
Questions?
Statistical Approaches to V&Vand Adaptive Sampling in
M&SApril 12, 2021
Part I. Statistical Methods for M&S V&V and Equivalence
Methods
The Big Picture of M&S V&V
Definitions
VV&A Process for OT Evaluations*
Equivalence Testing
Comparing Distributions: Kolmogorov-Smirnoff Test
Conceptual Framework for Modeling and Simulation
Why – Enhancing M&S V&V
The World As it Is . . . Collecting Evidence
Test Design Space Simulation vs. Live
Controlling Risk
DOD M&S Example: Missile Performance
HOW – Enhancing M&S V&V
Truth & Models & Live Testing
The Process – Statistical M&S V&V
The Span of Testing
Summary
Multiple Experiments for a TestF-15E Radar/Sensor Example
Part II. Equivalence in Curves&Augmenting M&S Designs with
Apaptive Sampling Methods
Equivalence in curves
Analysis Challenge: Temporal Responses
Additional Insights
Gap Filling Need
Adaptive Sampling Background and Motivation
Current Options
Example Problem
Boundary Exploration
Predictor Screening
Slide Number 65
Slide Number 67
Slide Number 68
Slide Number 69
Slide Number 70
Slide Number 71
Other Options
After Two Iterations
After Two Iterations
Adaptive Sampling Summary