25
MITRE MITRE Developing a Common Framework Developing a Common Framework for Model Evaluation for Model Evaluation Panel Discussion Panel Discussion 7 7 th th Transport and Diffusion Conference Transport and Diffusion Conference George Mason University George Mason University 19 June 2003 19 June 2003 Dr. Priscilla A. Glasow Dr. Priscilla A. Glasow The MITRE Corporation The MITRE Corporation

MITRE Developing a Common Framework for Model Evaluation Panel Discussion 7 th Transport and Diffusion Conference George Mason University 19 June 2003

Embed Size (px)

Citation preview

MITREMITRE

Developing a Common Framework Developing a Common Framework for Model Evaluationfor Model Evaluation

Panel DiscussionPanel Discussion77thth Transport and Diffusion Conference Transport and Diffusion Conference

George Mason UniversityGeorge Mason University19 June 200319 June 2003

Dr. Priscilla A. GlasowDr. Priscilla A. GlasowThe MITRE CorporationThe MITRE Corporation

MITREMITRE

TopicsTopics

Joint Effects Model evaluation Joint Effects Model evaluation approachapproach

Model evaluation in the Department Model evaluation in the Department of Defenseof Defense MethodsMethods Evaluation method deficienciesEvaluation method deficiencies Lessons learned in building a common Lessons learned in building a common

evaluation frameworkevaluation framework

MITREMITRE

JEM Model and Evaluation JEM Model and Evaluation ApproachApproach

MITREMITRE

CBRN Hazard Model CBRN Hazard Model ComponentsComponents

Decay, Deposition Processes

MITREMITRE

System DescriptionSystem DescriptionModels the transport and diffusion of hazardous Models the transport and diffusion of hazardous materials in near real-timematerials in near real-time CBRN HazardsCBRN Hazards Toxic Industrial Chemicals / Materials Toxic Industrial Chemicals / Materials

Modeling approach includesModeling approach includes WeatherWeather Terrain / vegetation / marine environmentTerrain / vegetation / marine environment Interaction with materialsInteraction with materials

Depicts the hazard area, concentrations, lethality, etc. Depicts the hazard area, concentrations, lethality, etc. Overlays on command and control system mapsOverlays on command and control system maps

Will deploy on C4I systems and standaloneWill deploy on C4I systems and standalone Where JWARN is not needed, but hazard prediction is neededWhere JWARN is not needed, but hazard prediction is needed

MITREMITRE

System Description System Description (continued)(continued)

Legacy systems being replacedLegacy systems being replaced HPAC - a DTRA S&T effortHPAC - a DTRA S&T effort VLSTRACK - a NAVSEA S&T effortVLSTRACK - a NAVSEA S&T effort D2PUFF - a USA S&T effortD2PUFF - a USA S&T effort

Improved CapabilitiesImproved Capabilities One model, accredited for all uses currently supported by One model, accredited for all uses currently supported by

the 3 Interim Accredited DoD Hazard Prediction Models the 3 Interim Accredited DoD Hazard Prediction Models Warfighter validated Graphical User InterfaceWarfighter validated Graphical User Interface Uses only Authoritative Data SourcesUses only Authoritative Data Sources

Integrated into Service Command and Control systemsIntegrated into Service Command and Control systems Flexible & Extensible Open ArchitectureFlexible & Extensible Open Architecture

Seamless weather data transfer from provider of choiceSeamless weather data transfer from provider of choice Planned program sustainment and Warfighter supportPlanned program sustainment and Warfighter support

School-house and embedded trainingSchool-house and embedded training Reach BackReach Back

MITREMITRE

OCCOCCVULNVULNHAZHAZPPPP

// OCCOCCVULNVULNHAZHAZPPPP

//

PHAZ = probability of the hazard

PHAZ/VULN = prob. of the hazard for a given vulnerability

PVULN/OCC = prob. of vulnerability for a given occurrence

POCC = prob. of the occurrence

Calculating Hazard Predictions Calculating Hazard Predictions

How might a hazard probability be How might a hazard probability be calculated? calculated?

MITREMITRE

Example: Calculating Example: Calculating CasualtiesCasualties

The probability of casualties can be The probability of casualties can be defined as: defined as:

EXPEXPCONTCONTCASCASPPPP

// EXPEXPCONTCONTCASCASPPPP

//

PCAS = prob. of casualties

PCAS/CONT = prob. of casualties for a given contamination

PCONT/EXP = prob. of contamination for a given exposure

PEXP = prob. of exposure

MITREMITRE

Probability of CasualtiesProbability of CasualtiesPCAS/CONT

Probability

of hazard for a given vulnerability

Toxicity is

already a probability

Data for

Inhalation or contact

PCONT/EXP

Probability of vulnerability for a given occurrence

Can be empirical or

computed by JOEF Can account for

Protection Levels (by inverting protection factors – ratios of concentrations outside a system to inside a system)

PEXP

Probability of occurrence

Performed by JEM’s Atmospheric Transport and Dispersion algorithms

MET DATA SOURCE

TERMS T&D

PHYSICS

MITREMITRE

AFW

AOL

FNW

Falsely and Falsely Not WarnedFalsely and Falsely Not Warned

Adapted from IDA work by Dr. Steve Warner

A

Can compute fractions of areas Can compute fractions of areas

(under curves) of FW and FNW to (under curves) of FW and FNW to

observed and/or predicted areas. observed and/or predicted areas. Given enough field trial data, Given enough field trial data,

means of FW and FNW can be means of FW and FNW can be

used as probabilities. used as probabilities.

AccuracyAccuracy then 1 – FW, 1 – FNW, then 1 – FW, 1 – FNW,

(false pos. acc. and false neg. (false pos. acc. and false neg.

acc.) acc.) Propose a tentative ~ 0.2 FW Propose a tentative ~ 0.2 FW

accuracy standard and a accuracy standard and a

tentative ~ 0.4 FNW accuracy tentative ~ 0.4 FNW accuracy

standard which 1) HPAC has standard which 1) HPAC has

demonstrated and 2) is demonstrated and 2) is

apparently accepted by HPAC apparently accepted by HPAC

users. users. Comparisons to ATP-45 (as Comparisons to ATP-45 (as

current JORD language reads) current JORD language reads)

may be problematic for FNW. may be problematic for FNW.

MITREMITRE

Transport and Diffusion Transport and Diffusion Statistical ComparisonsStatistical Comparisons

ASTM D 6589-00, Standard ASTM D 6589-00, Standard

Guide for Statistical Guide for Statistical

Evaluation of Atmospheric Evaluation of Atmospheric

Dispersion Model Dispersion Model

Performance Performance Geometric mean, geometric Geometric mean, geometric

variance, bias (absolute, variance, bias (absolute,

fractional), normalized mean fractional), normalized mean

square error, correlation square error, correlation

coefficient, robust highest coefficient, robust highest

concentration concentration Use all available Validation Use all available Validation

Database data Database data Allows comparison to other Allows comparison to other

T&D models and legacy T&D models and legacy

codescodes

MITREMITRE

Anticipated Approach Anticipated Approach Use both the IDA methodology and the Use both the IDA methodology and the

T&D statistics using the validation T&D statistics using the validation databasedatabase

The IDA metrics are easy for the The IDA metrics are easy for the warfighter to understand and visualize the warfighter to understand and visualize the threatthreat

T&D statistics provide a basis for T&D statistics provide a basis for comparison to legacy codes (comparison comparison to legacy codes (comparison of observed and predicted means)of observed and predicted means)

MITREMITRE

Model Evaluation in the Model Evaluation in the Department of DefenseDepartment of Defense

MITREMITRE

Model Evaluation MethodsModel Evaluation Methods

The Department of Defense (DoD) assesses The Department of Defense (DoD) assesses the credibility of models and simulations the credibility of models and simulations through a process of Verification, Validation through a process of Verification, Validation and Accreditation (VV&A)and Accreditation (VV&A)

The JEM model evaluation approach will The JEM model evaluation approach will follow DoD VV&A policy and proceduresfollow DoD VV&A policy and procedures DoD Instruction 5000.61, May 2003DoD Instruction 5000.61, May 2003 DoD VV&A Recommended Practices Guide, DoD VV&A Recommended Practices Guide,

November 1996November 1996

MITREMITRE

Current Methods DeficienciesCurrent Methods DeficienciesScience ViewScience View

Inadequate use of statistical methods Inadequate use of statistical methods for assessing the credibility of for assessing the credibility of models and simulationsmodels and simulations

Over-reliance on subjective methods, Over-reliance on subjective methods, such as subject matter experts, such as subject matter experts, whose credentials are not whose credentials are not documented and who may not be documented and who may not be current or unbiasedcurrent or unbiased

MITREMITRE

Current Method DeficienciesCurrent Method DeficienciesUser ViewUser View

Insufficient guidance on what techniques Insufficient guidance on what techniques are available and how to select suitable are available and how to select suitable techniques for a given applicationtechniques for a given application

Overemphasis on using models for their Overemphasis on using models for their own sake rather than as tools to support own sake rather than as tools to support analysisanalysis

Failure to carefully and clearly define the Failure to carefully and clearly define the problem for which models are being used problem for which models are being used to provide insights or solutionsto provide insights or solutions

Not understanding the inherent limitations Not understanding the inherent limitations of modelsof models

MITREMITRE

Is Developing a Common Is Developing a Common Framework an Achievable Goal?Framework an Achievable Goal?

The Military Operations Research Society The Military Operations Research Society (MORS) sponsored a workshop in October (MORS) sponsored a workshop in October 2002 to address the status and health of 2002 to address the status and health of VV&A practices in DoDVV&A practices in DoD

An Ad Hoc committee was also formed by An Ad Hoc committee was also formed by Mr. Walt Hollis, Deputy Undersecretary of Mr. Walt Hollis, Deputy Undersecretary of the Army (Operations Research) to address the Army (Operations Research) to address general concerns regarding capability and general concerns regarding capability and expectations for using M&S in acquisitionexpectations for using M&S in acquisition

MITREMITRE

Best PracticesBest PracticesAn Initial Set of Goals for Developing a Common An Initial Set of Goals for Developing a Common

Framework for Model EvaluationFramework for Model Evaluation

Understand the problemUnderstand the problem Determine use of M&SDetermine use of M&S Focus accreditation criteriaFocus accreditation criteria Scope the V&VScope the V&V Contract to get what you needContract to get what you need

MITREMITRE

Best PracticesBest PracticesUnderstand the ProblemUnderstand the Problem

Understand the problem/questions so that you Understand the problem/questions so that you can develop sound criteria against which to can develop sound criteria against which to evaluate possible solutions.evaluate possible solutions.

This first step is frequently disregardedThis first step is frequently disregarded Too hard to doToo hard to do Assumption that the problem is “clearly evident”Assumption that the problem is “clearly evident” No real understanding of what the problem isNo real understanding of what the problem is

Failure to understand the problem results in:Failure to understand the problem results in: Unfocused use of M&SUnfocused use of M&S Unbounded VV&AUnbounded VV&A Unnecessary expenditure of time and money without Unnecessary expenditure of time and money without

meaningful resultsmeaningful results

MITREMITRE

Best Practices:Best Practices:Determine Use of M&SDetermine Use of M&S

M&S should be used to:M&S should be used to: help critical thinkinghelp critical thinking generate hypothesesgenerate hypotheses provide insightsprovide insights perform sensitivity analyses to identify logical consequences of assumptionsperform sensitivity analyses to identify logical consequences of assumptions generate imaginable scenariosgenerate imaginable scenarios visualize resultsvisualize results crystallize thinkingcrystallize thinking suggest experiments the importance of which can only be demonstrated by the suggest experiments the importance of which can only be demonstrated by the

use of M&Suse of M&S Know up front what you will use M&S forKnow up front what you will use M&S for

What part of the problem can be answered by M&S?What part of the problem can be answered by M&S? What requirements do you have that M&S can address?What requirements do you have that M&S can address? Use M&S appropriately: If the model isn’t sufficiently accurate Use M&S appropriately: If the model isn’t sufficiently accurate

vis-à-vis the test issues, it may lead to flawed information for vis-à-vis the test issues, it may lead to flawed information for decisions, such as test planning and execution.decisions, such as test planning and execution.

MITREMITRE

Best Practices:Best Practices:Focus Accreditation CriteriaFocus Accreditation Criteria

Focus the accreditation on the criteria; Focus the accreditation on the criteria; focus the VV&A effort on reducing focus the VV&A effort on reducing program risk.program risk.

Accreditation criteria are best developed Accreditation criteria are best developed through collaboration among all through collaboration among all stakeholders:stakeholders: Get System Program Office buy-in on the M&S effortGet System Program Office buy-in on the M&S effort Foster collaboration with testers, analysts, and model Foster collaboration with testers, analysts, and model

developers…and key decisionmakers. developers…and key decisionmakers. Build the training simulation first to elicit the user’s needs Build the training simulation first to elicit the user’s needs

and ideas about “key criteria” before building the system and ideas about “key criteria” before building the system itself. If this is not feasible, think “training” and ultimate itself. If this is not feasible, think “training” and ultimate use to define key criteria.use to define key criteria.

MITREMITRE

Best Practices:Best Practices:Scope the V&VScope the V&V

Focus the accreditation on the criteria as Focus the accreditation on the criteria as that will properly focus the verification and that will properly focus the verification and validation (V&V) effort.validation (V&V) effort.

Write the accreditation plan so that it Write the accreditation plan so that it scopes and bounds the V&V effort.scopes and bounds the V&V effort.

Plan the V&V effort in advance to use test Plan the V&V effort in advance to use test data to validate the modeldata to validate the model Need to ensure that the data are used correctlyNeed to ensure that the data are used correctly Need to document the conditions under which Need to document the conditions under which

the data were collectedthe data were collected

MITREMITRE

Best Practices:Best Practices:Contract to Get What You NeedContract to Get What You Need

Include M&S and VV&A requirements in Include M&S and VV&A requirements in Requests for Proposals to obtain bids and Requests for Proposals to obtain bids and estimated costsestimated costs Require that bidders provide a conceptual model that Require that bidders provide a conceptual model that

can be used to assess the bidder’s understanding of the can be used to assess the bidder’s understanding of the problemproblem

Make documentation of M&S and VV&A deliverables Make documentation of M&S and VV&A deliverables under the contractunder the contract

Competition among bidders will generate alternative Competition among bidders will generate alternative models and a broader range of ideasmodels and a broader range of ideas

Require that the Government announce what M&S it will Require that the Government announce what M&S it will use as part of their source selectionuse as part of their source selection

Ensure that the VV&A team is multidisciplinary and Ensure that the VV&A team is multidisciplinary and includes experts who understand the mathematical and includes experts who understand the mathematical and physical underpinnings of the modelphysical underpinnings of the model

Ensure that Subject Matter Experts (SMEs) have relevant Ensure that Subject Matter Experts (SMEs) have relevant and current expertise, that these qualifications are and current expertise, that these qualifications are documented, and that the rational of each SME for his or documented, and that the rational of each SME for his or her judgements and recommendations are recordedher judgements and recommendations are recorded

MITREMITRE

Other Goals for Developing a Other Goals for Developing a Common Approach for Model Common Approach for Model

EvaluationEvaluation Consensus and commitment to developing Consensus and commitment to developing

and using credible modelsand using credible models Adequate resources to conduct credible Adequate resources to conduct credible

model evaluation effortsmodel evaluation efforts Qualified evaluatorsQualified evaluators FundingFunding

Provide program incentives to those who Provide program incentives to those who are expected to perform model are expected to perform model evaluationsevaluations

MITREMITRE

Other Goals for Developing a Other Goals for Developing a Common Approach for Model Common Approach for Model

EvaluationEvaluation Use independent evaluators, not the Use independent evaluators, not the

model developers, to obtain unbiased model developers, to obtain unbiased evaluation resultsevaluation results

Institutionalize the use and reuse of Institutionalize the use and reuse of models that have documented credibilitymodels that have documented credibility

Establish standards for model evaluation Establish standards for model evaluation performance to ensure quality and performance to ensure quality and integrity of evaluation effortsintegrity of evaluation efforts