Introduction to Forecast Verification

Introduction to Forecast Verification

Laurence J. Wilson

MSC Dorval Quebec

Outline• Why verify: Principles, goodness and goals of verification• General Framework for verification (Murphy-Winkler)

– Joint and conditional distributions– Scores and measures in context of the framework– Murphy’s attributes of forecasts

• Value of forecasts• Model Verification

– Data matching issues– Verification issues

• Verification of probability distributions• Tricks of the trade• Conclusion

References• Wilks, D.S., 1995: Statistical methods in the atmospheric sciences. Academic

Press. Chapter 7.• Stanski, H.R., L.J. Wilson, and W.R. Burrows, 1990: Survey of common

verification methods in meteorology. WMO WWW Technical Report No. 8. Also available on the web – see below.

• Jolliffe, I.T. and D.B. Stephenson, 2003: Forecast verification: A practitioner’s guide in atmospheric science. Wiley.

• Murphy, A.H. and R.L. Winkler, 1987: A general framework for forecast verification. Mon. Wea. Rev. 115, 1330-1338.

• Murphy, A.H., 1993: What is a good forecast? An essay on the nature of goodness in weather forecasting. Wea. Forecasting 8, 281-293.

Web Page of the Joint WGNE/WWRP Verification Working Group:

Has lots of links, and lots of information. (and is changing all the time)ANNOUNCEMENT: This group is organizing a workshop on Verification

methods in or near Montreal, Quebec, September 13 to 17, 2004.

http://www.bom.gov.au/bmrc/wefor/staff/eee/verif/verif_web_page.html

Evaluation of forecasts

• Murphy’s “goodness”– CONSISTENCY: forecasts agree with forecaster’s true

belief about the future weather [strictly proper]

– QUALITY: correspondence between observations and forecasts [ verification]

– VALUE: increase or decrease in economic or other kind of value to someone as a result of using the forecast [decision theory]

Evaluation of forecast system

• Evaluation of forecast “goodness”• Evaluation of delivery system

– timeliness (are forecasts issued in time to be useful?)

– relevence (are forecasts delivered to intended users in a form they can understand and use?)

– robustness (level of errors or failures in the delivery of forecasts)

Principles of (Objective) Verification• Verification activity has value only if the information generated leads

to a decision about the forecast or system being verified– User of the information must be identified

– Purpose of the verification must be known in advance

• No single verification measure provides complete information about the quality of a forecast product.

• Forecast must be stated in such a way that it can be verified– “chance” of showers

– What does that gridpoint value really mean?

• Except for specific validation studies, verification should be carried out independently of the issuer of the product.

Goals of Verification• Administrative

– Justify cost of provision of weather services

– Justify additional or new equipment

– Monitor the quality of forecasts and track changes

• Scientific– To identify the strengths and weaknesses of a forecast

product in sufficient detail that actions can be specified that will lead to improvements in the product, ie to provide information to direct R&D.

Verification Model (cont’d)

• Predictand Types– Continuous: Forecast is a specific value of the variable

• wind• temperature• upper air variables

– Categorical/probabilistic: Forecast is the probability of occurrence of ranges of values of the variable (categories)

• POP and other weather elements (fog etc)• Precipitation type• cloud amount• precipitation amount

– Probability distributions (ensembles)

Framework for Verification (Murphy-Winker)

• All verification information contained in the joint distribution of forecasts and observations

• Factorizations:

– calibration-refinementp(f,x) = p(x|f) p(f)p(x|f) = conditional distribution of observation given forecast (calibration/reliability)p(f) = marginal distribution of forecasts (refinement)

– likelihood-base ratep(f,x) = p(f|x)p(x)p(f|x) = conditional distribution of forecasts given observations (likelihood/discrimination)p(x) = marginal distribution of observations (base rate/climatology)

Verification Samples

• Joint distn of forecasts and observations may be:– A time series at points

– One or more spatial fields

– A combination of these

• In meteorological applications:– The events of the sample are not usually even close to

independent in the statistical sense

– Importance of carefully assessed confidence limits of verification results

Spot Temperature Scatter Plot Exercise

Spot Temperature Scatter Plot Exercise

HITS MISSES Total Events Obs

FALSEALARMS

CORRECTNEGATIVES

Total non-eventsObs

Total EventsFcst

Total Non-EventsFcst

Sample size

Contingency Tables - Basic 2 X 2Forecasts

Obs

erva

tion

s

Yes No

No

Yes

Verification -A general model

ATTRIBUTE DEFINITION RELATED MEASURES

1. Bias Correspondencebetween mean forecastand mean observation

bias (mean forecastprobability-sampleobserved frequency)

2. Association Strength of linearrelationship betweenpairs of forecasts andobservations

covariance, correlation

3. Accuracy Averagecorrespondencebetween individual pairsof observations andforecasts

mean absolute error(MAE), mean squarederror (MSE), root meansquared error, Brierscore (BS)

4. Skill Accuracy of forecastsrelative to accuracy offorecasts produced by astandard method

Brier skill score, othersin the usual format

Skill Scores

• Format:

• Where Sc=score (MAE, Brier etc)

• PSc= score for a perfect forecast

• SSc= score for a standard (unskilled) forecast

SScPSc

SScScSkill

ATTRIBUTE DEFINITION RELATED MEASURES5. Reliability Correspondence of

conditional meanobservation andconditioning forecast,averaged over allforecasts

Reliability component of BS,MAE, MSE of binned data fromreliability table.

6. Resolution Difference betweenconditional meanobservation andunconditional meanobservation, averagedover all forecasts.

Resolution component of BS

7. Sharpness Variability of forecasts asdescribed by distributionof forecasts

Variance of forecasts

8.Discrimination

Difference betweenconditional mean forecastand unconditional meanforecast, averaged overall observations

Area under ROC, measures ofseparation of conditionaldistributions; MAE,MSE ofscatter plot, binned byobservation value

9. Uncertainty Variability of observationsas described by thedistribution ofobservations

Variance of observations

ATTRIBUTE DEFINITION RELATED MEASURES5. Reliability Correspondence of

conditional meanobservation andconditioning forecast,averaged over allforecasts

Reliability component of BS,MAE, MSE of binned data fromreliability table.

6. Resolution Difference betweenconditional meanobservation andunconditional meanobservation, averagedover all forecasts.

Resolution component of BS

7. Sharpness Variability of forecasts asdescribed by distributionof forecasts

Variance of forecasts

8.Discrimination

Difference betweenconditional mean forecastand unconditional meanforecast, averaged overall observations

Area under ROC, measures ofseparation of conditionaldistributions; MAE,MSE ofscatter plot, binned byobservation value

9. Uncertainty Variability of observationsas described by thedistribution ofobservations

Variance of observations

ROC - ECMWF Ensemble Forecasts24 h POP (>1 mm)

Relative Operating Characteristic24 h Precip > 1 mm Europe obs

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1False Alarm Rate

Hit

Rate 96 h

0 skill

144 h

240 h

Az Da 96h 0.839 1.400144h 0.777 1.077240h 0.709 0.780

Likelihood diagram - 96 h pcpn

0500

10001500200025003000

0.0

5

0.2

5

0.4

5

0.6

5

0.8

5

Forecast Probability

Cases

pcpn

no pcpn

Likelihood Diagram - 144 h pcpn

0500

10001500200025003000

0.0

5

0.2

5

0.4

5

0.6

5

0.8

5


Cases

pcpn

no pcpn

Likelihood Diagram - 240 h pcpn

0

500

1000

1500

2000

0.0

5

0.2

5

0.4

5

0.6

5

0.8

5


Cases pcpn

no pcpn

Current ROC – Canadian EPSSpring 2003

Summary

• The choice of verification measure depends on:– The purpose of the verification (admin – science)

– The nature of the predictand

– The attributes of the forecast to be measured

24

The meaning of ‘Value’

• “Weather forecasts possess no intrinsic value in an economic sense. They acquire value by influencing the behaviour of individuals or organizations (“users”) whose activities are sensitive to weather.”– Allan Murphy, Conference on economic benefits of

Meteorological and Hydrological services (Geneva, 1994)

25

Types of “Value”

• Social value - Minimization of Hazards to human life and health– Value to individual users

• Economic value of forecasts– Value to a specific business

– Value to a weather-sensitive industry

– Value to a weather-sensitive sector

– Value to the economy of a country

– Market value (e.g. futures)

• Environmental value– minimizing risk to the environment

– optimal use of resources

26

Value vs. Quality

• Quality refers only to forecast verification; Value implicates a user

• A perfect forecast may have no value if no one cares about it

• An imperfect forecast will have less value than a perfect forecast

• See Murphy and Ehrendorfer 1987

Measuring value• The cost-loss decision model

– focus on maximizing gain or loss-avoidance

– requires objective cost information from user

– user specific, difficult to generalize

– economic value to weather-sensitive operation only.

– easy to evaluate relative value

• Contingent-valuation method– focuses on demand for service and “willingness to pay”

– requires surveys of users to determine variations in demand as function of variations in price and/or quality of service

– less user-specific; a larger crossection of users/industries can be evaluated in one study

– measures in terms of perception rather than actual accuracy.

– e.g. evaluation of ATADs, Rollins and Shaykewich, Met Apps Mar. 03

Model Verification

• Data matching issues• Verification issues

Model Verification – Data Matching Issues• Typical situation: Model gridded forecasts, observations

at points– Point in situ observations undersample the field, contain

information on all spatial and most temporal scales. (“representativeness error”? Not really)

• Alternatives:– Model to data: What does the model predict at the obs point?

• Interpolation – if the model gives point predictions– Gives answers at all verification points

• Nearest gridpoint value – if the model prediction is a grid box average

– Verify only those grid boxes where there is at least one obs– UPSCALING: - estimate grid box average using all obs in grid

box.

– Data to model: Analysing point data: • NOT RECOMMENDED because treats networks of point

observations as if they contain information only on the scales represented by the grid on which the analysis is done.

• Resolution – scale separation

• Spatial verification – object-oriented

Model Verification- Issues

Model Verification – Scale separation

• Mesoscale verification: Separating model errors due to resolution from other errors.

• IF high resolution spatial data is available:– Scale separation, wavelets or other method (Mike

Baldwin)

– Repeat verification on same dataset at different scales to get performance curve

– Data combination: Use high resolution data to “inform” statistical estimates such as grid box averages.

Spatial Verification• Object-oriented methods• The calculation of displacement, size errors for

specific objects (e.g. rain areas, fronts) – Hoffman, 1995; Ebert and McBride 2000 CRA method

• Decomposition of errors into location, shape, size components

– Others (Mike Baldwin)

– Problem always is the matching of the forecast and observed object.

Verification of probability distributions

• Problem:– Comparison of distribution with a single outcome

• Solutions:– Verify density in vicinity of observation (Wilson, Burrows and

Lanzinger, 1999)

– Verify cdf against observation represented as cdf (CRPS, Hersbach 2000)

– Extract probabilities from distribution and verify as probability forecasts (sample several thresholds)

– Compare parameters of pooled distribution with sample climatology (Talagrand diagram)

Ensemble verification - distribution

CRPS exampleCDF - Forecast-observed

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

X

Pro

bab

ility

Forecast Observed

Rank Histogram example

Tricks of the trade• “How can I get a better (higher) number?”

– Remove the bias before calculating scores (works really well for quadratic scoring rules) and don’t tell anyone.

– Claim that the model predicts grid box averages, even if it doesn’t . Make the boxes as large as possible.

– Never use observation data. It only contains a lot of “noise”. As an alternative,:

• Verify against an analysis that uses the model being verified as a trial field. Works best in data-sparse areas

• Use a shorter range forecast from the model being verified and call it observation data.

– Design a new or modified score. Don’t be bothered by restrictions such as strictly properness. Then the values can be as high as desired.

– Stratify the verification data using posteriori rules. One can always get rid of pathological cases that bring down the average.

– When comparing the performance of your model against others, make sure it is your analysis that is used as the verifying one.

– Always insist on doing the verification of your own products….• Remember, you already know you have a good product. The goal of

the verification is to show it “objectively”

Conclusions

• Quick survey of verification and value assessment• A data-oriented perspective

Documents

Introduction to Forecast Verification