1 Forecast verification - did we get it right? Ian Jolliffe Universities of Reading, Southampton, Kent, Aberdeen [email protected]

1

Forecast verification - did we get it right?

Ian Jolliffe

Universities of Reading, Southampton, Kent, Aberdeen

[email protected]

2

Outline of talk

• Introduction – what, why, how?• Binary forecasts

– Performance measures, ROC curves

– Desirable properties• Of forecasts

• Of performance measures

• Other forecasts– Multi-category, continuous, (probability)

• Value

3

Forecasts

• Forecasts are made in many disciplines

–Weather and climate

–Economics

–Sales

–Medical diagnosis

4

Why verify/validate/assess forecasts?

• Decisions are based on past data but also on forecasts of data not yet observed

• A look back at the accuracy of forecasts is necessary to determine whether current forecasting methods should be continued, abandoned or modified

5

Two (very different) recent references

• I T Jolliffe and D B Stephenson (eds.) (2003) Forecast verification A practitioner’s guide in atmospheric science. Wiley.

• M S Pepe (2003) The statistical evaluation of medical tests for classification and prediction. Oxford.

6

Horses for courses

• Different types of forecast need different methods of verification, for example in the context of weather hazards (TSUNAMI project): binary data - damaging frost: yes/no categorical - storm damage: slight/moderate/severe discrete - how many land-falling hurricanes in a season continuous - height of a high tide probabilities – of tornado

• Some forecasts (wordy/ descriptive) are very difficult to verify at all

7

Binary forecasts• Such forecasts might be

– Whether temperature will fall below a threshold, damaging crops or forming ice on roads

– Whether maximum river flow will exceed a threshold, causing floods

– Whether mortality due to extreme heat will exceed some threshold (PHEWE project)

– Whether a tornado will occur in a specified area

• The classic Finley Tornado example (next 2 slides) illustrates that assessing such forecasts is more subtle than it looks

• There are many possible verification measures – most have some poor properties

8

Forecasting tornados

Tornado Observed

Tornado not observed

Total

Tornado Forecast

28 72 100

Tornado not forecast

23 2680 2703

Total 51 2752 2803

9

Tornado forecasts

• Correct decisions 2708/2803 = 96.6%• Correct decisions by procedure which always

forecasts ‘No Tornado’ 2752/2803 = 98.2%• It’s easy to forecast ‘No Tornado’, and get it right but

more difficult to forecast when tornadoes will occur

• Correct decision when Tornado is forecast is 28/100 = 28.0%

• Correct forecast of observed tornadoes 28/51 = 54.5%

10

Forecast/observed contingency table

Event Observed

Event not observed

Total

Event Forecast

a b a + b

Event not forecast

c d c + d

Total a + c b + d n

11

Some verification measures for (2 x 2) tables

• a/(a+c) Hit rate = true positive fraction = sensitivity

• b/(b+d) False alarm rate = 1- specificity• b/(a+b) False alarm ratio = 1 – positive

predictive value• c/(c+d) Negative predictive value• (a+d)/n Proportion correct (PC)• (a+b)/(a+c) Bias

12

Skill scores

• A skill score is a verification measure adjusted to show improvement over some unskilful baseline, typically a forecast of ‘climatology’, a random forecast or a forecast of persistence. Usually adjustment gives zero value for the baseline and unity for a perfect forecast.

• For (2x2) tables we know how to calculate ‘expected’ values in the cells of the table under a null hypothesis of no association (no skill) for a χ2 test.

13

More (2x2) verification measures• (PC – E)/(1- E), where E is the expected value of PC

assuming no skill – the Heidke (1926) skill score = Cohen’s Kappa (1960), also Doolittle (1885)

• a/(a+b+c). Critical success index (CSI) = threat score

• Gilbert’s (1884) skill score - a skill score version of CSI

• (ad –bc)/(ad +bc) Yule’s Q (1900). A skill score version of the odds ratio ad/bc

• a(b+d)/b(a+c); c(b+d)/d(a+c) Diagnostic likelihood ratios

• Note that neither the list of measures nor the list of names is exhaustive – see, for example, J A Swets (1986), Psychological Bulletin, 99, 100-117

14

The (Relative Operating Characteristic) ROC curve

• Plots hit rate (proportion of occurrences of the event that were correctly forecast) against false alarm rate (proportion of non-occurrences that were incorrectly forecast) for different thresholds

• Especially relevant if a number of different thresholds are of interest

• There are a number of verification measures based on ROC curves. The most widely used is probably the area under the curve

0 0.2 0.4 0.6 0.8 1

false alarm rate

0

0.2

0.4

0.6

0.8

1

hitrate high

threshold

moderatethreshold

low threshold

15

Desirable properties of measures:

hedging and proper scores • ‘Hedging’ is when a forecaster gives a forecast

different from his/her true belief because he/she believes that the hedged forecasts will improve the (expected) score on a measure used to verify the forecasts. Clearly hedging is undesirable.

• For probability forecasts, a (strictly) proper score is one for which the forecaster (uniquely) maximises the expected score by forecasting his/her true beliefs, so that there is no advantage in hedging.

16

17

Desirable properties of measures: equitability

• A score for a probability forecast is equitable if it takes the same expected value (often chosen to be zero) for all unskilful forecasts of the type– Forecast the same probability all the time or– Choose a probability randomly from some

distribution on the range [0,1].• Equitability is desirable – if two sets of forecasts are

made randomly, but with different random mechanisms, one should not score better than the other.

18

Desirable properties of measures III

• There are a number of other desirable properties of measures, both for probability forecasts and other types of forecast, but equitability and propriety are most often cited in the meteorological literature.

• Equitability and propriety are incompatible (a new result)

19

Desirable properties (attributes) of forecasts

• Reliability. Conditionally unbiased. Expected value of the observation equals the forecast value.

• Resolution. The sensitivity of the expected value of the observation to different forecasts values (or more generally the sensitivity of this conditional distribution as a whole).

• Discrimination. The sensitivity of the conditional distribution of forecasts, given observations, to the value of the observation.

• Sharpness. Measures spread of marginal distribution of forecasts. Equivalent to resolution for reliable (perfectly calibrated) forecasts. – Other lists of desirable attributes exist.

20

A reliability diagram• For a probability forecast of an event

based on 850hPa temperature. Lots of grid points, so lots of forecasts (16380).

• Plots observed proportion of event occurrence for each forecast probability vs. forecast probability (solid line).

• Forecast probability takes only 17 possible values (0, 1/16, 2/16, … 15/16, 1) because forecast is based on proportion of an ensemble of 16 forecasts that predict the event.

• Because of the nature of the forecast event, 0 or 1 are forecast most of the time (see inset sharpness diagram).

21

Weather/climate forecasts vs medical diagnostic tests

• Quite different approaches in the two literatures– Weather/climate. Lots of measures used.

Literature on properties, but often ignored. Inference (tests, confidence intervals, power) seldom considered

– Medical (Pepe). Far fewer measures. Little discussion of properties. More inference: confidence intervals, complex models for ROC curves

22

Multi-category forecasts

• These are forecasts of the form – Temperature or rainfall ‘above’, ‘below’ or

‘near’ average (a common format for seasonal forecasts)

– ‘Very High Risk’, High Risk’, ‘Moderate Risk’, ‘Low Risk’ of excess mortality (PHEWE)

• Different verification measures are relevant depending on whether categories are ordered (as here) or unordered

23

Multi-category forecasts II

• As with binary forecasts there are many possible verification measures

• With K categories one class of measures assigns scores to each cell in the (K x K) table of forecast/outcome combinations

• Then multiply the proportion of observations in each cell by its score, and sum over cells to get an overall score

• By insisting on certain desirable properties (equitability, symmetry etc) the number of possible measures is narrowed

24

Gerrity (and LEPS) scores for 3 ordered category forecasts with equal probabilities

• Two possibilities are Gerrity scores or LEPS (Linear Error in Probability Space)

• In the example, LEPS rewards correct extreme forecasts more, and penalises badly wrong forecasts more, than Gerrity (divide Gerrity (LEPS) by 24 (36) to give the same scaling – an expected maximum value of 1)

30 (48)

-6 (-6)

-24 (-42)

-6 (-6)

12 (12)

-6 (-6)

-24 (-42)

-6 (-6)

30 (48)

25

Verification of continuous variables

• Suppose we make forecasts f1, f2, …, fn; the corresponding observed data are x1, x2, …, xn.

• We might assess the forecasts by computing [ |f1-x1| + |f2-x2| + … + |fn-xn |]/n (mean absolute error) [ (f1-x1)2 + (f2-x2)2 + … + (fn-xn)2 ]/n (mean square

error) – or take its square root• Some form of correlation between the f’s and x’s• Both MSE and correlation can be highly influenced by a

few extreme forecasts/observations• No time here to explore other possibilities

26

Skill or value?

• Our examples have looked at assessing skill

• Often we really want to assess value

• This needs quantification of the loss/cost of incorrect forecasts in terms of their ‘incorrectness’

27

Value of Tornado Forecasts

• If wrong forecasts of any sort costs $1K, then the cost of forecasting system is $95K, but the naive system costs only $51K

• If a false alarm costs $1K, but a tornado missed costs $10K, then the system costs $302K, but naivety costs $510K

• If a false alarm costs $1K, but a tornado missed costs $1million, then the system costs $23.07million , with naivety costing $51 million

28

Concluding remarks

• Forecasts should be verified• Forecasts are multi-faceted; verification should

reflect this• Interpretation of verification results needs careful

thought• Much more could be said, for example, on

inference, wordy forecasts, continuous forecasts, probability forecasts, ROC curves, value, spatial forecasts etc.

29

30

Continuous variables – LEPS scores

• Also for MSE a difference between forecast and observed of, say, 2oC is treated the same way, whether it is – a difference between 1oC above and 1oC below the

long-term mean or– a difference between 3oC above and 5oC above the

long-term mean

• It can be argued that the second forecast is better than the first because the forecast and observed are closer with respect to the probability distribution of temperature.

31

LEPS scores II

• LEPS (Linear Error in Probability Space) are scores that measure distances with respect to position in a probability distribution.

• They start from the idea of using | Pf – Pv |, where Pf, Pv are positions in the cumulative probability distribution of the measured variable for the forecast and observed values, respectively

• This has the effect of down-weighting differences between extreme forecasts and outcomes e.g. a forecast/outcome pair 3 & 4 standard deviations above the mean is deemed ‘closer’ than a pair 1 & 2 SDs above the mean. Hence it gives greater credit to ‘good’ forecasts of extremes.

32

LEPS scores III

• The basic measure is normalized and adjusted to ensure– the score is doubly equitable– no ‘bending back’– a simple value for unskilful and for perfect

forecasts

• We end up with 3(1-|Pf-Pv|+Pf

2-Pf+Pv2-Pv)-1

33

LEPS scores IV

• LEPS scores can be used on both continuous and categorical data.

• A skill score, taking values between –100 and 100 (or –1 and 1) for a set of forecasts can be constructed based on the LEPS score but it is not doubly equitable.

• Cross-validation (successively leaving out one of the data points and basing the prediction for that point from a rule derived from all the other data points) can be used to reduce the optimistic bias which exists when the same data are used to construct and to evaluate a rule. It has been used in some applications of LEPS, but is relevant more widely.

Documents

1 Forecast verification - did we get it right? Ian Jolliffe Universities of Reading, Southampton, Kent, Aberdeen [email protected]