PostScript version for download
Monthly Temperature Forecasts for Denmark -
Statistical or Dynamical?
Reliability of probability forecasts is an important issue for long-range forecasts. An example is included in this report that illustrates how reliability of probability forecasts can be ensured, if the probability forecasts are derived as the conditional probability density of observed temperature given the raw model forecast.
Monthly forecasts can be computed in two fundamentally different
ways: using empirical/statistical methods, or using
physical/dynamical models. For some time now, DMI has issued
monthly surface air temperature outlooks based on the statistical
relation between the monthly mean temperature in Denmark in one
month and the monthly mean temperature in the following month.
As an example, Table 1 shows the observed relation between
January and February mean temperatures in the period 1890-1990.
For example, a cold January was followed by a cold February in
68% of the years; only in 4% of the years a cold January was
followed by a warm February, i.e. there is a clear tendency for
temperature anomalies to persist during January and February. A
month is here categorised as ``cold'' when it belongs to the
coldest quartile; similarly, a month is ``warm'' when it belongs
to the warmest quartile. Thus, the observed distribution of
temperature in February after a cold January deviates strongly
from the climatological distribution (which is 25% cold, 50%
normal and 25% warm).
Forecasts based on statistical relations, such as the example shown in Table 1, has the advantage that they are reliable per construction (unless the climate has changed after the period that was used to derive relations between monthly mean temperatures), i.e. the forecast probabilities agree with the observed frequencies of cold, normal and warm months.
The main purpose of the present study is to determine whether the simple statistical monthly forecasts described above can be improved by including additional information from dynamical seasonal or medium-range forecasts.
As an example where the statistical monthly outlook certainly
would have benefited from additional forecast information from a
10-day medium-range forecast, consider the monthly outlook for
February 1997. Figure 1 shows daily mean temperatures during
January and February 1997 for the station Tirstrup. At around 13
January there was a jump in temperature which was associated with
a shift in the atmospheric circulation over Europe from blocked
to more zonal flow (Woetmann Nielsen and
Cappelen, 1997) which lead to milder conditions
in Denmark. The relatively mild conditions continued more or
less for the rest of the month, and, except for a couple of days
in the middle of the month, very mild conditions were dominating
After a brief description in Sect. 2 of the data used in the present study, Sect. 3 describes a systematic investigation of monthly predictive skill associated with different predictors, including observed monthly mean temperature for the previous month as well as monthly and 10-day dynamical simulations of surface air temperature. Section 4 discusses ways to express the forecasts in terms of probabilities, and Sect. 5 contains concluding remarks.
The primary source of data for the present study is the ``ECMWF Seasonal Simulations CD-ROM'' set (Becker, 1998) which contains selected data from the ECMWF PROVOST simulations (Brankovic and Palmer, 1999) and from the ECMWF ERA-15 reanalysis project. In the PROVOST project (PRediction Of climate Variations On Seasonal and interannual Timescales) the ECMWF IFS atmospheric model [cycle 13r4, horizontal resolution T63, 31 vertical levels; (Simmons et al., 1988))] was used to generate nine-member ensembles of simulated seasonal climate. Sea surface temperature (SST) was prescribed with observed values. The ensembles are approximately 120 days long, covering the seasons MAMJ, JJAS, SOND and DJFM during the period March 1979 - December 1993. The CD-ROM set contains 10-day means on a global 2.5°×2.5° grid for selected surface and pressure level fields.
The fields that are used in the present study include reanalysed surface temperature (including SST) and simulated max and min 2m temperature. Mean 2m temperature is not included on the CD-ROMs, so when reference in the following is made to simulated monthly mean 2m temperature we actually mean the average of max and min 2m temperature.
The nine members of the ensembles are generated from analysed initial conditions for the last nine days of February, May, August and November. Thus, realistic initial conditions (corresponding to short forecast lead times) are only available for March, June, September and December. For the remaining months of the year the model atmosphere has essentially no memory of the initial conditions (corresponding to forecast lead times of at least one month).
Observed monthly mean 2m temperature in Denmark is taken from the station Tranebjerg (55.9°N, 10.6°E). Homogenised monthly data for Tranebjerg from 1890 - 1990 is available in the ``North Atlantic Climatological Dataset'' (Frich and Coauthors, 1996); monthly mean 2m temperature after 1990 is taken from DMI's climate database.
For statistical climate predictions it is not uncommon to have a large set of potential predictors, e.g. a gridded analysis of monthly surface temperatures, where the temperature in each grid point is a potential predictor. It is in general not difficult to construct a linear regression equation that describes the observed relation between a large set of predictors and one predictand (e.g. an observed temperature time series for a single station). However, unless the regression equation describes a physical mechanism that relates the predictand to the set of predictors, it is very likely that such an equation simply represents fitting of random variations (noise) to the predictand, and that the equation, therefore, is useless for prediction of future observations.
This overfitting problem can be reduced by cutting the dimension of the predictor field. Several methods exist that can do this. One class of methods is known as screening regression. A simple version, the so-called forward selection, where predictors are successively added to a linear regression equation until addition of extra predictors no longer leads to a statistically significant reduction of the residual variance (e.g., Conradsen, 1984), was tested for monthly forecasts, but was found to give unsatisfactory results. Another class of methods is based on expansion of the predictor field in terms of ``dominating'' patterns. Linear regression where the predictors are chosen as the first few principal components of the original predictor field was found to give much better results than screening regression, and the results shown later in this section are based on linear regression of the first three principal components of the predictor field.
The obvious dynamical model based forecast for the 2m temperature at a particular location, would be the simulated surface air temperature in the grid box containing that location. However, local orography which is not properly resolved in the global model, may cause the forecast to be in error. Moreover, as mentioned in the previous section, simulated surface air temperature is not readily available on the CD-ROMs. Therefore, in order to predict monthly mean 2m temperature in Denmark, linear regression is applied to the first three principal components of the average of simulated monthly max and min 2m temperature in a grid covering Northern Europe. That is, the statistical method that is used to predict temperature in Denmark when the predictor is a field of previously observed surface temperature in Northern Europe, is also applied to a predictor field of dynamically simulated surface air temperature.
Thus, in the following the predictand is given by observed monthly mean 2m temperature in Tranebjerg, and the predictors include
The derivation of a statistical prediction model requires a model training period, and verification requires an independent validation period. The statistical model is simply a statistical relationship between predictor(s) and predictand(s), e.g. a linear regression equation, which is based on data from the training period. The predictive skill of the model is estimated from predictions or hindcasts in the validation period.
In order to introduce the cross-validation technique consider the case where the first half of a given period is used to derive a model (the training period), and the second half is used to validate the model (the validation period). This approach can easily be supplemented by swapping training and validation periods, so that another model is derived on data from the second half of the total period and validated on the first half of the period. Altogether, the two models provide hindcasts for each year of the total period, and the total series of hindcasts can be verified against observations. This is a special case of a general procedure where the total data period repeatedly is split into training and validation periods, where each hindcast from the resulting models is saved, and the total series of hindcasts finally is verified against observations. This procedure is known as cross-validation (Michaelsen, 1987).
For very short time series the training period should be kept as long as possible, i.e. a training period would typically consist of data from all time steps but one, and the resulting model would be used to make a hindcast for the one point in time which would not be included in the training period.
When verified against observations this standard procedure yields for each predictand a skill score which is a measure of the predictive skill of the applied statistical method (e.g. linear regression) over the data period in question. No assessment is made of the uncertainty of the skill score.
Cross-validated skill scores for statistical models are not always reliable for future predictions, i.e. in unfortunate cases one may be led to believe that a statistical forecast model is more accurate than it actually is. Two things can seriously affect the reliability of the cross-validated skill score: (i) outliers in the predictor and predictand data sets, and (ii) variability in the predictor and predictand data on the same timescale as the length of the data period.
In order to get a first estimate of the uncertainty of the cross-validated skill score for predicted 2m temperature in Tranebjerg, an ensemble of hindcasts are made for the same target month and year. This is achieved by making several hindcasts based on different, shorter training periods (note that a training ``period'' need not consist of data from consecutive points in time), so for a total period of, e.g., length 15, the standard cross-validation training period would have length 14, but if only 13 points in time were included then the training data could be selected in as many ways as is possible to pick 13 different numbers out of 14, i.e. there would be 14 different possible training periods and hence 14 different hindcasts; if only 12 points in time were included in the training data then there would be 91 different possible training periods, etc.
A skill score can be calculated for each of the many possible combinations of hindcasts to yield a distribution of skill scores, and from this distribution a skill score confidence interval can be derived.
The expected performance is estimated by computing skill scores for a set of hindcasts. Numerous different types of skill scores exist. The choice of skill score depends on what aspects of the forecast is most important. In the following the linear correlation in time between forecast anomalies and observed anomalies is used as skill score. Linear anomaly correlations are simple to compute and widely used as skill scores, but a systematic nonlinear relation between forecasts and observations may not be well reflected in the linear correlation coefficient. Also, the linear correlation is known to be sensitive to outliers in the two data sets that are correlated (Wilks, 1995).
When linear anomaly correlations are computed between cross-validated statistical hindcasts and observations, one should be aware of a couple of potential pitfalls:
Each hindcast and observed anomaly must be with respect to the climatology of the relevant training period.
Cross-validated anomaly correlations underestimate the true forecast skill. The closer to zero the true forecast skill is, the more severe is the underestimation of the cross-validated anomaly correlation. Thus, the cross-validated anomaly correlation can be close to -1 when the true forecast skill is close to zero (Barnston and van den Dool, 1993).
The cross-validation methodology does not apply to persistence hindcasts where the predicted anomaly for the target month is simply the observed anomaly for the preceding month. Therefore, in order to get a fair comparison of hindcast skill scores, persistence hindcasts must be computed in a way that allows cross-validation. So in the following, persistence hindcasts refer to linear regression hindcasts where the predictor is the observed monthly mean temperature for the month preceding the hindcast target month.
The total predictor and predictand data sets comprise observed monthly mean 2m temperatures for Tranebjerg for January and February 1980, Jan. and Feb. 1981, ..., Jan. and Feb. 1993, i.e. for a period of 14 years. 13 different hindcasts for Feb. 1980 are made by deriving a linear regression equation for Feb. temperature based on each possible training data set that includes precisely 12 of the 13 temperatures from Jan. 1981-93. This procedure is repeated for hindcasts for each of the remaining years, 1981-93, leading to a total of 14×13=182 hindcasts. The hindcasts for each year can be combined to give a total of 1314 hindcast time series for the 1980-93 period. A subset of 200 hindcast time series are picked at random, and for each of them the linear correlation with observed Feb. temperatures is computed. A 95%-confidence correlation interval is then simply obtained by finding the 2.5% and 97.5% sample percentiles.
The simplest zero-lead forecasts are those based on persistence
of the monthly mean temperature anomaly. Figure 4 shows as an
example February persistence hindcasts and February observations
in the 1980-93 period. The correlation between the hindcast
ensemble mean anomalies and observed anomalies is 0.68 which is a
high skill score relative to the skill scores obtained for other
months. Tables 2-5 summarise anomaly correlation skill scores
for linear regression hindcasts for the sets of predictors listed
in Sect. 3.1. Simulated temperature in 850 hPa and simulated
mean sea level pressure were also tested as predictors, but they
were not found to add to the skill of the monthly hindcasts
obtained with the predictors listed in Tables 2-5.
|T2m, obs., Tranebjerg stn.||-||-||0.40||[ 0.37; 0.44]||0.68||[ 0.64; 0.71]|
|-||-||-||-||0.71||[ 0.63; 0.76]|
T2m, monthly sim., long lead
|-||-||0.24||[ 0.11; 0.34]||0.49||[ 0.37; 0.58]|
|-||-||0.14||[ 0.04; 0.24]||0.57||[ 0.50; 0.63]|
T2m, monthly sim., short lead
|T2m, 10-day sim., short lead||-||-|
|T2m, obs., Tranebjerg stn.||0.65||[ 0.60; 0.68]||0.28||[ 0.18; 0.39]||-||-|
|0.69||[ 0.64; 0.74]||0.30||[ 0.18; 0.47]||0.36||[ 0.27; 0.43]|
T2m, monthly sim., long lead
|0.27||[ 0.11; 0.41]||-||-||0.34||[ 0.24; 0.41]|
|0.68||[ 0.61; 0.73]||-||-||0.35||[ 0.26; 0.41]|
T2m, monthly sim., short lead
|0.64||[ 0.59; 0.69]|
|T2m, 10-day sim., short lead||0.70||[ 0.67; 0.74]|
|T2m, obs., Tranebjerg stn.||0.59||[ 0.51; 0.65]||0.28||[ 0.16; 0.37]||0.62||[ 0.57; 0.65]|
|0.44||[ 0.37; 0.52]||0.20||[ 0.06; 0.32]||0.48||[ 0.41; 0.54]|
T2m, monthly sim., long lead
|0.58||[ 0.49; 0.62]||0.32||[ 0.18; 0.42]||0.42||[ 0.31; 0.53]|
|0.50||[ 0.42; 0.57]||0.38||[ 0.25; 0.49]||0.49||[ 0.43; 0.54]|
T2m, monthly sim., short lead
|0.34||[ 0.22; 0.47]|
|T2m, 10-day sim., short lead||-||-|
|T2m, obs., Tranebjerg stn.||0.25||[ 0.11; 0.33]||-||-||0.13||[ 0.03; 0.26]|
|0.30||[ 0.16; 0.37]||-||-||-||-|
T2m, monthly sim., long lead
|0.28||[ 0.13; 0.39]||-||-||-||-|
T2m, monthly sim., short lead
|T2m, 10-day sim., short lead||-||-|
Tables 2-5 show that
Several notes to the skill scores in Tables 2-5 are in order:
Although a relatively high skill score is obtained for persistence hindcasts for June, the result is not robust. For other periods than 1979-93 (not shown), the predictive skill is considerably lower. This is not the case for the other predictable months. The relatively wide 95%-confidence interval in Table 4 for the persistence hindcasts for June also gives an indication that the skill score for June is less robust than for the other predictable months.
It is puzzling that the long lead time simulated monthly 2m temperature for June is a better predictor than the short lead time monthly simulation. The difference gives an indication of the level of uncertainty in the estimation of skill scores.
Although the 95%-confidence intervals in the tables give an indication of the skill score uncertainties, they do not take into account all sources of uncertainty and, in particular, they can be poor estimates of skill scores for other periods than for the period for which they are derived.
The dynamical model simulations are carried out with prescribed, observed SST. Over sea the simulated 2m temperature closely resembles the SST, so if simulated 2m temperature over sea grid points were included in the predictor field, the predictor would essentially contain observed temperature for the prediction target month. Consequently, the predictive skill would be largely overestimated when the predictor is simulated 2m monthly mean temperature. In order to reduce this problem, dynamical model simulated 2m temperature is only included from model land points. As the simulated 2m temperature over land can still be affected by SST from adjacent grid points the predictive skill presented in Tables 2-5 for predictors of simulated 2m monthly mean temperature should be regarded as an upper limit.
The short lead time dynamical simulations have not proven more skilful than the statistical predictions, but for the skilful months they provide valuable information about the reliability of the statistical predictions. In order to demonstrate this the predictive skill is re-calculated and shown in Tables 6-7 for each statistical hindcast, but including only those years for which the sign of the hindcast anomaly based on the 10-day simulations agrees (disagrees) with that of the statistical hindcast.
For the March and June hindcasts the skill scores are
substantially increased when the 10-day simulations agree with
the statistical hindcasts, i.e. for these months more reliable
forecasts can be expected when a medium-range forecast agrees
with the statistical monthly outlook. Conversely, the skill
scores are decreased when the 10-day simulations disagree with
the statistical hindcasts. In fact, for March the statistical
hindcasts show no predictive skill at all. Note that the skill
scores in Tables 6-7 should be treated with some caution as some
of them are based on as little as 7 years.
|T2m, obs., Tranebjerg stn.||-||-||0.79||[ 0.75; 0.82]||0.79||[ 0.70; 0.85]||-||-|
|-||-||0.84||[ 0.79; 0.88]||0.19||[ 0.01; 0.29]|
|T2m, obs., Tranebjerg stn.||-||-||0.34||[ 0.24; 0.42]|
|-||-||0.31||[ 0.24; 0.40]|
|Lead time||Predictor month|
Relatively skilful monthly persistence hindcasts with a lead time of one month can be made in winter and spring; for the rest of the year the persistence of monthly temperature anomalies does not extend beyond the month following the predictor month. Due to the longer training period the skill scores in Table 8 are more robust than those in Tables 2-5.
Note that the months indicated in Table 8 are the predictor months so that, e.g., the predictive skill of February hindcasts with zero lead is found for lead time = 0 and predictor month = January. It would perhaps be more natural to show predictive skill for a combination of lead time and predictand month, but by showing the predictive skill as a function of predictor month, as in Table 8, it becomes clear that the predictive skill seems more closely linked to the predictor than to the predictand, so that temperature for certain months is a skilful predictor rather than temperature for certain months being more predictable. For example, for lead times of zero and one month, February is a better predictor month than March, so predictions of April mean temperatures should be based on February mean temperatures (correlation skill 0.56) rather than on March mean temperatures (correlation skill only 0.32).
Monthly predictability in north-western Europe is low compared to many other regions of the globe. The associated forecast uncertainty is most conveniently expressed in terms of forecast probabilities. For dynamical forecast models the probabilities can be derived from an ensemble of forecasts which can be produced by computing forecasts from a set of different initial conditions. For statistical models it is not so clear how an ensemble distribution translates to forecast probabilities when the ensemble is generated by varying the training period as described in Sect. 3.2. For both model types forecast reliability becomes a problem if the forecasts contain systematic biases. The hindcasts in Fig. 4 provide an example where the deviation between ensemble hindcasts and observations cannot simply be explained by sampling problems; there are years in which the ensemble forecast is nowhere near the observed temperature, e.g. in 1983 and 1986.
It is crucial that probability forecasts are reliable, i.e. over time the forecast probabilities should agree with the corresponding observed frequencies (Wilks, 1995). This can be ensured by estimating the joint probability density function for a set of hindcasts and observations and subsequently deriving the conditional probability density of observed temperature given a specific model forecast as illustrated by the following example.
Consider the observations and the ensemble means of the hindcasts shown in Fig. 4. A scatterplot is shown in Fig. 5. Hindcasts and observations are then both divided into categories such that the coldest 25% define the ``cold'' category, the middle 50% define the ``normal'' category, and the warmest 25% define the ``warm'' category. The categories are indicated in Fig. 5. The 25-50-25 division is an arbitrary choice; terciles are probably more widely used.
Note that the thresholds between categories are not the same for
hindcasts and observations. The use of different thresholds is a
way to correct for model bias when the predictions are expressed
in categorical terms. However, in the present example much of
the threshold differences are probably due to the uncertainty
associated with a small sample size (only 14 observations).
The joint distribution of hindcast and observation categories can
be described by a contingency table such as the one shown in
Table 9. Forecast probabilities can easily be derived from the
contingency table. If a statistical forecast for February 1994
was, say, -2°C, i.e. ``cold'' then the probabilities
according to Table 9 would be
26/45 = 58% for ``cold'',
= 42% for ``normal'' and 0% for ``warm''.
A 0% probability that February will be warm is unrealistic. For a larger sample of observations there would almost certainly be no zero entries in the contingency table. The example illustrates that it can be problematic to estimate probabilities from contingency tables that are based on small samples.
If a larger hindcast sample is not available then an alternative approach is to fit the parameters in a two-dimensional parametric probability distribution to the pairs of hindcasts and observations. The natural starting point is to assume a bivariate normal (Gaussian) distribution. Five parameters need to be estimated: mean and variance of hindcasts and observations and the linear correlation between hindcasts and observations, e.g., (Wilks, 1995).
Let T=T0 be a deterministic model forecast. Then a
probability forecast is given by the conditional probability
f(T | Tmodel = T0) which is itself
Gaussian with parameters
For the example shown in Fig. 5 we find for the hindcast ensemble means, and ; for the observations, and ; and the correlation between hindcast ensemble means and observations, .
For a forecast, T0=-2°C, we find the probability density
which is shown in Fig. 6 together with the climatological
probability density, i.e. the probability density for all
observations. ``Cold'', ``normal'' and ``warm'' categories can
be defined by finding the 25% and 75% percentiles in the
climatological distribution. The thresholds are indicated in
Fig. 6. The forecast probabilities are found as the areas under
the forecast probability density to be 54.7% for ``cold'',
42.1% for ``normal'' and 3.2% for ``warm'', which happen to be
very close to the probabilities that resulted from use of Table
The parametric approach is appealing, particularly when the joint probability density function is based on a short verification period. The parametric approach is more robust than the approach that is based on a contingency table, but the underlying assumption about normality has not been justified in the above example, and it may indeed prove difficult to establish which parametric distribution to use, particularly when only a short verification period is available.
Figure 7 shows an example of cross-validated persistence
probability hindcasts assuming a bivariate normal distribution.
In each step of the cross-validation the parameters describing
the joint distribution of predictions and observations have been
computed from data in the training period. The probabilistic
hindcasts are shown as the estimated 80% confidence intervals.
The probability hindcasts appear to be well calibrated as the
hindcast confidence intervals encompass the observations in 11 of
the 14 cases which is very close to the expected 80%.
Hindcasts of monthly mean temperature in Denmark show large seasonal variation in predictive skill. Highest predictability is found in February, March, June and August; no predictability is found in October, November and December. High predictability is closely linked to high persistence which, in turn, is probably linked to the influence of local sea surface temperature.
The results demonstrate that the statistical predictions are at least as skilful as the dynamical simulations; in most months they are more skilful.
One of the main motivations for the present study has been to improve the monthly outlooks that DMI issues at the beginning of every month.
While the simple outlooks based only on the temperature in Denmark in the previous month appear to give the highest skill scores in several months of the year, the use of larger predictor data sets can be used to improve the outlooks for some months, particularly in the transition seasons. The relatively poor skill scores for the dynamical simulations (which can reasonably be taken as an upper estimate for the skill of dynamical predictions) suggest that monthly outlooks given the state of the present generation of GCMs should be based primarily on statistical predictions.
For the highly predictable months an operational medium-range forecast might be consulted as an indicator for the reliability of the actual statistical monthly prediction. If the medium-range forecast points in the same direction as the statistical forecast the latter can be expected to be much more reliable than otherwise.
For longer lead times the predictability drops. For a lead time of one month relatively skilful statistical forecasts can still be made for the winter and spring months, and dynamical model simulations appear to show significant skill for the summer months. For lead times beyond one month the skill is generally very low for both statistical and dynamical predictions.
Seasonal climate prediction is an area of intense research these years, and the quality of coupled ocean-atmosphere GCMs is likely to improve substantially over the next decade or so, so although statistical methods presently perform better than dynamical models, one may expect that dynamical monthly and seasonal predictions eventually will beat the statistical methods.
It is crucial that probability forecasts are reliable, i.e. over time the forecast probabilities should agree with the corresponding observed frequencies. If this is not the case for an ensemble forecast then the forecast should be recalibrated, or if the raw model output cannot be described in probabilistic terms then probabilities should be derived in a way that ensures reliable probability forecasts. For the latter case, if the joint probability distribution of predictions and observations can be confidently described in terms of a parametric distribution (Gaussian or some other distribution), then a probability forecast can be based on this parametric distribution by deriving the conditional probability density of observed temperature given the model prediction.
Recalibration of an ensemble forecast is not so straightforward and has not been dealt with in the present study. The simple approach is to apply the above-mentioned conditional probability methodology to the ensemble mean, but as the spread of dynamical model ensembles is likely to give an indication of the reliability of the model (ensemble) forecast, there is a need to take into account more information about the ensemble when the ensemble forecasts are recalibrated. This problem will be pursued in future work.
Based on the findings of this report the following forecast strategy is recommended:
Re 1. For some months, particularly for one month lead time, the forecasts are expected to have no skill. In these cases one can choose to either not issue a forecast or to let the missing skill be reflected in the forecast probabilities.
Re 2. Zero-lead persistence forecasts for May, and possibly for July, are the only persistence forecasts that do not match the predictive skill of more advanced statistical or dynamical forecasts. The performance of dynamical monthly predictions should, however, be monitored continuously as dynamical models are likely to improve over the next few years.
Re 3. The parametric recalibration described in Sect. 4 is recommended provided that a parametric form of the joint probability distribution can be found (e.g. by transforming the temperature distribution so that the transformed variable follows a Gaussian distribution). Otherwise, contingency tables can be used to derived probabilities.
Re 4. For March and June it has been demonstrated (Table 6) that persistence forecasts are more accurate if there is agreement with medium-range forecasts for the first third of the target month. Consequently, the forecast probability density function should be sharpened if the medium-range forecast agrees with the persistence forecast, and it should be broadened if the medium-range forecast disagrees with the persistence forecast