A Procedure to Analyze Air Quality Data for the Detection of Linear Time Trends

8/10/2019 A Procedure to Analyze Air Quality Data for the Detection of Linear Time Trends

1/15

A Procedure to Analyze Air Quality Datafor the Detection of Linear Time Trends

by

Raymond Wong

Air Policy BranchAlberta Environment

Revised May 26, 2010


2/15

1

Table of Contents

Introduction ....................................................................................................................... 2The characteristics of air quality data: ........................................................................... 2

The objective of the present procedure: ......................................................................... 4The model: ......................................................................................................................... 4The estimates: .................................................................................................................... 7

The trend: ........................................................................................................................ 7The autoregressive coefficient ........................................................................................ 7

The pre-whitening to remove persistence effect ............................................................. 8Trend-free pre-whitening ................................................................................................ 8

The test ............................................................................................................................... 9The p-value and one or two tailed test .......................................................................... 11

Interpretation of results ................................................................................................. 12References ........................................................................................................................ 13


3/15

2

Introduction:

The detection of linear time trends in air quality data is a common process whenassessing the state of the atmospheric environment. It provides an overall description of

a time sequence and a first estimate of the general time variation of the data. In such

situations, the emphasis is not in the details of how the air quality changes over time, but

more in the general direction of change, and the magnitude of the change. In addition,

one would like to say whether the change is statistically significant.

It is important to analyze air quality data based on their special characteristics. Some

classical statistical analysis techniques may not be appropriate as these characteristics

violate the basic assumptions of these tests. Their use may lead to erroneous results. The

present report describes a statistical procedure to analyze air quality trends taking into

consideration the major characteristics of air quality data. The procedure involves a

distribution-free estimate of trend (the Thiel-Sen estimate), a distribution-free statistical

test for trend (the Mann-Kendall test) and a recently developed approach to remove the

effect of persistence (Trend-free prewhitening). A step-by-step description to analyze air

quality data using this procedure is described.

The characteristics of air quality data:

Air quality data have certain special characteristics that need to be considered when being

analyzed statistically.

(1) The data are not usually obtained under sophisticated experimental designs. There are

no controlled applications of treatments under carefully administered randomization

scheme. They are more likely obtained through measurements using monitoring

networks or specific monitoring stations.


4/15

3

(2) The data variability is large and there are many factors affecting it. Observations of

air pollutant concentration at a point could be influenced by the emission source

characteristics, emission rate, winds, temperatures, precipitation, solar radiation, terrain,

surface land condition, etc. In addition, different factors may have different importance

at different time or place and they may often interact.

(3) Skewed data distributions are common. Air quality data often follow heavily skewed

probability distributions. The skewness is usually positive, meaning there are many more

smaller observations and fewer larger ones. This means that the normal distribution

assumption required by many classical statistical methods is often violated. The

possible existence of outlier observations, in the form of unusually large values also

contribute to significant bias in many classical techniques. Hence many resort todistribution-free methods for analyzing air quality data or model them with heavy-tailed

probability distributions like the lognormal, gamma, Weibull, etc.

(4) Persistence or auto-correlation often exists in the data. Persistence means the data

correlate with themselves at different times. For example, what happens today depends

on what happens yesterday. There is a number of possible causes for persistence in air

quality data. One possible factor is weather conditions and a possible situation is when a

pollution episode lasts over several observational periods. Persistence means that the

independent assumption of many classical statistical methods is not satisfied. The effect

is that test results may be biased and may lead to erroneous conclusions. One estimate is

that with persistence alone, one can detect trends up to 80 percent of the time when no

trend actually exists in the data.

(5) Monitoring stations provide point measurements. However, for many practical

purposes, one needs to convert these point measurements to spatial estimates. This is the

realm of spatial statistics. An interesting comparison with hydrometric data is that the

point measurement of river discharge, say, is in itself a spatial estimate as it represents

the surplus of the water balance within a watershed. There is no comparable parameter

for an airshed and the definition of an airshed may be more political and administrative


5/15

4

than physical. The precipitation within an watershed also needs to be converted to

spatial estimates, however, but the delineation for a watershed is more physical than

political/administrative.

The objective of the present procedure:

The present procedure aims to estimate and detect linear trends in air quality data, taking

into consideration that the data distribution may be positively skewed and that there may

be persistence in the data time sequence.

A distribution-free trend test is used and a pre-whitening process is implemented when

persistence is indicated by the autocorrelation coefficient. A distribution-free estimate ofthe trend is also involved. This procedure is for air quality data observed at a point.

Areal estimates and field significance of trends are not within the scope of the present

work and will be discussed elsewhere.

The models:

The model for a linear trend is basically that of a straight line. To draw a straight line

through a number of data points and test to see if the straight line trend is significant is

the statistical problem of simple linear regression. The time trend model in simple linear

regression with autocorrelated errors is :

1 1

1

;t t

t t t

y t

(1)

where the subscript t represents time point and 1 is therefore the trend of yt . The

disturbance term t is assumed to be i.i.d. (independently identically distributed) and the

error term t is a first order autoregressive (AR1) process. The AR1 is considered useful

as a first approximation in many environmental applications where , the


6/15

5

autocorrelation coefficient is positive and less than unity, which also means that the AR1

process is stationary and the linear trend is entirely contained in the 1t term (Zheng et al,

1997). For a given AR1 coefficient , simple algebraic manipulation of Equation (1)

will lead to the following model:

2 2 1t y t yt t (2)

with the relationship

2 1

2 1

(1 )

(1 )1

(3)

One may consider 2 as the persistence adjusted trend and persistence is explicitly

represented by the yt-1 term. For convenience, we will refer to Equation (1) as Model 1

and Equation (2) as Model 2. In Model 2, the persistence in error of Model 1 is

transformed into persistence in the variable yt itself. Model 1 and Model 2 are linked

via the relation (3). It means that persistence in data is directly related to persistence in

residual errors.

Both models are commonly used in detecting trends with persistence, but there is a

difference between the two sets of parameters as represented by (3) in the interpretation

of trends. A useful description is to consider the deterministic part and the stochastic

part of Model 2. The deterministic term 2 t represents a constant rate and direction of

change, whereas the stochastic term yt-1 represents random trending and wandering due

to autocorrelation and can lead to increases and decreases. (Woodward et al., 1997). The

mechanisms behind the deterministic and stochastic terms can be quite different. For


7/15


8/15

7

The estimates:

The trend:

The estimate for trend in the present procedure is a special case of the Hodges-Lehmann

estimate, sometimes referred to as the Thiel-Sen slope estimate. (See Hollander and

Wolfe 1973, p.206, and also Sen,1968) This is similar to Kendalls rank correlation,

which is some form of central tendency of sample slope estimates. From the

computational point of view, the Thiel-Sen estimate is the median of all pair-wise slope

estimates, each computed from a pair of data points in the sequence. This is a commonly

used distribution-free alternative to the least squares regression estimate. See Wang et

al. (2001), and Hirsch at al (1982) and Burn and Elnur (2002) for more discussion.

Let be sampling times, there are T (T -1)/2 distinct pairs of

values in this sequence of length T , Y i, i = 1,2,.. T . Then

1 2 3 ........... nt t t t T

, 1 j i

j i

Y Y median where i j T

t t

is the Thiel-Sen estimate of slope.

The autoregressive coefficient:

There are several ways to estimate the AR1 coefficient . Some of the classical methods

include Burgs algorithm, Yule Walker equation, least absolute deviation and ordinary

least squares methods. Preliminary results from our simulation experiments indicate that

most of these methods have relatively high variability and the familiar underestimation of

. An alternative approach is that of lag one rank correlation. This means the

calculation of Spearmans rank correlation between the original data sequence and the

same sequence lagged by one time point. This is implemented in the present procedure.

Note that in estimating AR1 coefficient, we always lose one data point.


9/15

8

The pre-whitening to remove persistence effect:

It is known that many trend test, distribution-free or parametric, are very sensitive to

persistence (autocorrelation) effect. The Mann-Kendall test is used in the present

procedure. The presence of autocorrelation in the data can seriously biased the Mann-

Kendall test results. Much recent research efforts have been focusing on the

development of methods to model or remove persistence effect in testing for trends. The

approach adopted in the present procedure is that of trend-free pre-whitening (TFPW).

See for example Yue et al. (2002a). Following Wang and Swail (2001), we will invoke

TFPW when the AR1 coefficient exceeds a certain level (say 0.05). This is to ensurethat any indication of persistence effect will be accounted for.

Trend-free pre-whitening:

Pre-whitening means the removal of persistence (autocorrelation) effect. The analogy is

with light where white light means equal contribution from various frequencies in the

visible spectrum. By the same token, pure random data are called white noise, where

there is no predominance from any particular range of frequencies. Persistence in the

form of first order autocorrelation is like having a low frequency predominance on the

spectrum. In terms of the visible light spectrum, the color would tend to be red and

analysts have termed first order autoregressive process the red noise process. Hence the

removal of persistence is to remove the redness in the data and make them white again,

and the process to remove persistence before further analysis is called pre-whitening.

Trend-free pre-whitening refers to the process of doing pre-whitening by first removing

the linear trend. This is to ensure minimal interaction between trend and persistence

which often creates problems in both estimation and testing. Persistence (represented by

the AR1 coefficient ) in the residuals (the data that remain after the removal of trend) is

then estimated and removed by inverse filtering. The removed trend is then added back


10/15

9

onto the pre-whitened data and the Mann-Kendall test is applied to test for significance of

trend. Specifically, the technique involves the following steps:

(1) The linear trend 1 is first estimated from the data sequence, using the Theil-Sen

approach mentioned above.

(2) The data sequence is then detrended using the result in (1). This is to remove the

linear trend from the data.

(3) The resulting residuals are used in the estimation of the AR1 autoregressive

coefficient using the Spearmans rho.

(4) If the absolute value of AR1 coefficient is less than or equal to 0.05, the original set

of data is tested for trends using the Mann-Kendall test. However, if greater than 0.05,

then the following steps occur.

(5) The same residuals are then inverse-filtered using the AR1 coefficient to remove the

persistence effect. This leaves a prewhitened sequence of residuals.

(6) The prewhitened residuals are then added back to the trend which has been removed

earlier in step (2).

(7) This final sequence, now has only the original trend but no persistence in the

residuals, is then subjected to the Mann-Kendall test.

The test:

The distribution-free test for trend used in the present procedure is the Mann-Kendall test

(Mann 1945, Kendall 1975). See also Gilbert (1987). This is a test for the significance

of linear trend which handles missing data better than the Spearmans rho and have

similar power. (Yue et al, 2002b) . The null hypothesis to be tested is:


11/15

10

H 0: The data sequence Y i, i = 1,2,.. T . is a random sample of T

independent and identically distributed variables.

If a trend exists, the H 0 will be rejected at the specified level of significance. The test

statistic for the Mann-Kendall test for trend is:

1

1 1

sgn ( )T T

j k k j k

S Y

Y

where

1 0sgn( ) 0 0

1 0

if x x if x

if x

The distribution of S is symmetric about zero and is normal in the limit as T tends to

infinity. However, a good approximation by the normal distribution can be attained at T=

n about 40. See for example Gilbert (1987). In fact, Mann (1946) and Kendall (1975,

p.55) have documented that the normal approximation can be applied to cases with n 8if there are not many ties. With the normal approximation, it is assumed that the

expected value and variance of S under H 0 are:

1

( ) 0

( 1)(2 5) ( 1)(2 5)( )

18

n

mm

E S

n n n t m m mV S

where t m is the number of ties of extent m. The standardized test statistic Z with

continuity correction is computed by:


12/15

11

1, 0

( )

0, 0

1, 0

( )

S S

V S

Z S

S S

V S

The p-value is then computed based on the standard normal distribution:

2 / 21Pr exp

2t

z

Z z d

t for positive Z

and vice versa for negative Z.

For shorter sequences, the p-values are available from tables. As an alternative, the p-

value can also be assessed using a permutation approach. In this case, the null

distribution of the test statistic, S , is derived by randomly generating a large number of

sequences (say 5000) from the original data and calculating the test statistic for each.

The p-value is the probability of the observed S being exceeded under the null

distribution. One can compare this permutation p-value to the p-value from normal

approximation and from tables and find that there is good agreement. The permutation

p-value is used for determining trend significance in all cases.

The p-value and one- or two-tailed test:

Statistical significance at a specified level, say 5 percent, means that the probability of

that trend observed due to random chance is small (5 percent). In other words, one is

saying that the trend exists, and the probability of one being wrong is small. A p-value is

the probability of having a test statistic (in this case S ) that is as or more extreme than the

observed one.


13/15


14/15

13

References:

Box, G.E.P. and Jenkins, G.M. 1976. Time series analysis, forecasting and control .revised edition, Holden-Day, San Francisco. 575pp.

Burn D.H. and Elnur, M.A.H. 2002. Detection of hydrological trends and variability. J.of Hydrol . 255 , 107-122.

Kendall, M.G. 1975. Rank Correlation Methods. Griffin, London.

Gilbert, R.O. 1987. Statistical Methods for Environmental Pollution Monitoring . Van Nostrand Reinhold. New York. 320pp.

Hirsch R.M. Slack, J.R. and Smith, R.A. 1982. Techniques of trend analysis for monthlywater quality data. Water Resources Research. 18(1), 107-121.

Hollander M. and Wolfe D.A. 1973. Nonparametric Statistical Methods . Wiley. NewYork. 503pp.

Mann, H.B. 1945. Nonparametric test against trend. Econometrics 13 , 245-259.

Milionis, A.E. and Davis, T.D. 1994. Regression and stochastic models for air pollution I. Review, comments and suggetions. Atmospheric Environment . 28(17), 2801-2810.

Nankervis, J.C. and Savid, N.E. 1996. The level and power of the bootstrap t test in theAR(1) Model with Trend. J. of Business and Economic Statistics . 14 , 161-168.

Park, R.E. and Mitchell, B.M. 1980. Estimating the autocorrelated error model withtrended data. J. of Econometrics . 13 , 185-201.

Sen, P.K. 1968. Estimates of the regression coefficient based on Kendalls tau. J. of Amer. Statist. Assoc. 63 , 1379-1389.

Wang, X. L. and Swail, V.R. 2001. Changes of extreme wave heights in NorthernHemisphere oceans and related atmospheric circulation regimes. J. of Climate. 14 , 2204-2221.

Weiss. A. 1990. Least absolute error estimation in the presence of serial correlation. J. of

Econometrics , 44 , 127-158.

Woodward, W.A., Bottone, S. and Gray, H.L. 1997. Improved tests for trend in timeseries data. J. of Agricultural, Biological, and Environmental Statistics . 2(4), 403-416.

Yue, S. and Pilon, P. 2003: Interaction between deterministic trend and autoregressive process. Water Resources Research , 39(4), 1077 doi 10.1029/2001WR001210, 2003.


15/15

14

Yue, S., Pilon, P., Phinney, B. and Cavadias, G. 2002a. The influence of autocorrelationon the ability to detect trend in hydrological series. Hydrol. Process. 16 , 1807-1829.

Yue, S., Pilon P. and Cavadias, G. 2002b. Power of the Mann-Kendall and Spearmansrho tests for detecting monotonic trends in hydrological series. J. of Hydrol. 259 , 254-

271.

Zheng, X., Basher, R.E. and Thompson, C.S., 1997. Trend detection in regional-meantemperature series: Maximum, minimum, mean, diurnal range and SST. J. of Climate. 10 , 317-326.

Documents

A Procedure to Analyze Air Quality Data for the Detection of Linear Time Trends