A Procedure to Analyze Air Quality Data for the Detection of Linear Time Trends

Embed Size (px)

Citation preview

  • 8/10/2019 A Procedure to Analyze Air Quality Data for the Detection of Linear Time Trends

    1/15

    A Procedure to Analyze Air Quality Datafor the Detection of Linear Time Trends

    by

    Raymond Wong

    Air Policy BranchAlberta Environment

    Revised May 26, 2010

  • 8/10/2019 A Procedure to Analyze Air Quality Data for the Detection of Linear Time Trends

    2/15

    1

    Table of Contents

    Introduction ....................................................................................................................... 2The characteristics of air quality data: ........................................................................... 2

    The objective of the present procedure: ......................................................................... 4The model: ......................................................................................................................... 4The estimates: .................................................................................................................... 7

    The trend: ........................................................................................................................ 7The autoregressive coefficient ........................................................................................ 7

    The pre-whitening to remove persistence effect ............................................................. 8Trend-free pre-whitening ................................................................................................ 8

    The test ............................................................................................................................... 9The p-value and one or two tailed test .......................................................................... 11

    Interpretation of results ................................................................................................. 12References ........................................................................................................................ 13

  • 8/10/2019 A Procedure to Analyze Air Quality Data for the Detection of Linear Time Trends

    3/15

    2

    Introduction:

    The detection of linear time trends in air quality data is a common process whenassessing the state of the atmospheric environment. It provides an overall description of

    a time sequence and a first estimate of the general time variation of the data. In such

    situations, the emphasis is not in the details of how the air quality changes over time, but

    more in the general direction of change, and the magnitude of the change. In addition,

    one would like to say whether the change is statistically significant.

    It is important to analyze air quality data based on their special characteristics. Some

    classical statistical analysis techniques may not be appropriate as these characteristics

    violate the basic assumptions of these tests. Their use may lead to erroneous results. The

    present report describes a statistical procedure to analyze air quality trends taking into

    consideration the major characteristics of air quality data. The procedure involves a

    distribution-free estimate of trend (the Thiel-Sen estimate), a distribution-free statistical

    test for trend (the Mann-Kendall test) and a recently developed approach to remove the

    effect of persistence (Trend-free prewhitening). A step-by-step description to analyze air

    quality data using this procedure is described.

    The characteristics of air quality data:

    Air quality data have certain special characteristics that need to be considered when being

    analyzed statistically.

    (1) The data are not usually obtained under sophisticated experimental designs. There are

    no controlled applications of treatments under carefully administered randomization

    scheme. They are more likely obtained through measurements using monitoring

    networks or specific monitoring stations.

  • 8/10/2019 A Procedure to Analyze Air Quality Data for the Detection of Linear Time Trends

    4/15

    3

    (2) The data variability is large and there are many factors affecting it. Observations of

    air pollutant concentration at a point could be influenced by the emission source

    characteristics, emission rate, winds, temperatures, precipitation, solar radiation, terrain,

    surface land condition, etc. In addition, different factors may have different importance

    at different time or place and they may often interact.

    (3) Skewed data distributions are common. Air quality data often follow heavily skewed

    probability distributions. The skewness is usually positive, meaning there are many more

    smaller observations and fewer larger ones. This means that the normal distribution

    assumption required by many classical statistical methods is often violated. The

    possible existence of outlier observations, in the form of unusually large values also

    contribute to significant bias in many classical techniques. Hence many resort todistribution-free methods for analyzing air quality data or model them with heavy-tailed

    probability distributions like the lognormal, gamma, Weibull, etc.

    (4) Persistence or auto-correlation often exists in the data. Persistence means the data

    correlate with themselves at different times. For example, what happens today depends

    on what happens yesterday. There is a number of possible causes for persistence in air

    quality data. One possible factor is weather conditions and a possible situation is when a

    pollution episode lasts over several observational periods. Persistence means that the

    independent assumption of many classical statistical methods is not satisfied. The effect

    is that test results may be biased and may lead to erroneous conclusions. One estimate is

    that with persistence alone, one can detect trends up to 80 percent of the time when no

    trend actually exists in the data.

    (5) Monitoring stations provide point measurements. However, for many practical

    purposes, one needs to convert these point measurements to spatial estimates. This is the

    realm of spatial statistics. An interesting comparison with hydrometric data is that the

    point measurement of river discharge, say, is in itself a spatial estimate as it represents

    the surplus of the water balance within a watershed. There is no comparable parameter

    for an airshed and the definition of an airshed may be more political and administrative

  • 8/10/2019 A Procedure to Analyze Air Quality Data for the Detection of Linear Time Trends

    5/15

    4

    than physical. The precipitation within an watershed also needs to be converted to

    spatial estimates, however, but the delineation for a watershed is more physical than

    political/administrative.

    The objective of the present procedure:

    The present procedure aims to estimate and detect linear trends in air quality data, taking

    into consideration that the data distribution may be positively skewed and that there may

    be persistence in the data time sequence.

    A distribution-free trend test is used and a pre-whitening process is implemented when

    persistence is indicated by the autocorrelation coefficient. A distribution-free estimate ofthe trend is also involved. This procedure is for air quality data observed at a point.

    Areal estimates and field significance of trends are not within the scope of the present

    work and will be discussed elsewhere.

    The models:

    The model for a linear trend is basically that of a straight line. To draw a straight line

    through a number of data points and test to see if the straight line trend is significant is

    the statistical problem of simple linear regression. The time trend model in simple linear

    regression with autocorrelated errors is :

    1 1

    1

    ;t t

    t t t

    y t

    (1)

    where the subscript t represents time point and 1 is therefore the trend of yt . The

    disturbance term t is assumed to be i.i.d. (independently identically distributed) and the

    error term t is a first order autoregressive (AR1) process. The AR1 is considered useful

    as a first approximation in many environmental applications where , the

  • 8/10/2019 A Procedure to Analyze Air Quality Data for the Detection of Linear Time Trends

    6/15

    5

    autocorrelation coefficient is positive and less than unity, which also means that the AR1

    process is stationary and the linear trend is entirely contained in the 1t term (Zheng et al,

    1997). For a given AR1 coefficient , simple algebraic manipulation of Equation (1)

    will lead to the following model:

    2 2 1t y t yt t (2)

    with the relationship

    2 1

    2 1

    (1 )

    (1 )1

    (3)

    One may consider 2 as the persistence adjusted trend and persistence is explicitly

    represented by the yt-1 term. For convenience, we will refer to Equation (1) as Model 1

    and Equation (2) as Model 2. In Model 2, the persistence in error of Model 1 is

    transformed into persistence in the variable yt itself. Model 1 and Model 2 are linked

    via the relation (3). It means that persistence in data is directly related to persistence in

    residual errors.

    Both models are commonly used in detecting trends with persistence, but there is a

    difference between the two sets of parameters as represented by (3) in the interpretation

    of trends. A useful description is to consider the deterministic part and the stochastic

    part of Model 2. The deterministic term 2 t represents a constant rate and direction of

    change, whereas the stochastic term yt-1 represents random trending and wandering due

    to autocorrelation and can lead to increases and decreases. (Woodward et al., 1997). The

    mechanisms behind the deterministic and stochastic terms can be quite different. For

  • 8/10/2019 A Procedure to Analyze Air Quality Data for the Detection of Linear Time Trends

    7/15

  • 8/10/2019 A Procedure to Analyze Air Quality Data for the Detection of Linear Time Trends

    8/15

    7

    The estimates:

    The trend:

    The estimate for trend in the present procedure is a special case of the Hodges-Lehmann

    estimate, sometimes referred to as the Thiel-Sen slope estimate. (See Hollander and

    Wolfe 1973, p.206, and also Sen,1968) This is similar to Kendalls rank correlation,

    which is some form of central tendency of sample slope estimates. From the

    computational point of view, the Thiel-Sen estimate is the median of all pair-wise slope

    estimates, each computed from a pair of data points in the sequence. This is a commonly

    used distribution-free alternative to the least squares regression estimate. See Wang et

    al. (2001), and Hirsch at al (1982) and Burn and Elnur (2002) for more discussion.

    Let be sampling times, there are T (T -1)/2 distinct pairs of

    values in this sequence of length T , Y i, i = 1,2,.. T . Then

    1 2 3 ........... nt t t t T

    , 1 j i

    j i

    Y Y median where i j T

    t t

    is the Thiel-Sen estimate of slope.

    The autoregressive coefficient:

    There are several ways to estimate the AR1 coefficient . Some of the classical methods

    include Burgs algorithm, Yule Walker equation, least absolute deviation and ordinary

    least squares methods. Preliminary results from our simulation experiments indicate that

    most of these methods have relatively high variability and the familiar underestimation of

    . An alternative approach is that of lag one rank correlation. This means the

    calculation of Spearmans rank correlation between the original data sequence and the

    same sequence lagged by one time point. This is implemented in the present procedure.

    Note that in estimating AR1 coefficient, we always lose one data point.

  • 8/10/2019 A Procedure to Analyze Air Quality Data for the Detection of Linear Time Trends

    9/15

    8

    The pre-whitening to remove persistence effect:

    It is known that many trend test, distribution-free or parametric, are very sensitive to

    persistence (autocorrelation) effect. The Mann-Kendall test is used in the present

    procedure. The presence of autocorrelation in the data can seriously biased the Mann-

    Kendall test results. Much recent research efforts have been focusing on the

    development of methods to model or remove persistence effect in testing for trends. The

    approach adopted in the present procedure is that of trend-free pre-whitening (TFPW).

    See for example Yue et al. (2002a). Following Wang and Swail (2001), we will invoke

    TFPW when the AR1 coefficient exceeds a certain level (say 0.05). This is to ensurethat any indication of persistence effect will be accounted for.

    Trend-free pre-whitening:

    Pre-whitening means the removal of persistence (autocorrelation) effect. The analogy is

    with light where white light means equal contribution from various frequencies in the

    visible spectrum. By the same token, pure random data are called white noise, where

    there is no predominance from any particular range of frequencies. Persistence in the

    form of first order autocorrelation is like having a low frequency predominance on the

    spectrum. In terms of the visible light spectrum, the color would tend to be red and

    analysts have termed first order autoregressive process the red noise process. Hence the

    removal of persistence is to remove the redness in the data and make them white again,

    and the process to remove persistence before further analysis is called pre-whitening.

    Trend-free pre-whitening refers to the process of doing pre-whitening by first removing

    the linear trend. This is to ensure minimal interaction between trend and persistence

    which often creates problems in both estimation and testing. Persistence (represented by

    the AR1 coefficient ) in the residuals (the data that remain after the removal of trend) is

    then estimated and removed by inverse filtering. The removed trend is then added back

  • 8/10/2019 A Procedure to Analyze Air Quality Data for the Detection of Linear Time Trends

    10/15

    9

    onto the pre-whitened data and the Mann-Kendall test is applied to test for significance of

    trend. Specifically, the technique involves the following steps:

    (1) The linear trend 1 is first estimated from the data sequence, using the Theil-Sen

    approach mentioned above.

    (2) The data sequence is then detrended using the result in (1). This is to remove the

    linear trend from the data.

    (3) The resulting residuals are used in the estimation of the AR1 autoregressive

    coefficient using the Spearmans rho.

    (4) If the absolute value of AR1 coefficient is less than or equal to 0.05, the original set

    of data is tested for trends using the Mann-Kendall test. However, if greater than 0.05,

    then the following steps occur.

    (5) The same residuals are then inverse-filtered using the AR1 coefficient to remove the

    persistence effect. This leaves a prewhitened sequence of residuals.

    (6) The prewhitened residuals are then added back to the trend which has been removed

    earlier in step (2).

    (7) This final sequence, now has only the original trend but no persistence in the

    residuals, is then subjected to the Mann-Kendall test.

    The test:

    The distribution-free test for trend used in the present procedure is the Mann-Kendall test

    (Mann 1945, Kendall 1975). See also Gilbert (1987). This is a test for the significance

    of linear trend which handles missing data better than the Spearmans rho and have

    similar power. (Yue et al, 2002b) . The null hypothesis to be tested is:

  • 8/10/2019 A Procedure to Analyze Air Quality Data for the Detection of Linear Time Trends

    11/15

    10

    H 0: The data sequence Y i, i = 1,2,.. T . is a random sample of T

    independent and identically distributed variables.

    If a trend exists, the H 0 will be rejected at the specified level of significance. The test

    statistic for the Mann-Kendall test for trend is:

    1

    1 1

    sgn ( )T T

    j k k j k

    S Y

    Y

    where

    1 0sgn( ) 0 0

    1 0

    if x x if x

    if x

    The distribution of S is symmetric about zero and is normal in the limit as T tends to

    infinity. However, a good approximation by the normal distribution can be attained at T=

    n about 40. See for example Gilbert (1987). In fact, Mann (1946) and Kendall (1975,

    p.55) have documented that the normal approximation can be applied to cases with n 8if there are not many ties. With the normal approximation, it is assumed that the

    expected value and variance of S under H 0 are:

    1

    ( ) 0

    ( 1)(2 5) ( 1)(2 5)( )

    18

    n

    mm

    E S

    n n n t m m mV S

    where t m is the number of ties of extent m. The standardized test statistic Z with

    continuity correction is computed by:

  • 8/10/2019 A Procedure to Analyze Air Quality Data for the Detection of Linear Time Trends

    12/15

    11

    1, 0

    ( )

    0, 0

    1, 0

    ( )

    S S

    V S

    Z S

    S S

    V S

    The p-value is then computed based on the standard normal distribution:

    2 / 21Pr exp

    2t

    z

    Z z d

    t for positive Z

    and vice versa for negative Z.

    For shorter sequences, the p-values are available from tables. As an alternative, the p-

    value can also be assessed using a permutation approach. In this case, the null

    distribution of the test statistic, S , is derived by randomly generating a large number of

    sequences (say 5000) from the original data and calculating the test statistic for each.

    The p-value is the probability of the observed S being exceeded under the null

    distribution. One can compare this permutation p-value to the p-value from normal

    approximation and from tables and find that there is good agreement. The permutation

    p-value is used for determining trend significance in all cases.

    The p-value and one- or two-tailed test:

    Statistical significance at a specified level, say 5 percent, means that the probability of

    that trend observed due to random chance is small (5 percent). In other words, one is

    saying that the trend exists, and the probability of one being wrong is small. A p-value is

    the probability of having a test statistic (in this case S ) that is as or more extreme than the

    observed one.

  • 8/10/2019 A Procedure to Analyze Air Quality Data for the Detection of Linear Time Trends

    13/15

  • 8/10/2019 A Procedure to Analyze Air Quality Data for the Detection of Linear Time Trends

    14/15

    13

    References:

    Box, G.E.P. and Jenkins, G.M. 1976. Time series analysis, forecasting and control .revised edition, Holden-Day, San Francisco. 575pp.

    Burn D.H. and Elnur, M.A.H. 2002. Detection of hydrological trends and variability. J.of Hydrol . 255 , 107-122.

    Kendall, M.G. 1975. Rank Correlation Methods. Griffin, London.

    Gilbert, R.O. 1987. Statistical Methods for Environmental Pollution Monitoring . Van Nostrand Reinhold. New York. 320pp.

    Hirsch R.M. Slack, J.R. and Smith, R.A. 1982. Techniques of trend analysis for monthlywater quality data. Water Resources Research. 18(1), 107-121.

    Hollander M. and Wolfe D.A. 1973. Nonparametric Statistical Methods . Wiley. NewYork. 503pp.

    Mann, H.B. 1945. Nonparametric test against trend. Econometrics 13 , 245-259.

    Milionis, A.E. and Davis, T.D. 1994. Regression and stochastic models for air pollution I. Review, comments and suggetions. Atmospheric Environment . 28(17), 2801-2810.

    Nankervis, J.C. and Savid, N.E. 1996. The level and power of the bootstrap t test in theAR(1) Model with Trend. J. of Business and Economic Statistics . 14 , 161-168.

    Park, R.E. and Mitchell, B.M. 1980. Estimating the autocorrelated error model withtrended data. J. of Econometrics . 13 , 185-201.

    Sen, P.K. 1968. Estimates of the regression coefficient based on Kendalls tau. J. of Amer. Statist. Assoc. 63 , 1379-1389.

    Wang, X. L. and Swail, V.R. 2001. Changes of extreme wave heights in NorthernHemisphere oceans and related atmospheric circulation regimes. J. of Climate. 14 , 2204-2221.

    Weiss. A. 1990. Least absolute error estimation in the presence of serial correlation. J. of

    Econometrics , 44 , 127-158.

    Woodward, W.A., Bottone, S. and Gray, H.L. 1997. Improved tests for trend in timeseries data. J. of Agricultural, Biological, and Environmental Statistics . 2(4), 403-416.

    Yue, S. and Pilon, P. 2003: Interaction between deterministic trend and autoregressive process. Water Resources Research , 39(4), 1077 doi 10.1029/2001WR001210, 2003.

  • 8/10/2019 A Procedure to Analyze Air Quality Data for the Detection of Linear Time Trends

    15/15

    14

    Yue, S., Pilon, P., Phinney, B. and Cavadias, G. 2002a. The influence of autocorrelationon the ability to detect trend in hydrological series. Hydrol. Process. 16 , 1807-1829.

    Yue, S., Pilon P. and Cavadias, G. 2002b. Power of the Mann-Kendall and Spearmansrho tests for detecting monotonic trends in hydrological series. J. of Hydrol. 259 , 254-

    271.

    Zheng, X., Basher, R.E. and Thompson, C.S., 1997. Trend detection in regional-meantemperature series: Maximum, minimum, mean, diurnal range and SST. J. of Climate. 10 , 317-326.