ESTIMATION OF ALGAE CONCENTRATION IN CLOUD … · algae blooms, which might occur and decay in the range of a few days. At this moment a prediction of missing data purely ... resume

ABSTRACT

In satellite oceanography the cloud cover becomes themain obstacle in obtaining a spatially complete datasetfor investigations in space and time. E.g. to investigatethe dynamics of algae blooms a near daily temporalresolution and a full spatial coverage would be desirableand is required.

Geostatitical methods like kriging can to a certain extentpredict missing data values and provide the means toachieve a dataset of wanted temporal resolution.

The necessary steps in processing the data are shortlysummarized. The range of the kriging variance of thecross-validation is introduced as a measure of reliability.

1. INTRODUCTION

The dynamics of the phytoplankton distribution in theNorth Sea can be accessed by satellite imagery providedby MERIS and its algal_2 product for coastal waters. Inthe area of interest the dominant pattern of highchlorophyll concentrations along the southern shoresprevails throughout the year, combined with seasonalblooms in spring and autumn.

It is still to be seen whether the MERIS chlorophyllproduct is precise and stable enough to become thefoundation of a time series database to not only study theannual dynamics of the algae distribution, but alsodetermine an interannual trend due to global changeeffects.

Because of this interest in interannual trends in thedynamics, it will be necessary to form a dataset withnear daily temporal resolution and full coverage of theNorth Sea. Monthly means combined of several scenesin most cases cover the area completely, but they simplylack temporal resolution to follow the development ofalgae blooms, which might occur and decay in the rangeof a few days.

At this moment a prediction of missing data purelybased on spatial information seems to be the mostpromising method.

2. METHOD

2.1 Ordinary Kriging

To be able to represent the spatial variance of the data bya semivariogram, the data has to be a random function ofsecond order stationarity (or less strict: it has to fullfilthe intrinsic hypothesis).

(1)

(2)

If the data is a second order stationary random function,its expected value is independent of location (Eq. 1).Additionally the covariance between data points onlydepends on the (vectorial) distance , withestimation location and data location (Eq. 2). Thesemivariance is thus defined as

(3)

Typically - and ideally - the semivariance showsmonotonic behaviour, increasing with distance.

The Ordinary Kriging estimator is an exact, linearestimator.

(4)

(5)

The kriging (or error) variance (Eq. 5) can be rewrittenas:

(6)

with the Lagrange multiplier [2].

The weights are calculated by minimising the errorvariance under the condition of normalised weights. Ingeneral the weights will be bigger, if the data point iscloser to the estimation location, and decrease rapidly, ifthere are other data points in the line of sight betweenthe data and the prediction location (screening effect).

E Z x( )[ ] m=

E Z x( ) m–( ) Z x h+( ) m–( )[ ] Cov h( )=

h x x0–=x0 x

γ

γ h( ) Cov 0( ) Cov h( )–=

Z x( )

Z x0( ) λ i Z xi( )i 1=

k

∑ ,= λ i

i 1=

k

∑ 1=

σOK2 x0( ) min Var Z x0( ) Z x0( )–[ ]( )=

σOK2 x0( ) λ iγ xi x0–( )

i 1=

k

∑ µ–=

µ

λ i

ESTIMATION OF ALGAE CONCENTRATION IN CLOUD COVERED SCENESUSING GEOSTATISTICAL METHODS

Dagmar Müller

GKSS, Max-Planck-Str. 1, 21502 Geesthacht, Germany, Email: [email protected]

_____________________________________________________

Proc. ‘Envisat Symposium 2007’, Montreux, Switzerland 23–27 April 2007 (ESA SP-636, July 2007)

2.2 The kriging variance

In the first place the kriging variance (Eq. 6) reflects thespatial distribution of the data (see Fig. 7). Apart fromthe Lagrange multiplier it depends on the semivarianceand the applied weights.

In a local kriging approach a constant number of nearestneighbour data points are chosen for the prediction.Fig. 1 gives an overview for the choice of data in thecross validation: at the prediction location (red), which isa data point left out, the nearest data points (light blue)are used for the prediction, which is compared to theactual data value. The mean distance between predictionlocation and the data is very short, the data is aligned sothat screening effect will be large: there will be largeweights for a few small semivariances and weights closeto null for bigger semivariances. So in general thekriging variance derived by cross-validation is minimal.

The range of the kriging variance calculated in the crossvalidation defines the minimal error variance andcharacterizes the interpolation range.

If the estimation location doesn’t lie in the vicinity of thedata (Fig. 2), the mean distance becomes rather large, thedata is not aligned, so that in this case there will be a setof more or less equal weights (no screening effect). Thekriging variance will be close to the semivariance of themean distance.

By rejecting any prediction with kriging variancesoutside the range of the cross-validation kriging variancerange, one can effectively find those locations that aresurrounded by data or closely to data so that theirprediction may be treated like an interpolation.

3. PROCEDURE

In accordance with the special properties of satellite data- a very large number of data points, which are oftenspatially clustered into subsets of homogenously spreaddata - the following data processing is needed in order toprepare it for geostatistical analysis (see Fig. 3). Calculations are utterly performed in R [1], with the helpof several packages for geostatistics ([3], [4], [5]).

3.1 Projection

The MERIS L2-product algal_2 is projected on a simpleLat/Lon grid with pixel size of approx. 1.5 km by 1.5 kmin the North Sea area (60

°

N to 49

°

N, 5

°

W to 12

°

E).Further calculations are carried out on 730 by 815 pixelimages.

3.2 Removal of spatial trend

For kriging to be applicable the data has to fulfill theintrinsic hypothesis (Eq. 1, Eq. 2): the expected value ofthe data has to be independent from location. This iscertainly not true for the algae distributions with strongpattern, so that a removal of the spatial trend is anecessity. A weighted monthly mean of the chlorophyllconcentration centred around the day under study isproposed, which takes into account the temporalinformation available. The shorter the distance in time tothe day under study is, the more influential the weightsare chosen, due to the assumption of a strong correlationbetween data that is close in time.

Patterns are sufficiently reduced and the residuals areused for further analysis. In effect this is the firstrestriction to the possible prediction location: at such alocation the trend must be known to deduce fromestimated residuals the actual chlorophyll concentration.

Figure 1. Cross-validation: minimal distances and minimal kriging variance

Figure 2. Estimation location outside the interpolation range: large distances and large kriging variance

3.3 Choosing a subset

Satellite images are rather large datasets with easilymore than 100.000 data points, which in general cannotbe used entirely to calculate the variograms. To find arepresentative subset a grid with pixels that contain fiveby five pixel of the projection grid is used, of which onepixel is chosen randomly. If this happens to be a validdata point, i.e. data is available here, it will be chosen forthe calculation of the variogram, if it happens to be amissing data location, no further chosing will be done. Inthis manner the number of data points is reduced and thespatial distribution of data is preserved.

3.4 Finding the best variogram model

From the chosen data points the empiricalsemivariogram is calculated. The semivariogram is ameasure of the dissimilarity with distance and should bemonotonically increasing. Because of theinhomogeneous distribution of the data points empiricalsemivariograms with non-monotonic behaviour canoccur. To prevent these cases that can not be adequatelyfitted by simple variogram models, one can eitherresume to spatial clustering of the data or restrict thedistance of the variogram model to the part wheremonotony is ensured. This limitation in distance can bejustified by the use of a local kriging algorithm, whichwill use a fixed number of data points in the closevicinity of each prediction location.

Different variogram models (e.g. spherical, exponential,gaussian) in simple and nested forms are compared for

their goodness of fit by the Akaike Information Criterion(AIC).

3.5 Ordinary Kriging (OK)

Different kriging techniques have been tested previouslyand their accuracy in predicting a known value at a givenlocation compared with one another. Ordinary kriginghas been identified to be the most robust method forlarge areas and inhomogeneously distributed data. Thistechnique is therefore chosen for an automatic krigingprocessor.

Taking the variogram model derived from the subset thekriging is eventually performed with all available data,carried out as a local ordinary kriging.

3.6 Cross-validation (CV)

Leave-one-out-cross-validation is a well known standardprocedure to test the performance of the variogrammodel. The data subset, of which the variogram has beenderived (see Section 3.3), is used as the subsequently leftout points. The statistics of the error variance betweenactual value and predicted value is further used toidentify locations, which can be presumed to allow validpredictions.

4. CASE STUDY

As an example the MERIS product algal_2 from June11th, 2006 (Fig. 4) is chosen for its cloudlessness andrather complete coverage of the North Sea area. It consists of 115,494 data points, which are reduced byan overlay of the cloud pixel positions from May 2nd,2005 to 83,652 data points. These remaining chlorophyllvalues are corrected for spatial trend, which cannot becalculated everywhere and so reduces the dataset to29,526 points. There are 77,893 cloud pixels coincidingwith trend corrected data pixel. The data is processed asdescribed in Section 3. The subset to calculate thevariogram consists of 1228 data points.To shorten calculation time and enable the inversion ofmatrices local kriging is used; the number of datalocations is restricted to the one hundred nearestneighbours depending on the prediction location.

4.1 Validation of kriging estimates against known algea concentration

The error between the prediction and the actual datavalue (Fig. 5) is rather small over large areas, despite thefact that the cloud cover is quite extent. This goodperformace is due to the choice of the trend. A predictionlocation with long distances to the data points willreceive a prediction that is close to the mean of theresidual, which is zero, plus the weighted monthly mean.

!"#$#%&'()*+,-(.&/&

.0/"0%101(1&/&

-2340/

5'24/0"6

704/8&"#9$"&:

5"9448&'#1&/#9%

!"#$%&"'()"$*$%*

;#:0(-0"#04(.&/&3&40

<"9=0>/#9%(9?(.&/&

5'9214@(#%A&'#1(0/>B

Figure 3. Data processing

The pattern of the monthly mean is sightly modified bythe prediction of residuals, but remains visible andprevents the outcome to be smoothed strongly by thekriging.

In areas with higher variability, i.e. the southern part ofthe North Sea, the prediction deviates the most.

4.2 Restriction of estimation location by cross validation kriging variance

Taking into account the range of the kriging variance,specified during the cross validation procedure, only12844 of 77893 cloud locations can be considered valid.

These locations are marked in Fig. 6 by black dots. It isapparent that these locations either lie surrounded bydata points or in direct vicinity.

For a time series database I would accept those validpredictions as an addition to the original dataset, whileother locations might be filled e.g. by the weightedmonthly mean.

5. SUMMARY

Kriging can be used to estimate missing satellitechlorophyll data. To match the assumptions of theintrinsic hypothesis kriging is not performed on the datadirectly but on the residuals after correcting for spatialtrend. Therefore in the estimation process only locationswhere the trend is available can be considered.

The range of the kriging variance obtained by crossvalidation characterises the data distribution and canhelp to identify those estimation locations, which eithersurrounded by data points or lie in the vicinity of datapoints. The number of reliable estimations is restricted bygeometry, i.e. the spatial distribution of data, and rathersmall.

A combination of methods like weighted monthlymeans, temporal pixelwise interpolation andgeostatistical methods like kriging will be necessary tocreate the time series database.

Of course, geostatistical methods cannot ‘create’ newinformation, but rather spread it from known to unknownregions. The chlorophyll data of different satelliteinstruments (e.g. SeaWifS) or bathymetry data couldimprove the estimations, which is yet to be tested.

6. REFERENCES

1. R Development Core Team (2006). R: A language and environment for statistical computing. R Founda-tion for Statistical Computing, Vienna, Austria. ISBN 3-900051-07-0, URL http://www.R-pro-ject.org.

2. Olea, R.A. (1999). Geostatistics for engineeres and earth scientists, Kluwer Academic Publishers

3. Paulo J. Ribeiro Jr & Peter J. Diggle (2001). geoR: a package for geostatistical analysis, R-NEWS, 1(2).

4. Pebesma, E.J. (2004). Multivariable geostatistics in S: the gstat package. Computers & Geosciences, 30: 683-691.

5. Nychka, D. (2005). fields: Tools for spatial data. R package version 3.04. http://www.image.ucar.edu/GSP/Software/Fields

Figure 4. MERIS algal_2 June 11th, 2006

Figure 5. Error between prediction and data

0 5 10

5052

5456

58

Original log(algal2)

−1.0

−0.5

0.0

0.5

1.0

1.5

0 5 10

5052

5456

58Diff. Orig. Estim.

−4

−2

0

2

4

Figure 6. Valid estimation locations

Figure 7. Kriging variance

0 5 10

5052

5456

58

Estimations

−1.5

−1.0

−0.5

0.0

0.5

1.0

1.5

290870 of 77893 locat.

valid estimate

0 5 10

5052

5456

58

Krigingvarianz

0.030

0.035

0.040

0.045

0.050

0.055

Documents

ESTIMATION OF ALGAE CONCENTRATION IN CLOUD … · algae blooms, which might occur and decay in the range of a few days. At this moment a prediction of missing data purely ... resume