6
ecological modelling 212 ( 2 0 0 8 ) 86–91 available at www.sciencedirect.com journal homepage: www.elsevier.com/locate/ecolmodel A comparative study on predicting algae blooms in Douro River, Portugal Rita Ribeiro a,b,, Lu´ ıs Torgo a,c a LIAAD-INESC Porto L.A., R. Ceuta, 118-6o., 4050-190 Porto, Portugal b FCUP-DCC, University of Porto, R. Campo Alegre, 1021/1055, 4169-007 Porto, Portugal c FEP, University of Porto, R. Dr. Roberto Frias, 4200-464 Porto, Portugal article info Article history: Published on line 19 November 2007 Keywords: Multiple regression Extreme values prediction Model evaluation Algae blooms abstract Algae blooms are ecological events associated with extremely high abundance value of cer- tain algae. These rare events have a strong impact in the river’s ecosystem. In this context, the prediction of such events is of special importance. This paper addresses the problems that result from evaluating and comparing models at the prediction of rare extreme values using standard evaluation statistics. In this context, we describe a new evaluation statis- tic that we have proposed in Torgo and Ribeiro [Torgo, L., Ribeiro, R., 2006. Predicting rare extreme values. In: Ng, W., Kitsuregawa, M., Li, J., Chang, K. (Eds.), Proceedings of the 10th Pacific-Asia Conference on Knowledge Discover and Data Mining (PAKDD’2006). Springer, pp. 816–820 (number 3918 in LNAI)], which can be used to identify the best models for pre- dicting algae blooms. We apply this new statistic in a comparative study involving several models for predicting the abundance of different groups of phytoplankton in water samples collected in Douro River, Porto, Portugal. Results show that the proposed statistic identifies a variant of a Support Vector Machine as outperforming the other models that were tried in the prediction of algae blooms. © 2007 Elsevier B.V. All rights reserved. 1. Introduction In Douro River, as in other european rivers, an excessive growth of some algae is registered occasionally. These phe- nomena, known as algae blooms, degrades water quality as it reduces water clarity and the oxygen levels, causing a massive death of river fish. The early prediction of these blooms is par- ticularly important, especially when potable water is obtained from these rivers, and several authors have addressed this task (e.g. Bobbin and Recknagel, 2001; Ibanez and Conversi, 2002; Lee et al., 2003; Muttil and Lee, 2005). Corresponding author. E-mail addresses: [email protected] (R. Ribeiro), [email protected] (L. Torgo). URLs: www.liaad.up.pt/rpribeiro (R. Ribeiro), www.liaad.up.pt/ltorgo (L. Torgo). The data used in our comparative study were obtained from the public company responsible for potable water col- lection for the metropolitan Porto area, the second largest city of Portugal. We have used the data resulting from the analy- sis of water samples collected at Crestuma-Lever dam in river Douro between 1998 and 2003. The data include the identifica- tion and quantification of different groups of phytoplankton as well as the analysis of physical–chemical and microbiological parameters. The long term objective of the project behind this paper is to develop reliable models that can forecast micro- 0304-3800/$ – see front matter © 2007 Elsevier B.V. All rights reserved. doi:10.1016/j.ecolmodel.2007.10.018

A comparative study on predicting algae blooms in Douro River, Portugal

Embed Size (px)

Citation preview

Page 1: A comparative study on predicting algae blooms in Douro River, Portugal

e c o l o g i c a l m o d e l l i n g 2 1 2 ( 2 0 0 8 ) 86–91

avai lab le at www.sc iencedi rec t .com

journa l homepage: www.e lsev ier .com/ locate /eco lmodel

A comparative study on predicting algae bloomsin Douro River, Portugal

Rita Ribeiroa,b,∗, Luıs Torgoa,c

a LIAAD-INESC Porto L.A., R. Ceuta, 118-6o., 4050-190 Porto, Portugalb FCUP-DCC, University of Porto, R. Campo Alegre,1021/1055, 4169-007 Porto, Portugalc FEP, University of Porto, R. Dr. Roberto Frias,4200-464 Porto, Portugal

a r t i c l e i n f o

Article history:

Published on line 19 November 2007

Keywords:

Multiple regression

Extreme values prediction

Model evaluation

Algae blooms

a b s t r a c t

Algae blooms are ecological events associated with extremely high abundance value of cer-

tain algae. These rare events have a strong impact in the river’s ecosystem. In this context,

the prediction of such events is of special importance. This paper addresses the problems

that result from evaluating and comparing models at the prediction of rare extreme values

using standard evaluation statistics. In this context, we describe a new evaluation statis-

tic that we have proposed in Torgo and Ribeiro [Torgo, L., Ribeiro, R., 2006. Predicting rare

extreme values. In: Ng, W., Kitsuregawa, M., Li, J., Chang, K. (Eds.), Proceedings of the 10th

Pacific-Asia Conference on Knowledge Discover and Data Mining (PAKDD’2006). Springer,

pp. 816–820 (number 3918 in LNAI)], which can be used to identify the best models for pre-

dicting algae blooms. We apply this new statistic in a comparative study involving several

models for predicting the abundance of different groups of phytoplankton in water samples

collected in Douro River, Porto, Portugal. Results show that the proposed statistic identifies

a variant of a Support Vector Machine as outperforming the other models that were tried in

the prediction of algae blooms.

well as the analysis of physical–chemical and microbiological

1. Introduction

In Douro River, as in other european rivers, an excessivegrowth of some algae is registered occasionally. These phe-nomena, known as algae blooms, degrades water quality as itreduces water clarity and the oxygen levels, causing a massivedeath of river fish. The early prediction of these blooms is par-ticularly important, especially when potable water is obtainedfrom these rivers, and several authors have addressed this task

(e.g. Bobbin and Recknagel, 2001; Ibanez and Conversi, 2002;Lee et al., 2003; Muttil and Lee, 2005).

∗ Corresponding author.E-mail addresses: [email protected] (R. Ribeiro), [email protected]: www.liaad.up.pt/∼rpribeiro (R. Ribeiro), www.liaad.up.pt/∼lto

0304-3800/$ – see front matter © 2007 Elsevier B.V. All rights reserved.doi:10.1016/j.ecolmodel.2007.10.018

© 2007 Elsevier B.V. All rights reserved.

The data used in our comparative study were obtainedfrom the public company responsible for potable water col-lection for the metropolitan Porto area, the second largest cityof Portugal. We have used the data resulting from the analy-sis of water samples collected at Crestuma-Lever dam in riverDouro between 1998 and 2003. The data include the identifica-tion and quantification of different groups of phytoplankton as

.pt (L. Torgo).rgo (L. Torgo).

parameters.The long term objective of the project behind this paper

is to develop reliable models that can forecast micro-

Page 2: A comparative study on predicting algae blooms in Douro River, Portugal

e c o l o g i c a l m o d e l l i n g 2 1 2 ( 2 0 0 8 ) 86–91 87

valu

agsmbeud

eagahitTtmnasmdBetogtd

Fig. 1 – Cyanobacteria abundance

lgae blooms in river Douro. Identification of the differentroups of phytoplankton in water samples requires inten-ive and expensive manual labour, while the analysis ofost physical–chemical and microbiological parameters can

e automated by water probes. Our objective is to avoid thexpensive manual identification of phytoplankton groups bysing models that are able to accurately anticipate their abun-ance based on the values of other parameters.

In this paper we describe the results of a comparativexperiment between several candidate models. Predicting thebundance level of a certain phytoplankton species, wheniven the values of a set of other parameters, can be regardeds a multiple regression problem. Several techniques exist forandling this type of problems. They all revolve around the

ssue of obtaining the model parameters that optimize a cer-ain preference criterion that is estimated on a data sample.he standard error statistics used as preference criteria will

end to favour the model that has good performance on theost frequent cases. However, micro-algae blooms are fortu-

ately rare events that are characterized by extreme values ofbundance of those algae. As such, we claim that the use ofuch preference criteria will introduce a bias into the obtainedodels that goes against the goals of the application, i.e. pre-

icting rare extreme values of the abundance of micro-algae.ased on these observations, we have developed a differentrror statistic (Torgo and Ribeiro, 2006) that is able to iden-ify the models that are more “useful” from the perspective

f accurately predicting the rare extreme values of the tar-et variable. In our comparative experiments we have usedhis statistic to compare the models we have tried in ourata.

Table 1 – Physical–chemical and microbiological parameters

Turbidity Temperature pHNitrates Chlorates SulfatesDissolved oxygen Iron Dissolved iron

Fecal coliforms Total coliforms Fecal streptococcuTotal germs at 37 ◦C Escherichia coli

es registered from 1998 till 2003.

2. Problem description

The data used in this study was collected at Crestuma-Leverdam in river Douro and covers the period from 1998 till 2003. Asour objective is to predict the algae blooms, we have selectedthe information concerning certain groups of phytoplankton,physical–chemical and microbiological parameters.

Unfortunately, though this information respects to watersamples collected from the same place, their periodicity is farfrom being homogeneous. This required large pre-processingsteps to come up with a data set useful for model construction.

We decided to gather all the data by synchronizing it tothe periodicity of the information concerning algae abun-dance. From all the identified groups of micro-algae, weselected the ones for which we have information along all6 years: Cyanobacteria, Cryptophyta, Euglenophyta, Chrys-ophyta, Dinophyta, Chlorophyta and Diatom. The samplingperiodicity for these algae was initially weekly, but becamebiweekly since 2002, as the graph of Fig. 1 shows for theCyanobacteria case. Therefore, this last biweekly periodicitywas established as the periodicity for our data set.

In the original data, the values of physical–chemical andmicrobiological parameters had higher sampling frequency,as so they had to be aggregated. For each parameter, weincluded the minimum, the median and the maximum valuefor the 2 weeks respecting each observation in our dataset.

We left out from our analysis some physical–chemical andmicrobiological parameters based on their scarce number ofmeasurements. Still, for the selected parameters, there weresome unknown values. We have used an imputation strategy

Alkalinity ConductivitySilica OxidabilityTotal suspended solids

s Sulfite reducer clostridia Total germs at 22 ◦C

Page 3: A comparative study on predicting algae blooms in Douro River, Portugal

l i n g

88 e c o l o g i c a l m o d e l

to replace them, namely we used the exponential moving aver-age based on the last 4 observations, which corresponds to,approximately, 2 months. Table 1 summarises the final set ofparameters that were used in our study.

In order to capture the temporal oscillations and the exist-ing mutual influence among the algae abundance values (e.g.Lee et al., 2003), we also included, for each observation, thevalues of the previous 2 weeks of each group of phytoplank-ton and their total value and diversity index. By includingthis last information we intended to facilitate the identifi-cation of algae blooms that we suppose to be characterizedby extremely high total value and a low value of diversity,meaning that a few groups of phytoplankton dominate. Forthe diversity index we chose to use the Shannon index (e.g. Hill,1973).

As a result of these pre-processing steps we get to a datasetwith 72 variables, 7 target variables (one for each selectedgroup of phytoplankton) and 131 cases. The low number ofcases arises from the lack of information about the groups ofphytoplankton during the first semester of 1999 and 2002, aswe can see in Fig. 1.

3. Multiple regression problem

Our overall objective is to obtain models to predict the val-ues of abundance of each group of phytoplankton. This is amultiple regression problem as we want to estimate the valueof a continuous target variable Y based on the values of thepredictor variables X1, X2, . . . , Xp. In our application, the targetvariable is the abundance value of a group of phytoplanktonand the predictor variables are a set of physical–chemical andmicrobiological parameters together with the previous val-ues of the abundance of the groups of phytoplankton, theirtotal value and diversity index. In fact, we are facing 7 dif-ferent multiple regression problems: one for each group ofphytoplankton. Their particularity lies on the fact that we areespecially concerned with the prediction of a subset of valuesof the target variable: the rare extreme values that correspondto the micro-algae blooms. As already mentioned by Bobbinand Recknagel (2001) this implies that the algorithm whichwill conceive the model , has to learn an infrequently occur-ring event from a small number of examples, such as bloomsoccurrence.

In a multiple regression problem the objective is to learnthe best approximation fˇ of the unknown function f usinga data set D = {〈�xi, yi〉}n

i=1. The regression model parameters,ˇ, are estimated according to some error minimization proce-dure. The Mean Squared Error (MSE) and the Mean AbsoluteDeviation (MAD), defined by the Eqs. (1) and (2), are two of themost used error statistics in regression domains.

MSE = 1n

n∑

i=1

(yi − yi)2 (1)

MAD = 1n

n∑

i=1

|yi − yi| (2)

where yi is the prediction made by the model for the ith case.

2 1 2 ( 2 0 0 8 ) 86–91

We argue against the use of these error statistics when thegoal is to predict accurately a small proportion of values, likethe rare extreme values, mainly because they treat equally allthe errors made across the target variable range. Accordingto our objective, we should prefer a model which performsespecially well at the rare extreme values of the target vari-able. In this context, more emphasis should be given to theerrors made at these values, disregarding the errors made atthe “normal” values of the target variable. This is not the casewith MSE and MAD. They favour the models that achieve asmaller average error, independently of the true values asso-ciated to each error. Given the distribution of the values oftarget variable, this best score is obtained by models perform-ing well on the most frequent cases. Unfortunately, theseare not the relevant cases for our application as blooms arerare.

4. Rare extremes error

According to our objective, the errors made in different rangesof the target variable should not be treated equally. Namely,the errors made at rare extreme values should have moreweight than the rest of the errors. A possible approach wouldbe to weight the data set, with larger weights being givento cases with extreme values in the target. However, thiswould only partially meet our evaluation requirements. Whenanalysing the severity of the errors made by the model, it isnot enough to look at the extremeness of the true value anddisregard the extremeness of the predicted values. In fact, ifa model predicts a bloom for a case with a “normal” valueof occurrence of a micro-algal we have a false alarm. Falsealarms should also be avoided, as they can trigger unneces-sary preventive actions. As such, these situations should alsohave more weight. This means that the weight function shoulddepend on both the true and predicted values. In other wordsthe weight function should penalise both false alarms andmissed blooms.

The requirements of our weight function are stronglyrelated to the notion of cost sensitive evaluation (e.g. Turney,2000) used in classification domains, where the target vari-able is discrete. In some classification problems we knownthat making an error in a particular class is more seriousthan making an error in some other class. A frequently usedform of addressing this problem is through a cost matrixwhich defines all the costs for every combination of predictedand true classes. A possible way of addressing our predic-tion problem would be to transform it into a classificationproblem (e.g. Torgo and Gama, 1997) and use one of such cost-sensitive approaches. In particular, we would have to properlyselect the cut-points on the range of each the micro-algalabundance values, so that we could distinguish a non-bloomabundance value from a bloom abundance value in order toreflect our application objectives. This approach was followedfor instance by Bobbin and Recknagel (2001). The resultingclassification problem will inevitably lead to an unbalanced

classification problem (e.g. Provost et al., 1998; Weiss andProvost, 2003a, b). In fact, the classes associated with bloomvalues will have a very low frequency and thus to make amodel more accurate on these classes (that is more interest-
Page 4: A comparative study on predicting algae blooms in Douro River, Portugal

n g 2 1 2 ( 2 0 0 8 ) 86–91 89

id

ptSmscrttasctdc

sE

R

wbd

fm

isdfm

sstcb

bcttaaowrpbw

fie

Table 2 – Cost matrix associated to the Cyanobacteria

Predicted True

0 68 138.1 204.1

0 1 1 1 1468 1 1 1 10

138.1 1 1 1 6204.1 14 10 6 0.1

e c o l o g i c a l m o d e l l i

ng for our application) would involve using evaluation criteriaifferent from standard prediction accuracy.

We claim that the use of classification approaches in thisroblem leads to counter-intuitive situations as a result ofhe necessary discretization process of the abundance values.uppose that this process uses the cut-point 60 for deter-ining a bloom in some micro-algal. In this context a water

ample with a value of 58 would be considered a non-bloomase and two water samples with value 61 or 278 would beegarded as bloom cases. These means that the use of a crisphreshold creates an artificial differentiation between situa-ions that clearly seem more related (the 58 and 61 samples),nd considers similar samples that are clearly of differenteriousness (the 61 and 278 cases). We think this is clearlyounter-intuitive. A possible solution could be to further splithe bloom class into more classes to allow and increasedifferentiation, but that would turn an already unbalancedlassification problem into an even harder task.

In this context, we propose the Rare Extremes Error (RExE)tatistic presented in Torgo and Ribeiro (2006) and defined byq. (3).

ExE = 1n

n∑

i=1

w(yi, yi)L(yi, yi) (3)

here w is the weight function and L is a loss function that cane some error measure (e.g. the squared error or the absoluteeviation).

Inspired by the idea of cost matrix, we proposed a cost sur-ace, w(Y, Y), that can be seen as a continuous version of a cost

atrix.As we are dealing with a regression problem, we are

nterested in having a continuous cost function that variesmoothly along the range of the target variable and thatepends on both predicted and true values. We want this costunction to reflect our preference bias concerning the errors

ade at rare extreme values.In order to facilitate the specification of the continuous cost

urface, we suggest taking some selected points for which wepecify a cost and then use a function approximation methodo interpolate the complete surface. In our concrete appli-ation we want differentiate the value of the cost surfaceetween extreme and “normal” values.

The problem of appropriately specifying a threshold for aloom is not trivial (Ibanez and Conversi, 2002). In our appli-ation, as we did not have domain knowledge for setting thehreshold value defining algae blooms, we have used the dis-ribution properties of the variables. We established the upperdjacent value (adjH), as the limit above which the value ofbundance of a micro-algal is considered a bloom. This thresh-ld represents the largest value of a micro-algal abundancehich is not above the 3rd quartile plus 1.5 of the inter-quartile

ange. We then specify a set of equally spaced interpolationoints between the minimum value and the adjH. The num-er of points specified establishes the granularity with which

e wish to specify our cost surface.

The next step is to find the costs at these selected points toll in the cost matrix. These can either be specified by domainxperts or, when that sort of knowledge in not available, by

Fig. 2 – The cost surface obtained for Cyanobacteria.

the method described in Torgo and Ribeiro (2006). In our studywe have followed this latter alternative. We begin by setting to1, the costs associated to areas of cost surface where no rareextreme values are involved. Then we set the lowest cost of thematrix with a value lying between 0 and 1. This is the cost asso-ciated to correct predictions involving extreme values. Therest of the costs were filled in using an arithmetic progres-sion, according to the magnitude of the error made wheneverrare extreme values are involved. Lets see an example withCyanobacteria, where the adjH is equal to 204.1 cells/mL. Sup-pose we set an arithmetic progression of base 6 and ratio 4, andthe lower cost to be 0.1. This would result on the cost matrixshown in Table 2.

Once we have the cost matrix completely defined, we canuse a function approximation method to obtain a smoothcost surface using the selected points of the cost matrix. Wehave used the same procedure as in Torgo and Ribeiro (2006),where the loess function implemented in the R statistical soft-ware (R Development Core Team, 2005), was used to obtainthe surface.1 This function fits a local polynomial of degree1 taking the points defined in the cost matrix as samplesof points of the cost surface we want to approximate. Theresult is a smooth cost surface that satisfies the requirementof predicting rare extreme values, as shown in Fig. 2 for theCyanobacteria case.

1 Note that any other function approximation method couldhave been used.

Page 5: A comparative study on predicting algae blooms in Douro River, Portugal

l i n g 2 1 2 ( 2 0 0 8 ) 86–91

Table 3 – Estimated performance of the models for eachphytoplankton group

Error estimates: mean ± S.D.

NMAD NRExE

Cyanobacteriarpart 0.85 ± 0.08 0.89 ± 0.15nnet 1.21 ± 0.18 ++ 1.13 ± 0.26 +svm 0.88 ± 0.07 0.78 ± 0.09 −

Cryptophytarpart 0.84 ± 0.18 1.73 ± 0.70nnet 0.92 ± 0.30 3.29 ± 5.23svm 0.75 ± 0.17 − 0.95 ± 0.11 −

Euglenophytarpart 0.91 ± 0.47 11.80 ± 22.93nnet 2.37 ± 1.94 + 114.74 ± 220.01svm 0.55 ± 0.26 − 1.28 ± 0.80

Chrysophytarpart 1.05 ± 0.33 2.90 ± 1.69nnet 1.28 ± 0.23 2.83 ± 1.20svm 0.80 ± 0.16 − 1.16 ± 0.33 −

Dynophytarpart 0.96 ± 0.14 1.25 ± 1.20nnet 1.46 ± 0.41 ++ 1.87 ± 0.82svm 0.85 ± 0.15 − 0.82 ± 0.08

Chlorophytarpart 1.13 ± 0.29 4.60 ± 4.63nnet 1.08 ± 0.34 3.66 ± 4.25svm 0.81 ± 0.19 −− 0.87 ± 0.15 −

Diatomrpart 0.64 ± 0.26 0.57 ± 0.18nnet 0.86 ± 0.20 + 1.08 ± 0.39 +

90 e c o l o g i c a l m o d e l

5. Experimental comparisons betweendifferent predictive models

The goal of our experiments is to compare different modelsfor predicting the value of abundance of each group of phyto-plankton and identify the ones that have the best performanceat forecasting micro-algae blooms.

We have compared three different models: regression trees(Breiman et al., 1984), neural networks (Rumelhart et al.,1986) and support vector machines (Vapnik, 1996). This setof models was chosen because they have quite differentapproaches to regression tasks, thus ensuring some diversityin our analysis. In our experiments, we have used the imple-mentations of these techniques available in R (R DevelopmentCore Team, 2005) through the packages: rpart (Therneau andAtkinson, 2006), nnet (Venables and Ripley, 2002) and e1071

(Dimitriadou et al., 2006) which contains an implementationof svm).

Regarding our experimental methodology, we decided toleave out the last 6 months of data for the final testing of themodels. We used the remaining 4.5 years of data to estimatethe performance of the three different modelling techniquesthrough a Monte Carlo simulation. For each iteration of theMonte Carlo simulation we have randomly selected a datewithin this 4.5 years period ensuring that there were 2 previ-ous years to be used for training, i.e. 48 biweekly observations,and 9 following months, i.e. 18 biweekly observations, to beused for testing. We have repeated this process 10 times.For each repetition, the predictions made by each model forthe 9 months test period were obtained by a sliding-windowtechnique, that moves every two new observations, that is,approximately every month.

The performance of the models was asserted using a nor-malized version of our rare extremes error statistic (cf. Eq. (4))to allow better comparisons across the different micro-algae.

NRExE =

ntest∑

i=1

w(yi, yi) × |yi − yi|

ntest∑

i=1

w(yi, Y) × |yi − Y|(4)

where Y is the training sample median.In our experiments, we have defined the weighting func-

tion w based on a 4 × 4 cost matrix filled in with costs thatfollow an arithmetic progression of base 6 and ratio 4 and thatestablishes 0.1 as the minimum cost. This is a scenario sim-ilar to that shown in Table 2, but with different interpolationpoints according to each micro-algal abundance values.

We have also evaluated the performance of the modelsusing a “standard” error statistic, namely the NormalizedMean Absolute Deviation (NMAD) (c.f. Eq. (5)), to check whetherthe conclusions are different when using different evaluationmetrics.

ntest∑|y − y |

NMAD = i=1

i i

ntest∑

i=1

|yi − Y|(5)

svm 0.83 ± 0.20 ++ 0.80 ± 0.09 +

6. Discussion of the obtained results

We have used the Monte Carlo method to estimate the perfor-mance of each modelling technique according to the two errorstatistics referred above: NMAD and NRExE. For all modellingtechniques, we calculated the mean and standard deviation ofeach statistic in the 10 repetitions. Table 3 shows these resultsfor each group of phytoplankton.

We have asserted the statistical significance of theobserved difference in the 10 repetitions. For this purpose, wehave used the Wilcoxon signed rank test. For each error statis-tic we have asserted the significance of the differences of theresults obtained by nnet and svm against the results of rpart.We have denoted the results using: (+), (++), (−) and (−−). Thepositive signs indicate that the difference is positive, i.e. rpart

had a smaller error and thus a better performance, while neg-ative signs indicate the opposite. The number of signs reflectsthe confidence level of the observed difference: one sign rep-resents a 95% confidence level; two signs represent a 99.9%confidence level. The results of these comparisons for each

group of phytoplankton are also shown in Table 3, besides theerror estimates.

Globally, we can say that the methods that were tried didnot achieve good performance. Namely, there are some cases

Page 6: A comparative study on predicting algae blooms in Douro River, Portugal

n g

wriemaroc

afmih

sepestsFs

stooisa

povp

7

TtuwdpetOt

teiwaa

r

ML-TR-44, Department of Computer Science, RutgersUniversity.

e c o l o g i c a l m o d e l l i

here the error estimate is higher than 1, meaning that theespective model is performing worst than simply predict-ng the median. These high error estimates were somehowxpected as the models are biased to the prediction of theost frequent values and, as such, have difficulty in predicting

ccurately the rare extreme values. Nevertheless, we shouldemark that we did not explore different parameterizationsf these models. We could probably improve the results byarefully adjusting their parameters.

From the observation of Table 3, we can also notice that,ccording to NMAD, svm outperforms the other two techniquesor the prediction of values of abundance of almost all the

icro-algae. Moreover, if we look at NRExE, that gives moremportance to errors on extreme values (blooms), they stillave the best performance.

The significance tests carried out reinforce these conclu-ions. According to NRExE, the svm models have not only betterrror estimates, but also they have, for some groups of phyto-lankton, a significantly better performance in terms of rarextreme values prediction. It is interesting to observe thatome performance differences that were insignificant fromhe perspective of the standard evaluation statistic becomeignificant when using our evaluation statistic and vice versa.or instance, in the Cyanobacteria case, the error estimate ofvm not only decreased according to NRExE, but also becameignificantly better than the other two techniques. On the con-rary, for the Dinophyta case, the apparent good performancef svm revealed to be insignificant in the ranking performancesf the models for the prediction of the blooms. This is relevant

nformation given the main objectives of this application andtresses the dangers of using an evaluation metric that is notdjusted to these goals.

In summary, we have observed that, according to ourroposed evaluation metric, svm generally outperforms thether two techniques in terms of predicting rare extremealues of occurrence of almost all the considered groups ofhytoplankton.

. Conclusions and future work

his paper presents a comparative study of different predic-ion models on the difficult task of predicting algae bloomssing physical–chemical and microbiological parameters ofater samples in river Douro, Portugal. Our study confirms theifficulty of the problem caused mainly by the scarcity of thehenomenon we want to predict accurately. This type of rarevents requires specific evaluation criteria. We have used suchype of metrics to evaluate the performance of several models.ur results indicate a variant of support vector machines as

he most promising model.Future work will include some adjustments to this evalua-

ion measure so as to make it more interpretable for domainxperts in terms of the effectiveness of the models in the

dentification of the algae blooms. As a long-term objective,e also plan to include the proposed evaluation metric intomodelling technique to better tune the model parameters

ccording to the objectives of this type of applications.

2 1 2 ( 2 0 0 8 ) 86–91 91

Acknowledgements

This work was partially supported by FCT project MODAL(POSI/40949/2001) co-financed by POSI and by the Europeanfund FEDER, and by a PhD scholarship of the Portuguese gov-ernment (SFRH/BD/1711/2004) to R. Ribeiro. We would like tothank Aguas do Douro e Paiva, SA, for providing the data usedin this study.

e f e r e n c e s

Bobbin, J., Recknagel, F., 2001. Knowledge discovery for predictionand explanation of blue-green algal dynamics in lakes byevolutionary algorithms. Ecol. Model. 146, 253–262.

Breiman, L., Friedman, J., Olshen, R., Stone, C., 1984. Classificationand Regression Trees. Statistics/Probability Series. Wadsworth& Brooks/Cole Advanced Books & Software.

Dimitriadou, E., Hornik, K., Leisch, F., Meyer, D., Weingessel, A.,2006. e1071: Misc Functions of the Department of Statistics(e1071), TU Wien. R package Version 1. pp. 5–13.

Hill, M., 1973. Diversity and evenness: a unifying notation and itsconsequences. Ecology 54,427–473.

Ibanez, F., Conversi, A., 2002. Prediction of missing values anddetection of exceptional events in a chronological planktonicseries: a single algorithm. Ecol. Model. 154, 9–23.

Lee, J., Huang, Y., Dickman, M., Jayawardena, A.W., 2003. Neuralnetwork modelling of coastal algal blooms. Ecol. Model. 179,179–201.

Muttil, N., Lee, J.H.W., 2005. Genetic programming for analysisand real-time prediction of coastal algal blooms. Ecol. Model.189, 363–376.

Provost, F., Fawcett, T., Kohavi, R., 1998. The case against accuracyestimation for comparing induction algorithms. In:Proceedings of 15th International Conference on MachineLearning, Morgan Kaufmann, San Francisco, CA,pp. 445–453.

R Development Core Team, 2005. R: A Language and Environmentfor Statistical Computing. R Foundation for StatisticalComputing, Vienna, Austria, ISBN 3-900051-07-0.

Rumelhart, D., Hinton, G., Williams, R., 1986. Learning InternalRepresentations by Back Propagation. MIT Press.

Therneau, T.M., Atkinson, B., 2006. rpart: Recursive Partitioning, Rpackage version 3.1-29 (R port by Brian Ripley).

Torgo, L., Gama, J., 1997. Regression using classificationalgorithms. Intell. Data Anal. 1, 4.

Torgo, L., Ribeiro, R., 2006. Predicting Rare Extreme Values. In: Ng,W., Kitsuregawa, M., Li, J., Chang, K. (Eds.), Proceedings of the10th Pacific-Asia Conference on Knowledge Discovery andData Mining (PAKDD’2006). Springer, pp. 816–820 (number3918 in LNAI).

Turney, P., 2000. Types of cost in inductive concept learning.Vapnik, V., 1996. The Nature of Statistical Learning Theory.

Springer.Venables, W.N., Ripley, B.D., 2002. Modern Applied Statistics with

S, fourth ed. Springer, New York, ISBN 0-387-95457-0.Weiss, G., Provost, F., 2003a. The effect of class distribution on

classifier learning: an empirical study. Technical Report

Weiss, G., Provost, F., 2003b. Learning when training data arecostly: the effect of class distribution on tree induction. JAIR19, 315–354.