MACHINE LEARNING METHODS WITHOUT TEARS: A PRIMER …depts.washington.edu/oldenlab/wordpress/wp-content/... · (Drake et al. 2006), and wavelet analysis (Cho and Chon 2006). In addition,

MACHINE LEARNING METHODS WITHOUT TEARS: A PRIMERFOR ECOLOGISTS

Julian D. OldenSchool of Aquatic and Fishery Sciences, University of Washington

Seattle, Washington 98195 USA

e-mail: [email protected]

Joshua J. LawlerCollege of Forest Resources, University of Washington

Seattle, Washington 98195 USA


N. LeRoy PoffDepartment of Biology, Colorado State University

Fort Collins, Colorado 80523 USA


keywordsecological informatics, classification and regression trees, artificial neuralnetworks, evolutionary algorithms, genetic algorithms, GARP, inductive

modeling

abstractMachine learning methods, a family of statistical techniques with origins in the field of artificial

intelligence, are recognized as holding great promise for the advancement of understanding andprediction about ecological phenomena. These modeling techniques are flexible enough to handlecomplex problems with multiple interacting elements and typically outcompete traditional approaches(e.g., generalized linear models), making them ideal for modeling ecological systems. Despite theirinherent advantages, a review of the literature reveals only a modest use of these approaches in ecologyas compared to other disciplines. One potential explanation for this lack of interest is that machinelearning techniques do not fall neatly into the class of statistical modeling approaches with which mostecologists are familiar. In this paper, we provide an introduction to three machine learning approachesthat can be broadly used by ecologists: classification and regression trees, artificial neural networks,and evolutionary computation. For each approach, we provide a brief background to the methodology,give examples of its application in ecology, describe model development and implementation, discussstrengths and weaknesses, explore the availability of statistical software, and provide an illustrative

The Quarterly Review of Biology, June 2008, Vol. 83, No. 2

Copyright © 2008 by The University of Chicago. All rights reserved.

0033-5770/2008/8302-0002$15.00

Volume 83, No. 2 June 2008THE QUARTERLY REVIEW OF BIOLOGY

171

example. Although the ecological application of machine learning approaches has increased, thereremains considerable skepticism with respect to the role of these techniques in ecology. Our reviewencourages a greater understanding of machine learning approaches and promotes their futureapplication and utilization, while also providing a basis from which ecologists can make informeddecisions about whether to select or avoid these approaches in their future modeling endeavors.

Introduction

P REDICTIVE ABILITY is considered bymany to be the ultimate goal in ecol-

ogy (Peters 1991). Recent decades havewitnessed an increasing role of predictionin applied ecology, in large part because ofthe mounting threats to biological diversityfrom global environmental change and theresulting need for ecological forecasting(Clark et al. 2001). Such efforts, however,are hindered by the many complexities ofecosystems, including historical legacies,time lags, nonlinearities, interactions, andfeedback loops that vary in both time andspace (Levin 1998). Accordingly, ecologistsare challenged by the need to understandand predict complex ecological processesand patterns.

One promising set of quantitative toolsthat can help solve such environmentalchallenges (e.g., global climate change,emerging diseases, biodiversity loss) is cur-rently being researched and developed un-der the rubric of ecological informatics(Green et al. 2005). Ecological informatics,or eco-informatics, is an interdisciplinaryframework that promotes the use of ad-vanced computational technology to revealecological processes and patterns acrosslevels of ecosystem complexity (Recknagel2003). Machine learning (ML) is a rapidlygrowing area of eco-informatics that is con-cerned with identifying structure in com-plex, often nonlinear data and generatingaccurate predictive models. Recent ad-vances in data collection technology, suchas remote-sensing and data network cen-ters and archives, have produced large,high-resolution datasets spanning spatialand temporal extents that were, until re-cently, unattainable. As a result, ecologistshave the exciting opportunity to take ad-vantage of ML approaches to model thecomplex relationships inherent in theselarge datasets. Applications of ML meth-

ods in ecology are diverse, and range fromtesting biogeographical, ecological, and evo-lutionary hypotheses to modeling speciesdistributions for conservation and manage-ment planning (e.g., Fielding 1999; Reckna-gel 2001, 2003; Cushing and Wilson 2005;Ferrier and Guisan 2006; Park and Chon2007).

ML algorithms can be organized accord-ing to a diverse taxonomy that reflects thedesired outcome of the modeling process.A number of ML techniques have beenpromoted in ecology as powerful alterna-tives to traditional modeling approaches.These include supervised learning ap-proaches that attempt to model the rela-tionship between a set of inputs and knownoutputs, such as artificial neural networks(Lek et al. 1996), cellular automata (Ho-geweg 1988), classification and regressiontrees (De’ath and Fabricius 2000), fuzzy logic(Salski and Sperlbaum 1991), genetic al-gorithms and programming (Stockwelland Noble 1992), maximum entropy (Phil-lips et al. 2006), support vector machines(Drake et al. 2006), and wavelet analysis(Cho and Chon 2006). In addition, unsuper-vised learning approaches are used to revealpatterns in ecological data, including Hop-field neural networks (Hopfield 1982)and self-organizing maps (Kohonen 2001).The growing use of these methods in recentyears is the direct result of their ability tomodel complex, nonlinear relationships inecological data without having to satisfy therestrictive assumptions required by conven-tional, parametric approaches (Guisan andZimmermann 2000; Peterson and Vieglais2001; Olden and Jackson 2002a; Elith et al.2006). As a result, ML approaches often ex-hibit greater power for explaining and pre-dicting ecological patterns. The recent for-mation of The International Society forEcological Informatics, as well as the birth ofthe scientific journal Ecological Informatics,

172 Volume 83THE QUARTERLY REVIEW OF BIOLOGY

supports the observation that ML hasevolved from a field of theoretical demon-strations to one of significant and appliedvalue in ecology.

Perhaps not surprisingly, ML approachesare predominantly used by ecologists withstrong computational skills, and they haveseen only limited use within the broaderscientific community. Why have ML ap-proaches not been widely embraced byecologists? One reason is that ecologistsmay lack the fundamental backgroundneeded to understand and implementthese methods, and they may be unsureabout how to select approaches that bestsuit their needs. At the same time, re-searchers in the field of ecological infor-matics continue to advance better andmore complex ML algorithms, arguingthat more powerful computers and in-creased availability of large ecological datasets will move them further into the main-stream. Unfortunately, as the technologyin ML grows, so does the inaccessibility ofthese techniques to the majority of ecolo-gists who still require a basic understand-ing of why, when, where, and how suchapproaches should be applied.

We argue that there are relatively fewexamples in the ecological literature thatencourage the exploration and promotethe application of ML methods. In this pa-per, we will address this concern by provid-ing a comprehensive review of three MLmethods that have recently gained popu-larity among ecologists: classification andregression trees, artificial neural networks,and evolutionary computation (genetic al-gorithms and programming); however, werecognize that other statistical approaches,including generalized additive models andmultivariate adaptive regression splines,have also illustrated utility in ecology (e.g.,Austin 2007; Elith and Leathwick 2007).For each approach, we will provide a briefbackground to the methodology, give ex-amples of its application in ecology, de-scribe model development and implemen-tation, discuss strengths and weaknesses,and explore the availability of statisticalsoftware. In order to more clearly illustratethe basic principles of the ML methodolo-

gies, we will apply each method to a com-mon ecological question, namely model-ing species richness (dependent variable)as a function of environmental descriptors(independent variables). We stress that ourreview is not meant to replace previouslypublished texts on ML (e.g., Fielding 1999;Lek and Guegan 2000), rather it is in-tended to provide a gentle introduction toML methods that is more readily accessibleto the broad community of ecologists. Weaccomplish this by favoring written expla-nation over mathematical formulas and byavoiding statistical jargon that often serves tolimit the readership and comprehension ofML methodologies by ecologists. Put simply,our hope is that this paper will encourage agreater understanding and the future appli-cation of ML approaches in the ecologicalsciences.

An Illustrative Example of MachineLearning Methods

In order to illustrate ML methodologies,we will use an empirical example relatingfish species richness to environmental char-acteristics of 8236 north-temperate lakes inOntario, Canada. Identifying both patternsand drivers of species richness is a long-standing problem in ecology because envi-ronmental factors typically interact in non-linear ways to influence the number ofspecies at any particular site. This exampleis used solely to demonstrate a commonstatistical problem in which a researcher isinterested in modeling a single dependentvariable as a function of multiple indepen-dent variables and, thus, is not meant as acomparative analysis of approaches. Wechose this relatively simple dataset andstraightforward ecological problem for illus-trative purposes. Although ML approachesare well-suited for addressing even seeminglysimple problems, many of the advantages ofthese approaches can be brought to bear onmuch more complex problems as well.

We selected 8 whole-lake descriptorsthat are related to habitat requirements oftemperate fish species of the Ontario re-gion (Minns 1989). Regional climate wasrepresented by the mean monthly air tem-perature (TEMP, measured in °C) and

June 2008 173MACHINE LEARNING METHODS WITHOUT TEARS

mean monthly precipitation (PPT, cm) foreach lake based on data collected from 1836recording stations between 1960 and 1989 bythe Atmospheric Environment Service of En-vironment Canada (see Vander Zanden et al.2004). Whole-lake measures of habitat in-cluded lake surface area (AREA, km2), totalshoreline perimeter (SHP, km), maximumdepth (MAXD, m), elevation (ELEV, m),secchi disc depth (SDD, m), and pH. Theprimary source of the fish distribution datawas the Fish Species Distribution Data Sys-tem of the Ontario Ministry of Natural Re-sources. To assess predictive performance ofthe models, we used 10-fold cross-validation.In this procedure, the original data is parti-tioned into 10 subsamples each containingn/10 observations; a single subsample is re-tained as the validation data for testing themodel, and the remaining 9 subsamples areused as training data. The cross-validationprocess is then repeated 10 times (hence,folds), with each of the subsamples used ex-actly once as the validation data. The 10 re-sults are then combined to produce a singleset of predictions for all n observations. Forgeneral information regarding model selec-tion, model validation, and the assessment ofpredictive performance (i.e., topics that arenot the focus of our study), we refer thereader to Fielding and Bell (1997).

Classification and Regression Trees(CARTs)

background and ecologicalapplications

Classification and Regression Trees(CARTs), collectively called decision trees,date from the pioneering work of Morganand Sonquist (1963) in the social sciences,and their use in statistical literature was re-kindled by the seminal monograph ofBreiman et al. (1984). Since this time, deci-sion trees have been widely used in a numberof applied sciences including medicine, com-puter science, and psychology (Ripley 1996).Recent years have seen CARTs emerge aspowerful statistical tools for analyzing com-plex ecological datasets because they offer auseful alternative when modeling nonlineardata containing independent variables that

are suspected of interacting in a hierarchicalfashion (De’ath and Fabricius 2000).

There have been numerous ecologicalapplications of CARTs across a wide rangeof topics. Decision trees have been usedto develop habitat models for threatenedbirds (O’Connor et al. 1996), tortoise spe-cies (Anderson et al. 2000), and endan-gered crayfishes (Usio 2007). Iverson andPrasad (1998) forecasted potential shifts intree species distributions resulting from cli-matic warming, Rollins et al. (2004) quan-tified the relationship between the fre-quency and severity of forest fires andlandscape structure, and Mercado-Silva etal. (2006) predicted patterns of fish speciesinvasions in the Laurentian Great Lakes.Other applications have involved modelingpatterns of variability in PCB concentra-tions of salmonid species (Lamon andStow 1999), predicting days postpartumfrom fatty acids measured in harbor sealmilk (Smith et al. 1997), delineating geo-graphic patterns of bottlenose dolphinecotypes (Torres et al. 2003), and develop-ing models that assessed the vulnerabilityof the landscape to tsunami damage (Iver-son and Prasad 2007).

methodologyCART analysis is a form of binary recur-

sive partitioning where classification andregression trees refer to the modeling ofcategorical and continuous response vari-ables, respectively (Bell 1999). The generalanatomy of a decision tree is presented inFigure 1. The term “binary” implies thateach group of observations, represented bya node in a decision tree, is split into twochild nodes, a process through which theoriginal node becomes a parent node. Theterm “recursive” refers to the fact that thebinary partitioning process can be appliedrepetitively. Thus, each parent node cangive rise to two child nodes and, in turn,each of these child nodes may themselvesbe split, forming additional children. Theterm “partitioning” refers to the fact thatthe dataset is split into sections or parti-tioned. Although there are many differentversions of binary recursive partitioningavailable, each with its own unique details,


the overall methodology is consistent re-gardless of the exact implementation.

CART analysis consists of three basicsteps. The first step involves tree building,during which a decision tree is built byrepeatedly partitioning the data set into anested series of mutually exclusive groups,each of them as homogeneous as possiblewith respect to the response variable. Treebuilding begins at the root node with theentire dataset, and the algorithm formu-lates split-defining conditions for each pos-sible value of all the independent variablesto create candidate—or surrogate—splits.Other splitting criteria are also available.Next, the algorithm selects the best candi-date split that minimizes the average “im-purity” of the two child nodes. Impurity isbased on a goodness of fit measure, suchas the information (entropy) index andthe Gini index for classification trees andsums of squares about group means forregression trees (De’ath and Fabricius 2000).The algorithm continues recursively witheach of the new children nodes until treebuilding is stopped.

The second step consists of stopping the

tree building process. The process is stoppedwhen: (1) there are only n observations ineach of the child nodes (where n is set bythe user), (2) all observations within eachchild node have the identical distributionof independent variables, making splittingimpossible, or (3) an external limit on thenumber of splits in the tree or minimumpurity threshold is achieved. A terminalnode or “leaf” is a node that the algorithmcannot partition any further because oneof the above criteria is met. In classificationtrees, each node—even the root node—isassigned a predicted probability for eachclass (often the class with the greatest prob-ability is assigned to the node). For regres-sion trees, the predicted value for eachnode is typically defined as the mean ormedian value of the response variable forthe observations in the node.

The third step involves tree pruning andoptimal tree selection. The most commonmethod used is called “cost-complexity”pruning, which results in the creation of asequence of progressively simpler treesthrough the pruning or cutting of increas-ingly important nodes. This method relies

Figure 1. The General Anatomy of a Classification or Regression Tree


on a complexity parameter, denoted �,which is gradually increased during thepruning process. Starting at the terminalnodes, the child nodes are pruned away ifthe resulting change in the model error isless than � multiplied by the change in treecomplexity. Thus, � is a measure of howmuch additional accuracy a split must addto the entire tree to warrant the additionalcomplexity. As � is increased, more andmore nodes of increasing importance arepruned away, resulting in simpler and sim-pler trees (Bell 1999). The goal in selectingthe optimal tree is to find the correct com-plexity parameter � so that the informationin the learning dataset is fit but not overfit,the latter condition occurring when themodel successfully describes the specifics ofthe original data but is unable to predictadditional data. It is common practice to de-termine the optimal size of the decision tree(i.e., the number of terminal nodes) by se-lecting the tree size with the smallest modelerror based on repeated cross-validations ofthe data. Alternatively, the smallest treewhose model error falls within the one stan-dard error rule is also used (see De’ath andFabricius 2000). For further discussion onvalidation techniques and their use in iden-tifying ideal tree size, see Breiman et al.(1984), Bell (1999), Hastie et al. (2001), andSutton (2005).

Based on the branching topology of thedecision tree, one can interpret the pri-mary splits that represent the most impor-tant variables in the prediction process, aswell as the best competitive surrogate splitsthat also show high classification power. Tocalculate the overall importance of the in-dependent variables in a decision tree, theCART analysis quantifies the improvementmeasure attributable to each variable in itsrole as a surrogate to the primary split. Thevalues of these improvements are summedover each node, totaled, and scaled relativeto the best performing variable, i.e., ex-pressed as a relative importance on a0–100% scale (Breiman et al. 1984).

strengths and weaknessesCART analysis has a number of advan-

tages over traditional statistical methods

that make it particularly attractive for mod-eling ecological data. These include thefact that CARTs are: (1) inherently non-parametric and, therefore, not affected byheteroscedasticity or distributional errorstructures that affect parametric proce-dures; (2) invariant to monotonic transfor-mations of the data, thus eliminating theneed for data transformations; (3) ableto handle mixed numerical data includ-ing categorical, interval, and continuousvariables; (4) able to deal with missingvariables by using the surrogate splittingvariables in the decision tree; (5) not af-fected by outliers (outliers are isolated intoa node, and have only a minimal effecton splitting); (6) able to detect and revealinteractions in the data set; (7) able toeffectively deal with higher dimensional-ity (i.e., it can identify a reduced set ofimportant variables from a large numberof submitted variables); (8) relatively sim-ple to interpret graphically. In general, ofthe three ML approaches reviewed here,CART is the most flexible with respect todata requirements and the most transpar-ent when it comes to understanding themodeling process (Table 1).

Despite their many advantages, CARTshave a number of weaknesses that arerarely discussed in the literature. First,analyses that identify “splitting” variablesby employing the exhaustive search of allpossibilities have the chance of increasingthe complexity of the search, thereby caus-ing computational strain, especially withlarge data sets (although advances in com-puter speed have partially eliminated thisproblem). This tends to select a decisiontree that has more splits, thus promotingoverfitting. Second, deducing rules from adecision tree can be very complicated be-cause it is not based on a probabilisticmodel. Therefore, there is no probabilitylevel or confidence interval associated withpredictions derived from using a classifica-tion or regression tree to classify a new setof data. Third, correlations among in-dependent variables can complicate theidentification of important interactions.Fourth, decision trees are typically unsta-ble, i.e., sometimes small changes in the


learning sample values can lead to signifi-cant changes in the variables used in thesplits. As a result, overall variable impor-tance cannot be determined by only exam-ining the final tree; it also requires theexamination of all possible surrogate splits.Fifth, perhaps the greatest weakness ofCARTs is that the final decision tree is notguaranteed to be the optimal tree. At eachsplitting decision in the tree growing pro-cess, the selected split is the one that re-sults immediately in reduced impurity (forclassification) or variation (for regression).One might expect that some other split,which would appear suboptimal at thetime, could produce more effective futuresplits (Sutton 2005). A variety of ap-proaches have been developed to addressthe latter two problems, including the ap-plication of bagging and boosting tech-niques and the creation of an ensembletree based on random forests of multipletrees. We refer the reader to De’ath (2007)and Cutler et al. (2007) for an ecologicaltreatment of these topics.

softwareMany commercial packages are avail-

able to implement CART. This software

varies from requiring a fair amountof user design and programming toWindows-based programs to powerfuland user-friendly Graphical User Inter-faces. Windows-based programs includeCART (www.salford-systems.com), DTREG(www.dtreg.com), KnowledgeSEEKER (www.angoss.com), QUEST (www.stat.wisc.edu/�loh), PolyAnalyst (www.megaputer.com),Random Forests (www.stat.berkeley.edu/users/breiman), Shih Data Miner (www.shih.be), See5/C5.0 (www.rulequest.com),and XpertRule Miner (www.attar.com).Modules and libraries for statistical softwarepackages include AnswerTree for SPSS(www.spss.com/answertree), Multivariate Ex-ploratory Techniques (Classification Trees)for Statistica (www.statsoft.com), EnterpriseMiner for SAS (www.sas.com), Tree libraryfor S-Plus (http://lib.stat.cmu.edu/S), andRpart for the R-package (http://cran.r-project.org).

case studyWe constructed a regression tree to gain

explanatory and predictive insight into theenvironmental drivers of fish species rich-ness. CART begins with the entire hetero-geneous sample of 8236 north-temperate

TABLE 1Comparison of machine learning approaches according to a number of model characteristics

Characteristic GLM CART ANN EA

Data RequirementsAccommodate “mixed” data types Low High Low ModerateAccommodate missing values of predictors Low High Low LowInsensitive to monotonic transformations of predictors Low High Moderate ModerateRobust to outliers in predictors Low Moderate Moderate ModerateInsensitive to irrelevant predictors Low High Moderate Moderate

Modeling ProcessAutomation (i.e., low degree of user involvement) High Moderate Moderate LowTransparency of the modeling process High Moderate Low LowAbility to model nonlinear relationships Low Moderate High HighAccommodate interactions among predictors Low Moderate High High

Model OutputExplanatory insight and variable interpretability High Moderate Moderate LowPredictive power Low Moderate High High

Software Availability and Ease-of-Use High Moderate Low Low

Classification and regression trees (CARTs), artificial neural networks (ANNs), and evolutionary algorithms (EAs) arecompared to the family of generalized linear models (GLMs) that are traditionally used in ecology. Comparisons aregeneralized to include both classification and prediction problems. Values are based on Hastie et al. (2001), peer-reviewedliterature, and the personal experiences of the authors.


lakes, consisting of both species-poor andspecies-rich lakes. The goal of CART anal-ysis is to partition the sample according toa “splitting rule” and a “goodness of splitcriteria.” Splitting rules are questions ofthe form, “Is the environmental variableless or equal to some value?” or, put moregenerally, “Is X � d?” where X is an inde-pendent variable and d is a constant withinthe range of that variable. Such questionsare used to “split” the sample, and a good-ness of split criteria compares differentsplits and determines which of these willproduce the most homogeneous sub-samples. With respect to our case study, wewanted to disaggregate the lakes into thosewith similar values of fish species richness.As an example, Figure 2 is based on outputproduced by CART (Salford Systems, Inc.)when we set the parent node minimum to1000 lakes and the terminal node mini-mum to 200 lakes. The first step was toselect the optimal size of the regressiontree, defined by the number of terminalnodes. Breiman et al. (1984) suggest the1-SE rule, whereby the best tree is taken asthe smallest tree having an estimated errorrate that is within one standard error of theminimum value across all trees. Using 10-fold cross-validation, we selected a finaltree with 8 terminal nodes that exhibited apredictive performance of R�0.65 be-tween predicted and observed species rich-ness (Figure 2A).

The final regression tree is shown in Fig-ure 2C. The root node at the top of thetree shows that the 8236 study lakes in theinitial sample contain, on average, 6.7 fishspecies. The first split of the root node isbased on lake surface area. For lakes lessthan or equal to 1.5 km2 in area (“Yes”answer on the left-hand branch of thetree), the average richness is 5.3 (n �5511 lakes). This group is split based ona shoreline perimeter greater than(“No”) or less than (“Yes”) 3.5 km (aver-age richness � 6.3 vs. 3.9 species). Of the2322 lakes with smaller shoreline perim-eters, the regression tree distinguishesbetween two terminal nodes based on anannual air temperature threshold of 2.2°C: node A representing 859 lakes with

an average of 3.1 species, and node Brepresenting 1463 lakes with an average of4.5 species. The parent node containingthe 3189 lakes with larger shoreline perim-eters is split by a number of additionalcriteria, including mean monthly precipi-tation and elevation, to produce terminalnodes C through E, which represent themost homogeneous subgroups that can bepartitioned with the given independentvariables. For the right-hand split leadingoff from the root node the same interpre-tation is used. According to all surrogatesplits in the regression tree, we find thatsurface area, shoreline perimeter, andmean monthly precipitation are the mostimportant predictors of fish species rich-ness (Figure 2B).

Artificial Neural Networks (ANNs)background and ecological

applicationsAn artificial neural network (ANN), or,

more generally, a multilayer perception, is amodeling approach inspired by the way bio-logical nervous systems process complex in-formation. The key element of the ANN isthe novel structure of the information pro-cessing system, which is composed of a largenumber of highly interconnected elementscalled neurons, working in unity to solve spe-cific problems. The concept of ANNs wasfirst introduced in the 1940s (McCullochand Pitts 1943); however, it was not popular-ized until the development of the back-propagation training algorithm by Rumel-hart et al. (1986). The flexibility of thismodeling technique has led to its wide-spread use in many disciplines such as phys-ics, economics, and biomedicine.

Researchers in ecology have also recog-nized the potential mathematical utility ofneural network algorithms for addressingan array of problems. Previous applicationsinclude the modeling of species distribu-tions (Mastrorillo et al. 1997; Ozesmi andOzesmi 1999), species diversity (Guegan etal. 1998; Brosse et al. 2001; Olden et al.2006b), community composition (Oldenet al. 2006a), and aquatic primary and sec-ondary production (Scardi and Harding


Figure 2. Regression Tree and Associated Results for Predicting Fish Species RichnessResults from the regression tree for predicting fish species richness as a function of environmental charac-

teristics for 8236 north-temperate lakes in Ontario, Canada. (A) 10-fold cross-validation (solid circles) andresubstitution (empty circles) relative error for the regression tree. The dashed line represents � 1-SE of therelative error for the minimum regression tree (i.e., 15 nodes), and the selected tree under the 1-SE rule isindicated by the arrow. (B) Relative importance of the environmental variables for predicting fish speciesrichness (note that values do not sum to 100). Variables include mean monthly air temperature (TEMP) andprecipitation (PPT), lake surface area (AREA), total shoreline perimeter (SHP), maximum depth (MAXD),elevation (ELEV), secchi disc depth (SDD), and pH. (C) The final regression tree relating fish species richnessto lake environmental characteristics. Node precision is indicated by Root-Mean-Squared-Error.


1999; McKenna 2005). Cornuet et al. (1996)used a neural network to assign individualsto appropriate taxonomic groups using mul-tilocus genotypes. Spitz and Lek (1999) mod-eled wildlife damage to farmlands, andThuiller (2003) assessed the potential im-pacts of climate change on the distributionof tree species in Europe. Other applicationshave occurred in the fields of water resourcemanagement (Maier and Dandy 2000), inva-sive species biology (Vander Zanden et al.2004), and pest management (Worner andGevrey 2006). A collection of ANN appli-cations in ecology is presented in Lek andGuegan (2000), Recknagel (2003), andOzesmi et al. (2006), as well as in specialissues of Ecological Modelling and EcologicalInformatics (e.g., Recknagel 2001; Park andChon 2007).

methodologyThere are many types of supervised and

unsupervised learning methods for ANNs(Bishop 1995). Here we describe the mostfrequently used method in ecology: theone hidden-layer, supervised, feedforwardneural network trained by the back-propagation algorithm. These neural net-works are popular in the ecological litera-ture because they are considered to beuniversal approximators of any continuousfunction (Hornik et al. 1989). In this sec-tion, we will discuss neural network archi-tecture and the back-propagation algo-rithm used to parameterize the network,and we will describe the various methodsavailable to quantify variable importance.

Network architecture refers to the num-ber and organization of the neurons in thenetwork (see Figure 3 for the general anat-omy of a neural network). In the feedfor-ward network, neurons are organized in aninput layer, a hidden layer, and an outputlayer, with each layer containing one ormore neurons. Each neuron is connectedto all neurons in adjacent layers with anaxon; however, neurons within each layerand in nonadjacent layers are not con-nected. The input layer typically contains pneurons, one neuron representing each ofthe independent variables x1 through xp.The number of neurons in the hidden

layer can be selected arbitrarily or deter-mined empirically by the investigator tominimize the trade-off between bias andvariance (Geman et al. 1992). The addi-tion of hidden neurons increases the abil-ity of a network to approximate any under-lying relationship among the variables, i.e.,resulting in reduced bias, but also in-creases the variance of predictions due tooverfitting the data. Although mathemati-cal derivations exist for selecting an opti-mal design (see Bishop 1995), in practice itis common to train networks with differentnumbers of hidden neurons and to use theperformance on a test data set to choosethe network that performs the best. Forcontinuous and binary response variables,the output layer commonly contains oneneuron, but the number of output neu-rons can be greater than one if there ismore than one response variable or if theresponse variable is categorical (i.e., a sep-arate neuron for classifying observationsinto each category). Additional bias neu-rons with a constant output are also addedto the hidden and output layers, althoughthis is not mandatory, as these neuronsplay a similar role to the intercept term ingeneral linear regression.

Each neuron in the network has an “ac-tivity level” that is defined by the value ofthe incoming signals received from theother neurons connected to it. In turn,each axon in a network is assigned a “con-nection weight” that reflects the overall in-tensity of the signal it transmits (i.e., input

Figure 3. The General Anatomy of anArtificial Neural Network


to hidden or hidden to output). The activ-ity levels of the input neurons are definedby the values of the predictor variables(Figure 3). The state of each hidden neu-ron is evaluated locally by calculating theweighted sum of the incoming signals fromthe neurons of the input layer, which isthen subjected to an activation function,i.e., a differentiable function of the neu-ron’s total incoming signal from all inputneurons. The same procedure describedabove is repeated for the axon signals fromthe hidden layer to the output layer.

Training the neural network typically in-volves an error back-propagation algo-rithm that searches for an optimal set ofconnection weights that produces an out-put signal with a small error relative to theobserved output (i.e., minimizing the fit-ting criterion). For continuous output vari-ables, the most commonly used criterion isthe least-squares error function, whereasfor dichotomous output variables, the mostcommonly used criterion is the cross en-tropy error function, which is similar tolog-likelihood (Bishop 1995). The algo-rithm adjusts the connection weights in abackwards fashion, layer by layer, in thedirection of steepest descent, thus mini-mizing the error function (this is alsocalled gradient descent). The training ofthe network is a recursive process whereobservations from the training data are en-tered into the network in turn, each timemodifying the input-hidden and hidden-output connection weights. This proce-dure is repeated with the entire trainingdataset (i.e., each of the n observations) fora number of iterations or epochs until astopping rule (e.g., error rate) is achieved.Prior to training the network, the indepen-dent variables should be converted toz-scores (0 to 1) in order to standardize themeasurement scales of the inputs into thenetwork.

Recent efforts have focused on the devel-opment of methods for understanding theexplanatory contributions of the indepen-dent variables in ANNs (Olden and Jackson2002b). This was, in part, prompted by thefact that neural networks were considered a“black box” approach to modeling ecologi-

cal data because of the perceived difficulty inunderstanding their inner workings. Recentstudies in the biological sciences have pro-vided a variety of methods for quantifyingand interpreting the contributions of the in-dependent variables in neural networks (seeOlden and Jackson 2002b; Gevrey et al. 2003;Olden et al. 2004). These approaches utilizethe fact that the connection weights betweenneurons are the linkages between the inputsand the output of the neural network, and,therefore, the relative contribution of eachindependent variable depends on the mag-nitude and direction of these connectionweights. Input variables with larger connec-tion weights represent greater intensities ofsignal transfer; they are more important inpredicting the output compared to variableswith smaller weights. Negative connectionweights reduce the intensity or contributionof the incoming signal and negatively affectthe output, whereas positive connectionweights increase the intensity of the incom-ing signal and positively affect the output.One method, the connection weight ap-proach, uses the product of the input-hiddenand hidden-output connection weights todetermine variable importance (Olden et al.2004). Other approaches include Garson’salgorithm (Garson 1991), partial derivatives(Dimopoulos et al. 1995), a sensitivity analy-sis to determine the spectrum of input vari-able contributions in the neural network(Lek et al. 1996), and a number of pruningalgorithms (Bishop 1995), including a ran-domization test to remove small connectionweights (Olden and Jackson 2002b). Al-though these approaches can determinethe overall influence of each independentvariable, the interpretation of interactionsamong the variables requires the directexamination of the network connectionweights (e.g., Ozesmi and Ozesmi 1999;Olden and Jackson 2001).

strengths and weaknessesThere are both advantages and disadvan-

tages to neural networks, and, to discussthis subject properly, we would have tolook at each individual type of network. Inreference to a back-propagation feedfor-ward approach, ANNs have a number of


advantages over traditional parametric ap-proaches, including: (1) the ability to modelnonlinear associations; (2) no requirementof specific assumptions concerning the dis-tributional characteristics of the indepen-dent variables (i.e., nonparametric); and (3)the accommodation of variable interactionswithout a priori specification (Table 1).ANNs also provide a much more flexible wayof modeling ecological data. Model com-plexity can be varied by altering the transferfunction or the inner architecture of the net-work through an increase in the number ofhidden neurons or layers to enhance datafitting, or by increasing the number of out-put neurons to model multiple ecologicalresponse variables, such as multiple species(e.g., Ozesmi and Ozesmi 1999) or entirecommunities (e.g., Olden 2003; Olden et al.2006a). It is this flexibility that has likely ledto the increased popularity of neural net-works in ecology.

Yet ANNs are not without limitations, andthere are some specific issues potential usersshould be aware of. First, neural networkmodels are more complicated to implement,mainly for the reason that the optimizationof the network architecture is iterative and is,thus, time consuming. However, this processcan be automated and relies on computertime rather than human time to optimize.Second, the performance of a network canbe sensitive to the random initial connectionweights assigned to the network prior totraining. To overcome this, it is recom-mended that multiple networks based on dif-ferent initial connection weights be con-structed and that a final network that isdeemed representative of the population ofmodels be selected. Third, although ANNsare no longer simplistically viewed as a blackbox approach to modeling data (Olden andJackson 2002b), the model-building processis still far less transparent than that of moretraditional methods, and the ability to ex-plore direct and interactive variable contri-butions with ANNs will always be more com-plicated (Table 1).

softwareUntil recently, the ability of researchers

to use neural networks was limited to those

with computer programming experience.This is no longer the case. A number ofWindows-based programs, modules forcommonly used software packages, and li-braries for different programming lan-guages are now available. Windows-basedprograms include BrainMaker (www.calsci.com),EasyNN-plus(www.easynn.com),Neuro-Solutions (www.neurosolutions.com), Neural-Ware (www.neuralware.com), and SNNS(http://www-ra.informatik.uni-tuebingen.de/SNNS/). Modules and libraries for statisticalsoftware packages include Neural Connectionfor SPSS (www.spss.com), Neural NetworkmoduleforStatistica(www.statsoft.com),Neuro-Solutions for Excel (www.neurosolutions.com), NeuroXL Classfier for Excel (www.neuroxl.com), Enterprise Miner for SAS(www.sas.com), NeuroSolutions for MatLab(www.neurosolutions.com), Neural Network li-brary for S-Plus (http://lib.stat.cmu.edu/S/),and Feed-forward Neural Networks for the R-package (http://cran.r-project.org).

case studyANN methodology begins with a fully con-

nected network composed of axons withcompletely random connection weights thatlink the environmental variables (repre-sented by input neurons) to species richness(represented by the output neuron) via thehidden layer of neurons. In this casestudy, the back-propagation algorithmmodified the connection weights in an it-erative fashion to maximize the match be-tween predicted and observed levels of spe-cies richness in the study lakes. As anexample, Figure 4 is based on output pro-duced by using the Neural Network mod-ule in MatLab (Mathsoft Inc.) when we setthe initial random weights to range be-tween -0.3 and 0.3 and the maximum num-ber of iterations for model convergence to1000. Optimal network configuration (i.e.,optimal number of neurons in the hiddenlayer of the network) was determined bycomparing the performances of different10-fold cross-validated networks with 1 to20 hidden neurons in order to choose thenumber that produced the greatest net-work performance (Figure 4A). This re-sulted in a network with 8 input neurons (8


environmental variables), 5 hidden neu-rons, and 1 output neuron (i.e., speciesrichness), and which exhibited a predictiveperformance of R�0.70 between predictedand observed species richness.

The final neural network is presented inFigure 4C. In this figure, the relative magni-tudes of the connection weights are repre-sented by line thickness (i.e., thicker linesrepresent greater weights) and line type rep-

Figure 4. Artificial Neural Network and Associated Results for Predicting Fish Species RichnessResults from the artificial neural network predicting fish species richness as a function of environmental

characteristics for 8236 north-temperate lakes in Ontario, Canada. (A) 10-fold cross-validation (solid circles,bars represent �/- 1-SE) and resubstitution (empty circles) relative error for the ANN. The selected networkis indicated by the arrow. (B) Relative importance of the environmental variables for predicting fish speciesrichness, where solid and empty bars indicate positive and negative relationships, respectively. (C) The artificialneural network relating fish species richness to lake environmental characteristics. Line thickness is propor-tional to the magnitude of the axon connection weight, and line type indicates the direction of the interactionbetween neurons: solid line connections are positive (excitators) and dashed line connections are negative(inhibitors). Only statistically significant connection weights are presented (P � 0.05). (Variable codes arepresented in the Figure 2 caption.)


resents the direction of the weights (i.e.,solid lines represent positive signals anddashed lines represent negative signals).Positive effects of input variables are de-picted by positive input-hidden and pos-itive hidden-output connection weights,or negative input-hidden and negativehidden-output connection weights. Nega-tive effects of input variables are de-picted by positive input-hidden and neg-ative hidden-output connection weights,or by negative input-hidden and positivehidden-output connection weights. Thus,the multiplication of the two connectionweight directions (positive or negative) in-dicates the effect that each input variablehas on the response variable. Interactionsamong predictor variables can be identi-fied as input variables with opposing con-nection weights entering the same hiddenneuron.

Individual and interacting influences ofthe environmental variables on networkpredictions were examined after we re-moved nonsignificant, small connectionweights (based on P � 0.05) using the ran-domization approach of Olden and Jack-son (2002b). The resulting network showsthat species richness is positively correlatedwith mean monthly air temperature viahidden neurons A and B, and negativelycorrelated with mean monthly precipita-tion via hidden neurons A and D. Theneural network also predicts fish speciesrichness to be greatest in large and deeplakes (neuron C) with high shoreline pe-rimeters (neurons B, C, and D). Focusingon hidden neuron B, we see that shorelineperimeter and lake elevation interact suchthat the positive influence of shoreline pe-rimeter (reflecting greater littoral zoneavailability) on species richness (i.e., twonegative connection weights) decreaseswith increasing lake elevation. Figure 4Cillustrates the utility in eliminating thoseconnection weights that are small and,therefore, contribute little to network pre-dictions. In this case, we removed 23 out ofthe 45 possible connection weights. Sum-ming across all connection weights illus-trates that shoreline perimeter (positive),lake area (positive), and mean monthly

precipitation (negative) exhibited the stron-gest influences on predictions of speciesrichness from the neural network (Figure4B).

Evolutionary Computation: GeneticAlgorithms and Genetic

Programming

background and ecologicalapplications

Evolutionary computation (EC) includesa number of machine learning approachesthat can be classified as stochastic optimi-zation tools. In general, these techniquesuse an aspect of randomization to searchfor global model optima. More specifically,EC is based on the process of evolution innatural systems and was inspired by a di-rect analogy to sexual reproduction andCharles Darwin’s principle of natural selec-tion (Holland 1975; Goldberg 1989). ECapproaches include simulated annealing,evolutionary programming, evolutionarystrategies, genetic algorithms, and geneticprogramming. Because they have beenmore frequently used in ecological studies,we have confined our discussion here togenetic algorithms (GAs) and genetic pro-gramming (GP). In a strict interpretation,GAs refer to the general purpose searchalgorithms introduced by Holland (1975),which create population-based models thatuse selection and recombination operatorsto generate new sample points in a searchspace (Mitchell 1998). In contrast, GP re-fers to solutions to the problem in questionthat take the form of modular computerprograms. Running each GP provides asolution to the problem (Koza 1992),whereas, in GAs, the solution is repre-sented by fixed-length character strings(Mitchell 1998). These strings are then in-terpreted by one or more functions to pro-duce solutions. We will discuss the me-chanics of the two approaches in moredetail in the following sections.

Ecological applications of GAs and GPhave been more limited than most MLtechniques, but they are growing in popu-larity in the natural sciences. D’Angelo etal. (1995) used GAs to model the distribu-


tion of cutthroat and rainbow trout as afunction of stream habitat characteristicsin the Pacific Northwest of the UnitedStates, and they were also applied by Ter-mansen et al. (2006) to model plant spe-cies distributions as a function of both cli-mate and land use variables. Geneticprogramming was used by McKay (2001)to develop spatial models for marsupialdensity, by Chen et al. (2000) to analyzefish stock-recruitment relationship, and byMuttil and Lee (2005) to model nuisancealgal blooms in coastal ecosystems. Ecolo-gy’s recent interest in EC has been driven,in large part, by the introduction of theGenetic Algorithm for Rule-Set Prediction(GARP) for predicting species distribu-tions (Stockwell and Noble 1992). GARPuses several rule-building methods to buildheterogeneous rule sets that describe theecological niche of a species to environ-mental data (Stockwell and Peters 1999).The resulting niche model is then pro-jected back onto the landscape to generatea prediction of potential distribution. Todate, GARP has been used to model thespatial distribution of numerous species,including the habitat suitability of threat-ened species (e.g., Anderson and Martınez-Meyer 2004) and invasive species (e.g., Peter-son and Vieglais 2001; Peterson 2003; Drakeand Lodge 2006), as well as the geography ofdisease transmission (Peterson 2001). Fur-ther advances in ecological niche modelingusing EC approaches continue to be devel-oped, such as the WhyWhere algorithm ad-vocated by Stockwell (2006) (but see Peter-son 2007). Broader applications of ECmethods include their use in conservationplanning for biodiversity (Sarkar et al. 2006).

methodologyThe general anatomies of a genetic algo-

rithm and a genetic program are pre-sented in Figure 5A. In principle, GAs andGP operate on populations of competingsolutions to a problem that evolve overtime to converge to an optimal solution(Holland 1975). The solutions are looselyrepresented as “chromosomes” composedof component “genes.” Both approachesinvolve four steps. First, random potential

solutions (chromosomes) to the problemare developed. Second, the potential solu-tions are altered using the processes of re-production, mutation, and crossover. Third,the new solutions are evaluated to determinetheir fitness (i.e., how well they solve theproblem). Fourth, the most fit or best solu-tions are selected. Steps two through four,which can be seen as constituting a “genera-tion of solutions,” are then repeated usingthe solutions selected in step four until astopping criterion is reached. In this way,solutions to a problem evolve through themultiple iterations or generations of themodeling process (Haefner 2005).

Although the principles behind GAs andGP are similar, as mentioned above, thestructures of the models are fairly different.In GAs, the components of a solution (i.e.,model parameters) are represented as geneson single-vector chromosomes (Figure 5B).The chromosomes change such that solu-tions evolve over subsequent generations inthe model. In each generation, there are twoopportunities for chromosomal change: re-combination via crossover and random mu-tation. Recombination mimics sexual recom-bination in which genetic material fromparents is combined to produce related, butdifferent, offspring. First, chromosomes arepaired up, often at random. Next, a pointalong the length of the chromosomes is se-lected, again at random, and portions of thepaired chromosomes are exchanged in anevent called crossover (Mitchell 1998). Theresulting new chromosomes are regarded asoffspring. Further change can occur throughmutation events. In each generation of chro-mosomes, a mutation can occur changing avalue from 1 to 0 or vice versa at any individ-ual bit on a chromosome. Mutation acts asan insurance policy against the permanentloss of any simple gene, and it enhances theability of the genetic algorithm to find theoptimal solution.

Recombination and mutation act to createnew generations of different solutions. Al-though we have provided a basic descriptionof these two events, their implementationcan vary (Mitchell 1989; Haefner 2005). Forexample, recombination can be a mandatoryevent for all chromosomes and, thus, all par-


Figure 5. The General Anatomy of a Genetic Algorithm and a Genetic Program(A) Schematic illustrating the process of evolutionary computation. (B) Illustration of a genetic algorithm

(GA) where each individual (4 in total) is represented by a chromosome containing a binary string of 5 genes(parameter 0 or 1). F indicates the fitness of each individual. Crossovers occur between underlined sections ofthe chromosome, and the boxed value represents a mutation event. (C) Illustration of a genetic program (GP)where each individual is represented by a tree-like chromosome of genes representing values and operators.Independent variables 1 and 2 are represented by v1 and v2, respectively. Dashed lines indicate crossover pointsin the parents, and dashed circles represent mutation events. The underlined statements represent thecorresponding regression equations for each individual in the population at t � n�1 (i.e., after crossover andmutation events).


ent chromosomes can be replaced by theiroffspring, or some proportion of the popu-lation can reproduce and be replaced whilethe rest of the chromosomes remain un-changed. In addition, other events can beused to alter the chromosomes. For exam-ple, inversions can be used to reorder por-tions of a chromosome (Holland 1975). Therates at which mutations and crossoverevents occur can often be set by the modeler.

After recombination and mutation arecompleted, the fitness of each chromo-some is determined using a fitness func-tion. The function takes the genes of thechromosomes as inputs and produces avalue or condition that is then comparedto an optimal condition (Holland 1975).The “distance” between the optimal condi-tion and that produced by the chromo-some’s solution represents its fitness or,put another way, its prediction error. Aselection event then takes place in whichthe chromosomes with higher fitness areselected to comprise the next generation.

As a simple example, consider the prob-lem of selecting the parameters for a re-gression equation with five variables. Thechromosomes would each have five genescorresponding to the five parameters. Thesefive genes are represented on the chromo-somes as bit strings. For each generation inthe model, recombination and mutationevents would produce new sets of chromo-somes with different parameters for the re-gression equation. These new sets of param-eters could be evaluated by plugging theminto a predetermined regression equationand calculating the mean squared error us-ing the data for which the regression wasbeing developed. Chromosomes that pro-duced lower mean squared errors would beconsidered more fit than chromosomes withhigher mean squared errors.

The chromosomes in GP are modularcomputer programs in which each modulecan be seen as a gene (Koza 1992). Thesechromosomes have a tree-like structurethat is in turn interpreted as an equationor a list of commands (Figure 5C). Eachnode in the tree has a functional value thatrequires additional arguments, and eachterminal branch has a terminal value or

command that does not require additionalarguments. As an example, again considera regression equation. The functional val-ues can be addition, subtraction, or divi-sion, and terminal values can be numeric(parameters in the regression model) orvariables (explanatory variables). Figure5C shows the tree-like structure of such achromosome (program) and the corre-sponding regression equation. As in GAs, apopulation of chromosomes is evolvedthrough generations of recombination andcrossover events. Mutations can occur atany functional value or any terminal value.Crossover breaks can occur at any node inthe tree.

In summary, during successive genera-tions in both GAs and GP, the initial pop-ulation of chromosomes (i.e., strings ofproblem solutions in GAs or modular pro-grams in GP) advances toward a fitter pop-ulation by reproduction among membersof the previous generation. Selection ofthe fittest chromosomes makes sure thatonly the best chromosomes can crossoveror mutate, thus advancing their opportunityto find the best solution to the problem. Werefer the reader to Haefner (2005) for anoverview of evolutionary computation ap-proaches, Goldberg (1989) and Mitchell(1998) for introductory texts on the subject,and Holland (1975) and Koza (1992) for amore advanced treatment of GAs and GP,respectively.

strengths and weaknessesEvolutionary computation has a number

of advantages over traditional ecologicalmodeling approaches. First, because ECapproaches perform extensive optimiza-tion searches, they are able to model non-linear data and also may be appropriate forsituations involving uneven sampling andsmall sample sizes. This optimal searchingamounts to a bottom-up inductive analysisas opposed to the top-down deductive ap-proach of most traditional predictive ecolog-ical modeling techniques. A second and re-lated advantage of EC is that it providesecologists with the ability to model complexrelationships using a broad range of modelstructures and model fitting approaches (Ta-


ble 1). Theoretically, this means that one canhave a large amount of control over the de-sign of the models and the execution of thealgorithm; however, the ability to exert thisflexibility is often limited by the availablesoftware. The GARP program perfectly illus-trates the flexibility of EC approaches. Asmentioned above, GARP uses a set of rules todefine species distribution. This rule set maybe composed of many different functions orrelationships (e.g, logistic functions, Booleanoperators), which are assembled in the mod-eling process. Although the GARP algorithmhas a fixed set of rules, in designing a geneticalgorithm or genetic program to address agiven problem, one can design one’s own setof potential functions or relationships. Athird advantage of EC approaches is thatthey were designed as stochastic optimiza-tion tools with a relatively broad applicationin mind.

Nonetheless, EC approaches are not thebest techniques for all problems. First, manyreadily available statistical techniques per-form as well as, if not better than, GAs atdeveloping regression and classifications sys-tems. GARP, in particular, has been shown tooverpredict species distributions relative toother modeling approaches (Elith et al.2006; Lawler et al. 2006). Second, there islittle theory available to explain GP and littleguidance for selecting model parameters.Therefore, the onus, more so than with moststatistical approaches, is on the researcher todevelop the potential range of model struc-tures. Developing more complex models re-quires more work on the part of the modeler(Table 1).

There are also a few specific limitations ofthe two approaches discussed here that areworth mentioning. One specific limitation ofgenetic algorithms is that they use fixed-length chromosomes, which can limit thepotential range of solutions. In our exampleof the regression equation with five parame-ters, we were limited in that we had to set thenumber of parameters in advance. Thus, wewere unable to evolve a solution with, forinstance, seven or eight parameters. Geneticprogramming does not have this limitation;the length of a solution is limited only byavailable computer memory or by the soft-

ware used to do the modeling. Genetic pro-gramming, however, also has a distinct draw-back. The solutions obtained from geneticprograms are often too complex to easilyinterpret because they can be long strings ofparameters or equations with complicatedoperators (we will revisit this problem in thecase study to follow). Even with some of themore well-developed software, decipheringthe computer code that produces the mod-els or understanding the significance of thecomplex relationships that the model de-fines can be difficult (Table 1).

softwareAlthough much of the available evolution-

ary computation software requires somecomputer programming skill, a number ofmore user-friendly tools have been devel-oped. Discipulus, which is used in our casestudy, is available for Windows platforms(http://www.rmltech.com/). The GARP pro-gram for Windows is available at http://www.nhm.ku.edu/desktopgarp/, EJC is an “evolu-tionary computation research system”developed at George Mason University(http://cs.gmu.edu/�eclab/projects/ecj/),and Groovy Java Genetic Programming (JG-Prog) is available from http://jgprog.sourceforge.net/. At least two packages are availablefor R-package, including gafit and genalg(http://cran.r-project.org). And finally, thereare a number of repositories for computercode for evolutionary computation (e.g.,http://www.geneticprogramming.com).

case studyWe used genetic programming to model

species richness in the north-temperate lakesdataset. The modeling was done with theprogram Discipulus, which is easy to use soft-ware designed to efficiently apply geneticprogramming to regression and classifica-tion problems. Discipulus uses both muta-tions and crossover events within a popula-tion of computer programs to evolvesolutions. We used the program’s default set-tings, consisting of a population of 500 pro-grams, a mutation rate of 95%, and a cross-over frequency of 50%. Each programconsists of modules containing both vari-


ables and operators. Discipulus allows theuse of 11 different types of operators includ-ing addition, multiplication, subtraction, di-vision, trigonometric, and Boolean opera-tors. We used a total of 23 different potentialoperators to model fish species richness. Tobuild and evaluate the models, we randomlydivided the lakes dataset into three roughlyequal parts corresponding to training, valida-tion, and applied datasets. The training andvalidation datasets were used in model build-ing and the applied dataset was used toprovide a semi-independent assessment ofmodel performance.

The best performing model convergedafter 9000 generations of the algorithm(Figure 6A) and had an R�0.69 betweenpredicted and observed species richnessfor the semi-independent applied dataset.The two most influential variables in thetop 30 models were lake area and shorelineperimeter, followed by mean monthly pre-cipitation and air temperature (Figure 6B).As discussed previously, one of the disadvan-tages of using genetic programs is that themodels they produce are often large anddifficult to interpret. The model producedin our case study was no exception. For thatreason, we have not provided any detailsabout the structure of the GP model.

ConclusionMachine learning approaches can facili-

tate greater understanding and prediction inthe ecological sciences. The purpose of thisreview is to broaden the exposure of ecolo-gists to ML by illustrating how CART, ANN,and EC can be used to address complexproblems. In doing so, we hope to haveintroduced ecologists to a set of viable al-ternatives to more traditional statistical ap-proaches. Although the application of MLapproaches in ecology has increased in re-cent years, the growth has been relativelymodest compared to other disciplines, andthere remains a good degree of skepticismwith respect to its role in quantitative anal-yses. Currently, the majority of ecolo-gists lack the computational backgroundneeded to operate the software that imple-ments these approaches (Fielding 1999),and, as a result, many ecologists may behesitant to invest their time in learningextensive program code language and syn-tax. With the increasing popularity of theseapproaches, however, more user-friendlysoftware is rapidly being developed. Suchsoftware (examples of which are listedthroughout this review) will increase MLusage and awareness among ecologists and

Figure 6. Genetic Program and Associated Results for Predicting Fish Species RichnessResults from the genetic program predicting fish species richness as a function of environmental character-

istics for 8236 north-temperate lakes in Ontario, Canada. (A) Cross-validation relative error for the GPaccording to mean-squared-error. (B) Average relative importance of the environmental variables for predict-ing fish species richness based on the top 30 GPs. (Variable codes are presented in the Figure 2 caption.)


promote the advancement of these analyt-ical methods.

Machine learning methods are powerfultools for prediction and explanation, andthey will enhance our ability to model eco-logical systems. They are not, however, asolution to all ecological modeling prob-lems. No one ML approach will be bestsuited to addressing all problems nor willML approaches always be preferable to tra-ditional statistical approaches. AlthoughML methods are generally more flexiblewith respect to modeling complex relation-ships and messy datasets, the models theyproduce are often more difficult to inter-pret, and the modeling process itself is of-ten far from transparent.

Although ML technologies strengthenour ability to model ecological phenom-ena, advances in understanding the funda-mental processes underlying those phe-nomena are clearly critical as well. Someargue that ML methods attempt to elimi-nate the need for ecological intuition dur-ing the data analysis process; we stronglydisagree. Human intuition cannot be en-tirely eliminated because the analyst must

specify how the data are to be representedand what mechanisms will be used tosearch for a characterization of the prob-lem. In this respect, ML should be viewedas an attempt to automate parts of themodeling process, not replace it (Olden etal. 2006b). We hope our review of machinelearning methods will allow more ecolo-gists to add these tools to their repertoireof quantitative expertise, and that it willprovide them with a basis for making in-formed decisions about applying machinelearning approaches or more traditionalstatistical approaches in their future mod-eling endeavors.

acknowledgments

We thank the students of the Ecological Informaticsclass at Colorado State University for motivating us towrite this paper. Jodi Whittier and two anonymousreferees provided insightful comments on the manu-script. This research was supported in part by DavidH. Smith Conservation Postdoctoral Scholarships toJ. D. Olden and J. J. Lawler. J. D. Olden conceived anddeveloped the idea for the manuscript, J. D. Oldenand J. J. Lawler conducted the data analysis, and J. D.Olden, J. J. Lawler, and N. L. Poff wrote the manu-script.

REFERENCES

Anderson R. P., Martınez-Meyer E. 2004. Modelingspecies’ geographic distributions for preliminaryconservation assessments: an implementation withthe spiny pocket mice (Heteromys) of Ecuador. Bi-ological Conservation 116(2):167–179.

Anderson M. C., Watts J. M., Freilich J. E., Yool S. R.,Wakefield G. I., McCauley J. F., Fahnestock P. B.2000. Regression-tree modeling of desert tortoisehabitat in the central Mojave desert. Ecological Ap-plications 10(3):890–900.

Austin M. 2007. Species distribution models and eco-logical theory: a critical assessment and some pos-sible new approaches. Ecological Modelling 200(1–2):1–19.

Bell J. F. 1999. Tree-based methods. Pages 89–105 inMachine Learning Methods for Ecological Applications,edited by A. H. Fielding. Boston (MA): KluwerAcademic.

Bishop C. M. 1995. Neural Networks for Pattern Recogni-tion. Oxford (UK): Clarendon Press.

Breiman L., Friedman J., Olshen R. A., Stone C. J.1984. Classification and Regression Trees. Belmont(CA): Wadsworth International Group.

Brosse S., Lek S., Townsend C. R. 2001. Abundance,

diversity, and structure of freshwater invertebratesand fish communities: an artificial neural networkapproach. New Zealand Journal of Marine and Fresh-water Research 35(1):135–145.

Chen D. G., Hargreaves N. B., Ware D. M., Liu Y.2000. A fuzzy logic model with genetic algorithmfor analyzing fish stock-recruitment relationships.Canadian Journal of Fisheries and Aquatic Science57(9):1878–1887.

Cho E., Chon T.-S. 2006. Application of wavelet anal-ysis to ecological data. Ecological Informatics 1(3):229–233.

Clark J. S., Carpenter S. R., Barber M., Collins S.,Dobson A., Foley J. A., Lodge D. M., Pascual M.,Pielke R., Jr., Pizer W., Pringle C., Reid W. V., RoseK. A., Sala O., Schlesinger W. H., Wall D. H., WearD. 2001. Ecological forecasts: an emerging imper-ative. Science 293:657–660.

Cornuet J. M., Aulagnier S., Lek S., Franck P., Solig-nac M. 1996. Classifying individuals among infra-specific taxa using microsatellite data and neuralnetworks. Comptes rendus de l’Academie des sciences,Serie III, Sciences de la vie 319(12):1167–1177.

Cushing J. B., Wilson T. 2005. Eco-informatics for


decision makers advancing a research agenda.Pages 325–334 in Data Integration in the Life Sci-ences: Second International Workshop, DILS 2005, SanDiego, CA, USA, July 20–22, 2005, Proceedings, Lec-ture Notes in Computer Science, Volume 3615,edited by B. Ludascher and L. Raschid. Berlin(Germany): Springer-Verlag.

Cutler D. R., Edwards T. C., Jr., Beard K. H., Cutler A.,Hess K. T., Gibson J., Lawler J. J. 2007. Randomforests for prediction in ecology. Ecology 88(11):2783–2792.

D’Angelo D. J., Howard L. M., Meyer J. L., Gregory S.V., Ashkenas L. R. 1995. Ecological uses for ge-netic algorithms: predicting fish distributions incomplex physical habitats. Canadian Journal of Fish-eries and Aquatic Sciences 52:1893–1908.

De’ath G. 2007. Boosted trees for ecological model-ing and prediction. Ecology 88(1):243–251.

De’ath G., Fabricius K. E. 2000. Classification and re-gression trees: a powerful yet simple technique forecological data analysis. Ecology 81(11):3178–3192.

Dimopoulos Y., Bourret P., Lek S. 1995. Use of somesensitivity criteria for choosing networks withgood generalization. Neural Processing Letters 2:1–4.

Drake J. M., Lodge D. M. 2006. Forecasting potentialdistributions of nonindigenous species with a ge-netic algorithm. Fisheries 31:9–16.

Drake J. M., Randin C., Guisan A. 2006. Modellingecological niches with support vector machines.Journal of Applied Ecology 43(3):424–432.

Elith J., Graham C. H., Anderson R. P., Dudık M.,Ferrier S., Guisan A., Hijmans R. J., et al. 2006.Novel methods improve prediction of species’ dis-tributions from occurrence data. Ecography 29(2):129–151.

Elith J., Leathwick J. 2007. Predicting species distri-butions from museum and herbarium records us-ing multiresponse models fitted with multivariateadaptive regression splines. Diversity and Distribu-tions 13(3):265–275.

Ferrier S., Guisan A. 2006. Spatial modelling of biodi-versity at the community level. Journal of AppliedEcology 43(3):393–404.

Fielding A. H., editor. 1999. Machine Learning Methodsfor Ecological Applications. Boston (MA): Kluwer Ac-ademic Publishers.

Fielding A. H., Bell J. F. 1997. A review of methods forthe assessment of prediction errors in conserva-tion presence/absence models. Environmental Con-servation 24(1):38–49.

Garson G. D. 1991. Interpreting neural-network con-nection weights. Artificial Intelligence Expert 6(4):46–51.

Geman S., Bienenstock E., Doursat R. 1992. Neuralnetworks and the bias/variance dilemma. NeuralComputation 4(1):1–58.

Gevrey M., Dimopoulos I., Lek S. 2003. Review and

comparison of methods to study the contributionof variables in artificial neural network models.Ecological Modelling 160(3):249–264.

Goldberg D. E. 1989. Genetic Algorithms in Search, Op-timization, and Machine Learning. Reading (MA):Addison-Wesley.

Green J. L., Hastings A., Arzberger P., Ayala F. J.,Cottingham K. L., Cuddington K., Davis F., DunneJ. A., Fortin M.-J., Gerber L., Neubert M. 2005.Complexity in ecology and conservation: mathe-matical, statistical, and computational challenges.BioScience 55(6):501–510.

Guegan J.-F., Lek S., Oberdorff T. 1998. Energy avail-ability and habitat heterogeneity predict globalriverine fish diversity. Nature 391:382–384.

Guisan A., Zimmermann N. E. 2000. Predictive habi-tat distribution models in ecology. Ecological Mod-elling 135(2–3):147–186.

Haefner J. W. 2005. Modeling Biological Systems: Princi-ples and Applications. Second Edition. New York:Springer.

Hastie T., Tibshirani R., Friedman J. H. 2001. TheElements of Statistical Learning: Data Mining, Infer-ence, and Prediction. New York: Springer.

Hogeweg P. 1988. Cellular automata as a paradigmfor ecological modeling. Applied Mathematics andComputation 27(1):81–100.

Holland J. H. 1975. Adaptation in Natural and ArtificialSystems: An Introductory Analysis with Applications toBiology, Control, and Artificial Intelligence. Ann Arbor(MI): University of Michigan Press.

Hopfield J. J. 1982. Neural networks and physicalsystems with emergent collective computationalabilities. Proceedings of the National Academy of Sci-ences USA 79(8):2554–2558.

Hornik K., Stinchcombe M., White H. 1989. Multi-layer feedforward networks are universal approxi-mators. Neural Networks 2(5):359–366.

Iverson L. R., Prasad A. M. 1998. Predicting abun-dance of 80 tree species following climate changein the Eastern United States. Ecological Monographs68(4):465–485.

Iverson L. R., Prasad A. M. 2007. Using landscapeanalysis to assess and model tsunami damage inAceh province, Sumatra. Landscape Ecology 22(3):323–331.

Kohonen T. 2001. Self-Organizing Maps. Berlin (Ger-many) and New York: Springer-Verlag.

Koza J. R. 1992. Genetic Programming: On the Program-ming of Computers by Means of Natural Selection. Cam-bridge (MA): MIT Press.

Lamon E. C., III, Stow C. A. 1999. Sources of variabil-ity in microcontaminant data for Lake Michigansalmonids: statistical models and implications fortrend detection. Canadian Journal of Fisheries andAquatic Sciences 56(Supplement):71–85.

Lawler J. J., White D., Neilson R. P., Blaustein A. R.


2006. Predicting climate-induced range shifts:model differences and model reliability. GlobalChange Biology 12(8):1568–1584.

Lek S., Delacoste M., Baran P., Dimopoulos I., LaugaJ., Aulagnier S. 1996. Application of neural net-works to modelling nonlinear relationships inecology. Ecological Modelling 90(1):39–52.

Lek S., Guegan J.-F. 2000. Artificial Neuronal Networks:Application to Ecology and Evolution. Berlin (Ger-many): Springer-Verlag.

Levin S. A. 1998. Ecosystems and the biosphere as com-plex adaptive systems. Ecosystems 1(5):431–436.

Maier H. R., Dandy G. C. 2000. Neural networks forthe prediction and forecasting of water resourcevariables: a review of modelling issues and appli-cations. Environmental Modelling and Software15(1):101–124.

Mastrorillo S., Lek S., Dauba F., Belaud A. 1997. Theuse of artificial neural networks to predict thepresence of small-bodied fish in a river. FreshwaterBiology 38(2):237–246.

McCulloch W. S., Pitts W. 1943. A logical calculus ofideas immanent in nervous activity. Bulletin ofMathematical Biophysics 5:115–133.

McKay R. I. 2001. Variants of genetic programmingfor species distribution modelling—fitness shar-ing, partial functions, population evaluation. Eco-logical Modelling 146(1–3):231–241.

McKenna J. E., Jr. 2005. Application of neural net-works to prediction of fish diversity and salmonidproduction in the Lake Ontario basin. Transac-tions of the American Fisheries Society 134(1):28–43.

Mercado-Silva N., Olden J. D., Maxted J. T., Hrabik T.R., Vander Zanden M. J. 2006. Forecasting thespread of invasive rainbow smelt in the LaurentianGreat Lakes region of North America. ConservationBiology 20(6):1740–1749.

Minns C. K. 1989. Factors affecting fish species rich-ness in Ontario lakes. Transactions of the AmericanFisheries Society 118:533–545.

Mitchell M. 1998. An Introduction to Genetic Algorithms.Cambridge (MA): MIT Press.

Morgan J. N., Sonquist J. A. 1963. Problems in theanalysis of survey data, and a proposal. Journal ofthe American Statistical Association 58(302):415–434.

Muttil N., Lee J. H. W. 2005. Genetic programmingfor analysis and real-time prediction of coastal al-gal blooms. Ecological Modelling 189(3–4):363–376.

O’Connor R. J., Jones M. T., White D., Hunsaker C.,Loveland T., Jones B., Preston E. 1996. Spatialpartitioning of environmental correlates of avianbiodiversity in the conterminous United States.Biodiversity Letters 3(3):97–110.

Olden J. D. 2003. A species-specific approach to mod-eling biological communities and its potential forconservation. Conservation Biology 17(3):854–863.

Olden J. D., Jackson D. A. 2001. Fish-habitat relation-

ships in lakes: gaining predictive and explanatoryinsight by using artificial neural networks. Transac-tions of the American Fisheries Society 130(5):878–897.

Olden J. D., Jackson D. A. 2002a. A comparison ofstatistical approaches for modelling fish speciesdistributions. Freshwater Biology 47(10):1976–1995.

Olden J. D., Jackson D. A. 2002b. Illuminating the“black box”: a randomization approach for under-standing variable contributions in artificial neuralnetworks. Ecological Modelling 154(1–2):135–150.

Olden J. D., Joy M. K., Death R. G. 2004. An accuratecomparison of methods for quantifying variableimportance in artificial neural networks using sim-ulated data. Ecological Modelling 78(3–4):389–397.

Olden J. D., Joy M. K., Death R. G. 2006a. Rediscov-ering the species in community-wide modeling.Ecological Applications 16(4):1449–1460.

Olden J. D., Poff N. L., Bledsoe B. P. 2006b. Incorpo-rating ecological knowledge into ecoinformatics:an example of modeling hierarchically structuredaquatic communities with neural networks. Ecolog-ical Informatics 1(1):33–42.

Ozesmi S. L., Ozesmi U. 1999. An artificial neuralnetwork approach to spatial habitat modellingwith interspecific interaction. Ecological Modelling116(1):15–31.

Ozesmi S. L., Tan C. O., Ozesmi U. 2006. Method-ological issues in building, training, and testingartificial neural networks in ecological applica-tions. Ecological Modelling 195(1–2):83–93.

Park Y.-S., Chon T.-S. 2007. Biologically-inspired ma-chine learning implemented to ecological infor-matics. Ecological Modelling 203(1–2):1–7.

Peters R. H. 1991. A Critique for Ecology. Cambridge(UK): Cambridge University Press.

Peterson A. T. 2001. Predicting species’ geographicdistributions based on ecological niche modeling.Condor 103(3):599–605.

Peterson A. T. 2003. Predicting the geography ofspecies’ invasions via ecological niche modeling.Quarterly Review of Biology 78(4):419–433.

Peterson A. T. 2007. Why not WhyWhere: the needfor more complex models of simpler environmen-tal spaces. Ecological Modelling 203(3–4):527–530.

Peterson A. T., Vieglais D. A. 2001. Predicting speciesinvasions using ecological niche modeling: newapproaches from bioinformatics attack a pressingproblem. BioScience 51(5):363–371.

Phillips S. J., Anderson R. P., Schapire R. E. 2006.Maximum entropy modeling of species geo-graphic distributions. Ecological Modelling 190(3–4):231–259.

Recknagel F. 2001. Applications of machine learningto ecological modelling. Ecological Modelling146(1–3):303–310.

Recknagel F. 2003. Ecological Informatics: Understanding


Ecology by Biologically-Inspired Computation. Berlin(Germany) and New York: Springer-Verlag.

Ripley B. D. 1996. Pattern Recognition and Neural Networks.Cambridge (UK): Cambridge University Press.

Rollins M. G., Keane R. E., Parsons R. A. 2004. Map-ping fuels and fire regimes using sensing, ecosys-tem simulation, and gradient modeling. EcologicalApplications 14(1):75–95.

Rumelhart D. E., Hinton G. E., Williams R. J. 1986.Learning representations by back-propagating er-ror. Nature 323:533–536.

Salski A., Sperlbaum C. 1991. A fuzzy logic approach tomodeling in ecosystem research. Pages 520–527 inUncertainty in Knowledge Bases, 3rd International Confer-ence on Information Processing and Management of Un-certainty in Knowledge-Based Systems, IPMU ‘90, Paris,France, July 2–6, 1990, Lecture Notes in ComputerScience, Volume 521, edited by B. Bouchon-Meunier et al. Berlin (Germany): Springer-Verlag.

Sarkar S., Pressey R. L., Faith D. P., Margules C. R.,Fuller T., Stoms D. M., Moffett A., Wilson K. A.,Williams K. J., Williams P. H., Andelman S. 2006.Biodiversity conservation planning tools: presentstatus and challenges for the future. Annual Reviewof Environment and Resources 31:123–159.

Scardi M., Harding L. W., Jr. 1999. Developing anempirical model of phytoplankton primary pro-duction: a neural network case study. EcologicalModelling 120(2–3):213–223.

Smith S. J., Iverson S. J., Bowen W. D. 1997. Fatty acidsignatures and classification trees: new tools for in-vestigating the foraging ecology of seals. CanadianJournal of Fisheries and Aquatic Sciences 54(6):1377–1386.

Spitz F., Lek S. 1999. Environmental impact predictionusing neural network modelling: an example in wild-life damage. Journal of Applied Ecology 36(2):317–326.

Stockwell D. R. B. 2006. Improving ecological nichemodels by data mining large environmental data-

sets for surrogate models. Ecological Modelling192(1–2):188–196.

Stockwell D. R. B., Noble I. R. 1992. Induction of setsof rules from animal distribution data: a robustand informative method of analysis. Mathematicsand Computers in Simulation 33(5–6):385–390.

Stockwell D. R. B., Peters D. 1999. The GARP model-ling system: problems and solutions to automatedspatial prediction. International Journal of Geograph-ical Information Science 13(2):143–158.

Sutton C. D. 2005. Classification and regression trees,bagging, and boosting. Pages 303–329 in Handbookof Statistics: Data Mining and Data Visualization, Vol-ume 24, edited by C. R. Rao et al. Amsterdam (theNetherlands): Elsevier Publishing.

Termansen M., McClean C. J., Preston C. D. 2006.The use of genetic algorithms and Bayesian clas-sification to model species distributions. EcologicalModelling 192(3–4):410–424.

Thuiller W. 2003. BIOMOD—optimizing predictionsof species distributions and projecting potentialfuture shifts under global change. Global ChangeBiology 9(10):1353–1362.

Torres L. G., Rosel P. E., D’Agrosa C., Read A. J. 2003.Improving management of overlapping bottlenosedolphin ecotypes through spatial analysis and genet-ics. Marine Mammal Science 19(3):502–514.

Usio N. 2007. Endangered crayfish in northern Japan:distribution, abundance and microhabitat speci-ficity in relation to stream and riparian environ-ment. Biological Conservation 134(4):517–526.

Vander Zanden M. J., Olden J. D., Thorne J. H., Man-drak N. E. 2004. Predicting occurrences and impactsof smallmouth bass introductions in north temper-ate lakes. Ecological Applications 14(1):132–148.

Worner S. P., Gevrey M. 2006. Modelling global insectpest species assemblages to determine risk of in-vasion. Journal of Applied Ecology 43(5):858–867.


Documents

MACHINE LEARNING METHODS WITHOUT TEARS: A PRIMER …depts.washington.edu/oldenlab/wordpress/wp-content/... · (Drake et al. 2006), and wavelet analysis (Cho and Chon 2006). In addition,