9
Exploring Similarities Among Many Species Distributions Scott Simmerman, Jingyuan Wang, James Osborne, Kimberly Shook and Jian Huang University of Tennessee, Knoxville William Godsoe University of Canterbury, New Zealand Theodore Simons U.S. Geological Survey, North Carolina State University ABSTRACT Collecting species presence data and then building models to predict species distribution has been long practiced in the field of ecology for the purpose of improving our under- standing of species relationships with each other and with the environment. Due to limitations of computing power as well as limited means of using modeling software on HPC fa- cilities, past species distribution studies have been unable to fully explore diverse data sets. We build a system that can, for the first time to our knowledge, leverage HPC to support effective exploration of species similarities in distribution as well as their dependencies on common environmental condi- tions. Our system can also compute and reveal uncertainties in the modeling results enabling domain experts to make in- formed judgments about the data. Our work was motivated by and centered around data collection efforts within the Great Smoky Mountains National Park that date back to the 1940s. Our findings present new research opportunities in ecology and produce actionable field-work items for bio- diversity management personnel to include in their planning of daily management activities. Categories and Subject Descriptors D.1.3 [Software]: Concurrent Programming—Parallel Pro- gramming ; I.3.2 [Computing Methodologies]: Computer Graphics—Graphics Systems - Remote Systems Keywords HPC, Species Distribution Modeling, Parallel Processing 1. INTRODUCTION A major challenge in biology is to manage the vast di- versity of living things found across the earth. Most geo- graphic regions contain a large number of distinct species, and many species are poorly known. Given this uncertainty, Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. XSEDE12 2012 Chicago, IL USA Copyright 20XX ACM X-XXXXX-XX-X/XX/XX ...$15.00. an important pre-requisite for any biodiversity management plan is to predict the geographic distributions of individ- ual species and to summarize this knowledge across many species. To address these problems scientists have created two ever-growing sources of data: (1) records on the occur- rences of individual species and (2) geo-referenced datasets on environmental conditions. Using these data, ecologists can leverage machine learning algorithms or statistical mod- els to estimate the probability of presence for individual species. These models—hereafter Species Distribution Mod- els or SDMs—constitute a simple scalable approach to use limited information on the ecology of a species to predict its geographic distribution. Unfortunately ecologists still lack the tools to effectively compare, contrast, and explore trends and relationships in the large number of SDMs that are cur- rently being generated. To fill this gap we develop tools for summarizing informa- tion on the relationship between individual SDMs as well as analyzing the uncertainty in the models. These tools play a crucial role for meeting time-critical management needs of national stewardship agencies with missions such as en- suring long-term security of food production and detecting early stages of irreversible damages at ecosystem-scale. A scalable data analytics system is needed for such a chal- lenging application. In such a system, the source of data makes no fundamental difference. The data could be col- lected by experts in the field, as with the ATBI, or by cit- izens as with other recent data collection projects such as eBird and Natureserve 1 . In this work, we focus on the ATBI data for the following reasons: (1) proximity to the domain experts and final users, (2) availability of both species inven- tory data and also high-quality geo-referenced environmen- tal data, and (3) availability of a production-quality model- ing method, maximum entropy, whose results are acceptable to the bio-inventory stewards at the National Park Service (NPS) and US Geological Survey (USGS). Our system uses high performance computing (HPC) and has the following features: (1) reliable employment of SDM models to construct species distribution predictions across a large canvas of species, (2) an ability to test, record and analyze varying environmental responses in each individual model, (3) visualization of data quality in support of user exploration, (4) on-demand hierarchical grouping of species by similar distribution as well as similar environmental re- sponses. We enable a user community unfamiliar with HPC 1 http://ebird.org, http://www.natureserve.org

Exploring Similarities Among Many Species Distributions · tory (ATBI2) of The Great Smoky Mountains National Park (GSMNP). The ATBI project began in 1998 as an e ort to document

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Exploring Similarities Among Many Species Distributions · tory (ATBI2) of The Great Smoky Mountains National Park (GSMNP). The ATBI project began in 1998 as an e ort to document

Exploring Similarities Among Many Species Distributions

Scott Simmerman, Jingyuan Wang, James Osborne, Kimberly Shook and Jian HuangUniversity of Tennessee, Knoxville

William GodsoeUniversity of Canterbury, New

Zealand

Theodore SimonsU.S. Geological Survey, North

Carolina State University

ABSTRACTCollecting species presence data and then building modelsto predict species distribution has been long practiced inthe field of ecology for the purpose of improving our under-standing of species relationships with each other and withthe environment. Due to limitations of computing power aswell as limited means of using modeling software on HPC fa-cilities, past species distribution studies have been unable tofully explore diverse data sets. We build a system that can,for the first time to our knowledge, leverage HPC to supporteffective exploration of species similarities in distribution aswell as their dependencies on common environmental condi-tions. Our system can also compute and reveal uncertaintiesin the modeling results enabling domain experts to make in-formed judgments about the data. Our work was motivatedby and centered around data collection efforts within theGreat Smoky Mountains National Park that date back tothe 1940s. Our findings present new research opportunitiesin ecology and produce actionable field-work items for bio-diversity management personnel to include in their planningof daily management activities.

Categories and Subject DescriptorsD.1.3 [Software]: Concurrent Programming—Parallel Pro-gramming ; I.3.2 [Computing Methodologies]: ComputerGraphics—Graphics Systems - Remote Systems

KeywordsHPC, Species Distribution Modeling, Parallel Processing

1. INTRODUCTIONA major challenge in biology is to manage the vast di-

versity of living things found across the earth. Most geo-graphic regions contain a large number of distinct species,and many species are poorly known. Given this uncertainty,

Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee.XSEDE12 2012 Chicago, IL USACopyright 20XX ACM X-XXXXX-XX-X/XX/XX ...$15.00.

an important pre-requisite for any biodiversity managementplan is to predict the geographic distributions of individ-ual species and to summarize this knowledge across manyspecies. To address these problems scientists have createdtwo ever-growing sources of data: (1) records on the occur-rences of individual species and (2) geo-referenced datasetson environmental conditions. Using these data, ecologistscan leverage machine learning algorithms or statistical mod-els to estimate the probability of presence for individualspecies. These models—hereafter Species Distribution Mod-els or SDMs—constitute a simple scalable approach to uselimited information on the ecology of a species to predict itsgeographic distribution. Unfortunately ecologists still lackthe tools to effectively compare, contrast, and explore trendsand relationships in the large number of SDMs that are cur-rently being generated.

To fill this gap we develop tools for summarizing informa-tion on the relationship between individual SDMs as well asanalyzing the uncertainty in the models. These tools playa crucial role for meeting time-critical management needsof national stewardship agencies with missions such as en-suring long-term security of food production and detectingearly stages of irreversible damages at ecosystem-scale.

A scalable data analytics system is needed for such a chal-lenging application. In such a system, the source of datamakes no fundamental difference. The data could be col-lected by experts in the field, as with the ATBI, or by cit-izens as with other recent data collection projects such aseBird and Natureserve1. In this work, we focus on the ATBIdata for the following reasons: (1) proximity to the domainexperts and final users, (2) availability of both species inven-tory data and also high-quality geo-referenced environmen-tal data, and (3) availability of a production-quality model-ing method, maximum entropy, whose results are acceptableto the bio-inventory stewards at the National Park Service(NPS) and US Geological Survey (USGS).

Our system uses high performance computing (HPC) andhas the following features: (1) reliable employment of SDMmodels to construct species distribution predictions acrossa large canvas of species, (2) an ability to test, record andanalyze varying environmental responses in each individualmodel, (3) visualization of data quality in support of userexploration, (4) on-demand hierarchical grouping of speciesby similar distribution as well as similar environmental re-sponses. We enable a user community unfamiliar with HPC

1http://ebird.org, http://www.natureserve.org

Page 2: Exploring Similarities Among Many Species Distributions · tory (ATBI2) of The Great Smoky Mountains National Park (GSMNP). The ATBI project began in 1998 as an e ort to document

to greatly increase their productivity using their own legacycode, for example burning through 90 CPU hours in just 25minutes (see Table 2).

With these results, we also enable the testing of hypothe-ses fundamental to developing management plans, such as:When managing a specific aspect of an environment, whichspecies are the most affected? For a species central to a man-agement goal, do we already have sufficient, quality data? Ifnot, which parts of the data are the most trustworthy? Toour knowledge, no systems exist to serve these needs, andespecially not on a scale as large as that of GSMNP.

2. BACKGROUND

2.1 Species Distribution ModelingOur system explores one example of an extensive data

set on species occurrences, the All Taxa Biodiversity Inven-tory (ATBI2) of The Great Smoky Mountains National Park(GSMNP). The ATBI project began in 1998 as an effort todocument all forms of life in the Park [8]. To date, the in-ventory has documented 7,391 new species to the Park and922 species new to science. As with many projects that col-lect data about the environment, the raw data will alwaysbe too sparse to capture the full details of the subjects andthe targeted environment. Therefore, it is imperative touse further modeling to make well-informed predictions ofspecies distribution, allowing for more comprehensive studyand development of actionable management plans.

GSMNP spans 2,200 km2 and represents an iconic systemto compare observations on biodiversity. The Park is one ofthe most biologically diverse regions in North America. Forthis reason it is both an International Biosphere Reserveand a World Heritage Site. This region has been a focusof diversity research for decades notably through Robert H.Whittaker’s pioneering work in the 1940’s. In a series of pa-pers, Whittaker argued that many species specialize in loca-tions with similar elevations [14] and that these responsesto elevation varied dramatically from one species to another.

GSMNP management has used these iconic features topioneer creating and maintaining scientifically sound biodi-versity inventories. In their latest ATBI database [8], thePark has amassed data for over 17,000 species in the Park.In addition to the geo-referenced species occurrence data,the Park has collected environmental data such as geology,soil, terrain, vegetation and climate which enables the mod-eling of species distributions across many disparate taxa.

Since Whittaker’s time researchers have developed sev-eral methods to model species’ distributions. Many of thesecan be thought of as estimates of the probability that aspecies is present. SDM methods can be divided into thosefor presence-only data vs. those for presence as well as ab-sence data. Our data is presence-only, and in this cate-gory there are still many modeling options such as envelopesimilarity models, ENFA (ecological niche factor analysis),GARP (genetic algorithms for rule-set prediction), logisticregression models, spatial point-process models and MaxEnt(maximum entropy) models.

In general, predictions made by different SDM methodswill produce quite different patterns. Our application sci-entists’ favorite method is MaxEnt, which works by con-straining the estimated probability of presence to resem-

2http://www.dlia.org/atbi, http://www.nps.gov/grsm

ble the observed probability of presence. To do so, Max-Ent constructs a model consisting of a number of featuressuch as descriptions of how environmental variables mightaffect the probability of presence. It then performs modeloptimization to minimize the information contained in theresiduals. MaxEnt assumes that presence-only samples arerandom with respect to the species of interest, and that co-variates of species occurrence probabilities are distinct andindependently distributed from covariates of species detec-tion probabilities [11].

In this work, we direct our primary attention to the input,output and important controlling arguments of the models,and treat the internals of the modeling process as a “blackbox”. In the following we make no further distinction be-tween SDM methods because our data analytics system isgenerally applicable to all models.

2.2 Complexity of Data Analysis and Visual-ization

Many ecologists have studied the utility of SDM models.However, these efforts still do not allow for a comprehen-sive understanding of the choices necessary to make usefulSDMs. The current limit is our ability to understand theconsequences of the numerous input choices available to ap-ply machine learning algorithms. Some choices are easy tointerpret such as the particular combination of variables in-cluded as model input. Others are much more abstract suchas regularization parameters that penalize overly complexmodels and threshold values that distinguish predicted pres-ences from absences. Moreover these factors may interact inunexpected ways making it important to consider the jointinfluence of each variable. There is a pressing need for ap-propriate ways to compare models and effective visualizationmethods augmented by computation power to rigorously in-vestigate a large parameter space.

Though SDMs are widely applied, the sources of error in-herent in such analyses are poorly understood. Over thelast few years a tremendous amount of environmental datahas become available for analyzing species distributions inthe form of interpolated maps with environmental measure-ments. However, the best way to use this ever-growing bankof data is unclear, as ecologists face two seemingly contradic-tory problems. First, species distributions are an amalgamof many facets of the environment and these facets inter-act non-linearly. For the foreseeable future we will not havemeasurements of many of these processes and as a result wemust interpret SDMs that fail to incorporate much of whatwe know about organisms [3]. Second, we have only a fi-nite amount of data on the distribution of any one species.As a result, there is a serious danger of generating overlycomplex models that do not generalize well. Biodiversitymanagement will remain extremely challenging until toolsbecome available for visualizing uncertainties due to the in-ternals of a model and the quality of input data.

The conclusions from SDMs are also susceptible to sam-pling error. When fitting complex models with multiple in-teracting sources of uncertainty, it is difficult to predict howsensitive conclusions will be due to changes in the data usedto run the analyses. For this reason it is important to deter-mine whether biological conclusions are robust to changesin sampling protocols.

The Park’s natural resource managers need to visualizeand understand advanced biological analyses of virtually all

Page 3: Exploring Similarities Among Many Species Distributions · tory (ATBI2) of The Great Smoky Mountains National Park (GSMNP). The ATBI project began in 1998 as an e ort to document

of the Park’s species, and, unfortunately, this capability iscurrently lacking. The ATBI project has been in existencesince 1998. No other protected reserve possesses such a com-prehensive set of biodiversity data (though there are spe-cial purpose data repositories at similar scales for birds, forexample). The resource management and science staff atGSMNP need a synthesis of these data to be able to answercritical conservation questions, such as: What is the distri-bution of species in the Park? Where are the most species-rich sites? What natural factors in the environment affectthese distributions? Where do un-natural stressors likelyhave the greatest impact on rare or vulnerable species andbiodiversity in general? These questions are becoming morecritical every day because of the wide variety in the Park’snatural diversity, non-native invasive species, air pollutants,habitat loss, and possibly rapid climatic changes.

We must note that other leading researchers have reportedrecent success on using analytics for other pressing applica-tions, most notably, a powerful birdVis [2] system for theeBird project, analytics for epidemic monitoring [1] and an-alytics for maritime resource allocation [7]. We find the pri-mary differences between our work and theirs to be differentapplication needs. Our needs use HPC to take SDM-basedstudies from tens of species at a time to now hundreds orthousands at a time, and from previously using environmen-tal data at 1 kilometer resolution to now 30-meter resolution.In addition, we work on visualizing similarities between hi-erarchically organized groups as well as the uncertaintiesinvolved, but do not focus on visualizing events, especiallythe temporal aspects of events as in [1, 2, 7]. Our ultimategoal is to create a system for production use by non-technicalend users such as bioinventory managers and park rangers.As future consumers of HPC technology, these users in gen-eral lack the background or the desire to understand thetechnical intricacies. Instead, their daily job function in-volves problem-solving in their own domain. In this respect,they differ from the researchers who are typically the mainusers of scientific gateways which provide a set of tools allow-ing remote computation-intensive study without geographiclimitations [15]. Although we realize this fundamental differ-ence, we do notice that our cyberinfrastructure requirementssuch as central data repositories and data analytic capabil-ities, are similar to other gateway efforts such as geospatialmodeling gateways [5], climate modeling environments [16],and the iPlant Collaborative [6].

3. ADDRESSING LARGE PROBLEMS US-ING HPC

Software tools for biologists and ecologists are just re-cently coming to terms with very large data and parameterspaces that call for parallel processing with HPC. With cur-rent modeling methods which seek to quantify and predictspecies habitat, computing power is often a major limita-tion. The work by Webster et.al in 2008 [13] represents thestate of the art in this popular research direction.

Our collaboration with biologists and ecologists using Max-Ent for species modeling is just one example of a commonoccurrence where a popular “black box” software tool needsto scale up to address a larger problem size. The challengeis to develop quick, simple ways for new users to begin thisscaling process in an HPC environment, using existing, fa-miliar tools. Just leveraging computing power, however, is

Parallel Modelling Runs

Parallel Similarity Analysis

Clustering & Hierarchy

Construction

Data Processing & Visualization Web Interface

Figure 1: Overall workflow of our system incorpo-rating HPC in species distribution modeling.

Bedrock geology categoricalDigital elevation model continuousLeaf on canopy cover categoricalSlope measured in degrees categoricalSolar radiation data continuousSoil organic type categoricalTerrain shape index continuousTopographic convergence index continuousUnderstory density classes categoricalVegetation classes categorical

Table 1: Ten environmental layers provided by theGSMNP.

only the start of the solution. Interpreting the large vol-umes of computation results and creating useful and seman-tically meaningful abstractions to describe groups of coex-isting species presents a challenge of its own.

We address these challenges in our overall workflow asshown in Figure 1. On the left side of the workflow diagram,the components include parallel orchestration of serial SDMmodels for scalability, parallel similarity analysis, and on-the-fly hierarchical abstraction creation through on-demandhierarchical clustering. The results are then amalgamatedthrough parallel data processing and visualization and sub-sequently are delivered through a fully interactive web in-terface to stakeholders that include Park natural resourcemanagers as well as ecology researchers.

3.1 The DataThe current standard for modeling species distributions is

to use coarse resolution (∼1 kilometer) environmental lay-ers when generating models [4]. Superseding the breadthand depth of this standard, GSMNP provided us with tenenvironmental layers uniformly set on a common grid at 30-meter resolution (Fig. 2).

These environmental layers include abiotic factors, such aselevation (Fig. 2a) and soil types (Fig. 2b), as well as bioticlayers on vegetation cover (Fig. 2c). These data are all scalarvariables on a common grid, and each environmental layercan be either a continuous or a categorical variable (Table 1).Each layer amounts to 15–30MB in raw format.

To ensure high quality SDMs we restrict our attention to

Page 4: Exploring Similarities Among Many Species Distributions · tory (ATBI2) of The Great Smoky Mountains National Park (GSMNP). The ATBI project began in 1998 as an e ort to document

(a) Digital Elevation (b) Soil Classes

(c) Vegetation Classes (d) Sample SDM result

Figure 2: Environmental layers and SDM examplefrom the GSMNP.

species with 30 or more distinct records. Each record con-tains GPS coordinates, species sighted, time of sighting aswell as other metadata, and can be geo-referenced with theenvironmental layers. Since the initial focus on ATBI washeavily directed towards finding rare, special, unique andpreviously unknown species, our selection criteria trimmeddown the species count to around 500, about half of whichcomprise bird and tree species in the Park.

3.2 Modeling With HPCAmong the many alternative SDM methods for presence-

only data, maximum entropy is well understood [11] and hasrecently gained much popularity, to no small extent due tothe availability of a production quality software that imple-ments the method. This software is MaxEnt3 [10], written inJava. The inputs to MaxEnt consist of species records (coor-dinates of observed locations) plus environmental layer data.To control the process of fitting a model, there are severalparameters such as convergence threshold, maximum itera-tions, regularization value (β), selection of included features,and selection of environment layer input.

The MaxEnt software is only available as an executablejarfile; it is not open source but free for academic and re-search use. Originally designed for personal computer plat-forms, MaxEnt employs inherently serial steps, althoughJava threads are used when convenient. There are a few keyissues to address before we can use this “black box” code inan HPC environment.

First of all, large-scale runs of MaxEnt constitute an atyp-ical HPC usage pattern which calls for a tool to aid inscheduling. Most codes on an HPC platform are designed torun in parallel either using message passing (MPI) or threads(OpenMP or pthreads). It is far from practical to requirethat a proven machine learning SDM model be rewritten tooperate with great internal parallel scalability. We have toexplore ways to achieve scalability external to these blackbox models. From this respect, this need is general to al-most all fields that use complex models to interpret dataand make predictions.

For this project we developed a tool called Eden whichachieves coarse-grained parallelism by managing concurrentruns of serial code. Eden is simply a master-worker frame-work that allows the user to submit a list of commands to runon a given number of cores. The problem it solves is quite

3http://www.cs.princeton.edu/ schapire/maxent

general—run a list of N single-processor jobs on P proces-sors, where N is far greater than P . The single-process jobscan have highly varying running time, can fail due to anarray of reasons such as internal bugs, saturating disk I/Ochannels or saturating runtime limits of concurrent threads.

After exploring several existing tools for managing suchconcurrency, we found that most did not fit our specific need.Tools such as Swift 4, the Nimrod toolkit 5, Condor 6, andHadoop 7 are all comprehensive tools designed for runningconcurrent jobs across many heterogeneous computing re-sources. In our case, we need to run many small serial jobson a single system and make this as easy as possible for theuser. Therefore we wanted to avoid any added burden tothe user such as learning a new language (as with Swift)or requiring installations at the system administrator level(as with Condor). Other barriers to using these tools forour project include runtime errors and lack of support forthe Nimrod toolkit, and the dependence on Java for Hadoop.We also explored GNU’s Parallel utility 8 but found its com-mand line interface daunting for inexperienced users.

Eden is light-weight and operates entirely at user level,running with very low privileges and definitely not as anadmin or daemon process. This is for security and stabilityreasons for large-scale systems with high concurrencies. To-talling only 450 lines of code, Eden is entirely script-basedand requires no compilation or building process. Offered asan open source project 9 since July 2011, Eden works withtypical job queuing systems as on Nautilus as well as withany generic multi-processor system or fat node without ajob scheduler. For a user wishing to scale their code, Edenprovides a two-step process: first trying Eden on their ownmultiprocessor system (e.g. a desktop machine) and workingout any issues with concurrent runs; then using Eden withthe same simple interface to scale their code on Nautilus.

Eden’s architecture is shown in Figure 3. It uses mostlyBourne shell with a few tcl scripts to handle socket commu-nication. To use Eden, the user must first provide a job listand specify the number of processors to use (P ). When Edenis launched, the initial eden.sh script configures the run andthen launches adam.sh, typically on the login/master node.The adam.sh script launches P worker processes (abel.sh),one on each of the P processors. On a typical PBS queuingsystem, this involves creating a PBS script and submitting itusing the qsub command. The abel.sh processes then eachretrieve a job from adam.sh via a socket connection, executethe job, and then request a new job, continuing until thecommand list is exhausted. For each command completed,stderr, stdout, and timing information are written to indi-vidual files on disk. Finally, the eve.sh script cleans up anddistills these files into a single summary file.

Once we have a way to efficiently run many MaxEnt in-stances, we next have to deal with I/O issues. MaxEnt bydefault outputs images of the species distribution maps aswell as various other charts and figures relating to the sta-tistical details of the model. This could result in 10 to 12output files for a single MaxEnt run. When running hun-

4http://www.ci.uchicago.edu/swift/main/5http://www.messagelab.monash.edu.au/Nimrod6http://research.cs.wisc.edu/condor/7http://hadoop.apache.org/8http://www.gnu.org/software/parallel/9http://sourceforge.net/projects/rdaveden

Page 5: Exploring Similarities Among Many Species Distributions · tory (ATBI2) of The Great Smoky Mountains National Park (GSMNP). The ATBI project began in 1998 as an e ort to document

.

.

.

adam.sh

abel.sh

abel.sh

abel.sh

abel.sh

commands PBSscript Torque

params

commandsdone

make_commands.sh

eden logs

summary.csv

eve.sh

eden.sh

qsub

Figure 3: The Eden framework for managing con-current modeling runs.

dreds or thousands of runs concurrently, this deluge of fileswould easily overwhelm the file system. By judicious useof command line flags, we restricted the MaxEnt output tothe minimum allowable, seven files per run, with the mostsubstantial being the SDM overlaid on a raw ASCII gridof 30 meter resolution. But even with these improvements,we generally had to limit the number of concurrent runs toprevent overwhelming the file system.

Of course, by forfeiting the output of images from Max-Ent, we had need of other means of rendering the SDM. Wechose to use a popular scientific data visualization tool VisItto produce custom maps of the models. VisIt can easily bescripted using Python and run in parallel through a batchsystem, optimizing the production of these images from theraw SDM data. In our rendering, the black dots indicate theoriginal observation points, dark brown indicates very lowprobability of presence and green represents high probabilityof presence. Also, a white contour line marks the thresholdvalue suggested by the MaxEnt model that separates pres-ence prediction from absence prediction. A sample of therendered results is shown in Fig. 2d.

Finally, we found the need to replace built-in capabilitieswith custom procedures to increase parallelism. MaxEntprovides a cross-validation capability by performing multi-ple runs for a particular species, with each run holding backa different subset of samples for testing. This allows forevaluation of the fit of the model. Using MaxEnt’s cross-validation feature, unfortunately, led to much longer runtimes per species for a couple of reasons: MaxEnt has no in-herent parallelism (so, for instance, a 10-way cross-validationis equivalent to doing 10 serial runs), and then several sum-mary results are serially created by default. To optimizethe opportunities for parallelism, we externally orchestratedcross-validation runs, which involved creating subsets of thesample data, doing independent MaxEnt runs, and makingsummary plots with our own custom C code.

3.3 Similarity Analysis with HPCNewer and faster capabilities to handle large SDM runs

will not remedy the need for carefully filtering the resultingdata and for analyses of interpretational pitfalls. It is alwaysup to the expert users to use new capabilities properly toadvance our understanding of SDM output and conservation

implications. For this purpose, we use similarity metricsas an indirect way to measure data quality by comparingcross-validation results and also as a direct way to constructgroups of species with known relationships to establish basiccontext and support and guide further exploration.

With SDM models, there are several ways to measure sim-ilarity: (1) Sørensen Index, (2) Jaccard Index, and (3) χ2

Index. Our expert users chose the Sørensen similarity index(SSI) which, when comparing the SDMs of two species, iscalculated by the fraction:

2a

2a+ b+ c

where a is the number of shared presence locations, b is thenumber of presence locations for only species 1, and c isthe number of presence locations for only species 2. TheSSI value will be between 0.0 and 1.0 with 1.0 indicating anexact match. Other measures are plausible, notably Jaccardand the squared distance between observed and expectedcounts [9]. However, many biological questions correspondto determining if two species will encounter one another, andso we have focused our analyses on Sørensen’s similarity.

We use SSI to compare cross-validation runs of a singlespecies and also to compare all possible pairs of speciesSDMs. Figure 4D shows an example of a cross-validationsimilarity matrix computed using SSI as the metric. Thematrix shows similarity between models from 10 cross-validationruns using ATBI data of the Cedar Waxwing (Bombycilla ce-drorum). A traditional “hot-cold” color map is used, wherered indicates an SSI of 1.0 (exactly similar) and blue indi-cates and SSI of 0.0 (no commonality). We can see from thisexample that the cross-validation results moderately varyfrom one another due to the orange and yellow colors.

3.4 Visualization - MDS and Hierarchical Clus-tering

Once we have SSI information on how every species com-pares to every other species we naturally want to see howsimilar species form clusters. One way to show clustering isthrough multi-dimensional scaling (MDS). MDS is a methodof producing a mapping of objects where the relative posi-tions of the objects provide a graphical representation oftheir interrelationship. In our case, the objects are SDMsand their interrelationship is the degree of similarity basedon the SSI metric. The 2D MDS plot provides a view ofsimilarities among all species in one space. We used theVegan package in R to perform the MDS mappings. TheMDS plot appears in the lower left corner of the interface(Figure 4E). The current species is indicated by a red squareand the other points are colored according to the SSI bins.

Hierarchical clustering is another way to examine clustersof similar species. When comparing between two species,there are potentially three metrics.

Metric 1: The pair-wise SSI metric that distinctly stateshow similar the two species are distributed.

Metric 2: The environmental responses. Together witheach MaxEnt prediction, a feature vector is also pro-vided to show the weight given to each environmentallayer. In our case with 10 environmental layers, the en-vironmental response vector is 1 × 10. Exploring thisvector allows users to hypothesize about the environ-ment’s role in modeling a particular distribution.

Page 6: Exploring Similarities Among Many Species Distributions · tory (ATBI2) of The Great Smoky Mountains National Park (GSMNP). The ATBI project began in 1998 as an e ort to document

Figure 4: The web interface.

Metric 3: Similar coexisting pattern with the species acrossall taxa. Given a particular species and a total ofS species in the study, we can form a 1 × S vectorwhere each element is the SSI value between the cur-rent species and one other species from the study. Bycomparing these vectors among species, scientific userscan hypothesize about different kinds of species co-existence.

Users can decide, on the fly, which metric to use to formthe feature vector. In the subsequent hierarchical clusteringstep we cluster by Euclidean distance between the featurevectors. We start the clustering process with each speciesbeing treated as a single cluster. We recursively cluster to-gether species that are similar (i.e. feature vectors withinsome threshold distance). The threshold value increases ateach level so that clusters are gradually merged together.The stopping criteria is when no new clusters can be foundwithin some predetermined threshold distance. A virtualroot node is always added in the last step so that the forestof cluster hierarchies becomes a tree that is easier to manage.

To ensure biologically meaningful results, we add an ad-ditional stopping criteria—no clusters can have an averageof pair-wise SSI below a given threshold. That is, all of thespecies within a cluster should demonstrate basic similarityas a group. This threshold is typically set at a very looseSSI value of 0.2.

As part of our interface, we visualize cluster hierarchiesas dendrograms and aim to show the detailed merging ofhierarchies among the clusters. In the dendrogram layout,edges simply mean a parent-child relationship in the treehierarchy. Users can choose to see the current grouping ofsimilar species as a list of SDM maps or as a dendrogram.

3.5 Remote Visualization - Web InterfaceModeling and subsequent analysis results are accessed via

a web browser interface. The interface provides immediateaccess to all results from the MaxEnt modeling runs allowingthe user to visually compare SDMs as well as explore clus-

tering results. The stakeholders involved in the mission ofdata-driven natural resource management are geographicallydispersed, and it is difficult to deploy highly synchronous vi-sualizations to managers, scientists, and team leads in thefield. Also, it is unique to this field that a large numberof non-profit groups are collaborating on this mission, ei-ther with each other or directly with the national parks.Given such a dynamic environment, we opted to use asyn-chronous remote visualization and utilize a web browser asthe delivery tool on the user end. Dynamic information vi-sualization is still implemented on the user end, but onlyafter the data has been heavily reduced to a small set (e.g.less than 100 species). Our working website is located athttp://seelab.eecs.utk.edu/alltaxa. It is recommended foruse with Safari, Chrome, or Firefox browsers and has knownissues with Internet Explorer.

When accessing the website, the user first chooses a par-ticular group to explore. Currently, we have species dividedinto 10 taxonomic groups suggested by our expert users.The user is then presented the screen shown in Figure 4.

The sections of the web interface are labeled as follows:

A species picture, scientific name and common name, andtotal number of distinct records in ATBI

B plot of SDM model result – this is an average taken over10 cross-validation runs

C bar chart of environment layer contributions, again aver-aged over 10 runs

D data quality metrics including a matrix comparing SSIvalues over 10 cross-validation runs and a Tukey box-plot of AUC scores for the 10 runs.

E multidimensional scaling (MDS) plot showing species sim-ilarity (Metric 1) over entire group

F ability to view different groups of species based on distri-bution similarity, either by list or dendrogram (Metric1 or 3)

G ability to query for similar environment response (Metric2) and view results as list or dendrogram

For each individual species, the user can view the overallresults from a 10-way cross-validation MaxEnt run. TheSDM is shown as a map (Fig. 4B), averaged over the 10 runs,where dark brown indicates low probability of presence anddark green indicates high probability. The black dots showthe original sample points. Environment contributions areshown in the bar chart below the map (Fig. 4C).

Information relating to data and model quality for an in-dividual species is shown in the top right panel (Fig. 4D).Here, the user can get a sense of the outcome from the 10-way cross-validation runs by viewing an SSI matrix relat-ing to the current species. If the 10×10 matrix containsmostly red, that indicates that there was very little variabil-ity between the 10 SDMs—meaning a consistent model wasproduced no matter what subset of the data was set asidefor testing. On the other hand, lots of variability in thematrix with colors ranging toward orange and yellow indi-cates inconsistency across the 10 cross-validation runs andmight indicate a data quality issue. For example, in Fig. 4D,cross-validation run 4 stands out as having lower SSI valuescompared with the others. This indicates that the 10% of

Page 7: Exploring Similarities Among Many Species Distributions · tory (ATBI2) of The Great Smoky Mountains National Park (GSMNP). The ATBI project began in 1998 as an e ort to document

the data withheld in that run has a substantial leverage onthe results and calls for closer inspection.

To the right of the SSI matrix (still in Fig. 4D), the AUCside bar shows a Tukey plot. AUC is the “area under curve”metric given by MaxEnt for measuring quality of fit of themodel. The Tukey plot shows min, max, median, and thebottom and top quartile of test AUC scores for the 10 runs.The box surrounding the Tukey plot indicates the full dy-namic range of AUC scores, 0.0 to 1.0.

Besides viewing modeling information for a single species,the interface provides three ways to examine similaritiesamong species through panels E, F and G in Fig. 4. PanelE shows the 2D MDS plot showing overall species similari-ties. The current species is indicated by a red square a littlelarger than the surrounding squares. Individual points arecolored corresponding to the partitioning of the SSI valuesinto five bins. The topmost bin, colored red, denotes SSIvalues between 0.8 and 1.0. The other bins span intervals of0.2 in descending order of orange, green, blue and pink. Thenumber of species falling into each bin is displayed on theright of panel E and is clickable. In the example of Fig. 4, thetop two bins (SSI of 0.8–1.0 and 0.6–0.8) are empty, and 26species have an SSI between 0.4 and 0.6 compared with thecurrent species of Cedar Waxwing (Bombycilla cedrorum).The most similar species, the Hairy Woodpecker (Picoidesvillosus), appears at the top of the list with an SSI of 0.54.

Panel G (Fig. 4G) allows the user to perform queries basedon environmental responses. Users can ask, for example,“Which species models have elevation and vegetation con-tributing at least 60 percent to the model?” Any combina-tion of environment layers can be chosen along with threeoptions for a contribution threshold. Results are shown aseither a sorted list or dynamically generated dendrogram.Fig. 4G shows the dendrogram view.

4. PERFORMANCE AND SCALABILITYFor our computation and analysis, we used the supercom-

puter Nautilus at the NSF Center for Remote Data Anal-ysis and Visualization (RDAV) at the National Institutefor Computational Sciences, University of Tennessee. Nau-tilus10 is an SGI Altix 1000 machine with 1024 cores and 4TB of global shared memory in a single system image. Be-cause of having a single operating system image running ona large number of cores, Nautilus is particularly well-suitedto such atypical HPC jobs as running hundreds of instancesof serial code like MaxEnt. Nautilus allows the use of sharedlibraries so that languages such as Python and Java can beused on the entire machine. Often, large capability HPCsystems such as Kraken or Jaguar do not allow these typesof languages due to restrictions of the light-weight OS in-stances running on compute nodes.

Table 2 lists timings for the various workflow componentsof a representative run involving 530 species. Using theEden tool to manage concurrent runs on 256 cores, we wereable to perform 10-way cross-validation MaxEnt runs on 530species (totaling 5300 individual runs) in 25 minutes walltime. These runs included converting the MaxEnt outputfrom ASCII to binary. Each MaxEnt run takes, on average,61 seconds, and ranges between 37–147 seconds. Summingup the wall clock times of each individual model amounts to90 hours of CPU time. This corresponds to a parallel uti-

10http://rdav.nics.tennessee.edu/nautilus

lization of 84.4%. We also experimented with concurrentlyrunning two Edens using 128 cores each to perform half ofthe SDM runs. The total running time is no different fromusing a single Eden run with 256 cores.

The input for the runs consists of species sample files andenvironmental layer data, both totaling less than 80 MB af-ter compression. Our Eden run automatically decompressesdata before feeding data into MaxEnt. This design hasgreatly reduced the input part of the I/O bottleneck. Theoutput from 5300 MaxEnt runs totals to 167 GB in 37100files. We convert the ASCII output grids to binary, reducingthe overall file output to 90 GB. Due to this large amountof I/O from the runs and the limitation of the filesystemon Nautilus, we were limited to using only a small portionof the machine, usually 256 cores, to reduce the number ofconcurrent reads and writes.

Other steps in our overall workflow shown in Table 2 in-volved using Eden to run scripts in parallel for such thingsas data pre-processing (wall time of 3 minutes), renderingof SDM maps using VisIt with Python (wall time of 4 min-utes), and aggregating cross-validation results and summarystatistics (wall time of 3 minutes). For performing pair-wisecomparisons on the SDMs, we developed custom C code us-ing pthreads. Nautilus was particularly well-suited for thistask because of its large global shared memory, allowing allSDMs to be resident in memory at the same time for com-parisons. Our code could produce an SSI matrix for 530maps (139,920 comparisons of 2899×1302 grids) in just 26seconds wall time using 256 cores on Nautilus. With allof the miscellaneous runs, most of the wall clock time wasspent in reading data. For future studies on continuouslyupdated data of 8,000–10,000 species, we will have to em-ploy advanced techniques in parallel I/O. However, becausethe data is continuously updated by the data collectors, itis very unlikely to be in a format ideal for parallel I/O. Pre-computing to reorganize the input data will be unavoidable.

Finally, hierarchical clustering on 530 species takes lessthan 3 seconds. Dendrograms are computed on the fly usingJavascript in the web browser. Depending on how manyspecies are chosen in a query, these functionalities can rangefrom highly interactive to having a lag of a few seconds.

We considered three kinds of scalability during our designstage: in terms of input data, in terms of model analysis, andin terms of access to the analysis. Not all are related to com-puting power and the priorities would vary significantly fordifferent management purposes. For the scope of this work,which is very basic compared to that of an overall long-termproject in support of natural resource management by theNational Park Service, we could only address the scalabilityof model analysis and the access to the analysis.

In orchestrating large scale modeling, our system is sta-ble and the running time increases linearly as species areadded. Assuming that a monthly run schedule would be suf-ficient and the set of species will number around 10,000 , ourmodeling running time (including cross-validation) would beapproximately 7.8 hours, which is acceptable. The SSI sim-ilarity analysis requires comparing every species pair whichleads to a big-O squared complexity. This corresponds to arunning time of ∼ 3 hours for comparing 10,000 species.

Our Eden framework is a general purpose tool and hasproven to be beneficial for several other science projects onNautilus including tornado simulation analysis (Java andMatlab), building energy efficiency modeling (Python and

Page 8: Exploring Similarities Among Many Species Distributions · tory (ATBI2) of The Great Smoky Mountains National Park (GSMNP). The ATBI project began in 1998 as an e ort to document

CPU Time Wall Time Type of Parallelism Num. Parallel Utilizationhh:mm:ss hh:mm:ss Cores (%)

Pre-processing Species Records 00:19:05 00:03:17 scripts via Eden 64 9.1MaxEnt Runs 90:01:00 00:25:23 legacy code MaxEnt via Eden 256 84.4Aggregating Cross-validation 01:21:30 00:02:48 scripts/C code via Eden 64 45.5VisIt Rendering of SDMs 01:46:00 00:04:27 VisIt and Python via Eden 128 18.6SDM Comparisons 01:15:42 00:00:26 pthreads 256 68.2Hierarchical Clustering 00:00:03 00:00:03 – 1 –

Table 2: CPU time incurred by each workflow component vs. wall-clock time with parallel acceleration.

(a) Distribution Prediction (b) Cross Valida-tion

Figure 5: (a) SDM distribution prediction for FraserFir and the environment responses, and (b) the cor-responding cross validation result.

Matlab), and ecology parameter sweep studies (Python andR). Using Eden, four different projects on Nautilus haveburned at least 42798 CPU hours over the past year [12].

5. SAMPLE USER EXPLORATION PROCESSWe illustrate the utility of our visualization approach us-

ing an example of a species of conservation concern in GSMNP.Fraser fir (Abies fraseri) is a tree species only found in moun-tains in the southeastern United States. There are extensivepopulations of this species within GSMNP, however over thelast few years it has suffered population declines due to at-tack by the Balsam wooly adelgid (Adelges piceae).

When faced with a declining species, one of the first prob-lems is to understand the extent of its geographic distri-bution. As an example, we query Fraser fir as part of theoverall subset of species including bird and plant taxonomicgroups, as we plan to subsequently compare this species toother well-sampled taxa. We select “Fraser fir” from theinterface species menu and find that this species has beenextensively sampled with 358 records. By examining theSDM (Figure 5a) we note that most of these records (blackdots) correspond to a narrow band in the eastern portion ofthe park. The SDM for this species predicts a high proba-bility of presence in this region of the park (beige and greenportions of the map) and a low probability of presence (darkbrown shading) in other regions.

After noting that our SDM predicts presence in a narrowband of habitat it is useful to determine our confidence inthis conclusion. Examining the 10-way cross-validation re-sults, we note that there is substantive agreement betweeneach of the individual runs (Figure 5b). To see this, notethat within the SSI matrix most of the individual squares arebright red or orange indicating very high similarity betweenruns. The Tukey plot to the right of the matrix indicatesthat the median AUC score from these runs was quite highand that there is little variation among the 10 runs. These

(a) by Distribution (b) by Environment Response

Figure 6: Dendrograms that recursively groupspecies by similar predicted presence vs. similar en-vironmental influences.

results reinforce our conclusion that Fraser fir is restrictedto a narrow band of habitat within GSMNP.

Next we explore the community of species with similardistributions, both to identify species that may be suscep-tible to similar threats and to find species likely to respondto changes in the population size of Fraser fir. We select thelist option to find species with similar SSI scores. We firstset the threshold to SSI of 0.8, indicating the top most sim-ilar species, and find the American Mountain Ash, anothertree species with a similarly restricted distribution. Whenwe expand our list out by lowering the SSI threshold to 0.6,we get a list of seven additional species with similar distri-butions. We then switch to the dendrogram view, which isshown in Figure 6a. The dendrogram includes Fraser Fir,the current species, and also 8 species similar to it with SSIgreater than 0.6. In this list we see Red Spruce, another treesusceptible to the same insect.

Finally, we may investigate the environmental variableswhich predict the distribution of Fraser fir. Note the bar

Page 9: Exploring Similarities Among Many Species Distributions · tory (ATBI2) of The Great Smoky Mountains National Park (GSMNP). The ATBI project began in 1998 as an e ort to document

chart to the lower right of the map listing the percent contri-bution of each environmental variable to the SDM for Fraserfir (Figure 5a). This chart shows that the DEM (Digital Ele-vation Model) is the highest contributor. We can query thisinformation to find species that are likewise best modeled byelevation. To do this we click on the tick box below the bluesquare and the “60%” option to query for SDMs in whichDEM made at least a 60% contribution. From the list optionwe obtain a number of species. We may then switch to thedendrogram view to obtain groups of species best predictedby elevation, with Fraser fir and similar species constitutinga cluster at the bottom of our graph (Figure 6b).

Together, this information gives a concise summary on thecurrent state of knowledge on the Fraser fir (Abies fraseri), aspecies of concern. We know that this species has been wellsampled but that it has a comparatively narrow distributionwithin the park. A number of other species have similarlyrestricted distributions including some that are susceptibleto the same environmental threats. Also, the comparativelynarrow distribution of this species is best predicted by theelevation of a location.

The above describes the process that our ecology co-authorused to study the Fraser fir, a species iconic to the south-east US that is currently in danger. This system has alreadybeen provided to the park managers, but so far it has notbeen put into use for management purposes.

6. CONCLUDING REMARKSThis paper reports early progress on a long-term project

for supporting the decision making process of natural re-source managers. Their needs include finding correlationsas predicted by current data, understanding the uncertain-ties with the predictions, devising plans to improve data orperforming small scale field work to verify predictions, de-signing and carrying out intervention plans, and continuingmonitoring of results. This process is cyclic and comes withtime pressure of many kinds. Our work is in essence onlya pilot study on the way to meeting these goals. Our fu-ture plans include extending the web interface capabilitiesso that student users as well as researchers and biodiversitymanagers can more efficiently conduct their own studies andexperiments based on sophisticated modeling using HPC re-sources. The interface will allow non-technical users to ef-fectively leverage HPC resources to answer key questions intheir field. We hope to fully automate our workflow so thatresearchers can easily configure their own MaxEnt runs tobe performed on Nautilus and then explore the results inter-actively through the web interface. We also plan to extendthe multivariate visual analytic capabilities of the interface.

AcknowledgmentUniversity of Tennessee computer science undergraduate stu-dents James White and Adam Haney worked on reliablyrunning the MaxEnt model in parallel while maintainingsystem-wide stability. We thank Sean Ahern for insightfuldiscussions. We also owe professional gratitude to KeithLangdon and Thomas Colson of the National Park Service,Todd Witcher of Discover Life in America (DLIA), and CharlesParker of USGS for their insights in the application do-main. William Godsoe was supported by a post-doctoralfellowship at the National Institute for Mathematical andBiological Synthesis (NIMBioS), an institute sponsored by

the National Science Foundation, the U.S. Department ofHomeland Security, and the U.S. Department of Agricul-ture through the National Science Foundation Award EF-0832858, with additional support from the University ofTennessee, Knoxville. This work was supported primar-ily by the Office of Cyber Infrastructure of the NSF underAward Number ARRA-NSF-OCI-0906324 for NICS-RDAVcenter and NSF-OCI-1136246 for support of undergraduateresearch in the NICS-RDAV center.

7. REFERENCES[1] S. Afzal, R. Maciejewski, and D. Ebert. Visual analytics

decision support environment for epidemic modeling andresponse evaluation. In IEEE Conf on Visual AnalyticsScience & Technology, 2011.

[2] N. Ferreira, L. Lins, D. Fink, S. Kelling, C. Wood,J. Freire, and C. Silva. Birdvis: Visualizing andunderstanding bird populations. IEEE Transactions onVisualization and Computer Graphics, 17:2374–2383, 2011.

[3] W. Godsoe. I can’t define the niche but I know it when Isee it: a formal link between statistical theory and theecological niche. Oikos, 119:53–60, 2010.

[4] R. Hijmans and C. Graham. Testing the ability of climateenvelope models to predict the effect of climate change onspecies distributions. Global Change Biology, 12:2272–2281,2006.

[5] H. Lee, L. Zhao, G. J. Bowen, C. C. Miller, A. Kalangi,T. Zhang, and J. B. West. Enabling online geospatialisotopic model development and analysis. In Proceedings ofthe 2011 TeraGrid Conference: Extreme Digital Discovery,pages 1–8, 2011.

[6] A. Lenards, N. Merchant, and D. Stanzione. Building anenvironment to faciliatate discoveries for plant sciences. InProceedings from Gateway Computing Environments 2011at Supercomputing11, 2011.

[7] A. Malik, R. Maciejewski, B. Maule, and D. Ebert. A visualanalytics process for maritime resource allocation and riskassessment. In IEEE Conf. on Visual Analytics Science &Technology, 2011.

[8] B. Nichols and K. Langdon. The smokies all taxabiodiversity inventory: History and progress. SoutheasternNaturalist, 6(sp2):27–34, 2007.

[9] A. Peterson, J. Soberon, and V. Sanchez-Cordero.Conservatism of ecological niches in evolutionary time.Science, 285(1265-1267):3–27, 1999.

[10] S. J. Phillips, R. P. Anderson, and R. E. Schapire.Maximum entropy modeling of species geographicdistributions. Ecological Modelling, 190:231–259, 2006.

[11] C. T. Rota, R. J. Flectcher, J. M. Evans, and R. L. Hutto.Does accounting for imperfect detection improve speciesdistribution models. Ecography, 34:659–670, 2011.

[12] A. Szczepanski, T. Baer, Y. Mack, J. Huang, and S. Ahern.Usage of a hpc data analysis and visualization system.IEEE Computer (accepted, in press), ?(?), 2012.

[13] R. A. Webster, K. H. Pollock, and T. R. Simons. Bayesianspatial modeling of data from avian point count surveys. J.Agricultural, Biological and Environmental Statistics,13(2):121–139, 2008.

[14] R. H. Whittaker. A study of summer foliage insectcommunities in Great Smoky Mountains National Park”.Ecological Monographs, 22(1):1–44, 1952.

[15] N. Wilkins-Diehr, D. Gannon, G. Klimeck, S. Oster, andS. Pamidighantam. Teragrid science gateways and theirimpact on science. IEEE Computer, 41(11), 2008.

[16] L. Zhao, C. Song, C. Thompson, H. Zhang,M. Lakshminarayanan, C. DeLuca, S. Murphy, K. Saint,D. Middleton, N. Wilhelmi, E. Nienhouse, and M. Burek.Developing an integrated end-to-end teragrid climatemodeling environment. In Proceedings of the 2011TeraGrid Conference: Extreme Digital Discovery, 2011.