38
Cliff Notes on Ecological Niche Modeling with RandomForest (ensembles) Falk Huettmann EWHALE lab University of Alaska Fairbanks AK 99775 Email [email protected] Tel. 907 474 7882

Cliff Notes on Ecological Niche Modeling with RandomForest (ensembles) Falk Huettmann EWHALE lab University of Alaska Fairbanks AK 99775 Email [email protected]@alaska.edu

Embed Size (px)

Citation preview

Page 1: Cliff Notes on Ecological Niche Modeling with RandomForest (ensembles) Falk Huettmann EWHALE lab University of Alaska Fairbanks AK 99775 Email fhuettmann@alaska.edufhuettmann@alaska.edu

Cliff Notes on Ecological Niche Modeling with RandomForest (ensembles)

Falk HuettmannEWHALE lab

University of AlaskaFairbanks AK 99775

Email [email protected] Tel. 907 474 7882

Page 2: Cliff Notes on Ecological Niche Modeling with RandomForest (ensembles) Falk Huettmann EWHALE lab University of Alaska Fairbanks AK 99775 Email fhuettmann@alaska.edufhuettmann@alaska.edu

Modeling Ecological Niches

Geographic Space Ecological Space

Latitude

Longitude Environmental factor a

Env

ironm

enta

l fac

tor

b

Sampling Space Model Space => Predictions

Page 3: Cliff Notes on Ecological Niche Modeling with RandomForest (ensembles) Falk Huettmann EWHALE lab University of Alaska Fairbanks AK 99775 Email fhuettmann@alaska.edufhuettmann@alaska.edu

A Super Model

LMGLMGAMCARTMARS

NNGARP

TNRF

GDMMaxent…

=>Ensembles

Page 4: Cliff Notes on Ecological Niche Modeling with RandomForest (ensembles) Falk Huettmann EWHALE lab University of Alaska Fairbanks AK 99775 Email fhuettmann@alaska.edufhuettmann@alaska.edu

‘Mean’SDOne formula capturing the data y=a +bx

Linear regression

A starting point…

Response Variable ~ Predictor1 Y X

X

Y

Page 5: Cliff Notes on Ecological Niche Modeling with RandomForest (ensembles) Falk Huettmann EWHALE lab University of Alaska Fairbanks AK 99775 Email fhuettmann@alaska.edufhuettmann@alaska.edu

Common Ground

A Multiple Regression framework

Response Variable ~ Predictor1 + Predictor2 + Predictor3…

Page 6: Cliff Notes on Ecological Niche Modeling with RandomForest (ensembles) Falk Huettmann EWHALE lab University of Alaska Fairbanks AK 99775 Email fhuettmann@alaska.edufhuettmann@alaska.edu

Common Ground

A Multiple Regression framework

Response Variable ~ Predictor1 + Predictor2 + Predictor3…

Traditionally, we used 1-5 predictors

But: 1 to 1000s of predictors are possible

‘One single algorithm’ that explains relationship between response and predictors

Derived relationship can be predicted to other locations with known predictors

Page 7: Cliff Notes on Ecological Niche Modeling with RandomForest (ensembles) Falk Huettmann EWHALE lab University of Alaska Fairbanks AK 99775 Email fhuettmann@alaska.edufhuettmann@alaska.edu

GLM vs CART etc.

‘Mean’SD => potentially low r2

‘Mean’ ?SD ?

CART, TreeNet & RandomForest(there are many other algorithms !)

Linear(~unrealistic)

Non-Linear(driven by data)

Page 8: Cliff Notes on Ecological Niche Modeling with RandomForest (ensembles) Falk Huettmann EWHALE lab University of Alaska Fairbanks AK 99775 Email fhuettmann@alaska.edufhuettmann@alaska.edu

Our Free Algorithms …

R-ProjectTreeNet

RandomForest

Fortran, C …

http://rweb.stat.umn.edu/R/library/randomForest/html/00Index.html

http://salford-systems.com/products.php

(free 30 day trial)

Page 9: Cliff Notes on Ecological Niche Modeling with RandomForest (ensembles) Falk Huettmann EWHALE lab University of Alaska Fairbanks AK 99775 Email fhuettmann@alaska.edufhuettmann@alaska.edu

Tree/CART - Family

Classification & Regression Tree (CART)=>Binary recursive partitioning

Leo Breiman 1984, and others

Page 10: Cliff Notes on Ecological Niche Modeling with RandomForest (ensembles) Falk Huettmann EWHALE lab University of Alaska Fairbanks AK 99775 Email fhuettmann@alaska.edufhuettmann@alaska.edu

Tree/CART - Family

Leo Breiman 1984, and others

YES NO

Temp>15

Precip <100

Temp<5

Classification & Regression Tree (CART)=>Binary recursive partitioning

Page 11: Cliff Notes on Ecological Niche Modeling with RandomForest (ensembles) Falk Huettmann EWHALE lab University of Alaska Fairbanks AK 99775 Email fhuettmann@alaska.edufhuettmann@alaska.edu

Tree/CART - Family

Binary splits

Leo Breiman 1984, and others

Widely used concept

Page 12: Cliff Notes on Ecological Niche Modeling with RandomForest (ensembles) Falk Huettmann EWHALE lab University of Alaska Fairbanks AK 99775 Email fhuettmann@alaska.edufhuettmann@alaska.edu

Tree/CART - Family

Binary splits

Leo Breiman 1984, and others

Widely used conceptFree of dataassumptions!No significances.

Page 13: Cliff Notes on Ecological Niche Modeling with RandomForest (ensembles) Falk Huettmann EWHALE lab University of Alaska Fairbanks AK 99775 Email fhuettmann@alaska.edufhuettmann@alaska.edu

Tree/CART - Family

Binary splits

Binary split recursive partitioning (samepredictor can re-occur elsewhere as a ‘splitter’)

Maximizes Nodes for Homogenous Variance

Stopping Rules for Number of Branches basedon Optimization/Cross-validation

Terminal Nodes show Means (Regression Tree)or Categories (Classification Tree)

Leo Breiman 1984, and others

Widely used conceptFree of dataassumptions!No significances.

Page 14: Cliff Notes on Ecological Niche Modeling with RandomForest (ensembles) Falk Huettmann EWHALE lab University of Alaska Fairbanks AK 99775 Email fhuettmann@alaska.edufhuettmann@alaska.edu

Tree/CART - Family

Binary splits Multiple splits

Binary split recursive partitioning (samepredictor can re-occur elsewhere as a ‘splitter’)

Maximizes Nodes for Homogenous Variance

Stopping Rules for Number of Branches basedon Optimization/Cross-validation

Terminal Nodes show Means (Regression Tree)or Categories (Classification Tree)

Leo Breiman 1984, and others

Classification Tree

A B C

A B

Widely used conceptRarely used, yet

Free of dataassumptions!No significances.

0.3 3 0.1

2 2.3

Regression Tree

Page 15: Cliff Notes on Ecological Niche Modeling with RandomForest (ensembles) Falk Huettmann EWHALE lab University of Alaska Fairbanks AK 99775 Email fhuettmann@alaska.edufhuettmann@alaska.edu

CART Salford (rpart in R)Nice to interpret(e.g. for small trees, orwhen following throughspecific decision rulestil end)

Page 16: Cliff Notes on Ecological Niche Modeling with RandomForest (ensembles) Falk Huettmann EWHALE lab University of Alaska Fairbanks AK 99775 Email fhuettmann@alaska.edufhuettmann@alaska.edu

0.70

0.80

0.90

0 100 200 300 400 500

Rel

ativ

e C

ost

Number of Nodes

DEM 100.00 ||||||||||||||||||||||||||||||||||||||||||TAIR_AUG 77.58 ||||||||||||||||||||||||||||||||PREC_AUG 69.46 |||||||||||||||||||||||||||||HYDRO 54.59 ||||||||||||||||||||||POP 47.39 |||||||||||||||||||LDUSE 40.88 |||||||||||||||||

Importance Value

CART Salford (rpart in R)

ROC curves for accuracy tests

e.g. correctly predicted absence app. 77%

e.g. correctly predicted presence app. 85%

=>Apply to a dataset for predictions

ROC

Nice to interpret(e.g. for small trees, orwhen following throughspecific decision rulestil end)

From withheld

Test Data

Optimum

Page 17: Cliff Notes on Ecological Niche Modeling with RandomForest (ensembles) Falk Huettmann EWHALE lab University of Alaska Fairbanks AK 99775 Email fhuettmann@alaska.edufhuettmann@alaska.edu

TreeNet(~A sequence of CARTs) ‘boosting’

+ + + +

The more nodes…the more detail…the slower

Many trees make for a ‘net of trees’, or ‘a forest’ => Leo Breiman + Data Mining

Page 18: Cliff Notes on Ecological Niche Modeling with RandomForest (ensembles) Falk Huettmann EWHALE lab University of Alaska Fairbanks AK 99775 Email fhuettmann@alaska.edufhuettmann@alaska.edu

TreeNet(~A sequence of CARTs) ‘boosting’

Variable Score LDUSE 100.00 ||||||||||||||||||||||||||||||||||||||||||TAIR_AUG 97.62 |||||||||||||||||||||||||||||||||||||||||HYDRO94.35 ||||||||||||||||||||||||||||||||||||||||DEM94.01 |||||||||||||||||||||||||||||||||||||||PREC_AUG 90.17 ||||||||||||||||||||||||||||||||||||||POP 82.54 ||||||||||||||||||||||||||||||||||HMFPT81.46 ||||||||||||||||||||||||||||||||||

0.0

0.1

0.2

0.3

0.4

0 10 20 30 40 50 60 70 80 90 100 110

Ris

k

Number of Trees

0

20

40

60

80

100

0 20 40 60 80 100

Pct

. C

lass

1

Pct. Population

+ + + +

Importance Value ROC curves for accuracy tests

e.g. correctly predicted absence app. 97%

e.g. correctly predicted presence app. 92%

=>Apply to a dataset for predictions

The more nodes…the more detail…the slower

ROCeach explains remaining variance

Difficult to interpretbut good graphs

Page 19: Cliff Notes on Ecological Niche Modeling with RandomForest (ensembles) Falk Huettmann EWHALE lab University of Alaska Fairbanks AK 99775 Email fhuettmann@alaska.edufhuettmann@alaska.edu

Distance to Lake (m)

Bea

r O

ccu

rren

ce(P

arti

al D

epen

den

ce)

TreeNet: Graphic Output example

Response Curve

yes

no

Page 20: Cliff Notes on Ecological Niche Modeling with RandomForest (ensembles) Falk Huettmann EWHALE lab University of Alaska Fairbanks AK 99775 Email fhuettmann@alaska.edufhuettmann@alaska.edu

Distance to Lake (m)

Bea

r O

ccu

rren

ce(P

arti

al D

epen

den

ce)

TreeNet: Graphic Output example

Response Curve

(the function above is virtually impossible to fit in linear algorithms => misleading coefficients, e.g. from LMs, GLMs)

yes

no

?

or

Page 21: Cliff Notes on Ecological Niche Modeling with RandomForest (ensembles) Falk Huettmann EWHALE lab University of Alaska Fairbanks AK 99775 Email fhuettmann@alaska.edufhuettmann@alaska.edu

Distance to Lake (m)

Bea

r O

ccu

rren

ce(P

arti

al D

epen

den

ce)

or

TreeNet: Graphic Output example

Response Curve

(the function above is virtually impossible to fit in linear algorithms => misleading coefficients, e.g. from LMs, GLMs)

yes

no

?

?

Page 22: Cliff Notes on Ecological Niche Modeling with RandomForest (ensembles) Falk Huettmann EWHALE lab University of Alaska Fairbanks AK 99775 Email fhuettmann@alaska.edufhuettmann@alaska.edu

Random set 1

Random set 2

Average Final Treefrom >2000 treesdone by VOTING

RandomForest (Prasad et al. 2006, Furlanelllo et al. 2003 Breimann 2001)

‘Boosting & Bagging’ algorithms (~Ensemble)

Page 23: Cliff Notes on Ecological Niche Modeling with RandomForest (ensembles) Falk Huettmann EWHALE lab University of Alaska Fairbanks AK 99775 Email fhuettmann@alaska.edufhuettmann@alaska.edu

DEM Slope Aspect Climate Land-cover

1

2

3

4

5

Ran

dom

set

of

Row

s(C

ases

)

PredictorsRandom set 1

Random set 2

Average Final Treefrom e.g.>2000 treesdone by VOTING

RandomForest (Prasad et al. 2006, Furlanelllo et al. 2003 Breimann 2001)

‘Boosting & Bagging’ algorithms (~Ensemble)

Page 24: Cliff Notes on Ecological Niche Modeling with RandomForest (ensembles) Falk Huettmann EWHALE lab University of Alaska Fairbanks AK 99775 Email fhuettmann@alaska.edufhuettmann@alaska.edu

DEM Slope Aspect Climate Land-cover

1

2

3

4

5

Ran

dom

set

of

Row

s(C

ases

)

Random set of Columns(Predictors)

Random set 1

Random set 2

RandomForest (Prasad et al. 2006, Furlanelllo et al. 2003 Breimann 2001)

Difficult to interpretbut good graphs

Average Final Treefrom e.g.>2000 treesdone by VOTING

‘Boosting & Bagging’ algorithms (~Ensemble)

Page 25: Cliff Notes on Ecological Niche Modeling with RandomForest (ensembles) Falk Huettmann EWHALE lab University of Alaska Fairbanks AK 99775 Email fhuettmann@alaska.edufhuettmann@alaska.edu

DEM Slope Aspect Climate Land-cover

1

2

3

4

5

Ran

dom

set

of

Row

s(C

ases

)

Random set of Columns(Predictors)

Random set 1

Random set 2

RandomForest (Prasad et al. 2006, Furlanelllo et al. 2003 Breimann 2001)

Bagging: Optimization based on In-Bag, Out-of Bag samples

In RF no pruning => Difficult to overfit (robust)

Boosting & Bagging algorithms

Difficult to interpretbut good graphs

Handles ‘noise’, interactionsand categorical data fine!

Average Final Treefrom e.g.>2000 treesdone by VOTING

Page 26: Cliff Notes on Ecological Niche Modeling with RandomForest (ensembles) Falk Huettmann EWHALE lab University of Alaska Fairbanks AK 99775 Email fhuettmann@alaska.edufhuettmann@alaska.edu

RandomForest and GIS: Spatial Modeling

Page 27: Cliff Notes on Ecological Niche Modeling with RandomForest (ensembles) Falk Huettmann EWHALE lab University of Alaska Fairbanks AK 99775 Email fhuettmann@alaska.edufhuettmann@alaska.edu

RandomForest and GIS: Spatial Modeling

Predictors

Response

Table

RandomForest(quantification)

Train &DevelopModel

ApplyModel

GISOverlays

GISVisualization

ofPredictions

Page 28: Cliff Notes on Ecological Niche Modeling with RandomForest (ensembles) Falk Huettmann EWHALE lab University of Alaska Fairbanks AK 99775 Email fhuettmann@alaska.edufhuettmann@alaska.edu

Predictors

Response

Table

aaahhhhuuhhhh ?!-Makes sense because of...-No, wait a minute, that’s wrong…

RandomForest and GIS: Spatial Modeling

Train &DevelopModel

ApplyModel

GISOverlays

GISVisualization

ofPredictions

RandomForest(quantification)

Page 29: Cliff Notes on Ecological Niche Modeling with RandomForest (ensembles) Falk Huettmann EWHALE lab University of Alaska Fairbanks AK 99775 Email fhuettmann@alaska.edufhuettmann@alaska.edu

Allows for:

Works multivariate (100s of predictors)

Best Possible Predictions

Best Possible Clustering (without a response variable)

Tracking of Complex Interactions

Predictor Ranking

Handling Noisy Data

Fast & convenient applications

Allows for multiple (!) response variables !

RandomForest: Why so good and useable ?

Algorithms:RandomForest (R, Fortran, Salford)YAIMPUTE (R)PARTY (R)…

=> Change in World’s Science

Page 30: Cliff Notes on Ecological Niche Modeling with RandomForest (ensembles) Falk Huettmann EWHALE lab University of Alaska Fairbanks AK 99775 Email fhuettmann@alaska.edufhuettmann@alaska.edu

What to read, for instance…

http://www.stat.berkeley.edu/~breiman/RandomForests/

Breiman, L. 2001. Statistical modeling: the two cultures. Statistical Science. 16(3): 199 –231.

Craig, E., and F. Huettmann. (2008). Using “blackbox” algorithms such as TreeNet and Random Forests for data-mining and for finding meaningful patterns, relationships and outliers in complex ecological data: an overview, an example using golden eagle satellite data and an outlook for a promising future. Chapter IV in Intelligent Data Analysis: Developing New Methodologies through Pattern Discovery and Recovery (Hsiao-fan Wang, Ed.). IGI Global, Hershey, PA,USA.

Magness, D.R., F. Huettmann, and J.M. Morton.  (2008).  Using Random Forests to provide predicted species distribution maps as a metric for ecological inventory & monitoring programs.  Pages 209-229 in T.G. Smolinski, M.G. Milanova & A-E. Hassanien (eds.).  Applications of Computational Intelligence in Biology: Current Trends and Open Problems.  Studies in Computational Intelligence,Vol. 22, Springer-Verlag Berlin Heidelberg.  428 pp.

Prasad, A. L.A. Iverson, A. Liar. 2006. Newer Classification and Regression Tree Techniques: Bagging and Random Forests for Ecological Prediction. Ecosystems 181-199.

(and Hastie & Tibshirani, Furlanello et al. 2003, Elith et al. 2006 etc. etc.)

Page 31: Cliff Notes on Ecological Niche Modeling with RandomForest (ensembles) Falk Huettmann EWHALE lab University of Alaska Fairbanks AK 99775 Email fhuettmann@alaska.edufhuettmann@alaska.edu

From now on, simply referred to as …

Page 32: Cliff Notes on Ecological Niche Modeling with RandomForest (ensembles) Falk Huettmann EWHALE lab University of Alaska Fairbanks AK 99775 Email fhuettmann@alaska.edufhuettmann@alaska.edu

A Super Model

LMGLMCARTMARS

NNGARP

TNRF

GDMMaxent…

=>Ensembles

Page 33: Cliff Notes on Ecological Niche Modeling with RandomForest (ensembles) Falk Huettmann EWHALE lab University of Alaska Fairbanks AK 99775 Email fhuettmann@alaska.edufhuettmann@alaska.edu

Some Super Models: Ensembles

LMGLMCARTMARS

NNGARP

TNRF

GDMMaxent…

Find the best modelfor a given section of yourdata => the best possible fit & prediction

Pres/Abs

Predictors

RF

LM

log

poly

Ivory Gull

LMpoly

RFlog

Page 34: Cliff Notes on Ecological Niche Modeling with RandomForest (ensembles) Falk Huettmann EWHALE lab University of Alaska Fairbanks AK 99775 Email fhuettmann@alaska.edufhuettmann@alaska.edu

On Greyboxes, Philosophy and ScienceData

(Data Mining) Prediction & Accuracy

Algorithm with a Known Behavior

Page 35: Cliff Notes on Ecological Niche Modeling with RandomForest (ensembles) Falk Huettmann EWHALE lab University of Alaska Fairbanks AK 99775 Email fhuettmann@alaska.edufhuettmann@alaska.edu

On Greyboxes, Philosophy and ScienceData

(Data Mining) Prediction & Accuracy

Algorithm with a Known Behavior

Such a statistical relationshipwill be found by either CART, TN, RF orLM, GLM

Page 36: Cliff Notes on Ecological Niche Modeling with RandomForest (ensembles) Falk Huettmann EWHALE lab University of Alaska Fairbanks AK 99775 Email fhuettmann@alaska.edufhuettmann@alaska.edu

On Greyboxes, Philosophy and ScienceData

(Data Mining) Prediction & Accuracy

GLMs as a blackbox!? YES.Just think of software implementations, Max-Likelihood, Model FittingAIC and Research Design (sensu Keating & Cherry 1994)

Algorithm with a Known Behavior

Page 37: Cliff Notes on Ecological Niche Modeling with RandomForest (ensembles) Falk Huettmann EWHALE lab University of Alaska Fairbanks AK 99775 Email fhuettmann@alaska.edufhuettmann@alaska.edu

On Greyboxes, Philosophy and Science

-> Over time ->GLM ANN Boosting, Bagging …

100%

0%

ImprovementIncreases

ModelPerfor-mance

Data

(Data Mining) Prediction & Accuracy

GLMs as a blackbox!? YES.Just think of software implementations, Max-Likelihood, Model FittingAIC and Research Design (sensu Keating & Cherry 1994)

Algorithm with a Known Behavior

Page 38: Cliff Notes on Ecological Niche Modeling with RandomForest (ensembles) Falk Huettmann EWHALE lab University of Alaska Fairbanks AK 99775 Email fhuettmann@alaska.edufhuettmann@alaska.edu

Parsimony, Inference and Prediction ?!

Sole focus on predictions and its accuracies, whereas…

…R2, p-values and traditional inference (variable rankings, AIC) are of lower relevance

Why Parsimony ?

No real need for optimizing the fit and for parsimony when prediction is the goal

Global accuracy metrics, ROC, AUC, kappa, meta analysis …(instead of p-values and significance levels or AIC)

0.70

0.80

0.90

0 100 200 300 400 500

Rel

ativ

e C

ost

Number of Nodes