1. Data Mining to Reveal Biological InvadersVOJTCH
JAROK1Department of Ecology, Faculty of Science, CharlesUniversity,
Prague and 2Institute of Botany, Academy of Sciences of the Czech
Republic, Prhonice, the Czech Republic12
2. Backgroud on biological invasions Economic impacts Changes
in habitat properties Loss of species diversity Biological
homogenization
3. Economic impacts
4. Economic impactsSkibbereen 1847 by Cork artist James Mahony
Emigrants Leave Ireland, engraving by Henry Doyle
5. Changes of habitat properties
6. Extinction of native speciesBrown tree snake on a fence post
in Guam
7. Biological homogenization
8. Aim Data-mining tools were originally designed for analyzing
vastdatabases of often incomplete data, with an aim to find
financialfrauds, suitable candidates for loans, potential customers
andother uncertain outputs I will show that searching for potential
invasive species and theirtraits responsible for invasiveness, or
identifying factors thatdistinguish invasible communities from
those that resist invasionare similar risk assessments This is
perhaps the main reason why CART and related methods are becoming
increasingly popular in the field of invasion biology Identifying
homogeneous groups with high or low risk and constructing rules for
making predictions about individual cases is, in essence, the same
for financial credit scoring as for pest risk assessment In both
cases, one searches for rules that can be used to predict uncertain
future events
9. Basic principles of data mining Main literature sources
Binary recursive partitioning Classification And Regression Trees
(CART) Random forests (RFTM)
10. Basic principles of data mining Classification and
Regression Trees (CART): Breiman L, Friedman JH, Olshen RA, Stone
CG (1984) Classification and Regression Trees. Belmont, Wadsworth
International Group Steinberg D, Colla P (1995) CART:
Tree-structured Non-parametric Data Analysis. Salford Systems, San
Diego, USA Steinberg D, Colla P (1997) CART: Classification and
Regression Trees. Salford Systems, San Diego, USA Steinberg G,
Golovnya M (2006) CART 6.0 Users Manual. Salford Systems, San
Diego, USA Death G, Fabricius KE (2000) Classification and
regression trees: a powerful yet simple technique for ecological
data analysis. Ecology, 81, 3178-3192 Bourg NA, McShea WJ, Gill DE
(2005) Putting a cart before the search: successful habitat
prediction for a rare forest herb. Ecology, 86, 2793-2804 Jarok V
(2011) CART and related methods. In: Encyclopedia of Biological
Invasions (eds Simberloff D, Rejmnek M), pp. 104-108. University of
California Press, Berkeley and Los Angeles, USA Random ForestsTM
Breiman L (2001) Random Forests. Machine Learning, 45, 5--32
Breiman L, Cutler A (2004) Random Forests TM. An Implementation of
Leo Breimans RFTM by Salford Systems. Salford Systems, San Diego,
USA Cutler DR, Edwards TC, Beard KH, Cutler A, Hess KT, et al.
(2007) Random forests for classification in ecology. Ecology, 88,
2783-2792 Hochachka WM, Caruana R, Fink D, Munson A, Riedewald M,
et al. (2007) Data-mining discovery of pattern and process in
ecological systems. Journal of Wildlife Management, 71, 242-2437
TreeNetTM Friedman JH (1999) Stochastic Gradient Boosting.
Technical report, Dept. of Statistics, Stanford University.
Friedman JH (2001) Greedy function approximation: a gradient
boosting machine. Annals of Statistics, 29, 1189- 1232 Friedman JH
(2002) Stochastic gradient boosting. Computational Statistics and
Data Analysis, 38, 367378 Hastie TJ, Tibshirani RJ, Friedman JH
(2001) The Elements of Statistical Learning: Data Mining,
Inference, and Prediction. Springer, New York Jarok V, Pyek P,
Kadlec T (2011) Alien plants in urban nature reserves: from
red-list species to future invaders. NeoBiota, 10, 2746
11. Basic principles of predictive miningthe data are
successively split Binary recursive partitioning
12. Basic principles of CARTCART provide graphical, highly
intuitive inside Run off Class Cases % NoneAbsent 38460.3High,
Medium, Low Present25339.7N= 637Road density outsideNatural areas
outsideClass Cases %Class Cases%=< 0.1Absent287 91.4>
0.1=< 0.9Absent97 30.0> 0.9Present27 8.3Present226 70.0 N=
314 N= 323Terminal node 1Terminal node 2Terminal node 3Class Cases
%Class Cases %Class Cases % Road present insideAbsent 277
96.9Absent 10 35.7 Absent 34 15.7Class Cases %Present93.1Present18
64.3 Present 183 84.3NoAbsent 63 59.4 YesN= 286N= 28N= 217
Present43 40.6 Terminal node 4 N= 106Terminal node 5 Class Cases
%Class Cases % Absent 34 89.5Absent 29 42.6 Present4 10.5Present39
57.4N= 38From:N= 68 The tree is represented (in defiance of
gravity) with the root standing for undivided data at the top
13. Basic principles of CARTFrom:
14. Basic principles of RandomForestsTMRandom forests can be
seen as an extension of classification trees by fitting many
sub-treesto parts of the dataset and then combining the predictions
from all trees Figure 3. Ranking of importance values (%) for all
invasive species. Ranking is scaled relative to the best performing
variable based on out of-bag method of random forests. White bars
are predictors from the outside of Kruger National Park and grey
bars from the inside. From:
15. Important properties of data mining models Exploratory and
flexible Non parametric Surrogates Penalization Weighting Scoring
Artificial placing some predictors at the top ofa tree
16. Data mining models are exploratory Unlike the classical
linear methods, the datamining techniques enable predictions to
bemade from the data and to identify the mostimportant predictors
by screening a largenumber of candidate variables withoutrequiring
the user to make any assumptionsabout the form of the relationships
betweenthe predictors and the target variable, andwithout a priori
formulated hypotheses
17. Data mining models are flexible These techniques are also
more flexible than traditional statistical analysesbecause they can
reveal structures in the dataset that are other thanlinear, and
solve complex interactions Unlike linear models, which uncover a
single dominant structure in thedata, data mining models are
designed to work with data that might havemultiple structures: The
models can use the same explanatory variable in different parts of
thetree, dealing effectively with nonlinear relationships and
higher orderinteractions In fact, provided there are enough
observations, the more complex the dataand the more variables that
are available, the better models will appearcompared to alternative
methods. With a complex data set, understandable and generally
interpretable resultsoften can be found only by constructing data
mining models Data mining models are also excellent for initial
data inspection Models are often used to select a manageable number
of core measures fromdatabases with hundreds of variables A useful
subset of predictors from a large set of variables can then be used
inbuilding a formal linear model
18. With a complex data set, understandable and generally
interpretable results often can be found only by constructing data
mining modelsFrom:
19. With a complex data set, understandable and generally
interpretable resultsoften can be found only by constructing data
mining modelsTable 2. Test statistics and significances of the
explanatory variables and their interactions in the AICminimal
adequate models for proportion of archaeophytes. Non-significant
variables and theirinteractions are not shown. R2 = 0.82Explanatory
variable archaeophytesdfDeviance P R2Habitat type3190521.2