Data mining to reveal biological invaders

1. Data Mining to Reveal Biological InvadersVOJTCH JAROK1Department of Ecology, Faculty of Science, CharlesUniversity, Prague and 2Institute of Botany, Academy of Sciences of the Czech Republic, Prhonice, the Czech Republic12

2. Backgroud on biological invasions Economic impacts Changes in habitat properties Loss of species diversity Biological homogenization

3. Economic impacts

4. Economic impactsSkibbereen 1847 by Cork artist James Mahony Emigrants Leave Ireland, engraving by Henry Doyle

5. Changes of habitat properties

6. Extinction of native speciesBrown tree snake on a fence post in Guam

7. Biological homogenization

8. Aim Data-mining tools were originally designed for analyzing vastdatabases of often incomplete data, with an aim to find financialfrauds, suitable candidates for loans, potential customers andother uncertain outputs I will show that searching for potential invasive species and theirtraits responsible for invasiveness, or identifying factors thatdistinguish invasible communities from those that resist invasionare similar risk assessments This is perhaps the main reason why CART and related methods are becoming increasingly popular in the field of invasion biology Identifying homogeneous groups with high or low risk and constructing rules for making predictions about individual cases is, in essence, the same for financial credit scoring as for pest risk assessment In both cases, one searches for rules that can be used to predict uncertain future events

9. Basic principles of data mining Main literature sources Binary recursive partitioning Classification And Regression Trees (CART) Random forests (RFTM)

10. Basic principles of data mining Classification and Regression Trees (CART): Breiman L, Friedman JH, Olshen RA, Stone CG (1984) Classification and Regression Trees. Belmont, Wadsworth International Group Steinberg D, Colla P (1995) CART: Tree-structured Non-parametric Data Analysis. Salford Systems, San Diego, USA Steinberg D, Colla P (1997) CART: Classification and Regression Trees. Salford Systems, San Diego, USA Steinberg G, Golovnya M (2006) CART 6.0 Users Manual. Salford Systems, San Diego, USA Death G, Fabricius KE (2000) Classification and regression trees: a powerful yet simple technique for ecological data analysis. Ecology, 81, 3178-3192 Bourg NA, McShea WJ, Gill DE (2005) Putting a cart before the search: successful habitat prediction for a rare forest herb. Ecology, 86, 2793-2804 Jarok V (2011) CART and related methods. In: Encyclopedia of Biological Invasions (eds Simberloff D, Rejmnek M), pp. 104-108. University of California Press, Berkeley and Los Angeles, USA Random ForestsTM Breiman L (2001) Random Forests. Machine Learning, 45, 5--32 Breiman L, Cutler A (2004) Random Forests TM. An Implementation of Leo Breimans RFTM by Salford Systems. Salford Systems, San Diego, USA Cutler DR, Edwards TC, Beard KH, Cutler A, Hess KT, et al. (2007) Random forests for classification in ecology. Ecology, 88, 2783-2792 Hochachka WM, Caruana R, Fink D, Munson A, Riedewald M, et al. (2007) Data-mining discovery of pattern and process in ecological systems. Journal of Wildlife Management, 71, 242-2437 TreeNetTM Friedman JH (1999) Stochastic Gradient Boosting. Technical report, Dept. of Statistics, Stanford University. Friedman JH (2001) Greedy function approximation: a gradient boosting machine. Annals of Statistics, 29, 1189- 1232 Friedman JH (2002) Stochastic gradient boosting. Computational Statistics and Data Analysis, 38, 367378 Hastie TJ, Tibshirani RJ, Friedman JH (2001) The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer, New York Jarok V, Pyek P, Kadlec T (2011) Alien plants in urban nature reserves: from red-list species to future invaders. NeoBiota, 10, 2746

11. Basic principles of predictive miningthe data are successively split Binary recursive partitioning

12. Basic principles of CARTCART provide graphical, highly intuitive inside Run off Class Cases % NoneAbsent 38460.3High, Medium, Low Present25339.7N= 637Road density outsideNatural areas outsideClass Cases %Class Cases%=< 0.1Absent287 91.4> 0.1=< 0.9Absent97 30.0> 0.9Present27 8.3Present226 70.0 N= 314 N= 323Terminal node 1Terminal node 2Terminal node 3Class Cases %Class Cases %Class Cases % Road present insideAbsent 277 96.9Absent 10 35.7 Absent 34 15.7Class Cases %Present93.1Present18 64.3 Present 183 84.3NoAbsent 63 59.4 YesN= 286N= 28N= 217 Present43 40.6 Terminal node 4 N= 106Terminal node 5 Class Cases %Class Cases % Absent 34 89.5Absent 29 42.6 Present4 10.5Present39 57.4N= 38From:N= 68 The tree is represented (in defiance of gravity) with the root standing for undivided data at the top

13. Basic principles of CARTFrom:

14. Basic principles of RandomForestsTMRandom forests can be seen as an extension of classification trees by fitting many sub-treesto parts of the dataset and then combining the predictions from all trees Figure 3. Ranking of importance values (%) for all invasive species. Ranking is scaled relative to the best performing variable based on out of-bag method of random forests. White bars are predictors from the outside of Kruger National Park and grey bars from the inside. From:

15. Important properties of data mining models Exploratory and flexible Non parametric Surrogates Penalization Weighting Scoring Artificial placing some predictors at the top ofa tree

16. Data mining models are exploratory Unlike the classical linear methods, the datamining techniques enable predictions to bemade from the data and to identify the mostimportant predictors by screening a largenumber of candidate variables withoutrequiring the user to make any assumptionsabout the form of the relationships betweenthe predictors and the target variable, andwithout a priori formulated hypotheses

17. Data mining models are flexible These techniques are also more flexible than traditional statistical analysesbecause they can reveal structures in the dataset that are other thanlinear, and solve complex interactions Unlike linear models, which uncover a single dominant structure in thedata, data mining models are designed to work with data that might havemultiple structures: The models can use the same explanatory variable in different parts of thetree, dealing effectively with nonlinear relationships and higher orderinteractions In fact, provided there are enough observations, the more complex the dataand the more variables that are available, the better models will appearcompared to alternative methods. With a complex data set, understandable and generally interpretable resultsoften can be found only by constructing data mining models Data mining models are also excellent for initial data inspection Models are often used to select a manageable number of core measures fromdatabases with hundreds of variables A useful subset of predictors from a large set of variables can then be used inbuilding a formal linear model

18. With a complex data set, understandable and generally interpretable results often can be found only by constructing data mining modelsFrom:

19. With a complex data set, understandable and generally interpretable resultsoften can be found only by constructing data mining modelsTable 2. Test statistics and significances of the explanatory variables and their interactions in the AICminimal adequate models for proportion of archaeophytes. Non-significant variables and theirinteractions are not shown. R2 = 0.82Explanatory variable archaeophytesdfDeviance P R2Habitat type3190521.2

Technology

Data mining to reveal biological invaders