Upload
buicong
View
213
Download
0
Embed Size (px)
Citation preview
1
1
IV. MODELS FROM DATA
1
Data mining
2
Data mining
MODELLING METHODS- Data mining
data
3
Outline
A) THEORETICAL BACKGROUND
1. Knowledge discovery in data bases (KDD)
2. Data mining
• Data
• Patterns
• Data mining algorithms
B) PRACTICAL IMPLEMENTATIONS
3. Applications:
• Equations
• Decision trees
• Rules
2
4
Knowledge discovery in data bases (KDD)
What is KDD?
Frawley et al., 1991: “KDD is the non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data”,
How to find patters in data?
Data mining (DM) – central step in the KDD process concerned with applying computational techniques to actually find patterns in the data (15-25% of the effort of the overall KDD process).
- step 1: preparing data for DM (data preprocessing)
- step 3: evaluating the discovered patterns (results of DM)
5
Knowledge discovery in data bases (KDD)
When the patterns can be treated as knowledge?
Frawley et al., (1991): “A pattern that is interesting (according to
a user- imposed interest measure) and certain enough (again
according to the user’s criteria) is called knowledge. “
Condition 1: Discovered patterns should be valid on new data with some degree of certainty (typically prescribed by the user).
Condition 2: The patterns should potentially lead to some useful actions (according to user defined utility criteria).
6
Knowledge discovery in data bases (KDD)
What may KDD contribute to environmental sciences (ES) (e.g. agronomy, forestry, ecology, …)?
ES deal with complex unpredictable natural systems (e.g. arable, forest and water ecosystems) in order to get answers on complex questions.
The amount of collected environmental data is increasing exponentially.
KDD was purposively designed to cope with such complex questions about complex systems like:
- understanding the domain/system studied (e.g., gene flow, seed bank, life cycle, …)- predicting future values of system variables of interest (e.g., rate of out-crossing with GM plants at location x at time y, seedbank dynamics, …)
3
7
Data mining (DM)
What is data mining?Data Mining, is the process of automatically searching large
volumes of data for patterns using algorithms.
Data Mining – Machine learningData Mining is the application of Machine Learning techniques
to data analysis problems.
The most relevant notions of data mining:1. Data
2. Patterns
3. Data mining algorithms
8
Data mining (DM) - data
1. What is data?According to Fayyad et al. (1996):” Data is a set of facts, e.g., cases in a database.”
Data in DM is given in a single flat table:- rows: objects or records (examples in ML)- columns: properties of objects (attributes, features in ML)
which is then used as input to a data mining algorithm.
Properties of objects Distance (m) Wind direction (0) Wind speed (m/s) Out-crossing rate (%) 10 123 3 8 12 88 4 7 14 121 6 3 18 147 2 4 20 93 1 5 22 115 3 1
Ob
jec
ts
… … … ..
9
Data mining (DM) - pattern
2. What is a pattern?
A pattern is defined as: ”A statement (expression) in a given language, that describes (relationships among) the facts in a subset of the given data and is (in some sense) simpler than the enumeration of all facts in the subset” (Frawley et al. 1991, Fayyad et al. 1996).
Classes of patterns considered in DM (depend on the data mining task at hand):
1. equations,2. decision trees3. association, classification, and regression rules
4
10
Data mining (DM) - pattern
1.EquationsTo predict the value of a target (dependent) variable as a linear or non linear combination of the input (independent) variables.
�Linear equations involving:- two variables: straight lines in a two dimensional space - three variables: planes in a three-dimensional space - more variables: hyper-plains in multidimensional spaces
�Nonlinear equations involving:- two variables: curves in a two dimensional space - three variables: surfaces in a three-dimensional space - more variables: hyper-surfaces in multidimensional spaces
11
Data mining (DM) - pattern
2. Decision treesTo predict the value of one or several target dependentvariables (class) from the values of other independentvariables (attributes) by decision tree.
Decision tree is a hierarchical structure, where: - each internal node contains a test on an attribute, - each branch corresponds to an outcome of the test, - each leaf gives a prediction for the value of the class variable.
12
Data mining (DM) - pattern
Decision tree is called:
� A classification tree: class value in leaf is discrete (finite set of nominal values): e.g., (yes, no), (spec. A, spec. B, …)
� A regression tree: class value in leaf is a constant (infinite set of values): e.g.,120, 220, 312, …
� A model tree: leaf contains linear modelpredicting the class value (piece-wise linear function): out-crossing rate= 12.3 distance - 0.123 wind speed + 0.00123 wind direction
5
13
Data mining (DM) - pattern
3. RulesTo perform association analysis between attributes discovered by association rules.
The rule denotes patterns of the form:“IF Conjunction of conditions THEN Conclusion.”
� For classification rules, the conclusion assigns one of the possible discrete values to the class (finite set of nominal values): e.g., (yes, no), (spec. A, spec. B, spec. D)
� For predictive rules, the conclusion gives a prediction for the value of the target (class) variable (infinite set of values): e.g.,120, 220, 312, …
14
Data mining (DM) - algorithm
3. What is data mining algorithm?Algorithm in general:- a procedure (a finite set of well-defined instructions) for accomplishing some task which will terminate in a defined end-stat.
Data mining algorithm:- a computational process defined by a Turing machine (Gurevich et al. 2000) for finding patterns in data
15
Data mining (DM) - algorithm
What kind of possible algorithms do we use for discovering patterns?
It depends on the goals:
1. Equations = Linear and multiple regressions
2. Decision trees = Top/down induction of decision trees
3. Rules = Rule induction
6
16
Data mining (DM) - algorithm
1. Linear and multiple regression
• Bivariate linear regression:predicted variable (C-class (ML) my be contusions or discontinues) can be expressed as a linear function of one attribute (A):
C = α+ β×A
• Multiple regression:predicted variable (C-class (ML) my be contusions or discontinues) can be expressed as a linear function of a multi-dimensional attribute vector (Ai):
C = Σni =1 βi×Ai
17
Data mining (DM) - algorithm
2. Top/down induction of decision treesDecision tree is induced by Top-Down Induction of Decision Trees (TDIDT) algorithm (Quinlan, 1986)
Tree construction proceeds recursively starting with the entireset of training examples (entire table). At each step, an attribute is selected as the root of the (sub)tree and the current training set is split into subsets according to the values of the selected attribute.
18
Data mining (DM) - algorithm
3. Rule induction
A rule that correctly classifies some examples is constructed first.
The positive examples covered by the rule from the training set are removed and the process is repeated until no more examples remain.
7
19
Data mining (DM) - Statistics
Data mining vs. statistics
Common to both approaches:
Reasoning FROM properties of a data sample TO properties of a population.
20
Data mining (DM) – Machine learning -Statistics
StatisticsHypothesis testing when certain theoretical expectations about the data distribution, independence, random sampling, sample size, etc. are satisfied.
Main approach: best fitting all the available data.
Data miningAutomated construction of understandable patterns, and structured models.
Main approach: structuring the data space, heuristic search for decision trees, rules, … covering (parts of) the data space.
2121
DATA MINING – CASE STUDIES
8
2222
Practical implementations
Each class of described patterns is illustrated with examples of applications:
1. Equations:• Algebraic equations • Differential equations
2. Decision trees:• Classification trees• Regression trees• Model trees
3. Predictive rules
2323
Applications – Difference equations
Algebraic equations: CIPER
2424
Applications – Algebraic equations
Dataset
Hydrological conditions(HMS Lendava; monthly data on
minimal, average and maximum values)
- Ledava River levels - groundwater levels
Meteorological conditions(monthly data, HMS Lendava):-Time of solar radiation (h), - precipitation (mm), - ET (mm)- Number of days with white frost - Number of days with snow- T: max, aver, min- Cumulative T>0ºC, >5ºC, and >10ºC- Number of days with:
- minT>0ºC- minT<-10ºC- minT<-4ºC- minT>25ºC- maxT>10ºC- maxT>25ºC
Measured radial increments:- 8 trees
- 69 years old
Management data(thinning; m3/y removed from the
stand; Forestry Unit Lendava)
•Monthly data + aggregated data (AMJ, MJJ, JJA, MJJA etc.)
•Σ: 333 attributes; 35 years
Materials and methods
9
2525
Applications – Algebraic equations
• 52 different combinations of attributes were tested.
Σ: 124 models
Experiment RRSE # eq. elements
jnj3_2m 0,7282 6jnj3_3s 0,7599 6jnj3_1s 0,7614 6jnj3_4m 0,76455 3jnj2_2 0,7685 5jly_4xl 0,7686 6
2626
Applications – Algebraic equations
Model jnj3_2m:
RadialGrowthIncrement =+ -0.0511025526922 minL8-10^1 + -0.0291795197998 maxL8-10^1 + -0.017479975134 t-sun4-7^1 + 0.0346935385853 t-sun8-10^1 + -1.950606536e-05 t-sun8-10^2 + -2.01014710248 d-wf-4-7^1 + 9.35586778387e-05 minL4-7^1 t-sun4-7^1 + -0.000179339939732 minL4-7^1 t-sun8-10^1 + 6.45688563611e-05 minL8-10^1 t-sun8-10^1 + 3.06551434164e-05 maxL8-10^1 t-sun4-7^1 + 0.00282485442386 t-sun4-7^1 d-wf-4-7^1 + -0.00141078675225 t-sun8-10^1 d-wf-4-7^1 + 7.91071710872
Relative Root Squared Error = 0.728229824611
Correlation between average measured
(r-aver8) and modeled increments: linear regression:
R2 = 0.8771
2727
Applications – Algebraic equations
Model jnj3_2m
10
2828
Applications – Algebraic equations
Algebraic equations: Lagramge
2929
Data source:
• Federal Biological Research Centre (BBA), Braunschweig, Germany
(2000, 2001)
• Slovenian Agricultural Institute (KIS), Slovenia (2006)
Plants involved:
• BBA: - transgenic maize (var. “Acrobat”, glufosinate tolerant line) – donor - non-transgenic maize field (var. “Anjou”) - receptor
• KIS: - yellowkernel variety of maize (hybrid Bc462, simulatinga transgenic maize variety) – donor
- white kernel variety of maize (variety Bc38W, simulating a non-GM variety) - receptor
Applications – Algebraic equations
3030
100 m
a
b
c
d
efghj
k
l
m
n o p q 1
23
4 5
6Field design 2000
transgenic field / donor
non-transgenic field / recipient
2 m
access paths
Direction of drilling
2 m3 m4.5 m7.5 m
13.5 m
25,5m
49,5 m
220 m
sampling point
a3 system of coordinates
for the sampling points
N
Experiment design: 96 points
60 cobs – 2500 kernels
% of outcrossing
Receptors -NT corns
Donors -GMO corns
Applications – Algebraic equations
11
3131
Selected attributes:
• % of outcrossing
• Cardinal direction of the sampling point from the center of donor field
• Visual angle between the sampling point and the donor field
• Distance between the point to the center of donor filed
• The shortest distance between the sampling point and the donor field
• % of appropriate wind direction (exposure time)
• Length of the wind ventilation route
• Wind velocity
Applications – Algebraic equations: outcrossing rate
3232
Applications – Algebraic equations
3333
Applications – Algebraic equations
12
3434
Applications – Algebraic equations
3535
Applications – Algebraic equations
3636
Applications – Algebraic equations
13
3737
Applications – Differential equations
Differential equations: Lagramge
3838
Applications – Differential equations
3939
Applications – Differential equations
14
4040
Applications – Differential equations
water in-flow out-flow
respiration
growth
Phosphorus
4141
Applications – Differential equations
growthrespiration
sedimentation
grazing
Phytoplankton
4242
Applications – Differential equations
respiration mortality
Feeds on phytoplankton
Zooplankton
15
4343
Applications – Differential equations
4444
Applications – Classification trees: habitat models
Classification trees: J48
4545
Applications – Classification trees: habitat models
Observed locations of BBs
### ### ###### ## # ##
#######
###### ## ###
### ########### # ############################## ##
###### ####### ### ####### # ######## #### ##
#
## #### # # ###### ###
# ########## #######
## ## #### ######## ##############################################
#################
#####
## ################## ######## ### ## ######
### ##########
############## ###
#### ##
### ############################## ## # ### #
### # ## ######### # # ############## # # #### ###
###### ####### ###### # ## ##
##### ####
#########################################
###### # # ##
### ### #########
# #
###
#
#
#
#
# ####
#
#
###
##
#
#
#
###
#
#
#
##
##
#
#
#
### #
#
##
#
#
###
#
#
#
#
#
#
#
#####
#
####
###
#
##
###
##
#
###
#
##
#
### # ### ##
#
##
#
#
#
#
##
#
#
##
#
#
##
#
###
#
####
#
##
##
##
#
###
#### ###
##
##
#
##
##
#
#
#
#
### #
## ##
###
## #
###
#
##
##
#
#
#
###
#
###
#
#
# #
#
#
#
#
# ### # #
#
### ##
#
##
#
##
#
##
###
###
##
#
##
#
# #
# # # # ##### ##
##### ####
##
####
#
#
#
#
##
#
##
##
## ##
###
##
###
##
##
###
#
#
#####
#
#
##
####
#
## #
###
##
#
#
#
#### ##
######
#####
#
##
##
#
##
#####
# #### #
#########
# ###### ### #
# ## ##
# ## ######## ##### # #### ####### ### ##
######### # ####
#
# ########### #
#########
#
#
#
##
#
# #
###
# #
#
#####
#####
# ## #
###
###
# ####
##
# ###
#
####
#
#####
#
##
###
#
# #
#
#
####
#
#
##
#
##
##
#
### ##
#
#
#
##
#
#
# ##
#
# #
#
##
##
# #
#
#
# #
#
#
##
## ##
#
# #
##
######
#
#
#
#
########
##
##
#
#
#
######
###
###
# ###
# ###
######
#
##
##
#
##
#####
###
#
# #
#
#
#
##
## ##
###
#
##
#
#
##
##
#
###
####
#
#
##
##
### ###
##
#
###
###
###
##
##
#
###
#
##
# ###
#
#
#
##
#
##
# # ###
###
#
#
###
#
#
#
##
##
##
#
#
###
##
#
#
##
#
#
###
#
#
#
#
##
#
#
##
# #
#
#
#
#
#
#
##
# #
#
#
####
#
#
#
#
##
##
#
###
##
#
#
#
#
##
#
##
#
##
##
# ##
#
#
##
#
###
#
# #
##
#
#
#
#
##
#
##
#
#
###
###
##
#
#
#
#
##
##
###
###
###
##
## ##
##
##
#
### #
##
#
# #
#
#
##
#
##
####
#
#
##
#
#
## #
#
##
#
#
#
#
#
#
#
##
#
##
##
##
##
#
#
##
#
#
##
#
##
#
#
#
#
#
#
#
#
#
# #
# ##
#
#
#
#
#
######
#
##
#
#
##
#
###
#####
#
## ####
##
##
#
#
#
#
#
#
# #
#
#
#
#
##
##
##
##
#
##
#
#
##
#
##
##
#
##
#
#
#
##
#
#
#
#
#
#
#
#
#
#
##
#
##
#
##
##
##
#
##
###
#
##
#
#
##
#
##
# #
##
#
## ###
# ##
#
#
##
###
#
#
#
#
##
#
#
##
##
#
##
##
## #
#
##
#
#
#
##
#
# ###
#
##
#
##
# #
####
#### #
###
#
##
#### #
#
##
#
#
# #
##
##
#
## #
####
#
##
####
#
##
#
##
##
#
#
##
#
#
#
#
##
###
##
###
# #
#
#
#
#
##
### #
###
#####
#
#
#
###
#
##
#
##
##
#####
#
##
#
###
#
##
#
##
#
##
##
####
#
##
#
#
###
#
##
##
##
##
##
##
###
###
#####
###
#
###
####
# ##
##
##
#
#
##
#
#
#
# ##
#
## #
###
#
#
###
###
###
#
#
##
###
#
#
# #
# #
### ##
#
##
#### #
##
#
#####
# ### ##
##
##
#
##
###
##
#
###
##
#
##
###
#
##
#
#
#
###
#
#
#
#
#
##
##
# #
#
##
#
##
##
#
#
#
#
##
#
##
#
#
#
##
#
#
#
#
#
#
#
#
#
##
#
#
###
#
#
#
#
###
#
##
#
##
##
##
##
#
#
#
#### #####
######### ### # #
####
##
#
#
#
###
##
##
#
#
##
#
#
##
# #
### #
###
#
#
#
##
#
#
##
#
###
##
###
#
#
#
##
#
# #
##
##
#
#
# #
#
#
#
#
#
#
##
#
##
#
#
##
##
#
#
##
#
#
#
### #
#
##
##
#
#
##
#
#
##
###
#
#
#
# #
##
## ####
#
#
##
##
#
##
##
##
#
#
#
#
#
###
##
###
#
#
#
##
####
##
## #####
#
#
#
#
#
#
##
# #
#
######
##
## #
#
#
#
###
##
#
###
###
##
#
##
# #####
##
###
# ##
######
###
#
##
#
##
#
#
#
##
#
##
####
##
#
#
#
#
#
##
##
#
#
#
#
#
#
###
#
#
#
##
#
#
# #####
#
#
#
#
# ###
#
##
#
#
#
##
#
# #
#
#
#
#
##
##
##
##
#
###
#
#
#
#
##
# #
## #
#
##
## #
#
# ### ## ##
######
#
##### #######
##
##
#
####
#############################
#####################
#### #####
#####
#
####
#
#
##
#
##
#
###
##
# ##
#
##
####
####
# ##
### ##
####
#
#
## #
#
##
##
#
#
##
#
##
#
#
###
#
##
##
#
#
## #
#####
##
##
##
###
#
#
## ##
#
#
##
#
#
#
#
##
#
#######
#
##
# ###
##
########
####
##
##
##
#
##
#
#
#
#
#
#
##
# #
###
#
#
#
#
#####
#
#
#
# ##
#
###
# ###
##
##
#
##
##
#
#
#
#
###
##
# #
##
#
#
#
#
#
##### #
#
#
#
# ##
#
#
##
######
###
###
#
#
######
##
#
#
#
#
###
#
#
#
#
##
#
#
# #
#
#
# #####
##
###
# #
#
# #
#
#
#
#
#
##
#
#
#
###
#
##
#
##
#
#
###
##
#
#
### #
#
##
##
#
####
###
###### ##########
#############
### ###########
##
##
###################
##########
#
##########
#
#
# ###
#
#
####
#
#
#
#
#
#
##
#
##
#
#
###
#
#
##
#
#
##
##
##
##
#
#
#
##
#
#
#
#
#
#
##
##
#
#
##
#
##
#
#
#
#
##
##
##
#
#
###
##
#
#
#
# ###
#
##
#####
##
#
#
#
#
##
#
##
#
#
#
#
#
#
#
#
#
#
##
#
##
#
##
#
###
#
##
# #
#
##
#
##
#
##
#
# ##
#
##
#
# #
###
##
##
#
#
#
#
###
##
#
##
###
#
#
#
#
# ##
##
#
#
#
#
#
#
#
#
#
# # ##
#
## ##
#
####
#
##
#
#
#
#
##
#####
##
##
#
#
#
##
##
## ####
###
#
#####
###
# #
#
#
#
#
##
# #####
##
###### #
###
#
### ###
##
# ####
##
##
#
#
###
#####
# #
#
##
#
#
#
#
##
#
## ####
#
###
#
#
####
# #
###
# ##
##########
#
#
# ###
#
#
####
#
#
#
#
#
#
##
#
##
#
#
#
#
###
#
###
####
#
##
# ######
#
#
##
#############
#
#######
###
#
########
####
# ###
#
#
##
#
#
##
#
#
##
#
#
# #
#
#
#
###
#
#
##
#
#
#
#
#
#
#
##
# #
#
## ###
##
##
#
#
#
#
## ##
#
#
#### ##
##
#####
#
#
#
## #####
#
#
##
#
#
##
#
## #
###
####
####
##
#
#
#
#
##
#
#
#
#
#
# ##
#
##
#
##
#
#
##
##
###
#
#
#
#
#
#
#
#
#
#
#
#
##
# ##
### ###
####
## ####
#
####
#
#
####
######
#####
# ##
###
#
#####
#
##
#
#
#
## #
###
# ########## #
##
#
#
#
#
##
# #
#
#
##
##
##
#
#
#
##
#
#
#
##
#
#
#
#
#
#
##
##
# ### ##
##
# ##
#
#
##
#
#
#
#
#
#
###
##
#
###
#
#
##
## ####
#
#
#
#
#
##
#
##
##
##
#
##
#
##
#
#### ##
##
#
###
##
## ###
##
#
#
#
# ##
#
#
# ##
#
#
#
#
#
#
##
#
#
#
# ####
#
# #
#
#
#
####
#
#
##
#
###
####
#
# ##
##
#
###
#
#
##
#
#
###
#
# #
# #
####
## #####
#
#
#
##
#
#
#
#
# ### ##
##
#
# ##
#
#
#
#
# ##
#
#
##
#
###
#
## ###
#
#
##
##
#
##
##
###
##
#######
###
#
#
###
################
# ##
###
#
#
#
#
#
#
#
#
###
#
##
##
##
####
##
##
#
# #
#
#
#
#
##
##
##
#####
#
#
#
##
##
######
#
#
#
#
#
#
#
#
#
###
###
#
#
#####
#
#
#
#
##
###
#
####
#
##
#
#
##
###
###
#
#
# #
#
#
#
# #
#
#
#
#
###
##
#
#
##
#
#
###
##
## ##
# #
## #
#
#
#
#
## ##
#
###
###
#
#
#####
#
##
##
# ##
#
####
#
##
##
###
#
#
#### #
#
#
#
#
#
#
##
###
###
#
#
#
###
#
#####
#
#
#
##
###
#
#
#
####
##
##
#
#
#
#
#
#
#
###
#
##
#####
#
#
#
#
#
##
#
#
##
#
#
#
##
#
#
##
#
#
## #
#
##
#
##
#
#
# #
#
#
#
#
#
#
##
##
#
#
#
## #
#
#
#
##
##
##
##
# #
#
#
#
#
##
#
#
#
#
#
#
##
#
#
#
## ##
#
#
# #
#
#
# #
#
#
### #
##
#
##
#
##
##
#
#
#
# #
#
#
##
###
##
#
#
#
#####
###
# #
####
###
###
#
#
#
##
##
#
#
##
##
##
#
#
#
##
##
##
###
## #
#
##
##
#
####
###
##
##
#
##
#
# ####
##
#
#
##
#
# #
## #
#
#
#
#
###
#
##
###
###
###
# #
##
##
#
##
##
##
#
#
######
#
####
#
#
###
#
####
#
###
##
#
## #
#
#
###
##
#
#
#
#
#
#
###
#
##
#
#
####
#
##
#
#
#
##
#
####
#
#
##
#
# ##
#
##
##
## #
####
## ##
####
#
####
#
#
#
#
#
#
###
#
#
##### #
#
# ##
#
#
##
###
###
##
###
#
## #
##
###
#
##
##
##
##
#
##
#
#
##
##
###
#
#
#
##
##
#
##
#
##
###
#
#
#
#
#
##
#
##
##
##
###
#
#
#
##
#
#
#
##
#
#
##
##
##
#
###
#
#
##
#
#
###
#
####
#
###
#
#
##
##
#
##
#
#
##
#
#
# #
##
#
##
##
#
# #
###
#
#
#
#
####
#
##
#
###
#
##
##
#
##
#
#
#
#
#
#
#####
#
#####
#
##
#####
### ## ##
## # ###
##
####
##
##
#
##
#
#
###
#
##
##
#
##
#
#
#
##
#
#
# #
#
# #
### ###
#
#
#
## ########
##
######
##
#
# #
#
#####
#
########
#
###
#
#
#
#
#
#
#####
#### ###
#
# ##
#
#
#
##
##
######
#
#
#
#
##
#
#
###
#
#
#
###
#
####
##
###
# ##
##
##
###
#
#
##
#
# #
##
###
##
## ##
# #
#
#
#
#
#
#
#
##
###
#
#
#
#
###
########### ##
##### #
#
#
#
#
##
###
## ##
###
##
#
#######
#
#
#
##
##
##
####
##
##
#
##
#
##
##############
##########
###
###
#
##
#
#
##
#
#
# #
## ## #
###
#
#
##
#
##
#
# ###
# ##
##
#
#
#
#
# ##
### #
#
###
###
##
## #
#
## ## ###
#
#
##
##
#
#
#
#
#
#
#
# ### #
###
# #
#
#
###
###
#
#
#
######
#
#
#
#
## #
#
#
###
#
#
#####
###
# ##
#
###
##
#
####
# #
#
#
#
###
#
# #
##
#
#
#
#
#
# #
#
#
##
##
##
###
#
###
#
##
# ####
###
#
#
#
#
#
#
# ###
##
##
#
####
#
####
###
#
#
#
#
###
#
#
#
#
#
#
#
#
#
##
##
#
#
#
#
#
#
#
#
###
#
###
#
###
##
##
#
#
##
## ##
# ##
#
##
###
#
#
#
###
#
##
#
# ##
##
#
#
#
###
#
## #
#
##
##
###
#
## #
#
#
#
#
###
## #
##
##
#
#
#
##
#
#
#
#
#
##
#
###
#
#
##
##
#
#
#
#
#
#
# #
#
##
#
#
#
#
##
##
##
# #
#
##
#
#
#
#
#
#
#
##
##
#
#
#
#
#
# #
#
##
##
#
# ##
##
#
##
#
#
##
##
##
#
#####
#
#
##
#
#
##
#
###
#
##########
##########
#####
#
##
##
######
###
#
##
#
#
#
#
###
###
#
##
#
###
#
#
####
#
#
#
#
#
#
#
#
#
##
##
# ##
#
#
#
#
#
##
###
#
# #
#
#####
###
#
#
#
## #
#
##
#
#
#
#
##
#
#
#
#
#
###
#
#
#
#
#
#
#
#
##
##
##
#
##
##
#
#
#
##
#
#
## ##
#
#
#
#
#
#
#
#
##
#
##
###
##
#
##
###
#
##
#
###
#
##
#
#
##
#
##
#
#
##
#
#
#
#
#
#
#
#
#
##
##
#
#
#
#
#
#
#
#
###
#
#
#
#
##
#
#
#
##
##
##
#
#
#
#
#
#
#
#
#
#
#
#
#
##
#
##
##
#
#
#
#
#
#
#
#
###
##
#
#
#
##
#
#
#
#
#
#
##
###
##
#
##
#
#
#
##
##
#
#
##
#
#
#
#
##
#
#
#
#
#
#
#
#
#
##
#
#
##
#
##
#
#
#
##
#
# #
#
##
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
# #
#
#
##
#
##
#
#
#
###
#
#
#
##
#
####
###
##
##############
#######
###
##
#
####
##########
##
##
######
######
#######
#######
###
#
##########
#
#
###
#
#
#
##
##
#
#
#
#
#
#
#
#
#
#
#
#
#
#
##
#
#
#
#
#
#
# #
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
##
#
#
#
#
#
#
#
#
#
##
#
#
#
#
##
#
##
#
#
#
##
##
#
#
#
# ##
##
##
#
#
#
#
##
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
##
#
#
#
##
#
#
#
#
#
#
#
##
#
##
#
#
#
#
#
##
##
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
##
##
#
#
#
#
#
##
#
##
#
#
#
#
#
#
#
#
#
##
#
#
#
## #
#
#
###
#
#
#
#
#
#
#
#
#
#
##
##
#
#
#
#
#
#
#
#
#
#
#
#
##
##
##
#
##
#
##
###
##
#
##
#
#
#
#
#
#
##
#
###
#
#
##
##
#
##
##
#
#
###
##
#
#
#
###
##
##
#
#
#
#
###
###
##
##
#
##
#
#
#
#
##
#
##
###
#
#
#
#
#
#
#
##
##
##
#
##
##
#
##########
#
#
###
#
#
#
##
##
#
#
#
#
#
#
#
#
#
##
#
#
#
#
##
#
#
##
#
##
##
#
##
######
#
#
#
# #
###
##
########
#
#
##
#
# ##
## #
#
#
###
###
#
#
#
#
#
##
##
#
#
##
#
16
4646
Applications – Classification trees: habitat models
The training dataset• Positive examples:
- Locations of bear sightings (Hunting association; telemetry)
- Females only- Using home-range (HR) areas instead of “raw” locations- Narrower HR for optimal habitat, wider for maximal
• Negative examples:- Sampled from the unsuitable part of the study area- Stratified random sampling- Different land cover types equally accounted for
4747
Applications – Classification trees: habitat models
1,73,26,0,0,1,88,0,2,70,7,20,1,0,0,1,0,60,0,0,0,0,0,2,0,0,0,4123,0,0,0,0,63,211,11,11,11,83,213,213,0,0,4155.1,62,37,0,0,2,88,0,2,70,7,20,1,0,0,1,0,60,0,0,0,0,0,2,1,53,0,3640,0,0,0,-1347,63,211,11,11,11,83,213,213,11,89,3858.2,0,99,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,6,82,0,10404,0,2074,-309,48,0,0,11,11,11,83,83,83,0,20,3862.2,0,100,0,0,1,76,0,16,71,0,12,0,0,0,0,0,0,0,0,0,0,0,1,6,82,0,7500,0,1661,-319,-942,0,0,11,11,11,0,0,0,0,20,4088.1,8,91,0,0,1,52,0,59,41,0,0,0,0,0,0,0,4,0,0,0,0,5,1,6,82,0,6500,0,1505,-166,879,9,57,11,11,11,281,281,281,0,20,3199.4,3,0,86,9,0,75,0,33,67,0,0,0,0,0,0,1,2,0,0,0,0,0,1,2,54,0,0,0,465,-66,-191,4,225,11,31,31,41,72,272,60,619,4013.1,34,65,0,0,2,51,9,76,9,5,1,4,1,0,1,0,29,0,0,0,0,0,1,2,54,0,3000,0,841,-111,-264,34,220,11,41,41,151,141,112,60,619,3897.1,100,0,0,0,3,52,0,86,6,3,5,9,6,7,38,40,0,0,0,0,0,0,1,17,64,0,8062,0,932,-603,-71,100,337,11,41,41,171,232,202,4,24,3732.…….…….…….
Dataset
Present: 1
Absent: 0
4848
Applications – Classification trees: habitat models
The model for optimal habitat
17
4949
Applications – Classification trees: habitat models
The model for maximal habitat
5050
Applications – Classification trees: habitat models
Map of optimal habitat(13% SLO territory)
5151
Applications – Classification trees: habitat models
Map of maximal habitat (39% SLO territory)
18
5252
Applications – Multi-target classification : outcrossing rate
Multi-target classification model (Clus):
Modelling pollen dispersal of genetically modified oilseed rape within the field
Marko Debeljak, Claire Lavigne, Sašo Džeroski,Damjan Demšar
In: 90th ESA Annual Meeting [jointly with the]IX International Congress of Ecology, August 7-12,2005, Montréal, Canada. Abstracts. [S.l.]: ESA, 2005, p.
152.
5353
Experiment design:
90m
90m
Donors: MF transgenic oilseed rape “B004.oxy”
(10×10m)
Filed for receptors (90×90m)
3×3m grid = 841 nodes
10 seeds of MS oilseed rape “FU58B004” planted per node
Field planted with MF oilseed rape ”B004”
% MS outcrossing
% MF outcrossing
Applications – Multi-target classification : outcrossing rate
5454
Selected attributes for modelling:
• Rate of outcrossing of MS and MF receptor plants [rate per 1000]
• Cardinal direction of the sampling point from the center of donor field [rad]
• Visual angle between the sampling point and the donor field [rad]
• Distance between the point to the center of donor filed [m]
• The shortest distance between the sampling point and the donor field [m]
Applications – Multi-target classification : outcrossing rate
19
5555
• Number of examples: 817
• Correlation coefficient: MF: 0.821MS: 0.846
MSMF
Applications – Multi-target classification : outcrossing rate
5656
Applications – Multi-target regression model: soil resilience
5757
The dataset: soil samples taken on 26 location throughout SCO
The dataset: The flat table of data:26 by 18 data entries
Applications – Multi-target regression trees: soil resilience
20
5858
The dataset:
� physical properties: soil texture: sand, silt, clay
� chemical properties: pH, C, N, SOM (soil organic matter)
� FAO soil classification: Order and Suborder
� physical resilience: resistance to compression: 1/Cc, recovery from compression: Ce/Cc, overburden stress: eg, recovery from overburden stress after two days cycles: eg2dc
� biological resilience: heat, copper
Applications – Multi-target regression trees: soil resilience
5959
Applications – Multi-target regression trees: soil resilience
Different scenarios and multi-target regression models have been constructed:
A model predicting the resistance and resilience of soils to copper perturbation.
6060
Applications – Multi-target regression trees: soil resilience
The increasing importance of mapping soil functions to advice on land use and environmental management - to make a map of soil resilience for Scotland.
The models = filters for existing GIS datasets about physical and chemical properties of Scottish soils.
21
6161
Applications – Multi-target regression trees: soil resilience
Macaulay Institute (Aberdeen): soils data – attributes and maps:
Approximately 13 000 soil profiles held in database
Descriptions of over 40 000 soil horizons
6262
Application
6363
Application
22
6464
Application
USING RELATIONAL DECISION TREES TO MODEL FLEXIBLE CO-EXISTENCE
MEASURES IN A MULTI-FIELD SETTING
Marko Debeljak1, Aneta Trajanov1, Daniela Stojanova1, Florence Leprince2 ,Sašo Džeroski1
1: Jožef Stefan Institute, Ljubljana, Slovenia
2:ARVALIS-Institut du végétal, Montardon, France
Marko Debeljak WP2 contributions to SIGMEA Final Meeting, Cambridge, UK, 16-19 April, 2007
Introduction
Initial questions:
To what extent will GM maze grown on Geens genetically interfere with the maize on Yelows?
Will this interference be small enough to allow co-existence?
DKC6041 YGSemis du 26/04
N
100 m
1.4 haPr33A46
Semis du 21/04
8 ha
Pr34N44 YG
Semis du 20/04
24 rgsPr33A46
23
Marko Debeljak WP2 contributions to SIGMEA Final Meeting, Cambridge, UK, 16-19 April, 2007
GIS
Relational data preprocessing
Marko Debeljak WP2 contributions to SIGMEA Final Meeting, Cambridge, UK, 16-19 April, 2007
Relational data mining – results
Out-crossing rate: Threshold 0.9 %
24
Data
Marko Debeljak ISEM '09, Quebec, Canada, 6-9 October 2009
• 130 sites, monitoring every 7 to 14 days for 5 month (2665 samples: 1322 conventional, 1333,HT OSR observations)
• Each sample (observation) described with 65 attributes
• Original data collected by Centre for Ecology and Hydrology, Rothamsted Research and SCRI within Farm Scale Evaluation Program (2000, 2001, 2002)
Results scenario B: Multiple target regression tree
Marko Debeljak ISEM '09, Quebec, Canada, 6-9 October 2009
Target: Avg Crop Covers,Avg Weed Covers
Excluded attributes: /
Constraints: MinimalInstances = 64.0; MaxSize = 15
Predictive power: Corr.Coef.: 0.8513, 0.3746 RMSE: 16.504,12.6038 RRMSE: 0.5248,0.9301
Marko Debeljak ISEM '09, Quebec, Canada, 6-9 October 2009
syntactic constraint
Results scenario D: Constraint predictive clustering trees for time series including TS clusters for crop (CLUS)
25
Marko Debeljak ISEM '09, Quebec, Canada, 6-9 October 2009
Target: Avg Weed Covers (Time Series) Scenario 3.9
Constraints: Syntactic, MinInstances = 32
Predictive power: TSRMSExval: 4.98 TSRMSEtrain: 4.86 ICVtrain: 30.44
Results scenario D: Constraint predictive clustering trees for time series including TS clusters for crop (CLUS)
Results scenario D: Constraint predictive clustering trees for time series including TS clusters for crop (CLUS)
Marko Debeljak ISEM '09, Quebec, Canada, 6-9 October 2009
Results scenario D: Constraint predictive clustering trees for time series including TS clusters for crop (CLUS)
Marko Debeljak ISEM '09, Quebec, Canada, 6-9 October 2009
26
7676
Applications – Rules
7777
Applications – Rules
7878
Applications – Rules
The simulations were run with the first GENESYS version (published 2001, evaluated 2005, studied in sensitivity analyses 2004, 2005)
Only one field plan was used:
- maximising pollen and seed dispersal
27
7979
Applications – Rules
Large-risk field pattern
8080
Applications – Rules
Variables describing simulations
- simulation number- genetic variables- for each field (1 to 35), the cultivation techniques of year -3, -2, -1, 0- for each field (1 to 35) the number of years since the last GM oilseed rape crop- the number of years since the last non-GM oilseed rape crop
- proportion of GM seeds in non-GM oilseed rape of field 14 at year 0
-TOTAL NUMBER OF VARIABLES: 1899
8181
Applications – Rules
Run of experiment
• simulation started with an empty seed bank
• lasted 25 years,
• but only the last 4 years were kept in the files for data mining
• TOTAL NUMBER of simulations on the field pattern without the borders: 100 000
28
8282
Applications – Rules
Non aggregated data: CUBIST
Use 60% of data for trainingEach rule must cover >=1% of casesMaximum of 10 rules
Rule 1: [29499 cases, mean 0.0160539, range 0 to 0.9883608, est err 0.0160594]if
SowingDateY0F14 > 258then
PropGMinField14 = 0.0056903
Rule 2: [12726 cases, mean 0.0297134, range 7.62473e-07 to 0.9883608, est err 0.0388636]
ifSowingDateY0F14 > 258SowingDateY0F14 <= 277
thenPropGMinField14 = 0.0451188 - 0.0024 YearsSinceLastGMcrop_F14
8383
Applications – Rules
Rule 4: [22830 cases, mean 0.0958018, range 0 to 0.9994726, est err 0.0884937]if
SowingDateY0F14 <= 258YearsSinceLastGMcrop_F14 > 2
thenPropGMinField14 = 0.3722531 - 0.0013 SowingDateY0F14 - 0.00024 SowingDensityY0F14
8484
Applications – Rules
Rule 10: [1911 cases, mean 0.5392408, est err 0.2179439]if
TillageSoilBedPrepY0F14 in {0, 2}SowingDateY0F14 <= 258SowingDensityY0F14 <= 55YearsSinceLastGMcrop_F14 <= 2
thenPropGMinField14 = 2.8659117 - 0.3607 YearsSinceLastGMcrop_F14 - 0.0087 SowingDateY0F14 - 0.0032 SowingDensityY0F14 + 0.00073 SowingDateY-1F14 + 0.21 HarvestLossY-1F14 + 0.13 HarvestLossY-2F14 + 0.00017 SowingDensityY-2F14-0.09 HarvestLossY-1F8 - 0.07 EfficHerb2Y-3F27 - 7e-05 2cuttingY-3F23 + 7e-05 2cuttingY-2F12- 0.00027 SowingDateY-1F35 + 0.08 HarvestLossY0F9 + 5e-05 1cuttingY-3F17 + 0.00012 SowingDensityY-2F18 - 6e-05 2cuttingY-2F24 - 6e-05 2cuttingY-2F15 + 6e-05 2cuttingY0F32 - 6e-05 2cuttingY-2F16 + 0.00022 SowingDateY0F11 - 4e-05 1cuttingY0F33 + 0.0001 SowingDensityY-1F7 - 0.05 EfficHerb2Y-3F16 + 0.06 HarvestLossY-1F22 - 5e-05 2cuttingY-2F25 + 0.04 EfficHerb2GMvolunY0F6 + 0.04 EfficHerb1GMvolunY-2F27 + 0.04 fficHerb2GMvolunY0F5
29
8585
Applications – Rules
Non aggregated data: CUBIST Options:Use 60% of data for trainingEach rule must cover >=1% of casesMaximum of 10 rules
Target attribute `PropGMinField14'
Evaluation on training data (60000 cases):Average |error| 0.0633466Relative |error| 0.47Correlation coefficient 0.77
Evaluation on test data (40000 cases):
Average |error| 0.0657057Relative |error| 0.49Correlation coefficient 0.75
8686
Conclusions
What can data mining do for you?
Knowledge discovered by analyzing data with DM techniques can help:
� Understand the domain studied
� Make predictions/classifications
� Support decision processes in environmental management
8787
Conclusions
What data mining cannot do for you?
� The law of information conservation (garbage-in-garbage-out)
� The knowledge we are seeking to discover has to come from the combination of data and background knowledge
� If we have very little data of very low quality and no background knowledge no form of data analysis will help
30
8888
Conclusions
Side-effects?
• Discovering problems with the data during analysis
– missing values
– erroneous values
– inappropriately measured variables
• Identifying new opportunities
– new problems to be addressed
– recommendations on what data to collect and how
89
1. Data preprocessing
DATA MINING – Hands-on exercises
90
DATA MINING – data preprocessing
DATA FORMAT• File extension .arff• This a plain text format, files should be edited by
editors such as Notepad, TextPad, WordPad (that do not add extra formatting information)
• File consists of – Title: @relation NameOfDataset– List of attributes: @attribute AttName AttType
• AttType can be ‘numeric’ or nominal list of categorical values, e.g., ‘{red, green, blue}’
– Data: @data (in a separate line), followed by the actual data in comma separated value (.csv) format
31
91
DATA MINING – data preprocessing
DATA FORMAT
@relation weather
@attribute outlook {sunny, overcast, rainy}@attribute temperature numeric@attribute humidity numeric@attribute windy {TRUE, FALSE}@attribute play {yes, no}
@datasunny,85,85,FALSE,nosunny,80,90,TRUE,noovercast,83,86,FALSE,yes…
92
DATA MINING – data preprocessing
• Excel• Attributes (variables) in columns and cases in
lines
• Use decimal POINT and not decimal COMMA for numbers
• Save excel sheet as CSV file
93
DATA MINING – data preprocessing
• TextPad, Notepad
• Open CSV file
• Delete “ “ on the beginning of lines and save (just save, don’t change the format)
• Change all ; to ,
• Numbers must have decimal dot (.) and not decimal comma (,)
• Save file as CSV file (don't change format)
32
94
2. Data minig
DATA MINING – Hands-on exercises
95
DATA MINING – data preprocessing
• WEKA
• Open CVS file in WEKA
• Select algorithm and attributes
• Perform data mining
http://www.cs.waikato.ac.nz/ml/weka/index.html
96
How to select the “best” classification tree?
Performance of the classification tree:
Confusion matrix is a matrix showing actual and predicted classifications
classification accuracy:
(correctly classified examples)/(all examples)
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.2 0.4 0.6 0.8 1
False positive rate
Tru
e p
osi
tive
rat
e
33
97
How to select the “best” classification tree?
Classification trees: J48
the number of instances CORRECTLY classified into this leaf
the number of instances INCORRECTLY classified into this leaf
It could appear:(13) – no incorrectly classified instancesor(3.5/0.5) – due to missing values (?) where instances are fracturedor(0,0) – a split on a nominal attribute and one or more of the values do not occur in the subset of instances at the node in question
Interpretable size:
-pruned or unpruned
- minimal number of objects per leaf
98
How to select the “best” regression / model tree?
Performance of the regression / model tree:
99
The number of instances that REACH this leaf
Root of the mean squared error (RMSE) of the predictions from the leaf's linear model for the instances that reach the leaf, expressed as a percentage of the global standard deviation of the class attribute (i.e. the standard deviation of the class attribute computed from all the training data). Sum is not 100%.
The smaller this value, the better.
How to select the “best” regression / model tree?
The interpretable size:
-pruned or unpruned
- minimal number of objects pre leaf
34
100
Accuracy and error
Avoid overfitting the data by tree pruning.
Pruned trees are:- less accurate (percentage of correct classifications) on training data- more accurate when classifying unseen data
101
How to prune optimally?
Pre-pruning: stop growing the tree e.g., when data split not statistically significant or too few examples are in a split (minimum number of objects in leaf)
Post-pruning: grow full tree, then post-prune (confidence factor-classification trees)
102
Optimal accuracy
10-fold cross-validation is a standard classifier evaluation method used in machine learning:
- Break data into 10 sets of size n/10. - Train on 9 datasets and test on 1. - Repeat 10 times and take a mean accuracy.