IV. MODELS FROM DATA - IJSkt.ijs.si/markodebeljak/Lectures/ARHIV/Metode_EM/2011_12/Lectures/... · IV. MODELS FROM DATA Data mining 2 Data mining MODELLING METHODS- Data mining data

1

1

IV. MODELS FROM DATA

1

Data mining

2

Data mining

MODELLING METHODS- Data mining

data

3

Outline

A) THEORETICAL BACKGROUND

1. Knowledge discovery in data bases (KDD)

2. Data mining

• Data

• Patterns

• Data mining algorithms

B) PRACTICAL IMPLEMENTATIONS

3. Applications:

• Equations

• Decision trees

• Rules

2

4

Knowledge discovery in data bases (KDD)

What is KDD?

Frawley et al., 1991: “KDD is the non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data”,

How to find patters in data?

Data mining (DM) – central step in the KDD process concerned with applying computational techniques to actually find patterns in the data (15-25% of the effort of the overall KDD process).

- step 1: preparing data for DM (data preprocessing)

- step 3: evaluating the discovered patterns (results of DM)

5


When the patterns can be treated as knowledge?

Frawley et al., (1991): “A pattern that is interesting (according to

a user- imposed interest measure) and certain enough (again

according to the user’s criteria) is called knowledge. “

Condition 1: Discovered patterns should be valid on new data with some degree of certainty (typically prescribed by the user).

Condition 2: The patterns should potentially lead to some useful actions (according to user defined utility criteria).

6


What may KDD contribute to environmental sciences (ES) (e.g. agronomy, forestry, ecology, …)?

ES deal with complex unpredictable natural systems (e.g. arable, forest and water ecosystems) in order to get answers on complex questions.

The amount of collected environmental data is increasing exponentially.

KDD was purposively designed to cope with such complex questions about complex systems like:

- understanding the domain/system studied (e.g., gene flow, seed bank, life cycle, …)- predicting future values of system variables of interest (e.g., rate of out-crossing with GM plants at location x at time y, seedbank dynamics, …)

3

7

Data mining (DM)

What is data mining?Data Mining, is the process of automatically searching large

volumes of data for patterns using algorithms.

Data Mining – Machine learningData Mining is the application of Machine Learning techniques

to data analysis problems.

The most relevant notions of data mining:1. Data

2. Patterns

3. Data mining algorithms

8

Data mining (DM) - data

1. What is data?According to Fayyad et al. (1996):” Data is a set of facts, e.g., cases in a database.”

Data in DM is given in a single flat table:- rows: objects or records (examples in ML)- columns: properties of objects (attributes, features in ML)

which is then used as input to a data mining algorithm.

Properties of objects Distance (m) Wind direction (0) Wind speed (m/s) Out-crossing rate (%) 10 123 3 8 12 88 4 7 14 121 6 3 18 147 2 4 20 93 1 5 22 115 3 1

Ob

jec

ts

… … … ..

9

Data mining (DM) - pattern

2. What is a pattern?

A pattern is defined as: ”A statement (expression) in a given language, that describes (relationships among) the facts in a subset of the given data and is (in some sense) simpler than the enumeration of all facts in the subset” (Frawley et al. 1991, Fayyad et al. 1996).

Classes of patterns considered in DM (depend on the data mining task at hand):

1. equations,2. decision trees3. association, classification, and regression rules

4

10


1.EquationsTo predict the value of a target (dependent) variable as a linear or non linear combination of the input (independent) variables.

�Linear equations involving:- two variables: straight lines in a two dimensional space - three variables: planes in a three-dimensional space - more variables: hyper-plains in multidimensional spaces

�Nonlinear equations involving:- two variables: curves in a two dimensional space - three variables: surfaces in a three-dimensional space - more variables: hyper-surfaces in multidimensional spaces

11


2. Decision treesTo predict the value of one or several target dependentvariables (class) from the values of other independentvariables (attributes) by decision tree.

Decision tree is a hierarchical structure, where: - each internal node contains a test on an attribute, - each branch corresponds to an outcome of the test, - each leaf gives a prediction for the value of the class variable.

12


Decision tree is called:

� A classification tree: class value in leaf is discrete (finite set of nominal values): e.g., (yes, no), (spec. A, spec. B, …)

� A regression tree: class value in leaf is a constant (infinite set of values): e.g.,120, 220, 312, …

� A model tree: leaf contains linear modelpredicting the class value (piece-wise linear function): out-crossing rate= 12.3 distance - 0.123 wind speed + 0.00123 wind direction

5

13


3. RulesTo perform association analysis between attributes discovered by association rules.

The rule denotes patterns of the form:“IF Conjunction of conditions THEN Conclusion.”

� For classification rules, the conclusion assigns one of the possible discrete values to the class (finite set of nominal values): e.g., (yes, no), (spec. A, spec. B, spec. D)

� For predictive rules, the conclusion gives a prediction for the value of the target (class) variable (infinite set of values): e.g.,120, 220, 312, …

14

Data mining (DM) - algorithm

3. What is data mining algorithm?Algorithm in general:- a procedure (a finite set of well-defined instructions) for accomplishing some task which will terminate in a defined end-stat.

Data mining algorithm:- a computational process defined by a Turing machine (Gurevich et al. 2000) for finding patterns in data

15


What kind of possible algorithms do we use for discovering patterns?

It depends on the goals:

1. Equations = Linear and multiple regressions

2. Decision trees = Top/down induction of decision trees

3. Rules = Rule induction

6

16


1. Linear and multiple regression

• Bivariate linear regression:predicted variable (C-class (ML) my be contusions or discontinues) can be expressed as a linear function of one attribute (A):

C = α+ β×A

• Multiple regression:predicted variable (C-class (ML) my be contusions or discontinues) can be expressed as a linear function of a multi-dimensional attribute vector (Ai):

C = Σni =1 βi×Ai

17


2. Top/down induction of decision treesDecision tree is induced by Top-Down Induction of Decision Trees (TDIDT) algorithm (Quinlan, 1986)

Tree construction proceeds recursively starting with the entireset of training examples (entire table). At each step, an attribute is selected as the root of the (sub)tree and the current training set is split into subsets according to the values of the selected attribute.

18


3. Rule induction

A rule that correctly classifies some examples is constructed first.

The positive examples covered by the rule from the training set are removed and the process is repeated until no more examples remain.

7

19

Data mining (DM) - Statistics

Data mining vs. statistics

Common to both approaches:

Reasoning FROM properties of a data sample TO properties of a population.

20

Data mining (DM) – Machine learning -Statistics

StatisticsHypothesis testing when certain theoretical expectations about the data distribution, independence, random sampling, sample size, etc. are satisfied.

Main approach: best fitting all the available data.

Data miningAutomated construction of understandable patterns, and structured models.

Main approach: structuring the data space, heuristic search for decision trees, rules, … covering (parts of) the data space.

2121

DATA MINING – CASE STUDIES

8

2222

Practical implementations

Each class of described patterns is illustrated with examples of applications:

1. Equations:• Algebraic equations • Differential equations

2. Decision trees:• Classification trees• Regression trees• Model trees

3. Predictive rules

2323

Applications – Difference equations

Algebraic equations: CIPER

2424

Applications – Algebraic equations

Dataset

Hydrological conditions(HMS Lendava; monthly data on

minimal, average and maximum values)

- Ledava River levels - groundwater levels

Meteorological conditions(monthly data, HMS Lendava):-Time of solar radiation (h), - precipitation (mm), - ET (mm)- Number of days with white frost - Number of days with snow- T: max, aver, min- Cumulative T>0ºC, >5ºC, and >10ºC- Number of days with:

- minT>0ºC- minT<-10ºC- minT<-4ºC- minT>25ºC- maxT>10ºC- maxT>25ºC

Measured radial increments:- 8 trees

- 69 years old

Management data(thinning; m3/y removed from the

stand; Forestry Unit Lendava)

•Monthly data + aggregated data (AMJ, MJJ, JJA, MJJA etc.)

•Σ: 333 attributes; 35 years

Materials and methods

9

2525


• 52 different combinations of attributes were tested.

Σ: 124 models

Experiment RRSE # eq. elements

jnj3_2m 0,7282 6jnj3_3s 0,7599 6jnj3_1s 0,7614 6jnj3_4m 0,76455 3jnj2_2 0,7685 5jly_4xl 0,7686 6

2626


Model jnj3_2m:

RadialGrowthIncrement =+ -0.0511025526922 minL8-10^1 + -0.0291795197998 maxL8-10^1 + -0.017479975134 t-sun4-7^1 + 0.0346935385853 t-sun8-10^1 + -1.950606536e-05 t-sun8-10^2 + -2.01014710248 d-wf-4-7^1 + 9.35586778387e-05 minL4-7^1 t-sun4-7^1 + -0.000179339939732 minL4-7^1 t-sun8-10^1 + 6.45688563611e-05 minL8-10^1 t-sun8-10^1 + 3.06551434164e-05 maxL8-10^1 t-sun4-7^1 + 0.00282485442386 t-sun4-7^1 d-wf-4-7^1 + -0.00141078675225 t-sun8-10^1 d-wf-4-7^1 + 7.91071710872

Relative Root Squared Error = 0.728229824611

Correlation between average measured

(r-aver8) and modeled increments: linear regression:

R2 = 0.8771

2727


Model jnj3_2m

10

2828


Algebraic equations: Lagramge

2929

Data source:

• Federal Biological Research Centre (BBA), Braunschweig, Germany

(2000, 2001)

• Slovenian Agricultural Institute (KIS), Slovenia (2006)

Plants involved:

• BBA: - transgenic maize (var. “Acrobat”, glufosinate tolerant line) – donor - non-transgenic maize field (var. “Anjou”) - receptor

• KIS: - yellowkernel variety of maize (hybrid Bc462, simulatinga transgenic maize variety) – donor

- white kernel variety of maize (variety Bc38W, simulating a non-GM variety) - receptor


3030

100 m

a

b

c

d

efghj

k

l

m

n o p q 1

23

4 5

6Field design 2000

transgenic field / donor

non-transgenic field / recipient

2 m

access paths

Direction of drilling

2 m3 m4.5 m7.5 m

13.5 m

25,5m

49,5 m

220 m

sampling point

a3 system of coordinates

for the sampling points

N

Experiment design: 96 points

60 cobs – 2500 kernels

% of outcrossing

Receptors -NT corns

Donors -GMO corns


11

3131

Selected attributes:

• % of outcrossing

• Cardinal direction of the sampling point from the center of donor field

• Visual angle between the sampling point and the donor field

• Distance between the point to the center of donor filed

• The shortest distance between the sampling point and the donor field

• % of appropriate wind direction (exposure time)

• Length of the wind ventilation route

• Wind velocity

Applications – Algebraic equations: outcrossing rate

3232


3333


12

3434


3535


3636


13

3737

Applications – Differential equations

Differential equations: Lagramge

3838


3939


14

4040


water in-flow out-flow

respiration

growth

Phosphorus

4141


growthrespiration

sedimentation

grazing

Phytoplankton

4242


respiration mortality

Feeds on phytoplankton

Zooplankton

15

4343


4444

Applications – Classification trees: habitat models

Classification trees: J48

4545


Observed locations of BBs

### ### ###### ## # ##

#######

###### ## ###

### ########### # ############################## ##

###### ####### ### ####### # ######## #### ##

#

## #### # # ###### ###

# ########## #######

## ## #### ######## ##############################################

#################

#####

## ################## ######## ### ## ######

### ##########

############## ###

#### ##

### ############################## ## # ### #

### # ## ######### # # ############## # # #### ###

###### ####### ###### # ## ##

##### ####

#########################################

###### # # ##

### ### #########

# #

###

#

#

#

#

# ####

#

#

###

##

#

#

#

###

#

#

#

##

##

#

#

#

### #

#

##

#

#

###

#

#

#

#

#

#

#

#####

#

####

###

#

##

###

##

#

###

#

##

#

### # ### ##

#

##

#

#

#

#

##

#

#

##

#

#

##

#

###

#

####

#

##

##

##

#

###

#### ###

##

##

#

##

##

#

#

#

#

### #

## ##

###

## #

###

#

##

##

#

#

#

###

#

###

#

#

# #

#

#

#

#

# ### # #

#

### ##

#

##

#

##

#

##

###

###

##

#

##

#

# #

# # # # ##### ##

##### ####

##

####

#

#

#

#

##

#

##

##

## ##

###

##

###

##

##

###

#

#

#####

#

#

##

####

#

## #

###

##

#

#

#

#### ##

######

#####

#

##

##

#

##

#####

# #### #

#########

# ###### ### #

# ## ##

# ## ######## ##### # #### ####### ### ##

######### # ####

#

# ########### #

#########

#

#

#

##

#

# #

###

# #

#

#####

#####

# ## #

###

###

# ####

##

# ###

#

####

#

#####

#

##

###

#

# #

#

#

####

#

#

##

#

##

##

#

### ##

#

#

#

##

#

#

# ##

#

# #

#

##

##

# #

#

#

# #

#

#

##

## ##

#

# #

##

######

#

#

#

#

########

##

##

#

#

#

######

###

###

# ###

# ###

######

#

##

##

#

##

#####

###

#

# #

#

#

#

##

## ##

###

#

##

#

#

##

##

#

###

####

#

#

##

##

### ###

##

#

###

###

###

##

##

#

###

#

##

# ###

#

#

#

##

#

##

# # ###

###

#

#

###

#

#

#

##

##

##

#

#

###

##

#

#

##

#

#

###

#

#

#

#

##

#

#

##

# #

#

#

#

#

#

#

##

# #

#

#

####

#

#

#

#

##

##

#

###

##

#

#

#

#

##

#

##

#

##

##

# ##

#

#

##

#

###

#

# #

##

#

#

#

#

##

#

##

#

#

###

###

##

#

#

#

#

##

##

###

###

###

##

## ##

##

##

#

### #

##

#

# #

#

#

##

#

##

####

#

#

##

#

#

## #

#

##

#

#

#

#

#

#

#

##

#

##

##

##

##

#

#

##

#

#

##

#

##

#

#

#

#

#

#

#

#

#

# #

# ##

#

#

#

#

#

######

#

##

#

#

##

#

###

#####

#

## ####

##

##

#

#

#

#

#

#

# #

#

#

#

#

##

##

##

##

#

##

#

#

##

#

##

##

#

##

#

#

#

##

#

#

#

#

#

#

#

#

#

#

##

#

##

#

##

##

##

#

##

###

#

##

#

#

##

#

##

# #

##

#

## ###

# ##

#

#

##

###

#

#

#

#

##

#

#

##

##

#

##

##

## #

#

##

#

#

#

##

#

# ###

#

##

#

##

# #

####

#### #

###

#

##

#### #

#

##

#

#

# #

##

##

#

## #

####

#

##

####

#

##

#

##

##

#

#

##

#

#

#

#

##

###

##

###

# #

#

#

#

#

##

### #

###

#####

#

#

#

###

#

##

#

##

##

#####

#

##

#

###

#

##

#

##

#

##

##

####

#

##

#

#

###

#

##

##

##

##

##

##

###

###

#####

###

#

###

####

# ##

##

##

#

#

##

#

#

#

# ##

#

## #

###

#

#

###

###

###

#

#

##

###

#

#

# #

# #

### ##

#

##

#### #

##

#

#####

# ### ##

##

##

#

##

###

##

#

###

##

#

##

###

#

##

#

#

#

###

#

#

#

#

#

##

##

# #

#

##

#

##

##

#

#

#

#

##

#

##

#

#

#

##

#

#

#

#

#

#

#

#

#

##

#

#

###

#

#

#

#

###

#

##

#

##

##

##

##

#

#

#

#### #####

######### ### # #

####

##

#

#

#

###

##

##

#

#

##

#

#

##

# #

### #

###

#

#

#

##

#

#

##

#

###

##

###

#

#

#

##

#

# #

##

##

#

#

# #

#

#

#

#

#

#

##

#

##

#

#

##

##

#

#

##

#

#

#

### #

#

##

##

#

#

##

#

#

##

###

#

#

#

# #

##

## ####

#

#

##

##

#

##

##

##

#

#

#

#

#

###

##

###

#

#

#

##

####

##

## #####

#

#

#

#

#

#

##

# #

#

######

##

## #

#

#

#

###

##

#

###

###

##

#

##

# #####

##

###

# ##

######

###

#

##

#

##

#

#

#

##

#

##

####

##

#

#

#

#

#

##

##

#

#

#

#

#

#

###

#

#

#

##

#

#

# #####

#

#

#

#

# ###

#

##

#

#

#

##

#

# #

#

#

#

#

##

##

##

##

#

###

#

#

#

#

##

# #

## #

#

##

## #

#

# ### ## ##

######

#

##### #######

##

##

#

####

#############################

#####################

#### #####

#####

#

####

#

#

##

#

##

#

###

##

# ##

#

##

####

####

# ##

### ##

####

#

#

## #

#

##

##

#

#

##

#

##

#

#

###

#

##

##

#

#

## #

#####

##

##

##

###

#

#

## ##

#

#

##

#

#

#

#

##

#

#######

#

##

# ###

##

########

####

##

##

##

#

##

#

#

#

#

#

#

##

# #

###

#

#

#

#

#####

#

#

#

# ##

#

###

# ###

##

##

#

##

##

#

#

#

#

###

##

# #

##

#

#

#

#

#

##### #

#

#

#

# ##

#

#

##

######

###

###

#

#

######

##

#

#

#

#

###

#

#

#

#

##

#

#

# #

#

#

# #####

##

###

# #

#

# #

#

#

#

#

#

##

#

#

#

###

#

##

#

##

#

#

###

##

#

#

### #

#

##

##

#

####

###

###### ##########

#############

### ###########

##

##

###################

##########

#

##########

#

#

# ###

#

#

####

#

#

#

#

#

#

##

#

##

#

#

###

#

#

##

#

#

##

##

##

##

#

#

#

##

#

#

#

#

#

#

##

##

#

#

##

#

##

#

#

#

#

##

##

##

#

#

###

##

#

#

#

# ###

#

##

#####

##

#

#

#

#

##

#

##

#

#

#

#

#

#

#

#

#

#

##

#

##

#

##

#

###

#

##

# #

#

##

#

##

#

##

#

# ##

#

##

#

# #

###

##

##

#

#

#

#

###

##

#

##

###

#

#

#

#

# ##

##

#

#

#

#

#

#

#

#

#

# # ##

#

## ##

#

####

#

##

#

#

#

#

##

#####

##

##

#

#

#

##

##

## ####

###

#

#####

###

# #

#

#

#

#

##

# #####

##

###### #

###

#

### ###

##

# ####

##

##

#

#

###

#####

# #

#

##

#

#

#

#

##

#

## ####

#

###

#

#

####

# #

###

# ##

##########

#

#

# ###

#

#

####

#

#

#

#

#

#

##

#

##

#

#

#

#

###

#

###

####

#

##

# ######

#

#

##

#############

#

#######

###

#

########

####

# ###

#

#

##

#

#

##

#

#

##

#

#

# #

#

#

#

###

#

#

##

#

#

#

#

#

#

#

##

# #

#

## ###

##

##

#

#

#

#

## ##

#

#

#### ##

##

#####

#

#

#

## #####

#

#

##

#

#

##

#

## #

###

####

####

##

#

#

#

#

##

#

#

#

#

#

# ##

#

##

#

##

#

#

##

##

###

#

#

#

#

#

#

#

#

#

#

#

#

##

# ##

### ###

####

## ####

#

####

#

#

####

######

#####

# ##

###

#

#####

#

##

#

#

#

## #

###

# ########## #

##

#

#

#

#

##

# #

#

#

##

##

##

#

#

#

##

#

#

#

##

#

#

#

#

#

#

##

##

# ### ##

##

# ##

#

#

##

#

#

#

#

#

#

###

##

#

###

#

#

##

## ####

#

#

#

#

#

##

#

##

##

##

#

##

#

##

#

#### ##

##

#

###

##

## ###

##

#

#

#

# ##

#

#

# ##

#

#

#

#

#

#

##

#

#

#

# ####

#

# #

#

#

#

####

#

#

##

#

###

####

#

# ##

##

#

###

#

#

##

#

#

###

#

# #

# #

####

## #####

#

#

#

##

#

#

#

#

# ### ##

##

#

# ##

#

#

#

#

# ##

#

#

##

#

###

#

## ###

#

#

##

##

#

##

##

###

##

#######

###

#

#

###

################

# ##

###

#

#

#

#

#

#

#

#

###

#

##

##

##

####

##

##

#

# #

#

#

#

#

##

##

##

#####

#

#

#

##

##

######

#

#

#

#

#

#

#

#

#

###

###

#

#

#####

#

#

#

#

##

###

#

####

#

##

#

#

##

###

###

#

#

# #

#

#

#

# #

#

#

#

#

###

##

#

#

##

#

#

###

##

## ##

# #

## #

#

#

#

#

## ##

#

###

###

#

#

#####

#

##

##

# ##

#

####

#

##

##

###

#

#

#### #

#

#

#

#

#

#

##

###

###

#

#

#

###

#

#####

#

#

#

##

###

#

#

#

####

##

##

#

#

#

#

#

#

#

###

#

##

#####

#

#

#

#

#

##

#

#

##

#

#

#

##

#

#

##

#

#

## #

#

##

#

##

#

#

# #

#

#

#

#

#

#

##

##

#

#

#

## #

#

#

#

##

##

##

##

# #

#

#

#

#

##

#

#

#

#

#

#

##

#

#

#

## ##

#

#

# #

#

#

# #

#

#

### #

##

#

##

#

##

##

#

#

#

# #

#

#

##

###

##

#

#

#

#####

###

# #

####

###

###

#

#

#

##

##

#

#

##

##

##

#

#

#

##

##

##

###

## #

#

##

##

#

####

###

##

##

#

##

#

# ####

##

#

#

##

#

# #

## #

#

#

#

#

###

#

##

###

###

###

# #

##

##

#

##

##

##

#

#

######

#

####

#

#

###

#

####

#

###

##

#

## #

#

#

###

##

#

#

#

#

#

#

###

#

##

#

#

####

#

##

#

#

#

##

#

####

#

#

##

#

# ##

#

##

##

## #

####

## ##

####

#

####

#

#

#

#

#

#

###

#

#

##### #

#

# ##

#

#

##

###

###

##

###

#

## #

##

###

#

##

##

##

##

#

##

#

#

##

##

###

#

#

#

##

##

#

##

#

##

###

#

#

#

#

#

##

#

##

##

##

###

#

#

#

##

#

#

#

##

#

#

##

##

##

#

###

#

#

##

#

#

###

#

####

#

###

#

#

##

##

#

##

#

#

##

#

#

# #

##

#

##

##

#

# #

###

#

#

#

#

####

#

##

#

###

#

##

##

#

##

#

#

#

#

#

#

#####

#

#####

#

##

#####

### ## ##

## # ###

##

####

##

##

#

##

#

#

###

#

##

##

#

##

#

#

#

##

#

#

# #

#

# #

### ###

#

#

#

## ########

##

######

##

#

# #

#

#####

#

########

#

###

#

#

#

#

#

#

#####

#### ###

#

# ##

#

#

#

##

##

######

#

#

#

#

##

#

#

###

#

#

#

###

#

####

##

###

# ##

##

##

###

#

#

##

#

# #

##

###

##

## ##

# #

#

#

#

#

#

#

#

##

###

#

#

#

#

###

########### ##

##### #

#

#

#

#

##

###

## ##

###

##

#

#######

#

#

#

##

##

##

####

##

##

#

##

#

##

##############

##########

###

###

#

##

#

#

##

#

#

# #

## ## #

###

#

#

##

#

##

#

# ###

# ##

##

#

#

#

#

# ##

### #

#

###

###

##

## #

#

## ## ###

#

#

##

##

#

#

#

#

#

#

#

# ### #

###

# #

#

#

###

###

#

#

#

######

#

#

#

#

## #

#

#

###

#

#

#####

###

# ##

#

###

##

#

####

# #

#

#

#

###

#

# #

##

#

#

#

#

#

# #

#

#

##

##

##

###

#

###

#

##

# ####

###

#

#

#

#

#

#

# ###

##

##

#

####

#

####

###

#

#

#

#

###

#

#

#

#

#

#

#

#

#

##

##

#

#

#

#

#

#

#

#

###

#

###

#

###

##

##

#

#

##

## ##

# ##

#

##

###

#

#

#

###

#

##

#

# ##

##

#

#

#

###

#

## #

#

##

##

###

#

## #

#

#

#

#

###

## #

##

##

#

#

#

##

#

#

#

#

#

##

#

###

#

#

##

##

#

#

#

#

#

#

# #

#

##

#

#

#

#

##

##

##

# #

#

##

#

#

#

#

#

#

#

##

##

#

#

#

#

#

# #

#

##

##

#

# ##

##

#

##

#

#

##

##

##

#

#####

#

#

##

#

#

##

#

###

#

##########

##########

#####

#

##

##

######

###

#

##

#

#

#

#

###

###

#

##

#

###

#

#

####

#

#

#

#

#

#

#

#

#

##

##

# ##

#

#

#

#

#

##

###

#

# #

#

#####

###

#

#

#

## #

#

##

#

#

#

#

##

#

#

#

#

#

###

#

#

#

#

#

#

#

#

##

##

##

#

##

##

#

#

#

##

#

#

## ##

#

#

#

#

#

#

#

#

##

#

##

###

##

#

##

###

#

##

#

###

#

##

#

#

##

#

##

#

#

##

#

#

#

#

#

#

#

#

#

##

##

#

#

#

#

#

#

#

#

###

#

#

#

#

##

#

#

#

##

##

##

#

#

#

#

#

#

#

#

#

#

#

#

#

##

#

##

##

#

#

#

#

#

#

#

#

###

##

#

#

#

##

#

#

#

#

#

#

##

###

##

#

##

#

#

#

##

##

#

#

##

#

#

#

#

##

#

#

#

#

#

#

#

#

#

##

#

#

##

#

##

#

#

#

##

#

# #

#

##

#

#

#

#

#

#

#

#

#

#

#

#

#

#

#

#

#

# #

#

#

##

#

##

#

#

#

###

#

#

#

##

#

####

###

##

##############

#######

###

##

#

####

##########

##

##

######

######

#######

#######

###

#

##########

#

#

###

#

#

#

##

##

#

#

#

#

#

#

#

#

#

#

#

#

#

#

##

#

#

#

#

#

#

# #

#

#

#

#

#

#

#

#

#

#

#

#

#

#

#

#

#

#

#

#

#

#

#

##

#

#

#

#

#

#

#

#

#

##

#

#

#

#

##

#

##

#

#

#

##

##

#

#

#

# ##

##

##

#

#

#

#

##

#

#

#

#

#

#

#

#

#

#

#

#

#

#

#

#

#

#

#

#

#

#

#

##

#

#

#

##

#

#

#

#

#

#

#

##

#

##

#

#

#

#

#

##

##

#

#

#

#

#

#

#

#

#

#

#

#

#

#

#

##

##

#

#

#

#

#

##

#

##

#

#

#

#

#

#

#

#

#

##

#

#

#

## #

#

#

###

#

#

#

#

#

#

#

#

#

#

##

##

#

#

#

#

#

#

#

#

#

#

#

#

##

##

##

#

##

#

##

###

##

#

##

#

#

#

#

#

#

##

#

###

#

#

##

##

#

##

##

#

#

###

##

#

#

#

###

##

##

#

#

#

#

###

###

##

##

#

##

#

#

#

#

##

#

##

###

#

#

#

#

#

#

#

##

##

##

#

##

##

#

##########

#

#

###

#

#

#

##

##

#

#

#

#

#

#

#

#

#

##

#

#

#

#

##

#

#

##

#

##

##

#

##

######

#

#

#

# #

###

##

########

#

#

##

#

# ##

## #

#

#

###

###

#

#

#

#

#

##

##

#

#

##

#

16

4646


The training dataset• Positive examples:

- Locations of bear sightings (Hunting association; telemetry)

- Females only- Using home-range (HR) areas instead of “raw” locations- Narrower HR for optimal habitat, wider for maximal

• Negative examples:- Sampled from the unsuitable part of the study area- Stratified random sampling- Different land cover types equally accounted for

4747


1,73,26,0,0,1,88,0,2,70,7,20,1,0,0,1,0,60,0,0,0,0,0,2,0,0,0,4123,0,0,0,0,63,211,11,11,11,83,213,213,0,0,4155.1,62,37,0,0,2,88,0,2,70,7,20,1,0,0,1,0,60,0,0,0,0,0,2,1,53,0,3640,0,0,0,-1347,63,211,11,11,11,83,213,213,11,89,3858.2,0,99,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,6,82,0,10404,0,2074,-309,48,0,0,11,11,11,83,83,83,0,20,3862.2,0,100,0,0,1,76,0,16,71,0,12,0,0,0,0,0,0,0,0,0,0,0,1,6,82,0,7500,0,1661,-319,-942,0,0,11,11,11,0,0,0,0,20,4088.1,8,91,0,0,1,52,0,59,41,0,0,0,0,0,0,0,4,0,0,0,0,5,1,6,82,0,6500,0,1505,-166,879,9,57,11,11,11,281,281,281,0,20,3199.4,3,0,86,9,0,75,0,33,67,0,0,0,0,0,0,1,2,0,0,0,0,0,1,2,54,0,0,0,465,-66,-191,4,225,11,31,31,41,72,272,60,619,4013.1,34,65,0,0,2,51,9,76,9,5,1,4,1,0,1,0,29,0,0,0,0,0,1,2,54,0,3000,0,841,-111,-264,34,220,11,41,41,151,141,112,60,619,3897.1,100,0,0,0,3,52,0,86,6,3,5,9,6,7,38,40,0,0,0,0,0,0,1,17,64,0,8062,0,932,-603,-71,100,337,11,41,41,171,232,202,4,24,3732.…….…….…….

Dataset

Present: 1

Absent: 0

4848


The model for optimal habitat

17

4949


The model for maximal habitat

5050


Map of optimal habitat(13% SLO territory)

5151


Map of maximal habitat (39% SLO territory)

18

5252

Applications – Multi-target classification : outcrossing rate

Multi-target classification model (Clus):

Modelling pollen dispersal of genetically modified oilseed rape within the field

Marko Debeljak, Claire Lavigne, Sašo Džeroski,Damjan Demšar

In: 90th ESA Annual Meeting [jointly with the]IX International Congress of Ecology, August 7-12,2005, Montréal, Canada. Abstracts. [S.l.]: ESA, 2005, p.

152.

5353

Experiment design:

90m

90m

Donors: MF transgenic oilseed rape “B004.oxy”

(10×10m)

Filed for receptors (90×90m)

3×3m grid = 841 nodes

10 seeds of MS oilseed rape “FU58B004” planted per node

Field planted with MF oilseed rape ”B004”

% MS outcrossing

% MF outcrossing


5454

Selected attributes for modelling:

• Rate of outcrossing of MS and MF receptor plants [rate per 1000]

• Cardinal direction of the sampling point from the center of donor field [rad]

• Visual angle between the sampling point and the donor field [rad]

• Distance between the point to the center of donor filed [m]

• The shortest distance between the sampling point and the donor field [m]


19

5555

• Number of examples: 817

• Correlation coefficient: MF: 0.821MS: 0.846

MSMF


5656

Applications – Multi-target regression model: soil resilience

5757

The dataset: soil samples taken on 26 location throughout SCO

The dataset: The flat table of data:26 by 18 data entries

Applications – Multi-target regression trees: soil resilience

20

5858

The dataset:

� physical properties: soil texture: sand, silt, clay

� chemical properties: pH, C, N, SOM (soil organic matter)

� FAO soil classification: Order and Suborder

� physical resilience: resistance to compression: 1/Cc, recovery from compression: Ce/Cc, overburden stress: eg, recovery from overburden stress after two days cycles: eg2dc

� biological resilience: heat, copper


5959


Different scenarios and multi-target regression models have been constructed:

A model predicting the resistance and resilience of soils to copper perturbation.

6060


The increasing importance of mapping soil functions to advice on land use and environmental management - to make a map of soil resilience for Scotland.

The models = filters for existing GIS datasets about physical and chemical properties of Scottish soils.

21

6161


Macaulay Institute (Aberdeen): soils data – attributes and maps:

Approximately 13 000 soil profiles held in database

Descriptions of over 40 000 soil horizons

6262

Application

6363

Application

22

6464

Application

USING RELATIONAL DECISION TREES TO MODEL FLEXIBLE CO-EXISTENCE

MEASURES IN A MULTI-FIELD SETTING

Marko Debeljak1, Aneta Trajanov1, Daniela Stojanova1, Florence Leprince2 ,Sašo Džeroski1

1: Jožef Stefan Institute, Ljubljana, Slovenia

2:ARVALIS-Institut du végétal, Montardon, France

Marko Debeljak WP2 contributions to SIGMEA Final Meeting, Cambridge, UK, 16-19 April, 2007

Introduction

Initial questions:

To what extent will GM maze grown on Geens genetically interfere with the maize on Yelows?

Will this interference be small enough to allow co-existence?

DKC6041 YGSemis du 26/04

N

100 m

1.4 haPr33A46

Semis du 21/04

8 ha

Pr34N44 YG

Semis du 20/04

24 rgsPr33A46

23


GIS

Relational data preprocessing


Relational data mining – results

Out-crossing rate: Threshold 0.9 %

24

Data

Marko Debeljak ISEM '09, Quebec, Canada, 6-9 October 2009

• 130 sites, monitoring every 7 to 14 days for 5 month (2665 samples: 1322 conventional, 1333,HT OSR observations)

• Each sample (observation) described with 65 attributes

• Original data collected by Centre for Ecology and Hydrology, Rothamsted Research and SCRI within Farm Scale Evaluation Program (2000, 2001, 2002)

Results scenario B: Multiple target regression tree


Target: Avg Crop Covers,Avg Weed Covers

Excluded attributes: /

Constraints: MinimalInstances = 64.0; MaxSize = 15

Predictive power: Corr.Coef.: 0.8513, 0.3746 RMSE: 16.504,12.6038 RRMSE: 0.5248,0.9301


syntactic constraint

Results scenario D: Constraint predictive clustering trees for time series including TS clusters for crop (CLUS)

25


Target: Avg Weed Covers (Time Series) Scenario 3.9

Constraints: Syntactic, MinInstances = 32

Predictive power: TSRMSExval: 4.98 TSRMSEtrain: 4.86 ICVtrain: 30.44






26

7676

Applications – Rules

7777


7878


The simulations were run with the first GENESYS version (published 2001, evaluated 2005, studied in sensitivity analyses 2004, 2005)

Only one field plan was used:

- maximising pollen and seed dispersal

27

7979


Large-risk field pattern

8080


Variables describing simulations

- simulation number- genetic variables- for each field (1 to 35), the cultivation techniques of year -3, -2, -1, 0- for each field (1 to 35) the number of years since the last GM oilseed rape crop- the number of years since the last non-GM oilseed rape crop

- proportion of GM seeds in non-GM oilseed rape of field 14 at year 0

-TOTAL NUMBER OF VARIABLES: 1899

8181


Run of experiment

• simulation started with an empty seed bank

• lasted 25 years,

• but only the last 4 years were kept in the files for data mining

• TOTAL NUMBER of simulations on the field pattern without the borders: 100 000

28

8282


Non aggregated data: CUBIST

Use 60% of data for trainingEach rule must cover >=1% of casesMaximum of 10 rules

Rule 1: [29499 cases, mean 0.0160539, range 0 to 0.9883608, est err 0.0160594]if

SowingDateY0F14 > 258then

PropGMinField14 = 0.0056903

Rule 2: [12726 cases, mean 0.0297134, range 7.62473e-07 to 0.9883608, est err 0.0388636]

ifSowingDateY0F14 > 258SowingDateY0F14 <= 277

thenPropGMinField14 = 0.0451188 - 0.0024 YearsSinceLastGMcrop_F14

8383


Rule 4: [22830 cases, mean 0.0958018, range 0 to 0.9994726, est err 0.0884937]if

SowingDateY0F14 <= 258YearsSinceLastGMcrop_F14 > 2

thenPropGMinField14 = 0.3722531 - 0.0013 SowingDateY0F14 - 0.00024 SowingDensityY0F14

8484


Rule 10: [1911 cases, mean 0.5392408, est err 0.2179439]if

TillageSoilBedPrepY0F14 in {0, 2}SowingDateY0F14 <= 258SowingDensityY0F14 <= 55YearsSinceLastGMcrop_F14 <= 2

thenPropGMinField14 = 2.8659117 - 0.3607 YearsSinceLastGMcrop_F14 - 0.0087 SowingDateY0F14 - 0.0032 SowingDensityY0F14 + 0.00073 SowingDateY-1F14 + 0.21 HarvestLossY-1F14 + 0.13 HarvestLossY-2F14 + 0.00017 SowingDensityY-2F14-0.09 HarvestLossY-1F8 - 0.07 EfficHerb2Y-3F27 - 7e-05 2cuttingY-3F23 + 7e-05 2cuttingY-2F12- 0.00027 SowingDateY-1F35 + 0.08 HarvestLossY0F9 + 5e-05 1cuttingY-3F17 + 0.00012 SowingDensityY-2F18 - 6e-05 2cuttingY-2F24 - 6e-05 2cuttingY-2F15 + 6e-05 2cuttingY0F32 - 6e-05 2cuttingY-2F16 + 0.00022 SowingDateY0F11 - 4e-05 1cuttingY0F33 + 0.0001 SowingDensityY-1F7 - 0.05 EfficHerb2Y-3F16 + 0.06 HarvestLossY-1F22 - 5e-05 2cuttingY-2F25 + 0.04 EfficHerb2GMvolunY0F6 + 0.04 EfficHerb1GMvolunY-2F27 + 0.04 fficHerb2GMvolunY0F5

29

8585


Non aggregated data: CUBIST Options:Use 60% of data for trainingEach rule must cover >=1% of casesMaximum of 10 rules

Target attribute `PropGMinField14'

Evaluation on training data (60000 cases):Average |error| 0.0633466Relative |error| 0.47Correlation coefficient 0.77

Evaluation on test data (40000 cases):

Average |error| 0.0657057Relative |error| 0.49Correlation coefficient 0.75

8686

Conclusions

What can data mining do for you?

Knowledge discovered by analyzing data with DM techniques can help:

� Understand the domain studied

� Make predictions/classifications

� Support decision processes in environmental management

8787

Conclusions

What data mining cannot do for you?

� The law of information conservation (garbage-in-garbage-out)

� The knowledge we are seeking to discover has to come from the combination of data and background knowledge

� If we have very little data of very low quality and no background knowledge no form of data analysis will help

30

8888

Conclusions

Side-effects?

• Discovering problems with the data during analysis

– missing values

– erroneous values

– inappropriately measured variables

• Identifying new opportunities

– new problems to be addressed

– recommendations on what data to collect and how

89

1. Data preprocessing

DATA MINING – Hands-on exercises

90

DATA MINING – data preprocessing

DATA FORMAT• File extension .arff• This a plain text format, files should be edited by

editors such as Notepad, TextPad, WordPad (that do not add extra formatting information)

• File consists of – Title: @relation NameOfDataset– List of attributes: @attribute AttName AttType

• AttType can be ‘numeric’ or nominal list of categorical values, e.g., ‘{red, green, blue}’

– Data: @data (in a separate line), followed by the actual data in comma separated value (.csv) format

31

91


DATA FORMAT

@relation weather

@attribute outlook {sunny, overcast, rainy}@attribute temperature numeric@attribute humidity numeric@attribute windy {TRUE, FALSE}@attribute play {yes, no}

@datasunny,85,85,FALSE,nosunny,80,90,TRUE,noovercast,83,86,FALSE,yes…

92


• Excel• Attributes (variables) in columns and cases in

lines

• Use decimal POINT and not decimal COMMA for numbers

• Save excel sheet as CSV file

93


• TextPad, Notepad

• Open CSV file

• Delete “ “ on the beginning of lines and save (just save, don’t change the format)

• Change all ; to ,

• Numbers must have decimal dot (.) and not decimal comma (,)

• Save file as CSV file (don't change format)

32

94

2. Data minig

DATA MINING – Hands-on exercises

95


• WEKA

• Open CVS file in WEKA

• Select algorithm and attributes

• Perform data mining

http://www.cs.waikato.ac.nz/ml/weka/index.html

96

How to select the “best” classification tree?

Performance of the classification tree:

Confusion matrix is a matrix showing actual and predicted classifications

classification accuracy:

(correctly classified examples)/(all examples)

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.2 0.4 0.6 0.8 1

False positive rate

Tru

e p

osi

tive

rat

e

33

97

How to select the “best” classification tree?

Classification trees: J48

the number of instances CORRECTLY classified into this leaf

the number of instances INCORRECTLY classified into this leaf

It could appear:(13) – no incorrectly classified instancesor(3.5/0.5) – due to missing values (?) where instances are fracturedor(0,0) – a split on a nominal attribute and one or more of the values do not occur in the subset of instances at the node in question

Interpretable size:

-pruned or unpruned

- minimal number of objects per leaf

98

How to select the “best” regression / model tree?

Performance of the regression / model tree:

99

The number of instances that REACH this leaf

Root of the mean squared error (RMSE) of the predictions from the leaf's linear model for the instances that reach the leaf, expressed as a percentage of the global standard deviation of the class attribute (i.e. the standard deviation of the class attribute computed from all the training data). Sum is not 100%.

The smaller this value, the better.

How to select the “best” regression / model tree?

The interpretable size:

-pruned or unpruned

- minimal number of objects pre leaf

34

100

Accuracy and error

Avoid overfitting the data by tree pruning.

Pruned trees are:- less accurate (percentage of correct classifications) on training data- more accurate when classifying unseen data

101

How to prune optimally?

Pre-pruning: stop growing the tree e.g., when data split not statistically significant or too few examples are in a split (minimum number of objects in leaf)

Post-pruning: grow full tree, then post-prune (confidence factor-classification trees)

102

Optimal accuracy

10-fold cross-validation is a standard classifier evaluation method used in machine learning:

- Break data into 10 sets of size n/10. - Train on 9 datasets and test on 1. - Repeat 10 times and take a mean accuracy.

Documents

IV. MODELS FROM DATA - IJSkt.ijs.si/markodebeljak/Lectures/ARHIV/Metode_EM/2011_12/Lectures/... · IV. MODELS FROM DATA Data mining 2 Data mining MODELLING METHODS- Data mining data