105
Prof. M. Kanevski 1 Machine Learning Algorithms: Theory, Applications and Software Tools Lecture 2 Basics of ANN: MLP Prof. Mikhail Kanevski Institute of Geomatics and Analysis of Risk, University of Lausanne [email protected]

CSSS2010-20100803-Kanevski-lecture2

Embed Size (px)

DESCRIPTION

electronic

Citation preview

Page 1: CSSS2010-20100803-Kanevski-lecture2

Prof. M. Kanevski 1

Machine Learning Algorithms: Theory,

Applications and Software Tools

Lecture 2 Basics of ANN: MLP

Prof. Mikhail Kanevski

Institute of Geomatics and Analysis of Risk,

University of Lausanne

[email protected]

Page 2: CSSS2010-20100803-Kanevski-lecture2

Prof. M. Kanevski 2

Contents

• Introduction to artificial neural networks

• Multilayer perceptron

• Case studies

Page 3: CSSS2010-20100803-Kanevski-lecture2

Prof. M. Kanevski 3

Basics of ANN

Artificial neural networks are analytical systems that address problems whose solutions have not been

explicitly formulated.

In this way they contrast to classical computers and computer programs, which are designed to solve problems whose solutions - although they may be extremely complex - have been made explicit.

Page 4: CSSS2010-20100803-Kanevski-lecture2

Prof. M. Kanevski 4

Basics of ANN

• We can program or train neural networks to store, recognise, and associatively retrieve patterns;

• to filter noise from measurement data;

• to control ill-defined problems;

in summary:• to estimate sampled functions when we do not

know the form of the functions.

Page 5: CSSS2010-20100803-Kanevski-lecture2

Prof. M. Kanevski 5

Basics of ANN

Unlike statistical estimators, they estimate a function without a mathematical model of how outputs

depend on inputs.

Neural networks are model-semifree estimators (semiparametric models). They "learn from experience" with numerical and, sometimes, linguistic sample data.

Page 6: CSSS2010-20100803-Kanevski-lecture2

Prof. M. Kanevski 6

Basics of ANN

The major applications of ANN:• Feature recognition (pattern classification). Speech

recognition• Signal processing• Time-series prediction• Function approximation and regression, classification• Data Mining• Intelligent control• Associative memories• Optimisation• And many others

Page 7: CSSS2010-20100803-Kanevski-lecture2

Prof. M. Kanevski 7

Basics of ANN. Simple biological neuron

Page 8: CSSS2010-20100803-Kanevski-lecture2

Prof. M. Kanevski 8

Basics of ANNSimple model of the neuron

Page 9: CSSS2010-20100803-Kanevski-lecture2

Prof. M. Kanevski 9

Examples of transfer functions.

)]exp(1[

1)(

xxf

−+=

)]exp()[exp(

)]exp()[exp()tanh(

xx

xxx

−+

−−=

Page 10: CSSS2010-20100803-Kanevski-lecture2

Prof. M. Kanevski 10

Basics of ANN

The main parts of ANN:

• Neurones

(nodes, cells, units, processing

elements)

• Network topology

(connections between neurones)

Page 11: CSSS2010-20100803-Kanevski-lecture2

Prof. M. Kanevski 11

Basics of ANN

In general, Artificial Neural Networks are a collection of simple computational units (cells) interlinked by a system of connections (synaptic connections). The number of units and connections form a network topology.

Page 12: CSSS2010-20100803-Kanevski-lecture2

Prof. M. Kanevski 12

Multilayer perceptron

Page 13: CSSS2010-20100803-Kanevski-lecture2

Prof. M. Kanevski 13

Basics of ANN. ANN learning/training

Supervised learning is the most common training. Many samples

Input(i), Output(i) are prepared as a training set. Then a subset from

the training data set is selected. Samples from this subset are

presented to the network one by one. For each sample results

obtained by the network O[(input(i)] are compared with the desired

O[utput(i)]. After presenting the entire training subset the weights are

updated. This updating is done in such a way that a measure of the

error between the network's and desired outputs is reduced. One pass

through the subset of training samples, along with an updating of the

weights is called an epoch. The number of samples in the subset is

called epoch size. Sometimes an epoch size of one is used.

Page 14: CSSS2010-20100803-Kanevski-lecture2

Prof. M. Kanevski 14

Basics of ANN. ANN supervised learning.

ExamplesResponse

Neural network

Teacher

Evaluation

Of Response

Learning

Algorithm

Modifications

to Network

Page 15: CSSS2010-20100803-Kanevski-lecture2

Prof. M. Kanevski 15

Basics of ANNFeedforward ANN.

If there are no feedback and lateral connections we have feedforward ANN. The most frequently used model is so called - multi-layer perceptron. The term feedforward means that information flows only in one direction -from the input to the output.

Page 16: CSSS2010-20100803-Kanevski-lecture2

Prof. M. Kanevski 16

ANN Multi-layer Perceptron (MLP)

• Depends only on the data

and its inner structure

• Is able to learn from data

and generalise

• Good at modelling non-

linearities

• Robust to noise and

outliers

[ANN = artificial neurons + connection weights]

Page 17: CSSS2010-20100803-Kanevski-lecture2

Prof. M. Kanevski 17

Basics of ANN

All knowledge of ANN is based on

synaptic weights between units.

Page 18: CSSS2010-20100803-Kanevski-lecture2

Prof. M. Kanevski 18

The Universality Property

• A two layer feed-forward neural network

with step activation functions can implement any Boolean function,

provided that the number of hidden

neurons H is sufficiently large.

Page 19: CSSS2010-20100803-Kanevski-lecture2

Prof. M. Kanevski 19

MLP modelling

1 1 1 1

2 1 1 1 2 2 2

3 1 1 1 2 2 2 3 3 3

( , ) ( ) ,

( , ) ( ) ( ) ,

( , ) ( ) ( ) ( ) .

out

out

out out

out

out out out

out

F t w f w t b b

F t w f w t b w f w t b b

F t w f w t b w f w t b w f w t b b

= + +

= + + + +

= + + + + + +

w

w

w

Page 20: CSSS2010-20100803-Kanevski-lecture2

Prof. M. Kanevski 20

Backpropagation training

Page 21: CSSS2010-20100803-Kanevski-lecture2

Prof. M. Kanevski 21

Error function depends on network’s weights (W)

{ }∑−

=

−=1

0

2)(

1)(

n

j

out

ljljl WZTn

WE

Page 22: CSSS2010-20100803-Kanevski-lecture2

Prof. M. Kanevski 22

MLP training algorithms

Optimisation algorithms used for MLP training:

• Stochastic

− Annealing

− Genetic algorithm

• Gradient

− Conjugate gradients (slow 1st order gradient algorithm)

− Levenberg-Marquardt (fast 2nd order gradient algorithm)

− BFGS formula – quasi Newton

− Steepest Descent

− RProp – resilient propagation

− BackProp – back propagation

Page 23: CSSS2010-20100803-Kanevski-lecture2

Prof. M. Kanevski 23

Feedforward ANN: Multilayer

perceptron. Backprop algorithm

• The possibilities and capabilities of multi-layer perceptrons stem from

the non-linearities used within nodes. MLP can learn with supervised

learning rule - backpropagation algorithm. The Backword Error

Propagation algorithm for the ANN learning/training caused a

breakthrough in the application of multilayer perceptrons.

• The backpropagation algorithm is a supervised learning algorithm. The

backpropagation algorithm is an iterative gradient algorithm

designed to minimise the error measure between the actual output of

the neural network and the desired output. We have to optimise a very

non-linear system consisting of a large number of highly correlated

variables.

Page 24: CSSS2010-20100803-Kanevski-lecture2

Prof. M. Kanevski 24

Basics of ANNBackpropagation Algorithm

The backpropagation algorithm follows the next algorithmic steps:

• 1. Initialize weights. Usually it is recommended to set all

weights and node offsets to small random variables. In our

study we shall use simulated annealing and/or genetic

algorithm to select starting values more intelligently as it is

recommended in [Masters].

• 2. Present inputs and desired outputs. The vectors (Inputl, Outputl=tl) are presented to the network.

• 3. Calculate the actual output of the ANN.

Page 25: CSSS2010-20100803-Kanevski-lecture2

Prof. M. Kanevski 25

Basics of ANNBackpropagation Algorithm

• 4. Calculate error measure and update the weights. Use a recursive algorithm starting at the

output neurons (nodes) and working back to the first hidden layer - it is this backward propagation of output errors that inspired the name for this training algorithm. Update the weights W by

Page 26: CSSS2010-20100803-Kanevski-lecture2

Prof. M. Kanevski 26

We want to know how to modify

weights in order to decrease the

error function

( )( 1) ( )

( )ij ij

ij

E tw t w t

w t

∂+ − ∝−

Page 27: CSSS2010-20100803-Kanevski-lecture2

Prof. M. Kanevski 27

Basics of ANNBackpropagation Algorithm

)1()()1( −+=+ m

j

m

i

m

ij

m

ij Znwnw ηδ

where n - iteration step, η - rate of learning 0 < η≤ 1), Zj

m( )−1 - output of the j-th neurone in the layer

(m-1), error δi

m for the output layer is defined by equation

Page 28: CSSS2010-20100803-Kanevski-lecture2

Prof. M. Kanevski 28

Basics of ANNBackpropagation Algorithm

δ i

out

i

out

i

out

i i

outZ Z T Z= − −( )( )1

h

j

j

h

ij

h

i

h

i

h

i wZZ δδ ∑−=− )1()1(

Page 29: CSSS2010-20100803-Kanevski-lecture2

Prof. M. Kanevski 29

Basics of ANNBackpropagation Algorithm

Other error measures (such as maximum absolute error and

median squared error) have even greater advantages in

many situations. For example, median squared error is useful because unlike the mean the median is a robust

statistic - its value is insensitive to occasional large errors

in the training data. Unfortunately, practical techniques for

implementing these more desirable error measures do not

yet exist. Thus, most neural networks today are tied to

mean squared error measurements.

Page 30: CSSS2010-20100803-Kanevski-lecture2

Prof. M. Kanevski 30

Basics of ANNBackpropagation Algorithm

More general error functions can be written taking into

account (weighting, declustering, economic criteria, etc.)

importance of the samples presented to the network :

{ } lj

n

j

out

ljljl WZTWE ω∑−

=

−=1

0

2)()(

Page 31: CSSS2010-20100803-Kanevski-lecture2

Prof. M. Kanevski 31

Gradient descent

w

J(w)

Minimum

Direction of the gradientJ’(W)

Page 32: CSSS2010-20100803-Kanevski-lecture2

Prof. M. Kanevski 32

Gradient descent

w

J(w)

Minimum

Page 33: CSSS2010-20100803-Kanevski-lecture2

Prof. M. Kanevski 33

In reality the situation with error

function and corresponding

optimization problem is much more

complicated:

the presence of multiple local minima!

Page 34: CSSS2010-20100803-Kanevski-lecture2

Prof. M. Kanevski 34

Gradient descent

Local minima

Page 35: CSSS2010-20100803-Kanevski-lecture2

Prof. M. Kanevski 35

SA: Illustration

Page 36: CSSS2010-20100803-Kanevski-lecture2

Prof. M. Kanevski 36

How important are local

minima?(Duda et al. 2001)

In computational practice, we do not want our network to be caught in a local minimum having high training error because this usually indicates that key features of the problem have not been learned by the network.

In such cases it is traditional to reinitialize the weights and train again, possibly also altering other parameters in the net

Page 37: CSSS2010-20100803-Kanevski-lecture2

Prof. M. Kanevski 37

How important are local

minima?(Duda et al. 2001)

In many problems, convergence to a nonglobal minimum is acceptable, if the error is nevertheless fairly low. Furthermore, common stopping criteria demand that training terminate even before the minimum is reached, and thus it is not essential that the network be converging toward the global minimum or acceptable performance.

Page 38: CSSS2010-20100803-Kanevski-lecture2

Prof. M. Kanevski 38

In short

The presence of multiple minima does not

necessarily present difficulties in training

nets, and a few simple heuristics can often

overcome such problems (see next slide)

Page 39: CSSS2010-20100803-Kanevski-lecture2

Prof. M. Kanevski 39

Practical techniques for

improving backpropagation

• Activation function (sigmoid, hyperbolic tangent,..)

• Scaling inputs

• Training with noise (noise injection)

• Initializing weights (simulated annealing)

• Regularization (weight decay)

• Number of hidden layers

• Learning parameters (rates, momentum,..)

• Cost function

• ………………………………….

Page 40: CSSS2010-20100803-Kanevski-lecture2

Prof. M. Kanevski 40

Interpretation of network’s

outputs

Consider the limit in which the size N of the training data set goes to infinity [Bishop 1995]. In this limit we can replace the finite sum over patterns in the sum-of-squares error with an integral of the form

∑ ∑=

−=N

n k

n

k

n

k twxyN

E1

2});({2

1lim

∑ ∫∫ −=k

kkkk dxdtxtptwxy ),(});({2

1 2

Page 41: CSSS2010-20100803-Kanevski-lecture2

Prof. M. Kanevski 41

Interpretation of network’s

outputs

the network mapping is given by the conditional average of the target data, the regression of tkconditioned on x.

⟩⟨= xtwxy kk |*);(

Page 42: CSSS2010-20100803-Kanevski-lecture2

Prof. M. Kanevski 42

DEMO

Page 43: CSSS2010-20100803-Kanevski-lecture2

Prof. M. Kanevski 43

MLP and number of layers

• The problem with MLP using single hidden

layer is that the neurons tend to interact with

each other globally. In complex situations ,

this interaction makes it difficult to improve

the approximation at one point without

worsening it at some other point.

• On the other hand, with two hidden layers,

the approximation process becomes more

manageable.

Page 44: CSSS2010-20100803-Kanevski-lecture2

Prof. M. Kanevski 44

Two hidden layers! (Haykin)

1. Local features are extracted in the first hidden layer. Specifically, some neurons in the first hidden layer are used to partition the input space into regions, and other neurons in that layer learn the local features characterizing those regions.

2. Global features are extracted in the second layer. Specifically, a neuron in the second hidden layer combines the outputs of neurons in the first hidden layer operating on a particular region of the input space and thereby learns the global features for that region and outputs zero elsewhere.

Page 45: CSSS2010-20100803-Kanevski-lecture2

Prof. M. Kanevski 45

Data Preprocessing

• Machine learning algorithms are data-

driven methods.

• The quality and quantity of data is essential for training and generalization Post-processing

Pre-processing

MLA

Results

Input data

Page 46: CSSS2010-20100803-Kanevski-lecture2

Prof. M. Kanevski 46

Types of pre-processing:

1. Linear and nonlinear transformations

e.g input scaling/normalisation, Z-score transform,

square root transform, N-score transform, etc.

2. Dimensionality reduction

3. Incorporate prior knowledge

Invariants, hints,…

4. Feature extraction

linear/nonlinear combination of input variables

5. Feature selectiondecide which features to use

Page 47: CSSS2010-20100803-Kanevski-lecture2

Prof. M. Kanevski 47

Dimensionality reduction

• Two approaches are available to perform

dimensionality reduction:

• Feature extraction: creating a subset of new

features by combinations of the existing features

• Feature selection: choosing a subset of all

the features (the ones more informative)

Page 48: CSSS2010-20100803-Kanevski-lecture2

Prof. M. Kanevski 48

Feature selection/extraction

Page 49: CSSS2010-20100803-Kanevski-lecture2

Prof. M. Kanevski 49

Feature selection

• Reducing the feature space by throwing

out some of the features (covariates)

– Also called variable selection

• Motivating idea: try to find a simple,

“parsimonious” model (Occam’s razor!)

Page 50: CSSS2010-20100803-Kanevski-lecture2

Prof. M. Kanevski 50

Univariate selection may fail

Guyon-Elisseeff, JMLR 2004; Springer 2006

Page 51: CSSS2010-20100803-Kanevski-lecture2

Prof. M. Kanevski 51

Dimensionality Reduction

Clearly losing some information but this can be helpful

due to curse of dimensionality

Need some way of deciding what dimensions to keep

1. Random choice

2. Principal components analysis (PCA)

3. Independent components analysis (ICA)

4. Self-organised maps (SOM)

Page 52: CSSS2010-20100803-Kanevski-lecture2

Prof. M. Kanevski 52

Data transform

• Y = aZ+b

• Y = Log(Z)

• Y = Ind(Z, Zs)

• Normalisation: ZscoreY = (Z-Zm)/σ

• Box-Cox nonlinear transform :

1( ) si 0

( 0) ( )

ZY

Y Ln Z

λ

λ λλ

λ

−= >

= =

Page 53: CSSS2010-20100803-Kanevski-lecture2

Prof. M. Kanevski 53

Model Selection & Model Evaluation

Page 54: CSSS2010-20100803-Kanevski-lecture2

Prof. M. Kanevski 54

Guillaume d'Occam (1285 - 1349)

“Pluralitas non est ponenda sine

necessitate”

Occam’s razor:

“The more simple explanation of the phenomena is more likely to be correct”

Page 55: CSSS2010-20100803-Kanevski-lecture2

Prof. M. Kanevski 55

Model Assessment and Model Selection:

Two separate goals

Page 56: CSSS2010-20100803-Kanevski-lecture2

Prof. M. Kanevski 56

Model Selection:

Estimating the performance of different

models in order to choose the

(approximate) best one

Model Assessment:

Having chosen a final model, estimating its

prediction error (generalization error) on

new data

Page 57: CSSS2010-20100803-Kanevski-lecture2

Prof. M. Kanevski 57

If we are in a data-rich situation, the best

solution is to split randomly (?) data

Raw Data

Test:25%

(validation)Validation:25%

(test)

Train: 50%

(Train)

Page 58: CSSS2010-20100803-Kanevski-lecture2

Prof. M. Kanevski 58

Interpretation

• The training set is used to fit the models

• The validation set is used to estimate prediction error for model selection (tuning hyperparameters)

• The test set is used for assessment of the generalization error of the final chosen model

Elements of Statistical Learning- Hastie, Tibshirani & Friedman 2001

Page 59: CSSS2010-20100803-Kanevski-lecture2

Prof. M. Kanevski 59

Bias and Variance.

Model’s complexity

2 4 6 8 10

0.5

1

1.5

2

2.5

3

c. Underfitting

2 4 6 8 10

0.5

1

1.5

2

2.5

3

b. Overfitting

Page 60: CSSS2010-20100803-Kanevski-lecture2

Prof. M. Kanevski 60

One of the most serious problems that arises in connectionist learning by neural networks is overfitting of the provided training examples.

This means that the learned function fits very closely the training data however it does not generalise well, that is it can not model sufficiently well unseen data from the same task.

Solution: Balance the statistical bias and statistical variance when doing neural network learning in order to achieve smallest average generalization error

Page 61: CSSS2010-20100803-Kanevski-lecture2

Prof. M. Kanevski 61

Bias-Variance Dilemma

Assume that

2

( )

( ) 0,

( )

Y f X

where

E

Var ε

ε

ε

ε σ

= +

=

=

Page 62: CSSS2010-20100803-Kanevski-lecture2

Prof. M. Kanevski 62

We can derive an expression for the expected prediction error of a

regression at an input point X=x0

using squared-error loss:

Page 63: CSSS2010-20100803-Kanevski-lecture2

Prof. M. Kanevski 63

2

0 0 0

2 2 2

0 0 0 0

2 2

0 0

2

( ) [( ( )) ¦ ]

[ ( ) ( )] [ ( ) ( )]

( ( )) ( ( ))

Err x E Y f x X x

E f x f x E f x E f x

Bias f x Var f x

IrreducibleError Bias Variance

ε

ε

σ

σ

∧ ∧ ∧

∧ ∧

= − = =

+ − + − =

+ + =

+ +

Page 64: CSSS2010-20100803-Kanevski-lecture2

Prof. M. Kanevski 64

• The first term is the variance of the target around its true mean f(x0), and cannot be avoided no

matter how well we estimate f(x0), unless σε2=0.

• The second term is the squared bias, the amount by which the average of our estimate differs from the true mean

• The last term is the variance, the expected

squared deviation of around its mean. 0( )f x

Page 65: CSSS2010-20100803-Kanevski-lecture2

Prof. M. Kanevski 65

Elements of Statistical Learning. Hastie, Tibshirani & Friedman 2001

Page 66: CSSS2010-20100803-Kanevski-lecture2

Prof. M. Kanevski 66

Page 67: CSSS2010-20100803-Kanevski-lecture2

Prof. M. Kanevski 67

• A neural network is only as good as the

training data!

• Poor training data inevitably leads to an

unreliable and unpredictable network.

• Exploratory Data Analysis and data

preprocessing are extremely important!!!

Page 68: CSSS2010-20100803-Kanevski-lecture2

Prof. M. Kanevski 68

MLP modelling. Case Studies.

Original (10 000 points) Training (900 points)

Page 69: CSSS2010-20100803-Kanevski-lecture2

Prof. M. Kanevski 69

MLP modeling

Original MLP prediction

TrainRMSE 1.97

Ro 0.69

Which result do you prefer?

Page 70: CSSS2010-20100803-Kanevski-lecture2

Prof. M. Kanevski 70

MLP modeling

Original MLP prediction

TrainRMSE 1.61

Ro 0.80

Which result do you prefer?

Page 71: CSSS2010-20100803-Kanevski-lecture2

Prof. M. Kanevski 71

MLP modelingOriginal MLP prediction

TrainRMSE 1.67

Ro 0.79

Which result do you prefer?

Page 72: CSSS2010-20100803-Kanevski-lecture2

Prof. M. Kanevski 72

MLP modeling

Original MLP prediction

TrainRMSE 1.10

Ro 0.92Which result do you prefer?

Page 73: CSSS2010-20100803-Kanevski-lecture2

Prof. M. Kanevski 73

MLP modeling

Original MLP prediction

TrainRMSE 0.83

Ro 0.95

Which result do you prefer?

Page 74: CSSS2010-20100803-Kanevski-lecture2

Prof. M. Kanevski 74

MLP modeling

Original MLP prediction

TrainRMSE 0.55

Ro 0.98

Which result do you prefer?

Page 75: CSSS2010-20100803-Kanevski-lecture2

Prof. M. Kanevski 75

MLP modeling

Trainig statistics

5-5

10-10

10

15-15

20-20

5

0.50

0.70

0.90

1.10

1.30

1.50

1.70

1.90

5 10 5-5 10-10 15-15 20-20

M LP

RM

SE 5

10

10-10

15-15

5-5

20-20

0.65

0.70

0.75

0.80

0.85

0.90

0.95

1.00

5 10 5-5 10-10 15-15 20-20

MLP

Ro

Model 20-20 is the best ?

Page 76: CSSS2010-20100803-Kanevski-lecture2

Prof. M. Kanevski 76

MLP modelingTrainig statistics

0.980.5520-20

0.950.8315-15

0.921.1010-10

0.791.675-5

0.801.6110

0.691.975

RoRMSEMLP

Page 77: CSSS2010-20100803-Kanevski-lecture2

Prof. M. Kanevski 77

10

15-15

5-5

20-20

5

10-10

0.50

0.70

0.90

1.10

1.30

1.50

1.70

1.90

2.10

5 10 5-5 10-10 15-15 20-20

MLP

RM

SE

Validationg Training

MLP modeling

Training &Validation statistics

10

15-15 20-20

10-10

5-5

5

0.60

0.65

0.70

0.75

0.80

0.85

0.90

0.95

1.00

5 10 5-5 10-10 15-15 20-20

MLP

Ro

Page 78: CSSS2010-20100803-Kanevski-lecture2

Prof. M. Kanevski 78

10

15-15

5-5

20-20

5

10-10

0.50

0.70

0.90

1.10

1.30

1.50

1.70

1.90

2.10

5 10 5-5 10-10 15-15 20-20

MLP

RM

SE

Validationg Training

MLP modeling

Training &Validation statistics

10

15-15 20-20

10-10

5-5

5

0.60

0.65

0.70

0.75

0.80

0.85

0.90

0.95

1.00

5 10 5-5 10-10 15-15 20-20

MLP

Ro

Page 79: CSSS2010-20100803-Kanevski-lecture2

Prof. M. Kanevski 79

MLP modeling

0.881.3920-20

0.891.2415-15

0.891.2510-10

0.791.705-5

0.801.6610

0.682.015

RoRMSEMLP

Validation statistics

Page 80: CSSS2010-20100803-Kanevski-lecture2

Prof. M. Kanevski 80

ANNEX model: ANNEX model: AArtificial rtificial NNeural eural

NNetworks with etworks with ExExternal drift ternal drift

environmental data mappingenvironmental data mapping

Page 81: CSSS2010-20100803-Kanevski-lecture2

Prof. M. Kanevski 81

Traditional application of Traditional application of

ANN to spatial predictionsANN to spatial predictions

� Data are available at measurement points: F(xi,yi),

for i= 1,…N

� ANN solution: x,y - 2 inputs, F - output

- select ANN architecture

- train with available data

- after training use to predict

� Problem: Predict F(x,y) at the points without

measurements. Usually regular grid

Page 82: CSSS2010-20100803-Kanevski-lecture2

Prof. M. Kanevski 82

If there is an additional information

(available at training and prediction points)

related to the primary one, we can use it as

an additional inputs to the ANN.

ANNEX is similar to ANNEX is similar to ““Kriging with Kriging with

External Drift ModelExternal Drift Model””::

InputsInputs: : x,y,+fx,y,+fextext(x,y(x,y))

Page 83: CSSS2010-20100803-Kanevski-lecture2

Prof. M. Kanevski 83

Examples of external Examples of external

informationinformation

• Cheap information on secondary

variable

� Physical model of the phenomena

� Remotely sensed images

� GIS data

� DEM data

Page 84: CSSS2010-20100803-Kanevski-lecture2

Prof. M. Kanevski 84

Kriging with external driftKriging with external drift

Kriging with external drift is the model when trends

are limited to

E{F(x,y)}=m(x,y) = λλλλ0 +λλλλ1 fext(x,y) (1)

where the smooth variability of the secondary variable

is considered to be related (e.g., linearly correlated) to

that of primary variable F(x,y) being estimated.

In general, kriging with an external drift is a simple and efficient algorithms to incorporate a secondary variable in the estimation of the primary variable.

Page 85: CSSS2010-20100803-Kanevski-lecture2

Prof. M. Kanevski 85

What relationship between primary and What relationship between primary and

external information should be in case external information should be in case

of ANNEX?of ANNEX?

ANNEX modelANNEX model

Page 86: CSSS2010-20100803-Kanevski-lecture2

Prof. M. Kanevski 86

What does external What does external ““relatedrelated””

(how to measure: correlation between variables?)(how to measure: correlation between variables?)

information bring?information bring?

ANNEX modelANNEX model

Improved accuracy of prediction?Improved accuracy of prediction?

Reduce uncertainty of prediction?Reduce uncertainty of prediction?

An important problem is related to the question of the An important problem is related to the question of the

quality of additional data: there is a dilemma between quality of additional data: there is a dilemma between

introducing new information and/or new noise. introducing new information and/or new noise.

Page 87: CSSS2010-20100803-Kanevski-lecture2

Prof. M. Kanevski 87

Case study: Case study: Kazakh Kazakh PriaraliePriaralie, ,

monitoring networkmonitoring network

1 400 000 km1 400 000 km22 -- 400 monitoring stations400 monitoring stations

Page 88: CSSS2010-20100803-Kanevski-lecture2

Prof. M. Kanevski 88

DatasetsDatasets

GIS DEM GIS DEM

modelmodel

Average longAverage long--term term

temperatures of air in temperatures of air in

June (June (°°°°°°°°C)C)

Page 89: CSSS2010-20100803-Kanevski-lecture2

Prof. M. Kanevski 89

CorrelationCorrelation

Air temperature vs. AltitudeAir temperature vs. Altitude

Page 90: CSSS2010-20100803-Kanevski-lecture2

Prof. M. Kanevski 90

Train and Test datasetsTrain and Test datasets

TrainTest

Page 91: CSSS2010-20100803-Kanevski-lecture2

Prof. M. Kanevski 91

ANN and ANNEX modelsANN and ANNEX models

Model Correlation RMSE MAE MRE

2-7-5-1 0.917 2.57 1.96 -0.02

3-3-1 0.989 0.96 0.73 -0.01

3-5-1 0.99 0.9 0.7 -0.007

3-7-1 0.991 0.85 0.66 -0.004

3-8-1 0.991 0.84 0.68 -0.001

3-9-1 0.991 0.88 0.69 -0.01

3-10-1 0.99 0.92 0.74 -0.01

Kriging with

external drift0.984 1.19 0.91 -0.03

Page 92: CSSS2010-20100803-Kanevski-lecture2

Prof. M. Kanevski 92

Scatter plotsScatter plots

KrigingKriging CokrigingCokriging Drift Drift

KrigingKrigingANNEXANNEX

Page 93: CSSS2010-20100803-Kanevski-lecture2

Prof. M. Kanevski 93

Mapping resultsMapping results

Drift

KrigingANNEX

Kriging Cokriging

Page 94: CSSS2010-20100803-Kanevski-lecture2

Prof. M. Kanevski 94

Modelling noisy Modelling noisy ““altitudealtitude””

effect (100 %)effect (100 %)

BeforeBefore AfterAfter

Page 95: CSSS2010-20100803-Kanevski-lecture2

Prof. M. Kanevski 95

Scatter plots between variables Scatter plots between variables

(noisy 100 % altitude)(noisy 100 % altitude)

TrainTrain TestTest

Page 96: CSSS2010-20100803-Kanevski-lecture2

Prof. M. Kanevski 96

Mapping noise results Mapping noise results

ANNEXANNEX

Air temperature (Air temperature (°°C)C)

Page 97: CSSS2010-20100803-Kanevski-lecture2

Prof. M. Kanevski 97

Noise resultsNoise resultsModel Correlation RMSE MAE MRE

Kriging 0.874 3.13 2.04 -0.06

Kriging – external drift 0.984 1.19 0.91 -0.03

3-7-1 0.991 0.85 0.66 -0.004

3-8-1 0.991 0.84 0.68 -0.001

3-8-1

(100% noise)0.839 3.54 2.37 -0.13

3-7-1

(10% noise) Test 10.939 2.32 -1.49 -0.003

Kriging – external drift

(10% noise) Test 10.941 2.23 1.54 -0.06

3-7-1

(10% noise) Test 20.899 2.81 1.52 -0.08

Kriging – external drift

(10% noise) Test 20.903 2.81 1.59 -0.103

Page 98: CSSS2010-20100803-Kanevski-lecture2

Prof. M. Kanevski 98

MLP: real case study

Wind fields in Switzerland

Page 99: CSSS2010-20100803-Kanevski-lecture2

Prof. M. Kanevski 99

(pp 168-172 of the book)

Monitoring network:

111 stations in Switzerland

(80 training + 31 for validation)

Mapping of daily:

• Mean speed

• Maximum gust

• Average direction

Modeling of wind fields with MLP

using regularization technique

Page 100: CSSS2010-20100803-Kanevski-lecture2

Prof. M. Kanevski 100

Monitoring network:111 stations in Switzerland (80 training + 31 for validation)

Mapping of daily:• Mean speed• Maximum gust

• Average direction

Input information:

X,Y geographical coordinates

DEM (resolution 500 m)23 DEM-based « geo-features »

Total 26 features

Modeling of wind fields with MLP

and regularization technique

Model:MLP 26-20-20-3

Page 101: CSSS2010-20100803-Kanevski-lecture2

Prof. M. Kanevski 101

Model:

MLP 26-20-20-3

Training:

• Random initialization

• 500 iterations of the

RPROP algorithm

Training of the MLP

Page 102: CSSS2010-20100803-Kanevski-lecture2

Prof. M. Kanevski 102

Results: naîve approach

Page 103: CSSS2010-20100803-Kanevski-lecture2

Prof. M. Kanevski 103

Results: Noisy ejection regularization

Page 104: CSSS2010-20100803-Kanevski-lecture2

Prof. M. Kanevski 104

Results: summary

Noisy ejection regularization

Without regularization (overfitting)

Page 105: CSSS2010-20100803-Kanevski-lecture2

Prof. M. Kanevski 105

Conclusion

• MLP is a nonlinear universal tool for the

learning from and modeling of data.

Excellent exploratory tool.

• Application demands deep expert

knowledge and experience