kernel methods for land cover classification and prediction

7/27/2019 kernel methods for land cover classification and prediction

1/66

Beyond Neural Network: New Algorithms for

Classification and Prediction

MAHESH PAL

Department of Civil Engineering

National Institute of Technology

Kurukshetra, 136119, INDIA


2/66

Neural network

Support vector machines

Relevance vector Machines

Random forest classifier

Extreme Learning machines


3/66

3D GEOLOGICAL MODELING: SOLVING AS A CLASSIFICATION PROBLEM

WITH THE SUPPORT VECTOR MACHINE

3-D SEISMIC-BASED LITHOLOGY PREDICTION USING IMPEDANCE

INVERSION AND NEURAL NETWORKS APPLICATION: CASE-STUDY

FROM THE MANNVILLE GROUP IN EAST-CENTRAL ALBERTA, CANADA

EVALUATING CLASSIFICATION TECHNIQUES FOR MAPPING VERTICAL

GEOLOGY USING FIELD-BASED HYPERSPECTRAL SENSORS

FLOW UNIT PREDICTION WITH LIMITED PERMEABILITY DATA USING

ARTIFICIAL NEURAL NETWORK ANALYSIS (WVU, PhD, 2002)

SUBSURFACE CHARACTERIZATION WITH SUPPORT VECTOR

MACHINES

SUPPORT VECTOR MACHINES FOR DELINEATION OF GEOLOGICFACIES FROM POORLY DIFFERENTIATED DATA

SUPERIORITIES OF SUPPORT VECTOR MACHINE IN FRACTURE

PREDICTION AND GASSINESS EVALUATION


4/66

DYNAMICS OF WATER TRANSPORT THROUGH CATCHMENT OF DANUBE

RIVER TRACED BY 3H AND 18O -THE NEURAL NETWORK APPROACH

A COMBINED STABLE ISOTOPE AND MACHINE LEARNING APPROACH TO

QUANTIFY AND CLASSIFY NITRATE POLLUTION SOURCES IN WATER

USING GEOCHEMISTRY AND NEURAL NETWORKS TO MAP GEOLOGY

UNDER GLACIAL COVER

POROSITY AND PERMEABILITY ESTIMATION USING NEURAL NETWORK

APPROACH FROM WELL LOG DATA

ILLINOIS STATEWIDE MONITORING WELL NETWORK FOR PESTICIDES IN

SHALLOW GROUNDWATER (AQUIFER SENSITIVITY TO CONTAMINATION

BY PESTICIDE LEACHING USING NN).

APPLICATION OF ARTIFICIAL NEURAL NETWORKS IN HYDROGEOLOGY:

IDENTIFICATION OF UNKNOWN POLLUTION SOURCES IN

CONTAMINATED AQUIFERS


5/66

Classification has been a major research usingremote sensing images.

A major input in GIS based studies.

Several approaches are used.


6/66

Classification Algorithms

Supervised - requires labelled training data

Unsupervised- searches for natural groups of

data, called clusters.


7/66

Parametric

Maximum likelihood classifier

Nonparametric

Neural network, Support vector machines,Relevance vector machines, Random Forestclassifier, extreme learning machine


8/66

For classification/regression, training sample is

made available to the learning algorithm (likeNeural network, SVM, RVM, Random forest,

extreme learning machines etc).

After training, learning algorithm outputs amodel or function, which is called

the hypothesis.

This Hypothesis can be considered as amachine that outputs the prediction for a new

test data.


9/66

Training samples

Model/ function

Learning algorithm

Output values

Testing samples

Also called as

Hypothesis

Hypothesis can be considered as a machine that provides the prediction for test

data


10/66

Neural Network

A major research area within 1990-2000 for

classification/regression, still in use.

No assumption about data distribution.

Works well with different data including remote sensingdata.


11/66

ijw

k

i

kw

Input

Layer

Hidden

Layer

Output

Layer


12/66

The interconnecting weights are determined during the

training process.

Number of algorithms can be used to adjust the

interconnecting weights.

Back-propagation is the most commonly used methods

The error between actual and predicted values is fed

backwards through the network towards the input layer.

Connecting weights changes in relation to the

magnitude of the error.

use an iterative process to minimize the error.


13/66

ProblemsIdentifying user-defined parameters:

Number of hidden layer and nodes

Learning rate

Momentum factor

Iterations

Local minima due to the use of non-

convex, unconstrained minimization

problem


14/66

http://mnemstudio.org/neural-networks-multilayer-perceptron-design.htm


15/66

Support Vector Machines (SVM)

Basic Theory: in 1965 Margin based classifier: in 1992

Support vector network: In 1995

Since 1998, support vector network called as

Support Vector Machines (SVM) - used as an

alternative to neural network.

First application, Gualtieri and Cromp, (1998)

for hyperspectral image classification


16/66

SVM: structural risk minimisation (SRM)

statistical learning theory proposed in 1960s

by Vapnik and co-workers.

SRM: Minimise the probability of

misclassifying an unknown data drawn

randomly

Neural network: Empirical risk minimisation

Minimise the misclassification error ontraining data


17/66

SVM

Map data from the original input featurespace to a very high dimensional feature

space (even infinite).

Data becomes linearly separable but problembecomes computationally difficult to solve.

Kernel function allows SVM to work in feature

space, without knowing mapping anddimensionality of feature space.


18/66

A Kernel Function:

SVM kernels need to satisfy Mercer

Theorem: Any continuous, symmetric, positive

semi-definite kernel function can be expressed

as a dot product in a high-dimensional space.

The linear classification in the new space is

equivalent to non-linear classification in the

original space.

jijiK xxxx


19/66

Linearly separable class


20/66

For a 2-class classification problem, Training

patterns are linearly separable if:

for all y = 1

for all y = -1

wprovide orientation of discriminating plane andb, the offset from origin.

Theclassification function will be:

1b ixw

1b ixw

bsignf b, xww


21/66


22/66

To classify the dataset

There can be a large number ofdiscriminating planes.

SVM tries to find a plane farthest fromboth classes.

Assume two supporting planes,

maximise the distance (called margin)

between them.


23/66

A plane supports a class if allpoints in that class are on

one side of that plane. Use convex optimisation

problem.

Push parallel planes apartuntil they collides with few

data points for each class.

Data points are calledSupport vectors.

Other training examples areof no use

margin

w

Origin

x

i

x

w.x + b = 1

Optimal

hyperplane


24/66

The margin is defined by : 2/

Maximising the margin is equivalent to

minimising the following quadratic program:

/2

subject to

Solved by QP techniques using Lagrangian

multipliers.

2w

01by i ixw

w

j,i

jijiji

i

i yy

2

1L xx 0i for


25/66

Linearly Non-separable data


26/66

New optimisation problem:

with and

C is a positive constant such that

LargerC means higher penalty to errors.

k

1i

i

2

,....,b C2

1

mink1

ww,

0i

0C

01bxwy iii

Cortes and Vapnik (1995)


27/66

Nonlinear SVM


28/66

Final classification function:

Nonlinear classification via linear separation in higherdimensional space:

http://www.youtube.com/watch?v=9NrALgHFwTo

SVM with polynomial kernel visualization:

http://www.youtube.com/watch?v=3liCbRZPrZA

j,i

jijiji

i

i yy

2

1L xx

bysignf

i

ii ji KK xxx
http://www.youtube.com/watch?v=9NrALgHFwTohttp://www.youtube.com/watch?v=3liCbRZPrZAhttp://www.youtube.com/watch?v=3liCbRZPrZAhttp://www.youtube.com/watch?v=9NrALgHFwTo


29/66

Advantages

Margin theory suggest no affect ofdimensionality of input space

uses fewer number of training data (called

support vectors)

QP solution, so no chance of local minima

Not many user-defined parameters


30/66

But with real data:

55

60

65

70

75

80

85

90

95

5 10 15 20 25 30 35 40 45 50 55 60 65

Classificationaccuracy

(%)

Number of features

8 pixels 15 pixels

25 pixels 50 pixels

75 pixels 100 pixels

Mahesh Pal and Giles M. Foody, 2010, Feature selection for classification of hyperspectral data bySVM. IEEE Transactions on Geoscience and Remote Sensing, Vol. 48, No. 5, 2297-2306.


31/66

Training set size per class

8 pixels 15 pixels 25 pixels 50 pixels 75 pixels 100 pixels

Peak accuracy,

% (number of

features)

74.79 (35) 81.21 (35) 84.45 (35) 88.47 (40) 91.13 (50) 92.53 (50)

Accuracy with

65 features (%)69.79 77.05 81.66 87.58 90.63 91.76

Difference in

accuracy (%)5.00 4.16 2.79 0.89 0.50 0.77

Z value 6.04 5.35 4.02 1.69 1.48 2.22


32/66

Disadvantages Designed for two class problem

Different methods to create multi-class

classifier.

Choice of kernel function and kernel specific

parameters

The kernel function is required to satisfy the

Mercer condition

Choice of ParameterC

Output is not naturally probabilistic


33/66

Multiclass results

Multiclass approach Classificationaccuracy (%)

Training time

one against one 87.90 6.4 sec

one against rest 86.55 30.37sec

Directed Acyclic Graph 87.63 6.5 sec

Bound constrained approach 87.29 79.6 sec

Crammer and Singer approach 87.43 347 min 18 sec

ECOC (exhaustive approach) 89.00 806.6 min


34/66

Choice of kernel function


35/66

Parameter selection

Grid search and trial & error methods

commonly used approach computationally expensive

Other approaches

Genetic algorithm Particle swarm optimization

Their combination with grid search.


36/66

SVR


37/66

http://www.saedsayad.com/support_vector_machine_reg.htm


38/66

Relevance vector Machines


39/66

Based on a Bayesian formulation of a linear

model (Tipping, 2001).

Produce a sparse solution than that of SVM

(i.e. less number of relevance vectors)

Ability to use non-Mercer kernels

Probabilistic output

No need to define the parameter C


40/66

For a 2-class problem, The maximum a

posteriori estimate of the weights can be

obtained by maximizing the followingobjective function:

http://www.cs.uoi.gr/~tzikas/papers/EURASIP06.pdf

http://www.tristanfletcher.co.uk/RVM%20Explained.pdf

n

iiiii

n

in wplogwcplogwwwf

1121 ,........,,
http://www.cs.uoi.gr/~tzikas/papers/EURASIP06.pdfhttp://www.tristanfletcher.co.uk/RVM%20Explained.pdfhttp://www.tristanfletcher.co.uk/RVM%20Explained.pdfhttp://www.cs.uoi.gr/~tzikas/papers/EURASIP06.pdf


41/66

RVM

The solution involves in calculating the gradientoffwith respect to w.

Only those training data having non-zero

coefficients wi (called relevance vectors) will

contribute to the decision function.

An iterative analysis is followed to find the set ofweights that maximizes the objective function


42/66

Major difference from SVM

Selected points are anti-boundary (away fromBoundary)

Support vectors represent the leastprototypical examples (closer to boundary,difficult to classify)

Relevance vectors are the most prototypical(more representative of class)


43/66

Location of the useful training cases for

classifications by SVM & RVM

40

50

60

70

80

90

100

110

70 80 90 100

Band5

Band 1

Wheat

Sugar beet

Oilseed rape

40

50

60

70

80

90

100

110

70 80 90 100

Band5

Band 1

Wheat

Sugar beet

Oilseed rape

MAHESH PAL AND G.M FOODY, Evaluation of SVM, RVM and SMLR for accurate image classification with limited

ground data, IEEEjournal of selected topics in applied earth observations and remote sensing, 5( 5), 2012


44/66

Class (number of useful

training cases)

Difference of two

smallest

Mahalanobis

distances

Mahalanobis distance to class centroid

Wheat Sugar beet Oilseed rape

Support vectors

Wheat 1(4) 4.8697 15.8246 100.2179 10.9549

Sugar beet(8) 51.9803 3.9906 47.6909 31.0740

Oilseed rape(7) 89.3444 20.9320 6.2782 15.8113

Relevance vectors

Wheat(1) 12.9498 31.8135 171.6667 18.8637

Sugar beet(2) 68.8468 4.4170 144.2734 64.4298

Oilseed rape(4) 112.0943 35.5128 4.3981 31.1147


45/66

Disadvantages

Requires large computation cost incomparison to SVM.

Designed for 2-class problem- similar toSVM.

Choice of kernel

May have a problem of local minima


46/66

Random forest algorithm


47/66

A multistage or hierarchical algorithm

Break up of complex decision into a union of

several simpler decision

Use different subset of features/data at

various decision levels.

Tree based Algorithm


48/66

Root node

Internal

node

Terminal

node


49/66


50/66

A tree based algorithm requires

Splitting rules/tree creation [called attribute selection]

Most popular are:

a) Gain ratio criterion (Quinlan, 1993)

b) Gini Index (Breiman, et. al., 1984)

Termination rules/ pruning rules

Most popular are:

a) Error-based pruning (Quinlan, 1993)

b) Cost-Complexity pruning (Brieman, et. al., 1984)


51/66

Information GainInformation Gain

ratioGini Index

Chi-squaremeasure

Accuracy 83.7 84.54 83.9 83.65

83

84

85

Accura

cy(%)

Attribute selection measure

Mahesh Pal and P.M. Mather, 2003, An Assessment of the Effectiveness of Decision Tree Methods for

Land Cover Classification. Remote Sensing of Environment. 86, 554-565

R d f


52/66

Random forest

An ensemble of tree based algorithm

Uses a random set of features (i.e. input

variables)

Uses a bootstrapped sample of original data Bootstrapped sample consists of ~63% of

original data

Remaining ~37% is left out and called out ofbag data (OOB).

Multiclass and require no pruning


53/66

Parameters

a) Number of tree to growb) Number of attributes (features) for each tree

87.78

87.48

88.3788.27

88.0787.92

86.5

87

87.5

88

88.5

89

1 2 3 4 5 6

Number of features used

Testdataaccuracy(%)

87

87.2

87.4

87.6

87.8

88

88.2

88.4

88.6

88.8

89

0 2000 4000 6000 8000 10000 12000 14000

Number of trees

Testdataaccuracy(%)

Mahesh Pal, 2005, Random Forest Classifier for Remote Sensing Classifications. International Journal of

Remote sensing, 26(1), 217-222.


54/66

Classification Results

Classifier used Random forest classifier Support vector machines

Accuracy (%) and Kappa value 88.37 (0.86) 87.9 (0.86)

Training time 12.98 seconds on P-IV 0.30 minutes on sun machine


55/66

Can be used for:

Feature selection

Clustering of data

Outlier detection

Predictions/regression

Can handle categorical data and the data with

missing values

Performance - comparable to SVM

Computationally efficient

Mahesh Pal,2006,Support Vector Machines Based Feature Selection for land cover classification: a casestudy with DIAS Hyperspectral Data. International Journal of Remote Sensing, 27(14), 28772894


56/66

Outliers

0123456789

101112131415161718192021

0 500 1000 1500 2000 2500 3000

Outliervalue

samples

class 1

class 2

class 3

class 4

class 5

class 6

class 7

An outlieris an observation that lies at an abnormal distance from other values in

the dataset

Cl t i


57/66

Clustering

-0.4

-0.3

-0.2

-0.1

0

0.1

0.2

0.3

0.4

0.5

-0.4 -0.3 -0.2 -0.1 0 0.1 0.2 0.3 0.4 0.5

IIndscalingcoordina

te

Ist scaling coordinate

class 1

class 2

class 3

class 4

class 5

class 6

class 7


58/66

Extreme Learning Machines

Comparison of ELM with SVR for reservoir permeability prediction

Modelling Permeability prediction using ELM


59/66

A neural network classifier

Use one hidden layer only

No parameter except number of hidden nodes

Global solution

Performance comparable to SVM and better

than back-propagation neural network

Very fast


60/66

http://www.ntu.edu.sg/home/egbhuang/pdf/ELM-WCCI2012.pdf


61/66

HUANG, G.-B., ZHU, Q.-Y. and SIEW, C.-K., 2006, Extreme learning machine: Theory and

applications, Neurocomputing, 70, 489501.

. +

=1=


62/66

Disadvantages

Weights are randomly assigned. Large variation inaccuracy using same number of hidden nodes with

different trials.

Difficult to replicate results

Mahesh Pal, 2009, Extreme learning machine based land cover classification, International Journal of

Remote Sensing, 30(14), 38353841.

70

74

78

82

86

90

25 50 75 100 150 200 250 300 350 400 450

Number of nodes in hidden layer

Classif

icationaccuracy(%)

Extreme learning

machine1.25 sec

Back propagation

neural network336.20 sec

K li d ELM


63/66

Kernlised ELM

Kernel function can be used in place of hidden layer by

modifying the optimization problem. Multiclass

Can be used for classification and regression

Same Kernel function as used with SVM/RVM can be

used.

Encouraging results for classification and

prediction- better than SVM in terms of accuracy

and computational cost

Huang, G-B. Zhou H. Ding X. and Zhang R. 2012, Extreme Learning Machine for Regression and Multiclass

Classification. IEEE Transactions on Systems, Man, and CyberneticsPart B: Cybernetics 42: 513-529.

NO f L h Th


64/66

NO free Lunch Theorem

No algorithm performs better than any other when their

performance is averaged uniformly over all possible

problems of a particulartype(Wolpert and Macready, 1995)

Algorithm must be designed for a particular domain and

there is no such thing as a general purpose algorithm.

Data dependent nature


65/66

http://www.tristanfletcher.co.uk/SVM%20Explained.pdf

http://www.youtube.com/watch?v=eHsErlPJWUU

{SVM by Prof. Yasser, CalTech}

http://www.youtube.com/watch?v=s8B4A5ubw6c

{SVM by Prof. Andrew Ng, Stanford}

http://videolectures.net/mlss03_tipping_pp/

{ RVM, Video lecture by Tipping}

http://www.ntu.edu.sg/home/egbhuang/pdf/ELM-WCCI2012.pdf
http://www.tristanfletcher.co.uk/SVM%20Explained.pdfhttp://www.youtube.com/watch?v=eHsErlPJWUUhttp://www.youtube.com/watch?v=s8B4A5ubw6chttp://videolectures.net/mlss03_tipping_pp/http://videolectures.net/mlss03_tipping_pp/http://videolectures.net/mlss03_tipping_pp/http://www.youtube.com/watch?v=s8B4A5ubw6chttp://www.youtube.com/watch?v=s8B4A5ubw6chttp://www.youtube.com/watch?v=eHsErlPJWUUhttp://www.youtube.com/watch?v=eHsErlPJWUUhttp://www.tristanfletcher.co.uk/SVM%20Explained.pdfhttp://www.tristanfletcher.co.uk/SVM%20Explained.pdf


66/66

Questions?

Documents

kernel methods for land cover classification and prediction