kernel methods for land cover classification and prediction

Embed Size (px)

Citation preview

  • 7/27/2019 kernel methods for land cover classification and prediction

    1/66

    Beyond Neural Network: New Algorithms for

    Classification and Prediction

    MAHESH PAL

    Department of Civil Engineering

    National Institute of Technology

    Kurukshetra, 136119, INDIA

  • 7/27/2019 kernel methods for land cover classification and prediction

    2/66

    Neural network

    Support vector machines

    Relevance vector Machines

    Random forest classifier

    Extreme Learning machines

  • 7/27/2019 kernel methods for land cover classification and prediction

    3/66

    3D GEOLOGICAL MODELING: SOLVING AS A CLASSIFICATION PROBLEM

    WITH THE SUPPORT VECTOR MACHINE

    3-D SEISMIC-BASED LITHOLOGY PREDICTION USING IMPEDANCE

    INVERSION AND NEURAL NETWORKS APPLICATION: CASE-STUDY

    FROM THE MANNVILLE GROUP IN EAST-CENTRAL ALBERTA, CANADA

    EVALUATING CLASSIFICATION TECHNIQUES FOR MAPPING VERTICAL

    GEOLOGY USING FIELD-BASED HYPERSPECTRAL SENSORS

    FLOW UNIT PREDICTION WITH LIMITED PERMEABILITY DATA USING

    ARTIFICIAL NEURAL NETWORK ANALYSIS (WVU, PhD, 2002)

    SUBSURFACE CHARACTERIZATION WITH SUPPORT VECTOR

    MACHINES

    SUPPORT VECTOR MACHINES FOR DELINEATION OF GEOLOGICFACIES FROM POORLY DIFFERENTIATED DATA

    SUPERIORITIES OF SUPPORT VECTOR MACHINE IN FRACTURE

    PREDICTION AND GASSINESS EVALUATION

  • 7/27/2019 kernel methods for land cover classification and prediction

    4/66

    DYNAMICS OF WATER TRANSPORT THROUGH CATCHMENT OF DANUBE

    RIVER TRACED BY 3H AND 18O -THE NEURAL NETWORK APPROACH

    A COMBINED STABLE ISOTOPE AND MACHINE LEARNING APPROACH TO

    QUANTIFY AND CLASSIFY NITRATE POLLUTION SOURCES IN WATER

    USING GEOCHEMISTRY AND NEURAL NETWORKS TO MAP GEOLOGY

    UNDER GLACIAL COVER

    POROSITY AND PERMEABILITY ESTIMATION USING NEURAL NETWORK

    APPROACH FROM WELL LOG DATA

    ILLINOIS STATEWIDE MONITORING WELL NETWORK FOR PESTICIDES IN

    SHALLOW GROUNDWATER (AQUIFER SENSITIVITY TO CONTAMINATION

    BY PESTICIDE LEACHING USING NN).

    APPLICATION OF ARTIFICIAL NEURAL NETWORKS IN HYDROGEOLOGY:

    IDENTIFICATION OF UNKNOWN POLLUTION SOURCES IN

    CONTAMINATED AQUIFERS

  • 7/27/2019 kernel methods for land cover classification and prediction

    5/66

    Classification has been a major research usingremote sensing images.

    A major input in GIS based studies.

    Several approaches are used.

  • 7/27/2019 kernel methods for land cover classification and prediction

    6/66

    Classification Algorithms

    Supervised - requires labelled training data

    Unsupervised- searches for natural groups of

    data, called clusters.

  • 7/27/2019 kernel methods for land cover classification and prediction

    7/66

    Parametric

    Maximum likelihood classifier

    Nonparametric

    Neural network, Support vector machines,Relevance vector machines, Random Forestclassifier, extreme learning machine

  • 7/27/2019 kernel methods for land cover classification and prediction

    8/66

    For classification/regression, training sample is

    made available to the learning algorithm (likeNeural network, SVM, RVM, Random forest,

    extreme learning machines etc).

    After training, learning algorithm outputs amodel or function, which is called

    the hypothesis.

    This Hypothesis can be considered as amachine that outputs the prediction for a new

    test data.

  • 7/27/2019 kernel methods for land cover classification and prediction

    9/66

    Training samples

    Model/ function

    Learning algorithm

    Output values

    Testing samples

    Also called as

    Hypothesis

    Hypothesis can be considered as a machine that provides the prediction for test

    data

  • 7/27/2019 kernel methods for land cover classification and prediction

    10/66

    Neural Network

    A major research area within 1990-2000 for

    classification/regression, still in use.

    No assumption about data distribution.

    Works well with different data including remote sensingdata.

  • 7/27/2019 kernel methods for land cover classification and prediction

    11/66

    ijw

    k

    i

    kw

    Input

    Layer

    Hidden

    Layer

    Output

    Layer

  • 7/27/2019 kernel methods for land cover classification and prediction

    12/66

    The interconnecting weights are determined during the

    training process.

    Number of algorithms can be used to adjust the

    interconnecting weights.

    Back-propagation is the most commonly used methods

    The error between actual and predicted values is fed

    backwards through the network towards the input layer.

    Connecting weights changes in relation to the

    magnitude of the error.

    use an iterative process to minimize the error.

  • 7/27/2019 kernel methods for land cover classification and prediction

    13/66

    ProblemsIdentifying user-defined parameters:

    Number of hidden layer and nodes

    Learning rate

    Momentum factor

    Iterations

    Local minima due to the use of non-

    convex, unconstrained minimization

    problem

  • 7/27/2019 kernel methods for land cover classification and prediction

    14/66

    http://mnemstudio.org/neural-networks-multilayer-perceptron-design.htm

  • 7/27/2019 kernel methods for land cover classification and prediction

    15/66

    Support Vector Machines (SVM)

    Basic Theory: in 1965 Margin based classifier: in 1992

    Support vector network: In 1995

    Since 1998, support vector network called as

    Support Vector Machines (SVM) - used as an

    alternative to neural network.

    First application, Gualtieri and Cromp, (1998)

    for hyperspectral image classification

  • 7/27/2019 kernel methods for land cover classification and prediction

    16/66

    SVM: structural risk minimisation (SRM)

    statistical learning theory proposed in 1960s

    by Vapnik and co-workers.

    SRM: Minimise the probability of

    misclassifying an unknown data drawn

    randomly

    Neural network: Empirical risk minimisation

    Minimise the misclassification error ontraining data

  • 7/27/2019 kernel methods for land cover classification and prediction

    17/66

    SVM

    Map data from the original input featurespace to a very high dimensional feature

    space (even infinite).

    Data becomes linearly separable but problembecomes computationally difficult to solve.

    Kernel function allows SVM to work in feature

    space, without knowing mapping anddimensionality of feature space.

  • 7/27/2019 kernel methods for land cover classification and prediction

    18/66

    A Kernel Function:

    SVM kernels need to satisfy Mercer

    Theorem: Any continuous, symmetric, positive

    semi-definite kernel function can be expressed

    as a dot product in a high-dimensional space.

    The linear classification in the new space is

    equivalent to non-linear classification in the

    original space.

    jijiK xxxx

  • 7/27/2019 kernel methods for land cover classification and prediction

    19/66

    Linearly separable class

  • 7/27/2019 kernel methods for land cover classification and prediction

    20/66

    For a 2-class classification problem, Training

    patterns are linearly separable if:

    for all y = 1

    for all y = -1

    wprovide orientation of discriminating plane andb, the offset from origin.

    Theclassification function will be:

    1b ixw

    1b ixw

    bsignf b, xww

  • 7/27/2019 kernel methods for land cover classification and prediction

    21/66

  • 7/27/2019 kernel methods for land cover classification and prediction

    22/66

    To classify the dataset

    There can be a large number ofdiscriminating planes.

    SVM tries to find a plane farthest fromboth classes.

    Assume two supporting planes,

    maximise the distance (called margin)

    between them.

  • 7/27/2019 kernel methods for land cover classification and prediction

    23/66

    A plane supports a class if allpoints in that class are on

    one side of that plane. Use convex optimisation

    problem.

    Push parallel planes apartuntil they collides with few

    data points for each class.

    Data points are calledSupport vectors.

    Other training examples areof no use

    margin

    w

    Origin

    x

    i

    x

    w.x + b = 1

    Optimal

    hyperplane

  • 7/27/2019 kernel methods for land cover classification and prediction

    24/66

    The margin is defined by : 2/

    Maximising the margin is equivalent to

    minimising the following quadratic program:

    /2

    subject to

    Solved by QP techniques using Lagrangian

    multipliers.

    2w

    01by i ixw

    w

    j,i

    jijiji

    i

    i yy

    2

    1L xx 0i for

  • 7/27/2019 kernel methods for land cover classification and prediction

    25/66

    Linearly Non-separable data

  • 7/27/2019 kernel methods for land cover classification and prediction

    26/66

    New optimisation problem:

    with and

    C is a positive constant such that

    LargerC means higher penalty to errors.

    k

    1i

    i

    2

    ,....,b C2

    1

    mink1

    ww,

    0i

    0C

    01bxwy iii

    Cortes and Vapnik (1995)

  • 7/27/2019 kernel methods for land cover classification and prediction

    27/66

    Nonlinear SVM

  • 7/27/2019 kernel methods for land cover classification and prediction

    28/66

    Final classification function:

    Nonlinear classification via linear separation in higherdimensional space:

    http://www.youtube.com/watch?v=9NrALgHFwTo

    SVM with polynomial kernel visualization:

    http://www.youtube.com/watch?v=3liCbRZPrZA

    j,i

    jijiji

    i

    i yy

    2

    1L xx

    bysignf

    i

    ii ji KK xxx

    http://www.youtube.com/watch?v=9NrALgHFwTohttp://www.youtube.com/watch?v=3liCbRZPrZAhttp://www.youtube.com/watch?v=3liCbRZPrZAhttp://www.youtube.com/watch?v=9NrALgHFwTo
  • 7/27/2019 kernel methods for land cover classification and prediction

    29/66

    Advantages

    Margin theory suggest no affect ofdimensionality of input space

    uses fewer number of training data (called

    support vectors)

    QP solution, so no chance of local minima

    Not many user-defined parameters

  • 7/27/2019 kernel methods for land cover classification and prediction

    30/66

    But with real data:

    55

    60

    65

    70

    75

    80

    85

    90

    95

    5 10 15 20 25 30 35 40 45 50 55 60 65

    Classificationaccuracy

    (%)

    Number of features

    8 pixels 15 pixels

    25 pixels 50 pixels

    75 pixels 100 pixels

    Mahesh Pal and Giles M. Foody, 2010, Feature selection for classification of hyperspectral data bySVM. IEEE Transactions on Geoscience and Remote Sensing, Vol. 48, No. 5, 2297-2306.

  • 7/27/2019 kernel methods for land cover classification and prediction

    31/66

    Training set size per class

    8 pixels 15 pixels 25 pixels 50 pixels 75 pixels 100 pixels

    Peak accuracy,

    % (number of

    features)

    74.79 (35) 81.21 (35) 84.45 (35) 88.47 (40) 91.13 (50) 92.53 (50)

    Accuracy with

    65 features (%)69.79 77.05 81.66 87.58 90.63 91.76

    Difference in

    accuracy (%)5.00 4.16 2.79 0.89 0.50 0.77

    Z value 6.04 5.35 4.02 1.69 1.48 2.22

  • 7/27/2019 kernel methods for land cover classification and prediction

    32/66

    Disadvantages Designed for two class problem

    Different methods to create multi-class

    classifier.

    Choice of kernel function and kernel specific

    parameters

    The kernel function is required to satisfy the

    Mercer condition

    Choice of ParameterC

    Output is not naturally probabilistic

  • 7/27/2019 kernel methods for land cover classification and prediction

    33/66

    Multiclass results

    Multiclass approach Classificationaccuracy (%)

    Training time

    one against one 87.90 6.4 sec

    one against rest 86.55 30.37sec

    Directed Acyclic Graph 87.63 6.5 sec

    Bound constrained approach 87.29 79.6 sec

    Crammer and Singer approach 87.43 347 min 18 sec

    ECOC (exhaustive approach) 89.00 806.6 min

  • 7/27/2019 kernel methods for land cover classification and prediction

    34/66

    Choice of kernel function

  • 7/27/2019 kernel methods for land cover classification and prediction

    35/66

    Parameter selection

    Grid search and trial & error methods

    commonly used approach computationally expensive

    Other approaches

    Genetic algorithm Particle swarm optimization

    Their combination with grid search.

  • 7/27/2019 kernel methods for land cover classification and prediction

    36/66

    SVR

  • 7/27/2019 kernel methods for land cover classification and prediction

    37/66

    http://www.saedsayad.com/support_vector_machine_reg.htm

  • 7/27/2019 kernel methods for land cover classification and prediction

    38/66

    Relevance vector Machines

  • 7/27/2019 kernel methods for land cover classification and prediction

    39/66

    Based on a Bayesian formulation of a linear

    model (Tipping, 2001).

    Produce a sparse solution than that of SVM

    (i.e. less number of relevance vectors)

    Ability to use non-Mercer kernels

    Probabilistic output

    No need to define the parameter C

  • 7/27/2019 kernel methods for land cover classification and prediction

    40/66

    For a 2-class problem, The maximum a

    posteriori estimate of the weights can be

    obtained by maximizing the followingobjective function:

    http://www.cs.uoi.gr/~tzikas/papers/EURASIP06.pdf

    http://www.tristanfletcher.co.uk/RVM%20Explained.pdf

    n

    iiiii

    n

    in wplogwcplogwwwf

    1121 ,........,,

    http://www.cs.uoi.gr/~tzikas/papers/EURASIP06.pdfhttp://www.tristanfletcher.co.uk/RVM%20Explained.pdfhttp://www.tristanfletcher.co.uk/RVM%20Explained.pdfhttp://www.cs.uoi.gr/~tzikas/papers/EURASIP06.pdf
  • 7/27/2019 kernel methods for land cover classification and prediction

    41/66

    RVM

    The solution involves in calculating the gradientoffwith respect to w.

    Only those training data having non-zero

    coefficients wi (called relevance vectors) will

    contribute to the decision function.

    An iterative analysis is followed to find the set ofweights that maximizes the objective function

  • 7/27/2019 kernel methods for land cover classification and prediction

    42/66

    Major difference from SVM

    Selected points are anti-boundary (away fromBoundary)

    Support vectors represent the leastprototypical examples (closer to boundary,difficult to classify)

    Relevance vectors are the most prototypical(more representative of class)

  • 7/27/2019 kernel methods for land cover classification and prediction

    43/66

    Location of the useful training cases for

    classifications by SVM & RVM

    40

    50

    60

    70

    80

    90

    100

    110

    70 80 90 100

    Band5

    Band 1

    Wheat

    Sugar beet

    Oilseed rape

    40

    50

    60

    70

    80

    90

    100

    110

    70 80 90 100

    Band5

    Band 1

    Wheat

    Sugar beet

    Oilseed rape

    MAHESH PAL AND G.M FOODY, Evaluation of SVM, RVM and SMLR for accurate image classification with limited

    ground data, IEEEjournal of selected topics in applied earth observations and remote sensing, 5( 5), 2012

  • 7/27/2019 kernel methods for land cover classification and prediction

    44/66

    Class (number of useful

    training cases)

    Difference of two

    smallest

    Mahalanobis

    distances

    Mahalanobis distance to class centroid

    Wheat Sugar beet Oilseed rape

    Support vectors

    Wheat 1(4) 4.8697 15.8246 100.2179 10.9549

    Sugar beet(8) 51.9803 3.9906 47.6909 31.0740

    Oilseed rape(7) 89.3444 20.9320 6.2782 15.8113

    Relevance vectors

    Wheat(1) 12.9498 31.8135 171.6667 18.8637

    Sugar beet(2) 68.8468 4.4170 144.2734 64.4298

    Oilseed rape(4) 112.0943 35.5128 4.3981 31.1147

  • 7/27/2019 kernel methods for land cover classification and prediction

    45/66

    Disadvantages

    Requires large computation cost incomparison to SVM.

    Designed for 2-class problem- similar toSVM.

    Choice of kernel

    May have a problem of local minima

  • 7/27/2019 kernel methods for land cover classification and prediction

    46/66

    Random forest algorithm

  • 7/27/2019 kernel methods for land cover classification and prediction

    47/66

    A multistage or hierarchical algorithm

    Break up of complex decision into a union of

    several simpler decision

    Use different subset of features/data at

    various decision levels.

    Tree based Algorithm

  • 7/27/2019 kernel methods for land cover classification and prediction

    48/66

    Root node

    Internal

    node

    Terminal

    node

  • 7/27/2019 kernel methods for land cover classification and prediction

    49/66

  • 7/27/2019 kernel methods for land cover classification and prediction

    50/66

    A tree based algorithm requires

    Splitting rules/tree creation [called attribute selection]

    Most popular are:

    a) Gain ratio criterion (Quinlan, 1993)

    b) Gini Index (Breiman, et. al., 1984)

    Termination rules/ pruning rules

    Most popular are:

    a) Error-based pruning (Quinlan, 1993)

    b) Cost-Complexity pruning (Brieman, et. al., 1984)

  • 7/27/2019 kernel methods for land cover classification and prediction

    51/66

    Information GainInformation Gain

    ratioGini Index

    Chi-squaremeasure

    Accuracy 83.7 84.54 83.9 83.65

    83

    84

    85

    Accura

    cy(%)

    Attribute selection measure

    Mahesh Pal and P.M. Mather, 2003, An Assessment of the Effectiveness of Decision Tree Methods for

    Land Cover Classification. Remote Sensing of Environment. 86, 554-565

    R d f

  • 7/27/2019 kernel methods for land cover classification and prediction

    52/66

    Random forest

    An ensemble of tree based algorithm

    Uses a random set of features (i.e. input

    variables)

    Uses a bootstrapped sample of original data Bootstrapped sample consists of ~63% of

    original data

    Remaining ~37% is left out and called out ofbag data (OOB).

    Multiclass and require no pruning

  • 7/27/2019 kernel methods for land cover classification and prediction

    53/66

    Parameters

    a) Number of tree to growb) Number of attributes (features) for each tree

    87.78

    87.48

    88.3788.27

    88.0787.92

    86.5

    87

    87.5

    88

    88.5

    89

    1 2 3 4 5 6

    Number of features used

    Testdataaccuracy(%)

    87

    87.2

    87.4

    87.6

    87.8

    88

    88.2

    88.4

    88.6

    88.8

    89

    0 2000 4000 6000 8000 10000 12000 14000

    Number of trees

    Testdataaccuracy(%)

    Mahesh Pal, 2005, Random Forest Classifier for Remote Sensing Classifications. International Journal of

    Remote sensing, 26(1), 217-222.

  • 7/27/2019 kernel methods for land cover classification and prediction

    54/66

    Classification Results

    Classifier used Random forest classifier Support vector machines

    Accuracy (%) and Kappa value 88.37 (0.86) 87.9 (0.86)

    Training time 12.98 seconds on P-IV 0.30 minutes on sun machine

  • 7/27/2019 kernel methods for land cover classification and prediction

    55/66

    Can be used for:

    Feature selection

    Clustering of data

    Outlier detection

    Predictions/regression

    Can handle categorical data and the data with

    missing values

    Performance - comparable to SVM

    Computationally efficient

    Mahesh Pal,2006,Support Vector Machines Based Feature Selection for land cover classification: a casestudy with DIAS Hyperspectral Data. International Journal of Remote Sensing, 27(14), 28772894

  • 7/27/2019 kernel methods for land cover classification and prediction

    56/66

    Outliers

    0123456789

    101112131415161718192021

    0 500 1000 1500 2000 2500 3000

    Outliervalue

    samples

    class 1

    class 2

    class 3

    class 4

    class 5

    class 6

    class 7

    An outlieris an observation that lies at an abnormal distance from other values in

    the dataset

    Cl t i

  • 7/27/2019 kernel methods for land cover classification and prediction

    57/66

    Clustering

    -0.4

    -0.3

    -0.2

    -0.1

    0

    0.1

    0.2

    0.3

    0.4

    0.5

    -0.4 -0.3 -0.2 -0.1 0 0.1 0.2 0.3 0.4 0.5

    IIndscalingcoordina

    te

    Ist scaling coordinate

    class 1

    class 2

    class 3

    class 4

    class 5

    class 6

    class 7

  • 7/27/2019 kernel methods for land cover classification and prediction

    58/66

    Extreme Learning Machines

    Comparison of ELM with SVR for reservoir permeability prediction

    Modelling Permeability prediction using ELM

  • 7/27/2019 kernel methods for land cover classification and prediction

    59/66

    A neural network classifier

    Use one hidden layer only

    No parameter except number of hidden nodes

    Global solution

    Performance comparable to SVM and better

    than back-propagation neural network

    Very fast

  • 7/27/2019 kernel methods for land cover classification and prediction

    60/66

    http://www.ntu.edu.sg/home/egbhuang/pdf/ELM-WCCI2012.pdf

  • 7/27/2019 kernel methods for land cover classification and prediction

    61/66

    HUANG, G.-B., ZHU, Q.-Y. and SIEW, C.-K., 2006, Extreme learning machine: Theory and

    applications, Neurocomputing, 70, 489501.

    . +

    =1=

  • 7/27/2019 kernel methods for land cover classification and prediction

    62/66

    Disadvantages

    Weights are randomly assigned. Large variation inaccuracy using same number of hidden nodes with

    different trials.

    Difficult to replicate results

    Mahesh Pal, 2009, Extreme learning machine based land cover classification, International Journal of

    Remote Sensing, 30(14), 38353841.

    70

    74

    78

    82

    86

    90

    25 50 75 100 150 200 250 300 350 400 450

    Number of nodes in hidden layer

    Classif

    icationaccuracy(%)

    Extreme learning

    machine1.25 sec

    Back propagation

    neural network336.20 sec

    K li d ELM

  • 7/27/2019 kernel methods for land cover classification and prediction

    63/66

    Kernlised ELM

    Kernel function can be used in place of hidden layer by

    modifying the optimization problem. Multiclass

    Can be used for classification and regression

    Same Kernel function as used with SVM/RVM can be

    used.

    Encouraging results for classification and

    prediction- better than SVM in terms of accuracy

    and computational cost

    Huang, G-B. Zhou H. Ding X. and Zhang R. 2012, Extreme Learning Machine for Regression and Multiclass

    Classification. IEEE Transactions on Systems, Man, and CyberneticsPart B: Cybernetics 42: 513-529.

    NO f L h Th

  • 7/27/2019 kernel methods for land cover classification and prediction

    64/66

    NO free Lunch Theorem

    No algorithm performs better than any other when their

    performance is averaged uniformly over all possible

    problems of a particulartype(Wolpert and Macready, 1995)

    Algorithm must be designed for a particular domain and

    there is no such thing as a general purpose algorithm.

    Data dependent nature

  • 7/27/2019 kernel methods for land cover classification and prediction

    65/66

    http://www.tristanfletcher.co.uk/SVM%20Explained.pdf

    http://www.youtube.com/watch?v=eHsErlPJWUU

    {SVM by Prof. Yasser, CalTech}

    http://www.youtube.com/watch?v=s8B4A5ubw6c

    {SVM by Prof. Andrew Ng, Stanford}

    http://videolectures.net/mlss03_tipping_pp/

    { RVM, Video lecture by Tipping}

    http://www.ntu.edu.sg/home/egbhuang/pdf/ELM-WCCI2012.pdf

    http://www.tristanfletcher.co.uk/SVM%20Explained.pdfhttp://www.youtube.com/watch?v=eHsErlPJWUUhttp://www.youtube.com/watch?v=s8B4A5ubw6chttp://videolectures.net/mlss03_tipping_pp/http://videolectures.net/mlss03_tipping_pp/http://videolectures.net/mlss03_tipping_pp/http://www.youtube.com/watch?v=s8B4A5ubw6chttp://www.youtube.com/watch?v=s8B4A5ubw6chttp://www.youtube.com/watch?v=eHsErlPJWUUhttp://www.youtube.com/watch?v=eHsErlPJWUUhttp://www.tristanfletcher.co.uk/SVM%20Explained.pdfhttp://www.tristanfletcher.co.uk/SVM%20Explained.pdf
  • 7/27/2019 kernel methods for land cover classification and prediction

    66/66

    Questions?