97
Data Mining

Data Mining - Osservatorio Astronomico di Capodimontebrescia/documents/ASTROINFOEDU/brescia-L4-ML... · M. Brescia - Data Mining - lezione 3 2 Field of study that gives computers

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Data Mining - Osservatorio Astronomico di Capodimontebrescia/documents/ASTROINFOEDU/brescia-L4-ML... · M. Brescia - Data Mining - lezione 3 2 Field of study that gives computers

Data Mining

Page 2: Data Mining - Osservatorio Astronomico di Capodimontebrescia/documents/ASTROINFOEDU/brescia-L4-ML... · M. Brescia - Data Mining - lezione 3 2 Field of study that gives computers

What’s Machine Learning

M. Brescia - Data Mining - lezione 3 2

Field of study that gives computers the ability to learn without being explicitly programmed.Arthur Samuel (1959)

A computer program is said to learn from experience E with respect to some task T and someperformance measure P, if its performance on T, as measured by P, improves with experienceE.

Tom Mitchell (1998)

Machine Learning is a scientific discipline that is concerned with the design and developmentof algorithms that allow computers to learn based on data-driven resources (sensors,databases). A major focus of machine learning is to automatically learn to recognize complexpatterns and make intelligent decisions based on data.

Page 3: Data Mining - Osservatorio Astronomico di Capodimontebrescia/documents/ASTROINFOEDU/brescia-L4-ML... · M. Brescia - Data Mining - lezione 3 2 Field of study that gives computers

ML origins: from Aristotele to Darwin

M. Brescia - Data Mining - lezione 3 3

The Greek philosopher Aristotle was one of the first to attempt to codify"right thinking," that syllogism is, irrefutable reasoning processes. Hissyllogisms provided patterns for argument structures that always yieldedcorrect conclusions when given correct premises. For example, "Socratesis a man; all men are mortal; therefore, Socrates is mortal." These laws ofthought were logic supposed to govern the operation of the mind; theirstudy initiated the field called logic

By 1965, programs existed that could, in principle, process anysolvable problem described in logical notation. The so-calledlogicist tradition within artificial intelligence hopes to build on suchprograms to create intelligent systems and the ML theoryrepresents their demonstration discipline. A reinforcement in thisdirection came out by integrating ML paradigm with statisticalprinciples following the Darwin’s Nature evolution law

Page 4: Data Mining - Osservatorio Astronomico di Capodimontebrescia/documents/ASTROINFOEDU/brescia-L4-ML... · M. Brescia - Data Mining - lezione 3 2 Field of study that gives computers

ML supervised paradigm

M. Brescia - Data Mining - lezione 3 4

In supervised ML we have a set of data points or observations for which we know the desiredoutput, expressed in terms of categorical classes, numerical or logical variables or asgeneric observed description of any real problem. The desired output is in fact providingsome level of supervision in that it is used by the learning model to adjust parameters ormake decisions allowing it to predict correct output for new data.Finally, when the algorithm is able to correctly predict observations we define it a classifier.Some classifiers are also capable of providing results in a more probabilistic sense, i.e. aprobability of a data point belonging to class. We usually refer to such model behavior asprobabilistic classification

Page 5: Data Mining - Osservatorio Astronomico di Capodimontebrescia/documents/ASTROINFOEDU/brescia-L4-ML... · M. Brescia - Data Mining - lezione 3 2 Field of study that gives computers

ML supervised process (1/2)

M. Brescia - Data Mining - lezione 3 5

Pre-processing of databuild input patterns appropriate for feeding into our supervised learning algorithm. Thisincludes scaling and preparation of data;

Create data sets for training and evaluationrandomly splitting the universe of data patterns. The training set is made of the data used bythe classifier to learn their internal feature correlations, whereas the evaluation set is used tovalidate the already trained model in order to get an error rate (or other validationmeasures) that can help to identify the performance and accuracy of the classifier. Typicallyyou will use more training data than validation data;

Training of the modelWe execute the model on the trainingdata set. The output result consists ofa model that (in the successful case)has learned how to predict theoutcome when new unknown dataare submitted;

Page 6: Data Mining - Osservatorio Astronomico di Capodimontebrescia/documents/ASTROINFOEDU/brescia-L4-ML... · M. Brescia - Data Mining - lezione 3 2 Field of study that gives computers

ML supervised process (2/2)

M. Brescia - Data Mining - lezione 3 6

ValidationAfter we have created the model, it is of course required a test of its performance accuracy,completeness and contamination (or its dual, the purity). It is particularly crucial to do this ondata that the model has not seen yet. This is main reason why on previous steps weseparated the data set into training patterns and a subset of the data not used for training.

UseIf validation was successful the modelhas correctly learned the underlyingreal problem. So far we can proceedto use the model to classify/predictnew data.

Verify Modelverify and measure the generalization capabilities of the model. It is very easy to learn every single combination of input vectors and their mappings to the output as observed on the training data, and we can achieve a very low error in doing that, but how does the very same rules or mappings perform on new data that may have different input to output mappings? If the classification error of the validationset is higher than the training error, thenwe have to go back and adjust modelparameters.

Page 7: Data Mining - Osservatorio Astronomico di Capodimontebrescia/documents/ASTROINFOEDU/brescia-L4-ML... · M. Brescia - Data Mining - lezione 3 2 Field of study that gives computers

Knowledge Base

Train Set

Blind Test Set

Analysis of results

Train Test

M. Brescia - Data Mining – ViaLactea Progress Meeting – Catania Feb 14, 2014

Machine Learning - Supervised

Page 8: Data Mining - Osservatorio Astronomico di Capodimontebrescia/documents/ASTROINFOEDU/brescia-L4-ML... · M. Brescia - Data Mining - lezione 3 2 Field of study that gives computers

The World

Trained Network

New Knowledge

M. Brescia - Data Mining – ViaLactea Progress Meeting – Catania Feb 14, 2014

Machine Learning - Supervised

Page 9: Data Mining - Osservatorio Astronomico di Capodimontebrescia/documents/ASTROINFOEDU/brescia-L4-ML... · M. Brescia - Data Mining - lezione 3 2 Field of study that gives computers

Glossary

M. Brescia - Data Mining - lezione 3 9

§ Data can be tables, images, streaming vectors. They may be represented under the form ofnumbers, percentages, pixel values, literals, strings, probabilities, any other entity giving aninformation on a physical/conceptual/simulated event or phenomena of our world.

§ Dataset is a set of samples representing a problem. All samples must be expressed in auniform way (i.e. same dimensions and representation).

§ Pattern is a sequence of symbols/values identifying a single sample of any dataset.

§ Feature is an atomic element of a pattern, i.e. a number or symbol representing onecharacteristic of the pattern (carrier of hidden information).

§ Target (supervised dataset) is usually a label (number/symbol) or a set of labelsrepresenting the solution (desired/known output) of a single pattern. If unknown or missing,the pattern belongs to the unsupervised category of datasets.

§ Base of Knowledge (BoK) or KB (Knowledge Base) is the ensemble of datasets in which thepatterns contain the target (known solutions to a real problem). It is always available forsupervised ML.

Page 10: Data Mining - Osservatorio Astronomico di Capodimontebrescia/documents/ASTROINFOEDU/brescia-L4-ML... · M. Brescia - Data Mining - lezione 3 2 Field of study that gives computers

Examples of BoK - Astronomy

M. Brescia - Data Mining - lezione 3 10

Page 11: Data Mining - Osservatorio Astronomico di Capodimontebrescia/documents/ASTROINFOEDU/brescia-L4-ML... · M. Brescia - Data Mining - lezione 3 2 Field of study that gives computers

Examples of BoK - Astronomy

M. Brescia - Data Mining - lezione 3 11

UMAG,GMAG,RMAG,IMAG,ZMAG,nuv,fuv,YMAG,JMAG,HMAG,KMAG,w1,w2,w3,w4,zspec20.38,20.46,20.32,20.09,20.04,0.65,3.21,19.28,18.963,19.286,17.505,16.828,15.238,12.238,8.579,1.82419.465,19.368,19.193,19.015,0.219,1.397,18.29,17.76,16.97,15.77,14.26,13.2,10.76,8.158,0.459,1.93417.995,17.934,17.873,1.865,0.132,16.863,16.597,15.902,14.75,13.33,12.28,9.5,7.37,0.478,20.49,2.24720.13,20.36,1.43,4.22,19.906,19.409,18.427,17.935,17.076,15.589,12.619,8.863,1.4365, 8.15,0.45,1.93

Dataset: set of galaxies observed by a space telescope. Each pattern has 15 features(different emission flux wavelength) + one target (the velocity dispersion of each galaxy,called redshift)

Usually such BoK is made by hundreds ofthousands of galaxies (patterns).The ML problem is to learn to predict theirredshift for new objects observed infurther space missions.

Page 12: Data Mining - Osservatorio Astronomico di Capodimontebrescia/documents/ASTROINFOEDU/brescia-L4-ML... · M. Brescia - Data Mining - lezione 3 2 Field of study that gives computers

Examples of BoK - Astronomy

M. Brescia - Data Mining - lezione 3 12

Dataset: large multi-band image of a nebula, million of patterns of galaxies and stars (theirspectra). Features are peaks in the object spectrum and target is the type of object. MLproblem: learn to classify objects (such as star/galaxy separation)

star

galaxy

Page 13: Data Mining - Osservatorio Astronomico di Capodimontebrescia/documents/ASTROINFOEDU/brescia-L4-ML... · M. Brescia - Data Mining - lezione 3 2 Field of study that gives computers

Examples of BoK – Web Traffic

M. Brescia - Data Mining - lezione 3 13

Dataset: huge amount of TCP/UDP packets over the network, to be classified respectingprivacy and evaluating their impact on the CPU load and transmission frequency

Page 14: Data Mining - Osservatorio Astronomico di Capodimontebrescia/documents/ASTROINFOEDU/brescia-L4-ML... · M. Brescia - Data Mining - lezione 3 2 Field of study that gives computers

Examples of BoK – fusion reactors

M. Brescia - Data Mining - lezione 3 14

Goals: In the core of a tokamak there is the vacuum vessel where the fusion plasma is confined bymeans of strong magnetic fields and plasma currents (up to 4 tesla and 5 mega amperes).

a) to “surf” jet discharges, to find the list of discharges with parameters required by the userb) Help for decision-makingc) to integrate new and existing diagnostic faults and processingd) to monitor and improve data quality and validation of scientific production.e) to understand the effective use of a diagnostic, in order to improve the efficiency of data storage

and production but also to identify redundant, unusable or false dataf) search for the data analysis can be completely substituted using a simple query (with an easy and

quick interface)g) to develop new diagnostic systems: for example producing a cross correlation between

measurements of the same physical quantity made by different techniquesh) to obtain important and unpredictable results using data mining system that allow to discover

hidden connections in the datai) Cost reductions for analysis tools and maintenance.

Page 15: Data Mining - Osservatorio Astronomico di Capodimontebrescia/documents/ASTROINFOEDU/brescia-L4-ML... · M. Brescia - Data Mining - lezione 3 2 Field of study that gives computers

Examples of BoK – facial recognition

M. Brescia - Data Mining - lezione 315

Goal: identification of a face among millions of image samples, based on the dimension reduction of thefacial parameter space and pattern recognition

Page 16: Data Mining - Osservatorio Astronomico di Capodimontebrescia/documents/ASTROINFOEDU/brescia-L4-ML... · M. Brescia - Data Mining - lezione 3 2 Field of study that gives computers

Examples of BoK – smoke & fire detection

M. Brescia - Data Mining - lezione 316

fast smoke/fire detection, on-line alert

CBS

Chromaticity-based

Background

Subtraction

SMRD

Slow Moving

Regions Detection

SCRD

Smoke Colored

Regions Detection

SDR

Shadow Detection

and Removal

GRD

Grayed Regions

Detection

CA

Compound Algorithm

Oracle

supervision

response

Moving Regions

Cross-check and

validation

SMOKE DETECTION

PROTOTYPE

REL. 0.1

Input parameter

setup

Video

sample

Decision

weight

initialization

DG

RD(x

,n)

DS

DR(x

,n)

DC

BS(x

,n)

DS

MR

D(x

,n)

DS

CR

D(x

,n)

Ba

ckg

rou

nd

mo

de

l e

xtr

actio

n

Background

model extraction

Background

subtraction

(moving regions)

Validated moving

regions

Configuration

files

MSE of responses

Evaluation

Decision output

Video sample

Decision target

LMS

(Least Mean Square)

weight update

ERROR

EVENT

MATCH!

OK

EVENT

OUTPUT

Start new cycle (next video sample submission)

Start new cycle

(next video sample submission)

Updated

weights

Page 17: Data Mining - Osservatorio Astronomico di Capodimontebrescia/documents/ASTROINFOEDU/brescia-L4-ML... · M. Brescia - Data Mining - lezione 3 2 Field of study that gives computers

Examples of BoK – wine classification

M. Brescia - Data Mining - lezione 317

chemical analysis of wines grown in the same region in Italy butderived from three different cultivars.The analysis determined the quantities of 13 constituentsfound in each of the three types of wines.

1) Alcohol2) Malic acid3) Ash (cenere)4) Alcalinity of ash5) Magnesium6) Total phenols7) Flavanoids8) Nonflavanoid phenols9) Proanthocyanins10) Color intensity11) Hue12) OD280/OD315 of diluted wines13) Proline

14) Target class:1) Aglianico2) Falanghina3) Lacryma christi

14.23 1.71 2.43 15.6 127 2.8 3.06 .28 2.29 5.64 1.04 3.92 1065 113.2 1.78 2.14 11.2 100 2.65 2.76 .26 1.28 4.38 1.05 3.4 1050 113.16 2.36 2.67 18.6 101 2.8 3.24 .3 2.81 5.68 1.03 3.17 1185 214.37 1.95 2.5 16.8 113 3.85 3.49 .24 2.18 7.8 .86 3.45 1480 313.24 2.59 2.87 21 118 2.8 2.69 .39 1.82 4.32 1.04 2.93 735 114.2 1.76 2.45 15.2 112 3.27 3.39 .34 1.97 6.75 1.05 2.85 1450 314.39 1.87 2.45 14.6 96 2.5 2.52 .3 1.98 5.25 1.02 3.58 1290 214.06 2.15 2.61 17.6 121 2.6 2.51 .31 1.25 5.05 1.06 3.58 1295 2…

Page 18: Data Mining - Osservatorio Astronomico di Capodimontebrescia/documents/ASTROINFOEDU/brescia-L4-ML... · M. Brescia - Data Mining - lezione 3 2 Field of study that gives computers

Examples of BoK - Medicine

M. Brescia - Data Mining - lezione 3 18

Machine learning systems can be used to develop the knowledge bases used by expert systems. Given aset of clinical cases that act as examples, a machine learning system can produce a systematic descriptionof those clinical features that uniquely characterize the clinical conditions. This knowledge can beexpressed in the form of simple rules, or often as a decision tree.

it is possible, using patient data, to automatically construct pathophysiological models that describe thefunctional relationships between the various measurements. For example a learning system that takesreal-time patient data obtained during cardiac bypass surgery, and then creates models of normal andabnormal cardiac physiology. These models might be used to look for changes in a patient's condition ifused at the time they are created. Alternatively, if used in a research setting, these models can serve asinitial hypotheses that can drive further experimentation.

One particularly exciting development has been the use oflearning systems to discover new drugs. The learning system isgiven examples of one or more drugs that weakly exhibit aparticular activity, and based upon a description of thechemical structure of those compounds, the learning systemsuggests which of the chemical attributes are necessary forthat pharmacological activity.

Page 19: Data Mining - Osservatorio Astronomico di Capodimontebrescia/documents/ASTROINFOEDU/brescia-L4-ML... · M. Brescia - Data Mining - lezione 3 2 Field of study that gives computers

ML Functionalities

M. Brescia - Data Mining - lezione 3 19

In the DM scenario, the ML model choice should always be accompanied by the functionality domain. To be more precise, some ML models can be used in a same functionality domain, because it represents the functional context in which it is performed the exploration of data.

Examples of such domains are:

Dimensional reduction;Classification;Regression;Clustering;Segmentation;Forecasting;Data Model Filtering;Statistical data analysis;

Page 20: Data Mining - Osservatorio Astronomico di Capodimontebrescia/documents/ASTROINFOEDU/brescia-L4-ML... · M. Brescia - Data Mining - lezione 3 2 Field of study that gives computers

The core of Machine Learning

M. Brescia - Data Mining - lezione 3 20

Whatever being the functionality or the model of interest in the machine learning context, the key point isalways the concept of LEARNING

More in practice, having in mind the functional taxonomy described in the previous slide, there areessentially four kinds of learning related with ML for DM:

Learning by association;Learning by classification;Learning by prediction;Learning by grouping (clustering);

Page 21: Data Mining - Osservatorio Astronomico di Capodimontebrescia/documents/ASTROINFOEDU/brescia-L4-ML... · M. Brescia - Data Mining - lezione 3 2 Field of study that gives computers

Learning by association

M. Brescia - Data Mining - lezione 3 21

The learning by association consists of the identification of any structure hidden between data. It does notmean to identify the belonging of patterns to specific classes, but to predict values of any featureattribute, by simply recalling it, i.e. by associating it to a particular state or sample of the real problem..

It is evident that in the case of association we are dealing with very generic problems, i.e. those requiringa precision less than in the classification case. In fact, the complexity grows with the range of possiblemultiple values for feature attributes, potentially causing a mismatch in the association results.

In practical terms, fixed percentage thresholds are given in order toreduce the mismatch occurrence for different association rules, basedon the experience on that problem and related data. Therepresentation of data for associative learning is thus based on thelabeling of features with non-numerical values or by alpha-numericcoding.

Page 22: Data Mining - Osservatorio Astronomico di Capodimontebrescia/documents/ASTROINFOEDU/brescia-L4-ML... · M. Brescia - Data Mining - lezione 3 2 Field of study that gives computers

Learning by classification

M. Brescia - Data Mining - lezione 3 22

Classification learning is often named simply “supervised” learning, because the process to learn the rightassignment of a label to a datum, representing its category or “class”, is usually done by examples.Learning by examples stands for a training scheme operating under supervision of any oracle, able toprovide the correct, already known, outcome for each of the training sample. And this outcome isproperly a class or category of the examples. Its representation depends on the available Base ofKnowledge (BoK) and on its intrinsic nature, but in most cases is based on a series of numerical attributes,related to the extracted BoK, organized and submitted in an homogeneous way.

The success of classification learning isusually evaluated by trying out theacquired feature description on anindependent set of data, havingknown output but never submitted tothe model before.

Page 23: Data Mining - Osservatorio Astronomico di Capodimontebrescia/documents/ASTROINFOEDU/brescia-L4-ML... · M. Brescia - Data Mining - lezione 3 2 Field of study that gives computers

Learning by prediction

M. Brescia - Data Mining - lezione 3 23

Slightly different from classification scheme is the prediction learning. In this case the outcome consists of a numerical value instead of a class label (often called REGRESSION).The numeric prediction is obviously related to a quantitative result, because is the predicted value much more interesting than the structure of the concept behind the numerical outcome.

Page 24: Data Mining - Osservatorio Astronomico di Capodimontebrescia/documents/ASTROINFOEDU/brescia-L4-ML... · M. Brescia - Data Mining - lezione 3 2 Field of study that gives computers

Learning by clustering

M. Brescia - Data Mining - lezione 3 24

Whenever there is no any class attribution, clustering learning is used to group data that show naturalsimilar features. Of course the challenge of a clustering experiment is to find these clusters and assigninput data to them.

The data could be given under the form of categorical/numerical tables and the success of a clusteringprocess could be evaluated in terms of human experience on the problem or a posteriori by means of asecond step of the experiment, in which a classification learning process is applied in order to learn anintelligent mechanism on how new data samples should be clustered.

The clustering can be performed in a top-down(from largest clusters down to singles), orbottom-up (from singles up to larger clusters).Both types may be represented by dendograms

Page 25: Data Mining - Osservatorio Astronomico di Capodimontebrescia/documents/ASTROINFOEDU/brescia-L4-ML... · M. Brescia - Data Mining - lezione 3 2 Field of study that gives computers

How to learn data?

M. Brescia - Data Mining - lezione 3 25

In the wide variety of possible applications for ML, DM is of course one of the most important, but alsothe most challenging. Users encounter as much problems as massive is the data set to be investigated. Tofind hidden relationships between multiple features in thousands of patterns is hard, especially byconsidering the limited capacity of human brain to have a clear vision in a multiple than 3D parameterspace.

In order to deeply discuss the learning of data we recall the paradigm of ML, by distinguish between datawhere features are provided with known labels (target attributes), defined as supervised learning, anddata where features are unlabeled, called unsupervised learning. With such concepts in mind we candiscuss in the next sections the wide-ranging issues of both kinds of ML.

Page 26: Data Mining - Osservatorio Astronomico di Capodimontebrescia/documents/ASTROINFOEDU/brescia-L4-ML... · M. Brescia - Data Mining - lezione 3 2 Field of study that gives computers

Machine Learning & Statistical Models

M. Brescia - Data Mining - lezione 3 26

NeuralNetworks

Feed Forward

Recurrent / Feedback

• Perceptron• Multi Layer Perceptron• Radial Basis Functions

• Competitive Networks• Hopfield Networks• Adaptive Reasoning Theory

• Bayesian Networks• Hidden Markov Models• Mixture of Gaussians• Principal Probabilistic Surface• Maximum Likelihood• 2

• Negentropy

DecisionAnalysis

• Fuzzy Sets• Genetic Algorithms• K-Means• Principal Component Analysis• Support Vector Machine• Soft Computing

StatisticalModels

• Decision Trees• Random Decision Forests• Evolving Trees• Minimum Spanning TreesHybrid

Page 27: Data Mining - Osservatorio Astronomico di Capodimontebrescia/documents/ASTROINFOEDU/brescia-L4-ML... · M. Brescia - Data Mining - lezione 3 2 Field of study that gives computers

Artificial Neural Networks

M. Brescia - Data Mining - lezione 3 27

Artificial Neural Network:- consists of simple, adaptive processing units, called neurons- the neurons are interconnected, forming a large network- parallel computation, often in layers- nonlinearities are used in computations

Important property of neural networks: learning from input data.- with teacher (supervised learning)- without teacher (unsupervised learning)

Artificial neural networks have their roots in:- neuroscience- mathematics and statistics- computer science- engineering

Neural computing was inspired by computing in human brains

Page 28: Data Mining - Osservatorio Astronomico di Capodimontebrescia/documents/ASTROINFOEDU/brescia-L4-ML... · M. Brescia - Data Mining - lezione 3 2 Field of study that gives computers

ANN - introduction

M. Brescia - Data Mining - lezione 3 28

Application areas of neural networks:– modeling– time series processing– pattern recognition– signal processing– automatic control

Neural networks resemble the brain intwo respects:1. The network acquires knowledge from its environmentusing a learning process (algorithm)2. Synaptic weights, which are inter-neuron connectionstrenghts, are used to store the learned information.

Fully connected 10-4-2 feedforwardnetwork with 10 source (input) nodes,

4 hidden neurons, and 2 output neurons.

Principle of neural modeling

The inputs are known or they can be measured.The behavior of outputs is investigated when input varies.All information has to be converted into vector form.

Page 29: Data Mining - Osservatorio Astronomico di Capodimontebrescia/documents/ASTROINFOEDU/brescia-L4-ML... · M. Brescia - Data Mining - lezione 3 2 Field of study that gives computers

Benefits of ANNs

M. Brescia - Data Mining - lezione 3 29

Nonlinearity- Allows modeling of nonlinear functions and processes.- Nonlinearity is distributed through the network.- Each neuron typically has a nonlinear output.- Using nonlinearities has drawbacks: local minima.Input-Output Mapping- In supervised learning, the input-output mapping is learned from training data.- For example known prototypes in classification.- Typically, some statistical criterion is used.- The synaptic weights (free parameters) are modified to optimize the criterion.Adaptivity- Weights (parameters) can be retrained with new data.- The network can adapt to non-stationary environment.- However, the changes must be slow enough.Evidential ResponseContextual InformationFault ToleranceVLSI (Very Large Scale Integration) ImplementabilityUniformity of Analysis and DesignNeurobiological Analogy- Human brains are fast, powerful, fault tolerant, and use massively parallel computing.- Neurobiologists try to explain the operation of human brains using artificial neural networks.- Engineers use neural computation principles for solving complex problems.

Page 30: Data Mining - Osservatorio Astronomico di Capodimontebrescia/documents/ASTROINFOEDU/brescia-L4-ML... · M. Brescia - Data Mining - lezione 3 2 Field of study that gives computers

Model of a Neuron

M. Brescia - Data Mining - lezione 3 30

A neuron is the fundamental information processing unit of a neural network.The neuron model consists of three (or four) basic elements:

A set of synapses or connecting links:- Characterized by weights (strengths).- xj denote a signal at the input of synapse j.- When connected to neuron k, xj is multiplied by the synaptic weight wkj .- weights are usually real numbers.

An adder (linear combiner):- Sums the weighted inputs wkjxj

An activation function:- Applied to the output of a neuron,

limiting its value.- Typically a nonlinear function.- Called also squashing function.

Sometimes a neuron includes anexternally applied bias term bk

Page 31: Data Mining - Osservatorio Astronomico di Capodimontebrescia/documents/ASTROINFOEDU/brescia-L4-ML... · M. Brescia - Data Mining - lezione 3 2 Field of study that gives computers

Neuron mathematics

M. Brescia - Data Mining - lezione 3 31

Mathematical equations describing neuron k:

𝑢𝑘 = σ𝑗=1𝑚 𝑤𝑘𝑗𝑥𝑗 (1) 𝑦𝑘 = 𝜑 𝑢𝑘 + 𝑏𝑘 (2)

Here:- uk is the linear combiner output;- (.) is the activation function;- yk is the output signal of the neuron;- x1, x2, . . . , xm are the m input signals;- wk1,wk2, . . . ,wkm are the respective m synaptic weights;

A mathematically equivalent representation:Add an extra synapse with input x0 = +1 and weight wk0 = bk

𝑢𝑘 = σ𝑗=0𝑚 𝑤𝑘𝑗𝑥𝑗 (1)

𝑦𝑘 = 𝜑 𝑢𝑘 (2)

The equations are now slightly simpler:

Page 32: Data Mining - Osservatorio Astronomico di Capodimontebrescia/documents/ASTROINFOEDU/brescia-L4-ML... · M. Brescia - Data Mining - lezione 3 2 Field of study that gives computers

Typical activation functions

M. Brescia - Data Mining - lezione 3 32

Threshold function (u) = 1, u ≥ 0; (u) = 0, u < 0;

Piecewise-linear function: saturates at 1 and 0

Sigmoid function:

• Most commonly in ANNs;• The figure shows the logistic function defined by (3);• The slope parameter a is important;• When a infinite, the logistic sigmoid approaches the

threshold function;• Continuous balance between linearity and non-linearity;• The tanh(au) allows the activation functions to have

negative values (so far, it is one of most used whennetwork parameters (weights) are normalized in [-1, +1];

𝜑 𝑢 =1

1+𝑒−𝑎𝑢(3)

Page 33: Data Mining - Osservatorio Astronomico di Capodimontebrescia/documents/ASTROINFOEDU/brescia-L4-ML... · M. Brescia - Data Mining - lezione 3 2 Field of study that gives computers

Stochastic model of a neuron

M. Brescia - Data Mining - lezione 3 33

The activation function of the McCulloch-Pitts early neuronal model (1943) is the threshold function.

The neuron is permitted to reside in only two states, say x = +1 and x = −1.

In the stochastic model, a neuron fires (switches its state x) according to a probability.

The state is x = 1 with probability P(v)The state is x = −1 with probability 1 − P(v)

A standard choice for the probability is the sigmoid type function P(v) = 1 / [1 + exp(−v/T )]

Here T is a parameter controlling the uncertainty in firing, called pseudotemperature..

Neural networks can be represented in terms of signal-flow graphs.Nonlinearities appearing in a neural network cause that two different types of links(branches) can appear:1. Synaptic links having a linear input-output relation:

2. Activation links with a nonlinear input-output relation: 𝑦𝑘 = 𝜑 𝑥𝑗

𝑦𝑘 = 𝑤𝑘𝑗𝑥𝑗

Page 34: Data Mining - Osservatorio Astronomico di Capodimontebrescia/documents/ASTROINFOEDU/brescia-L4-ML... · M. Brescia - Data Mining - lezione 3 2 Field of study that gives computers

Neurons as signal-flow graphs

M. Brescia - Data Mining - lezione 3 34

Signal-flow graph consists of directed branches

• The branches sum up in nodes• Each node j has a signal xj

• Branch kj starts from node j and ends at k; wkj is the synaptic weight corresponding the signal damping

Three basic rules:

– Rule 1Signal flows only to the direction of arrow.Signal strength will be multiplied with strengthening factor wkj

– Rule 2Node signal = Sum of incoming signals from branches

– Rule 3Node signal will be transmitted to each outgoing branch

Page 35: Data Mining - Osservatorio Astronomico di Capodimontebrescia/documents/ASTROINFOEDU/brescia-L4-ML... · M. Brescia - Data Mining - lezione 3 2 Field of study that gives computers

Neuron as an architectural graph

M. Brescia - Data Mining - lezione 3 35

Page 36: Data Mining - Osservatorio Astronomico di Capodimontebrescia/documents/ASTROINFOEDU/brescia-L4-ML... · M. Brescia - Data Mining - lezione 3 2 Field of study that gives computers

Single-loop feedback system

M. Brescia - Data Mining - lezione 3 36

Feedback: Output of an element of a dynamic system affects to the input of this element.• Thus in a feedback system there are closed paths.• Feedback appears almost everywhere in natural nervous systems.• Important in recurrent networks• Signal-flow graph of a single-loop feedback system

Page 37: Data Mining - Osservatorio Astronomico di Capodimontebrescia/documents/ASTROINFOEDU/brescia-L4-ML... · M. Brescia - Data Mining - lezione 3 2 Field of study that gives computers

ANN are feedback systems

M. Brescia - Data Mining - lezione 3 37

There are three fundamentally different classes of network architectures

Page 38: Data Mining - Osservatorio Astronomico di Capodimontebrescia/documents/ASTROINFOEDU/brescia-L4-ML... · M. Brescia - Data Mining - lezione 3 2 Field of study that gives computers

1 - Single-layer feed-forward network

M. Brescia - Data Mining - lezione 3 38

The simplest form of neural networks.• The input layer of source nodes onto an output layer of neurons (computation nodes).• The network is strictly a feedforward or acyclic type, because there is no feedback.• Such a network is called a single-layer network.

Page 39: Data Mining - Osservatorio Astronomico di Capodimontebrescia/documents/ASTROINFOEDU/brescia-L4-ML... · M. Brescia - Data Mining - lezione 3 2 Field of study that gives computers

Single-Layer: Perceptron

M. Brescia - Data Mining - lezione 3 39

La rete SLP che emula la funzione AND:

Page 40: Data Mining - Osservatorio Astronomico di Capodimontebrescia/documents/ASTROINFOEDU/brescia-L4-ML... · M. Brescia - Data Mining - lezione 3 2 Field of study that gives computers

2 - Multi-Layer feed-forward networks

M. Brescia - Data Mining - lezione 3 40

In a multilayer network, there is one or more hidden layers.• Their computation nodes are called hidden neurons or hidden units.• Hidden neurons can extract higher-order statistics and acquire more global information.• Typically, input signals of a layer consist of the output signals of the preceding layer only.

Page 41: Data Mining - Osservatorio Astronomico di Capodimontebrescia/documents/ASTROINFOEDU/brescia-L4-ML... · M. Brescia - Data Mining - lezione 3 2 Field of study that gives computers

Multi-Layer Perceptron

M. Brescia - Data Mining - lezione 3 41

La rete MLP che emula la funzione XOR:

Page 42: Data Mining - Osservatorio Astronomico di Capodimontebrescia/documents/ASTROINFOEDU/brescia-L4-ML... · M. Brescia - Data Mining - lezione 3 2 Field of study that gives computers

Multilayer Perceptron (MLP)

ҧ𝑥 = 𝑥1, 𝑥2, … , 𝑥𝑑 input values

𝑎𝑗(1)

= σ𝑖=1𝑑 𝑤𝑗𝑖

(1)𝑥𝑖 + 𝑏𝑗

(1), 𝑗 = 1,… ,𝑀 hidden layer neuron input

𝑧𝑗 = 𝑓 𝑎𝑗1

, 𝑗 = 1,… ,𝑀 hidden layer neuron output

𝑎𝑘(2)

= σ𝑗=1𝑀 𝑤𝑘𝑗

(1)𝑧𝑗 + 𝑏𝑘

(2), 𝑘 = 1,… , 𝑐 output layer neuron input

𝑦𝑘 = 𝑎𝑘(2)

output layer neuron output

Il principale algoritmo di addestramento di un MLP è quello della retropropagazione(backpropagation). La funzione d’errore può essere a somma quadratica o cross-entropy, aseconda del particolare tipo di problema in esame (regressione o classificazione).

MLP neural network

Page 43: Data Mining - Osservatorio Astronomico di Capodimontebrescia/documents/ASTROINFOEDU/brescia-L4-ML... · M. Brescia - Data Mining - lezione 3 2 Field of study that gives computers

3 – Recurrent Networks

M. Brescia - Data Mining - lezione 3 43

A recurrent neural network has at least one feedback loop.• In a feedforward network there are no feedback loops.• Recurrent network with:- No self-feedback loops to the “own” neuron.- No hidden neurons The feedback loops have a profound impact on the learning

capability and performance of the network.

The unit-delay elements result in a nonlinear dynamicalbehavior if the network contains nonlinear elements.

Page 44: Data Mining - Osservatorio Astronomico di Capodimontebrescia/documents/ASTROINFOEDU/brescia-L4-ML... · M. Brescia - Data Mining - lezione 3 2 Field of study that gives computers

Knowledge representation

M. Brescia - Data Mining - lezione 3 44

Definition: Knowledge refers to stored information or models used by a person or machine tointerpret, predict, and appropriately respond to the outside world.

• In knowledge representation one must consider:

1. What information is actually made explicit;2. How the information is physically encoded for subsequent use.

• A well performing neural network must represent the knowledge in an appropriate way.• A real design challenge, because there are highly diverse ways of representing information.• A major task for a neural network: learn a model of the world (environment) where it isworking.

Two kinds of information about the environment:1. Prior information = the known facts.2. Observation (measurements). Usually noisy, but give examples for training the neuralnetwork.• The examples can be:- labeled, with a known desired response (target output) to an input signal.- unlabeled, consisting of different realizations of the input signal.• A set of pairs, consisting of an input and the corresponding desired response, form a set oftraining data or training sample

Page 45: Data Mining - Osservatorio Astronomico di Capodimontebrescia/documents/ASTROINFOEDU/brescia-L4-ML... · M. Brescia - Data Mining - lezione 3 2 Field of study that gives computers

Knowledge representation

M. Brescia - Data Mining - lezione 3 45

An example: Handwritten digit recognition

• Input signal: a digital image with black and white pixels.

• Each image represents one of the 10 possible digits.

• The training sample consists of a large variety of hand-written digits from a real-worldsituation.

• An appropriate architecture in this case:- Input signals consist of image pixel values.- 10 outputs, each corresponding to a digit class.

• Learning: The network is trained using a suitable algorithm with a subset of examples.

• Generalization: After this, the recognition performance of the network is tested with datanot used in learning.

Page 46: Data Mining - Osservatorio Astronomico di Capodimontebrescia/documents/ASTROINFOEDU/brescia-L4-ML... · M. Brescia - Data Mining - lezione 3 2 Field of study that gives computers

Rules for knowledge representation

M. Brescia - Data Mining - lezione 3 46

The free parameters (synaptic weights and biases) represent knowledge of the surroundingenvironment.• Four general rules for knowledge representation.

• Rule 1. Similar inputs from similar classes should produce similar representations insidethe network, and they should be classified to the same category.

• Let xi denote the column vector 𝑥𝑖 = 𝑥𝑖1, 𝑥𝑖2, … , 𝑥𝑖𝑚𝑇

Page 47: Data Mining - Osservatorio Astronomico di Capodimontebrescia/documents/ASTROINFOEDU/brescia-L4-ML... · M. Brescia - Data Mining - lezione 3 2 Field of study that gives computers

Rules for knowledge representation

M. Brescia - Data Mining - lezione 3 47

Rule 2: Items to be categorized as separate classes should be given widely differentrepresentations in the network.

• Rule 3: If a particular feature is important, there should be a large number of neuronsinvolved in representing it in the network.

• Rule 4: Prior information and invariances should be built into the design of a neuralnetwork.

Page 48: Data Mining - Osservatorio Astronomico di Capodimontebrescia/documents/ASTROINFOEDU/brescia-L4-ML... · M. Brescia - Data Mining - lezione 3 2 Field of study that gives computers

How to build invariance in ANNs

M. Brescia - Data Mining - lezione 3 48

Classification systems must be invariant to certain transformations depending on theproblem.• For example, a system recognizing objects from images must be invariant to rotations andtranslations.• At least three techniques exist for making classifier-type neural networks invariant totransformations.

1. Invariance by Structure- Synaptic connections between the neurons are created so that transformed versions of thesame input are forced to produce the same output.- Drawback: the number of synaptic connections tends to grow very large.2. Invariance by Training- The network is trained using different examples of the same object corresponding todifferent transformations (for example rotations).- Drawbacks: computational load, generalization mismatch for other objects.3. Invariant feature space- Try to extract features of the data invariant to transformations.- Use these instead of the original input data.- Probably the most suitable technique to be used for neural classifiers.- Requires prior knowledge on the problem.

Page 49: Data Mining - Osservatorio Astronomico di Capodimontebrescia/documents/ASTROINFOEDU/brescia-L4-ML... · M. Brescia - Data Mining - lezione 3 2 Field of study that gives computers

How to build invariance in ANNs

M. Brescia - Data Mining - lezione 3 49

• Optimization of the structure of a neural network is difficult.

• Generally, a neural network acquires knowledge about the problem through training.• The knowledge is represented by a distributed and compact form by the synapticconnection weights.• Neural networks are not able to handle uncertainty, and do not gain from probabilisticevolution of data samples.• A possible solution: integrate a neural network and artificial intelligence into a hybridsystem.

• connessionismo

• apprendimento

•generalizzazione

Reti Neurali

• incertezza

• incompletezza

• approssimazione

Logica Fuzzy

• ottimizzazione

• casualità

• evoluzione

Algoritmi Genetici

•robustezza

Page 50: Data Mining - Osservatorio Astronomico di Capodimontebrescia/documents/ASTROINFOEDU/brescia-L4-ML... · M. Brescia - Data Mining - lezione 3 2 Field of study that gives computers

MLP – learning – Back Propagation

M. Brescia - Data Mining - lezione 3 50

Output error Stopping threshold

Activation function

Law for updating hidden weights

learning rateMomentum to jump

over the error surface

Backward phase with the back

propagation of the error

Forward phase with the

propagation of the input patterns

through the layers

Page 51: Data Mining - Osservatorio Astronomico di Capodimontebrescia/documents/ASTROINFOEDU/brescia-L4-ML... · M. Brescia - Data Mining - lezione 3 2 Field of study that gives computers

Back Propagation learning rule

M. Brescia - Data Mining - lezione 3 51

Other typical problems of the back-propagation algorithm are the speed of convergence and thepossibility of ending up in a local minimum of the error function.Back Propagation requires that the activation function used by the artificial neurons (or "nodes")is differentiable. Main formulas are:

•(3) and (4) are the activation function for a neuron of the, respectively, hidden layer and output layer. Thisis the mechanism to process and flow the input pattern signal through the “forward” phase;•At the end of the “forward” phase the network error is calculated (inner argument of the (5)), to be usedduring the “backward” or top-down phase to modify (adjust) neuron weights;•(5) and (6) are the descent gradient calculations of the “backward” phase, respectively, for a genericneuron of the output and hidden layer;•(7) and (8) are the most important laws of the backward phase. They represent the weight modificationlaws, respectively, between output and hidden layers (7) and between hidden-input (or hidden-hidden ifmore than one hidden layer is present in the network topology) layers. The new weights are adjusted byadding to the old ones two terms:

ηδhf(j): this is the descent gradient multiplied by a parameter, defined as “learning rate”, generallychosen sufficiently small in [0, 1[, in order to induce a smooth learning variation at each backwardstage during training;αΔw_oldjh: this is the weight variation multiplied by a parameter, defined as “momentum”, generallychosen quite high in [0, 1[, in order to give an high change to the weights to prevent the “localminima”

w_newjh=w_oldjh+ηδhf(j)+αΔw_oldjh (7)w_newhi=w_oldhi+ηδif(h)+αΔw_oldhi (8)

Page 52: Data Mining - Osservatorio Astronomico di Capodimontebrescia/documents/ASTROINFOEDU/brescia-L4-ML... · M. Brescia - Data Mining - lezione 3 2 Field of study that gives computers

BP – Regression error estimation

M. Brescia - Data Mining - lezione 3 52

MLP with BP can be used for regression and classification problems. They are always treated asoptimization problems (i.e. minimization of the error function to achieve the best training in a supervisedfashion).

For regression problems the basic goal is to model the conditional distribution ofthe output variables, conditioned on the input variables.

This motivates the use of a sum-of-squares error function. But for classificationproblems the sum-of-squares error function is not the most appropriate choice.

Page 53: Data Mining - Osservatorio Astronomico di Capodimontebrescia/documents/ASTROINFOEDU/brescia-L4-ML... · M. Brescia - Data Mining - lezione 3 2 Field of study that gives computers

BP – Classification error estimation

M. Brescia - Data Mining - lezione 3 53

By assigning a sigmoidal activation function on the output layer of the neural network, the output can beinterpreted as posterior probabilities.In fact, the output of the network trained by minimizing a sum-of-squares error function approximate theposterior probabilities of class membership, conditioned on the input vector, using the maximumlikelihood principle by assuming that the target data was generated from a smooth deterministic functionwith added Gaussian noise. For classification problems, however, the targets are binary variables andhence far from having a Gaussian distribution, so their description cannot be given by using Gaussiannoise model.Therefore a more appropriate choice of error function is needed.

Let us now consider problems involving two classes. One approach to such problems would be to use a network with two output units, one for each class. But let’s discuss an alternative approach in which we consider a network with a single output y. We would like the value of y to represent the posterior probability for class C1. The posterior probability of class C2 will then be given by 1( | )P C x 2( | ) 1P C x y

This can be achieved if we consider a target coding scheme for which t = 1 if the input vector belongs to class C1 and t = 0 if it belongs to class C2. We can combine these into a single expression, so that the probability of observing either target value is the Bernoulli distribution equation

1( | ) (1 )t tP t x y y 1( ) (1 )n nn t n t

n

y y

ln (1 ) ln(1 )n n n nE t y t y By minimizing the negative logarithm, we get to the cross-entropy error function in the form

Page 54: Data Mining - Osservatorio Astronomico di Capodimontebrescia/documents/ASTROINFOEDU/brescia-L4-ML... · M. Brescia - Data Mining - lezione 3 2 Field of study that gives computers

MLP heuristic rules – activation function

M. Brescia - Data Mining - lezione 3 54

If there are good reasons to select a particular activation function, then do itMixture of Gaussian Gaussian activation function;Hyperbolic tangent;Arctangent;Linear threshold;

General “good” properties of activation functionNon-linear;Saturate – some max and min value;Continuity and smooth;Monotonicity: convenient but nonessential;Linearity for a small value of net;

Sigmoid function has all the good properties:Centered at zero;Anti-symmetric;f(-net) = - f(net);Faster learning;Overall range and slope are not important;

Page 55: Data Mining - Osservatorio Astronomico di Capodimontebrescia/documents/ASTROINFOEDU/brescia-L4-ML... · M. Brescia - Data Mining - lezione 3 2 Field of study that gives computers

MLP heuristic rules – activation function

M. Brescia - Data Mining - lezione 3 55

We can also use bipolar logistic function as the activation function in hidden and output layer. Choosingan appropriate activation function can also contribute to a much faster learning. Theoretically, sigmoidfunction with less saturation speed will give a better result.

It can be manipulated its slope and see how it affects the learning speed. A larger slope will make weightvalues move faster to saturation region (faster convergence), while smaller slope will make weight valuesmove slower but it allows a refined weight adjustment

Page 56: Data Mining - Osservatorio Astronomico di Capodimontebrescia/documents/ASTROINFOEDU/brescia-L4-ML... · M. Brescia - Data Mining - lezione 3 2 Field of study that gives computers

MLP heuristic rules – scaling of data

M. Brescia - Data Mining - lezione 3 56

StandardizeLarge scale difference

error depends mostly on large scale feature;Shifted to Zero mean, unit variance

Need to be done once, before training;Need full data set;

Target valueOutput is saturated

In the training, the output never reach saturated value;Full training never terminated;

Range [-1, +1] is suggested;

Scaling input and target values

Page 57: Data Mining - Osservatorio Astronomico di Capodimontebrescia/documents/ASTROINFOEDU/brescia-L4-ML... · M. Brescia - Data Mining - lezione 3 2 Field of study that gives computers

MLP heuristic rules – hidden nodes

M. Brescia - Data Mining - lezione 3 57

Number of hidden units governs the expressive power of net and the complexity of decisionboundary;

Well-separated fewer hidden nodes;

complicated distribution or large spread over parameter spacemany hidden nodes;

Heuristics rule of thumb:

More training data yields better result;

Number of weights << number of training data;

Number of weights ≈ (number of training data)/10 (impossible for massive data);

Adjust number of weights in response to the training data:Start with a “large” number of hidden nodes, then decay, prune weights…;

Page 58: Data Mining - Osservatorio Astronomico di Capodimontebrescia/documents/ASTROINFOEDU/brescia-L4-ML... · M. Brescia - Data Mining - lezione 3 2 Field of study that gives computers

MLP heuristic rules – hidden layers

M. Brescia - Data Mining - lezione 3 58

In the mathematical theory of ANNs, the universal approximation theorem states:

A standard MLP feed-forward network with a single hidden layer, which contains finite numberof hidden neurons, is a universal approximator among continuous functions on compactsubsets of Rn, under mild assumptions on the activation function

The theorem was proved by George Cybenko in 1989 for a sigmoid activation function, thus itis also called the Cybenko theorem.Kurt Hornik proved in 1991 that not the specific activation function, but rather the feed-forward architecture itself allows ANNs to be universal approximators.

Then, in 1998, Simon Haykin added the conclusion that a 2-hidden layer feed-forward ANNhas more chances to converge in local minima than a single-hidden layer network.

Cybenko., G. (1989). "Approximations by superpositions of sigmoidal functions", Mathematics of Control,Signals, and Systems, 2 (4), 303-314Hornik, K. (1991). "Approximation Capabilities of Multilayer Feedforward Networks", Neural Networks,4(2), 251–257Haykin, S. (1998). Neural Networks: A Comprehensive Foundation, Volume 2, Prentice Hall. ISBN 0-13-273350-1.

Page 59: Data Mining - Osservatorio Astronomico di Capodimontebrescia/documents/ASTROINFOEDU/brescia-L4-ML... · M. Brescia - Data Mining - lezione 3 2 Field of study that gives computers

MLP heuristic rules – hidden layers

M. Brescia - Data Mining - lezione 3 59

One or two hidden layers are OK, so long as differentiable activation function;But one layer is generally sufficient;

More layers may be induce more chance of local minima;

Single hidden layer vs double (multiple) hidden layer:single is good for any approximation of continuous function;double may be good some times, when the parameter space is largely spread;

Problem-specific reason of more layers:Each layer learns different aspects (different level of non-linearity);Each layer is an hyperplane performing a separation of parameter space;

Recently, according the experimental results discussed in Bengio & LeCun 2007, problemswhere the data are particularly complex and with a high variation in the parameter spaceshould be treated by “deep” networks, i.e. with more than a computation layer.

Hence, the universal ANN theorem has evident limits! The choice must be driven by theexperience!!!

Page 60: Data Mining - Osservatorio Astronomico di Capodimontebrescia/documents/ASTROINFOEDU/brescia-L4-ML... · M. Brescia - Data Mining - lezione 3 2 Field of study that gives computers

MLP heuristic rules – weight initialization

M. Brescia - Data Mining - lezione 3 60

Not to set zero – no learning take place;Selection of good Seed for Fast and uniform learning;Reach final equilibrium values at about the same time;

For standardized data:Choose randomly from single distribution;Give positive and negative values equally –ω < w < + ω;

If ω is too small, net activation is small – linear model;If ω is too large, hidden units will saturate before learning begins;

the particular initialization values give influences to the speed of convergence. There areseveral methods available for this purpose.

The most common is by initializing the weights at random with uniform distribution inside theinterval of a certain small range of number. In the MLP-BP we call thismethod HARD_RANDOM.

Another better method is by bounding the range as expressed in the equation below. We callthis method with just RANDOM

Page 61: Data Mining - Osservatorio Astronomico di Capodimontebrescia/documents/ASTROINFOEDU/brescia-L4-ML... · M. Brescia - Data Mining - lezione 3 2 Field of study that gives computers

MLP heuristic rules – weight initialization

M. Brescia - Data Mining - lezione 3 61

Widely known as a very good weight initialization method is the Nguyen-Widrow method.We call this method as NGUYEN. Nguyen-Widrow weight initialization algorithm can be expressed as thefollowing steps:

Remember that:

Page 62: Data Mining - Osservatorio Astronomico di Capodimontebrescia/documents/ASTROINFOEDU/brescia-L4-ML... · M. Brescia - Data Mining - lezione 3 2 Field of study that gives computers

MLP heuristic rules – parameters

M. Brescia - Data Mining - lezione 3 62

Benefit of preventing the learning process from terminating in a shallow local minimum; is the momentum constant;converge if 0 | | < 1, typical value = 0.9;

= 0: standard Back Propagation

Smaller learning-rate parameter makes smoother path;increase rate of learning yet avoiding danger of instability;First choice : η ≈ 0.1;Suggestion : η is inversely proportional to square root of number of synaptic connection ( m-1/2) ;May change during training;

There is also available an adaptive learning rule. The idea is to change the learning rate automatically based on current error and previous error. The formula is:

The idea is to observe the last two errors rate in the direction that would have reduced the second error. Both variable E and Ei are the current and previous error. Parameter A is a parameter that will determine how rapidly the learning rate is adjusted. Parameter A should be less than one and greater than zero.

Page 63: Data Mining - Osservatorio Astronomico di Capodimontebrescia/documents/ASTROINFOEDU/brescia-L4-ML... · M. Brescia - Data Mining - lezione 3 2 Field of study that gives computers

MLP heuristic rules – training error

M. Brescia - Data Mining - lezione 3 63

Standard rules to evaluate the learning error are MSE (Mean Square Error) and RMSE (Root MSE)

𝑀𝑆𝐸 =σ𝑘=1𝑁𝑃 𝑡𝑘−𝑂𝑢𝑡(𝑝𝑎𝑡𝑘)

2

𝑁𝑃

𝑅𝑀𝑆𝐸 =σ𝑘=1𝑁𝑃 𝑡𝑘−𝑂𝑢𝑡(𝑝𝑎𝑡𝑘)

2

𝑁𝑃

𝐸𝑘 = 𝑡𝑘 − 𝑂𝑢𝑡(𝑝𝑎𝑡𝑘)2

sometimes it may happen that a better solution for MSE is a worse solution for the net.To avoid this problem we can use the so-called convergence tube: for a given radius R the errorwithin R is placed equal to 0, obtaining:

if 𝑡𝑘 − 𝑂𝑢𝑡(𝑝𝑎𝑡𝑘)2 > 𝑅: 𝐸𝑘 = 𝑡𝑘 − 𝑂𝑢𝑡(𝑝𝑎𝑡𝑘)

2

if 𝑡𝑘 − 𝑂𝑢𝑡(𝑝𝑎𝑡𝑘)2 ≤ 𝑅: 𝐸𝑘 = 0

Don’t believe it? Let’s see an example

Page 64: Data Mining - Osservatorio Astronomico di Capodimontebrescia/documents/ASTROINFOEDU/brescia-L4-ML... · M. Brescia - Data Mining - lezione 3 2 Field of study that gives computers

MLP heuristic rules – training error

M. Brescia - Data Mining - lezione 3 64

simple classification problem with two patterns, one of class 0 and one of class 1.

Class 0 0.49Class 1 0.51

If the solution are 0.49 for the class 0 and 0.51 for the class 1, we have an efficiency of 100% (each patterncorrectly classified) and a MSE of 0.24, a solution of 0 for the class 0 and 0.49 for the class 1 (efficiency of50%) gives back a MSE equal to 0.13 so the algorithm will prefer this kind of solution

Class 0 0.49Class 1 0.51

MSE = 0.24 and efficiency = 100% OK

Class 0 0.0Class 1 0.49

MSE = 0.13 and efficiency = 50% BAD

But the selected is the bad one! GASP!

So far, using convergence tube with R = 0.25 in the first case we have a MSEequal to 0 in the second MSE = 0.13 so the algorithm recognize the firstsolution as better than the second.

Class 0 0.49 (Ek = 0)Class 1 0.51 (Ek = 0)

MSE = 0.0 and efficiency = 100% OK

Class 0 0.0 (Ek = 0)Class 1 0.49 (Ek = 0.26)

MSE = 0.13 and efficiency = 50% BAD

the selected is the good one! OK!

Page 65: Data Mining - Osservatorio Astronomico di Capodimontebrescia/documents/ASTROINFOEDU/brescia-L4-ML... · M. Brescia - Data Mining - lezione 3 2 Field of study that gives computers

Radial Basis Functions

M. Brescia - Data Mining - lezione 3 65

A linear model for a function y(x) takes the form:

The model f is expressed as a linear combinationof a set of m fixed functions often called basisfunctions

Any set of functions can be used as the basis set although it helps if they are well behaved(differentiable).Combinations of sinusoidal waves (Fourier series):

Logistic functions (common in ANNs):

Radial functions are a special class of functions. Their response decreases (or increases)monotonically with distance from a central point.

The center, the distance scale and the precise shape of the radial function are parameters ofthe neural model which uses radial functions as neuron activations.

Page 66: Data Mining - Osservatorio Astronomico di Capodimontebrescia/documents/ASTROINFOEDU/brescia-L4-ML... · M. Brescia - Data Mining - lezione 3 2 Field of study that gives computers

Radial Functions

M. Brescia - Data Mining - lezione 3 66

Page 67: Data Mining - Osservatorio Astronomico di Capodimontebrescia/documents/ASTROINFOEDU/brescia-L4-ML... · M. Brescia - Data Mining - lezione 3 2 Field of study that gives computers

Radial Basis Function Networks

M. Brescia - Data Mining - lezione 3 67

The radial basis function network are a special kind of MLP, where each of n components of the inputvector x feeds forward to m basis functions whose output is linearly combined with weights wj into thenetwork model output f(x). RBF are specialized as approximations of functions.

When applied to supervised learning, the leastsquares principle leads to a particularly easyoptimization problem. If the model is

And the training set is 𝑥, ො𝑦 𝑛𝑝=1

then the least squares recipe is to minimize thesum squared error

If a weight penalty term is added to the sumsquared error, then the minimized costfunction is

Page 68: Data Mining - Osservatorio Astronomico di Capodimontebrescia/documents/ASTROINFOEDU/brescia-L4-ML... · M. Brescia - Data Mining - lezione 3 2 Field of study that gives computers

MLPQNA - I

L’MLPQNA è un tradizionale MLP che implementa come algoritmo di addestramento il modellodella Quasi Newton Approximation (QNA).

Il metodo di Newton utilizza il calcolo delle derivate seconde, quindi dell’Hessiano, nelladeterminazione del minimo dell’errore. Tuttavia in molti casi tale calcolo risultacomputazionalmente troppo complesso.

La QNA è un’ottimizzazione dell’algoritmo di addestramento basata su di un’approssimazionestatistica dell’Hessiano, attraverso un calcolo ciclico del gradiente, che è alla base del metodo diretropropagazione.

Il QNA, anziché calcolare la matrice Hessiana, effettua una serie di passaggi intermedi, di minorecosto computazionale, al fine di generare una sequenza di matrici che risulterannoun’approssimazione via via sempre più accurata di H.

TDA: MLP neural network + QNA

Page 69: Data Mining - Osservatorio Astronomico di Capodimontebrescia/documents/ASTROINFOEDU/brescia-L4-ML... · M. Brescia - Data Mining - lezione 3 2 Field of study that gives computers

MLPQNA – II

I parametri fondamentali dell’MLPQNA sono:

• Neuroni di input• Hidden layers e numero di neuroni relativi• Neuroni di output• W-step (soglia di errore ad ogni approssimazione)• Restarts (cicli di approssimazione)• Decay (fattore di decadimento della legge di aggiornamento dei pesi)• MaxIts (max numero di iterazioni)• K-fold Cross Validation

Nel caso della classificazione si ottiene come risultato una matrice di confusione, insieme aiparametri caratterizzanti di efficienza, completezza, purezza e contaminazione per le rispettiveclassi.

TDA: MLP neural network + QNA

Page 70: Data Mining - Osservatorio Astronomico di Capodimontebrescia/documents/ASTROINFOEDU/brescia-L4-ML... · M. Brescia - Data Mining - lezione 3 2 Field of study that gives computers

P

p

ppP

p

pw

dwxyP

wEP

wE1

2

1

));((2

1)(

2

1)(min

kkkk dww 1

Nk Rd DIRECTION OF SEARCH

Rk

Hessian approx. (QNA)

)()(2 kkk wEdwE pE is a measure of the error related to the p-th

pattern

The implemented MLPQNA model uses

Tikhonov regularization (AKA weight decay).

When the regularization factor is accurately

chosen, then generalization error of the

trained neural network can be improved, and

training can be accelerated.

If it is unknown what Decay regularization

value to choose (as usual), it could be

possible to experiment the values within the

range of 0.001 (weak regularization) up to

100 (very strong regularization).

TDA: MLP neural network + QNA

Page 71: Data Mining - Osservatorio Astronomico di Capodimontebrescia/documents/ASTROINFOEDU/brescia-L4-ML... · M. Brescia - Data Mining - lezione 3 2 Field of study that gives computers

TDA: MLP neural network + QNA

REGRESSION ERROR

Least Square error + Tikhonov regularization

𝐸 =𝑖=1

𝑁𝑝𝑎𝑡𝑡𝑒𝑟𝑛𝑠 𝑦𝑖 − 𝑡𝑖2

2+

𝑊 2𝜆

2

where, λ is the Decay, y and t are respectively, output and target for each pattern,while W is the weight matrix of MLP.

CLASSIFICATION ERROR

1. Cross entropy enabled cross-entropy per record estimated in bits (logarithm);

𝐸 =𝑖=1

𝑁𝑝𝑎𝑡𝑡𝑒𝑟𝑛𝑠

𝑙𝑛1

𝑦𝑖

1. Cross entropy disabled percentage of misclassified patterns at each cycle;

Page 72: Data Mining - Osservatorio Astronomico di Capodimontebrescia/documents/ASTROINFOEDU/brescia-L4-ML... · M. Brescia - Data Mining - lezione 3 2 Field of study that gives computers

MLP trained by Quasi Newton rule

72

P

p

ppP

p

pw

dwxyP

wEP

wE1

2

1

));((2

1)(

2

1)(min

kkkk dww 1

Nk Rd DIRECTION OF SEARCH

Rk

)( kk wEd

)()(2 kkk wEdwE

Descent gradient (BP)

Genetic Algorithms (GA)

Hessian approx. (QNA)

operatorsgeneticd k

pE is a measure of the error related to the p-th pattern

M. Brescia - Data Mining - lezione 3

Page 73: Data Mining - Osservatorio Astronomico di Capodimontebrescia/documents/ASTROINFOEDU/brescia-L4-ML... · M. Brescia - Data Mining - lezione 3 2 Field of study that gives computers

MLP-BP Algorithm

M. Brescia - Data Mining - lezione 3 73

By making the mathematical relations in practice, we can derive the complete standard algorithm for ageneric MLP trained by BP rule as the following (ALG-1):

Let us consider a generic MLP with m output and n input nodes and with 𝑤𝑖𝑗(𝑡)

the weight between i-th

and j-th neuron at time (t).

1) Initialize all weights 𝒘𝒊𝒋(𝟎)

with small random values, typically normalized in [-1, 1];

2) Present to the network a new pattern 𝒑 = (𝒙𝟏, … , 𝒙𝒏) together with the target 𝒄𝒑 = (𝒄𝒑𝟏, … , 𝒄𝒑𝒎) as

the value expected for network output;

3) Calculate the output of each neuron j (layer by layer) as: 𝒚𝒑𝒋 = 𝒇(σ𝒊𝒘𝒊𝒋𝒙𝒑𝒊), except for the input

neurons (in which their output is the input itself);

4) Adapt the weights between neurons at all layers, proceeding in the backward direction from output

to input layer, with the following rule: 𝒘𝒊𝒋(𝒕+𝟏)

= 𝒘𝒊𝒋(𝒕)+ 𝜼𝜹𝒑𝒋𝒚𝒑𝒋 + 𝝁∆𝒘(𝒕), where η is the gain term

(also called learning rate) and μ the momentum factor, both typically in ]0,1[;

5) Goto 2 and repeat steps for all input patterns of the training set;

Page 74: Data Mining - Osservatorio Astronomico di Capodimontebrescia/documents/ASTROINFOEDU/brescia-L4-ML... · M. Brescia - Data Mining - lezione 3 2 Field of study that gives computers

MLP-QNA Algorithm

M. Brescia - Data Mining - lezione 3 74

Let us consider a generic MLP with 𝑤(𝑡) the weight vector at time (t).

1) Initialize all weights 𝒘(𝟎) with small random values (typically normalized in [-1, 1]), set constant 𝜺,

set t= 0 and 𝒈(𝟎) = 𝑰;

2) Present to the network all training set and calculate 𝑬(𝒘(𝒕)) as the error function for the current weight configuration;

3) If t=0

then 𝒅(𝒕) = −𝛁𝑬(𝒕) (gradient of error function)

else 𝒅(𝒕) = −𝒈(𝒕−𝟏)𝛁𝑬(𝒕−𝟏)

4) Calculate 𝒘(𝒕+𝟏) = 𝒘 𝒕 − 𝜶𝒅(𝒕) where 𝜶 is obtained by line search expression 𝜶(𝒕) = −𝒅 𝒕 𝑻𝒈(𝒕)

𝒅 𝒕 𝑻𝑯𝒅(𝒕)

5) Calculate 𝐠(𝒕+𝟏) with equation 𝒈(𝒕+𝟏) = 𝒈(𝒕) +𝒑𝒑𝑻

𝒑𝑻𝝂−

𝒈 𝒕 𝝂 𝝂𝑻𝒈 𝒕

𝝂𝑻𝒈 𝒕 𝝂+ 𝝂𝑻𝒈(𝒕)𝝂 𝒖𝒖𝑻

6) If 𝑬 𝒘 𝒕+𝟏 > 𝜺 then t=t+1 and goto 2, else STOP

)()(2 kkk wEdwE

http://dame.dsf.unina.it/documents/DAME_MLPQNA_Model_Mathematics.pdf

Page 75: Data Mining - Osservatorio Astronomico di Capodimontebrescia/documents/ASTROINFOEDU/brescia-L4-ML... · M. Brescia - Data Mining - lezione 3 2 Field of study that gives computers

Machine Learning for control systems

M. Brescia - Data Mining - lezione 3 75

An hybrid solution combines control schemes with NN, (VSPI + NN = NVSPI), to obtain an optimized adaptivecontrol system, able to correct motorized axis position in case of unpredictable and unexpected positionerrors.

NVSPI = Neural VSPISubmit to a MLP network a dataset(reference position trajectories) through thesystem in order to teach the NN to recognizefault condition of the VSPI response.

VSPI = Variable Structure PI

"A Neural Tool for Ground-Based TelescopeTracking control", Brescia M. et al. : AIIA NOTIZIE, Anno XVI, N°4, pp. 57-65, 2003

Page 76: Data Mining - Osservatorio Astronomico di Capodimontebrescia/documents/ASTROINFOEDU/brescia-L4-ML... · M. Brescia - Data Mining - lezione 3 2 Field of study that gives computers

Machine Learning & Statistical Models

M. Brescia - Data Mining - lezione 3 76

NeuralNetworks

Feed Forward

Recurrent / Feedback

• Perceptron• Multi Layer Perceptron• Radial Basis Functions

• Competitive Networks• Hopfield Networks• Adaptive Reasoning Theory

• Bayesian Networks• Hidden Markov Models• Mixture of Gaussians• Principal Probabilistic Surface• Maximum Likelihood• 2

• Negentropy

DecisionAnalysis

• Fuzzy Sets• Genetic Algorithms• K-Means• Principal Component Analysis• Support Vector Machine• Soft Computing

StatisticalModels

• Decision Trees• Random Decision Forests• Evolving Trees• Minimum Spanning TreesHybrid

Page 77: Data Mining - Osservatorio Astronomico di Capodimontebrescia/documents/ASTROINFOEDU/brescia-L4-ML... · M. Brescia - Data Mining - lezione 3 2 Field of study that gives computers

Genetic Algorithms

M. Brescia - Data Mining - lezione 3 77

A class of probabilistic optimization algorithms

Inspired by the biological evolution process

Uses concepts as “Natural Selection” and “Genetic Inheritance” (Darwin 1859)

Originally developed by John Holland (1975)

A genetic algorithm maintains a population of candidate solutions for theproblem at hand, and makes it evolve by iteratively applying a set of stochasticoperators

Page 78: Data Mining - Osservatorio Astronomico di Capodimontebrescia/documents/ASTROINFOEDU/brescia-L4-ML... · M. Brescia - Data Mining - lezione 3 2 Field of study that gives computers

GA artificial vs natural

M. Brescia - Data Mining - lezione 3 78

Genetic Algorithms

selection

population evaluation

modification

discard

deleted

members

parents

modified

offspring

Evaluated

offspring

initiate &

evaluate

Nature

Page 79: Data Mining - Osservatorio Astronomico di Capodimontebrescia/documents/ASTROINFOEDU/brescia-L4-ML... · M. Brescia - Data Mining - lezione 3 2 Field of study that gives computers

Genetic operators

M. Brescia - Data Mining - lezione 3 79

s1 = 1111010101

s2 = 1110110101

Before:

After:

s1` = 1110110101

s2` = 1111010101

crossover

Before:

s1 = 1110110100

After:

s2` = 1111010101

mutation

Maintain best N

solutions in the

next population

elitism

Extracts k individuals from

the population with uniform

probability (without re-

insertion) and makes them

play a “tournament”,

where the probability for an

individual to win is

generally proportional to its

fitness. Selection pressure

is directly proportional to

the number k of

participants

Rank Tournament

Roulette wheel

All above operators are quite invariant inrespect of the particular problem.What drastically has to change is thefitness function (how to evaluatepopulation individuals)

Individual i will have a

probability to be chosen i

if

if

)(

)(

Page 80: Data Mining - Osservatorio Astronomico di Capodimontebrescia/documents/ASTROINFOEDU/brescia-L4-ML... · M. Brescia - Data Mining - lezione 3 2 Field of study that gives computers

Population

M. Brescia - Data Mining - lezione 3 80

Chromosomes could be:

Bit strings (0101 ... 1100)

Real numbers (43.2 -33.1 ... 0.0 89.2)

Permutations of element (E11 E3 E7 ... E1 E15)

Lists of rules (R1 R2 R3 ... R22 R23)

Program elements (genetic programming)

... any data structure ...

population

Page 81: Data Mining - Osservatorio Astronomico di Capodimontebrescia/documents/ASTROINFOEDU/brescia-L4-ML... · M. Brescia - Data Mining - lezione 3 2 Field of study that gives computers

Initial Population

(5,3,4,6,2) (2,4,6,3,5) (4,3,6,5,2)

(2,3,4,6,5) (4,3,6,2,5) (3,4,5,2,6)

(3,5,4,6,2) (4,5,3,6,2) (5,4,2,3,6)

(4,6,3,2,5) (3,4,2,6,5) (3,6,5,1,4)

Example of genetic evolution

Page 82: Data Mining - Osservatorio Astronomico di Capodimontebrescia/documents/ASTROINFOEDU/brescia-L4-ML... · M. Brescia - Data Mining - lezione 3 2 Field of study that gives computers

Select Parents

(5,3,4,6,2) (2,4,6,3,5) (4,3,6,5,2)

(2,3,4,6,5) (4,3,6,2,5) (3,4,5,2,6)

(3,5,4,6,2) (4,5,3,6,2) (5,4,2,3,6)

(4,6,3,2,5) (3,4,2,6,5) (3,6,5,1,4)

Try to pick the better ones.

Example of genetic evolution

Page 83: Data Mining - Osservatorio Astronomico di Capodimontebrescia/documents/ASTROINFOEDU/brescia-L4-ML... · M. Brescia - Data Mining - lezione 3 2 Field of study that gives computers

Create Off-Spring – 1 point

(5,3,4,6,2) (2,4,6,3,5) (4,3,6,5,2)

(2,3,4,6,5) (4,3,6,2,5) (3,4,5,2,6)

(3,5,4,6,2) (4,5,3,6,2) (5,4,2,3,6)

(4,6,3,2,5) (3,4,2,6,5) (3,6,5,1,4)

(3,4,5,6,2)

Example of genetic evolution

Page 84: Data Mining - Osservatorio Astronomico di Capodimontebrescia/documents/ASTROINFOEDU/brescia-L4-ML... · M. Brescia - Data Mining - lezione 3 2 Field of study that gives computers

(3,4,5,6,2)

Create More Offspring

(5,3,4,6,2) (2,4,6,3,5) (4,3,6,5,2)

(2,3,4,6,5) (4,3,6,2,5) (3,4,5,2,6)

(3,5,4,6,2) (4,5,3,6,2) (5,4,2,3,6)

(4,6,3,2,5) (3,4,2,6,5) (3,6,5,1,4)

(5,4,2,6,3)

Example of genetic evolution

Page 85: Data Mining - Osservatorio Astronomico di Capodimontebrescia/documents/ASTROINFOEDU/brescia-L4-ML... · M. Brescia - Data Mining - lezione 3 2 Field of study that gives computers

(3,4,5,6,2) (5,4,2,6,3)

Mutate

(5,3,4,6,2) (2,4,6,3,5) (4,3,6,5,2)

(2,3,4,6,5) (4,3,6,2,5) (3,4,5,2,6)

(3,5,4,6,2) (4,5,3,6,2) (5,4,2,3,6)

(4,6,3,2,5) (3,4,2,6,5) (3,6,5,1,4)

Example of genetic evolution

Page 86: Data Mining - Osservatorio Astronomico di Capodimontebrescia/documents/ASTROINFOEDU/brescia-L4-ML... · M. Brescia - Data Mining - lezione 3 2 Field of study that gives computers

Mutate

(5,3,4,6,2) (2,4,6,3,5) (4,3,6,5,2)

(2,3,4,6,5) (2,3,6,4,5) (3,4,5,2,6)

(3,5,4,6,2) (4,5,3,6,2) (5,4,2,3,6)

(4,6,3,2,5) (3,4,2,6,5) (3,6,5,1,4)

(3,4,5,6,2) (5,4,2,6,3)

Example of genetic evolution

Page 87: Data Mining - Osservatorio Astronomico di Capodimontebrescia/documents/ASTROINFOEDU/brescia-L4-ML... · M. Brescia - Data Mining - lezione 3 2 Field of study that gives computers

Eliminate

(5,3,4,6,2) (2,4,6,3,5) (4,3,6,5,2)

(2,3,4,6,5) (2,3,6,4,5) (3,4,5,2,6)

(3,5,4,6,2) (4,5,3,6,2) (5,4,2,3,6)

(4,6,3,2,5) (3,4,2,6,5) (3,6,5,1,4)

Tend to kill off the worst ones.

(3,4,5,6,2) (5,4,2,6,3)

Example of genetic evolution

Page 88: Data Mining - Osservatorio Astronomico di Capodimontebrescia/documents/ASTROINFOEDU/brescia-L4-ML... · M. Brescia - Data Mining - lezione 3 2 Field of study that gives computers

Integrate

(5,3,4,6,2) (2,4,6,3,5)

(2,3,6,4,5) (3,4,5,2,6)

(3,5,4,6,2) (4,5,3,6,2) (5,4,2,3,6)

(4,6,3,2,5) (3,4,2,6,5) (3,6,5,1,4)

(3,4,5,6,2)

(5,4,2,6,3)

Example of genetic evolution

Page 89: Data Mining - Osservatorio Astronomico di Capodimontebrescia/documents/ASTROINFOEDU/brescia-L4-ML... · M. Brescia - Data Mining - lezione 3 2 Field of study that gives computers

Restart

(5,3,4,6,2) (2,4,6,3,5)

(2,3,6,4,5) (3,4,5,2,6)

(3,5,4,6,2) (4,5,3,6,2) (5,4,2,3,6)

(4,6,3,2,5) (3,4,2,6,5) (3,6,5,1,4)

(3,4,5,6,2)

(5,4,2,6,3)

Example of genetic evolution

Page 90: Data Mining - Osservatorio Astronomico di Capodimontebrescia/documents/ASTROINFOEDU/brescia-L4-ML... · M. Brescia - Data Mining - lezione 3 2 Field of study that gives computers

When to use a GA

M. Brescia - Data Mining - lezione 3 90

Alternate solutions are too slow or overly complicated

Need an exploratory tool to examine new approaches

Problem is similar to one that has already been successfully solved by using a GA

Want to hybridize with an existing solution

Benefits of the GA technology meet key problem requirements

Page 91: Data Mining - Osservatorio Astronomico di Capodimontebrescia/documents/ASTROINFOEDU/brescia-L4-ML... · M. Brescia - Data Mining - lezione 3 2 Field of study that gives computers

Benefits of GAs

M. Brescia - Data Mining - lezione 3 91

Concept is easy to understand

Modular, separate from application

Supports multi-objective optimization

Good for “noisy” environments

Always an answer; answer gets better with time

Inherently parallel; easily distributed

Many ways to improve a GA application as knowledge about problem domain is gained

Easy to exploit previous or alternate solutions

Flexible building blocks for hybrid applications

Page 92: Data Mining - Osservatorio Astronomico di Capodimontebrescia/documents/ASTROINFOEDU/brescia-L4-ML... · M. Brescia - Data Mining - lezione 3 2 Field of study that gives computers

Soft Computing – MLP with GAs

M. Brescia - Data Mining - lezione 3 92

Errore output Soglia di convergenza

Funzione di attivazione

Fase a ritroso con retro

propagazione dell’errore

Fase in avanti con

propagazione dell’input

attraverso le funzioni di attivazione

Diverse configurazioni di pesi (popolazioni

di reti neurali) ottenute con evoluzione

genetica

Se la matrice dei pesi di una MLP la identifichiamo come un cromosoma, avremo unapopolazione di matrici di pesi (popolazione di MLP) evoluta tramite operatori genetici.

Page 93: Data Mining - Osservatorio Astronomico di Capodimontebrescia/documents/ASTROINFOEDU/brescia-L4-ML... · M. Brescia - Data Mining - lezione 3 2 Field of study that gives computers

Soft Computing – MLP with GAs

M. Brescia - Data Mining - lezione 3 93

Linked genes represent individual neuron weight values and thresholds whichconnect this neuron to the previous neural network layer. The genetic algorithmof the optimization over genes, structured like this, is standard.

Page 94: Data Mining - Osservatorio Astronomico di Capodimontebrescia/documents/ASTROINFOEDU/brescia-L4-ML... · M. Brescia - Data Mining - lezione 3 2 Field of study that gives computers

GAME (GA Model Experiment)

M. Brescia - Data Mining - lezione 3 94

Given a generic dataset with N features and a target t, pat a generic input pattern of thedataset, 𝑝𝑎𝑡 = 𝑓1, ⋯ , 𝑓𝑁, 𝑡 and g(x) a generic real function, the representation of a genericfeature fi of a generic pattern, with a polynomial sequence of degree d is:𝐺 𝑓𝑖 ≅ 𝑎0 + 𝑎1 𝑔 𝑓𝑖 +⋯+ 𝑎𝑑 𝑔

𝑑 𝑓𝑖

Hence, the k-th pattern (patk) with N features may be represented by:

𝑂𝑢𝑡(𝑝𝑎𝑡𝑘) ≅ σ𝑖=1𝑁 𝐺 𝑓𝑖 ≅ 𝑎0 + σ𝑖=1

𝑁 σ𝑗=1𝑑 𝑎𝑗 𝑔

𝑗 𝑓𝑖 (1)

The target tk, concerning the pattern patk, can be used to evaluate the approximation error ofthe input pattern to the expected value:

𝐸𝑘 = 𝑡𝑘 − 𝑂𝑢𝑡(𝑝𝑎𝑡𝑘)2

With NP patterns number (k = 1, …, NP), at the end of the “forward” phase (batch) of the GA,we have NP expressions (1) which represent the polynomial approximation of the dataset.In order to evaluate the fitness of the patterns, the Mean Square Error (MSE) or Root MeanSquare Error (RMSE) may be used:

𝑀𝑆𝐸 =σ𝑘=1𝑁𝑃 𝑡𝑘−𝑂𝑢𝑡(𝑝𝑎𝑡𝑘)

2

𝑁𝑃𝑅𝑀𝑆𝐸 =

σ𝑘=1𝑁𝑃 𝑡𝑘−𝑂𝑢𝑡(𝑝𝑎𝑡𝑘)

2

𝑁𝑃

Cavuoti, S. et al. (2012). “Genetic Algorithm Modeling with GPU Parallel Computing Technology”. “NeuralNets and Surroundings, Smart Innovation, Systems and Technologies”, Vol. 19, p. 11, Springer

Page 95: Data Mining - Osservatorio Astronomico di Capodimontebrescia/documents/ASTROINFOEDU/brescia-L4-ML... · M. Brescia - Data Mining - lezione 3 2 Field of study that gives computers

GAME (GA Model Experiment)

M. Brescia - Data Mining - lezione 3 95

𝑁𝑈𝑀𝐶𝐻𝑅𝑂𝑀𝑂𝑆𝑂𝑀𝐸𝑆 = 𝐵 ∙ 𝑁 + 1

where N is the number of features of the patterns and B is a multiplicative factor thatdepends from the g(x) function, in the simplest case is just 1, but can arise to 3 or 4

With 2100 patterns, 11 features each, the expression for the single (k-th) pattern, using (1) with degree 6, will be:

𝑂𝑢𝑡(𝑝𝑎𝑡𝑘) ≅

𝑖=1

11

𝐺 𝑓𝑖 ≅ 𝑎0 +

𝑖=1

11

𝑗=1

6

𝑎𝑗 𝑐𝑜𝑠 𝑗 𝑓𝑖 +

𝑖=1

11

𝑗=1

6

𝑏𝑗 𝑠𝑖𝑛 𝑗 𝑓𝑖

for k = 1,…,2100.𝑁𝑈𝑀𝐶𝐻𝑅𝑂𝑀𝑂𝑆𝑂𝑀𝐸𝑆 = 2 ∙ 11 + 1 = 23

𝑁𝑈𝑀𝐺𝐸𝑁𝐸𝑆 = 6 ∙ 2 + 1 = 13

𝑁𝑈𝑀𝐺𝐸𝑁𝐸𝑆 = 𝑑 ∙ 𝐵 + 1

where d is the degree of the polynomial.

We use the trigonometric polynomial sequence, given by the following expression,g 𝑥 = 𝑎0 + σ𝑚=1

𝑛 𝑎𝑚 cos 𝑚 𝑥 +σ𝑚=1𝑛 𝑏𝑚 sin 𝑚 𝑥

B= 2

Page 96: Data Mining - Osservatorio Astronomico di Capodimontebrescia/documents/ASTROINFOEDU/brescia-L4-ML... · M. Brescia - Data Mining - lezione 3 2 Field of study that gives computers

The general-purpose GA has been internally designed for classification and regression problems

Serial on multi-core CPU

Parallel on many-core GPU

Needed faster execution tobecome scalable for MDS

Genetic Algorithms are embarrassingly parallel (granularity + repetitive operations)

GAME (GA Model Experiment)

Page 97: Data Mining - Osservatorio Astronomico di Capodimontebrescia/documents/ASTROINFOEDU/brescia-L4-ML... · M. Brescia - Data Mining - lezione 3 2 Field of study that gives computers

References

M. Brescia - Data Mining - lezione 3 97

Brescia, M.; 2011, New Trends in E-Science: Machine Learning and Knowledge Discovery inDatabases, Contribution to the Volume Horizons in Computer Science Research, Editors:Thomas S. Clary, Series Horizons in Computer Science, ISBN: 978-1-61942-774-7, availableat Nova Science Publishers

Kotsiantis, S. B.; 2007, Supervised Machine Learning: A Review of Classification Techniques, Informatica, Vol. 31, 249-268

Shortliffe, E. H.; 1993, The adolescence of AI in medicine: will the field come of age in the '90s?, Artif Intell Med. 5(2):93-106. Review.

Hornik, K.; 1989, Multilayer Feedforward Networks are Universal Approximators, Neural Networks, Vol. 2, pp. 359-366, Pergamon Press.

Brescia, M. et al.; 2003, A Neural Tool for Ground-Based Telescope Tracking control, AIIA NOTIZIE, periodico dell’Associazione Italiana per l’Intelligenza Artificiale, Anno XVI, N° 4, pp. 57-65.

Bengio & LeCun; 2007, Scaling Learning Algorithms towards AI, to appear in “Large Scale Kernel Machines” Volume, MIT Press.