Introduction to Artificial Intelligence CSCI 3202: The Perceptron Algorithm

Greg Grudic Intro AI 1

Introduction to Artificial IntelligenceCSCI 3202:

The Perceptron Algorithm

Greg Grudic

Questions?

Binary Classification

• A binary classifier is a mapping from a set of d inputs to a single output which can take on one of TWO values

• In the most general setting

• Specifying the output classes as -1 and +1 is arbitrary!– Often done as a mathematical convenience

{ }inputs:output:

yx Î ÂÎ - +

A Binary Classifier

ClassificationModelx ˆ 1, 1y

Given learning data: ( ) ( )1 1, ,..., ,N Ny yx x

A model is constructed:

( )M x

Linear Separating Hyper-Planes

+ £å0

• The Model:

• Where:

• The decision boundary:

( )0 1ˆ ˆ ˆˆ ( ) sgn ,..., T

dy M b b bé ù= = +ê úë ûx x

[ ] 1 if 0sgn

1 otherwiseA

Aì >ïï=íï -ïî

( )0 1ˆ ˆ ˆ,..., 0T

db b b+ =x

• The model parameters are:

• The hat on the betas means that they are estimated from the data

• Many different learning algorithms have been proposed for determining

( )0 1ˆ ˆ ˆ, ,..., db b b

Rosenblatt’s Preceptron Learning Algorithm

• Dates back to the 1950’s and is the motivation behind Neural Networks

• The algorithm:– Start with a random hyperplane– Incrementally modify the hyperplane such that

points that are misclassified move closer to the correct side of the boundary

– Stop when all learning examples are correctly classified

( )0 1ˆ ˆ ˆ, ,..., db b b

Rosenblatt’s Preceptron Learning Algorithm

• The algorithm is based on the following property:– Signed distance of any point to the boundary is:

• Therefore, if is the set of misclassified learning examples, we can push them closer to the boundary by minimizing the following

( ) ( )( )0 1 0 1ˆ ˆ ˆ ˆ ˆ ˆ, ,..., ,..., T

d i d ii M

D yb b b b b bÎ

=- +å x

( ) ( )0 1

ˆ ˆ ˆ,...,ˆ ˆ ˆ,...,

xb b b

b b bb

+= µ +æ ö÷ç ÷ç ÷ç ÷è øå

Rosenblatt’s Minimization Function

• This is classic Machine Learning!• First define a cost function in model

parameter space

• Then find an algorithm that modifies such that this cost function is minimized

• One such algorithm is Gradient Descent

( )0 1 01

ˆ ˆ ˆ ˆ ˆ, ,...,d

d i k iki M k

D y xb b b b bÎ =

æ ö÷ç=- + ÷ç ÷ç ÷çè øå å( )0 1

ˆ ˆ ˆ, ,..., db b b

Gradient Descent

( )0 1ˆ ˆ ˆ, ,..., dD b b b =

0b̂ = 1̂b =

The Gradient Descent Algorithm

( )0 1ˆ ˆ ˆ, ,...,

ˆ ˆˆ

D b b bb b r

¶¬ -

Where the learning rate is defined by: 0r >

The Gradient Descent Algorithm for the Perceptron

i idd d

b bb b r

æ ö æ ö æ ö÷ ÷ç ç ÷÷ ÷ çç ç ÷÷ ÷ çç ç ÷÷ ÷ çç ç ÷÷ ÷ çç ç ÷÷ ÷ çç ç ÷÷ ÷¬ - ç ÷ç ç÷ ÷ ç ÷÷ ÷ç ç ÷÷ ÷ çç ç ÷÷ ÷ çç ç ÷÷ ÷ çç ç ÷÷ç÷ ÷ çè øç ç÷ ÷ç ç ÷÷ ÷è ø è øç ç÷ ÷

( )0 1

ˆ ˆ ˆ, ,...,ˆ

¶ å( )0 1

ˆ ˆ ˆ, ,...,, 1,...,ˆ

i iji Mj

Dy x j d

¶=- =

i ii M

d d i idi M

b bb b r

æ ö- ÷ç ÷æ ö æ ö ç ÷÷ ÷ çç ç ÷÷ ÷ çç ç ÷÷ ÷ çç ç ÷÷ ÷ -çç ç ÷÷ ÷ ç ÷ç ç÷ ÷ ÷çç ç÷ ÷¬ - ÷çç ç÷ ÷ ÷ç÷ ÷ç ç ÷÷ ÷ çç ç ÷÷ ÷ çç ç ÷÷ ÷ çç ç ÷÷ ÷ ç ÷ç ç÷ ÷è ø è ø - ÷ç ÷ç ÷çè ø

åM M M

Update One misclassified point at a time (online)

Update all misclassified points at once (batch)

Two Versions of the Perceptron Algorithm:

The Learning Data

• Matrix Representation of N learning examples of d dimensional inputs

11 1 1

N Nd N

x x yX Y

æ ö æ ö÷ ÷ç ç÷ ÷ç ç÷ ÷ç ç÷ ÷= =ç ç÷ ÷ç ç÷ ÷ç ç÷ ÷÷ ÷ç çè ø è ø

KM O M M

( ) ( )1 1, ,..., ,N Ny yx xTraining Data:

The Good Theoretical Properties of the Perceptron Algorithm

• If a solution exists the algorithm will always converge in a finite number of steps!

• Question: Does a solution always exist?

Linearly Separable Data

• Which of these datasets are separable by a linear boundary?

Linearly Separable Data

• Which of these datasets are separable by a linear boundary?

- NotLinearly

Separable!

Bad Theoretical Properties of the Perceptron Algorithm

• If the data is not linearly separable, algorithm cycles forever!– Cannot converge!– This property “stopped” active research in this area between

1968 and 1984…• Perceptrons, Minsky and Pappert, 1969

• Even when the data is separable, there are infinitely many solutions– Which solution is best?

• When data is linearly separable, the number of steps to converge can be very large (depends on size of gap between classes)

What about Nonlinear Data?

• Data that is not linearly separable is called nonlinear data

• Nonlinear data can often be mapped into a nonlinear space where it is linearly separable

Nonlinear Models

• The Linear Model:

• The Nonlinear (basis function) Model:

• Examples of Nonlinear Basis Functions:

ˆ ˆˆ ( ) sgnd

y M xb b=

é ùê ú= = +ê úë ûåx

ˆ ˆˆ ( ) sgnk

y M b bf=

é ùê ú= = +ê úë ûåx x

( ) ( ) ( ) ( ) ( )2 21 1 2 2 3 1 2 4 55sinx x x x xff ff= = = =x x x x

Linear Separating Hyper-Planes In Nonlinear Basis Function Space

+ £å0

An Example

-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1-1

: y=+1: y=-1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

1 = x1

: y=+1: y=-1

Kernels as Nonlinear Transformations

• Polynomial

• Sigmoid

• Gaussian or Radial Basis Function (RBF)

( ) ( )( ) ( )( ) 2

, tanh ,

1, exp2

i j i j

x x x x

= += +

æ ö÷ç= - - ÷ç ÷÷çè ø

The Kernel Model

ˆ ˆˆ ( ) sgn ,N

y M Kx x xb b=

é ùê ú= = +ê úë ûå

The number of basis functions equals the number of training examples!

- Unless some of the beta’s get set to zero…

Gram (Kernel) Matrix

( ) ( )

æ ö÷ç ÷ç ÷ç ÷=ç ÷ç ÷ç ÷ç ÷÷çè ø

x x x x

KM O M

Properties:•Positive Definite Matrix

•Symmetric•Positive on diagonal

•N by N

Picking a Model Structure?• How do you pick the Kernels?

– Kernel parameters• These are called learning parameters or

hyperparamters– Two approaches choosing learning paramters

• Bayesian– Learning parameters must maximize probability of correct

classification on future data based on prior biases• Frequentist

– Use the training data to learn the model parameters – Use validation data to pick the best hyperparameters.

• More on learning parameter selection later

( )0 1ˆ ˆ ˆ, ,..., db b b

Perceptron Algorithm Convergence

• Two problems:– No convergence when data is not separable in

basis function space– Gives infinitely many solutions when data is

separable• Can we modify the algorithm to fix these

problems?

Introduction to Artificial Intelligence CSCI 3202: The Perceptron Algorithm

Documents

Perceptron. Inner-product scalar Perceptron Perceptron learning rule XOR problem linear separable patterns Gradient descent Stochastic Approximation to

Braun CombiMax K 750 3202 - ESpecarchive.espec.ws/files/BRAUN_K_750.pdf · Braun CombiMax K 750 3202 3202. 3202 - 2 ... 120V 60 Hz Driving motor: DC motor B2-DC ... (8 x 3 mm) in

Presentation clcv 3202

Neural Networks Professor Michael C. Mozer CSCI 3202 · Neural Networks Professor Michael C. Mozer CSCI 3202. Exponential number of hidden units is bad ... DICK TRACY NEURAL- B catvus

Greg GrudicIntro AI1 Introduction to Artificial Intelligence CSCI 3202: The Perceptron Algorithm Greg Grudic

FEM 3202 NUTRITION, HEALTH AND ENVIRONMENT 3202-1 INTRODUCTION.… · FEM 3202 NUTRITION, HEALTH AND ENVIRONMENT DR. SHAMSUL AZAHARI ZAINAL BADARI Department of Resource Management

Learning Linear Separators Perceptron, Marginsninamf/courses/315sp19/lectures/Perceptron-01-25-2019.pdfJan 25, 2019 · Perceptron Extensions • Can use Perceptron in a batch setting

Elec 3202 Chap 6

Manual Bomba FLYGT 3202

Linear Classification: The Perceptronbboots3/CS4641-Fall2018/... · Improving the Perceptron • The Perceptron produces many θ‘sduring training • The standard Perceptron simply

3202 guide work4

Perceptron C++

Math 3202 Section 2.1 Opening Investigationmrmcdonaldshomepage.weebly.com/uploads/1/3/9/2/13921997/mat… · Math 3202 Section 2.1 Opening Investigation: Math 3202 Section 2.1 Mean,

Jaringan Perceptron

Perceptron Multilayer

Rosenblatt's Perceptron

Computacion inteligente El perceptron. Nov 2005 2 Agenda History Perceptron structure Perceptron as a two-class classifier The Perceptron as a

Linear Classification: The Perceptron - Penn Engineeringcis519/fall2017/...Improving the Perceptron • The Perceptron produces many θ‘s during training • The standard Perceptron

Perceptron - SNU · •Hypothesis space of a perceptron (representation power of a perceptron) •The case where the threshold perceptron return 1 • defines a hyperplane in the

Perceptron (actual lecture) · 2020. 4. 21. · Perceptron (actual lecture) Tuesday, April 21, 2020 13:15 Perceptron Page 1 . Perceptron Page 2 . Perceptron Page 3 . Perceptron Page