Instructor: Dr. Benjamin Thompson Lecture 15: 3 March 2009scripts.cac.psu.edu/users/c/a/cao5021/ee/456/hw/lecture_15 Dec Bo… · The Oral Presentation Minimum 5 minutes, maximum

Instructor:

Dr. Benjamin Thompson

Lecture 15: 3 March 2009

Quod Erat Demonstrandum� More Heuristics for Better Learning

� Momentum

� Maximizing Information Content

� The Activation Function

� Input Normalization

� Weight Initialization

Et In Saecula Saeculorum…� Decision Boundaries in Classification Problems

� Matlab Demonstration: Two Moons revisited

� Matlab Demonstration: Five-class problem

� Super-Awesome Really Fun and Amazing Term Project Assignment! Yay!

� Biologically-Inspired Search Algorithms Preview

You’ve gotta make a choice eventually…

Crisp n’ Fuzzy� Recall: no matter how well-trained the neural network

is, it’s always going to produce a continuously-valued output� XOR problem: it came very close to producing the exact

binary outputs we desired, but not exactly

� For function approximations, this typically isn’t an issue – you’re just trying to approximate some input/output relationship that’s already continuous.

� For classification problems, there are a discrete number of classes, and we must ultimately decide towhich class the input belongs

Classification example� Suppose I have fully trained a neural network to

recognize between handwritten letter “a” and handwritten letter “b”

� My input is the features of the handwritten sample, my desired output is either a +1 for “a”, or -1 for “b”

� In reality, my output for new patterns is going to be in the ballpark of +1 for “a”, and in the ballpark of -1 for “b”.

� A simple threshold may be applied at this point:

� Above zero, call it an “a”

� Below zero, call it a “b”

The Decision Boundary� Given a particular threshold, I may then sample the

entire feature space, and determine which sets of points yield the first class, and which sets of points yield the second class

� In the Rosenblatt Perceptron case, this was already done form me, by the plane formed from wTx+b = 0

� The line(s) between the classes (which may be more than two) once this is done is my decision boundary

What To Do With More Classes� Remember, the output in a classification problem just

answers the question “to which class does this input correspond”

� We must supply a numerical value to each possible output in order to train our neural networks!

� When presented with a multiple-class problem, we have several options on how to encode the output:

� Class-t0-scalar mapping

� Class-to-vector mapping

Class-to-Scalar Mapping� Inputs from class 1 all map to some number α1

� Inputs from class 2 all map to some number α2

� …

� Inputs from class n all map to some number αn

� Selection of these scalar values presents a design problem:

� If they vary wildly for adjacent regions, the neural network will have to learn those sharp discontinuities, which neural networks tend to have a hard time learning

� The overall scale doesn’t matter (you could scale the output weight matrix arbitrarily to accommodate for that), but the relative difference between each value does matter

Class-to-Scalar Mapping

� Using 1 2 3 5 6 is much smoother than using 1 -1 2 -2 3

We call this an adjacency problem!

Class-to-Scalar Mapping� Overall output:

� A decision is made by quantizing the output of the neural network, which forms a set of selection rules.

� In our previous example:� IF the output is less than 1.5, it is class 1

� IF the output is greater than 1.5 AND less than 2.5, it is class 2

� IF the output is greater than 2.5 AND less than 3.5, it is class 3

� etc.

� This makes it clear that, for classification problems, we only have to get the answer “in the ballpark” rather than exactly right� This can make training time easier

Class-to-Vector Mapping� Construct your neural network to have as many output

neurons as there are input classes (not inputs, but the classes from which the inputs were drawn!)

� The desired output pattern for an input drawn from class nis just a vector of all zeros except for the nth element, which is a 1!

� This also enables one to observe how well the classifier is working for a given class:� If many inputs are close to 1, that indicates some level of

confusion

� If only a single input is close to 1, that indicates a level of confidence

Class-to-Vector Mapping� Class selection: whichever neuron has the largest

output corresponds to the class of the input

� Additional strength of this approach: if there are two or more competing/conflicting neurons (that is, two or more have approximately the same maximum value), we may choose to make “no decision”

� That is: the neural network is clearly confused to which class the input belongs, so we develop an “I don’t know” case to handle this

ComparisonClass-to-Vector Class-to-Scalar

� One output neuron implies fewer free parameters, thus simplifying neural network training time

� Single output makes interpretation simpler

� Scalar value for each class must be chosen with care

� Multiple output neurons implies more free parameters, which can cause longer training times

� No adjacency problem

� No need for careful selection of output values

� Additional interpretive tools available for confusion or confidence estimates

One-to-Many Hybrid� As with many things, there are tradeoffs here as well:

� Rather than a single output for many classes, or as many outputs as there are classes, we may “meet in the middle” by cleverly coding the classes

� That is, a particular class maps to a particular set of output values

� The more fine-grained the set of output values (and the fewer neurons it takes to represent them), the closer we are to scalar mapping

� The coarser the set of output values (and the more neurons it takes to respresent them), the closer we are to vector mapping

Example: binary coding� For each class number, the desired output is simply the

binary equivalent for that class� e.g., suppose we have 4 classes. The possible outputs become:

� [0,0] for class 1� [0,1] for class 2� [1,0] for class 3� [1,1] for class 4

� Such a coding scheme suffers from some adjacency issues, but not as many as pure scalar coding� e.g., if class 1 and class 4 are close to each other in the input space, it

may be problematic since their outputs are “units” apart in the output space

� This scheme only requires log2(n) outputs, where n is the number of classes.

2

Final Note for Classification� Recall that we said that the overall goal for classification is

to get the answer “in the ballpark” rather than exactly right

� Thus, while we still need to use the traditional error metric for backprop to work, a better error metric to use simply for performance evaluation might be:

� E(n) = % of incorrectly classified patterns on epoch n

� Of course, “correctly classified” requires the calculation of the threshold value(s) used to make the ultimate determination

� This metric gives us a firmer stopping criterion:

� When we have successfully classified all the input patterns (or some acceptably low % of misclassifications), we may stop training!

That’s no two moons revisited! That’s two space stations revisited!

Just a Reminder

More Reminders� This two-class problem is not linearly separable

� So Rosenblatt Perceptron would not suffice

� Neural Network classification:

� Top moon: Class one, should give a “+1” output

� Bottom moon: Class two, should give a “-1” output

� Decision threshold: output ≥ 0 for class 1

� Things to remember:

� Goal of classification is not perfect input-output mapping

� Rather, goal is to get “close enough” so that the two classes are separable by a simple threshold with minimal overlap

Stay classy, neural networks!

The Classes

Class 3

Class 2Class 4

Class 5

Class 1

The Approach� We will train a neural network with a single output

� The outputs simply map to the class number

� Overall goal: to demonstrate the decision boundary for multiple classes

� Thing to note: lots of “white space” where there is no such thing as an “incorrect” classification

� In other words, generalization only applies over the support of each class

Quick, somebody ask, “Hey Dr. Thompson, what’s ‘support’ mean in this context?”

As in, Terminal Project. As in, it’ll be the death of you…

The Purpose� The main goal of the term project is to independently

exercise one or more of the techniques developed in this class on interesting, real-world data sets

� So really, there are several goals:

� 1) Learning how to gather data

� 2) Exercising neural networks or related techniques on real-world data

� 3) Writing a coherent and lucid report on experimental results

� 4) Reporting said results in front of a group in a concise and informative manner

The Details� The project will comprise 35% of your grade for this

course

� The project must be done in assigned groups unless explicit permission is granted by the instructor

� Valid reason: “My project will be based on my own existing research that I am already performing, and for funding reasons I can’t share that data with a partner.”

� Invalid reason: “My homework partners smell funny.”

More Details� The project actually consists of three (3) components:

� 1) Project proposal (5% of your overall course grade)

� Due 3/24/09

� 2) Oral presentation (10% of your overall course grade)

� Given either 4/28/09 or 4/30/09, in-class

� 3) Written Report (20% of your overall course grade)

� Due the last day of class, 4/30/09

If this makes you nervous, that whole “picturing the audience in their underwear” trick is a load of bull.

The Proposal� The proposal for your project topic must be submitted by the

beginning of class on 3/24/09

� The proposal must pithily address the following issues:� What day you would prefer to give your presentation (4/28 or 4/30)?

� You are not guaranteed this date, but I will try to accommodate everyone as best I can

� The problem you plan on solving.� Why is the problem important to solve?

� The technique(s) you plan on applying.

� The data you plan to use.� How will you obtain/have you obtained the data?

� The programming language you will use.

� The proposal should probably only be 1-2 pages in length.

The Oral Presentation� Minimum 5 minutes, maximum 8 minutes

� Exceeding these bounds will lower your grade

� Allow 2 minutes for questions

� Every member of the group must speak for at least 1 minute� Failure to speak at least 1 minute will lower that individual’s

grade

� Yes, I will be timing you.

� Must be accompanied by some form of illustration� Overhead projector slides

� Powerpoint slides (preferred option)

� Large-print poster

The Oral Presentation� The oral presentation must contain the following key

components:� Problem Statement� Description of the data used� The Learning Machine(s) that was/were used� Training and testing results� Conclusions and Future Work

� The Oral Presentation will be graded on the following criteria:� Clarity of speaking (practice, practice, practice!)� Detail (say something useful!)� Interest (make your slides informative and attractive!)

Oral Presentation Caveats� Each group is solely responsible for ensuring that their

presentation will work on the classroom equipment

� Technology failures due to poor planning will count against your grade

� You are perfectly able, and highly encouraged, to come to class the day before the presentation to test out the slides

� You may email me your Powerpoint slides at least 24 hours before your presentation and I will bring them pre-loaded on my laptop

Courtesy of Dr. David W. Krout

Diligent Q. Student

EE/ESC 456

Problem Statement� Goal

� Develop a Neural Network that can classify what sport a person plays based on height and weight

Data Set� Inputs

� Height (in), Weight (lbs)

� Outputs

� Sport Classification

� Jockeys: 1, Basketball: 2, Soccer: 3, Sumo: 4

� 200 athletes, approximately 50 per sport

Plot of Data Set

55 60 65 70 75 80 85100

150

200

250

300

350

400

450

500Data points for Athletes (Height vs. W ieght)

Weight (lbs)

Height (inches)

Jockeys

Basketball Players

Soccer Players

Sumo Wrestlers

Neural Network� Multilayer Feed Forward

� 2/4/3/1

� Linear transfer function on output later and Sigmoid for all others

Training� 80% of data set used for training

� 20% used for testing

Training Results

Training Results

Trial Epochs Time Ave. Error % Correct

1 7955 25s .462 65

2 16681 50s .450 65

3 20540 55s .454 65

4 9541 30s .643 69

Average: 13679 40s .502 66

Conclusions/Future Work� Classification NN performed well

� Still might be room for improvement

� Larger network may be beneficial

� Another metric would probably improve results greatly� 40 meter dash time

� Ratio of weight and height

� Annual income

� Bench press

Written Report� Your written report must address the following issues:

� Problem Statement� Description of the data used� The Learning Machine(s) that was/were used� Training and testing results� Conclusions and Future Work

� Report will be graded on the following criteria:� Clarity of presentation (Be Specific!)� Pithiness (don’t use 10 words to say what can be said in 5)� Grammar and spelling (yes, this counts!)� Depth of knowledge (show me you understand what you’ve

done)� Results (this is, of course, the big one!)

Example Report� Anything in the open literature is a good example of

what your report should strive to achieve.

If this doesn’t give you a reason to show up on Thursday, nothing will!

Particle Swarm Optimization� Mimics the motion of a f lock of birds converging on a

food source to find the global minimum of a search space

� Very very easy to code (I can do the whole algorithm in 6 lines of code)

� Very very easy to modify for better performance

� Simple example of swarm intelligence

� Highly parallelizable and scalable

Genetic Algorithms� Optimization routine based on evolutionary strategy

� Solutions are encoded as bit-string chromosomes

� New generations (new solutions) are generated via crossover (two parents combine genetic material to form a “child” solution) and mutation (genes are randomly changed with some small probability)

Simulated Annealing� Based on metallurgic principle of annealing, wherein

metal is slowly heated and cooled to result in the lowest energy state of the atoms in the metal for improved hardness and strength

� Given a particular solution, new solutions are searched for near that solution

� Better solutions (lower error) always accepted, worse solutions accepted with some probability based on an annealing schedule and overall error difference

Documents

Instructor: Dr. Benjamin Thompson Lecture 15: 3 March 2009scripts.cac.psu.edu/users/c/a/cao5021/ee/456/hw/lecture_15 Dec Bo… · The Oral Presentation Minimum 5 minutes, maximum