Computing & Information Sciences Kansas State University Wednesday, 19 Nov 2008CIS 530 / 730: Artificial Intelligence Lecture 34 of 42 Wednesday, 19 November

Computing & Information SciencesKansas State University

Wednesday, 19 Nov 2008CIS 530 / 730: Artificial Intelligence

Lecture 34 of 42

Wednesday, 19 November 2008

William H. Hsu

Department of Computing and Information Sciences, KSU

KSOL course page: http://snipurl.com/v9v3

Course web site: http://www.kddresearch.org/Courses/Fall-2008/CIS730

Instructor home page: http://www.cis.ksu.edu/~bhsu

Reading for Next Class:

Sections 22.1, 22.6-7, Russell & Norvig 2nd edition

Genetic and Evolutionary ComputationDiscussion: GA, GP



Hidden Units and Feature Extraction

Training procedure: hidden unit representations that minimize error E

Sometimes backprop will define new hidden features that are not explicit in the

input representation x, but which capture properties of the input instances that

are most relevant to learning the target function t(x)

Hidden units express newly constructed features

Change of representation to linearly separable D’

A Target Function (Sparse aka 1-of-C, Coding)

Can this be learned? (Why or why not?)

Learning Hidden Layer Representations

Input Hidden Values Output1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0

0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0

0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0

0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0

0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0

0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0

0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0

0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1

Input Hidden Values Output1 0 0 0 0 0 0 0 0.89 0.04 0.08 1 0 0 0 0 0 0 0

0 1 0 0 0 0 0 0 0.01 0.11 0.88 0 1 0 0 0 0 0 0

0 0 1 0 0 0 0 0 0.01 0.97 0.27 0 0 1 0 0 0 0 0

0 0 0 1 0 0 0 0 0.99 0.97 0.71 0 0 0 1 0 0 0 0

0 0 0 0 1 0 0 0 0.03 0.05 0.02 0 0 0 0 1 0 0 0

0 0 0 0 0 1 0 0 0.22 0.99 0.99 0 0 0 0 0 1 0 0

0 0 0 0 0 0 1 0 0.80 0.01 0.98 0 0 0 0 0 0 1 0

0 0 0 0 0 0 0 1 0.60 0.94 0.01 0 0 0 0 0 0 0 1



Training: Evolution of Error and Hidden Unit

Encoding

errorD(ok)

hj(01000000), 1 j 3



Input-to-Hidden Unit Weights and Feature Extraction

Changes in first weight layer values correspond to changes in hidden layer

encoding and consequent output squared errors

w0 (bias weight, analogue of threshold in LTU) converges to a value near 0

Several changes in first 1000 epochs (different encodings)

Training:Weight Evolution

ui1, 1 i 8



Convergence of Backpropagation

No Guarantee of Convergence to Global Optimum Solution

Compare: perceptron convergence (to best h H, provided h H; i.e., LS)

Gradient descent to some local error minimum (perhaps not global minimum…)

Possible improvements on backprop (BP)

• Momentum term (BP variant with slightly different weight update rule)

• Stochastic gradient descent (BP algorithm variant)

• Train multiple nets with different initial weights; find a good mixture

Improvements on feedforward networks

• Bayesian learning for ANNs (e.g., simulated annealing) - later

• Other global optimization methods that integrate over multiple networks

Nature of Convergence

Initialize weights near zero

Therefore, initial network near-linear

Increasingly non-linear functions possible as training progresses



Overtraining in ANNs

Error versus epochs (Example 2)

Recall: Definition of Overfitting h’ worse than h on Dtrain, better on Dtest

Overtraining: A Type of Overfitting Due to excessive iterations

Avoidance: stopping criterion(cross-validation: holdout, k-fold)

Avoidance: weight decay

Error versus epochs (Example 1)



Overfitting in ANNs

Other Causes of Overfitting Possible Number of hidden units sometimes set in advance

Too few hidden units (“underfitting”)• ANNs with no growth

• Analogy: underdetermined linear system of equations (more unknowns than equations)

Too many hidden units

• ANNs with no pruning

• Analogy: fitting a quadratic polynomial with an approximator of degree >> 2

Solution Approaches Prevention: attribute subset selection (using pre-filter or wrapper)

Avoidance

• Hold out cross-validation (CV) set or split k ways (when to stop?)

• Weight decay: decrease each weight by some factor on each epoch

Detection/recovery: random restarts, addition and deletion of weights, units



90% Accurate Learning Head Pose, Recognizing 1-of-20 Faces

http://www.cs.cmu.edu/~tom/faces.html

Example:Neural Nets for Face Recognition

30 x 32 Inputs

Left Straight Right Up

Hidden Layer Weights after 1 Epoch

Hidden Layer Weights after 25 Epochs

Output Layer Weights (including w0 = ) after 1 Epoch



Example:NetTalk

Sejnowski and Rosenberg, 1987

Early Large-Scale Application of Backprop Learning to convert text to speech

• Acquired model: a mapping from letters to phonemes and stress marks

• Output passed to a speech synthesizer

Good performance after training on a vocabulary of ~1000 words

Very Sophisticated Input-Output Encoding Input: 7-letter window; determines the phoneme for the center letter and

context on each side; distributed (i.e., sparse) representation: 200 bits

Output: units for articulatory modifiers (e.g., “voiced”), stress, closest phoneme; distributed representation

40 hidden units; 10000 weights total

Experimental Results Vocabulary: trained on 1024 of 1463 (informal) and 1000 of 20000 (dictionary)

78% on informal, ~60% on dictionary

http://en.wikipedia.org/wiki/NETtalk_(artificial_neural_network)



NeuroSolutions Demo



PAC Learning:Definition and Rationale

Intuition Can’t expect a learner to learn exactly

• Multiple consistent concepts

• Unseen examples: could have any label (“OK” to mislabel if “rare”)

Can’t always approximate c closely (probability of D not being representative)

Terms Considered Class C of possible concepts, learner L, hypothesis space H

Instances X, each of length n attributes

Error parameter , confidence parameter , true error errorD(h)

size(c) = the encoding length of c, assuming some representation

Definition C is PAC-learnable by L using H if for all c C, distributions D over X, such

that 0 < < 1/2, and such that 0 < < 1/2, learner L will, with probability at

least (1 - ), output a hypothesis h H such that errorD(h)

Efficiently PAC-learnable: L runs in time polynomial in 1/, 1/, n, size(c)



PAC Learning:Results for Two Hypothesis

Languages Unbiased Learner

Recall: sample complexity bound m 1/ (ln | H | + ln (1/))

Sample complexity not always polynomial

Example: for unbiased learner, | H | = 2 | X |

Suppose X consists of n booleans (binary-valued attributes)

• | X | = 2n, | H | = 22n

• m 1/ (2n ln 2 + ln (1/))

• Sample complexity for this H is exponential in n

Monotone Conjunctions

Target function of the form

Active learning protocol (learner gives query instances): n examples needed

Passive learning with a helpful teacher: k examples (k literals in true concept)

Passive learning with randomly selected examples (proof to follow):

m 1/ (ln | H | + ln (1/)) = 1/ (ln n + ln (1/))

'k

'1n1 xxx, ,xfy



PAC Learning:Monotone Conjunctions [1]

Monotone Conjunctive Concepts

Suppose c C (and h H) is of the form x1 x2 … xm

n possible variables: either omitted or included (i.e., positive literals only)

Errors of Omission (False Negatives)

Claim: the only possible errors are false negatives (h(x) = -, c(x) = +)

Mistake iff (z h) (z c) ( x Dtest . x(z) = false): then h(x) = -, c(x) = +

Probability of False Negatives

Let z be a literal; let Pr(Z) be the probability that z is false in a positive x D

z in target concept (correct conjunction c = x1 x2 … xm) Pr(Z) = 0

Pr(Z) is the probability that a randomly chosen positive example has z = false

(inducing a potential mistake, or deleting z from h if training is still in progress)

error(h) z h Pr(Z)

ch

Instance Space X

++-

-

--+

+



PAC Learning: Monotone Conjunctions [2]

Bad Literals

Call a literal z bad if Pr(Z) > = ’/n

z does not belong in h, and is likely to be dropped (by appearing with value true

in a positive x D), but has not yet appeared in such an example

Case of No Bad Literals

Lemma: if there are no bad literals, then error(h) ’

Proof: error(h) z h Pr(Z) z h ’/n ’ (worst case: all n z’s are in c ~ h)

Case of Some Bad Literals

Let z be a bad literal

Survival probability (probability that it will not be eliminated by a given

example): 1 - Pr(Z) < 1 - ’/n

Survival probability over m examples: (1 - Pr(Z))m < (1 - ’/n)m

Worst case survival probability over m examples (n bad literals) = n (1 - ’/n)m

Intuition: more chance of a mistake = greater chance to learn



PAC Learning: Monotone Conjunctions [3]

Goal: Achieve An Upper Bound for Worst-Case Survival Probability

Choose m large enough so that probability of a bad literal z surviving across m

examples is less than

Pr(z survives m examples) = n (1 - ’/n)m <

Solve for m using inequality 1 - x < e-x

• n e-m’/n <

• m > n/’ (ln (n) + ln (1/)) examples needed to guarantee the bounds

This completes the proof of the PAC result for monotone conjunctions

Nota Bene: a specialization of m 1/ (ln | H | + ln (1/)); n/’ = 1/

Practical Ramifications

Suppose = 0.1, ’ = 0.1, n = 100: we need 6907 examples

Suppose = 0.1, ’ = 0.1, n = 10: we need only 460 examples

Suppose = 0.01, ’ = 0.1, n = 10: we need only 690 examples



PAC Learning:k-CNF, k-Clause-CNF, k-DNF, k-Term-DNF

k-CNF (Conjunctive Normal Form) Concepts: Efficiently PAC-Learnable Conjunctions of any number of disjunctive clauses, each with at most k literals

c = C1 C2 … Cm; Ci = l1 l1 … lk; ln (| k-CNF |) = ln (2(2n)k) = (nk)

Algorithm: reduce to learning monotone conjunctions over nk pseudo-literals Ci

k-Clause-CNF

c = C1 C2 … Ck; Ci = l1 l1 … lm; ln (| k-Clause-CNF |) = ln (3kn) = (kn)

Efficiently PAC learnable? See below (k-Clause-CNF, k-Term-DNF are duals)

k-DNF (Disjunctive Normal Form) Disjunctions of any number of conjunctive terms, each with at most k literals

c = T1 T2 … Tm; Ti = l1 l1 … lk

k-Term-DNF: “Not” Efficiently PAC-Learnable (Kind Of, Sort Of…)

c = T1 T2 … Tk; Ti = l1 l1 … lm; ln (| k-Term-DNF |) = ln (k3n) = (n + ln k)

Polynomial sample complexity, not computational complexity (unless RP = NP)

Solution: Don’t use H = C! k-Term-DNF k-CNF (so let H = k-CNF)



Consistent Learners

General Scheme for Learning

Follows immediately from definition of consistent hypothesis

Given: a sample D of m examples

Find: some h H that is consistent with all m examples

PAC: show that if m is large enough, a consistent hypothesis must be close

enough to c

Efficient PAC (and other COLT formalisms): show that you can compute the

consistent hypothesis efficiently

Monotone Conjunctions

Used an Elimination algorithm (compare: Find-S) to find a hypothesis h that is

consistent with the training set (easy to compute)

Showed that with sufficiently many examples (polynomial in the parameters),

then h is close to c

Sample complexity gives an assurance of “convergence to criterion” for

specified m, and a necessary condition (polynomial in n) for tractability



VC Dimension:Framework

Infinite Hypothesis Space?

Preceding analyses were restricted to finite hypothesis spaces

Some infinite hypothesis spaces are more expressive than others, e.g.,

• rectangles vs. 17-sided convex polygons vs. general convex polygons

• linear threshold (LT) function vs. a conjunction of LT units

Need a measure of the expressiveness of an infinite H other than its size

Vapnik-Chervonenkis Dimension: VC(H)

Provides such a measure

Analogous to | H |: there are bounds for sample complexity using VC(H)



VC Dimension:Shattering A Set of Instances

Dichotomies

Recall: a partition of a set S is a collection of disjoint sets Si whose union is S

Definition: a dichotomy of a set S is a partition of S into two subsets S1 and S2

Shattering

A set of instances S is shattered by hypothesis space H if and only if for every dichotomy of S, there exists a hypothesis in

H consistent with this dichotomy

Intuition: a rich set of functions shatters a larger instance space

The “Shattering Game” (An Adversarial Interpretation)

Your client selects an S (an instance space X)

You select an H

Your adversary labels S (i.e., chooses a point c from concept space C = 2X)

You must find then some h H that “covers” (is consistent with) c

If you can do this for any c your adversary comes up with, H shatters S



VC Dimension:Examples of Shattered Sets

Three Instances Shattered

Intervals

Left-bounded intervals on the real axis: [0, a), for a R 0

• Sets of 2 points cannot be shattered

• Given 2 points, can label so that no hypothesis will be consistent

Intervals on the real axis ([a, b], b R > a R): can shatter 1 or 2 points, not 3

Half-spaces in the plane (non-collinear): 1? 2? 3? 4?

Instance Space X

0 a

- +

- +

a b

+



Lecture Outline

Readings for Friday

Finish Chapter 20, Russell and Norvig 2e

Suggested: Chapter 1, 6.1-6.5, Goldberg; 9.1 – 9.4, Mitchell

Evolutionary Computation

Biological motivation: process of natural selection

Framework for search, optimization, and learning

Prototypical (Simple) Genetic Algorithm

Components: selection, crossover, mutation

Representing hypotheses as individuals in GAs

An Example: GA-Based Inductive Learning (GABIL)

GA Building Blocks (aka Schemas)

Taking Stock (Course Review)



Simple Genetic Algorithm (SGA)

Algorithm Simple-Genetic-Algorithm (Fitness, Fitness-Threshold, p, r, m)

// p: population size; r: replacement rate (aka generation gap width), m: string

size

P p random hypotheses // initialize population

FOR each h in P DO f[h] Fitness(h) // evaluate Fitness: hypothesis R

WHILE (Max(f) < Fitness-Threshold) DO 1. Select: Probabilistically select (1 - r)p members of P to add to PS

2. Crossover:

Probabilistically select (r · p)/2 pairs of hypotheses from P

FOR each pair <h1, h2> DO

PS += Crossover (<h1, h2>) // PS[t+1] = PS[t] + <offspring1, offspring2>

3. Mutate: Invert a randomly selected bit in m · p random members of PS

4. Update: P PS

5. Evaluate: FOR each h in P DO f[h] Fitness(h)

RETURN the hypothesis h in P that has maximum fitness f[h]

p

1j j

ii

hf

hfhP



GA-Based Inductive Learning (GABIL)

GABIL System [Dejong et al, 1993]

Given: concept learning problem and examples

Learn: disjunctive set of propositional rules

Goal: results competitive with those for current decision tree learning

algorithms (e.g., C4.5)

Fitness Function: Fitness(h) = (Correct(h))2

Representation

Rules: IF a1 = T a2 = F THEN c = T; IF a2 = T THEN c = F

Bit string encoding: a1 [10] . a2 [01] . c [1] . a1 [11] . a2 [10] . c [0] = 10011 11100

Genetic Operators

Want variable-length rule sets

Want only well-formed bit string hypotheses



Crossover:Variable-Length Bit Strings

Basic Representation Start with

a1 a2 c a1 a2 c

h1 1[0 01 1 11 1]0 0

h2 0[1 1]1 0 10 01 0

Idea: allow crossover to produce variable-length offspring

Procedure

1. Choose crossover points for h1, e.g., after bits 1, 8

2. Now restrict crossover points in h2 to those that produce bitstrings with well-

defined semantics, e.g., <1, 3>, <1, 8>, <6, 8>

Example Suppose we choose <1, 3>

Result

h3 11 10 0

h4 00 01 1 11 11 010 01 0



GABIL Extensions

New Genetic Operators

Applied probabilistically

1. AddAlternative: generalize constraint on ai by changing a 0 to a 1

2. DropCondition: generalize constraint on ai by changing every 0 to a 1

New Field

Add fields to bit string to decide whether to allow above operators

a1 a2 c a1 a2 c AA

DC

01 11 0 10 01 0 1

0

So now learning strategy also evolves!

aka genetic wrapper



GABIL Results

Classification Accuracy

Compared to symbolic rule/tree learning methods

C4.5 [Quinlan, 1993]

ID5R

AQ14 [Michalski, 1986]

Performance of GABIL comparable

Average performance on a set of 12 synthetic problems: 92.1% test

accuracy

Symbolic learning methods ranged from 91.2% to 96.6%

Effect of Generalization Operators

Result above is for GABIL without AA and DC

Average test set accuracy on 12 synthetic problems with AA and DC: 95.2%



Building Blocks(Schemas)

Problem

How to characterize evolution of population in GA?

Goal

Identify basic building block of GAs

Describe family of individuals

Definition: Schema

String containing 0, 1, * (“don’t care”)

Typical schema: 10**0*

Instances of above schema: 101101, 100000, …

Solution Approach

Characterize population by number of instances representing each schema

m(s, t) number of instances of schema s in population at time t



Selection and Building Blocks

Restricted Case: Selection Only

average fitness of population at time t


average fitness of instances of schema s at time t

Quantities of Interest

Probability of selecting h in one selection step

Probability of selecting an instance of s in one selection step

Expected number of instances of s after n selections

tf

t s,u

n

i ihf

hfhP

1

t s,mtfn

t s,u

tfn

hfshP

tpsh

ˆ

t s,mtf

t s,ut s,mE

ˆ1



Schema Theorem

Theorem


average fitness of population at time t

average fitness of instances of schema s at time t

pc probability of single point crossover operator

pm probability of mutation operator

l length of individual bit strings

o(s) number of defined (non “*”) bits in s

d(s) distance between rightmost, leftmost defined bits in s

Intuitive Meaning “The expected number of instances of a schema in the population tends

toward its relative fitness”

A fundamental theorem of GA analysis and design

so

ms

c p-l

dpt s,m

tf

t s,ut s,mE 1-

11-1

ˆ

tf

t s,u



Genetic Programming

Readings / Viewings View GP videos 1-3

GP1 – Genetic Programming: The Video

GP2 – Genetic Programming: The Next Generation

GP3 – Genetic Programming: Invention

GP4 – Genetic Programming: Human-Competitive

Suggested: Chapters 1-5, Koza

Previously Genetic and evolutionary computation (GEC)

Generational vs. steady-state GAs; relation to simulated annealing, MCMC

Schema theory and GA engineering overview

Today: GP Discussions Code bloat and potential mitigants: types, OOP, parsimony, optimization,

reuse

Genetic programming vs. human programming: similarities, differences



GP Flow Graph

Adapted from The Genetic Programming Notebook © 2002 Jaime J. Fernandezhttp://www.geneticprogramming.com



Structural Crossover




Structural Mutation




Terminology

Evolutionary Computation (EC): Models Based on Natural Selection

Genetic Algorithm (GA) Concepts Individual: single entity of model (corresponds to hypothesis)

Population: collection of entities in competition for survival

Generation: single application of selection and crossover operations

Schema aka building block: descriptor of GA population (e.g., 10**0*)

Schema theorem: representation of schema proportional to its relative fitness

Simple Genetic Algorithm (SGA) Steps Selection

Proportionate (aka roulette wheel): P(individual) f(individual)

Tournament: let individuals compete in pairs or tuples; eliminate unfit ones

Crossover

Single-point: 11101001000 00001010101 { 11101010101, 00001001000 }

Two-point: 11101001000 00001010101 { 11001011000, 00101000101 }

Uniform: 11101001000 00001010101 { 10001000100, 01101011001 }

Mutation: single-point (“bit flip”), multi-point



Summary Points

Evolutionary Computation

Motivation: process of natural selection

Limited population; individuals compete for membership

Method for parallelizing and stochastic search

Framework for problem solving: search, optimization, learning

Prototypical (Simple) Genetic Algorithm (GA)

Steps

Selection: reproduce individuals probabilistically, in proportion to fitness

Crossover: generate new individuals probabilistically, from pairs of “parents”

Mutation: modify structure of individual randomly

How to represent hypotheses as individuals in GAs

An Example: GA-Based Inductive Learning (GABIL)

Schema Theorem: Propagation of Building Blocks

Next Lecture: Genetic Programming, The Movie

Documents

Computing & Information Sciences Kansas State University Wednesday, 19 Nov 2008CIS 530 / 730: Artificial Intelligence Lecture 34 of 42 Wednesday, 19 November