28
Overview of Today’s Overview of Today’s Lecture Lecture Last Time: course introduction Last Time: course introduction Reading assignment posted to class webpage Reading assignment posted to class webpage Don’t get discouraged Don’t get discouraged Today: introduction to “Supervised Today: introduction to “Supervised Machine Learning” Machine Learning” Our first ML algorithm: K-nearest neighbor Our first ML algorithm: K-nearest neighbor HW 0 out online HW 0 out online Create a dataset of Create a dataset of fixed-length feature vectors” fixed-length feature vectors” Due next Tuesday Sept 19 (4 PM) Due next Tuesday Sept 19 (4 PM) Instructions for handing in HW0 coming soon Instructions for handing in HW0 coming soon

Overview of Today’s Lecture Last Time: course introductionLast Time: course introduction Reading assignment posted to class webpageReading assignment posted

Embed Size (px)

Citation preview

Overview of Today’s Overview of Today’s LectureLecture

• Last Time: course introductionLast Time: course introduction• Reading assignment posted to class webpageReading assignment posted to class webpage• Don’t get discouragedDon’t get discouraged

• Today: introduction to “Supervised Machine Today: introduction to “Supervised Machine Learning”Learning”• Our first ML algorithm: K-nearest neighborOur first ML algorithm: K-nearest neighbor

• HW 0 out onlineHW 0 out online• Create a dataset of Create a dataset of

• ““fixed-length feature vectors”fixed-length feature vectors”• Due next Tuesday Sept 19 (4 PM)Due next Tuesday Sept 19 (4 PM)• Instructions for handing in HW0 coming soonInstructions for handing in HW0 coming soon

Supervised Learning: Supervised Learning: OverviewOverview

Real World Digital Representation(feature space)

select features

constructclassifier

classificationrules

If feature 2 = X then APPLY BREAK = TRUE

HW 0 HW 1-2

humans machine

Supervised Learning: Supervised Learning: Task DefinitionTask Definition• GivenGiven

• A collection of A collection of positivepositive examples of some examples of some concept/class/category (i.e., members of the class) concept/class/category (i.e., members of the class) and, possibly, a collection of the and, possibly, a collection of the negativenegative examples examples (i.e., non-members)(i.e., non-members)

• ProduceProduce• A description that A description that coverscovers (includes) all (most) of the (includes) all (most) of the

positive examples and none (few) of the negative positive examples and none (few) of the negative examples examples

(which, hopefully, properly categorizes most future (which, hopefully, properly categorizes most future examples!)examples!)

The KeyPoint!

Note: one can easily extend this definition Note: one can easily extend this definition to handle more than two classesto handle more than two classes

ExampleExamplePositive Examples Negative Examples

How does this symbol classify?

•Concept

•Solid Red Circle in a Regular Polygon

•What about?• Figure with red solid circles not in larger red circle• Figures on left side of page etc

HW0 – Your “Personal HW0 – Your “Personal Concept”Concept”

• Step 1: Step 1: Choose a Boolean (true/false) conceptChoose a Boolean (true/false) concept• Subjective judgment (can’t articulate)Subjective judgment (can’t articulate)

• Books I like/dislike Books I like/dislike • Movies I like/dislike Movies I like/dislike • www pages I like/dislikewww pages I like/dislike

• ““time will tell” conceptstime will tell” concepts• Stocks to buyStocks to buy• Medical treatmentMedical treatment ( (at time at time tt, predict outcome at , predict outcome at

time (time (t t ++∆∆tt))))• Sensory interpretation Sensory interpretation

• Face recognition (See text)Face recognition (See text)• Handwritten digit recognitionHandwritten digit recognition• Sound recognitionSound recognition

• Hard to program functionsHard to program functions

HW0 – Your “Personal HW0 – Your “Personal Concept”Concept”

• Step 2: Step 2: Choose a feature spaceChoose a feature space• We will use fixed-length feature vectorsWe will use fixed-length feature vectors

• Choose Choose NN features features• Each feature has Each feature has VVii

possible valuespossible values• Each example is represented by a vector of N feature Each example is represented by a vector of N feature

values values (i.e., (i.e., is a point in the feature spaceis a point in the feature space))e.g.: e.g.: <red, 50, round><red, 50, round>

colorcolor weight shapeweight shape

• Feature TypesFeature Types• BooleanBoolean• NominalNominal• OrderedOrdered• HierarchicalHierarchical

• Step 3: Step 3: Collect examples (“I/O” pairs)Collect examples (“I/O” pairs)

Defines a space

We will not use hierarchical features

Standard Feature TypesStandard Feature Typesfor representing training examples for representing training examples – source of “ – source of “domain knowledgedomain knowledge””

closed

polygon continuous

trianglesquare circle ellipse

• Nominal (Boolean is a special case)Nominal (Boolean is a special case)• No relationship among possible valuesNo relationship among possible values

e.g., e.g., color color єє {red, blue, green} {red, blue, green} (vs.(vs. color = 1000 color = 1000 Hertz)Hertz)

• Linear (or Ordered)Linear (or Ordered)• Possible values of the feature are totally Possible values of the feature are totally

orderedorderede.g., e.g., size size єє {small, medium, large} {small, medium, large} ←← discretediscrete

weight weight єє [0…500] [0…500] ←← continuouscontinuous

• HierarchicalHierarchical• Possible values are Possible values are partiallypartially

ordered in an ISA hierarchyordered in an ISA hierarchye.g. for e.g. for shapeshape ->->

Example Hierarchy Example Hierarchy (KDD* Journal, Vol 5, No. 1-2, 2001, page 17)(KDD* Journal, Vol 5, No. 1-2, 2001, page 17)

Product

Pet Foods

Tea

Canned Cat Food

Dried Cat Food

99 Product Classes

2302 Product Subclasses

Friskies Liver, 250g

~30k Products• Structure of one feature!

• “the need to be able to incorporate hierarchical (knowledge about data types) is shown in every paper.”

- From eds. Intro to special issue (on applications) of KDD journal, Vol 15, 2001

* Officially, “Data Mining and Knowledge Discovery”, Kluwer Publishers

Some Famous ExamplesSome Famous Examples

• Car Steering (Pomerleau)Car Steering (Pomerleau)

• Medical Diagnosis (Quinlan)Medical Diagnosis (Quinlan)

• DNA CategorizationDNA Categorization• TV-pilot ratingTV-pilot rating• Chemical-plant controlChemical-plant control• Back gammon playingBack gammon playing• WWW page scoringWWW page scoring• Credit application scoringCredit application scoring

Learned Function

Steering Angle

Digitized camera image

age = 13sex = M wgt = 18

Learned Function

ill vs

healthy

Medicalrecord

HW0: Creating your datasetHW0: Creating your dataset

1.1. Choose a datasetChoose a dataset• based on interest/familiaritybased on interest/familiarity• meets basic requirementsmeets basic requirements

• >1000 examples>1000 examples• category (function) learned category (function) learned

should be binary valuedshould be binary valued• ~500 “true” and “false” ~500 “true” and “false”

examplesexamples

→→ Internet Movie Database (IMDb)Internet Movie Database (IMDb)

Example Database: IMDbExample Database: IMDb

Studio

Movie

Director/Producer

Actor

Made Acted inDirected

•Name•Country•Movies

•Name•Year of birth•Gender•Oscars•Movies

•Title•Genre•Year•Opening Weekend•BO receipts•List of actors/actresses•Release season

•Name•Year of birth•Movies

Produced

HW0: Creating your datasetHW0: Creating your dataset

Choose Boolean target function Choose Boolean target function (category)(category)

• Some examples:Some examples:• Opening weekend box office receipts > Opening weekend box office receipts >

$2 million$2 million• Movie is drama? (action, sci-fi,…)Movie is drama? (action, sci-fi,…)• Movies I like/dislike (e.g. Tivo)Movies I like/dislike (e.g. Tivo)

HW0: Creating your datasetHW0: Creating your dataset

• MovieMovie• Average age of actorsAverage age of actors• Number of producersNumber of producers• Percent female actorsPercent female actors

• StudioStudio• Number of movies Number of movies

mademade• Average movie grossAverage movie gross• Percent movies Percent movies

released in USreleased in US

• Director/ProducerDirector/Producer• Years of experienceYears of experience• Most prevalent Most prevalent

genregenre• Number of award Number of award

winning movieswinning movies• Average movie Average movie

grossgross• ActorActor

• GenderGender• Has previous Oscar Has previous Oscar

award or award or nominationsnominations

• Most prevalent Most prevalent genregenre

Create your feature spaceCreate your feature space

HW0: Creating your datasetHW0: Creating your dataset

David Jensen’s group at UMass used Naïve Bayes David Jensen’s group at UMass used Naïve Bayes (NB) to predict the following based on attributes they (NB) to predict the following based on attributes they selected and a novel way of sampling from the data:selected and a novel way of sampling from the data:

• Opening weekend box office receipts > $2 Opening weekend box office receipts > $2 millionmillion• 25 attributes25 attributes• Accuracy = 83.3%Accuracy = 83.3%• Default accuracy = 56%Default accuracy = 56%

• Movie is drama?Movie is drama?• 12 attributes12 attributes• Accuracy = 71.9%Accuracy = 71.9%• Default accuracy = 51%Default accuracy = 51%

• http://kdl.cs.umass.edu/proximity/about.htmlhttp://kdl.cs.umass.edu/proximity/about.html

.

.

.

Back to Supervised Back to Supervised Learning Learning One way learning systems differ is in how they One way learning systems differ is in how they representrepresent concepts: concepts:

TrainingExamples

Backpropagation

C4.5, CART

AQ, FOIL

SVMs

NeuralNet

DecisionTree

Φ <- X^YΦ <- Z

Rules

If 5x1 + 9x2 – 3x3 > 12Then +

Feature SpaceFeature Space

If examples are described in terms of If examples are described in terms of values of features, they can be plotted values of features, they can be plotted as points in an N-dimensional space.as points in an N-dimensional space.

Size

Color

Weight

?Big

2500

Gray

A “concept” is then a (possibly disjoint) volume in this space.

Supervised Learning = Supervised Learning = Learning from Labeled Learning from Labeled ExamplesExamples• Most common & successful form Most common & successful form

of MLof ML Venn Diagram

+ ++

+

- -

--

-

-

--

• Examples – points in multi-dimensional “feature space”• Concepts – “function” that labels points in feature space

(as +, -, and possibly ?)

Brief ReviewBrief Review

• Conjunctive ConceptConjunctive Concept• Color(?obj1, red)Color(?obj1, red)

^̂• Size(?obj1, large)Size(?obj1, large)

• Disjunctive ConceptDisjunctive Concept• Color(?obj2, blue)Color(?obj2, blue)

vv• Size(?obj2, small)Size(?obj2, small)A A A

“and”

“or”

Instances

Empirical Learning and Empirical Learning and Venn DiagramsVenn Diagrams

Concept = Concept = AA or or B B (Disjunctive concept)(Disjunctive concept)

Examples = labeled points in feature spaceExamples = labeled points in feature space

Concept = a label for a Concept = a label for a setset of points of points

Venn Diagram

A

B

--

--

-

-

- -

-

-

-

-

--

-

-

-

--

--

- - --- -

---

--

-

-

+

++ ++

+ +

+

++

+ +

+

++

+

+

++

Feature Space

Aspects of an ML SystemAspects of an ML System

• ““Language” for representing examplesLanguage” for representing examples• ““Language” for representing “Concepts”Language” for representing “Concepts”• Technique for producing concept Technique for producing concept

“consistent” with the training examples“consistent” with the training examples• Technique for classifying new instanceTechnique for classifying new instance

Each of these limits the Each of these limits the expressivenessexpressiveness//efficiencyefficiency of the supervised learning algorithm.of the supervised learning algorithm.

HW 0

OtherHW’s

Nearest-Neighbor Nearest-Neighbor AlgorithmsAlgorithms(aka. Exemplar models, instance-based learning (aka. Exemplar models, instance-based learning

(IBL), case-based learning)(IBL), case-based learning)

• Learning ≈ memorize training examplesLearning ≈ memorize training examples• Problem solving = find most similar Problem solving = find most similar

example in memory; output its categoryexample in memory; output its categoryVenn

-

--

-

-

--

-+

+

+

+ + +

++

+

+?

…“Voronoi

Diagrams”(pg 233)

Sample Experimental Sample Experimental ResultsResults

TestbedTestbed Testset CorrectnessTestset Correctness

IBLIBL D-TreesD-Trees Neural NetsNeural Nets

Wisconsin Wisconsin CancerCancer

98%98% 95%95% 96%96%

Heart Heart DiseaseDisease

78%78% 76%76% ??

TumorTumor 37%37% 38?38? ??

AppendicitisAppendicitis 83%83% 85%85% 86%86%

Simple algorithm works quite well!

““Hamming Distance”Hamming Distance”•Ex 1 = 2Ex 1 = 2•Ex 2 = 1Ex 2 = 1•Ex 3 = 2Ex 3 = 2

Simple Example – 1-NNSimple Example – 1-NN

Training SetTraining Set1.1. a=0, b=0, c=1a=0, b=0, c=1 ++2.2. a=0, b=1, c=0a=0, b=1, c=0 --3.3. a=1, b=1, c=1a=1, b=1, c=1 --Test ExampleTest Example• a=0, b=1, c=0 a=0, b=1, c=0 ??

So output -

(1-NN ≡(1-NN ≡ one nearest neighbor)one nearest neighbor)

K-NN AlgorithmK-NN Algorithm

Collect K nearest neighbors, select majority Collect K nearest neighbors, select majority classification (or somehow combine their classification (or somehow combine their classes)classes)

• What should K be?What should K be?• It probability is problem dependentIt probability is problem dependent• Can use Can use tuning setstuning sets (later) to select (later) to select

a good setting for Ka good setting for KTuning SetError Rate

1 2 3 4 5 K

Shouldn’t really“connect the dots”(Why?)

Some Common JargonSome Common Jargon

• ClassificationClassification• Learning a Learning a discretediscrete valued function valued function

• RegressionRegression• Learning a Learning a realreal valued function valued function

IBL easily extended to regression tasks IBL easily extended to regression tasks (and to multi-category classification)(and to multi-category classification)

Discrete/RealOutputs

Variations on a ThemeVariations on a Theme

• IB1IB1 – keep all examples – keep all examples

• IB2IB2 – keep next instance if – keep next instance if incorrectlyincorrectly classified by using previous instancesclassified by using previous instances• Uses less storageUses less storage• Order dependentOrder dependent• Sensitive to noisy dataSensitive to noisy data

(From Aha, Kibler and Albert in ML Journal)(From Aha, Kibler and Albert in ML Journal)

Variations on a Theme Variations on a Theme (cont.)(cont.)• IB3IB3 – extend IB2 to more intelligently decide – extend IB2 to more intelligently decide

which examples to keep (see article)which examples to keep (see article)• Better handling of noisy dataBetter handling of noisy data

• Another IdeaAnother Idea - - cluster groups, keep cluster groups, keep “examples” from each (median/centroid)“examples” from each (median/centroid)

Next timeNext time

• Finish K-NNFinish K-NN• Begin linear separatorsBegin linear separators

• Naïve BayesNaïve Bayes