26
Math 285 Classification with Handwritten Digits – First day of class, Spring 2016 Dr. Guangliang Chen Dept. of Math and Statistics San José State University

Math 285 Classification with Handwritten Digits – First day ...gchen/Math285S16/intro.pdfplatform hosting many data science competitions. – It uses a crowdsourcing approach that

  • Upload
    others

  • View
    5

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Math 285 Classification with Handwritten Digits – First day ...gchen/Math285S16/intro.pdfplatform hosting many data science competitions. – It uses a crowdsourcing approach that

Math 285 Classification with Handwritten Digits– First day of class, Spring 2016

Dr. Guangliang Chen

Dept. of Math and StatisticsSan José State University

Page 2: Math 285 Classification with Handwritten Digits – First day ...gchen/Math285S16/intro.pdfplatform hosting many data science competitions. – It uses a crowdsourcing approach that

An Introduction to Math 285 Classification with Handwritten Digits

How did this course originate?• Developed from last fall’s Math 203 CAMCOS (whose theme was

classification)

• Based on a current Kaggle competition: Digit Recognizer

– Kaggle is a Silicon Valley start-up and Kaggle.com is its onlineplatform hosting many data science competitions.

– It uses a crowdsourcing approach that relies on the fact that“there are countless strategies that can be applied to any pre-dictive modelling task and it is impossible to know at the outsetwhich technique or analyst will be most effective”.

Dr. Guangliang Chen | Mathematics & Statistics, San José State University 2/26

Page 3: Math 285 Classification with Handwritten Digits – First day ...gchen/Math285S16/intro.pdfplatform hosting many data science competitions. – It uses a crowdsourcing approach that

An Introduction to Math 285 Classification with Handwritten Digits

How do Kaggle competitions operate?

• Companies, with the help of Kaggle, post their data as well as adescription of the problem on the website;

• Participants (from all over the world) experiment with different tech-niques and submit their best results to a scoreboard to compete;

• After the deadline passes, the winning team receives a cash reward(which could be as much as several millions) and the company ob-tains "a worldwide, perpetual, irrevocable and royalty-free license”.

Dr. Guangliang Chen | Mathematics & Statistics, San José State University 3/26

Page 4: Math 285 Classification with Handwritten Digits – First day ...gchen/Math285S16/intro.pdfplatform hosting many data science competitions. – It uses a crowdsourcing approach that

An Introduction to Math 285 Classification with Handwritten Digits

Potential benefits of participating in a Kagglecompetition

• Experience with large, complex, interesting, real data

• Learning (new knowledge and skills)

• Become a part of the data science community

• Cash prize

• Could get you a job

Dr. Guangliang Chen | Mathematics & Statistics, San José State University 4/26

Page 5: Math 285 Classification with Handwritten Digits – First day ...gchen/Math285S16/intro.pdfplatform hosting many data science competitions. – It uses a crowdsourcing approach that

An Introduction to Math 285 Classification with Handwritten Digits

What is classification?

Classificiation is a major machine learning field that studies how to assignlabels to new data (test set) based on labeled data (training set).

• Lots of applications, e.g., spam email detection, digit recognition,face recognition, and document classification

• Classification is an example of supervised learning ; in contrast, clus-tering is unsupervised (focus of last semester’s 285)

Dr. Guangliang Chen | Mathematics & Statistics, San José State University 5/26

Page 6: Math 285 Classification with Handwritten Digits – First day ...gchen/Math285S16/intro.pdfplatform hosting many data science competitions. – It uses a crowdsourcing approach that

An Introduction to Math 285 Classification with Handwritten Digits

The digit recognition problemGiven a set of training examples, determine what digits the test imagescontain by machine:

Dr. Guangliang Chen | Mathematics & Statistics, San José State University 6/26

Page 7: Math 285 Classification with Handwritten Digits – First day ...gchen/Math285S16/intro.pdfplatform hosting many data science competitions. – It uses a crowdsourcing approach that

An Introduction to Math 285 Classification with Handwritten Digits

Dr. Guangliang Chen | Mathematics & Statistics, San José State University 7/26

Page 8: Math 285 Classification with Handwritten Digits – First day ...gchen/Math285S16/intro.pdfplatform hosting many data science competitions. – It uses a crowdsourcing approach that

An Introduction to Math 285 Classification with Handwritten Digits

Our main data set

The MNIST database of handwritten digits, formed by Yann LeCun ofNYU, has a total of 70,000 examples from approximately 250 writers:

• The images are 28 × 28 in size

• The training set contains 60,000 images while the test set has 10,000

• It is a benchmark dataset used by many people

Dr. Guangliang Chen | Mathematics & Statistics, San José State University 8/26

Page 9: Math 285 Classification with Handwritten Digits – First day ...gchen/Math285S16/intro.pdfplatform hosting many data science competitions. – It uses a crowdsourcing approach that

An Introduction to Math 285 Classification with Handwritten Digits

Why MNIST?• Famous

• Simple to understand and use

• Yet difficult enough for classification

– Big data (large size and high dimensionality)

– 10 classes in total (0, 1, . . . , 9)

– Great variability (due to different ways people write)

– Nonlinear separation

Dr. Guangliang Chen | Mathematics & Statistics, San José State University 9/26

Page 10: Math 285 Classification with Handwritten Digits – First day ...gchen/Math285S16/intro.pdfplatform hosting many data science competitions. – It uses a crowdsourcing approach that

An Introduction to Math 285 Classification with Handwritten Digits

• Well studied (thus lots of resources available)

– The Kaggle competition page(https://www.kaggle.com/c/digit-recognizer)

– Lecun’s page (http://yann.lecun.com/exdb/mnist/)

– CAMCOS course page from last semester(http://www.math.sjsu.edu/~gchen/math203.html)

Dr. Guangliang Chen | Mathematics & Statistics, San José State University 10/26

Page 11: Math 285 Classification with Handwritten Digits – First day ...gchen/Math285S16/intro.pdfplatform hosting many data science competitions. – It uses a crowdsourcing approach that

An Introduction to Math 285 Classification with Handwritten Digits

Representation of the digits

• The original format is matrices (of size 28 × 28)

• They can also be turned into vectors (784 dimensional)

Dr. Guangliang Chen | Mathematics & Statistics, San José State University 11/26

Page 12: Math 285 Classification with Handwritten Digits – First day ...gchen/Math285S16/intro.pdfplatform hosting many data science competitions. – It uses a crowdsourcing approach that

An Introduction to Math 285 Classification with Handwritten Digits

Visualization of the data set1. The “average” writer

2. The full appearance of each digit class0 - 3

-6 -4 -2 0 2 4 6-5

-4

-3

-2

-1

0

1

2

3

4

5

-4 -3 -2 -1 0 1 2 3 4 5

-3

-2

-1

0

1

2

3

4

-4 -2 0 2 4 6

-4

-2

0

2

4

6

-4 -2 0 2 4 6

-4

-3

-2

-1

0

1

2

3

4

5

(cont’d on next page)

Dr. Guangliang Chen | Mathematics & Statistics, San José State University 12/26

Page 13: Math 285 Classification with Handwritten Digits – First day ...gchen/Math285S16/intro.pdfplatform hosting many data science competitions. – It uses a crowdsourcing approach that

An Introduction to Math 285 Classification with Handwritten Digits

4-6

-5 -4 -3 -2 -1 0 1 2 3 4

-5

-4

-3

-2

-1

0

1

2

3

4

-4 -2 0 2 4 6

-5

-4

-3

-2

-1

0

1

2

3

4

5

-6 -5 -4 -3 -2 -1 0 1 2 3 4

-4

-3

-2

-1

0

1

2

3

4

5

7-9

-6 -4 -2 0 2 4

-5

-4

-3

-2

-1

0

1

2

3

4

-5 -4 -3 -2 -1 0 1 2 3 4 5

-4

-2

0

2

4

6

-6 -5 -4 -3 -2 -1 0 1 2 3 4-6

-5

-4

-3

-2

-1

0

1

2

3

4

Dr. Guangliang Chen | Mathematics & Statistics, San José State University 13/26

Page 14: Math 285 Classification with Handwritten Digits – First day ...gchen/Math285S16/intro.pdfplatform hosting many data science competitions. – It uses a crowdsourcing approach that

An Introduction to Math 285 Classification with Handwritten Digits

A very first attempt at classification

Assign labels to test images based on the most similar centers.

Call this classifier (global) kmeans.

How well does it do: 18.0% error rate.

Dr. Guangliang Chen | Mathematics & Statistics, San José State University 14/26

Page 15: Math 285 Classification with Handwritten Digits – First day ...gchen/Math285S16/intro.pdfplatform hosting many data science competitions. – It uses a crowdsourcing approach that

An Introduction to Math 285 Classification with Handwritten Digits

A second attemptThe k nearest neighbors (kNN) classifier assigns class label based on thek closest examples around a test point

k0 2 4 6 8 10 12 14 16 18 20

erro

r ra

te

0.029

0.03

0.031

0.032

0.033

0.034

0.035

0.036

0.037

0.038

Dr. Guangliang Chen | Mathematics & Statistics, San José State University 15/26

Page 16: Math 285 Classification with Handwritten Digits – First day ...gchen/Math285S16/intro.pdfplatform hosting many data science competitions. – It uses a crowdsourcing approach that

An Introduction to Math 285 Classification with Handwritten Digits

Which other classifiers will we cover?• Linear classifiers, such as Logistic regression, LDA, and SVM

• Nonlinear classifiers

• Tree classifiers

• Neural networks

Dr. Guangliang Chen | Mathematics & Statistics, San José State University 16/26

Page 17: Math 285 Classification with Handwritten Digits – First day ...gchen/Math285S16/intro.pdfplatform hosting many data science competitions. – It uses a crowdsourcing approach that

An Introduction to Math 285 Classification with Handwritten Digits

Additional data to be used

• Small toy data sets created by the instructor

• Real data in other databases, such as

– USPS Handwritten digits(http://statweb.stanford.edu/~tibs/ElemStatLearn/data.html),

– UC Irvine Machine Learning Repository(http://archive.ics.uci.edu/ml/datasets.html).

Dr. Guangliang Chen | Mathematics & Statistics, San José State University 17/26

Page 18: Math 285 Classification with Handwritten Digits – First day ...gchen/Math285S16/intro.pdfplatform hosting many data science competitions. – It uses a crowdsourcing approach that

An Introduction to Math 285 Classification with Handwritten Digits

Necessary technical background to take thiscourse

• Strong linear algebra knowledge and skills (129A)

• Know probability and statistics well (163 and 164)

• Excellent programming skills (in at least one of Matlab, R, Python)

Dr. Guangliang Chen | Mathematics & Statistics, San José State University 18/26

Page 19: Math 285 Classification with Handwritten Digits – First day ...gchen/Math285S16/intro.pdfplatform hosting many data science competitions. – It uses a crowdsourcing approach that

An Introduction to Math 285 Classification with Handwritten Digits

Characterwise expectation

• Hard work

• Independent thinking

• Team player

• Active learner

– Eager to explore new things

– Willing to go beyond requirements

Dr. Guangliang Chen | Mathematics & Statistics, San José State University 19/26

Page 20: Math 285 Classification with Handwritten Digits – First day ...gchen/Math285S16/intro.pdfplatform hosting many data science competitions. – It uses a crowdsourcing approach that

An Introduction to Math 285 Classification with Handwritten Digits

Requirements of this course

• About 6 homework assignments (50%)

• A midterm project (20%)

• A final project (30%)

Dr. Guangliang Chen | Mathematics & Statistics, San José State University 20/26

Page 21: Math 285 Classification with Handwritten Digits – First day ...gchen/Math285S16/intro.pdfplatform hosting many data science competitions. – It uses a crowdsourcing approach that

An Introduction to Math 285 Classification with Handwritten Digits

Homework policy

• You may collaborate on homework but you must write independentcodes and solutions.

• You must prepare your homework solutions in presentation formatusing PowerPoint or LaTex.

• You must submit homework on time in order to receive full credit.

Dr. Guangliang Chen | Mathematics & Statistics, San José State University 21/26

Page 22: Math 285 Classification with Handwritten Digits – First day ...gchen/Math285S16/intro.pdfplatform hosting many data science competitions. – It uses a crowdsourcing approach that

An Introduction to Math 285 Classification with Handwritten Digits

Is there a required textbook?

None, but we will cover material from various sources (websites, papers,textbook chapters, instructors notes, etc.) and reading material will beprovided from time to time in class.

Below is an excellent book that you may use as reference:

• The Elements of Statistical Learning: Data Mining, Inference, andPrediction, 2nd edition, by Hastie, Tibshirani, and Friedman, Springer.Freely available at http://statweb.stanford.edu/~tibs/ElemStatLearn/

Dr. Guangliang Chen | Mathematics & Statistics, San José State University 22/26

Page 23: Math 285 Classification with Handwritten Digits – First day ...gchen/Math285S16/intro.pdfplatform hosting many data science competitions. – It uses a crowdsourcing approach that

An Introduction to Math 285 Classification with Handwritten Digits

The MATLAB License issueUnfortunately, MATLAB could not be installed on the computers in thisclassroom. Here are some alternative options:

• A freely available 30-day trial version of the newest MATLAB withall toolboxes at mathworks.com

• The computer lab in MacQuarrie Hall 221 (limited access)

• Consider buying MATLAB student license (platform+all toolboxes,$99.99/year).

• Talk to me if you still need help.

Dr. Guangliang Chen | Mathematics & Statistics, San José State University 23/26

Page 24: Math 285 Classification with Handwritten Digits – First day ...gchen/Math285S16/intro.pdfplatform hosting many data science competitions. – It uses a crowdsourcing approach that

An Introduction to Math 285 Classification with Handwritten Digits

Some final reminders

• This is an experimental course (subject to change as needed)

• This is a demanding/challenging course

• This is also a highly rewarding course

• This is not a traditional course (essentially projects based)

• This is not an ordinary classroom!

Dr. Guangliang Chen | Mathematics & Statistics, San José State University 24/26

Page 25: Math 285 Classification with Handwritten Digits – First day ...gchen/Math285S16/intro.pdfplatform hosting many data science competitions. – It uses a crowdsourcing approach that

An Introduction to Math 285 Classification with Handwritten Digits

Assignments

• Install/check software (MATLAB, R, Python).

• Download the MNIST data set from Lecun’s page (and use thescripts that I provide to process them into .mat format)

• Perform very preliminary experiments (e.g., you need to be able toload the data). Try to understand the data as much as you can.

• Explore Kaggle.com and the competition pages

Dr. Guangliang Chen | Mathematics & Statistics, San José State University 25/26

Page 26: Math 285 Classification with Handwritten Digits – First day ...gchen/Math285S16/intro.pdfplatform hosting many data science competitions. – It uses a crowdsourcing approach that

Questions?