Fall 2021 Instructor:ShandianZhe

Preview:

Citation preview

1

CS5350/6350 Machine Learning

MachineLearningFall2017

SupervisedLearning:TheSetup

1

Fall 2021

Instructor: Shandian Zhe

Shandian Zhe: Probabilistic Machine Learning

Research Topics:1. Bayesian Nonparametrics 2. Bayesian Deep Learning3. Probabilistic Graphical Models4. Large-Scale Learning System5. Tensor/Matrix Factorization6. Embedding Learning

Assistant Professor, School of Computing, University of Utah

Applications:• Collaborative Filtering• Online Advertising• Physical Simulation• Brain Imaging Data Analysis

….

zhe@cs.utah.edu

Outline

• Machine learning definition, applications andcourse content

• Course requirements/policies (homeworkassignments, projects, final exams, etc.)

• Basic knowledge review (random variables,mean, variance, independency, etc.)

3

What is (machine) learning?

4

Let’s play a game

5

The badges game

Attendees of the 1994 conference on Computational Learning

Theory received conference badges labeled + or –

Only one person (Haym Hirsh) knew the function that generated the labels

Depended only on the attendee’s name

The task for the attendees: Look at as many examples as you want in the conference and find the unknown function

6

Let’s play

Name Label

Claire Cardie -Peter Bartlett +Eric Baum -Haym Hirsh -Shai Ben-David -Michael I. Jordan +

7

How were the labels generated?

What is the label for my name? Yours?

Playing the badge game à a typical learning procedure

8

If the players are machines àit is a machine learning procedure!

9

Alpha-Go! A ML algorithm rather AI

Machine learning is everywhere!

10

And you are probably already using it

Machine learning is everywhere!

• Is an email spam?

• Find all the people in this photo

• If I like these three movies, what should I watch next?

• Based on your purchase history, you might be interested in…

• Will a stock price go up or down tomorrow? By how much?

• Handwriting recognition

• What are the best ads to place on this website?

• I would like to read that Dutch website in English

• Ok Google, Drive this car for me. And, fly this helicopter for me.

• Does this genetic marker correspond to Alzheimer’s disease?

11

And you are probably already using it

But what is learning?

Let’s try to define (machine) learning

12

What is machine learning?

“Field of study that gives computers the ability to learn without being explicitly programmed”

Arthur Samuel (1950s)

13

From 1959!

Learning as generalization

“Learning denotes changes in the system that are adaptive in the sense that they enable the system to do the task (or tasks drawn from the same population) more effectively the next time.”

Herbert Simon (1983)

14

Economist, psychologist, political scientist, computer scientist, sociologist, Nobel Prize (1978), Turing Award (1975)…

Learning as generalization

“A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E.”

Tom Mitchell (1999)

15

Learning = generalization

16

Learning = generalization

17

Motivation: Why study machine learning?

• Build computer programs/systems with new capabilities

• Understand the nature of human learning

• Ultimate goal: develop robots that can learn as human beings!

18

Machine learning is the future

• Gives a system the ability to perform a task in a situation which has never been encountered before

• Big data: Learning allows programs to interact more robustly with messy data

• Starting to make inroads into end-user facing applications already

19

This course

Focuses on the underlying concepts and algorithmic ideas in the field of machine learning

This course is not about• Using a specific machine learning tool• Any single learning paradigm, e.g., deep learning

20

How will you learn?

21

• Take classes to learn the models and algorithms• Finish the homework assignments to deepen your

understanding• Implement the learning models/algorithms by

yourself!• Doing course project for using machine learning

techniques to solve problems!

Workload

• 6 homework assignments (most including both latex and programming problems)

• Project (report and a lot of programming)• Final exam

22

Warning: This course is one of the most challenging course in CS department. The workload is heavy; you need to plan on ~20 hours per week (on average).

Be cautious when you make the decisionJ

Overview of this course

23

https://www.cs.utah.edu/~zhe/teach/cs6350.html

Syllabus

This course

• The course website contains all the detailed information

• The course website is linked to my homepage

24

https://www.cs.utah.edu/~zhe/teach/cs6350.html

My home page http://www.cs.utah.edu/~zhe/

Course website

This course

Focuses on the underlying concepts and algorithmic ideas in the field of machine learning

This course is not about• Using a specific machine learning tool• Any single learning paradigm, e.g., deep learning

25

How will you learn?

26

• Take classes to learn the models and algorithms• Finish the homework assignments to deepen your

understanding• Implement the learning models/algorithms by

yourself!• Doing course project for using machine learning

techniques to solve problems!

Canvas

• Feel free to post questions and discuss• Our TM will respond as fast as they can

27

Workload

• 6 homework assignments (most including both latex and programming problems)

• Project (report and a lot of programming)• Final exam

28

Warning: The workload is heavy; you need to plan on around 20 hours per week.

Be cautious when you make the decisionJBe sure to plan on enough time on this course!

29

Basic Knowledge Review

Basic Knowledge Review• Random events and probabilities– We use sets to represent random events, each

element in the set is an atomic outcome• Example: tossing a coin for 5 times• Event A = {H,H,H,T, T}, B = {T,H,T,H,T}, …

– We use probability to measure the chance anevent happens: p(A), p(B)

– Both A and B happen: A ∩ 𝐵. – A or B happens: 𝐴 ∪ 𝐵.– 𝑝 𝐴 ∪ 𝐵 = 𝑝 𝐴 + 𝑝 𝐵 − 𝑝(A ∩ 𝐵). – What is the general version? 30

Basic Knowledge Review

• Random variables– For research convenience /rigor descriptions, we use

numbers to represent the sample outcomes. Thosenumbers are called random variables. The events arerepresented by random variables falling in some region.

– Example: tossing a coin, we introduce a R.V. X,– X = 1, H; X=0, T.– We toss a coin for 5 times, we have 5 R.V. X1, X2, X3, X4, X5– Event: we have less than 3 heads:• X1+X2+X3+X4+X5 <3• Probability: p(X1+X2+X3+X4+X5<3)

31

Basic Knowledge Review

• Independency 𝑝 𝐴, 𝐵 = 𝑝 𝐴 𝑝(𝐵)𝑝 𝑋, 𝑌 = 𝑝 𝑋 𝑝(𝑌)

32

• Joint probability and conditional probability

𝑝 𝐴, 𝐵 = 𝑝 𝐴 𝑝 𝐵 𝐴 = 𝑝 𝐵 𝑝(𝐴|𝐵)𝑝 𝑋, 𝑌 = 𝑝 𝑋 𝑝 𝑌 𝑋 = 𝑝 𝑌 𝑝(𝑋|𝑌)

• Conditional independency 𝑝 𝐴, 𝐵|𝐶 = 𝑝 𝐴|𝐶 𝑝(𝐵|𝐶)𝑝 𝑋, 𝑌|𝑍 = 𝑝 𝑋|𝑍 𝑝(𝑌|𝑍)

What conclusioncan you make?

Basic Knowledge Review

• Expectation𝐸 𝑋 = ∫ 𝑋𝑝 𝑋 𝑑𝑋

𝐸 𝑔(𝑋) = ∫ 𝑔(𝑋)𝑝 𝑋 𝑑𝑋

• Variance𝑉𝑎𝑟 𝑋 = 𝐸 𝑋! − 𝐸 𝑋 ! ≥ 0

when 𝑋 is a vector𝐶𝑜𝑣 𝑋 = 𝐸 𝑋𝑋" − 𝐸 𝑋 𝐸 𝑋 " ≥ 0

• Conditional Expectation/Variance?𝐸 𝑋 𝑌 , 𝑉𝑎𝑟(𝑋|𝑌)

33

Basic Knowledge Review

• Convex region/set

34

Basic Knowledge Review

• Convex function

35

𝑓: 𝑋 ⟶ 𝑅

• The input domain X is a convex region/set

Basic Knowledge Review

• Examples of convex functions

36

• How to determine a convex function?

When differentiable

When twice differentiable

f(x) � f(y) +rf(y)>(x� y)<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>

rrf(x) ⌫ 0<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>

f(x) = ex<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>

f(x) = �log(x)<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>

multivariableSingle variable

f(x) =1

2x>x

<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>

f(x) = a>x+ b<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>

Basic Knowledge Review

• Jensen’s inequality (for convex function)

37

𝑓 𝐸 𝑋 ≤ 𝐸( 𝑓 𝑋 )

𝑓 𝐸 𝑔(𝑋) ≤ 𝐸( 𝑓 𝑔(𝑋) )

When X is random variable

Basic Knowledge Review

• Matrix derivative

38

2 DERIVATIVES

2 Derivatives

This section is covering di↵erentiation of a number of expressions with respect toa matrix X. Note that it is always assumed that X has no special structure, i.e.that the elements of X are independent (e.g. not symmetric, Toeplitz, positivedefinite). See section 2.8 for di↵erentiation of structured matrices. The basicassumptions can be written in a formula as

@Xkl

@Xij

= �ik�lj (28)

that is for e.g. vector forms,@x@y

i

=@xi

@y

@x

@y

i

=@x

@yi

@x@y

ij

=@xi

@yj

The following rules are general and very useful when deriving the di↵erential ofan expression ([19]):

@A = 0 (A is a constant) (29)@(↵X) = ↵@X (30)

@(X + Y) = @X + @Y (31)@(Tr(X)) = Tr(@X) (32)

@(XY) = (@X)Y + X(@Y) (33)@(X �Y) = (@X) �Y + X � (@Y) (34)@(X⌦Y) = (@X)⌦Y + X⌦ (@Y) (35)

@(X�1) = �X�1(@X)X�1 (36)@(det(X)) = det(X)Tr(X�1

@X) (37)@(ln(det(X))) = Tr(X�1

@X) (38)@XT = (@X)T (39)@XH = (@X)H (40)

2.1 Derivatives of a Determinant

2.1.1 General form

@ det(Y)@x

= det(Y)TrY�1 @Y

@x

�(41)

@@ det(Y)

@x

@x= det(Y)

"Tr

"Y�1 @

@Y@x

@x

#

+TrY�1 @Y

@x

�Tr

Y�1 @Y

@x

�Tr✓

Y�1 @Y@x

◆ ✓Y�1 @Y

@x

◆�#(42)

Petersen & Pedersen, The Matrix Cookbook, Version: November 14, 2008, Page 7

Hint: Use matrix cookbook as your reference!

https://www.math.uwaterloo.ca/~hwolkowi/matrixcookbook.pdf

Recommended