31
COMS 4771 Fall 2019 0 / 24 Machine Learning

Machine Learning Overview - Columbia University · recent years by combining good scalability with predictive accuracy. In addition, they offer much flexibility for model-ing various

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Machine Learning Overview - Columbia University · recent years by combining good scalability with predictive accuracy. In addition, they offer much flexibility for model-ing various

Overview

COMS 4771 Fall 2019

0 / 24

Machine Learning

Page 2: Machine Learning Overview - Columbia University · recent years by combining good scalability with predictive accuracy. In addition, they offer much flexibility for model-ing various

This is not Section H02

Page 3: Machine Learning Overview - Columbia University · recent years by combining good scalability with predictive accuracy. In addition, they offer much flexibility for model-ing various

Overview

I What is machine learning?

I Basic topics/challenges in machine learning

1 / 24

Page 4: Machine Learning Overview - Columbia University · recent years by combining good scalability with predictive accuracy. In addition, they offer much flexibility for model-ing various

Applications I

I Image classification: Predict bird species depicted in image

2 / 24

Page 5: Machine Learning Overview - Columbia University · recent years by combining good scalability with predictive accuracy. In addition, they offer much flexibility for model-ing various

Applications II

I Recommender systems: Predict how user would rate a movie (Koren,

Bell, and Volinsky, 2009)COVER FE ATURE

COMPUTER 44

vector qi � R f, and each user u is associ-ated with a vector pu � R f. For a given item i, the elements of qi measure the extent to which the item possesses those factors, positive or negative. For a given user u, the elements of pu measure the extent of interest the user has in items that are high on the corresponding factors, again, posi-tive or negative. The resulting dot product, qi

T pu, captures the interaction between user u and item i—the user’s overall interest in the item’s characteristics. This approximates user u’s rating of item i, which is denoted by rui, leading to the estimate

r̂ui

= qiT pu. (1)

The major challenge is computing the map-ping of each item and user to factor vectors qi, pu � R f. After the recommender system completes this mapping, it can easily esti-mate the rating a user will give to any item by using Equation 1.

Such a model is closely related to singular value decom-position (SVD), a well-established technique for identifying latent semantic factors in information retrieval. Applying SVD in the collaborative filtering domain requires factoring the user-item rating matrix. This often raises difficulties due to the high portion of missing values caused by sparse-ness in the user-item ratings matrix. Conventional SVD is undefined when knowledge about the matrix is incom-plete. Moreover, carelessly addressing only the relatively few known entries is highly prone to overfitting.

Earlier systems relied on imputation to fill in missing ratings and make the rating matrix dense.2 However, im-putation can be very expensive as it significantly increases the amount of data. In addition, inaccurate imputation might distort the data considerably. Hence, more recent works3-6 suggested modeling directly the observed rat-ings only, while avoiding overfitting through a regularized model. To learn the factor vectors (pu and qi), the system minimizes the regularized squared error on the set of known ratings:

min* *,q p ( , )u i �

£K

(rui qiTpu)

2 + L(|| qi ||2 + || pu ||

2) (2)

Here, K is the set of the (u,i) pairs for which rui is known (the training set).

The system learns the model by fitting the previously observed ratings. However, the goal is to generalize those previous ratings in a way that predicts future, unknown ratings. Thus, the system should avoid overfitting the observed data by regularizing the learned parameters, whose magnitudes are penalized. The constant L controls

recommendation. These methods have become popular in recent years by combining good scalability with predictive accuracy. In addition, they offer much flexibility for model-ing various real-life situations.

Recommender systems rely on different types of input data, which are often placed in a matrix with one dimension representing users and the other dimension representing items of interest. The most convenient data is high-quality explicit feedback, which includes explicit input by users regarding their interest in products. For example, Netflix collects star ratings for movies, and TiVo users indicate their preferences for TV shows by pressing thumbs-up and thumbs-down buttons. We refer to explicit user feedback as ratings. Usually, explicit feedback com-prises a sparse matrix, since any single user is likely to have rated only a small percentage of possible items.

One strength of matrix factorization is that it allows incorporation of additional information. When explicit feedback is not available, recommender systems can infer user preferences using implicit feedback, which indirectly reflects opinion by observing user behavior including pur-chase history, browsing history, search patterns, or even mouse movements. Implicit feedback usually denotes the presence or absence of an event, so it is typically repre-sented by a densely filled matrix.

A BASIC MATRIX FACTORIZATION MODEL Matrix factorization models map both users and items

to a joint latent factor space of dimensionality f, such that user-item interactions are modeled as inner products in that space. Accordingly, each item i is associated with a

Gearedtowardmales

Serious

Escapist

The PrincessDiaries

Braveheart

Lethal Weapon

IndependenceDay

Ocean’s 11Sense andSensibility

Gus

Dave

Gearedtoward

females

Amadeus

The Lion King Dumb andDumber

The Color Purple

Figure 2. A simplified illustration of the latent factor approach, which characterizes both users and movies using two axes—male versus female and serious versus escapist.

3 / 24

Page 6: Machine Learning Overview - Columbia University · recent years by combining good scalability with predictive accuracy. In addition, they offer much flexibility for model-ing various

Applications III

I Machine translation: Predict French translation of English sentence

(Google translate)

4 / 24

Page 7: Machine Learning Overview - Columbia University · recent years by combining good scalability with predictive accuracy. In addition, they offer much flexibility for model-ing various

Applications IV

I Chess: Predict win probability of a move in given game state

(AlphaZero)

5 / 24

Page 8: Machine Learning Overview - Columbia University · recent years by combining good scalability with predictive accuracy. In addition, they offer much flexibility for model-ing various

Work in ML

I Applied ML

I Collect/prepare data, build/train models, analyze errors

I ML developer

I Implement ML algorithms and infrastructure

I ML research

I Design/analyze models and algorithms

6 / 24

Page 9: Machine Learning Overview - Columbia University · recent years by combining good scalability with predictive accuracy. In addition, they offer much flexibility for model-ing various

Mathematical and computational prerequisites

I Math

I Linear algebra, probability, multivariable calculus, reading and writing

proofs

I Software/programming

I Much ML work is implemented in python with libraries such as numpy

and pytorch

7 / 24

Q: Who has learned about the Singular Value Decomposition?

Page 10: Machine Learning Overview - Columbia University · recent years by combining good scalability with predictive accuracy. In addition, they offer much flexibility for model-ing various

Basic setting: supervised learning

I Training data: dataset comprised of labeled examplesI Labeled example: a pair (input, label)

I Goal: learn function to predict label from input for new examples

learned predictorpast labeled examples learning algorithm

predicted label

new (unlabeled) example

Figure 1: Schematic for supervised learning

8 / 24

Page 11: Machine Learning Overview - Columbia University · recent years by combining good scalability with predictive accuracy. In addition, they offer much flexibility for model-ing various

Examples of functions I

I Decision tree

9 / 24

Page 12: Machine Learning Overview - Columbia University · recent years by combining good scalability with predictive accuracy. In addition, they offer much flexibility for model-ing various

Examples of functions II

I Linear classifier

10 / 24

Page 13: Machine Learning Overview - Columbia University · recent years by combining good scalability with predictive accuracy. In addition, they offer much flexibility for model-ing various

Examples of functions III

I Neural network

input hidden units output

11 / 24

Page 14: Machine Learning Overview - Columbia University · recent years by combining good scalability with predictive accuracy. In addition, they offer much flexibility for model-ing various

Types of prediction problems

I Binary classification

I Given an email, is it spam or not?

I (What’s the probability that it is spam?)

I Multi-class classification

I Given an image, what animal is depicted?

I (Or which animals are depicted?)

I Regression

I Given clincal measurements, what is level of tumor antigens?

I (In absolute level? Log-scale?)

I Structured output prediction

I Given a sentence, what is its grammatical parse tree?

I (Or dependency tree?)

I . . .

12 / 24

Page 15: Machine Learning Overview - Columbia University · recent years by combining good scalability with predictive accuracy. In addition, they offer much flexibility for model-ing various

Beyond supervised learning

I Unsupervised learning / probabilistic modeling

I Online learning

I Reinforcement learning

13 / 24

Page 16: Machine Learning Overview - Columbia University · recent years by combining good scalability with predictive accuracy. In addition, they offer much flexibility for model-ing various

Challenges in supervised learning

I Might not have the right data

I Might pick a bad model

I Might not fit training data well (under-fitting)

I Might fit the training data too well (over-fitting)

I Training data could be noisy / corrupted (robustness)

I . . .

14 / 24

Page 17: Machine Learning Overview - Columbia University · recent years by combining good scalability with predictive accuracy. In addition, they offer much flexibility for model-ing various

Example: over-fitting

I Which polynomial degree to use?

I Truth: y = 0 · x + noise

I Red points: training data

I Green points: unseen data

15 / 24

Page 18: Machine Learning Overview - Columbia University · recent years by combining good scalability with predictive accuracy. In addition, they offer much flexibility for model-ing various

Example: over-fitting

I Which polynomial degree to use?

I Truth: y = 0 · x + noise

I Red points: training data

I Green points: unseen data

16 / 24

Page 19: Machine Learning Overview - Columbia University · recent years by combining good scalability with predictive accuracy. In addition, they offer much flexibility for model-ing various

Example: over-fitting

I Which polynomial degree to use?

I Truth: y = 0 · x + noise

I Red points: training data

I Green points: unseen data

17 / 24

Page 20: Machine Learning Overview - Columbia University · recent years by combining good scalability with predictive accuracy. In addition, they offer much flexibility for model-ing various

Example: over-fitting

I Which polynomial degree to use?

I Truth: y = 0 · x + noise

I Red points: training data

I Green points: unseen data

18 / 24

Page 21: Machine Learning Overview - Columbia University · recent years by combining good scalability with predictive accuracy. In addition, they offer much flexibility for model-ing various

Example: over-fitting

I Which polynomial degree to use?

I Truth: y = 0 · x + noise

I Red points: training data

I Green points: unseen data

19 / 24

Page 22: Machine Learning Overview - Columbia University · recent years by combining good scalability with predictive accuracy. In addition, they offer much flexibility for model-ing various

Example: over-fitting

I Which polynomial degree to use?

I Truth: y = 0 · x + noise

I Red points: training data

I Green points: unseen data

20 / 24

Page 23: Machine Learning Overview - Columbia University · recent years by combining good scalability with predictive accuracy. In addition, they offer much flexibility for model-ing various

Example: over-fitting

I Which polynomial degree to use?

I Truth: y = 0 · x + noise

I Red points: training data

I Green points: unseen data

21 / 24

Page 24: Machine Learning Overview - Columbia University · recent years by combining good scalability with predictive accuracy. In addition, they offer much flexibility for model-ing various

Example: over-fitting

I Which polynomial degree to use?

I Truth: y = 0 · x + noise

I Red points: training data

I Green points: unseen data

22 / 24

Page 25: Machine Learning Overview - Columbia University · recent years by combining good scalability with predictive accuracy. In addition, they offer much flexibility for model-ing various

Example: the right data

I Given a college applicant, will they graduate if admitted?

I What is appropriate training data?

I input = past applicant; label = admitted or not

I input = past admit; label = graduated or not

I input = past applicant; label = graduated or not

23 / 24

Page 26: Machine Learning Overview - Columbia University · recent years by combining good scalability with predictive accuracy. In addition, they offer much flexibility for model-ing various

Overview of the rest of the course

I Non-parametric methods

I Simple and flexible methods for prediction

I Prediction theory

I Statistical model for studying prediction problems

I Regression

I Models and methods for predicting real-valued outcomes

I Inductive bias, features, kernels

I Classification

I Models and methods for predicting discrete-valued outcomes

I Surrogate losses, margins, cost-sensitive risk, fairness, ensemble

methods

I Optimization

I Convex optimization and neural network training

I Unsupervised learning

I Methods for clustering and matrix approximation

24 / 24

Page 27: Machine Learning Overview - Columbia University · recent years by combining good scalability with predictive accuracy. In addition, they offer much flexibility for model-ing various

What to expect in the class

1. How to use sklearn, pytorch, tensorflow, …[We expect you will be able to learn these software tools on your own!]

2. Algorithmic and statistical principles - Well-weathered models + methods (e.g., logistic reg., gradient descent)- Not the latest nonsense that shows up on arXiv

3. Programming and proofs- No need to be a professional hacker or mathematician- But, need to exercise your critical thinking!

Page 28: Machine Learning Overview - Columbia University · recent years by combining good scalability with predictive accuracy. In addition, they offer much flexibility for model-ing various

More about the course1. Website: http://www.cs.columbia.edu/~djhsu/coms4771-f19/

2. Course assistants (CAs): Achille, Arjun, Dinko, Erica, Miles, …- Office hours start *next week*.

3. Piazza, Courseworks, Gradescope- Use Piazza to ask (and answer!) questions about the course- E-mail instructor for administrative issues (e.g., sign drop form)- Do not e-mail CAs. All regrade requests handled through Gradescope.

4. Syllabus: see website- Reading + homework assignments: 34% of grade- Exams on October 16 and December 18: 66% of grade

5. Disability services: https://health.columbia.edu/- If needed, please make arrangements with disability services by 9/18.

Page 29: Machine Learning Overview - Columbia University · recent years by combining good scalability with predictive accuracy. In addition, they offer much flexibility for model-ing various

Academic rules of conduct

1. Cheating: Don’t do it.- See syllabus for policies & consequences- If you are unsure if something constitutes cheating, ask the instructor!

2. Cheating out of desperation is still cheating.- Instead: Get help early! Go to office hours, find a tutor, etc.- We are here to help.

Page 30: Machine Learning Overview - Columbia University · recent years by combining good scalability with predictive accuracy. In addition, they offer much flexibility for model-ing various

Homework 0

• Due Monday 9/9 at 2:30 PM

• Instructions (also on website):• Create an account on Gradescope (if you don’t have one already)• Add this course using the Entry Code M3D5EX• Use your UNI as your “Student ID”• Solve simple problems about mathematical prerequisites

• REQUIRED. Cannot pass the class without completing it.

• Not used to determine enrollment / waiting list status.

• HWx for x>0 will generally not be “multiple choice”.

Page 31: Machine Learning Overview - Columbia University · recent years by combining good scalability with predictive accuracy. In addition, they offer much flexibility for model-ing various

Questions?