28
Lecture 15 – Part 01 Supervised and Unsupervised Learning

Lecture15–Part01 Supervisedand UnsupervisedLearningsierra.ucsd.edu/cse151a-2020-sp/lectures/15... · Example 70.0 72.5 75.0 77.5 80.0 82.5 85.0 87.5 Height 160 180 200 220 240 260

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Lecture15–Part01 Supervisedand UnsupervisedLearningsierra.ucsd.edu/cse151a-2020-sp/lectures/15... · Example 70.0 72.5 75.0 77.5 80.0 82.5 85.0 87.5 Height 160 180 200 220 240 260

2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5Eruption Time

2

1

0

1

2

Wai

ting

Tim

e

Lecture 15 – Part 01Supervised and

Unsupervised Learning

Page 2: Lecture15–Part01 Supervisedand UnsupervisedLearningsierra.ucsd.edu/cse151a-2020-sp/lectures/15... · Example 70.0 72.5 75.0 77.5 80.0 82.5 85.0 87.5 Height 160 180 200 220 240 260

Supervised Learning

▶ We tell the machine the “right answer”.▶ There is a ground truth.

▶ Data set: {( ⃗𝑥(𝑖), 𝑦𝑖)}.

▶ Goal: learn relationship between features ⃗𝑥(𝑖)and labels 𝑦𝑖.

▶ Examples: classification, regression.

Page 3: Lecture15–Part01 Supervisedand UnsupervisedLearningsierra.ucsd.edu/cse151a-2020-sp/lectures/15... · Example 70.0 72.5 75.0 77.5 80.0 82.5 85.0 87.5 Height 160 180 200 220 240 260

Unsupervised Learning

▶ We don’t tell the machine the “right answer”.▶ In fact, there might not be one!

▶ Data set: ⃗𝑥(𝑖) (usually no test set)

▶ Goal: learn the structure of the data itself.▶ To discover something, for compression, to use as afeature later.

▶ Example: clustering

Page 4: Lecture15–Part01 Supervisedand UnsupervisedLearningsierra.ucsd.edu/cse151a-2020-sp/lectures/15... · Example 70.0 72.5 75.0 77.5 80.0 82.5 85.0 87.5 Height 160 180 200 220 240 260

Example

▶ We gather measurements ⃗𝑥(𝑖) of a bunch offlowers.

▶ Question: how many species are there?

▶ Goal: cluster the similar flowers into groups.

Page 5: Lecture15–Part01 Supervisedand UnsupervisedLearningsierra.ucsd.edu/cse151a-2020-sp/lectures/15... · Example 70.0 72.5 75.0 77.5 80.0 82.5 85.0 87.5 Height 160 180 200 220 240 260

Example

1 2 3 4 5 6 7PetalLengthCm

0.0

0.5

1.0

1.5

2.0

2.5

Peta

lWid

thCm

Page 6: Lecture15–Part01 Supervisedand UnsupervisedLearningsierra.ucsd.edu/cse151a-2020-sp/lectures/15... · Example 70.0 72.5 75.0 77.5 80.0 82.5 85.0 87.5 Height 160 180 200 220 240 260

Example

70.0 72.5 75.0 77.5 80.0 82.5 85.0 87.5Height

160

180

200

220

240

260

280

Wei

ght

Page 7: Lecture15–Part01 Supervisedand UnsupervisedLearningsierra.ucsd.edu/cse151a-2020-sp/lectures/15... · Example 70.0 72.5 75.0 77.5 80.0 82.5 85.0 87.5 Height 160 180 200 220 240 260

Clustering and Dimensionality

▶ Groups emerge with more features.

▶ But too many features, and groups disappear.▶ Curse of dimensionality.

▶ Also: We can’t see in 𝑑 > 3.

Page 8: Lecture15–Part01 Supervisedand UnsupervisedLearningsierra.ucsd.edu/cse151a-2020-sp/lectures/15... · Example 70.0 72.5 75.0 77.5 80.0 82.5 85.0 87.5 Height 160 180 200 220 240 260

Ground Truth

▶ If we don’t have labels, we can’t measureaccuracy.

▶ Sometimes, labels don’t exist.

▶ Example: cluster customers into types byprevious purchases.

Page 9: Lecture15–Part01 Supervisedand UnsupervisedLearningsierra.ucsd.edu/cse151a-2020-sp/lectures/15... · Example 70.0 72.5 75.0 77.5 80.0 82.5 85.0 87.5 Height 160 180 200 220 240 260

2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5Eruption Time

2

1

0

1

2

Wai

ting

Tim

e

Lecture 15 – Part 02K-Means Clustering

Page 10: Lecture15–Part01 Supervisedand UnsupervisedLearningsierra.ucsd.edu/cse151a-2020-sp/lectures/15... · Example 70.0 72.5 75.0 77.5 80.0 82.5 85.0 87.5 Height 160 180 200 220 240 260

Learning

▶ Goal: turn clustering into optimization problem.

▶ Idea: clustering is like compression

1 2 3 4 5 6 7PetalLengthCm

0.0

0.5

1.0

1.5

2.0

2.5

Peta

lWid

thCm

Page 11: Lecture15–Part01 Supervisedand UnsupervisedLearningsierra.ucsd.edu/cse151a-2020-sp/lectures/15... · Example 70.0 72.5 75.0 77.5 80.0 82.5 85.0 87.5 Height 160 180 200 220 240 260

K-Means Objective

▶ Given: data, { ⃗𝑥(𝑖)} ∈ ℝ𝑑 and a parameter 𝑘.

▶ Find: 𝑘 cluster centers �⃗�(1), … , �⃗�(𝑘) so that theaverage squared distance from a data point tonearest cluster center is small.

▶ The k-means objective function:

Cost(�⃗�(1), … , �⃗�(𝑘)) = 1𝑛𝑛∑𝑖=1

min𝑗∈{1,…,𝑘}

‖ ⃗𝑥(𝑖) − �⃗�(𝑗)‖2

Page 12: Lecture15–Part01 Supervisedand UnsupervisedLearningsierra.ucsd.edu/cse151a-2020-sp/lectures/15... · Example 70.0 72.5 75.0 77.5 80.0 82.5 85.0 87.5 Height 160 180 200 220 240 260

Optimization

▶ Goal: find �⃗�(1), … , �⃗�(𝑘) minimizing 𝑘-meansobjective function.

▶ Problem: this is NP-Hard.

▶ We use a heuristic instead of solving exactly.

Page 13: Lecture15–Part01 Supervisedand UnsupervisedLearningsierra.ucsd.edu/cse151a-2020-sp/lectures/15... · Example 70.0 72.5 75.0 77.5 80.0 82.5 85.0 87.5 Height 160 180 200 220 240 260

Lloyd’s Algorithm for K-Means

▶ Initialize centers, �⃗�(1), … , �⃗�(𝑘) somehow.

▶ Repeat until convergence:▶ Assign each point ⃗𝑥(𝑖) to closest center▶ Update each �⃗�(𝑖) to be mean of points assigned to it

Page 14: Lecture15–Part01 Supervisedand UnsupervisedLearningsierra.ucsd.edu/cse151a-2020-sp/lectures/15... · Example 70.0 72.5 75.0 77.5 80.0 82.5 85.0 87.5 Height 160 180 200 220 240 260

Example

Page 15: Lecture15–Part01 Supervisedand UnsupervisedLearningsierra.ucsd.edu/cse151a-2020-sp/lectures/15... · Example 70.0 72.5 75.0 77.5 80.0 82.5 85.0 87.5 Height 160 180 200 220 240 260

Theory

▶ Each iteration reduces cost.

▶ This guarantees convergence to a local min.

▶ Initialization is very important.

Page 16: Lecture15–Part01 Supervisedand UnsupervisedLearningsierra.ucsd.edu/cse151a-2020-sp/lectures/15... · Example 70.0 72.5 75.0 77.5 80.0 82.5 85.0 87.5 Height 160 180 200 220 240 260

Example

Page 17: Lecture15–Part01 Supervisedand UnsupervisedLearningsierra.ucsd.edu/cse151a-2020-sp/lectures/15... · Example 70.0 72.5 75.0 77.5 80.0 82.5 85.0 87.5 Height 160 180 200 220 240 260

Initialization Strategies

▶ Basic Approach: Pick 𝑘 data points at random.

▶ Better Approach: k-means++:▶ Pick first center at random from data.▶ Let 𝐶 = {�⃗�(1)} (centers chosen so far)▶ Repeat 𝑘 − 1 more times:

▶ Pick random data point ⃗𝑥 according todistribution

ℙ( ⃗𝑥) ∝ min�⃗�∈𝐶

‖ ⃗𝑥 − 𝜇‖2

▶ Add ⃗𝑥 to 𝐶

Page 18: Lecture15–Part01 Supervisedand UnsupervisedLearningsierra.ucsd.edu/cse151a-2020-sp/lectures/15... · Example 70.0 72.5 75.0 77.5 80.0 82.5 85.0 87.5 Height 160 180 200 220 240 260

Picking k

▶ How do we know how many clusters the datacontains?

Page 19: Lecture15–Part01 Supervisedand UnsupervisedLearningsierra.ucsd.edu/cse151a-2020-sp/lectures/15... · Example 70.0 72.5 75.0 77.5 80.0 82.5 85.0 87.5 Height 160 180 200 220 240 260

Plot of K-Means Objective

Page 20: Lecture15–Part01 Supervisedand UnsupervisedLearningsierra.ucsd.edu/cse151a-2020-sp/lectures/15... · Example 70.0 72.5 75.0 77.5 80.0 82.5 85.0 87.5 Height 160 180 200 220 240 260

Applications of K-Means

▶ Discovery

▶ Vector Quantization▶ Find a finite set of representatives of a large (possiblyinfinite) set.

Page 21: Lecture15–Part01 Supervisedand UnsupervisedLearningsierra.ucsd.edu/cse151a-2020-sp/lectures/15... · Example 70.0 72.5 75.0 77.5 80.0 82.5 85.0 87.5 Height 160 180 200 220 240 260

Example #1

▶ Cluster animal descriptions.

▶ 50 animals: grizzly bear, dalmatian, rabbit, pig, …

▶ 85 attributes: long neck, tail, walks, swims, …

▶ 50 data points in ℝ85. Run 𝑘-means with 𝑘 = 10

Page 22: Lecture15–Part01 Supervisedand UnsupervisedLearningsierra.ucsd.edu/cse151a-2020-sp/lectures/15... · Example 70.0 72.5 75.0 77.5 80.0 82.5 85.0 87.5 Height 160 180 200 220 240 260

Results

Page 23: Lecture15–Part01 Supervisedand UnsupervisedLearningsierra.ucsd.edu/cse151a-2020-sp/lectures/15... · Example 70.0 72.5 75.0 77.5 80.0 82.5 85.0 87.5 Height 160 180 200 220 240 260

Example #2

▶ How do we represent images of different sizes asfixed length feature vectors for use inclassification tasks?

Page 24: Lecture15–Part01 Supervisedand UnsupervisedLearningsierra.ucsd.edu/cse151a-2020-sp/lectures/15... · Example 70.0 72.5 75.0 77.5 80.0 82.5 85.0 87.5 Height 160 180 200 220 240 260

Visual Bags-of-Words

▶ Idea: build a “dictionary” of image patches.

▶ Extract all ℓ × ℓ image patches from all trainingimages.

▶ Cluster them with 𝑘-means.▶ Each cluster center is now a dictionary “word”

▶ Represent an image as a histogram over {1, 2, … , 𝑘}by associating each patch with nearest center.

Page 25: Lecture15–Part01 Supervisedand UnsupervisedLearningsierra.ucsd.edu/cse151a-2020-sp/lectures/15... · Example 70.0 72.5 75.0 77.5 80.0 82.5 85.0 87.5 Height 160 180 200 220 240 260

Online Learning

▶ What if the dataset is huge?▶ It doesn’t even fit in memory.

▶ What if we’re continuously getting new data?▶ Don’t want to retrain with every new point.

▶ We can update the model online.

Page 26: Lecture15–Part01 Supervisedand UnsupervisedLearningsierra.ucsd.edu/cse151a-2020-sp/lectures/15... · Example 70.0 72.5 75.0 77.5 80.0 82.5 85.0 87.5 Height 160 180 200 220 240 260

Sequential k-Means▶ Set the centers �⃗�(1), … , �⃗�(𝑘) to be first 𝑘 points

▶ Set counts to be 𝑛1 = 𝑛2 = … = 𝑛𝑘 = 1.

▶ Repeat:▶ Get next data point, ⃗𝑥▶ Let �⃗�(𝑗) be closest center▶ Update �⃗�(𝑗) and 𝑛𝑗:

�⃗�(𝑗) =𝑛𝑗�⃗�(𝑗) + ⃗𝑥𝑛𝑗 + 1

𝑛𝑗 = 𝑛𝑗 + 1

Page 27: Lecture15–Part01 Supervisedand UnsupervisedLearningsierra.ucsd.edu/cse151a-2020-sp/lectures/15... · Example 70.0 72.5 75.0 77.5 80.0 82.5 85.0 87.5 Height 160 180 200 220 240 260

K-Means

▶ Perhaps the most popular clustering algorithm.

▶ Fast, easy to understand.

▶ Assumes spherical clusters.

Page 28: Lecture15–Part01 Supervisedand UnsupervisedLearningsierra.ucsd.edu/cse151a-2020-sp/lectures/15... · Example 70.0 72.5 75.0 77.5 80.0 82.5 85.0 87.5 Height 160 180 200 220 240 260

Example