53
Data-driven Clustering via Parameterized Lloyds Families Travis Dick Joint work with Maria-Florina Balcan and Colin White Carnegie Mellon University NeurIPS 2018

Data-driven Clustering via Parameterized Lloyds Families04... · Learning Model •An unknown distribution !over clustering instances. •Find an algorithm "that produces clusteringssimilar

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Data-driven Clustering via Parameterized Lloyds Families04... · Learning Model •An unknown distribution !over clustering instances. •Find an algorithm "that produces clusteringssimilar

Data-driven Clustering via Parameterized Lloyds Families

Travis DickJoint work with Maria-Florina Balcan and Colin White

Carnegie Mellon UniversityNeurIPS 2018

Page 2: Data-driven Clustering via Parameterized Lloyds Families04... · Learning Model •An unknown distribution !over clustering instances. •Find an algorithm "that produces clusteringssimilar

Data-driven Clustering

Page 3: Data-driven Clustering via Parameterized Lloyds Families04... · Learning Model •An unknown distribution !over clustering instances. •Find an algorithm "that produces clusteringssimilar

Data-driven Clustering• Clustering aims to divide a dataset into self-similar clusters.

Page 4: Data-driven Clustering via Parameterized Lloyds Families04... · Learning Model •An unknown distribution !over clustering instances. •Find an algorithm "that produces clusteringssimilar

Data-driven Clustering• Clustering aims to divide a dataset into self-similar clusters.• Goal: find some unknown natural clustering.

Page 5: Data-driven Clustering via Parameterized Lloyds Families04... · Learning Model •An unknown distribution !over clustering instances. •Find an algorithm "that produces clusteringssimilar

Data-driven Clustering• Clustering aims to divide a dataset into self-similar clusters.• Goal: find some unknown natural clustering.

Page 6: Data-driven Clustering via Parameterized Lloyds Families04... · Learning Model •An unknown distribution !over clustering instances. •Find an algorithm "that produces clusteringssimilar

Data-driven Clustering• Clustering aims to divide a dataset into self-similar clusters.• Goal: find some unknown natural clustering.

Page 7: Data-driven Clustering via Parameterized Lloyds Families04... · Learning Model •An unknown distribution !over clustering instances. •Find an algorithm "that produces clusteringssimilar

Data-driven Clustering• Clustering aims to divide a dataset into self-similar clusters.

• However, most clustering algorithms minimize a clustering cost function.

• Goal: find some unknown natural clustering.

Page 8: Data-driven Clustering via Parameterized Lloyds Families04... · Learning Model •An unknown distribution !over clustering instances. •Find an algorithm "that produces clusteringssimilar

Data-driven Clustering• Clustering aims to divide a dataset into self-similar clusters.

• However, most clustering algorithms minimize a clustering cost function.

• Goal: find some unknown natural clustering.

• Hope that low-cost clusterings recover the natural clusters.

Page 9: Data-driven Clustering via Parameterized Lloyds Families04... · Learning Model •An unknown distribution !over clustering instances. •Find an algorithm "that produces clusteringssimilar

Data-driven Clustering• Clustering aims to divide a dataset into self-similar clusters.

• However, most clustering algorithms minimize a clustering cost function.

• Goal: find some unknown natural clustering.

• Hope that low-cost clusterings recover the natural clusters.• There are many algorithms and many objectives.

Page 10: Data-driven Clustering via Parameterized Lloyds Families04... · Learning Model •An unknown distribution !over clustering instances. •Find an algorithm "that produces clusteringssimilar

Data-driven Clustering• Clustering aims to divide a dataset into self-similar clusters.

• However, most clustering algorithms minimize a clustering cost function.

How do we choose the best algorithm for a specific application?

• Goal: find some unknown natural clustering.

• Hope that low-cost clusterings recover the natural clusters.• There are many algorithms and many objectives.

Page 11: Data-driven Clustering via Parameterized Lloyds Families04... · Learning Model •An unknown distribution !over clustering instances. •Find an algorithm "that produces clusteringssimilar

Data-driven Clustering• Clustering aims to divide a dataset into self-similar clusters.

• However, most clustering algorithms minimize a clustering cost function.

How do we choose the best algorithm for a specific application?

Can we automate this process?

• Goal: find some unknown natural clustering.

• Hope that low-cost clusterings recover the natural clusters.• There are many algorithms and many objectives.

Page 12: Data-driven Clustering via Parameterized Lloyds Families04... · Learning Model •An unknown distribution !over clustering instances. •Find an algorithm "that produces clusteringssimilar

Learning Model

Page 13: Data-driven Clustering via Parameterized Lloyds Families04... · Learning Model •An unknown distribution !over clustering instances. •Find an algorithm "that produces clusteringssimilar

Learning Model• An unknown distribution ! over clustering instances.

Page 14: Data-driven Clustering via Parameterized Lloyds Families04... · Learning Model •An unknown distribution !over clustering instances. •Find an algorithm "that produces clusteringssimilar

Learning Model• An unknown distribution ! over clustering instances.• Given a sample "#, … , "& ∼ ! annotated by their target clusterings.

Page 15: Data-driven Clustering via Parameterized Lloyds Families04... · Learning Model •An unknown distribution !over clustering instances. •Find an algorithm "that produces clusteringssimilar

Learning Model• An unknown distribution ! over clustering instances.

• Find an algorithm " that produces clusterings similar to the target clusterings.

• Given a sample #$, … , #' ∼ ! annotated by their target clusterings.

Page 16: Data-driven Clustering via Parameterized Lloyds Families04... · Learning Model •An unknown distribution !over clustering instances. •Find an algorithm "that produces clusteringssimilar

Learning Model• An unknown distribution ! over clustering instances.

• Find an algorithm " that produces clusterings similar to the target clusterings.

• Given a sample #$, … , #' ∼ ! annotated by their target clusterings.

• Want " to also work well for new instances from !!

Page 17: Data-driven Clustering via Parameterized Lloyds Families04... · Learning Model •An unknown distribution !over clustering instances. •Find an algorithm "that produces clusteringssimilar

Learning Model• An unknown distribution ! over clustering instances.

• Find an algorithm " that produces clusterings similar to the target clusterings.

• Given a sample #$, … , #' ∼ ! annotated by their target clusterings.

• Want " to also work well for new instances from !!

• In this work:1. Introduce large parametric family of clustering algorithms, (*, +)-Lloyds.

Page 18: Data-driven Clustering via Parameterized Lloyds Families04... · Learning Model •An unknown distribution !over clustering instances. •Find an algorithm "that produces clusteringssimilar

Learning Model• An unknown distribution ! over clustering instances.

• Find an algorithm " that produces clusterings similar to the target clusterings.

• Given a sample #$, … , #' ∼ ! annotated by their target clusterings.

• Want " to also work well for new instances from !!

• In this work:1. Introduce large parametric family of clustering algorithms, (*, +)-Lloyds.2. Efficient procedures for finding best parameters on a sample.

Page 19: Data-driven Clustering via Parameterized Lloyds Families04... · Learning Model •An unknown distribution !over clustering instances. •Find an algorithm "that produces clusteringssimilar

Learning Model• An unknown distribution ! over clustering instances.

• Find an algorithm " that produces clusterings similar to the target clusterings.

• Given a sample #$, … , #' ∼ ! annotated by their target clusterings.

• Want " to also work well for new instances from !!

• In this work:1. Introduce large parametric family of clustering algorithms, (*, +)-Lloyds.2. Efficient procedures for finding best parameters on a sample.3. Generalization: optimal parameters on sample are nearly optimal on !.

Page 20: Data-driven Clustering via Parameterized Lloyds Families04... · Learning Model •An unknown distribution !over clustering instances. •Find an algorithm "that produces clusteringssimilar

Lloyds Method

Page 21: Data-driven Clustering via Parameterized Lloyds Families04... · Learning Model •An unknown distribution !over clustering instances. •Find an algorithm "that produces clusteringssimilar

Lloyds Method• Maintains ! centers "#, … , "& that define clusters.

Page 22: Data-driven Clustering via Parameterized Lloyds Families04... · Learning Model •An unknown distribution !over clustering instances. •Find an algorithm "that produces clusteringssimilar

Lloyds Method• Maintains ! centers "#, … , "& that define clusters.• Perform local search to improve the !-means cost of the centers.

Page 23: Data-driven Clustering via Parameterized Lloyds Families04... · Learning Model •An unknown distribution !over clustering instances. •Find an algorithm "that produces clusteringssimilar

Lloyds Method• Maintains ! centers "#, … , "& that define clusters.• Perform local search to improve the !-means cost of the centers.

1. Assign each point to nearest center.

Page 24: Data-driven Clustering via Parameterized Lloyds Families04... · Learning Model •An unknown distribution !over clustering instances. •Find an algorithm "that produces clusteringssimilar

Lloyds Method• Maintains ! centers "#, … , "& that define clusters.• Perform local search to improve the !-means cost of the centers.

1. Assign each point to nearest center.2. Update each center to be the mean of assigned points.

Page 25: Data-driven Clustering via Parameterized Lloyds Families04... · Learning Model •An unknown distribution !over clustering instances. •Find an algorithm "that produces clusteringssimilar

Lloyds Method• Maintains ! centers "#, … , "& that define clusters.• Perform local search to improve the !-means cost of the centers.

1. Assign each point to nearest center.2. Update each center to be the mean of assigned points.3. Repeat until convergence.

Page 26: Data-driven Clustering via Parameterized Lloyds Families04... · Learning Model •An unknown distribution !over clustering instances. •Find an algorithm "that produces clusteringssimilar

Lloyds Method• Maintains ! centers "#, … , "& that define clusters.• Perform local search to improve the !-means cost of the centers.

1. Assign each point to nearest center.2. Update each center to be the mean of assigned points.3. Repeat until convergence.

Page 27: Data-driven Clustering via Parameterized Lloyds Families04... · Learning Model •An unknown distribution !over clustering instances. •Find an algorithm "that produces clusteringssimilar

Lloyds Method• Maintains ! centers "#, … , "& that define clusters.• Perform local search to improve the !-means cost of the centers.

1. Assign each point to nearest center.2. Update each center to be the mean of assigned points.3. Repeat until convergence.

Page 28: Data-driven Clustering via Parameterized Lloyds Families04... · Learning Model •An unknown distribution !over clustering instances. •Find an algorithm "that produces clusteringssimilar

Lloyds Method• Maintains ! centers "#, … , "& that define clusters.• Perform local search to improve the !-means cost of the centers.

1. Assign each point to nearest center.2. Update each center to be the mean of assigned points.3. Repeat until convergence.

Page 29: Data-driven Clustering via Parameterized Lloyds Families04... · Learning Model •An unknown distribution !over clustering instances. •Find an algorithm "that produces clusteringssimilar

Lloyds Method• Maintains ! centers "#, … , "& that define clusters.• Perform local search to improve the !-means cost of the centers.

1. Assign each point to nearest center.2. Update each center to be the mean of assigned points.3. Repeat until convergence.

Page 30: Data-driven Clustering via Parameterized Lloyds Families04... · Learning Model •An unknown distribution !over clustering instances. •Find an algorithm "that produces clusteringssimilar

Lloyds Method• Maintains ! centers "#, … , "& that define clusters.• Perform local search to improve the !-means cost of the centers.

1. Assign each point to nearest center.2. Update each center to be the mean of assigned points.3. Repeat until convergence.

Page 31: Data-driven Clustering via Parameterized Lloyds Families04... · Learning Model •An unknown distribution !over clustering instances. •Find an algorithm "that produces clusteringssimilar

Initial Centers are Important!

Page 32: Data-driven Clustering via Parameterized Lloyds Families04... · Learning Model •An unknown distribution !over clustering instances. •Find an algorithm "that produces clusteringssimilar

Initial Centers are Important!• Lloyd’s method can get stuck if initial centers are chosen poorly

Page 33: Data-driven Clustering via Parameterized Lloyds Families04... · Learning Model •An unknown distribution !over clustering instances. •Find an algorithm "that produces clusteringssimilar

Initial Centers are Important!• Lloyd’s method can get stuck if initial centers are chosen poorly• Initialization is a well-studied problem with many proposed procedures (e.g., !-means++)

Page 34: Data-driven Clustering via Parameterized Lloyds Families04... · Learning Model •An unknown distribution !over clustering instances. •Find an algorithm "that produces clusteringssimilar

Initial Centers are Important!• Lloyd’s method can get stuck if initial centers are chosen poorly• Initialization is a well-studied problem with many proposed procedures (e.g., !-means++)• Best method will depend on properties of the clustering instances.

Page 35: Data-driven Clustering via Parameterized Lloyds Families04... · Learning Model •An unknown distribution !over clustering instances. •Find an algorithm "that produces clusteringssimilar

The (", $)-Lloyds Family

Page 36: Data-driven Clustering via Parameterized Lloyds Families04... · Learning Model •An unknown distribution !over clustering instances. •Find an algorithm "that produces clusteringssimilar

The (", $)-Lloyds FamilyInitialization: Parameter "

Page 37: Data-driven Clustering via Parameterized Lloyds Families04... · Learning Model •An unknown distribution !over clustering instances. •Find an algorithm "that produces clusteringssimilar

The (", $)-Lloyds FamilyInitialization: Parameter "• Use &'-sampling (generalizing &(-sampling of )-means++)

Page 38: Data-driven Clustering via Parameterized Lloyds Families04... · Learning Model •An unknown distribution !over clustering instances. •Find an algorithm "that produces clusteringssimilar

The (", $)-Lloyds FamilyInitialization: Parameter "• Use &'-sampling (generalizing &(-sampling of )-means++)• Choose initial centers from dataset * randomly.

Page 39: Data-driven Clustering via Parameterized Lloyds Families04... · Learning Model •An unknown distribution !over clustering instances. •Find an algorithm "that produces clusteringssimilar

The (", $)-Lloyds FamilyInitialization: Parameter "• Use &'-sampling (generalizing &(-sampling of )-means++)• Choose initial centers from dataset * randomly.• Probability that point + ∈ * is center -. is proportional to & +, -/, … , -.1/ '.

Page 40: Data-driven Clustering via Parameterized Lloyds Families04... · Learning Model •An unknown distribution !over clustering instances. •Find an algorithm "that produces clusteringssimilar

The (", $)-Lloyds FamilyInitialization: Parameter "• Use &'-sampling (generalizing &(-sampling of )-means++)

" = 0: random initialization

• Choose initial centers from dataset , randomly.• Probability that point - ∈ , is center /0 is proportional to & -, /1, … , /031 '.

Page 41: Data-driven Clustering via Parameterized Lloyds Families04... · Learning Model •An unknown distribution !over clustering instances. •Find an algorithm "that produces clusteringssimilar

The (", $)-Lloyds FamilyInitialization: Parameter "• Use &'-sampling (generalizing &(-sampling of )-means++)

" = 0: random initialization " = 2: )-means++

• Choose initial centers from dataset - randomly.• Probability that point . ∈ - is center 01 is proportional to & ., 02, … , 0142 '.

Page 42: Data-driven Clustering via Parameterized Lloyds Families04... · Learning Model •An unknown distribution !over clustering instances. •Find an algorithm "that produces clusteringssimilar

The (", $)-Lloyds FamilyInitialization: Parameter "• Use &'-sampling (generalizing &(-sampling of )-means++)

" = 0: random initialization " = 2: )-means++ " = ∞: farthest first

• Choose initial centers from dataset . randomly.• Probability that point / ∈ . is center 12 is proportional to & /, 13, … , 1253 '.

Page 43: Data-driven Clustering via Parameterized Lloyds Families04... · Learning Model •An unknown distribution !over clustering instances. •Find an algorithm "that produces clusteringssimilar

The (", $)-Lloyds FamilyInitialization: Parameter "• Use &'-sampling (generalizing &(-sampling of )-means++)

" = 0: random initialization " = 2: )-means++ " = ∞: farthest first

Local search: Second parameter $ tweaks the local search. Details in paper.

• Choose initial centers from dataset . randomly.• Probability that point / ∈ . is center 12 is proportional to & /, 13, … , 1253 '.

Page 44: Data-driven Clustering via Parameterized Lloyds Families04... · Learning Model •An unknown distribution !over clustering instances. •Find an algorithm "that produces clusteringssimilar

The (", $)-Lloyds FamilyInitialization: Parameter "• Use &'-sampling (generalizing &(-sampling of )-means++)

" = 0: random initialization " = 2: )-means++ " = ∞: farthest first

Local search: Second parameter $ tweaks the local search. Details in paper.

Question: For a distribution . over tasks, what parameters give best performance?

• Choose initial centers from dataset / randomly.• Probability that point 0 ∈ / is center 23 is proportional to & 0, 24, … , 2364 '.

Page 45: Data-driven Clustering via Parameterized Lloyds Families04... · Learning Model •An unknown distribution !over clustering instances. •Find an algorithm "that produces clusteringssimilar

Results

Page 46: Data-driven Clustering via Parameterized Lloyds Families04... · Learning Model •An unknown distribution !over clustering instances. •Find an algorithm "that produces clusteringssimilar

ResultsEfficient Tuning on Sample:

Page 47: Data-driven Clustering via Parameterized Lloyds Families04... · Learning Model •An unknown distribution !over clustering instances. •Find an algorithm "that produces clusteringssimilar

ResultsEfficient Tuning on Sample:• Efficient algorithm for finding parameters on sample with best agreement to targets.

Page 48: Data-driven Clustering via Parameterized Lloyds Families04... · Learning Model •An unknown distribution !over clustering instances. •Find an algorithm "that produces clusteringssimilar

ResultsEfficient Tuning on Sample:• Efficient algorithm for finding parameters on sample with best agreement to targets.• “Algorithmically feasible to tune parameters on sample.”

Page 49: Data-driven Clustering via Parameterized Lloyds Families04... · Learning Model •An unknown distribution !over clustering instances. •Find an algorithm "that produces clusteringssimilar

ResultsEfficient Tuning on Sample:

Generalization Guarantee:

• Efficient algorithm for finding parameters on sample with best agreement to targets.• “Algorithmically feasible to tune parameters on sample.”

Page 50: Data-driven Clustering via Parameterized Lloyds Families04... · Learning Model •An unknown distribution !over clustering instances. •Find an algorithm "that produces clusteringssimilar

ResultsEfficient Tuning on Sample:

Generalization Guarantee:

• Efficient algorithm for finding parameters on sample with best agreement to targets.• “Algorithmically feasible to tune parameters on sample.”

• Analyze the intrinsic complexity of (", $)-Lloyds

Page 51: Data-driven Clustering via Parameterized Lloyds Families04... · Learning Model •An unknown distribution !over clustering instances. •Find an algorithm "that produces clusteringssimilar

ResultsEfficient Tuning on Sample:

Generalization Guarantee:

• Efficient algorithm for finding parameters on sample with best agreement to targets.• “Algorithmically feasible to tune parameters on sample.”

• Analyze the intrinsic complexity of (", $)-Lloyds• Show that need only roughly &' ( )*+ ,

-. clustering instances to ensure empirical cost for all parameters within / of expected cost.

Page 52: Data-driven Clustering via Parameterized Lloyds Families04... · Learning Model •An unknown distribution !over clustering instances. •Find an algorithm "that produces clusteringssimilar

ResultsEfficient Tuning on Sample:

Generalization Guarantee:

• Efficient algorithm for finding parameters on sample with best agreement to targets.• “Algorithmically feasible to tune parameters on sample.”

• Analyze the intrinsic complexity of (", $)-Lloyds• Show that need only roughly &' ( )*+ ,

-. clustering instances to ensure empirical cost for all parameters within / of expected cost.

• “Parameters tuned on the sample will work well for new instances!”

Page 53: Data-driven Clustering via Parameterized Lloyds Families04... · Learning Model •An unknown distribution !over clustering instances. •Find an algorithm "that produces clusteringssimilar

ResultsEfficient Tuning on Sample:

Generalization Guarantee:

Experiments: Evaluate (", $)-Lloyds family on real and synthetic data.CIFAR-10 MNIST Mixture of Gaussians CNAE-9

• Efficient algorithm for finding parameters on sample with best agreement to targets.• “Algorithmically feasible to tune parameters on sample.”

• Analyze the intrinsic complexity of (", $)-Lloyds• Show that need only roughly &' ( )*+ ,

-. clustering instances to ensure empirical cost for all parameters within / of expected cost.

• “Parameters tuned on the sample will work well for new instances!”