48
Esteban García-Cuesta – Computer Science Department Scalable Machine Learning Algorithms and Applications PhD. Esteban García-Cuesta Associate Professor & Head of Data Science Laboratory Universidad Europea de Madrid Esteban García-Cuesta – Computer Science Department

Scalable Machine Learning Algorithmsprojectbasedschool.universidadeuropea.es/blogs/dsl/... · • Machine Learning and data mining • Affective computing • Dimensionality reduction

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Scalable Machine Learning Algorithmsprojectbasedschool.universidadeuropea.es/blogs/dsl/... · • Machine Learning and data mining • Affective computing • Dimensionality reduction

Esteban García-Cuesta – Computer Science Department

Scalable Machine LearningAlgorithms

and Applications

PhD. Esteban García-Cuesta

Associate Professor & Head of Data Science Laboratory

Universidad Europea de Madrid

Esteban García-Cuesta – Computer Science Department

Page 2: Scalable Machine Learning Algorithmsprojectbasedschool.universidadeuropea.es/blogs/dsl/... · • Machine Learning and data mining • Affective computing • Dimensionality reduction

Esteban García-Cuesta – Computer Science Department

Professor and Researcher at Universidad Europea de Madrid

Head of Data Science Lab Research Group• Machine Learning and data mining• Affective computing• Dimensionality reduction and latent spaces• Social mining

Contact informationEmail: [email protected]: egarciacuestaTel: +34 912115163

PhD. In Computer Science(Artificial Intelligence) byUniversidad Carlos III de Madrid

Page 3: Scalable Machine Learning Algorithmsprojectbasedschool.universidadeuropea.es/blogs/dsl/... · • Machine Learning and data mining • Affective computing • Dimensionality reduction

Esteban García-Cuesta – Computer Science Department

CANOPY ALS SLMVP

END

Page 4: Scalable Machine Learning Algorithmsprojectbasedschool.universidadeuropea.es/blogs/dsl/... · • Machine Learning and data mining • Affective computing • Dimensionality reduction

Esteban García-Cuesta – Computer Science Department

CANOPYClustering

Page 5: Scalable Machine Learning Algorithmsprojectbasedschool.universidadeuropea.es/blogs/dsl/... · • Machine Learning and data mining • Affective computing • Dimensionality reduction

Esteban García-Cuesta – Computer Science Department

High Dimensional Data

• Given a cloud of data points we want to understand its structure

Page 6: Scalable Machine Learning Algorithmsprojectbasedschool.universidadeuropea.es/blogs/dsl/... · • Machine Learning and data mining • Affective computing • Dimensionality reduction

Esteban García-Cuesta – Computer Science Department

Clustering Images

• Image segmentation• Goal: break up the images into meaningful or perceptually similar regions

Nuclear segmentation in microscope cell images: A hand-segmented dataset and comparison of algorithms" by "Luis Pedro Coelho and Aabid Shariff and Robert F. Murphy"; DOI: 10.1109/ISBI.2009.5193098

Page 7: Scalable Machine Learning Algorithmsprojectbasedschool.universidadeuropea.es/blogs/dsl/... · • Machine Learning and data mining • Affective computing • Dimensionality reduction

Esteban García-Cuesta – Computer Science Department

Clustering Problem: Galaxies (SkyCat)

• A catalog of 2 billion “sky objects” represents objects by theirradiation in 7 dimensions (frequency bands)

• Problem: Cluster into similar objects, e.g., galaxies, nearby stars,quasars, etc.

• Sloan Digital Sky Survey

Page 8: Scalable Machine Learning Algorithmsprojectbasedschool.universidadeuropea.es/blogs/dsl/... · • Machine Learning and data mining • Affective computing • Dimensionality reduction

Esteban García-Cuesta – Computer Science Department

Clustering is a hard problem!

Page 9: Scalable Machine Learning Algorithmsprojectbasedschool.universidadeuropea.es/blogs/dsl/... · • Machine Learning and data mining • Affective computing • Dimensionality reduction

Esteban García-Cuesta – Computer Science Department

Why is it hard?

Page 10: Scalable Machine Learning Algorithmsprojectbasedschool.universidadeuropea.es/blogs/dsl/... · • Machine Learning and data mining • Affective computing • Dimensionality reduction

Esteban García-Cuesta – Computer Science Department

Why is it hard?

• Clustering in two dimensions looks easy

• Clustering small amounts of data looks easy

• And in most cases, looks are not deceiving

• Many applications involve not 2, but 10 or 10,000 dimensions

• High-dimensional spaces look different: Almost all pairs of points areat about the same distance

Page 11: Scalable Machine Learning Algorithmsprojectbasedschool.universidadeuropea.es/blogs/dsl/... · • Machine Learning and data mining • Affective computing • Dimensionality reduction

Esteban García-Cuesta – Computer Science Department

Previous step for reducing the number of operations to be performed by k-

means

Suitable for large data sets (large number of samples)

Results are similar to those provided by k-means itself

Canopy clustering

Page 12: Scalable Machine Learning Algorithmsprojectbasedschool.universidadeuropea.es/blogs/dsl/... · • Machine Learning and data mining • Affective computing • Dimensionality reduction

Esteban García-Cuesta – Computer Science Department

Canopy clustering

Page 13: Scalable Machine Learning Algorithmsprojectbasedschool.universidadeuropea.es/blogs/dsl/... · • Machine Learning and data mining • Affective computing • Dimensionality reduction

Esteban García-Cuesta – Computer Science Department

1

Page 14: Scalable Machine Learning Algorithmsprojectbasedschool.universidadeuropea.es/blogs/dsl/... · • Machine Learning and data mining • Affective computing • Dimensionality reduction

Esteban García-Cuesta – Computer Science Department

2

Page 15: Scalable Machine Learning Algorithmsprojectbasedschool.universidadeuropea.es/blogs/dsl/... · • Machine Learning and data mining • Affective computing • Dimensionality reduction

Esteban García-Cuesta – Computer Science Department

3

Page 16: Scalable Machine Learning Algorithmsprojectbasedschool.universidadeuropea.es/blogs/dsl/... · • Machine Learning and data mining • Affective computing • Dimensionality reduction

Esteban García-Cuesta – Computer Science Department

4

Page 17: Scalable Machine Learning Algorithmsprojectbasedschool.universidadeuropea.es/blogs/dsl/... · • Machine Learning and data mining • Affective computing • Dimensionality reduction

Esteban García-Cuesta – Computer Science Department

5

Page 18: Scalable Machine Learning Algorithmsprojectbasedschool.universidadeuropea.es/blogs/dsl/... · • Machine Learning and data mining • Affective computing • Dimensionality reduction

Esteban García-Cuesta – Computer Science Department

6

Page 19: Scalable Machine Learning Algorithmsprojectbasedschool.universidadeuropea.es/blogs/dsl/... · • Machine Learning and data mining • Affective computing • Dimensionality reduction

Esteban García-Cuesta – Computer Science Department

7

Page 20: Scalable Machine Learning Algorithmsprojectbasedschool.universidadeuropea.es/blogs/dsl/... · • Machine Learning and data mining • Affective computing • Dimensionality reduction

Esteban García-Cuesta – Computer Science Department

8

Page 21: Scalable Machine Learning Algorithmsprojectbasedschool.universidadeuropea.es/blogs/dsl/... · • Machine Learning and data mining • Affective computing • Dimensionality reduction

Esteban García-Cuesta – Computer Science Department

9

Page 22: Scalable Machine Learning Algorithmsprojectbasedschool.universidadeuropea.es/blogs/dsl/... · • Machine Learning and data mining • Affective computing • Dimensionality reduction

Esteban García-Cuesta – Computer Science Department

10

Page 23: Scalable Machine Learning Algorithmsprojectbasedschool.universidadeuropea.es/blogs/dsl/... · • Machine Learning and data mining • Affective computing • Dimensionality reduction

Esteban García-Cuesta – Computer Science Department

11

Page 24: Scalable Machine Learning Algorithmsprojectbasedschool.universidadeuropea.es/blogs/dsl/... · • Machine Learning and data mining • Affective computing • Dimensionality reduction

Esteban García-Cuesta – Computer Science Department

12

Page 25: Scalable Machine Learning Algorithmsprojectbasedschool.universidadeuropea.es/blogs/dsl/... · • Machine Learning and data mining • Affective computing • Dimensionality reduction

Esteban García-Cuesta – Computer Science Department

13

Page 26: Scalable Machine Learning Algorithmsprojectbasedschool.universidadeuropea.es/blogs/dsl/... · • Machine Learning and data mining • Affective computing • Dimensionality reduction

Esteban García-Cuesta – Computer Science Department

14

Page 27: Scalable Machine Learning Algorithmsprojectbasedschool.universidadeuropea.es/blogs/dsl/... · • Machine Learning and data mining • Affective computing • Dimensionality reduction

Esteban García-Cuesta – Computer Science Department

15

Page 28: Scalable Machine Learning Algorithmsprojectbasedschool.universidadeuropea.es/blogs/dsl/... · • Machine Learning and data mining • Affective computing • Dimensionality reduction

Esteban García-Cuesta – Computer Science Department

Assigning points to canopies

Page 29: Scalable Machine Learning Algorithmsprojectbasedschool.universidadeuropea.es/blogs/dsl/... · • Machine Learning and data mining • Affective computing • Dimensionality reduction

Esteban García-Cuesta – Computer Science Department

16

Page 30: Scalable Machine Learning Algorithmsprojectbasedschool.universidadeuropea.es/blogs/dsl/... · • Machine Learning and data mining • Affective computing • Dimensionality reduction

Esteban García-Cuesta – Computer Science Department

17

Page 31: Scalable Machine Learning Algorithmsprojectbasedschool.universidadeuropea.es/blogs/dsl/... · • Machine Learning and data mining • Affective computing • Dimensionality reduction

Esteban García-Cuesta – Computer Science Department

18

Page 32: Scalable Machine Learning Algorithmsprojectbasedschool.universidadeuropea.es/blogs/dsl/... · • Machine Learning and data mining • Affective computing • Dimensionality reduction

Esteban García-Cuesta – Computer Science Department

19

Page 33: Scalable Machine Learning Algorithmsprojectbasedschool.universidadeuropea.es/blogs/dsl/... · • Machine Learning and data mining • Affective computing • Dimensionality reduction

Esteban García-Cuesta – Computer Science Department

20

Page 34: Scalable Machine Learning Algorithmsprojectbasedschool.universidadeuropea.es/blogs/dsl/... · • Machine Learning and data mining • Affective computing • Dimensionality reduction

Esteban García-Cuesta – Computer Science Department

Canopy as initial step for k-means

Page 35: Scalable Machine Learning Algorithmsprojectbasedschool.universidadeuropea.es/blogs/dsl/... · • Machine Learning and data mining • Affective computing • Dimensionality reduction

Esteban García-Cuesta – Computer Science Department

21

Page 36: Scalable Machine Learning Algorithmsprojectbasedschool.universidadeuropea.es/blogs/dsl/... · • Machine Learning and data mining • Affective computing • Dimensionality reduction

Esteban García-Cuesta – Computer Science Department

22

Page 37: Scalable Machine Learning Algorithmsprojectbasedschool.universidadeuropea.es/blogs/dsl/... · • Machine Learning and data mining • Affective computing • Dimensionality reduction

Esteban García-Cuesta – Computer Science Department

23

Page 38: Scalable Machine Learning Algorithmsprojectbasedschool.universidadeuropea.es/blogs/dsl/... · • Machine Learning and data mining • Affective computing • Dimensionality reduction

Esteban García-Cuesta – Computer Science Department

24

Page 39: Scalable Machine Learning Algorithmsprojectbasedschool.universidadeuropea.es/blogs/dsl/... · • Machine Learning and data mining • Affective computing • Dimensionality reduction

Esteban García-Cuesta – Computer Science Department

25

Page 40: Scalable Machine Learning Algorithmsprojectbasedschool.universidadeuropea.es/blogs/dsl/... · • Machine Learning and data mining • Affective computing • Dimensionality reduction

Esteban García-Cuesta – Computer Science Department

26

Page 41: Scalable Machine Learning Algorithmsprojectbasedschool.universidadeuropea.es/blogs/dsl/... · • Machine Learning and data mining • Affective computing • Dimensionality reduction

Esteban García-Cuesta – Computer Science Department

27

Page 42: Scalable Machine Learning Algorithmsprojectbasedschool.universidadeuropea.es/blogs/dsl/... · • Machine Learning and data mining • Affective computing • Dimensionality reduction

Esteban García-Cuesta – Computer Science Department

Summary (Canopy Algorithm)

• Start with a list of data points and two distances T1 > T21. Select any point (at random) for the list to form a canopy center

2. Calculate its distance to all the other points in the list

3. Put all the points which fall within the distance threshold of T1 into a canopy

4. Remove from the main dataset list all the points which fall within thethreshold of T2. These points are excluded from being the center of a formin new canopies.

5. Repeat from step 1 to 4 until original list is empty

Andrew McCallum, Kamal Nigam, and Lyle H. Ungar. 2000. Efficient clustering of high-dimensional data sets with application to reference matching. In Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining (KDD '00). ACM, New York, NY, USA, 169-178. DOI=http://dx.doi.org/10.1145/347090.347123

Page 43: Scalable Machine Learning Algorithmsprojectbasedschool.universidadeuropea.es/blogs/dsl/... · • Machine Learning and data mining • Affective computing • Dimensionality reduction

Esteban García-Cuesta – Computer Science Department

The processing is done in 3 M/R steps:

1. The data is massaged into suitable input format

2. Each mapper performs canopy clustering on the points in its input set and outputs its canopies’ centers

3. The reducer clusters the canopy centers to produce the final canopy centers

4. The points are then clustered into these final canopies

Canopy clustering (Parallelization Summary)

Page 44: Scalable Machine Learning Algorithmsprojectbasedschool.universidadeuropea.es/blogs/dsl/... · • Machine Learning and data mining • Affective computing • Dimensionality reduction

Esteban García-Cuesta – Computer Science Department

Co

stFu

nct

ion

#N de clusters

Thumb rule k=(n/2)^0.5

A better approximation

Canopy (How to Choose K?)

Page 45: Scalable Machine Learning Algorithmsprojectbasedschool.universidadeuropea.es/blogs/dsl/... · • Machine Learning and data mining • Affective computing • Dimensionality reduction

Esteban García-Cuesta – Computer Science Department

Co

stFu

nct

ion

#N de clusters

Optimal

Thumb rule k=(n/2)^0.5

A better approximation

Canopy (How to Choose K?)

Page 46: Scalable Machine Learning Algorithmsprojectbasedschool.universidadeuropea.es/blogs/dsl/... · • Machine Learning and data mining • Affective computing • Dimensionality reduction

Esteban García-Cuesta – Computer Science Department

• Check how good are the clusters for the applicationunder use: e.g. Portugal market segmentation

We

igth

Height Height

We

igth

Canopy (How to Choose K?)

Page 47: Scalable Machine Learning Algorithmsprojectbasedschool.universidadeuropea.es/blogs/dsl/... · • Machine Learning and data mining • Affective computing • Dimensionality reduction

Esteban García-Cuesta – Computer Science Department

• Check how good are the clusters for the applicationunder use: e.g. Portugal market segmentation

We

igth

Height Height

We

igth

Canopy (How to Choose K?)

Page 48: Scalable Machine Learning Algorithmsprojectbasedschool.universidadeuropea.es/blogs/dsl/... · • Machine Learning and data mining • Affective computing • Dimensionality reduction

Esteban García-Cuesta – Computer Science Department

Copyright

Nota para los usuarios de las diapositivas proporcionadas: Nos encantaríaque este material le resulte útil para dar sus propias conferencias. Siéntaselibre de usar estas diapositivas textualmente, o modificarlas para que seajusten a sus propias necesidades. Los originales de PowerPoint estándisponibles. Si utiliza una parte importante de estas diapositivas en su propiaconferencia o charla, incluya este mensaje.

Note to the users of provided slides: We would be delighted if you foundthis our material useful in giving your own lectures. Feel free to use theseslides verbatim, or to modify them to fit your own needs. PowerPointoriginals are available. If you make use of a significant portion of these slidesin your own lecture, please include this message.