View
195
Download
0
Category
Preview:
Citation preview
Machine learning:
• Supervised vs Unsupervised.
– Supervised learning - the presence of the outcome variable is available to guide the learning process.
• there must be a training data set in which the solution is already known.
– Unsupervised learning - the outcomes are unknown.
• cluster the data to reveal meaningful partitions and hierarchies
Clustering:
• Clustering is the task of gathering samples into groups of similar samples according to some predefined similarity or dissimilarity measure
sample Cluster/group
Clustering:
• What is clustering good for
– Market segmentation - group customers into different market segments
– Social network analysis - Facebook "smartlists"
– Organizing computer clusters and data centers for network layout and location
– Astronomical data analysis - Understanding galaxy formation
Galaxy Clustering:
• Multi-wavelength data obtained for galaxy clusters– Aim: determine robust criteria for the inclusion of a galaxy into
a cluster galaxy– Note: physical parameters of the galaxy cluster can be heavily
influenced by wrong candidate
Credit:HST
Clustering Algorithms :
• Hierarchy methods
– statistical method used to build a cluster by arranging elements at various levels
Dendogram:
• Each level will then represent a possible cluster.
• The height of the dendrogram shows the level of similarity that any two clusters are joined
• The closer to the bottom they are the more similar the clusters are
• Finding of groups from a dendrogram is not simple and is very often subjective
• Partitioning methods
– make an initial division of the database and then use an iterative strategy to further divide it into sections
– here each object belongs to exactly one cluster
Credit:Legodi, 2014
K-means algorithm:
1. Given n objects, initialize k cluster centers
2. Assign each object to its closest cluster centre
3. Update the center for each cluster
4. Repeat 2 and 3 until no change in each cluster center
• Experiment: Pack of cards, dominoes
• Apply the K-means algorithm to the Shapley data– Change the number of potential cluster and find how the
clustering differ
K Nearest Neighbors (k-NN):
• One of the simplest of all machine learning classifiers
• Differs from other machine learning techniques, in that it doesn't produce a model.
• It does however require a distance measure and the selection of K.
• First the K nearest training data points to the new observation are investigated.
• These K points determine the class of the new observation.
1-NN Aspects of anInstance-Based Learner
1. A distance metric– Euclidian
2. How many nearby neighbors to look at?– One
3. A weighting function (optional)– Unused
4. How to fit with the local points?– Just predict the same output as the nearest
neighbor.
Recommended