18
CIS192 Python Programming Machine Learning in Python Robert Rand University of Pennsylvania October 22, 2015 Robert Rand (University of Pennsylvania) CIS 192 October 22, 2015 1 / 18

CIS192 Python Programming - University of Pennsylvaniacis192/fall2015/files/lec/lec8.pdf · CIS192 Python Programming Machine Learning in Python Robert Rand University of Pennsylvania

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: CIS192 Python Programming - University of Pennsylvaniacis192/fall2015/files/lec/lec8.pdf · CIS192 Python Programming Machine Learning in Python Robert Rand University of Pennsylvania

CIS192 Python ProgrammingMachine Learning in Python

Robert Rand

University of Pennsylvania

October 22, 2015

Robert Rand (University of Pennsylvania) CIS 192 October 22, 2015 1 / 18

Page 2: CIS192 Python Programming - University of Pennsylvaniacis192/fall2015/files/lec/lec8.pdf · CIS192 Python Programming Machine Learning in Python Robert Rand University of Pennsylvania

Outline

1 Machine Learning SoftwareNumpy and ScipyMatplotlibScikit Learn

2 Unsupervised LearningK-Means

3 Supervised Learning

4 General Tips

Robert Rand (University of Pennsylvania) CIS 192 October 22, 2015 2 / 18

Page 3: CIS192 Python Programming - University of Pennsylvaniacis192/fall2015/files/lec/lec8.pdf · CIS192 Python Programming Machine Learning in Python Robert Rand University of Pennsylvania

Installation

Recommended:pip install -U numpy scipy ipython scikit-learnYou should install pip first if you don’t have it.

Packages like Anaconda and Canopy also include the relevantlibraries.

Robert Rand (University of Pennsylvania) CIS 192 October 22, 2015 3 / 18

Page 4: CIS192 Python Programming - University of Pennsylvaniacis192/fall2015/files/lec/lec8.pdf · CIS192 Python Programming Machine Learning in Python Robert Rand University of Pennsylvania

Numpy and Scipy

Libraries for sophisticated mathematics and mathematicalcomputing in Python.Include libraries for linear algebra.Optimized for efficient machine learning.

Robert Rand (University of Pennsylvania) CIS 192 October 22, 2015 4 / 18

Page 5: CIS192 Python Programming - University of Pennsylvaniacis192/fall2015/files/lec/lec8.pdf · CIS192 Python Programming Machine Learning in Python Robert Rand University of Pennsylvania

Matplotlib

Library for plotting datasets.Use it to look at data before attempting machine learningtechniques.Interfaces well with the IPython (an alternative shell forPython development).Should come bundled with scikit-learn, otherwise addmatplotlib to the pip command on slide 1.

Robert Rand (University of Pennsylvania) CIS 192 October 22, 2015 5 / 18

Page 6: CIS192 Python Programming - University of Pennsylvaniacis192/fall2015/files/lec/lec8.pdf · CIS192 Python Programming Machine Learning in Python Robert Rand University of Pennsylvania

Scikit Learn

A machine learning library for Python.Uses numpy and scipy.Comes with a broad array of built in machine learningalgorithms

Robert Rand (University of Pennsylvania) CIS 192 October 22, 2015 6 / 18

Page 7: CIS192 Python Programming - University of Pennsylvaniacis192/fall2015/files/lec/lec8.pdf · CIS192 Python Programming Machine Learning in Python Robert Rand University of Pennsylvania

Outline

1 Machine Learning SoftwareNumpy and ScipyMatplotlibScikit Learn

2 Unsupervised LearningK-Means

3 Supervised Learning

4 General Tips

Robert Rand (University of Pennsylvania) CIS 192 October 22, 2015 7 / 18

Page 8: CIS192 Python Programming - University of Pennsylvaniacis192/fall2015/files/lec/lec8.pdf · CIS192 Python Programming Machine Learning in Python Robert Rand University of Pennsylvania

Coats and Bars

Suppose you didn’t know the meaning of “boat” or “car” but youhad a stack of photographs of boats and cars.

Could you somehow sort them into two stacks, whichcorresponded to boats and cars?

Robert Rand (University of Pennsylvania) CIS 192 October 22, 2015 8 / 18

Page 9: CIS192 Python Programming - University of Pennsylvaniacis192/fall2015/files/lec/lec8.pdf · CIS192 Python Programming Machine Learning in Python Robert Rand University of Pennsylvania

K-Means

cluster.KMeans()I Parameter: n_clusters - specifies the number of clusters

desired.

Randomly assign initial position for each cluster.Repeat until stable:

I Assign every point to its closest cluster ci .I Move ci to the center of points that are assigned to it.

Every point is labeled with its cluster.

Robert Rand (University of Pennsylvania) CIS 192 October 22, 2015 9 / 18

Page 10: CIS192 Python Programming - University of Pennsylvaniacis192/fall2015/files/lec/lec8.pdf · CIS192 Python Programming Machine Learning in Python Robert Rand University of Pennsylvania

Other Models

Gaussian Mixture Models general K-Means - K-means assumesclusters look like circle, where GMM can handle arbitrary ellipticclusters.Affinity Propagation ((cluster.AffinityPropagation()) identifiesexemplars - points that can stand in for a given cluster. It may varythe number of clusters depending on the exemplars it finds.

Robert Rand (University of Pennsylvania) CIS 192 October 22, 2015 10 / 18

Page 11: CIS192 Python Programming - University of Pennsylvaniacis192/fall2015/files/lec/lec8.pdf · CIS192 Python Programming Machine Learning in Python Robert Rand University of Pennsylvania

Outline

1 Machine Learning SoftwareNumpy and ScipyMatplotlibScikit Learn

2 Unsupervised LearningK-Means

3 Supervised Learning

4 General Tips

Robert Rand (University of Pennsylvania) CIS 192 October 22, 2015 11 / 18

Page 12: CIS192 Python Programming - University of Pennsylvaniacis192/fall2015/files/lec/lec8.pdf · CIS192 Python Programming Machine Learning in Python Robert Rand University of Pennsylvania

Is that a squirrel?

Often, we do have labels. But learning is still a problem. MachineLearning is about generalizing from the example’s you’ve seen(say, 100 pictures of squirrels) to things you haven’t yet seen.

Robert Rand (University of Pennsylvania) CIS 192 October 22, 2015 12 / 18

Squirrel Squirrel Squirrel?

Page 13: CIS192 Python Programming - University of Pennsylvaniacis192/fall2015/files/lec/lec8.pdf · CIS192 Python Programming Machine Learning in Python Robert Rand University of Pennsylvania

K Nearest Neighbors

neighbors.KNeighborsClassifierI Parameter: n_neighbors - specifies the number of neighbors k .

What it sounds like - looks for the k points most similar to agiven point in the data and returns the most common labelof those points.Where k = 1 matches the closest point (high variance).Where k = n always returns the most common label (highbias).Slow: Compares points to every point in the training data.

Robert Rand (University of Pennsylvania) CIS 192 October 22, 2015 13 / 18

Page 14: CIS192 Python Programming - University of Pennsylvaniacis192/fall2015/files/lec/lec8.pdf · CIS192 Python Programming Machine Learning in Python Robert Rand University of Pennsylvania

Decision Trees

tree.DecisionTreeClassifierI Parameter: max_depth - specifies the maximum tree depth (we

can alternatively specify max_leaf_nodes) .

Each nodes splits the data according to a specific feature.Greater tree height -> more variance.Smaller tree height -> more bias.

Robert Rand (University of Pennsylvania) CIS 192 October 22, 2015 14 / 18

Page 15: CIS192 Python Programming - University of Pennsylvaniacis192/fall2015/files/lec/lec8.pdf · CIS192 Python Programming Machine Learning in Python Robert Rand University of Pennsylvania

Support Vector Machine

eg. svm.SVC, svm.LinearSVCI Parameter: kernel - a function that specifies the form of the

separation

Carves up the space using support vectors - lines ofmaximum thickness that divide the space into two.Primarily for binary decision problems but can also be usedin stages for nary classification.

Robert Rand (University of Pennsylvania) CIS 192 October 22, 2015 15 / 18

Page 16: CIS192 Python Programming - University of Pennsylvaniacis192/fall2015/files/lec/lec8.pdf · CIS192 Python Programming Machine Learning in Python Robert Rand University of Pennsylvania

Naive Bayes

eg. naive_bayes.GaussianNBI Parameter: max_depth - specifies the maximum tree depth (we

can alternatively specify max_leaf_nodes) .

A powerful and efficient algorithm that assumesindependence between features.User specifies the assumed underlying distribution -Gaussian, Bernoulli etc.Classifies points using Bayes’ Theorem.

Robert Rand (University of Pennsylvania) CIS 192 October 22, 2015 16 / 18

Page 17: CIS192 Python Programming - University of Pennsylvaniacis192/fall2015/files/lec/lec8.pdf · CIS192 Python Programming Machine Learning in Python Robert Rand University of Pennsylvania

Outline

1 Machine Learning SoftwareNumpy and ScipyMatplotlibScikit Learn

2 Unsupervised LearningK-Means

3 Supervised Learning

4 General Tips

Robert Rand (University of Pennsylvania) CIS 192 October 22, 2015 17 / 18

Page 18: CIS192 Python Programming - University of Pennsylvaniacis192/fall2015/files/lec/lec8.pdf · CIS192 Python Programming Machine Learning in Python Robert Rand University of Pennsylvania

Tips and Tricks

Visualize the data before running an algorithm on it - whatkind of approach is appropriate to the problem.Partition your data (randomly!) into 3 sets of data:

I Training ( 80%): The core training data.I Cross Validation ( 10%): Data left out, test the model with different

parameters on it.I Test Data ( 10%): Also withheld, used to determine the ultimate

accuracy of the model.

You’ll generally find that your model has either too muchbias or too much variance - try to find a sweet spot.Avoid overfitting - this is generally the point where accuracyon your training set is substantially better than on yourcross-validation set.Don’t become too attached to one algorithm - no amount oftweaking will make the wrong algorithm into the right one.

Robert Rand (University of Pennsylvania) CIS 192 October 22, 2015 18 / 18