Predict oscars (4:17)

Predicting the Oscars with Data Science

http://bit.ly/tf-predict-oscars

http://bit.ly/tf-predict-oscars

About me

• Jasjit Singh

• Self-taught developer

• Worked in finance & tech

• Co-Founder Hotspot

• Thinkful General Manager

About us

Thinkful prepares students for web development & data science jobs with 1-on-1 mentorship programs

What’s your background?

• I have a software background

• I have a math or stats background

• None of the above

Data Science Process

• Frame the question.

• Collect the raw data.

• Process the data.

• Explore the data.

• Communicate results.

Frame the question

• Who will win the Oscar for Best Picture?

Collect the Data

• What kind of data do we need?

• Financial data (Budget, box office…)

• Reviews, ratings and scores.

• Awards and nominations.

Process the data

• How’s the data “dirty” and how can we fix it?

• User input, redundancies, missing data…

• Formatting: adapt the data to meet certain specifications.

• Cleaning: detecting and correcting corrupt or inaccurate records.

Explore the data

• What are the meaningful patterns in the data?

• How meaningful is each data point for our predictions?

Goals

• Introduction to a data scientist's tools and methods:

• Jupyter notebooks, numpy, pandas, sklearn…

• Overview of basic machine learning concepts:

• Data formatting and cleaning, Decision trees, Overfitting, Random Forests…

Jupyter Notebooks

• One of data scientist’s everyday tools.

• Find the links in our classroom tool.

• Contains cells with code.

NumPy

• The fundamental package for scientific computing with Python.

• Provides powerful multi-dimensional array objects.

• Many methods for fast operations on arrays.

Pandas

• Fundamental high-level building block for doing practical, real world data analysis in Python.

• Built on top of NumPy.

• Offers data structures and operations for manipulating numerical tables and time series.

Scikit-learn

• Python module for machine learning.

• Provides a large menu of libraries for scientific computation, such as integration, interpolation, signal processing, linear algebra, statistics, etc.

Initial imports and loading data with Pandas

Understanding your data

• .head(n) method: Returns first n rows.

• .value_counts() method: Returns the counts of unique values in the DataFrame.

Formatting your Data

Formatting your Data

• Rate values in a non-numeric format. Thus, we will need to assign each rate a unique integer so that Python can handle the information.

• With the .ix method you create a subset of rows and assign a value to a certain variable of that subset of observations.

Cleaning your Data

Decision Trees

• It breaks down a dataset into smaller and smaller subsets.

• The final result is a model with a tree structure that has:

• Decision nodes: ask a question and have two or more branches.

• Leaf nodes: represent a classification or decision.

Classification vs Regression

• Classification — Predict categories.• Identifying group membership.

• Regression — Predict values.• Involves estimating or predicting a

response.

Classification

Classification

?

Creating your first Decision Tree

You will use the scikit-learn and numpy libraries to build your first decision tree. We will need the following to build a decision tree

• target: A one-dimensional numpy array containing the target from the train data.

• features: A multidimensional numpy array containing the features/predictors from the train data.


Importances and Score

• .feature_importances_ attribute: tells us how important the features are for the final result.

• .score() method: returns the mean accuracy of our fitting.


Predicting

Pretty bad results :(Let’s improve it!

Let’s improve it!

Modify the feature list

Run the prediction again

Overfitting

• Resulting model too tied to the training set.

• It doesn’t generalize to new data, which is the point of prediction.

Random Forest Classifier

• Random Forest Classifiers use many Decision Trees to build a classifier.

• We introduce a bit of randomness.

• Each Tree can give a different answer (a vote). The final classification is the most common amongst the Trees.

Random Forest Classifier



Predicting with Random Forest Classifiers

Results

1976

Rocky

1984

Amadeus

1996

The English Patient

2009

The Hurt Locker

And the Oscar goes to…

La La Land!!

The EndNothing happened after that.

Right?? RIGHT??

We can predict the OscarsExcept for 2017 ¯\_(ツ)_/¯

More about Thinkful

• Anyone who’s committed can learn to code

• 1-on-1 mentorship is the best way to learn

• Flexibility! Learn anywhere, anytime, & at your own pace

Our Program

You’ll learn concepts, practice with drills, and build capstone projects — all guided by a personal mentor

Our Mentors

Mentors have, on average, 10+ years of experience

Data Science Syllabus

• Managing data with SQL and Python

• Modeling with both supervised and unsupervised models

• Data visualization and communicating with data

• Technical interviews + Career prep

Our Results

Job Titles after GraduationMonths until Employed

Special Introductory Offer

• Prep course for 50% off — $250 instead of $500

• Covers math, stats, Python, and data science toolkit

• Option to continue into full program

• Talk to me (or email me) if you’re interested

October 2015

Questions? [email protected]

schedule a call through thinkful.com

http://thinkful.com

Education

Predict oscars (4:17)