Upload
thinkful
View
67
Download
0
Embed Size (px)
Citation preview
Predicting the Oscars with Data Science
http://bit.ly/tf-predict-oscars
About me
• Jasjit Singh
• Self-taught developer
• Worked in finance & tech
• Co-Founder Hotspot
• Thinkful General Manager
About us
Thinkful prepares students for web development & data science jobs with 1-on-1 mentorship programs
What’s your background?
• I have a software background
• I have a math or stats background
• None of the above
Data Science Process
• Frame the question.
• Collect the raw data.
• Process the data.
• Explore the data.
• Communicate results.
Frame the question
• Who will win the Oscar for Best Picture?
Collect the Data
• What kind of data do we need?
• Financial data (Budget, box office…)
• Reviews, ratings and scores.
• Awards and nominations.
Process the data
• How’s the data “dirty” and how can we fix it?
• User input, redundancies, missing data…
• Formatting: adapt the data to meet certain specifications.
• Cleaning: detecting and correcting corrupt or inaccurate records.
Explore the data
• What are the meaningful patterns in the data?
• How meaningful is each data point for our predictions?
Goals
• Introduction to a data scientist's tools and methods:
• Jupyter notebooks, numpy, pandas, sklearn…
• Overview of basic machine learning concepts:
• Data formatting and cleaning, Decision trees, Overfitting, Random Forests…
Jupyter Notebooks
• One of data scientist’s everyday tools.
• Find the links in our classroom tool.
• Contains cells with code.
NumPy
• The fundamental package for scientific computing with Python.
• Provides powerful multi-dimensional array objects.
• Many methods for fast operations on arrays.
Pandas
• Fundamental high-level building block for doing practical, real world data analysis in Python.
• Built on top of NumPy.
• Offers data structures and operations for manipulating numerical tables and time series.
Scikit-learn
• Python module for machine learning.
• Provides a large menu of libraries for scientific computation, such as integration, interpolation, signal processing, linear algebra, statistics, etc.
Initial imports and loading data with Pandas
Understanding your data
• .head(n) method: Returns first n rows.
• .value_counts() method: Returns the counts of unique values in the DataFrame.
Formatting your Data
Formatting your Data
• Rate values in a non-numeric format. Thus, we will need to assign each rate a unique integer so that Python can handle the information.
• With the .ix method you create a subset of rows and assign a value to a certain variable of that subset of observations.
Cleaning your Data
Decision Trees
• It breaks down a dataset into smaller and smaller subsets.
• The final result is a model with a tree structure that has:
• Decision nodes: ask a question and have two or more branches.
• Leaf nodes: represent a classification or decision.
Classification vs Regression
• Classification — Predict categories.• Identifying group membership.
• Regression — Predict values.• Involves estimating or predicting a
response.
Classification
Classification
?
Creating your first Decision Tree
You will use the scikit-learn and numpy libraries to build your first decision tree. We will need the following to build a decision tree
• target: A one-dimensional numpy array containing the target from the train data.
• features: A multidimensional numpy array containing the features/predictors from the train data.
Creating your first Decision Tree
Importances and Score
• .feature_importances_ attribute: tells us how important the features are for the final result.
• .score() method: returns the mean accuracy of our fitting.
Importances and Score
Predicting
Pretty bad results :(Let’s improve it!
Let’s improve it!
Modify the feature list
Run the prediction again
Overfitting
• Resulting model too tied to the training set.
• It doesn’t generalize to new data, which is the point of prediction.
Random Forest Classifier
• Random Forest Classifiers use many Decision Trees to build a classifier.
• We introduce a bit of randomness.
• Each Tree can give a different answer (a vote). The final classification is the most common amongst the Trees.
Random Forest Classifier
Creating your first Decision Tree
Importances and Score
Predicting with Random Forest Classifiers
Results
1976
Rocky
1984
Amadeus
1996
The English Patient
2009
The Hurt Locker
And the Oscar goes to…
La La Land!!
The EndNothing happened after that.
Right?? RIGHT??
We can predict the OscarsExcept for 2017 ¯\_(ツ)_/¯
More about Thinkful
• Anyone who’s committed can learn to code
• 1-on-1 mentorship is the best way to learn
• Flexibility! Learn anywhere, anytime, & at your own pace
Our Program
You’ll learn concepts, practice with drills, and build capstone projects — all guided by a personal mentor
Our Mentors
Mentors have, on average, 10+ years of experience
Data Science Syllabus
• Managing data with SQL and Python
• Modeling with both supervised and unsupervised models
• Data visualization and communicating with data
• Technical interviews + Career prep
Our Results
Job Titles after GraduationMonths until Employed
Special Introductory Offer
• Prep course for 50% off — $250 instead of $500
• Covers math, stats, Python, and data science toolkit
• Option to continue into full program
• Talk to me (or email me) if you’re interested