Upload
jamie-fradkin
View
150
Download
1
Embed Size (px)
Citation preview
Analysis of features most influential in the success of
Jamie FradkinJanuary 29, 2016
PREDICTING SUCCESS FOR MOVIES
Problem Statement/MotivationCreate a linear regression model that can predict Worldwide Gross of movies Based on a True Story by determining the features most influential to their success.
All Features1) MPAA Rating (G/PG/PG-13/R)
2) Runtime
3) IMDB Score
4) Opening # Theaters (Domestic)
5) Opening Gross (Domestic)
6)
6) Total # Theaters (Domestic)
7)Total Gross (Domestic)
8) Peak movie season*
9) Budget
10) Genre: Action, Adventure, Biography,
Comedy, Crime, Documentary, Drama, Family,
History, Horror, Music, Mystery, Romance, Sport,
Thriller, War, Western*May, June, July, November, December are highest grossing months (BoxOfficeMojo.com)
Procedure• Scrape all relevant data from various sources*, merge
into data frame by Title• Perform OLS regression on training set (70% of data)
beginning with all features• Evaluate model based on p-values for each feature and R2,
remove features as needed• Apply new model to remainder of data set
*Boxofficemojo.com, TheNumbers.com, IMDB.com
Training Set—Results Feature P >|t| R2 : 0.925
Adjusted R2: 0.918
MPAA Rating 0.123Runtime 0.300IMDB Score 0.155Opening Theaters 0.002Opening Gross 0.005Total Theaters (Domestic)
0.000
Total Gross (Domestic)
0.000
Peak Movie Season 0.067Budget 0.000
Training Set—ResultsRule: remove feature if p-value > 0.100Feature P >|t| R2 : 0.925
Adjusted R2: 0.918
MPAA Rating 0.123Runtime 0.300IMDB Score 0.155Opening Theaters 0.002Opening Gross 0.005Total Theaters (Domestic)
0.000
Total Gross (Domestic)
0.000
Peak Movie Season 0.067Budget 0.000
Training Set—ResultsFeature P >|t| R2 : 0.920
Adjusted R2: 0.915
Opening Theaters 0.007Opening Gross 0.011Total Theaters (Domestic)
0.000
Total Gross (Domestic)
0.000
Peak Movie Season 0.069Budget 0.000
Training Set—ResultsRule: remove feature if p-value > 0.005Feature P >|t| R2 : 0.920
Adjusted R2: 0.915
Opening Theaters 0.007Opening Gross 0.011Total Theaters (Domestic)
0.000
Total Gross (Domestic)
0.000
Peak Movie Season 0.069Budget 0.000
Final ModelFeature P >|t| R2 : 0.904
Adjusted R2: 0.902
Total Theaters (Domestic)
0.000
Total Gross (Domestic)
0.000
Budget 0.000Next step: Add in genre categories to determine which one(s) have lowest p-values and determine how they affect the model overall
Final ModelFeature P >|t| R2 : 0.905
Adjusted R2: 0.902
Total Theaters (Domestic)
0.000
Total Gross (Domestic)
0.000
Budget 0.000Romance ❤ * 0.053
*Runner-ups: Thriller, Sport, Family
Key Features:Trends in Raw Data
Model Fit:Actual Worldwide Gross vs. Predicted by Feature Model predictions
Raw Data
95% Confidence Interval
Model Fit: Prediction Residuals
Residuals increase in magnitude as Worldwide Gross increases: model predictions are not as accurate with extreme cases
Model Fit: Prediction Residuals
Accuracy could be improved with more data points in extreme high-grossing group
Conclusion:Predicting Success of Movies Based on a True Story• Domestic release (measured by Total Theaters and
Total Domestic Gross) is key indicator of Worldwide Gross• Total Domestic Gross and Budget are linearly related
to Worldwide Gross• Romance genre is the most highly correlated to
Worldwide Gross
Next Steps• Revisit larger data set knowing key features • Addition features to explore:• Actors/actresses• Award nominations or wins
• Analyze model with more features, allow more lenient p-values