View
1
Download
0
Category
Preview:
Citation preview
DS-240 Final Project – Movie Budgets vs. Success
Introduction
Hypothesis / Problem Statement
The question or hypothesis proposed: Is there any correlation between budget spent to produce a movie and a
movie’s success? The first step, other than finding a solid data set, is determining how to measure success.
Several variables are strong possibilities. First, straight “Revenue” which is the simplest metric that is widely
reported and common in most data sets. Second, the Popularity of a movie which is available through several of
the movie database sites. Third candidate, Vote_Average from movie viewers from several of the movie
database sites. Fourth and Fifth options could be calculated variables of either a profit (Revenue-Budget) or ROI
(Revenue/Budget) type variables. All five will be explored in the following analysis.
Data Set Selection & Cleansing
After review of multiple movie data sets, there were several potential candidates in a wide range of sizes. The
selected data set is a middle-sized data set posted on www.kaggle.com from the tmdb movie database. (Link:
https://www.kaggle.com/kevinmariogerard/tmdbmovies) It has 10.9k rows with 21 columns listed below:
- Id (tmdb_id) (Qual.)
- Imdb_id (Qual.)
- Popularity (tmdb site data) (Quant.)
- Budget (Quant.)
- Revenue (Quant.)
- Original_title (Qual.)
- Cast (Qual.)
- Homepage (Qual.)
- Director (Qual.)
- Tagline
- Keywords (Qual.)
- Overview (Qual.)
- Runtime (Quant.)
- Genres (Qual.)
- Production_companies (Qual.)
- Release_date (Quant.)
- Vote_count (tmdb site data) (Quant.)
- Vote_average (tmdb site data) (Quant.)
- Release_year (Qual.)
- Budget_adj (Quant.)
- Revuene_adj (Quant.)
As part of data cleansing, several columns were dropped that would not be necessary for this analysis. There
were several areas that did have some holes (NULL values or zeros) but mostly at the low tail end of the data
which is going to be dropped. The threshold settled on was any movie with a recorded budget or revenue lower
than $1000 was removed. The net result, only two values remained as NULL or zero. Those values were looked
up manually and updated. The cleaned data set landed at about 3.8k rows with 13 columns remaining. The
selected columns are show below with two added derived values (*):
- Id (tmdb_id)
- Imdb_id
- Popularity (tmdb site data)
- Budget
- Revenue
- Original_title
- Runtime
- Genres
- Production_companies
- Release_date
- Vote_count (tmdb site data)
- Vote_average (tmdb site data)
- Release_year
- *Pure_Gain (Revenue-Budget) (Quant.)
- *Retrun_Ratio (Revenue/Budget) (Quant.)
Description of Data
Once the data set was loaded into R-studio as movies.csv, a summary function was executed to produce a high-
level data summary. The following excerpt shows results for previously selected “success” variables and others.
Key Variable Ranges for our success variables and others (from summary(movies)):
Derived Variable Ranges from summary function:
Next up, a look at the selected methodology and how it unfolded for this analysis.
Methodology
For this analysis, the main focus will be on the Quantitative methodology. Below is an outline of the approach
through the three phases of data analysis.
Raw Data:
- Develop Hypothesis
- Data Selection
- Cleanse the data
Quantitative Analysis:
- Selection of significant variables
- List assumptions / Qualitative Factors
- Visual inspection and summary review of data
- Statistical analysis of significant variables
- Check for data Bias
Meaningful Information:
- Visualization development
- Present Findings
- Modify Hypothesis (if needed) and repeat
- Implement Solution / Results
Key Base Assumptions
Some key assumptions are being made with this data set. First and foremost, assumptions are being made
about the accuracy of the budget and revenue. The assumption that dropping low end budget and revenue files
will have minimal to no effect on the analysis outcome. Genre is also not being included as a delineator for this
analysis and it is assumed that this will have no appreciable impact. Finally, initially the assumption is being
made that there are minimal impacts from qualitative metrics and that the quantitative metrics will be enough
to achieve a reasonable correlation.
Preliminary Analysis
Visual Inspection and Summary Data Review
Looking first at how the data is distributed by release year (Release_Year) grouped by decade (see chart below).
The most recent decade (2010’s), as noted from the summary function, only includes years 2010-2015 so 60% of
the other segments. This explains the dip in # of movies released in that decade. It would be anticipated that
the trend would continue if it were a full decade of data (assumption).
With the hypothesis statement: Is there any correlation between budget spent and movie success? Let’s take a
look at the data visual ggplot2 graphs for budget verse each of the “success” metrics.
Revenue, Popularity and Profit (Pure-Gain) are showing some directional correlation tendencies while
Vote_Average is showing a very slight correlation if any. It is obvious that the ROI (Return_Ratio) is being
skewed by the two extreme outliers from a couple of very successful low budget films (Paranormal Activity and
The Blair Witch Project). These two movies both had budgets between 10-15k but made 12,000 and 9,000 times
more than the films budget, respectively. The next closest value was 700 times.
Correlation and Linear Regression Analysis
Let’s dig a little deeper into a data correlation matrix and some linear regression models for budget and revenue
against the other success metrics (shown on the next page).
Some moderate, almost strong, correlations are seen in the 0.5-0.7 range with the key success variables
highlighted in light yellow above. As expected, the Revenue to Budget is the best at 0.69 followed closely by
Revenue to Popularity at 0.61.
From a linear regression model standpoint, lets look at the budget first, as our main analysis variable, then
revenue.
51% of the Budget is explained by the selected variables. The selected “success” variables are also showing
strong significance with varied levels of impact on explaining the budget.
The revenue linear regression model shows similar variable significance (on the following page).
The revenue model comes in the highest of all the linear regression models in the analysis and has the highest R-
squared at 61% of Revenue explained by the selected variables.
Qualitative Impact Factors
With 61% being the highest linear regression results, consideration must be given to what qualitative factors
may explain the other 40-50% of the revenue and budget numbers. Several factors should be investigated in
deeper studies. Here are some of the qualitative metrics.
- Time of year the movie is released?
- Competition at time of release?
- Script Quality?
- Target Audience?
- Genre?
- Production Company?
- Inflation / Economic Cycles?
- Director?
- Other Human Behavior Factors?
- Marvel Magic? Disney Magic?
Budget Distribution Analysis
With budget being at the root of the hypothesis, looking deeper at the budget segmentation is the next area of
analysis. Below is the Budget distribution with mean line shown in red. Budget summary data shown as well.
By breaking the budget numbers into quartiles and looking at budget verses profit (Pure_Gain), trends can be
uncovered at the low-, mid-range, or high-end segmentation. Below is the assumed “Movie Success Scale” and
two views of the quartile distribution (percentage and count) with stated scale applied.
By applying this five-point scale we can glean a few things from the above three charts.
1. Higher the Budget, slightly higher the chance of profitability
2. Not as clear cut in the Loss mid-range segment of the scale.
3. Bombed and Smash Hit segments may be skewed due to the way the scale has been set up since for
larger budget films it would be harder to get to the x5 level.
4. Only 72% of movies analyzed turn a profit
5. May be influenced by how movies secure funding (extra commitment to fund a$50MM+ film?)
Results
There was a moderate to strong correlation of the key success metrics with budget (0.5-0.7 range). The P-values
show there is a strong significance (***) with both budget and revenue linear regression models. But the R-
squared values were a little soft at 0.51 and 0.61 for budget and revenue, respectively, compared to other
values seen in statistical analysis studies. So, is this a bad thing? Not necessarily.
With some online investigation, several sources noted that when dealing with “Human Behavior” typically the R-
squared values are lower than 50%. The excerpt below from a blog called “Adventures in statistics” summarized
it best. (Link to blog provided below the excerpt.)
http://blog.minitab.com/blog/adventures-in-statistics-2/regression-analysis-how-do-i-interpret-r-squared-and-
assess-the-goodness-of-fit
Several variables and impactful decisions like “Which movie to see?” definitely qualifies as a “human behavior”
type variable. So, 51%-61% is not looking so bad now, actually very good!
Conclusion
In conclusion, there are correlations between budget spend and movie success, just clouded and somewhat
capped quantitatively due to human behavior factors. We can account for 61% of the Revenue numbers and
51% of the Budget numbers with our quantitative “success” variables.
Recommended