Upload
others
View
3
Download
1
Embed Size (px)
Citation preview
Group A-7 Project Report Section- A
INDIAN SCHOOL OF BUSINESS
Project Report
Forecasting daily maximum Carbon Monoxide level
Team Members:
Name PGID
Ankush Chetwani 61710630
Bharathi Rajan Muthu Krishnan 61710069
Surya Ramkumar 61710477
Shravanan Rudrapathy 61710165
Vaishnavi Gurusamy 61710732
Group A-7 Project Report Section- A
Title: Forecasting daily maximum Carbon Monoxide (CO) level for an event management firm in Italy
Executive Summary
Business Objective
Considering the increasing pollution levels in the city and its harmful effects on kid’s health, an event
management firm has decided to conduct outdoor events only when Carbon monoxide levels are within
3ppm to 9ppm. For this, they need a model to know the expected daily maximum level of Carbon
Monoxide (CO) one week in advance.
Forecasting Description
To forecast the daily maximum Carbon Monoxide (CO) level for next one week (5th April 2005 to 11th
April 2005) by using data of various air pollutants including CO from 10th March 2004 to 4th April 2005.
Data Description
The dataset contains 9358 instances of hourly averaged responses from an array of 5 metal oxide
chemical sensors embedded in an Air Quality Chemical Multi sensor Device. Data were recorded from
10th March 2004 to 4th April 2005 (one year). Ground Truth hourly averaged concentrations for CO, Non-
Metallic Hydrocarbons, Benzene, Total Nitrogen Oxides (NOx) and Nitrogen Dioxide (NO2) and were
provided by a co-located reference certified analyzer.
Source: UCI machine learning repository- Air Quality data set
(http://archive.ics.uci.edu/ml/datasets/Air+Quality#)
Attribute Information
0 Date (DD/MM/YYYY)
1 Time (HH.MM.SS)
2 True hourly averaged concentration CO in mg/m^3 (reference analyzer)
3 PT08.S1 (tin oxide) hourly averaged sensor response (nominally CO targeted)
4 True hourly averaged overall Non Metallic Hydro Carbons concentration in micro g/m^3 (reference
analyzer)
5 True hourly averaged Benzene concentration in micro g/m^3 (reference analyzer)
6 PT08.S2 (titania) hourly averaged sensor response (nominally NMHC targeted)
7 True hourly averaged NOx concentration in ppb (reference analyzer)
8 PT08.S3 (tungsten oxide) hourly averaged sensor response (nominally NOx targeted)
9 True hourly averaged NO2 concentration in micro g/m^3 (reference analyzer)
10 PT08.S4 (tungsten oxide) hourly averaged sensor response (nominally NO2 targeted)
11 PT08.S5 (indium oxide) hourly averaged sensor response (nominally O3 targeted)
12 Temperature in °C
13 Relative Humidity (%)
14 AH Absolute Humidity
Group A-7 Project Report Section- A
Key Characteristics
Data was found with missing values which were visible as “-200”. Data had monthly seasonality and was
also changing as per the days of the week, which could be because of the varying number of
automobiles (emitting air pollutants) on weekdays and weekends.
Curves of output variable (CO): (after replacing missing values)
Input Variables: After checking different available variables we decided that the following variables can
affect the CO levels:
Group A-7 Project Report Section- A
• Daily maximum C6H6 (lag 8)
• Daily maximum T (lag 7)
• Daily maximum AH (lag 7)
• Monthly dummy variables
• Weekly dummy variables
Output Variable: Daily maximum CO concentration
TRAINING DATA = 11 months data set; VALIDATION DATA = 1 month data set
Final Forecasting Method
“Multi-Linear Regression Model”
Performance
• Multi-Linear regression model (best fit model) gave us the Root Mean Square Error equal to 1.2
which is much better than the Naïve prediction RMSE of 1.89.
• RMSE from the Best Fit model in PPM = (1.2 * 24.45)/28 = 1.1 PPM
Conclusions
Using allowable CO limit as 3pppm to 9ppm and assuming Normal distribution for the above range with
mean=6 and standard deviation=1.1, we find that P( Z > ((X- µ)/ϭ)) = P( Z > ((9-6)/1.1)) = 1- 0.9968 =
0.003. Therefore, the risk associated with the forecasted model is only 0.3%. Thus, Multi-Linear
regression model can be used to predict the daily Maximum CO level for next one week.
Recommendations
To be on the safer side, if the outdoor level CO (PPM) is > 8 PPM (about 1 Standard deviation away from
the threshold), it is advisable for the event management firm to not conduct the outdoor event.
Technical summary
Data Preparation
Data was found with missing values represented as “-200” in the different columns. We have first
checked the number of missing values in each column to make sure they are within acceptable limits to
arrive at a good prediction. We found that for the variable NMHC, there are more than 90% missing
values, hence we decided not to include that in our model. All other variables had less than 10% missing
values. We replaced the missing values by the previous hour values and for consecutive missing values
we replaced them with last week-hour values because the level for each week day was found to be
Group A-7 Project Report Section- A
different. Data was then converted from hourly to maximum daily level data for the output and the
predictor variables.
Post processed time series for predictor variables
Forecasting methods selection
SMOOTHING
Out of the smoothing techniques, Holtz Winter smoothing technique with the parameters α = 0.2, ɣ =
0.05 (with no trend) gave us the least RMSE.
REGRESSION
Step 1: Captured 2 layers of seasonality with dummy variables
Step 2: Forecasted the future values for C6H6 since it was the best predictor of CO concentration. This
increases the error.
Step 3: Hence used the lagged values of C6H6 i.e. Lag 8 C6H6 along with the Lag 7 values of T and AH
variables
NEURAL NETWORK
We implemented the Neural Network model. But the results were not as good as the simple smoothing
models
Forecasting methods used
Group A-7 Project Report Section- A
a) Naïve forecast: We started with the Naïve forecast by lagging the data by 7 days as shown in curve below. We got RMSE of 1.89.
b) Multi-Linear regression Model (Best Model): We then used the multi-Linear regression model which gave us the below curve. It gave us the RMSE of 1.2
Residual DF 331
R² 0.502713
Adjusted R² 0.472665Std. Error Estimate 1.372884
RSS 623.8723
Regress ion Model
Total sum
of
squared
errors RMS Error
Average
Error
623.8723 1.331302 -2.14349E-15
Training Data Scoring - Summary ReportTotal sum
of
squared
errors RMS Error
Average
Error
45.07613 1.205848 0.262738012
Validation Data Scoring - Summary Report
Group A-7 Project Report Section- A
c) Holt Winter without trend: We then used Holt-winter (without trend) method however RMSE found to be equal to sqrt(1.678)= 1.29 which was higher than the Multi-Linear regression model. (Curve is shown under ‘all model comparison’ below)
d) Linear Regression (with seasonality only): We also used linear regression (with seasonality only), however again the RMSE found to be 1.25 (as shown below) which was higher than the Multi-Linear regression model. (Curve is shown under ‘all model comparison’ below)
e) Ensemble: We then used the ensemble model by averaging the outputs of Multi-Linear regression, Holt winter (without trend) and Linear Regression (with seasonality only). We got the below curves using ensembles however the RMSE was found to be higher than multi-linear regression model.
Ensemble Residuals
Validation Data Scoring - Summary Report
Total sum
of
squared
errors RMS Error
Average
Error
48.53848 1.251302 0.29832052
Validation Error Measures
24.88125501
0.959511376
1.678859477
6.649563945
6.380332253
0.205817169
Mean Absolute Percentage Error (MAPE)
Mean Absolute Deviation (MAD)
Mean Square Error (MSE)
Tracking Signal Error (TSE)
Cumulative Forecast Error (CFE)
Mean Forecast Error (MFE)
Group A-7 Project Report Section- A
Group A-7 Project Report Section- A
Group A-7 Project Report Section- A
All models comparison
Below is the plot of Actual data, Multi-linear regression model (best model), Ensemble, Holt-winter (No
trend), Linear regression (with seasonality only) with prediction intervals: