Project Report - Galit Shmueli · Group A-7 Project Report Section- A Key Characteristics Data was found with missing values which were visible as “-200”. Data had monthly seasonality

Group A-7 Project Report Section- A

INDIAN SCHOOL OF BUSINESS

Project Report

Forecasting daily maximum Carbon Monoxide level

Team Members:

Name PGID

Ankush Chetwani 61710630

Bharathi Rajan Muthu Krishnan 61710069

Surya Ramkumar 61710477

Shravanan Rudrapathy 61710165

Vaishnavi Gurusamy 61710732


Title: Forecasting daily maximum Carbon Monoxide (CO) level for an event management firm in Italy

Executive Summary

Business Objective

Considering the increasing pollution levels in the city and its harmful effects on kid’s health, an event

management firm has decided to conduct outdoor events only when Carbon monoxide levels are within

3ppm to 9ppm. For this, they need a model to know the expected daily maximum level of Carbon

Monoxide (CO) one week in advance.

Forecasting Description

To forecast the daily maximum Carbon Monoxide (CO) level for next one week (5th April 2005 to 11th

April 2005) by using data of various air pollutants including CO from 10th March 2004 to 4th April 2005.

Data Description

The dataset contains 9358 instances of hourly averaged responses from an array of 5 metal oxide

chemical sensors embedded in an Air Quality Chemical Multi sensor Device. Data were recorded from

10th March 2004 to 4th April 2005 (one year). Ground Truth hourly averaged concentrations for CO, Non-

Metallic Hydrocarbons, Benzene, Total Nitrogen Oxides (NOx) and Nitrogen Dioxide (NO2) and were

provided by a co-located reference certified analyzer.

Source: UCI machine learning repository- Air Quality data set

(http://archive.ics.uci.edu/ml/datasets/Air+Quality#)

Attribute Information

0 Date (DD/MM/YYYY)

1 Time (HH.MM.SS)

2 True hourly averaged concentration CO in mg/m^3 (reference analyzer)

3 PT08.S1 (tin oxide) hourly averaged sensor response (nominally CO targeted)

4 True hourly averaged overall Non Metallic Hydro Carbons concentration in micro g/m^3 (reference

analyzer)

5 True hourly averaged Benzene concentration in micro g/m^3 (reference analyzer)

6 PT08.S2 (titania) hourly averaged sensor response (nominally NMHC targeted)

7 True hourly averaged NOx concentration in ppb (reference analyzer)

8 PT08.S3 (tungsten oxide) hourly averaged sensor response (nominally NOx targeted)

9 True hourly averaged NO2 concentration in micro g/m^3 (reference analyzer)

10 PT08.S4 (tungsten oxide) hourly averaged sensor response (nominally NO2 targeted)

11 PT08.S5 (indium oxide) hourly averaged sensor response (nominally O3 targeted)

12 Temperature in Â°C

13 Relative Humidity (%)

14 AH Absolute Humidity


Key Characteristics

Data was found with missing values which were visible as “-200”. Data had monthly seasonality and was

also changing as per the days of the week, which could be because of the varying number of

automobiles (emitting air pollutants) on weekdays and weekends.

Curves of output variable (CO): (after replacing missing values)

Input Variables: After checking different available variables we decided that the following variables can

affect the CO levels:


• Daily maximum C6H6 (lag 8)

• Daily maximum T (lag 7)

• Daily maximum AH (lag 7)

• Monthly dummy variables

• Weekly dummy variables

Output Variable: Daily maximum CO concentration

TRAINING DATA = 11 months data set; VALIDATION DATA = 1 month data set

Final Forecasting Method

“Multi-Linear Regression Model”

Performance

• Multi-Linear regression model (best fit model) gave us the Root Mean Square Error equal to 1.2

which is much better than the Naïve prediction RMSE of 1.89.

• RMSE from the Best Fit model in PPM = (1.2 * 24.45)/28 = 1.1 PPM

Conclusions

Using allowable CO limit as 3pppm to 9ppm and assuming Normal distribution for the above range with

mean=6 and standard deviation=1.1, we find that P( Z > ((X- µ)/ϭ)) = P( Z > ((9-6)/1.1)) = 1- 0.9968 =

0.003. Therefore, the risk associated with the forecasted model is only 0.3%. Thus, Multi-Linear

regression model can be used to predict the daily Maximum CO level for next one week.

Recommendations

To be on the safer side, if the outdoor level CO (PPM) is > 8 PPM (about 1 Standard deviation away from

the threshold), it is advisable for the event management firm to not conduct the outdoor event.

Technical summary

Data Preparation

Data was found with missing values represented as “-200” in the different columns. We have first

checked the number of missing values in each column to make sure they are within acceptable limits to

arrive at a good prediction. We found that for the variable NMHC, there are more than 90% missing

values, hence we decided not to include that in our model. All other variables had less than 10% missing

values. We replaced the missing values by the previous hour values and for consecutive missing values

we replaced them with last week-hour values because the level for each week day was found to be


different. Data was then converted from hourly to maximum daily level data for the output and the

predictor variables.

Post processed time series for predictor variables

Forecasting methods selection

SMOOTHING

Out of the smoothing techniques, Holtz Winter smoothing technique with the parameters α = 0.2, ɣ =

0.05 (with no trend) gave us the least RMSE.

REGRESSION

Step 1: Captured 2 layers of seasonality with dummy variables

Step 2: Forecasted the future values for C6H6 since it was the best predictor of CO concentration. This

increases the error.

Step 3: Hence used the lagged values of C6H6 i.e. Lag 8 C6H6 along with the Lag 7 values of T and AH

variables

NEURAL NETWORK

We implemented the Neural Network model. But the results were not as good as the simple smoothing

models

Forecasting methods used


a) Naïve forecast: We started with the Naïve forecast by lagging the data by 7 days as shown in curve below. We got RMSE of 1.89.

b) Multi-Linear regression Model (Best Model): We then used the multi-Linear regression model which gave us the below curve. It gave us the RMSE of 1.2

Residual DF 331

R² 0.502713

Adjusted R² 0.472665Std. Error Estimate 1.372884

RSS 623.8723

Regress ion Model

Total sum

of

squared

errors RMS Error

Average

Error

623.8723 1.331302 -2.14349E-15

Training Data Scoring - Summary ReportTotal sum

of

squared

errors RMS Error

Average

Error

45.07613 1.205848 0.262738012

Validation Data Scoring - Summary Report


c) Holt Winter without trend: We then used Holt-winter (without trend) method however RMSE found to be equal to sqrt(1.678)= 1.29 which was higher than the Multi-Linear regression model. (Curve is shown under ‘all model comparison’ below)

d) Linear Regression (with seasonality only): We also used linear regression (with seasonality only), however again the RMSE found to be 1.25 (as shown below) which was higher than the Multi-Linear regression model. (Curve is shown under ‘all model comparison’ below)

e) Ensemble: We then used the ensemble model by averaging the outputs of Multi-Linear regression, Holt winter (without trend) and Linear Regression (with seasonality only). We got the below curves using ensembles however the RMSE was found to be higher than multi-linear regression model.

Ensemble Residuals

Validation Data Scoring - Summary Report

Total sum

of

squared

errors RMS Error

Average

Error

48.53848 1.251302 0.29832052

Validation Error Measures

24.88125501

0.959511376

1.678859477

6.649563945

6.380332253

0.205817169

Mean Absolute Percentage Error (MAPE)

Mean Absolute Deviation (MAD)

Mean Square Error (MSE)

Tracking Signal Error (TSE)

Cumulative Forecast Error (CFE)

Mean Forecast Error (MFE)


All models comparison

Below is the plot of Actual data, Multi-linear regression model (best model), Ensemble, Holt-winter (No

trend), Linear regression (with seasonality only) with prediction intervals:

Documents

Project Report - Galit Shmueli · Group A-7 Project Report Section- A Key Characteristics Data was found with missing values which were visible as “-200”. Data had monthly seasonality