10
Group A-7 Project Report Section- A INDIAN SCHOOL OF BUSINESS Project Report Forecasting daily maximum Carbon Monoxide level Team Members: Name PGID Ankush Chetwani 61710630 Bharathi Rajan Muthu Krishnan 61710069 Surya Ramkumar 61710477 Shravanan Rudrapathy 61710165 Vaishnavi Gurusamy 61710732

Project Report - Galit Shmueli · Group A-7 Project Report Section- A Key Characteristics Data was found with missing values which were visible as “-200”. Data had monthly seasonality

  • Upload
    others

  • View
    3

  • Download
    1

Embed Size (px)

Citation preview

  • Group A-7 Project Report Section- A

    INDIAN SCHOOL OF BUSINESS

    Project Report

    Forecasting daily maximum Carbon Monoxide level

    Team Members:

    Name PGID

    Ankush Chetwani 61710630

    Bharathi Rajan Muthu Krishnan 61710069

    Surya Ramkumar 61710477

    Shravanan Rudrapathy 61710165

    Vaishnavi Gurusamy 61710732

  • Group A-7 Project Report Section- A

    Title: Forecasting daily maximum Carbon Monoxide (CO) level for an event management firm in Italy

    Executive Summary

    Business Objective

    Considering the increasing pollution levels in the city and its harmful effects on kid’s health, an event

    management firm has decided to conduct outdoor events only when Carbon monoxide levels are within

    3ppm to 9ppm. For this, they need a model to know the expected daily maximum level of Carbon

    Monoxide (CO) one week in advance.

    Forecasting Description

    To forecast the daily maximum Carbon Monoxide (CO) level for next one week (5th April 2005 to 11th

    April 2005) by using data of various air pollutants including CO from 10th March 2004 to 4th April 2005.

    Data Description

    The dataset contains 9358 instances of hourly averaged responses from an array of 5 metal oxide

    chemical sensors embedded in an Air Quality Chemical Multi sensor Device. Data were recorded from

    10th March 2004 to 4th April 2005 (one year). Ground Truth hourly averaged concentrations for CO, Non-

    Metallic Hydrocarbons, Benzene, Total Nitrogen Oxides (NOx) and Nitrogen Dioxide (NO2) and were

    provided by a co-located reference certified analyzer.

    Source: UCI machine learning repository- Air Quality data set

    (http://archive.ics.uci.edu/ml/datasets/Air+Quality#)

    Attribute Information

    0 Date (DD/MM/YYYY)

    1 Time (HH.MM.SS)

    2 True hourly averaged concentration CO in mg/m^3 (reference analyzer)

    3 PT08.S1 (tin oxide) hourly averaged sensor response (nominally CO targeted)

    4 True hourly averaged overall Non Metallic Hydro Carbons concentration in micro g/m^3 (reference

    analyzer)

    5 True hourly averaged Benzene concentration in micro g/m^3 (reference analyzer)

    6 PT08.S2 (titania) hourly averaged sensor response (nominally NMHC targeted)

    7 True hourly averaged NOx concentration in ppb (reference analyzer)

    8 PT08.S3 (tungsten oxide) hourly averaged sensor response (nominally NOx targeted)

    9 True hourly averaged NO2 concentration in micro g/m^3 (reference analyzer)

    10 PT08.S4 (tungsten oxide) hourly averaged sensor response (nominally NO2 targeted)

    11 PT08.S5 (indium oxide) hourly averaged sensor response (nominally O3 targeted)

    12 Temperature in °C

    13 Relative Humidity (%)

    14 AH Absolute Humidity

  • Group A-7 Project Report Section- A

    Key Characteristics

    Data was found with missing values which were visible as “-200”. Data had monthly seasonality and was

    also changing as per the days of the week, which could be because of the varying number of

    automobiles (emitting air pollutants) on weekdays and weekends.

    Curves of output variable (CO): (after replacing missing values)

    Input Variables: After checking different available variables we decided that the following variables can

    affect the CO levels:

  • Group A-7 Project Report Section- A

    • Daily maximum C6H6 (lag 8)

    • Daily maximum T (lag 7)

    • Daily maximum AH (lag 7)

    • Monthly dummy variables

    • Weekly dummy variables

    Output Variable: Daily maximum CO concentration

    TRAINING DATA = 11 months data set; VALIDATION DATA = 1 month data set

    Final Forecasting Method

    “Multi-Linear Regression Model”

    Performance

    • Multi-Linear regression model (best fit model) gave us the Root Mean Square Error equal to 1.2

    which is much better than the Naïve prediction RMSE of 1.89.

    • RMSE from the Best Fit model in PPM = (1.2 * 24.45)/28 = 1.1 PPM

    Conclusions

    Using allowable CO limit as 3pppm to 9ppm and assuming Normal distribution for the above range with

    mean=6 and standard deviation=1.1, we find that P( Z > ((X- µ)/ϭ)) = P( Z > ((9-6)/1.1)) = 1- 0.9968 =

    0.003. Therefore, the risk associated with the forecasted model is only 0.3%. Thus, Multi-Linear

    regression model can be used to predict the daily Maximum CO level for next one week.

    Recommendations

    To be on the safer side, if the outdoor level CO (PPM) is > 8 PPM (about 1 Standard deviation away from

    the threshold), it is advisable for the event management firm to not conduct the outdoor event.

    Technical summary

    Data Preparation

    Data was found with missing values represented as “-200” in the different columns. We have first

    checked the number of missing values in each column to make sure they are within acceptable limits to

    arrive at a good prediction. We found that for the variable NMHC, there are more than 90% missing

    values, hence we decided not to include that in our model. All other variables had less than 10% missing

    values. We replaced the missing values by the previous hour values and for consecutive missing values

    we replaced them with last week-hour values because the level for each week day was found to be

  • Group A-7 Project Report Section- A

    different. Data was then converted from hourly to maximum daily level data for the output and the

    predictor variables.

    Post processed time series for predictor variables

    Forecasting methods selection

    SMOOTHING

    Out of the smoothing techniques, Holtz Winter smoothing technique with the parameters α = 0.2, ɣ =

    0.05 (with no trend) gave us the least RMSE.

    REGRESSION

    Step 1: Captured 2 layers of seasonality with dummy variables

    Step 2: Forecasted the future values for C6H6 since it was the best predictor of CO concentration. This

    increases the error.

    Step 3: Hence used the lagged values of C6H6 i.e. Lag 8 C6H6 along with the Lag 7 values of T and AH

    variables

    NEURAL NETWORK

    We implemented the Neural Network model. But the results were not as good as the simple smoothing

    models

    Forecasting methods used

  • Group A-7 Project Report Section- A

    a) Naïve forecast: We started with the Naïve forecast by lagging the data by 7 days as shown in curve below. We got RMSE of 1.89.

    b) Multi-Linear regression Model (Best Model): We then used the multi-Linear regression model which gave us the below curve. It gave us the RMSE of 1.2

    Residual DF 331

    R² 0.502713

    Adjusted R² 0.472665Std. Error Estimate 1.372884

    RSS 623.8723

    Regress ion Model

    Total sum

    of

    squared

    errors RMS Error

    Average

    Error

    623.8723 1.331302 -2.14349E-15

    Training Data Scoring - Summary ReportTotal sum

    of

    squared

    errors RMS Error

    Average

    Error

    45.07613 1.205848 0.262738012

    Validation Data Scoring - Summary Report

  • Group A-7 Project Report Section- A

    c) Holt Winter without trend: We then used Holt-winter (without trend) method however RMSE found to be equal to sqrt(1.678)= 1.29 which was higher than the Multi-Linear regression model. (Curve is shown under ‘all model comparison’ below)

    d) Linear Regression (with seasonality only): We also used linear regression (with seasonality only), however again the RMSE found to be 1.25 (as shown below) which was higher than the Multi-Linear regression model. (Curve is shown under ‘all model comparison’ below)

    e) Ensemble: We then used the ensemble model by averaging the outputs of Multi-Linear regression, Holt winter (without trend) and Linear Regression (with seasonality only). We got the below curves using ensembles however the RMSE was found to be higher than multi-linear regression model.

    Ensemble Residuals

    Validation Data Scoring - Summary Report

    Total sum

    of

    squared

    errors RMS Error

    Average

    Error

    48.53848 1.251302 0.29832052

    Validation Error Measures

    24.88125501

    0.959511376

    1.678859477

    6.649563945

    6.380332253

    0.205817169

    Mean Absolute Percentage Error (MAPE)

    Mean Absolute Deviation (MAD)

    Mean Square Error (MSE)

    Tracking Signal Error (TSE)

    Cumulative Forecast Error (CFE)

    Mean Forecast Error (MFE)

  • Group A-7 Project Report Section- A

  • Group A-7 Project Report Section- A

  • Group A-7 Project Report Section- A

    All models comparison

    Below is the plot of Actual data, Multi-linear regression model (best model), Ensemble, Holt-winter (No

    trend), Linear regression (with seasonality only) with prediction intervals: