24
Bell Labs Technical Journal 18(4), 135–158 (2014) © 2014 Alcatel-Lucent. Published by Wiley Periodicals, Inc. Published online in Wiley Online Library (wileyonlinelibrary.com) • DOI: 10.1002/bltj.21650 Demand Forecasting in Smart Grids Piotr Mirowski, Sining Chen, Tin Kam Ho, and Chun-Nam Yu Data analytics in smart grids can be leveraged to channel the data downpour from individual meters into knowledge valuable to electric power utilities and end-consumers. Short-term load forecasting (STLF) can address issues vital to a utility but it has traditionally been done mostly at system (city or country) level. In this case study, we exploit rich, multi-year, and high- frequency annotated data collected via a metering infrastructure to perform STLF on aggregates of power meters in a mid-sized city. For smart meter aggregates complemented with geo-specific weather data, we benchmark several state-of-the-art forecasting algorithms, including kernel methods for nonlinear regression, seasonal and temperature-adjusted auto-regressive models, exponential smoothing and state-space models. We show how STLF accuracy improves at larger meter aggregation (at feeder, substation, and system-wide level). We provide an overview of our algorithms for load prediction and discuss system performance issues that impact real time STLF. © 2014 Alcatel-Lucent. in the distribution network. Customers may also gain better awareness of their own consumption patterns. In this paper we report a study on demand predic- tion, where we analyzed near real time power con- sumption monitored by tens of thousands of smart meters in a medium-size U.S. city. We developed and adapted several short-term forecasting methods for predicting the load at several levels of aggregation. In this context, short-term load forecasting (STLF) refers to the prediction of power consumption levels in the next hour, next day, or up to a week ahead. Within this time scope, one can have reliable weather fore- casts, which provide important input to the predic- tion, as historically the load in this city is highly influenced by weather because electricity is used for both heating and cooling. Introduction Smart grid deployments carry the promise of allowing better control and balance of energy supply and demand through near real time, continuous visi- bility into detailed energy generation and consump- tion patterns. Methods to extract knowledge from near real time and accumulated observations are hence critical to the extraction of value from the infra- structure investment. On the demand side, widespread deployment of smart meters that provide frequent readings allows insight into continuous traces of usage patterns that are unique to each premise and each aggregate at different levels of the distribution hierar- chy. This in turn enables better designs and triggers of demand response actions and pricing strategies, and provides input to the planning for growth and changes

Demand Forecasting in Smart Grids - NYU Computer Sciencemirowski/pub/Mirowski_BLTJ2014_Demand... · DOI: 10.1002/bltj Bell Labs Technical Journal 137 State-of-the-Art in Load Forecasting

Embed Size (px)

Citation preview

Page 1: Demand Forecasting in Smart Grids - NYU Computer Sciencemirowski/pub/Mirowski_BLTJ2014_Demand... · DOI: 10.1002/bltj Bell Labs Technical Journal 137 State-of-the-Art in Load Forecasting

Bell Labs Technical Journal 18(4), 135–158 (2014) © 2014 Alcatel-Lucent. Published by Wiley Periodicals, Inc. Published online in Wiley Online Library (wileyonlinelibrary.com) • DOI: 10.1002/bltj.21650

◆ Demand Forecasting in Smart GridsPiotr Mirowski, Sining Chen, Tin Kam Ho, and Chun-Nam Yu

Data analytics in smart grids can be leveraged to channel the data downpour from individual meters into knowledge valuable to electric power utilities and end-consumers. Short-term load forecasting (STLF) can address issues vital to a utility but it has traditionally been done mostly at system (city or country) level. In this case study, we exploit rich, multi-year, and high-frequency annotated data collected via a metering infrastructure to perform STLF on aggregates of power meters in a mid-sized city. For smart meter aggregates complemented with geo-specifi c weather data, we benchmark several state-of-the-art forecasting algorithms, including kernel methods for nonlinear regression, seasonal and temperature-adjusted auto-regressive models, exponential smoothing and state-space models. We show how STLF accuracy improves at larger meter aggregation (at feeder, substation, and system-wide level). We provide an overview of our algorithms for load prediction and discuss system performance issues that impact real time STLF. © 2014 Alcatel-Lucent.

in the distribution network. Customers may also gain

better awareness of their own consumption patterns.

In this paper we report a study on demand predic-

tion, where we analyzed near real time power con-

sumption monitored by tens of thousands of smart

meters in a medium-size U.S. city. We developed and

adapted several short-term forecasting methods for

predicting the load at several levels of aggregation. In

this context, short-term load forecasting (STLF) refers

to the prediction of power consumption levels in the

next hour, next day, or up to a week ahead. Within

this time scope, one can have reliable weather fore-

casts, which provide important input to the predic-

tion, as historically the load in this city is highly

infl uenced by weather because electricity is used for

both heating and cooling.

IntroductionSmart grid deployments carry the promise of

allowing better control and balance of energy supply

and demand through near real time, continuous visi-

bility into detailed energy generation and consump-

tion patterns. Methods to extract knowledge from

near real time and accumulated observations are

hence critical to the extraction of value from the infra-

structure investment. On the demand side, widespread

deployment of smart meters that provide frequent

readings allows insight into continuous traces of usage

patterns that are unique to each premise and each

aggregate at different levels of the distribution hierar-

chy. This in turn enables better designs and triggers of

demand response actions and pricing strategies, and

provides input to the planning for growth and changes

Page 2: Demand Forecasting in Smart Grids - NYU Computer Sciencemirowski/pub/Mirowski_BLTJ2014_Demand... · DOI: 10.1002/bltj Bell Labs Technical Journal 137 State-of-the-Art in Load Forecasting

136 Bell Labs Technical Journal DOI: 10.1002/bltj

Scenario of the StudyIn our study, the meters are deployed at customer

locations and their readings are sampled every 15

minutes. The meter’s network description includes its

geographical location (latitude, longitude); date of

installation and planned removal; type of customer

served; as well as which pole, which feeder section,

and which substation the meter is connected to.

Weather data is collected by the utility company at the

substation level, and consists of hourly temperature,

wind speed, and wind chill temperature. Additional

weather data, made available by the National Climatic

Data Center (NCDC) and the National Oceanic and

Atmospheric Administration (NOAA), provide addi-

tional measurements, such as humidity or sky cover

at the location of the city airport, and hourly weather

forecasts up to seven days ahead.

The load prediction algorithms that we have

investigated and implemented are embedded in a

module of a data analytic system being developed for

the utility company. The module receives meter mea-

surements and converts them to power usage values,

and aggregates usage at different levels: individual

meters, feeder sections, distribution substations, and

at the system level. It then generates load forecasts at

prediction horizons that range from 60 minutes

(next-hour predictions) to 24 hours (next-day pre-

dictions), or even 168 hours (next-week predictions).

As we will detail in a later section, the load forecasts

operate independently for each meter and meter

aggregate, communicating through a limited set of

inputs/outputs with a database to read the latest

weather forecasts and per-meter usage history and

return corresponding load forecasts. This procedure

can be parallelized, which enables some degree of

asynchronous behavior within the prediction time-

frame (the load forecasts are made with a granularity

of one hour).

Short-term load forecasts generated at the meter

(customer premise) level will provide the utility

company with customer-level smart grid capabilities

and help the company communicate with the cus-

tomer about energy saving and billing issues. STLF

generated at higher levels of aggregation (from feeder

section to city-wide) will help in planning and opera-

tion of the relevant components of the electric grid.

Panel 1. Abbreviations, Acronyms, and Terms

ACF—Auto-correlation functionARIMA—Auto-regressive integrated moving

averageARIMAX—Auto-regressive moving average with

external inputsARMA—Auto-regressive moving averageDASARIMA—Dummy-adjusted seasonal auto-

regressive integrated moving averageENEL—Ente Nazionale d’ElectricitaGARCH—Generalized auto-regressive

conditional heteroscedasticGDP—Gross domestic productHWT—Holt-Winters modeli.i.d.—Independently and identically distributedKPSS—Kwiatkowiski, Phillips, Schmidt, and ShinLOESS—Locally-weighted scatterplot smoothingLSE—Least square errorLTLF—Long-term load forecastMAPE—Mean absolute percentage error

ML—Machine learningMTLF—Medium-term load forecastNCDC—National Climatic Data CenterNOAA—National Oceanic and Atmospheric

AdministrationPACF—Partial ACFRAM—Random access memorySARIMA—Seasonal auto-regressive integrated

moving averageSARIMAX—Seasonal auto-regressive integrated

moving average with external inputsSSM—State-space modelsSVR—Sigma SVRSTLF—Short-term load forecastSVM—Support vector machineSVR—Support vector regressionWh—Watt hourWKR—Weighted kernel regression

Page 3: Demand Forecasting in Smart Grids - NYU Computer Sciencemirowski/pub/Mirowski_BLTJ2014_Demand... · DOI: 10.1002/bltj Bell Labs Technical Journal 137 State-of-the-Art in Load Forecasting

DOI: 10.1002/bltj Bell Labs Technical Journal 137

State-of-the-Art in Load ForecastingElectric load forecasting is a mature fi eld of

investigation and the statistical methodologies have

been implemented and deployed in industrial appli-

cations. Several meta-review papers provide a good

overview of the demand prediction literature [13,

17, 25, 27] and identify three sub-fi elds, depending

on the prediction horizon.

The prominent sub-fi eld of investigation, short-

term load forecasting (STLF), handles prediction

horizons of one hour up to one week and typically

relies on time series analysis and modeling. Daily,

weekly, and sometimes yearly seasonality can be

explicitly modeled. These methods consider variables

such as date (e.g., day of week and hour of the day),

temperature (including weather forecasts), humid-

ity, temperature-humidity index, wind-chill index

and most importantly, historical load. Residential

versus commercial or industrial uses are rarely

specifi ed.

Representative algorithms for STLF include time

series models of linear dynamic systems involving

load and weather regressors, typically relying on

auto-regressive models such as the auto-regressive

moving average (ARMA) [15] and the seasonal auto-

regressive integrated moving average (SARIMA)

[35]. State-space models offer further refi nement to

linear dynamics by defi ning additional (so-called

“hidden” or “latent”) state variables representing

underlying load dynamics and seasonality, either by

explicit variables as in the exponential smoothing

methods [37] or in spline representations of daily

load [19]. An alternative approach to modeling load

and weather dynamics is to consider nonlinear mod-

els and a machine learning approach. A popular class

of algorithms for STLF, which we do not report here

but which has been used by several electric compa-

nies for system-wide predictions, is neural networks

[22]. We focused instead on so-called kernel meth-

ods, starting from simple weighted kernel regression

[6] all the way up to support vector machines [7] and

kernel ridge regression. The section on Short-Term

Load Forecasting Methodology provides more details

about the methods that we implemented and inves-

tigated for this comparative study.

The remaining two fi elds of investigation, not

covered in this paper, are medium-term load fore-

casting (MTLF), handling horizons of one week up to

one year, and long-term load forecasting (LTLF),

with predictions at horizons of multiple years. These

methods typically proceed by the regression on input

variables, which, in addition to historic load and cli-

mate forecasts, typically incorporate demographic

and economic factors such as the gross domestic

product (GDP), real estate statistics, or population

growth projections, as well as estimated demands of

electric equipment.

Our key fi nding was that most of the research

focused on large aggregated load data, typically at

city level or even at country level, where most indi-

vidual variations are averaged out by the effect of the

law of large numbers. These methods were seldom

tried on individual meters or at meter aggregate lev-

els such as distribution feeders and substations, with

a few exceptions such as recent work on STLF [3] in

non-residential buildings or a clustering analysis of

individual meters and aggregate load forecasting on

feeder sections in a neighborhood of Seoul [32].

In this paper, we propose to continue bridging

this gap by systematically evaluating when the state-

of-the-art STLF algorithms break down, i.e., how the

performance degrades when the number of consid-

ered meters goes down.

Other Datasets With Individual MetersThe specifi city of the unique dataset that we

investigated is that it contains energy consumption

data from the system-wide (city-wide) level down to

the level of individual meters. To our knowledge, few

such complex datasets [32] have been investigated

for short-term load forecasting, even though a few

localized (e.g., building-specifi c) smart meter data-

sets have been studied [3].

The Italian energy provider Enel has deployed

over 32 million smart meters. Remote monitoring is

done by sending the readings from each customer’s

location [31] through a low-bandwidth network to

data aggregators located at substations. Data is sam-

pled and stored at 15 minute frequency. The readings

are sent about every two weeks or every month. The

Page 4: Demand Forecasting in Smart Grids - NYU Computer Sciencemirowski/pub/Mirowski_BLTJ2014_Demand... · DOI: 10.1002/bltj Bell Labs Technical Journal 137 State-of-the-Art in Load Forecasting

138 Bell Labs Technical Journal DOI: 10.1002/bltj

motivation for the utility is the ability to leverage

customized hourly-based tariffs [11] to price services

for its customers. Although the individual meter data

collected by ENEL has been used in studies on group-

ing individual customers based on clustering load

profi les [16], we are unaware of analyses of these

data from the perspective of load aggregates.

At the individual home level, there are several

studies on peak load prediction [33] and on energy

disaggregation of individual appliances in house-

holds [23]. However, these datasets are much smaller

(typically < 100 meters) than our current dataset.

The frequency of load measurement is also much

higher (one measurement every few seconds or mil-

liseconds) and is not typical of smart meters currently

under deployment.

The rest of the paper is divided as follows. We

begin by explaining the structure of our unique,

hierarchical dataset of load consumption coming

from a mid-size U.S. city. We then provide an over-

view of key algorithms for short-term load forecast-

ing that exploit both historical load and weather

data. The section titled “Short-Term Load Forecasting

Results” details the essentially state-of-the-art STLF

results that we obtain at the system (city) level and

how STLF performance depends on the size of the

load aggregate. We conclude with a discussion of

performance, parallelism, and runtime issues waged

by performing STLF at all levels of load aggregation;

it also introduces ensemble prediction that leverages

multiple STLF algorithms for improved predictions.

Smart Grid DataThe specifi city of our study on short-term load

forecasting is in its unique dataset consisting of hun-

dreds of thousands of individual meters intercon-

nected in a hierarchy of feeders and substations. We

provide details on how the meter data is aggregated

and how we associate it with weather data.

System Hierarchy of Meters, Feeders and SubstationsThis study actually exploits two sets of data, col-

lected in a mid-sized U.S. city (population of about

200,000 inhabitants) over the course of several years:

• System-wide data representing total city consumption

(residential and industrial), collected over the

course of 2007, 2008, and 2009 at hourly inter-

vals. This dataset is typical to classic STLF studies.

• Individual meter readings coming from over a hun-

dred thousand meters installed at customer loca-

tions. Out of this rich dataset, we use 32,000

mostly residential meters that satisfi ed a number

of conditions detailed earlier. This data was col-

lected between January 2011 and June 2012.

The individual meters measure consumption (in

Watt hours, Wh) at 15 minute intervals and are refer-

enced in a meter−pole−feeder section−substation−district hierarchy. Meter measurements included in

our analysis are those from the residential and small

business customers (with contract demand under

5000 kW). Load predictions are made at these levels:

1. Single customer. Load measurements are derived

from meter measurements by differentiation to

obtain the value increment within a sampling

time interval, divided over the duration of the

sampling interval. Single-customer STLF perfor-

mance and methods are not the object of this

paper.

2. Feeder section. We defi ne a feeder section as a sub-

set of transformers connected to a feeder (such

as serving a neighborhood). The system network

topology considered in this paper consists of

about 300 unique feeders employed in the time

period 2011−2012. The historical load measure-

ments at a given time and at the level of a feeder

section are based on aggregating the load derived

from all the meters connected to that feeder

section.

3. Substation. Each substation serves a small geo-

graphical area. There were about 100 unique

substations in the distribution network over the

time period we consider. Aggregation at the sub-

station level works in the same way as aggrega-

tion at the feeder section level.

4. System-wide. This highest level of aggregation

comprises all the residential and small business

meters indexed in the distribution hierarchy.

As explained in the next section, weather data

are geo-located with the average (center) location of

the meters connected to that feeder or substation

section.

Page 5: Demand Forecasting in Smart Grids - NYU Computer Sciencemirowski/pub/Mirowski_BLTJ2014_Demand... · DOI: 10.1002/bltj Bell Labs Technical Journal 137 State-of-the-Art in Load Forecasting

DOI: 10.1002/bltj Bell Labs Technical Journal 139

From Consumer Meters to Consistent Load AggregatesThe main problem in aggregating meter data into

aggregates is the lack of consistency, across time, of

the constituents of each aggregate. Fortunately, our

dataset contained, in addition to meter readings,

periodically updated metadata that described each

meter, its connection to the feeder and substation, its

geographical (latitude, longitude) coordinates as well

as customer-specifi c data. To obtain a consistent

dataset for method evaluation, we used this meta-

data to discard meters that were disconnected and

reconnected, keeping only meters that satisfi ed sev-

eral consistency requirements: same owner, same

feeder connection, and same geographical location

throughout the evaluation period.

Although our per-meter dataset lists nearly a

hundred thousand meters, only a subset of the

32,000 meters satisfi es the consistency requirements

and contains non-zero meter readings. We concen-

trate on the load aggregates derived from these 32,000

meters. Load aggregates at feeder level are basically

obtained by summing up the 15-minute load from all

the meters connected to that feeder. Similarly, the

load aggregates at a substation are determined by

adding up loads at all the feeders connected to that

substation. Each aggregate’s load is then down-sam-

pled to hourly time intervals.

While aggregating the loads at individual meters,

we had to handle non-aligned time stamps, meter

reading resets, missing values or repeated readings,

sometimes resorting to linear interpolation of the load.

All processing for this 18 month dataset, representing

about 100 GB of data, was done using Perl* and shell

scripts. As shown in Figure 1, we were able to recon-

struct a smooth load profi le at the system-wide level.

Geo-Specifi c Weather DataThe area covered by the individual meters in our

mid-sized U.S. city dataset encompasses a gently hilly

area of about 40 km by 60 km, traversed by a river

and subject to micro-climatic variations. The weather

data (temperature and wind speed) are measured

hourly at 22 substations across that area. It is com-

mon to measure a difference of 15 degrees (F) in

temperature between weather substations.

Because the STLF methods detailed in the next

section are temperature-dependent, we are prompted

to interpolate the temperatures at the locations of all

meters and all meter aggregates. This interpolation is

done through the simple Kriging algorithm [14].

“Kriging” refers here to the temperature interpola-

tion based on temperature regression against observed

temperature values at a set of surrounding locations,

each of them weighted according to spatial covari-

ance. We employed the mGstat Matlab* toolbox for

geo-statistics [18] and performed simple Kriging

for each hour independently, using the latitude and

longitude coordinates of about 400 feeders and sub-

station meter load aggregates and the geographical

coordinates and temperatures of 22 weather substa-

tions. A similar procedure was adopted for wind

speed. The fi nal result is illustrated on Figure 2,

which shows the temperature interpolation at feeder

and substation aggregate levels at two times of the

year, and proves the large temperature variations.

Short-Term Load Forecasting MethodologyTime series modeling for short-term load fore-

casting (STLF) has been widely used over the last 30

years and a myriad of approaches have been devel-

oped. Kyriakides and Polycarpou [25] summarized

these methods as follows:

1. Regression models that represent electricity load as

a linear combination of variables related to

weather factors, day type, and customer class.

2. Linear time series-based methods including the

ARMA model, autoregressive integrated moving

average (ARIMA) model, auto regressive moving

average with external inputs (ARIMAX) model,

generalized auto-regressive conditional het-

eroscedastic (GARCH) model and state-space

models.

3. State-space models (SSMs) typically relying on a

fi ltering- (e.g., Kalman) based technique and

a characterization of dynamical systems.

4. Nonlinear time series modeling through machine

learning methods such as nonlinear regression.

Principles of Statistical Learning for Time SeriesIn the sections that follow, we will discuss the

temperature regression and load residual, linear

Page 6: Demand Forecasting in Smart Grids - NYU Computer Sciencemirowski/pub/Mirowski_BLTJ2014_Demand... · DOI: 10.1002/bltj Bell Labs Technical Journal 137 State-of-the-Art in Load Forecasting

140 Bell Labs Technical Journal DOI: 10.1002/bltj

time series approaches, state-space models, and

nonlinear time series models. Before delving into

more detailed descriptions of learning algorithms,

we will begin by outlining their commonalities in a

section on the “Principles of Statistical Learning for

Time Series.”

Supervised learning of the predictor. Supervised

learning consists of fi tting a predictive model to a

training dataset (X; L), which consists of pairs (xi; Li)

of data points or samples xi and of associated target

values Li. In the case of load forecasting, samples x

represent historical values of electric load, weather,

or other types of data, collected over a short time

interval (e.g., one day). The target labels Li corre-

spond to the electric load at the prediction horizon.

The objective is to optimize a function f such that for

each data point xi, the prediction f(xi) is as close as

possible to the ground truth target Li. The discrepancy

between all the predictions and the target labels is

quantifi ed here by the mean absolute percentage

error (MAPE), whose formula is given in “Short-

Term Load Forecasting Results.”

Figure 1.System load aggregated from about 32,000 individual meters over 18 months.

(a) January 2011 through June 2012

Syst

em lo

ad (

MW

per

15m

in)

Syst

em lo

ad (

MW

per

15m

in)

(b) August 2011 through September 2011

Jan110

5

10

15

20

25

30

35

Apr11 Jul11 Oct11 Jan12 Apr12 Jul12

Aug11

5

10

15

20

25

30

Sep11 Oct11

Page 7: Demand Forecasting in Smart Grids - NYU Computer Sciencemirowski/pub/Mirowski_BLTJ2014_Demand... · DOI: 10.1002/bltj Bell Labs Technical Journal 137 State-of-the-Art in Load Forecasting

DOI: 10.1002/bltj Bell Labs Technical Journal 141

Training, validation and test sets. Good statistical

learning algorithms are capable of extrapolating

knowledge and of generalizing it on unseen data

points. For this reason, we separate the known

data points into a training (in-sample) set, used to

defi ne model f, and a test (out-of-sample) set, used

exclusively to quantify the predictive power of f.

In the experiments previously reported in our

section on short-term load forecasting results, we use

one year of data for training and we test the model in

the calendar month immediately following. When

evaluating STLF on the 2007−2009 data, we retrain

the model 24 times and provide predictions for

January 2008 through December 2009. Using the

18-month aggregate dataset from January 2011

through June 2012, we trained six different STLF

models for predicting results for January through

June 2012.

Direct prediction versus iterated prediction in time series. In a time series prediction problem, as repre-

sented in Figure 3, the variable of interest (here, the

load) might be present at the same time in the targets

(output predictions) of the system and in the inputs,

particularly when that variable is serially correlated or

when it is produced by a dynamic system (e.g., the

weather/climate model or a model for the human

activities). Knowing the history of immediate previous

time samples of that variable helps in that prediction.

In our study, we consider hourly load and

weather data, and are interested in making load fore-

casts at prediction horizons ranging from h = 1 hour

(next hour) to h = 168 hours (next week). Predictions

at all these different horizons can be achieved in two

different ways, through direct prediction and iterated

prediction. Let us note t the current time and assume

that we have access to historical load up to time t, as

well as to weather forecasts up to time t + 168.

• Direct prediction. This predictor takes all the data

known up to time t, for instance load values in

the past 24 hours (Lt−23, Lt−22, … , Lt−1, Lt) and

temperature forecasts at any horizon h, namely

Tt+h, and directly predicts the load Lt+h that will

occur h hours ahead (see Figure 3b). Direct pre-

diction has a huge computational cost, because

different predictors need to be trained for each

prediction horizon (168 in our case).

• Iterated prediction. This predictor is simply

designed to make one-step-ahead predictions, at

horizon h = 1. As the predictive model moves

forward in time, the outputs of the predictor

Figure 2.Example of temperature variations and of spatial temperature interpolation using Kriging, at two different times of the year.

−85.5 −85.4 −85.3 −85.2 −85.1 −85 −84.9

34.95

35

35.05

35.1

35.15

35.2

35.25

35.3

35.35

35.4

35.45

Longitude

Lati

tud

e

Lati

tud

e

Temperatures (21k meters and weather stations) on 01–Jan–2011 05:45:00

51

52

53

54

55

56

57

(a)

−85.5 −85.4 −85.3 −85.2 −85.1 −85 −84.9

34.95

35

35.05

35.1

35.15

35.2

35.25

35.3

35.35

35.4

35.45

Longitude

Temperatures (21k meters and weather stations) on 30–Jun–2011 00:00:00

64

66

68

70

72

74

76

(b)

Page 8: Demand Forecasting in Smart Grids - NYU Computer Sciencemirowski/pub/Mirowski_BLTJ2014_Demand... · DOI: 10.1002/bltj Bell Labs Technical Journal 137 State-of-the-Art in Load Forecasting

142 Bell Labs Technical Journal DOI: 10.1002/bltj

(here, load at time t + h) can in turn become its

inputs (see Figure 3a), albeit introducing the pre-

diction error directly into the model. This iter-

ated prediction can be seen as the discretization

of a dynamic system.

Temperature Regression and Load ResidualThe simplest method for load forecasting relates

the load to temperature. This is particularly relevant

for residential and business-related consumption,

where a signifi cant portion of power usage might be

due to electric heating in the winter and/or air con-

ditioning in the summer.

In our data set, electricity was used to both heat

and cool many buildings, in addition to gas heating.

The total load decreases with temperature fi rst and

then increases, the minimum occurring at or around

66 degrees Fahrenheit. We observed that this rela-

tionship varies slightly throughout the day. We

investigated two approaches for load regression.

Figure 3.Direct prediction versus iterated prediction in a time series.

(a) Iterated prediction on load with a 24-hour history of load valuesand the temperature at the prediction horizon.

(b) Direct prediction on load at horizon h = 3.

h − 23

h − 22

h − 21

h − 1 h + 1

h

h + 1

h + 1

h + 2

h + 3

h

Input

Input

Load

Temperature

Load

Temperature

Load

Temperature

Prediction at h + 2

Prediction at h + 1

temperature“forecast”

temperature“forecast”

InputPrediction at h + 3

temperature“forecast”

h − 21h − 23 h h + 3

Load

Temperature

InputPrediction at h + 3

temperature“forecast”

Page 9: Demand Forecasting in Smart Grids - NYU Computer Sciencemirowski/pub/Mirowski_BLTJ2014_Demand... · DOI: 10.1002/bltj Bell Labs Technical Journal 137 State-of-the-Art in Load Forecasting

DOI: 10.1002/bltj Bell Labs Technical Journal 143

The fi rst used local polynomial regression,

locally-weighted scatterplot smoothing (LOESS) [8]

to fi t a surface of load on temperature and time of

day (see Figure 4). Specifi cally, for the fi t at point x,

a polynomial surface of degree 1 or 2 is made using

points in a neighborhood of x, weighted by their dis-

tance from x, to minimize the least square error

(LSE). The size of the neighborhood is controlled by

a parameter α chosen to be 0.2 in this situation for a

balance between smoothness and goodness of fi t.

The MAPE for this fi t is between six to seven percent for system-wide prediction with an average load of

approximately 0.7M kWh (2007−2009 system-wide

load) when the surface is fi tted to the previous full

year’s data.

log(L) = s(T,H) + ε

where L is the hourly load, T is the temperature, H is

the hour of day, and ε is the residual. The log transfor-

mation is used here to make the distribution more

Gaussian-like and to stabilize the variance, such that the

subsequent modeling assumptions hold. Note that

the residuals ε are not independently and identically

distributed (i.i.d.) and will continue to exhibit a daily

cyclic pattern. In the SARIMA, SSM, and Holt-Winters

model (HWT) methods detailed in the following sec-

tions, those methods are applied to the residuals ε, not

to the load time series.

A second method relies on fi tting a cubic polyno-

mial directly on the temperature values, using 24

sets of coeffi cients {ai(H)}3

i=0, one for each hour H of

the day. Temperature regression using cubic polyno-

mials is a simple benchmark for STLF [21].

L = a0(H) + a

1(H)T + a

2(H)T2 + a

3(H)T3+ ε

Note that we may use the apparent temperature,

or the wind-chill temperature, or an average of both,

instead of the raw temperature. The apparent tem-

perature (temperature taking into account the non-

linear “heat index” due to humidity) may improve

the fi t in some cases, particularly during the hot and

humid Summer season [36]. Similarly, the wind-

speed dependent wind chill temperature may help

for Winter load forecasts. We make our choices based

on cross-validation performance.

Hobby et al. [20] study the residential energy con-

sumption measured at an aggregate of all residential

meters by separating the weather- and illumination-

dependent load consumption from the residual con-

sumption. To fi t the weather- and illumination-dependent

component, they use 24 cubic spline surfaces, one per

hour of the day, indexed by apparent temperature

and illumination. They observe a strong cubic depen-

dency of load on temperature and an almost negli-

gible small linear term due to illumination.

Linear Time Series ApproachesLinear time series models exploit directly the his-

torical values of the load, and enable us to make iter-

ated load forecasts thanks to previously observed

load values. Gross and Galiana [15] wrote the refer-

ence paper on short-term load forecasting using sta-

tistical linear time series models, in particular the

auto-regressive moving average (ARMA) model.

These models have been later extended to cope with

seasonality and non-stationarity in so-called seasonal

Figure 4.Dependency among the temperature, the time of the day and the load modeled as a smooth surface. Load is expressed on the logarithmic scale and the temperature is taken one hour prior to the load value.

tim

e of

day

lag 1 hr temperature (F)

log(load +1)

Page 10: Demand Forecasting in Smart Grids - NYU Computer Sciencemirowski/pub/Mirowski_BLTJ2014_Demand... · DOI: 10.1002/bltj Bell Labs Technical Journal 137 State-of-the-Art in Load Forecasting

144 Bell Labs Technical Journal DOI: 10.1002/bltj

auto-regressive integrated moving average (SARIMA)

models. Further extensions have been made in the

work of Soares and Medeiros [35], where they com-

pared two-level seasonal auto-regressive model and

dummy-adjusted seasonal auto-regressive integrated

moving average (DASARIMA) on Brazilian electric

load data.

Seasonal Auto-Regressive Integrated Moving Average Models

In seasonal auto-regressive integrated moving

average (SARIMA) models, the seasonality compo-

nent comes from the daily load cyclic pattern. In this

paper we apply the SARIMA model to residuals from

the LOESS fi t (we refer to this method as “residual

SARIMA”). We also considered the SARIMAX model,

i.e., SARIMA with “exogenous” variables, namely

temperature. However, the temperature coeffi cient is

diffi cult to interpret and the model offers poor pre-

diction accuracy compared to residual SARIMA. In

contrast, residual SARIMA explicitly models the rela-

tionship between the time series and the exogenous

variable. It is especially appealing when changes in

exogenous variable(s) are concurrent with changes

in the original time series, which is the case with

temperature and power usage.

A SARIMA model has seven order parameters.

We can write the model as:

SARIMA (p, d, q) × (P, D, Q)s

Φp(Bs)ϕP(B)(1−B)d(1−BS)DXt = ΘQ(BS)θq(B)εt

where B is the lag operator that satisfi es:

Bi(Xt) = Xt−i

and Φp(Bs), ΘQ(BS) and (1−BS)D are corresponding

autoregressive, moving average and differencing

parts for seasonal components, while ϕP(B), θq(B) and

(1−B)d are corresponding autoregressive, moving

average and differencing parts for the non-seasonal

component. S is the period length (S = 24 with

hourly load reading and a daily cyclic pattern).

The procedure of determining the order param-

eters follows Box-Jenkins procedures by examining

the auto-correlation function (ACF) and partial ACF

(PACF) of the differenced and original time series.

Investigating the order parameters on the one-year

training data, we concluded that d = 1, D = 1, p = 0,

P = 0, while q = 1 and Q = 1, essentially ignoring the

auto-regressive component. Stationarity of the dif-

ferenced data were checked using the Kwiatkowiski,

Phillips, Schmidt, and Shin (KPSS) test [2, 24]. The

p-value was greater than 0.1, suggesting stationarity

in differenced data.

Note that for the residual SARIMA, shortening

the training period for estimating the parameters of

the model, from one year down to the last month

immediately preceding the prediction (test) period

offered a better fi t.

State-Space ModelsThe state-space model (SSM) is an online adap-

tive method for forecasting. SSMs introduce hidden

(unknown) variables representing the quantity to be

estimated. The main state-space model used across

scientifi c disciplines is the Kalman fi lter. In their

review paper, Pigazo and Moreno [30] described how

the Kalman fi lter can predict electric load values

from the previous load measurements, and then

update that prediction using other regressors such as

temperature data. Harvey and Koopman [19] mod-

eled load time series through cubic spline interpola-

tion on intra-daily and intra-weekly patterns, where

the spline coeffi cients were time-varying and updated

using a Kalman fi lter. Dordonnat et al. [10] defi ned a

custom state-space that took into account calendar

days and used it to predict nationwide French electric

load. Taylor and McSharry [37] reformulated the

state-space model as a multi-level linear time series

model, which can handle weekly and daily seasonal-

ity in electric load.

State-space model on the spline fi t of load residu-als. The SSM in [19] does not require offl ine training

and updates the model parameters in real time as

each reading comes in. This method has been suc-

cessfully applied to the online monitoring of time-

varying network streams [4].

In that SSM, the computation for each update is

inexpensive thanks to Kalman fi ltering, making it an

ideal method for online forecasting. It uses B-splines

to model the daily cyclic pattern, as the nonlinear

Page 11: Demand Forecasting in Smart Grids - NYU Computer Sciencemirowski/pub/Mirowski_BLTJ2014_Demand... · DOI: 10.1002/bltj Bell Labs Technical Journal 137 State-of-the-Art in Load Forecasting

DOI: 10.1002/bltj Bell Labs Technical Journal 145

trends in the load time series can be transformed into

a linear model with respect to the spline basis.

Moreover, a cyclic spline basis ensures the periodic

constraint (namely, the daily cyclic pattern of the

load). We place K equally spaced knots, or K−1 spline

bases to cover a full day (here K = 8 for 24 hourly

load readings on a given day).

The state space model consists of two equations:

the observation equation, which generates the load

data from the hidden variable, and the state equation,

which explains dynamics in the hidden (spline co-

effi cient) data. The observation equation is:

εt = Bαt + ut

ut ∼N(0, σuI)

εt is the one-day time series of the L load residu-

als from one day; B is a 24 by K matrix of B-spline

bases, each column corresponding to one spline; αt is

the vector of coeffi cients for the splines; ut is a vector

of i.i.d. Gaussian white noise with standard deviation

σu. The vector αt characterizes the daily pattern on

day t.

To accommodate day-to-day variations in the

daily pattern αt, we use a random walk for the spline

coeffi cients, specifi ed by the state equation:

αt = αt−1 + vt

vt ∼ N(0, σvI)

where the spline coeffi cients on day t are equal to

those on t−1, plus i.i.d. white noise of variance σv.

The above SSM is fi tted online with a Kalman

fi lter, such that the updating is done for each incom-

ing data point. This ensures that forecasts are done in

an online fashion. Hyper-parameters are estimated

empirically by fi tting them to spline coeffi cients for

individual days.

We also applied this approach directly to the log-

transformed load without the regression on temper-

ature (results not reported here). The performance is

slightly worse than using the residuals but still rea-

sonable. This approach would work well if tempera-

ture forecasts were unavailable or unreliable.

Holt-Winters double seasonal exponential smooth-ing. The HWT model [37] is a variation on the state-

space model designed specifi cally for data that have

two seasonalities: an intra-day (24 h) seasonality, and

an intra-week (168 h) seasonality. The state equa-

tions involve three state variables, essentially corre-

sponding to the smoothing, daily and weekly effect in

the data.

yt(k) = lt + dt−m1+k1+ wt−m2+k2

+ φket

et = yt − (lt−1 + dt−m1 + wt−m2

)

lt = lt−1 + αet

dt = dt−m1 + δet

wt = wt−m2 + ωet

In the above equations, y is the estimated value

of the load, l is the exponentially smoothed fi rst-

order auto-regressive component of the load, d is the

intra-day seasonal component of the load (m1 = 24

hours) and w is the intra-week seasonal component

of the load (m2 = 168 hours); fi nally e is the expo-

nentially decaying error term. The values of the state

variables are initialized in the following way: the

model is on about one month of data. The four coef-

fi cients α, δ, ω and φ are fi tted by least square optimi-

zation (i.e., by minimizing the error between the

actual observed load and the predicted load, and we

use simple heuristic search using genetic algorithms

to fi nd their optimal values.

Nonlinear Time Series ModelsMachine learning (ML) techniques focus on

learning a prediction function that takes as input the

historical load and other data such as weather, and

outputs the predicted load. Unlike the statistical meth-

ods reviewed in the previous section, the ML methods

chosen in our study enable us to learn a nonlinear

prediction function. Parametric machine learning

techniques focus on tuning the parameters of the load

prediction function. Khontanzad et al. [22] described

a state-of-the-art implementation of neural networks

for load forecasting, that has been used by several

electrical companies. Fan and Chen [12] employed

self-organizing maps to cluster the load and weather

data into several regimes, before using them as inputs

to a nonlinear regression function.

Page 12: Demand Forecasting in Smart Grids - NYU Computer Sciencemirowski/pub/Mirowski_BLTJ2014_Demand... · DOI: 10.1002/bltj Bell Labs Technical Journal 137 State-of-the-Art in Load Forecasting

146 Bell Labs Technical Journal DOI: 10.1002/bltj

We focus in this paper on kernel-based methods,

learning the relationship between data samples: in

this case, each sample corresponds to a pair of his-

torical load and weather data, taken over a short

time interval, and the electric load at the next time

point. We compared three standard, proven tech-

niques: weighted kernel regression (WKR) [6], sup-

port vector regression (SVR) [7], and kernel ridge

regression with learnable feature coeffi cients.

In addition to kernel methods, we investigated

simple neural network models with one hidden

layer. Although the latter achieved good perfor-

mance at one-hour prediction horizons, they would

perform poorly on iterated forecasts and the error

would rapidly increase after a few iterations of the

neural network predictor (results not reported).

Research on modeling dynamic systems using one

hidden-layer neural networks showed indeed that

these nonlinear models are very sensitive to noise

and that they can generate predictions that diverge

from the training set patterns. More complex neural

network models that provide stable iterated predic-

tions and are capable of learning long-term depen-

dencies [1] are beyond the scope of this paper. In

parallel, it has been proven experimentally that ker-

nel methods such as SVR provide more stable iter-

ated predictions on highly nonlinear time series than

the basic embodiment of neural networks [26].

While they do not model long-term dependencies,

they at least provide a solution that is bounded and

stays within the patterns seen in the training set. This

statement does not apply to more complex neural

network architectures (that involve state space mod-

els and learning hidden representations of time

series).

Weighted kernel regression. Weighted kernel

regression (WKR) [28] is the simplest among the

non-parametric regression algorithms. It consists of computing the Euclidean distance metric between

the input sample x and each data point sample y(t) at

time t in the training set and then using it in a Gaussian

kernel function k(x,y(t)) that can be seen as a measure

of symmetric “similarity” between the two samples x

and y(t). The Gaussian kernel takes a value equal to one

when x and y(t) are identical and therefore when their

distance is equal to zero. The kernel function takes

decreasing values down to zero as the input sample x

becomes “dissimilar” from the training point y(t) and

therefore as their distance increases.

k(x,y(t)) = exp(− 1 __ 2 ∑

k=1

k

1 __ σ2 (xk−yk(t))2)

The kernel function is used as the weight of data

point y(t) in the decision function (equation 2). The

decision function is a weighted interpolation over

the entire training dataset.

L = ΣtLtk(x,y(t))

__________ Σtk(x,y(t))

WKR assumes smoothness within the input data,

controlled through a “spread” coeffi cient σ that

depends on the dataset and is fi tted by n-fold cross-

validation on the training data. We resorted to fi ve-

fold cross-validation on fi ve non-overlapping sets.

More specifi cally, for each choice of hyperparameters,

we used 80 percent of the training data to fi t the model

and the remaining 20 percent to compute the predic-

tion performance, and repeated that step fi ve times.

Support vector regression. Support vector

machines (SVMs) [9, 34] are a popular and effi cient

statistical learning tool that can be qualifi ed as mostly

non-parametric. SVMs are also called maximum mar-

gin classifi ers, because their decision boundary is, by

construction, as far as possible from the training data

points, so that they remain well separated according

to their labels. Maximum margin training enables bet-

ter generalization of the classifi er to unseen examples.

The work on support vector regression (SVR) by

Chen et al. [7] was indeed the winning entry to a

competition on the prediction of electric load and can

be considered as a state-of-the-art method. SVR relies

on the defi nition of a kernel function k(x,y(t)) and in

using a decision function f(x) for a sample x that is

defi ned in terms of the kernel function between x

and the data points in the training set, but involving

a minimal, sparse, set of support vectors S = {y(t)}

that are each given a weight αt. Learning in SVM cor-

responds to fi nding a minimal set S of support vectors

that minimizes the error on the training labels.

L = ΣtLtαtk(x,y(t))

Page 13: Demand Forecasting in Smart Grids - NYU Computer Sciencemirowski/pub/Mirowski_BLTJ2014_Demand... · DOI: 10.1002/bltj Bell Labs Technical Journal 137 State-of-the-Art in Load Forecasting

DOI: 10.1002/bltj Bell Labs Technical Journal 147

We cross-validated the SVM’s regularization

coeffi cient C as well as the Gaussian spread coeffi -

cient using fi ve-fold cross-validation.

Kernel ridge regression. Kernel ridge regression is

a generalized version of support vector regression.

One can see it as a trivial extension of SVR, where

the Gaussian spread coeffi cient is tuned for each

input regressor (feature) separately using a gradient-

descent optimization procedure and cross-validation

[5]. This method, which we call sigma-SVR, differs

from SVR by this simple equation:

k(x,y(t)) = exp(− 1 __ 2 ∑ k=1

k

1 __ σ2 (xk − yk(t))2)

Short-Term Load Forecasting ResultsIn our investigations, we used the standard

demand prediction metric, mean absolute percent-

age error (MAPE), which, for a set of N load values

Lt(e.g, in Watt hours Wh) and associated load fore-

casts Lt, is defi ned as:

MAPE = 1 __ N

∑ t = 1

N

| Lt − Lt | ______

|Lt|

In the previously published STLF studies on city-

wide and country-wide load forecasting, the MAPE

typically was expected to range from a one to a three

percent error at next-hour horizon forecasts to about

four percent error at next-day horizons.

System-Wide PredictionsIn a fi rst series of experiments, we compared the

performance of three iterated predictors relying on

nonlinear time series models based on kernel meth-

ods: weighted kernel regression (WKR), support vec-

tor regression (SVR), and sigma-SVR (sSVR) on

system-wide load from 2007 to 2009. We would train

the predictors on one year of load and weather fore-

casts, and make predictions for the following month,

repeating this procedure 24 times for January 2008

through December 2009, averaging the MAPE per-

formance, for each prediction horizon, over all 24

months. Our approach essentially simulated an STLF

system retrained every month to fi t mid- to long-

term evolutions of the city-wide load consumption

and of the climate.

Unsurprisingly, as reported on Figure 5, the

more complex kernel method that enabled us both

to weigh each input feature (e.g., load at a specifi c

time, time of day, temperature or humidity forecast)

individually and to select the support vectors, namely

sSVR, achieved the best results (MAPE = 1.2 per-

cent) at the one-hour horizon and MAPE = 4.7 per-

cent after h = 24 hours. The Steadman apparent

temperature would slightly outperform raw temper-

ature (decreasing the MAPE).

We then compared the performance of iterated

sSVR to the direct prediction using sSVR, as well as to

the remaining, linear, models, namely Holt-Winters

double-exponential smoothing (HWT), state-space

models with B-spline fi t on load residue (SSM) and

seasonal auto-regressive integrated moving average

(SARIMA), all operating on the load residue after fi t-

ting the load on temperature and hour of the day

(see “Temperature Regression and Load Residual”).

As can be seen in Figure 6 and Figure 7, which pro-

vide details on the system-wide aggregated load from

2012, the overall best algorithms were HWT and

sSVR. HWT achieved MAPE = 4 percent performance

at h = 24 on the 2008 − 2009 dataset, slightly out-

performing sSVR. The performance on the aggre-

gated (2012) dataset was worse, because the set of

meters considered (32,000) was only a subset of the

total city load. Figure 8 and Figure 9 show how

these predictions actually look, at h = 1 and at h = 24

respectively.

Performance on Meter AggregatesWe observed that the load forecasting perfor-

mance seemed to worsen for lower level aggregates

and tried to verify the hypothesis that, independently

of the method, aggregates with large forecast errors

are those with very few meters. As can be seen on

Figure 10, we trained about 400 STLF predictors

on different meter aggregates (feeders, substations,

and system-wide) and plotted the performance

(MAPE at h = 1) versus the size of the meter aggregate

(which we can measure, for instance, as the number

of meters interconnected to that aggregate, or as the

peak hourly load measured at that meter aggregate).

The MAPE would decrease as a function of meter

k

Page 14: Demand Forecasting in Smart Grids - NYU Computer Sciencemirowski/pub/Mirowski_BLTJ2014_Demand... · DOI: 10.1002/bltj Bell Labs Technical Journal 137 State-of-the-Art in Load Forecasting

148 Bell Labs Technical Journal DOI: 10.1002/bltj

aggregate size (the more meters in an aggregate, the

better the MAPE). We hypothesize that aggregates

connected to more meters tend to behave in a more

predictable way: the effect of weather (temperature)

is prominent and there is an averaging effect due to

the large sample (hundreds or thousands) of meters.

Some meter aggregates (see Figure 11) can never-

theless be relative well predictable, despite their

small size (here 12 meters).

At the substation or system level, accurate fore-

casts can be useful input to strategic cost-saving deci-

sions. At the level of individual meters, the utility is

not interested in predicting precisely how much elec-

tricity will be used every hour, but rather in detecting

large spikes of abnormal activity. Such abnormal

usage spikes could be indicative of a system failure in

the home (e.g., a malfunctioning heat pump), and

could be useful information to the customer. Accurate

forecasts can serve as baselines for detecting such

anomalies.

DiscussionIn this section, we discuss the practical consider-

ations for the implementation and deployment of a

load forecasting system, including modularity and

parallelization, running time considerations,

and robustness of the forecasts.

Independent STLF for Each Meter AggregateAs explained previously, the meters, feeders, and

substations considered in this study of a mid-sized

Figure 5.System-wide load forecasts using kernel methods for nonlinear time series modeling. These curves are the average of monthly MAPE performance over two years (2008–2009).

MAPE—Mean absolute percentage errorSVR—Support vector regressionWKR—Weighted kernel regression

sigmaSVR 24h load + temperature

sigmaSVR 24h load + Steadman temperaturesigmaSVR 24h load + temperature + humidity

WKR 8h load + temperature

WKR 8h load + Steadman temperatureSVR 8h load + temperature

SVR 8h load + Steadman temperature

11

1.5

2

2.5

3

3.5

4

4.5

5

5.5

6

3 6 9 12

Prediction horizon (h)

System–wide load predictions for Jan 2008 – Dec 2009, per–month MAPE averages

MA

PE (

%)

15 18 21 24

Page 15: Demand Forecasting in Smart Grids - NYU Computer Sciencemirowski/pub/Mirowski_BLTJ2014_Demand... · DOI: 10.1002/bltj Bell Labs Technical Journal 137 State-of-the-Art in Load Forecasting

DOI: 10.1002/bltj Bell Labs Technical Journal 149

U.S. city are interconnected in a hierarchical distri-

bution network.

Such a rich hierarchy invites a study of the cor-

relations or even interdependencies among all

metered electrical components. The obvious advan-

tage is in exploiting redundancies among all the

meters (as households in the same urban area and

under identical climatic conditions might present

similar load consumption profi les).

From a systems perspective, it may be desirable

to make the load prediction component as modular

as possible and to forecast load independently for

each meter or load aggregate. In this study, all the

predictions at the same level of aggregation are

considered independent from the point of view of

load forecasting, despite the correlations between

each feeder connected to a given substation and the

substation itself.

There are several justifi cations for this approach.

First of all, the meters in our system often are updated

asynchronously or even suffer downtimes, not nec-

essarily related to power outages. It could therefore

be very detrimental, for the operation of the entire

system, to make it wait for synchronous meter

updates. Here, we allow for asynchronous data

updates and load forecasts within the prediction

timeframe, which happens at a granularity of one

hour.

Figure 6.System-wide load forecasts using various families of prediction algorithms, using the total load consumption of a mid-sized U.S. city. The curves represent average monthly MAPE performance over two years from 2008 to 2009.

HWT—Holt-Winters modelMAPE—Mean absolute percentage errorSARIMA—Seasonal auto-regressive integrated moving averageSSM—State-space modelSVR—Support vector regression

10

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

5.5

6

6.5

7System–wide load predictions for Jan 2008 – Dec 2009, per–month MAPE averages

3 6 9 12Prediction horizon (h)

MA

PE (

%)

15 18 21 24

HWT on residual load from apparent temperature fit

Iterated sigma–SVR using app. temp. and wind chill temp. and load

Direct sigma–SVR using app. temp. and wind chill temp. and load

SSM on residual load from temperature fit at h−1

SARIMA on residual load from temperature fit at h−1

Page 16: Demand Forecasting in Smart Grids - NYU Computer Sciencemirowski/pub/Mirowski_BLTJ2014_Demand... · DOI: 10.1002/bltj Bell Labs Technical Journal 137 State-of-the-Art in Load Forecasting

150 Bell Labs Technical Journal DOI: 10.1002/bltj

Secondly, enforcing independence at each level

of aggregation enables us to trivially parallelize the

operation of the STLF modules for all the aggregates.

Each module’s only points of input/output are data-

base accesses to read the latest meter historical load

data as well as the associated geo-specifi c weather data

and to return forecasts at different horizons.

Running Time and PerformanceThe parallelism that is enabled by the indepen-

dent STLF operations facilitates the implementation

of our system in a multi-threaded environment.

Essentially, the process for generating hourly load

forecasts at an aggregate level can be run as soon

as all the hourly weather and meter data for the

aggregate components have been collected. The

system does not need to wait for the completion of

all the prediction processes, each of which takes

care of updating the database with forecasts

independently.

For model development purposes, we have been

using a 16-core 2.3 GHz Intel Xeon* Linux* server

with 24 GB random access memory (RAM), running

Ubuntu*. The deployment system is a 32-core, 128

GB RAM, Linux system running Red Hat. The sigma-

SVR and HWT algorithms are implemented in Matlab

Figure 7.Comparison of different load forecasting algorithms for 32,000 meter load aggregates. The curves represent average monthly MAPE performance over the six months from January to June 2012.

HWT—Holt-Winters modelMAPE—Mean absolute percentage errorSARIMA—Seasonal auto-regressive integrated moving averageSSM—State-space modelSVR—Support vector regression

10

1

2

3

4

5

6

MA

PE (

%)

7

8

9

10

11

12

13

3 6 9 12Prediction horizon (h)

System–wide load predictions for Jan – Jun 2012, per–month MAPE averages

15 18 21 24

HWT on residual load from apparent temperature fit at h−1

Iterated sigma–SVR using app. temp. and wind chill temp. and load

Direct sigma–SVR using app. temp. and wind chill temp. and load

SSM on residual load from temperature fit at h−1

SARIMA on residual load from temperature fit at h−1

Page 17: Demand Forecasting in Smart Grids - NYU Computer Sciencemirowski/pub/Mirowski_BLTJ2014_Demand... · DOI: 10.1002/bltj Bell Labs Technical Journal 137 State-of-the-Art in Load Forecasting

DOI: 10.1002/bltj Bell Labs Technical Journal 151

(or its open source clone, Octave) and the SARIMA

and SSM methods run in R.

Our system avoids major computational bottle-

necks at runtime. The SARIMA, SSM, and HWT

methods can make essentially instantaneous fore-

casts on the 400 or so meter aggregates. The kernel

methods-based predictions by the sigma-SVR algo-

rithm require, for each meter aggregate and for each

prediction horizon (up to 168), a few matrix multi-

plications, with matrix dimensions on the order of

10,000. The latter can bring the computational time

to several minutes, once per hour.

The largest computational requirements are due

to training the prediction algorithms, which, as we

explained, happens once a month. While the

SARIMA and SSM methods are, again, negligible in

terms of training time, it typically takes a few hours

to cross-validate the state parameters of the HWT

model and about one day to learn the feature and

Lagrange coeffi cients of the sigma-SVR predictor.

This is currently handled by scheduling learning for

all the models over several days.

Ensemble PredictionGiven that we have four different prediction

algorithms (HWT, sSVR, SARIMA, SSM), we can

study methods for combining their predictions for

potentially better accuracy and robustness to noise

Figure 8.Predictions and prediction errors by four algorithms and a simple weather fi t model (in gray) over one week in 2008. These plots show the predictions at horizon h = 1 hour.

HWT—Holt-Winters modelSARIMA—Seasonal auto-regressive integrated moving averageSSM—State-space modelsSVR—Sigma SVRSVR—Support vector regression

Predictions: 1–hour ahead

Prediction Errors: 1–hour ahead

4000

00

KW

h

02–05–2008 00:00:00 02–07–2008 00:00:00 02–09–2008 00:00:00 02–11–2008 00:00:00

8000

0012

0000

0

(a)

(b)

−2e

+05

KW

h

02–05–2008 00:00:00 02–07–2008 00:00:00 02–09–2008 00:00:00 02–11–2008 00:00:00

0e+

002e

+05

Observed Regression On Temp HWT sSVR SARIMA SSM

Page 18: Demand Forecasting in Smart Grids - NYU Computer Sciencemirowski/pub/Mirowski_BLTJ2014_Demand... · DOI: 10.1002/bltj Bell Labs Technical Journal 137 State-of-the-Art in Load Forecasting

152 Bell Labs Technical Journal DOI: 10.1002/bltj

and random errors. We conjecture that possibility

after observing that the predictions of the four algo-

rithms have largely uncorrelated errors, as visible on

Figure 8 for an example of system-wide load fore-

casts over one week at the one-hour horizon and on

Figure 9, on the same data and time period, at a

24-hour horizon.

Systematically generated ensembles are used

extensively in numerical weather forecasting [29].

Our approach, on the other hand, needs to work

with a small ensemble, each of which has indepen-

dent ability to achieve a certain level of accuracy. In

this case, simple combination strategies are desirable.

We considered fi ve simple schemes for combining

the predictions:

1. Mean of four predictions,

2. Median of four predictions,

3. Switching among four predictions, using the one

with the smallest absolute error at the time when

the prediction is made,

4. Mean of HWT and sSVR, and

5. Switching between HWT and sSVR, using the

one with the smallest absolute error at the time

when the prediction is made.

We summarize in Figure 12 the performance of

these algorithms and the combined predictions for

the system-wide aggregates from 2008 to 2009. The

fi nal performance of the mean of HWT and sSVR

predictions on the system-wide data reaches a per-

formance around MAPE = 3 percent at a 24-hour

Figure 9.Predictions and prediction errors by four algorithms and a simple weather fi t model (in gray) over one week in 2008. These plots show the predictions at horizon h = 24 hours.

HWT—Holt-Winters modelSARIMA—Seasonal auto-regressive integrated moving averageSSM—State-space modelsSVR—Sigma SVRSVR—Support vector regression

Predictions: 24–hour ahead

Prediction Errors: 24–hour ahead

4000

00

KW

h

02–05–2008 00:00:00 02–07–2008 00:00:00 02–09–2008 00:00:00 02–11–2008 00:00:00

8000

0012

0000

0

(a)

(b)

−2e

+05

KW

h

02–05–2008 00:00:00 02–07–2008 00:00:00 02–09–2008 00:00:00 02–11–2008 00:00:00

0e+

002e

+05

Observed Regression On Temp HWT sSVR SARIMA SSM

Page 19: Demand Forecasting in Smart Grids - NYU Computer Sciencemirowski/pub/Mirowski_BLTJ2014_Demand... · DOI: 10.1002/bltj Bell Labs Technical Journal 137 State-of-the-Art in Load Forecasting

DOI: 10.1002/bltj Bell Labs Technical Journal 153

prediction horizon, down from about four percent

achieved by HWT alone. We can see that by most of

the performance criteria considered, either the mean

or the median of the four predictors gives the best

performance, and it is better than the best individual

method except for the horizon of one hour ahead

(which is best done by sSVR). Further investigation

will examine to what extent this observation gener-

alizes to smaller meter aggregates.

ConclusionWe methodically evaluated state-of-the-art STLF

methods on a unique dataset consisting of load

aggregates from individual meters, and showed a

dependency of the load forecasting performance on

the size of the aggregate. In this study, we considered

load forecasting at each meter aggregate as an inde-

pendent task, and did not fully exploit the pyramidal

structure of the meter-feeder-substation network.

Future investigations could explore such hierarchical

time series prediction.

AcknowledgementsThe authors wish to acknowledge the help and

contribution of former and current members of

Alcatel-Lucent Bell Labs: Gary Atkinson, Kenneth

Budka, Jayant Deshpande, Frank Feather, Zhi He,

Marina Thottan and Kim Young Jin, as well as the

Figure 10.Relationship between load forecasting accuracy and the size of the load aggregate (i.e., the number of meters connected to the electrical structure). The monthly MAPE performance has been averaged over six months, from January to June 2012).

100 101 102 103 104 1051

2

3

4

5

6

7

8

9

10

2030405060708090

100

log10 of number of meters connected to the aggregate

MA

PE (

%)

in lo

g10

sca

le

1−hour ahead predictions at aggregate level

HWT—Holt-Winters modelMAPE—Mean absolute percentage errorSVR—Support vector regression

Selected feeder

System-wide

HWTSigma−SVR

Page 20: Demand Forecasting in Smart Grids - NYU Computer Sciencemirowski/pub/Mirowski_BLTJ2014_Demand... · DOI: 10.1002/bltj Bell Labs Technical Journal 137 State-of-the-Art in Load Forecasting

154 Bell Labs Technical Journal DOI: 10.1002/bltj

Figure 11.Predictions by two algorithms and a simple weather fi t model (in gray) over three weeks in 2012 for a selected feeder connected to 12 individual meters. These plots show the predictions at horizon h = 1 hour.

Selected feeder: 1–hour ahead STLF

1000

0

Wh

02–12–2012 00:00:00 02–14–2012 00:00:00 02–16–2012 00:00:00 02–18–2012 00:00:00

3000

050

000

1000

0

Wh

02–19–2012 00:00:00 02–21–2012 00:00:00 02–23–2012 00:00:00 02–25–2012 00:00:00

3000

050

000

1000

0

Wh

02–26–2012 00:00:00 02–28–2012 00:00:00 03–01–2012 00:00:00 03–03–2012 00:00:00

3000

050

000

Selected feeder: 1–hour ahead STLF

Selected feeder: 1–hour ahead STLF

(a)

(b)

(c)

HWT—Holt-Winters modelsSVR—Sigma SVRSTLF—Short-term load forecastSVR—Support vector regression

Observed Regression On Temp HWT sSVR

Observed Regression On Temp HWT sSVR

Observed Regression On Temp HWT sSVR

Page 21: Demand Forecasting in Smart Grids - NYU Computer Sciencemirowski/pub/Mirowski_BLTJ2014_Demand... · DOI: 10.1002/bltj Bell Labs Technical Journal 137 State-of-the-Art in Load Forecasting

Fig

ure

12.

Perf

orm

ance

of

the

fou

r p

red

icto

rs a

nd

th

eir

com

bin

atio

ns,

ben

chm

arke

d a

gai

nst

a p

red

icto

r u

sin

g o

nly

reg

ress

ion

on

tem

per

atu

re,

for

pre

dic

tio

n h

ori

zon

s fr

om

on

e to

24

ho

urs

ah

ead

.

(a)

(b)

(c)

Leg

end

s

HW

T—H

olt

-Win

ters

mo

del

SAR

IMA

—Se

aso

nal

au

to-r

egre

ssiv

e in

teg

rate

d m

ovi

ng

ave

rag

esS

VR

—Si

gm

a SV

RSV

R—

Sup

po

rt v

ecto

r re

gre

ssio

n

5

1

100002000030000

kwh

40000

234

%

567

Mea

n A

bso

lute

Per

cen

tag

e Er

ror

1015

Ho

rizo

n (

ho

urs

)H

ori

zon

(h

ou

rs)

205

Mea

n A

bso

lute

Err

or

1015

20

(d)

(e)

5

2000030000400005000060000

Ro

ot

Mea

n S

qu

ared

Err

or

1015

205

150000

−300000−250000−200000−150000

kwh

kwh

250000

kwh

350000450000

Wo

rst

Ove

rsh

oo

tW

ors

t U

nd

ersh

oo

t

1015

Ho

rizo

n (

ho

urs

)H

ori

zon

(h

ou

rs)

Ho

rizo

n (

ho

urs

)

205

1015

20

Reg

ress

ion

On

Tem

p

HW

T

sSV

R

SAR

IMA

SSM

MEA

N

MED

IAN

SWIT

CH

MEA

N_H

WT_

sSV

R

SWIT

CH

_HW

T_sS

VR

Page 22: Demand Forecasting in Smart Grids - NYU Computer Sciencemirowski/pub/Mirowski_BLTJ2014_Demand... · DOI: 10.1002/bltj Bell Labs Technical Journal 137 State-of-the-Art in Load Forecasting

156 Bell Labs Technical Journal DOI: 10.1002/bltj

utility company for providing the meter and weather

dataset used in this study.

*TrademarksLinux is a trademark of Linus Torvalds. Matlab is a registered trademark of The Mathworks, Inc. Perl is a trademark of the Perl Foundation.Ubuntu is a registered trademark of Canonical Limited.Xeon is a registered trademark of Intel Corporation.

References [1] Y. Bengio, P. Simard, and P. Frasconi,

“Learning Long-Term Dependencies with Gradient Descent Is Diffi cult,” IEEE Trans. Neural Networks, 5:2 (1994), 157–166.

[2] A. Bhargava, “On the Theory of Testing for Unit Roots in Observed Time Series,” Rev. Econom. Stud., 53:3 (1986), 369–384.

[3] C. E. Borges, Y. K. Penya, and I. Fernández, “Optimal Combined Short-Term Building Load Forecasting,” Proc. IEEE PES Innovative Smart Grid Technol. Asia Conf. (ISGT ’11) (Perth, Aus., 2011).

[4] J. Cao, A. Chen, T. Bu, and A. Buvaneswari, “Monitoring Time-Varying Network Streams Using State-Space Models,” Proc. 28th IEEE Internat. Conf. on Comput. Commun. (INFOCOM ’09) (Rio de Janeiro, Bra., 2009), pp. 2721–2725.

[5] O. Chapelle, V. Vapnik, O. Bousquet, and S. Mukherjee, “Choosing Multiple Parameters for Support Vector Machines,” Mach. Learn., 46:1-3 (2002), 131–159.

[6] W. Charytoniuk, M. S. Chen, and P. Van Olinda, “Nonparametric Regression Based Short-Term Load Forecasting,” IEEE Trans. Power Syst., 13:3 (1998), 725–730.

[7] B.-J. Chen, M.-W. Chang, and C.-J. Lin, “Load Forecasting Using Support Vector Machines: A Study on EUNITE Competition 2001,” IEEE Trans. Power Syst., 19:4 (2004), 1821–1830.

[8] W. S. Cleveland and S. J. Devlin, “Locally-Weighted Regression: An Approach to Regression Analysis by Local Fitting,” J. Amer. Statist. Assoc., 83:403 (1988), 596–610.

[9] C. Cortes and V. Vapnik, “Support-Vector Networks,” Mach. Learn., 20:3 (1995), 273–297.

[10] V. Dordonnat, S. J. Koopman, M. Ooms, A. Dessertaine, and J. Collet, “An Hourly Periodic State Space Model for Modelling French National Electricity Load,” Internat. J. Forecasting, 24:4 (2008), 566–587.

[11] Enel, “Smart Metering System,” <http://www.enel.com/en-GB/innovation/project_technology/zero_emission_life/smart_networks/smart_meters.aspx>, accessed Mar. 5, 2012.

[12] S. Fan and L. Chen, “Short-Term Load Forecasting Based on an Adaptive Hybrid Method,” IEEE Trans. Power Syst., 21:1 (2006), 392–401.

[13] E. A. Feinberg and D. Genethliou, “Load Forecasting,” Applied Mathematics for Restructured Electric Power Systems: Optimization, Control, and Computational Intelligence (J. H. Chow, F. F. Wu, and J. A. Momoh, eds.), Springer, New York, 2005, pp. 269–285.

[14] P. Goovaerts, Geostatistics for Natural Resources Evaluation, Oxford University Press, New York, 1997.

[15] G. Gross and F. D. Galiana, “Short-Term Load Forecasting,” Proc. IEEE, 75:12 (1987), 1558–1573.

[16] F. Gullo, G. Ponti, A. Tagarelli, S. Iiritano, M. Ruffolo, and D. Labate, “Low-Voltage Electricity Customer Profi ling Based on Load Data Clustering,” Proc. 13th Internat. Database Eng. and Applications Symp. (IDEAS ’09) (Cetraro, Calabria, Ita., 2009), pp. 330–333.

[17] H. Hahn, S. Meyer-Nieberg, and S. Pickl, “Electric Load Forecasting Methods: Tools for Decision Making,” European J. Oper. Res., 199:3 (2009), 902–907.

[18] T. M. Hansen, “mGstat: A Geostatistical Matlab Toolbox,” <http://mgstat.sourceforge.net/>.

[19] A. Harvey and S. J. Koopman, “Forecasting Hourly Electricity Demand Using Time-Varying Splines,” J. Amer. Statist. Assoc., 88:424 (1993), 1228–1236.

[20] J. D. Hobby, A. Shoshitaishvili, and G. H. Tucci, “Analysis and Methodology to Segregate Residential Electricity Consumption in Different Taxonomies,” IEEE Trans. Smart Grid, 3:1 (2012), 217–224.

[21] T. Hong, P. Wang, and H. L. Willis, “A Naïve Multiple Linear Regression Benchmark for Short Term Load Forecasting,” Proc. IEEE Power and Energy Soc. Gen. Meeting (Detroit, MI, 2011).

[22] A. Khotanzad, R. Afkhami-Rohani, and D. Maratukulam, “ANNSTLF—Artifi cial Neural Network Short-Term Load Forecaster—

Page 23: Demand Forecasting in Smart Grids - NYU Computer Sciencemirowski/pub/Mirowski_BLTJ2014_Demand... · DOI: 10.1002/bltj Bell Labs Technical Journal 137 State-of-the-Art in Load Forecasting

DOI: 10.1002/bltj Bell Labs Technical Journal 157

Generation Three,” IEEE Trans. Power Syst., 13:4 (1998), 1413–1422.

[23] J. Z. Kolter and M. J. Johnson, “REDD: A Public Data Set for Energy Disaggregation Research,” Proc. KDD Workshop on Data Mining Applications in Sustainability (SustKDD ’11) (San Diego, CA, 2011).

[24] D. Kwiatkowski, P. C. B. Phillips, P. Schmidt, and Y. Shin, “Testing the Null Hypothesis of Stationarity Against the Alternative of a Unit Root,” J. Econometrics, 54:1-3 (1992), 159–178.

[25] E. Kyriakides and M. Polycarpou, “Short Term Electric Load Forecasting: A Tutorial,” Trends in Neural Computation (K. Chen and L. Wang, eds.), Springer, Berlin, New York, 2007, pp. 391–418.

[26] D. Mattera and S. Haykin, “Support Vector Machines for Dynamic Reconstruction of a Chaotic System,” Advances in Kernel Methods: Support Vector Learning (B. Schölkopf, C. J. C. Burges, and A. J. Smola, eds.), MIT Press, Cambridge, MA, 1999, pp. 211–242.

[27] A. Muñoz, E. F. Sánchez-Úbeda, A. Cruz, and J. Marín, “Short-Term Forecasting in Power Systems: A Guided Tour,” Handbook of Power Systems II: Energy Systems (S. Rebennack, P. M. Pardalos, M. V. F. Pereira, and N. A. Iliadis, eds.), Springer, Heidelberg, New York, 2010, pp. 129–160.

[28] E. A. Nadaraya, “On Estimating Regression,” Theory Probab. Appl., 9:1 (1964), 141–142.

[29] T. N. Palmer, G. J. Shutts, R. Hagedorn, F. J. Doblas-Reyes, T. Jung, and M. Leutbecher, “Representing Model Uncertainty in Weather and Climate Prediction,” Annual Review of Earth and Planetary Sciences, Volume 33 (R. Jeanloz, A. L. Albee, and K. C. Burke, eds.), Annual Reviews, Palo Alto, CA, May 2005, pp. 163–193.

[30] A. Pigazo and V. M. Moreno, “Estimation of Electrical Power Quantities by Means of Kalman Filtering,” Kalman Filter: Recent Advances and Applications (A. Pigazo and V. M. Moreno, eds.), In-Tech, Rijeka, Cro., Apr. 2009, pp. 375–396.

[31] S. Rogai, “Keynote I. Telegestore Project Progresses and Results,” IEEE Internat. Symp. on Power Line Commun. and Its Applications (ISPLC ’07) (Pisa, Ita., 2007).

[32] J.-H. Shin, B.-J. Yi, Y.-I. Kim, H.-G. Lee, and K. H. Ryu, “Spatiotemporal Load-Analysis Model for Electric Power Distribution Facilities

Using Consumer Meter-Reading Data,” IEEE Trans. Power Delivery, 26:2 (2011), 736–743.

[33] R. P. Singh, P. X. Gao, and D. J. Lizotte, “On Hourly Home Peak Load Prediction,” Proc. 3rd IEEE Internat. Conf. on Smart Grid Commun. (SmartGridComm ’12) (Tainan, Twn., 2012), pp. 163–168.

[34] A. J. Smola and B. Schölkopf, “A Tutorial on Support Vector Regression,” Stat. Comput., 14:3 (2004), 199–222.

[35] L. J. Soares and M. C. Medeiros, “Modeling and Forecasting Short-Term Electricity Load: A Comparison of Methods with an Application to Brazilian Data,” Internat. J. Forecasting, 24:4 (2008), 630–644.

[36] R. G. Steadman, “A Universal Scale of Apparent Temperature,” J. Climate Applied Meteorology, 23:12 (1984), 1674–1687.

[37] J. W. Taylor and P. E. McSharry, “Short-Term Load Forecasting Methods: An Evaluation Based on European Data,” IEEE Trans. Power Syst., 22:4 (2007), 2213–2219.

(Manuscript approved October 2013)

PIOTR MIROWSKI is a member of technical staff at Bell Labs in Murray Hill, New Jersey. He obtained his Ph.D. in computer science at the Courant Institute of Mathematical Sciences at New York University, New York City, with a thesis in machine learning under the supervision of Prof. Yann LeCun.

He also has a master’s degree in computer science from École Nationale Supérieure ENSEEIHT in Toulouse, France. Prior to joining Bell Labs he worked as a research engineer in geology at Schlumberger Research. During his Ph.D. studies, he interned at the NYU Medical Center (investigating epileptic seizure prediction from EEG), at Google, at the Quantitative Analytics department of Standard & Poor’s and at AT&T Labs Research. His current research focuses on machine learning methods for text analysis and query ranking, on computer vision and simultaneous localization and mapping for robotics, on indoor localization, on time series modeling and load forecasting for smart grids, and on deep learning.

SINING CHEN is a member of technical staff in the Statistics and Learning Research Department at Bell Labs in Murray Hill, New Jersey. She received a B.S. in applied mathematics from Tsinghua University, Beijing, China and a Ph.D. in statistics from

Page 24: Demand Forecasting in Smart Grids - NYU Computer Sciencemirowski/pub/Mirowski_BLTJ2014_Demand... · DOI: 10.1002/bltj Bell Labs Technical Journal 137 State-of-the-Art in Load Forecasting

158 Bell Labs Technical Journal DOI: 10.1002/bltj

Duke University, Durham, North Carolina. After completing her doctorate, Dr. Chen worked at Johns Hopkins University, fi rst as a postdoctoral fellow in the School of Medicine and then as an assistant professor in the School of Public Health. Prior to joining Bell Labs, she was an associate professor at the Department of Biostatistics, University of Medicine and Dentistry of New Jersey. Her current research interests include Bayesian methods and forecasting.

TIN KAM HO leads the Statistics of Communication Systems Research Activity in Bell Labs at Murray Hill. She pioneered research in multiple classifi er systems, random decision forests, and data complexity analysis, and pursued applications of

automatic learning in many areas of science and engineering. She also led major efforts on modeling and monitoring large-scale optical transmission systems. Recently she worked on wireless geo-location, video surveillance, smart grid data mining, and customer experience modeling. Her contributions were recognized by a Bell Labs President’s Gold Award and two Bell Labs Teamwork Awards, a Young Scientist Award in 1999, and the 2008 Pierre Devijver Award for Statistical Pattern Recognition. She is an elected Fellow of IAPR (International Association for Pattern Recognition) and IEEE, and served as editor-in-chief of the journal Pattern Recognition Letters in 2004-2010. She received a Ph.D. in computer science from State University of New York (SUNY), Buffalo.

CHUN-NAM YU is a member of technical staff in the Statistics and Learning Department at Bell Labs in Murray Hill, New Jersey. He received a B.A. degree in mathematics and computer science from Oxford University, United Kingdom, and an M.S. degree and Ph.D. in computer science from Cornell

University, Ithaca, New York. Prior to joining Bell Labs, he was a postdoctoral fellow at the Alberta Innovates Centre of Machine Learning (AICML) at the University of Alberta in Edmonton, Alberta, Canada. His research interests include structured output learning, graphical models, kernel methods, optimization, and biomedical applications. ◆