A database infrastructure to implement real-time …coimbra.ucsd.edu/publications/papers/2018_Pedro_Lim...A database infrastructure to implement real-time solar and wind power generation

lable at ScienceDirect

Renewable Energy 123 (2018) 513e525

Contents lists avai

Renewable Energy

journal homepage: www.elsevier .com/locate/renene

A database infrastructure to implement real-time solar and windpower generation intra-hour forecasts

Hugo T.C. Pedro a, Edwin Lim b, Carlos F.M. Coimbra a, *

a Department of Mechanical and Aerospace Engineering, Jacobs School of Engineering, Center of Excellence in Renewable Resource Integration and Centerfor Energy Research, University of California San Diego, La Jolla, CA 92093, USAb Temporal Dynamics, Martinsville, NJ, 08836, USA

a r t i c l e i n f o

Article history:Received 9 October 2017Received in revised form24 January 2018Accepted 9 February 2018Available online 15 February 2018

Keywords:Renewable generation forecastReal-time implementationNearest neighbors forecast

* Corresponding author.E-mail address: [email protected] (C.F.M. Coimb

https://doi.org/10.1016/j.renene.2018.02.0430960-1481/© 2018 Elsevier Ltd. All rights reserved.

a b s t r a c t

This paper presents a simple forecasting database infrastructure implemented using the open-sourcedatabase management system MySQL. This proposal aims at advancing the myriad of solar and windforecast models present in the literature into a production stage. The paper gives all relevant detailsnecessary to implement a MySQL infra-structure that collects the raw data, filters unrealistic values,classifies the data, and produces forecasts automatically and without the assistance of any othercomputational tools. The performance of this methodology is demonstrated by creating intra-hour po-wer output forecasts for a 1MW photovoltaic installation in Southern California and a 10MW windpower plant in Central California. Several machine learning forecast models are implemented (persis-tence, auto-regressive and nearest neighbors) and tested. Both point forecasts and prediction intervalsare generated with this methodology. Quantitative and qualitative analyses of solar and wind powerforecasts were performed for an extended testing period (4 years and 6 years, respectively). Results showan acceptable and robust performance for the proposed forecasts.

© 2018 Elsevier Ltd. All rights reserved.

1. Introduction

Methods and algorithms for assessing renewable resources,forecasting wind and solar irradiance and renewable power gen-eration have been the topic of many research papers in the lastyears. This interest is motivated by the increasing presence ofrenewable energy in our lives, from roof-top installations in ourhomes to largewind and solar farms that supply the electrical grids,and the challenges posed by the inherent variability of this energysource [1e3].

Forecasting models, both deterministic and stochastic in nature,and qualitative and quantitative performance metrics have beenamply discussed in the recent literature [4e9]. Recent works inrenewable generation forecasting follow one or more of thefollowing trends: (i) exploring new exogenous predictors relevantto the forecasting; (ii) data prepossessing into predictors moreamenable to the forecasting tools; (iii) creating probabilistic fore-casts instead the classic point forecast; (iv) expanding the

ra).

forecasting to large spatial domains; (v) creating models dedicatedto forecast extreme events such as power ramps; (vi) exploring newmetrics to better characterize forecasting performance [9,10].Despite these advances, there is, to the best of our knowledge, noliterature that addresses, in a practical way, how the models areimplemented in a real case scenario. Motivated by this fact, wepresent here a simple strategy to deploy solar and wind forecastingtools in real-time. The proposed workflow can be readily imple-mented in any computing platform capable of ruining the open-source database management system (DBMS) MySQL. As shownbelow, all the data storage, data validation and forecasting tools canbe implemented in this simple infrastructure.

The use of MySQL database in local forecast conveys a number ofadvantages. MySQL database is robust, fast, has an open source li-cense, and is available on all the common platforms. As such,MySQL has gained widespread use in diverse applications and istoday one of themost popular DMBS [11]. MySQL can be set upwithease. With a little learning investment, it is easy to perform basicstore and retrieve functions. For intermediate and advance usage,the official documentation is comprehensive and freely available[12] (accessed June 2016). Accessing the data in MySQL is vastly

mailto:[email protected]

http://crossmark.crossref.org/dialog/?doi=10.1016/j.renene.2018.02.043&domain=pdf

www.sciencedirect.com/science/journal/09601481

http://www.elsevier.com/locate/renene

https://doi.org/10.1016/j.renene.2018.02.043



H.T.C. Pedro et al. / Renewable Energy 123 (2018) 513e525514

simplified by the availability of connectors and APIs (applicationprogramming interfaces) for various third party software like C,Java, Python, Perl, etc. Being natively network oriented, MySQLadds an important dimension to the data accessibility: differentusers and tools can access the same data set from different com-puters simultaneously. The native text based client and thegraphical interface (MySQL Workbench) provide convenientmethods of manually querying the data during software develop-ment and troubleshooting problems in general.

MySQL also offers an extensive set of functions and operatorsthat can be explored to change the DBMS from a mere data hostinginfrastructure to much more. For example, MySQL stored proced-ures are used in Zach et al. [13] to improve building monitoring andin Ref. [14] to enable complex queries to a bioinformatic database.

The purpose of this paper is to demonstrated how one can takeadvantage of these functionalities to create a forecasting method-ology completely implemented in MySQL. In the approachdescribed here, all the operations are performed within MySQL asillustrated in Fig. 1. This is a departure from the more traditionalapproach where the database simply stores the measured data andthe forecasts are produced by attendant jobs (e.g. cron jobs orWindows scheduled tasks) implemented in another programminglanguage [15,16].

In the remainder of this paper we explain how to set up thisdatabase infrastructure for two real case scenarios. In the first case,the power output (PO) from a 1MW solar canopy is predicted fromthe PO and global horizontal irradiance measurements (GHI). In thesecond case, we consider telemetry (PO, wind speed and direction,temperature, pressure and density) from a 10MW wind powerplant and predict wind PO. We implement simple forecast tech-niques to create real-time point forecast as well as prediction in-tervals for intra-hour horizons (15e60min).

2. Methodology

In this section we describe the key steps necessary to create thedatabase tables and the procedures used to compute the forecast.As mentioned above, several data streams are available in this case:power output and environmental data. This set up corresponds to avery common scenario for medium sized solar and wind powergeneration, where local telemetry data is used as predictors for theforecasting models. The methodology description will focus on thefirst case under consideration, the forecasting of PO from a 1MWsolar canopy. At the end of the section we explain the few

Fig. 1. (a) In this approach the database is used to store measured data and the forecast. Edescribed in this work all computations are performed inside the MySQL database. No exte

modifications necessary to apply this database infrastructure forthe forecasting of PO for the 10MW wind power plant.

2.1. MySQL tables

All the data used in this work (sensor data, computed forecasts,etc.) are stored in MySQL tables. A table is a structure where thedata is collected. Tables contain records or rows, and fields or col-umns. In this case most tables follow a very simple structure inwhich the first column contains the date and time (timestamp) thatidentifies uniquely the data. To improve search efficiency, tablesshould be indexed or keyed and the natural choice here is to use thetimestamp as the index. All timestamps are in Coordinated Uni-versal Time (UTC) to avoid any problems related to daylight savingtime. Indices are kept in memory which enables the database toprocess queries much faster (e.g. get sensor data between a giventime period). Using indices requires more memory but reduces thequery times substantially. Furthermore, if the indexed quantity isunique, the index can be declared as the primary key of the table. Inthis case, MySQL will enforce that there are no duplicates of anyindexed quantity.

The tables used here can be categorized into two main typesdepending on how they are indexed. The first category are tablesthat store data indexed by the timestamp (primary key). These arethe tables that contain sensor data and data directly derived fromthem such as normalized data or classification data, as explainedbelow. Such tables contain one column that stores the timestampfollowed by as many columns as necessary to store the indexedvariables. For example, the table that stores the raw PO and GHIdata can be created with the following MySQL code:

To the second category belong tables used to store forecasteddata. Data in these tables are associated with two timestamps: thereference time (_reftime), which records the time at whichforecast is computed, and the valid time (_valtime), which is timeat which the forecast applies. These tables have a compound pri-mary key made up by combining the reference time and valid time.

xternal attendant jobs are responsible for computing the forecast. (b) In the approachrnal processes are required. (c) Flow chart for the processes that create the forecasts.

H.T.C. Pedro et al. / Renewable Energy 123 (2018) 513e525 515

Thismeans that there can be duplicates in valid times, as long as theduplicates have different reference times.

Other columns always present in these tables include the fore-casted value and the forecast error. There are several tables in thiscategory, one for each one of the different forecast models imple-mented. In this work we implement three models: a persistencemodel (Pers), an auto-regressive model (AR) and a nearest neighbormodel (kNN). The statement to create the table to store the nearestneighbor forecast is:

Note that in addition to the primary key of the combined_reftime and _valtime, three other indices are also declared toincrease the data retrieval speed. The primary key here is mainly toenforce the uniqueness of the _reftime, _valtime pair, whereasmost of the actual queries will be performed on the other indices.For example, the queries to update the forecast error with the latestPO measurements are done based on _valtime, as explained inthe following section. If the additional indices are not declared, theperformance will suffer. Another aspect from this particular table isthat it also contains columns for the minimum and maximumvalues that define the prediction interval (PI) for PO (explainedbelow).

All the main tables used in this work can be created using var-iations of the two codes above. In total we use nine different tables.Four that contain sensor and sensor-derived data, four that storethe forecasted values, and a temporary table used to speed up thenearest neighbors forecast computation. Table 1 lists and describesall these tables in more detail. In the following sections we showhow all these tables are populated automatically once data isinserted into the dataRaw table.

2.2. Data filtering and data normalization

In this scenario we assume that measurements are fed unfil-tered from the telemetry hardware sensors into the database. Thus,the data must be filtered to discard missing and invalid measure-ments, e.g., physically impossible values such as negative irradianceand power generation measurements. In this case such measure-ments are tagged with the value �9999.9 in accordance to theconvention used in data from the SURFRAD network [17] (accessedJune 2016). Instead of relying on external tools to filter the data,MySQL supports triggers (procedures that are executed automati-cally in response to events in a table) that can be used to performthis task.

The triggering event in this case, is the insertion of new data intothe dataRaw table. The triggered procedure has access to the datathat has just been inserted into the dataRaw. It operates on thesedata and inserts the resulting filtered data into the dataFiltered

table. In this case, the procedure discards flagged values and nighttime values (qz � 85�):

The key statement in this code is AFTER INSERT ON dataRaw,which indicates that the procedure is triggered after data isinserted into the table dataRaw (MySQL supports other triggerevents such as BEFORE INSERT or AFTER DELETE). The data thatwas inserted in the triggering event is accessed by prepending“NEW.” to the respective column name. Also, note that in thesnippets presented here, lines starting with # are comments andwords prefixed by the underscore character “_” refer to columns inthe different tables.

In the same triggered procedure we also normalize the raw datavalues and insert them into the respective dataNormalized table.Normalization can be accomplished with accurate clear-sky models[18] most of which are not easily implemented in MySQL. A simplerapproach is to normalize PO and GHI with the cosine of the zenithangle cosqz:

cos qz ¼ cos f cos d cos uþ sin f sin d (1)

where f is latitude, u is the solar time and d is the declination angle.These quantities can be computed easily for any time andgeographical location [19] and can be implemented as a MySQLprocedure or function (see Appendix A.1). This normalization

Table 1Description of the MySQL tables used in this work.

Name Type Description

dataRaw Data Stores the raw PO and exogenous data. The primary key is the data acquisition timestamp.dataFiltered Data Stores the filtered raw data. The filtering removes night time values (in the first test case) and unrealistic values. The primary key

is the data acquisition timestamp inherited from the dataRaw table.dataNormalized Data Stores the normalized filtered data. The primary key is the data acquisition timestamp inherited from the dataRaw table.dataClass Data Stores several features used to classify data. These features are used in the nearest neighbor forecast. The primary key is the

timestamp inherited from the dataNormalized table.dataTarget Data Stores the target data for the different forecast horizons corresponding to the data in dataClass. The primary key is the

timestamp inherited from the dataNormalized table.forecastPer Forecast Stores the persistence forecast. The primary key combines the _reftime and _valtime timestamps.forecastAR Forecast Stores the auto-regressive forecast. The primary key combines the _reftime and _valtime timestamps.forecastNN Forecast Stores the nearest neighbors forecast. The primary key combines the _reftime and _valtime timestamps.tempTableNN Temporary A temporary table that stores the timestamps for the nearest neighbors. The contents of this table are deleted once the forecast is

computed.


accounts for much of the daily and seasonal variation in the PO, sothat the models' task is to predict the effect of the atmosphere onthe target variable.

A third task implemented in this procedure is the update of theforecasted data tables with the forecast error. That is accomplishedwith the UPDATE query in the code above. With this implementa-tion the forecast performance is updated in real-time as newmeasured data becomes available.

The above trigger is a very simple way of implementing datafiltering, data normalization, and updating the forecast errordirectly into the database. It also assures the integrity of the in-formation, that is, it guarantees that there is a one-to-one corre-spondence between the indexes in the tables necessary for theforecast.

Finally, since this is the first procedure to be triggered, it alsodefines several global session variables (variables that start with@) that are shared by all subsequent procedures. This avoidshaving hard-coded values dispersed in the MySQL code that woulddifficult modifying the code for other locations, for instance. Theglobal variables, their values and their description are listed inTable 2.

2.3. Classification features

Another type of data derived directly from the measured datais stored in the tables dataClass and dataTarget. The firsttable contains features such as backward averages, standard de-viations, variability, etc., that are useful for classifying the data formachine learning tools such as nearest neighbors and artificialneural networks [20,21]. In the cases demonstrated here PO andexogenous data are classified by computing the average andstandard deviation for the normalized variables for differentbackward windows (30, 60 and 120min), the lagged values from15 to 60min, and the Sun's zenith angle. All these values arestored in the table dataClass which is automatically populatedby a procedure triggered when new rows are inserted into thedataNormalized.

The second table, dataTarget, stores the normalized PO datafor the different horizons k and forecasting issuing time n. The datain these two tables form the pair f

!n/ðxnþ1;xnþ2;xnþ3;xnþ4Þ, where

f!

n denotes the classification features at time instance n. At thattime the values of xnþk are not known, thus the indexed rows indataTarget are initialized with NULL and later updated as themeasured xnþk values become available.

The snippet below provides an abbreviated listing of the codethat performs this task:

Table 2Global variables used in the MySQL procedures.

Symbol Value Description

@lat_deg 32.959 Latitude of the target location in degrees.@lon_deg �117.190 Longitude of the target location in degrees.@utc2loc 8 Difference in hours between UTC time and

local standard time.@Zlim 85.0 The cut off qz for the day-time night

time differentiation.@NNghb 50 The number on nearest-neighbors in the

kNN forecast.@MAXPO 1050 The maximum power output.


Other features can easily be incorporated into this triggeredprocedure. Although, too many classification features may lead tothe deterioration of the forecast performance. In that case anoptimization procedure similar to the one implemented in Ref. [20]may be implemented in order to find the set of features that min-imizes the forecast error.

2.4. Forecast models

Once the dataClass and dataTarget tables are populatedwith new values a third procedure is triggered that automaticallyproduces the different forecasts. In this work we consider threeforecasting models: two auto-regressive models and a nearestneighbor model.

2.4.1. Auto-regressive forecastThe simplest forecasts implemented here are the auto-

regressive models that follow the general form proposed inRef. [22]:

bxnþk¼�a0þ

a1xncosqzðnÞþ/þ amxn�mþ1

cosqzðn�mþ1Þ�cosqzðnþkÞ (2)

where bxnþk denotes the forecasted PO for time nþ k wherek ¼ f1;2;3;4g corresponds to the four forecast horizons 15, 30, 45,and 60min given that the data is in 15min intervals. The variablemindicates the number of lagged PO measurements used in theforecast. Two variations of Eq. (2) are implemented in this work:the persistence forecast (Pers) for whichm ¼ 1, a0 ¼ 0, a1 ¼ 1, andan auto-regressive (AR) forecast with m ¼ 5 and a0;/;a5 thatminimize the error function:

mina0;/;a5

kxnþk � bxnþkk2 (3)

The AR model depends on external tools to solve the minimi-zation problem. However, we include this model in this work todemonstrate how these type of models can be implemented inMySQL once the free parameters are known, and also, as a bench-mark forecast that performs better than persistence. In this case theparameters a0;/;a5 were determined using least squares imple-mented in Matlab for the four different forecast horizons using thefirst 12 months of data available. The value m ¼ 5 was determinedusing a grid search method in which m2f1;/;10g.

TheMySQL procedure that computes the Pers and AR forecasts islisted in Appendix A.2. That code includes the free parametersa0;/;a5 determined for each one of the four forecast horizons. Anymodel that can be expressed in an algebraic form can be easilyimplemented using this methodology by modifying this procedure.

2.4.2. Nearest neighbors forecastOne type of forecast model that is specially suited for MySQL

is the nearest neighbor forecast. This results from the factthat one of the main functions of MySQL is to search datawithin the tables based on some conditions. In this case, thecondition is the similarity between the data classification fea-tures for the new data and previous classification featuresalready present in the dataClass table. This forecast iscomputed in three steps:

1. identification of the list of the nearest neighbors based on theclassification features;

2. retrieval of PO target data corresponding to those neighborsfrom dataTarget table;

3. aggregation of those data.

Identifying the nearest neighbors can be done with a simpleMySQL query:

The third line in this snippet computes the Euclidean distancebetween the features for the new data (identified by the “NEW.”

prefix) and the same features already in the dataClass table. Thisline can be modified to include any combination of the differentclassification features available. It can also be modified to use adifferent distance metric. The last line in the snippet indicates thatall neighbors are listed in ascending order and limits the result tothe first NN items. In this work, we set the number of nearestneighbors NN to 50 following a previous study in the optimizationof nearest neighbors forecast models [20]. The first line in this codeindicates that the retrieved timestamps that index the nearestneighbors are stored in the temporary table tempTableNN. Thisspeeds up the process by avoiding re-querying the table since thenearest neighbors are the same for all the four forecast horizonsconsidered. Different combinations of features could be specifiedfor each forecast horizon, although that was not tested in this work.We also limit the search in terms of solar geometry through thecondition ABS(NEW._z-_z) <¼ 10. This illustrates how one of theclassification features is used as a query condition instead of beingconsidered in the distance calculation. Finally, the condition_timestamp< reftime guarantees that the current values areexcluded from the list of neighbors given that the target values arenot known.

Once the list of timestamps (t ¼ fmi; i ¼ 1;2;/;Ng) is knownwe select the target data according to the forecast horizon k,resulting in the vector xk ¼ fxmiþk; i ¼ 1;2;/;Ng. The elementsof this vector are then aggregated to determine the forecastvalue:

bxnþk ¼ hxkicos qzðnþ kÞ ¼ 1N

XNi¼1

xmiþk

!cos qzðnþ kÞ (4)

In this implementation it is also possible to obtain a predictioninterval (PI) for the forecasts. The lower and upper bounds for the PIcan be estimated as:

PInþk ¼ fminðxkÞ;maxðxkÞgcos qzðnþ kÞ (5)

The forecasted value and the PI bounds can be obtained with thefollowing MySQL code (for the 15min horizon):


This code computes the average, minimum and maximumvalues for the target PO values indexed by the nearest neighborstimestamps (stored in the tempTableNN table) and assigns theminto the variables nPO_nn,nPO_nn_min, and nPO_nn_max. Thenon-normalized predicted values are computed in the last line ofthe snippet, where we also limit the value to the maximum thepower output. The complete procedure to compute the forecast andthe PI is listed in Appendix A.3. Once these values are determinedthey are inserted in the table forecastNN.

The full implementation in MySQL for this case is about 400lines longwith comments. Table 3 lists and describes all the triggersand stored procedures necessary for the work here described andFig. 2 depicts the entity-relationship diagram using ”Crows foot”notation. The code snippets provided above and in the appendixgive all the relevant information but the interested reader canobtain the full code in the supplementary material.

3. Forecasting validation

This database infrastructure was implemented with the MysqlCommunity Server [23] (accessed June 2016) version 5.5 in aUbuntu machine with 2.5 GHz Intel Xeon processors. The resultswere post processed with Matlab version R2017a. We applied thismethodology to two cases:

1. Forecasting the power output from a 1MWPV non-trackingsolar canopy installation at the Canyon Crest Academy (CCA)in San Diego, USA;

2. Forecasting the power output from a 10MW wind power plantin central California.

For the first case, the monitoring system for the solar panelsprovides 15min averaged data for PO and GHI every 15min, and POand GHI were collected from 2011-02-03 to 2014-12-31. In thesecond case, data was retrieved from the The Wind IntegrationNational Dataset (WIND) Toolkit [24] for the site 56345 with lati-tude and longitude 37.75�,-121.68�. This dataset contains modeleddata rather than measured data. Comparisons between themodeled data andmeasured data in King et al. [25]; Draxl et al. [26]showed that the PO data models the power output of wind re-sources appropriately and is representative of production bymodern wind power plants. In this case the data is provided in5min intervals from 2007 to 2012. In order to maintain data res-olution equal to the one in the first case the data was averaged to15min bins.

The MySQL implementation for the second case differs from thefirst case in a few aspects:

� The tables that store data and data-derived values are expandedto accommodate the different set of exogenous variables (windspeed and direction, temperature, etc.).

� Given that there is no clear-sky normalization equivalent for thecase of wind power we normalize the PO based on the windpower plant nominal value of 10MW. Wind speed and

temperature are normalized based on maximum values for2007. Wind direction is normalized by 360� and pressure anddensity by the respective sea level values for a standardatmosphere.

� The target data from which the forecasts are created is not thenormalized PO at the desired horizon but the step changes xn �xn�k. This was chosen to illustrate an alternative way to definethe target data for the kNN forecast.

� There is no data filtering based on solar geometry.� The persistence model in this case, is the naïve persistencebxnþk ¼ xn, no auto-regressive model is implemented, and thedistance formula used to define the nearest-neighbors ischanged to reflect the different set of exogenous variables.

The complete MySQL code for this case can also be found in theSupplementary material. Once both MySQL databases wereinitialized data were inserted, row by row, into the dataRaw table,automatically triggering the different procedures that populate allother tables. The execution time increases with the size of the ta-bles (mainly due to the selection of the nearest neighbors), how-ever even with six years of data in the second case it takesapproximately 1 s to produce the forecast for a new data entry.

3.1. PV generation forecasting

As indicated above the forecat error is updated automatically asnew data are inserted into the dataRaw table. We tracked theperformance of the difference forecast models as function of thesize of the database. For this analysis we used common quantitativemetrics for the point forecast, such as the root mean square error

RMSEf ¼

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi1M

XMn

�xnþk � bxfnþk

�2vuut (6)

where the superscript f denotes the forecast model (f2fPers; AR;kNNg). Forecast improvement relative to the reference persistencemodel is also computed [39]:

skillf ¼ 1� RMSEf

RMSEPer

!� 100 (7)

These metrics were computed daily in two ways: using all thedata available up to that day, and the preceding three months ofdata. The former reflects the overall performance of the models,whereas the latter allows the study of seasonal variations in themodels' performance. Fig. 3 plots the global (dashed lines) andseasonal (solid lines) evolution of the RMSE for the three forecastmodels and four forecast horizons as a function of time, which isproportional to the database size. Fig. 4 shows the forecastimprovement evolution for the AR and nearest neighbors forecastwith respect to the persistence model. Initially the global or cu-mulative values oscillate substantially due to the small data set.Once there is between one and one and a half years of data in thedatabase the values stabilize, although small variations can beobserved. The seasonal metrics show much more variability withlower errors and higher skill in summer periods, and the reverse forwinter. These plots show that, with sufficient data, the nearestneighbors model outperforms the other models for all forecasthorizons.

In order to further analyze the performance of kNN model wealso recorded, for each time instance, the average and minimumdistances (normalized by the maximum power output) for the 50neighbors. The shaded band in Fig. 4 shows the 3-month rollingaverage for these two values. This seasonal depiction of the nearest-

Table 3Triggers and stored procedures.

Name Type Description

filterData Trigger Triggers when new data is inserted into the dataRaw table. Filters the new data by removing physical impossible measurements,normalizes the data and updates the error columns in the forecast tables.

classifyData Trigger Triggers when new data is inserted into the dataNormalized table. Computes several statistical features using the most recentnormalized data and inserts those values into the dataClass table. Initializes the rows for the dataTarget table and updatesprevious rows with the new PO measurements.

createForecasts Trigger Triggers when new data is inserted into the dataClass. Computes the AR and kNN forecats based on the latest data classificationvalues and inserts them in the respective tables.

calcZenith Procedure Receives a DATETIME value and computes the corresponding zenith angle, qz .nnForecast Procedure Receives date and time, cos qz , the forecast horizon, and the number of neighbors to consider in the forecast. Computes the

forecast and inserts into the table forecastNN.arForecast Procedure Receives date and time, cos qz , the forecast horizon, the latest lagged values for PO. Computes the AR and persistence forecasts

and inserts them into the tables forecastAR and forecastPer, respectively.


neighbors distance helps to explain the large swings in the seasonalkNN forecasting skill. Fig. 4 shows that, in summer periods, theminimum and average distances are substantially lower than inwinter. Lower average and minimum distances indicate that closermatches were found during the nearest-neighbors identificationwhich in turn benefits the forecasting performance. The seasonaldisparity also indicates that, most likely, the best set of features todetermine the nearest neighbors differs with the season. This couldbe explored by subjecting the kNN model to some optimizationprocedure that segregates between seasons (not explored in thiswork).

Nevertheless, despite some indications that the kNN modelcould be further improved, these results show that the proposedmethodology can be used to successfully implement a nearestneighbor algorithm that significantly outperforms simpler auto-regressive models. In terms of how this model compares againstthe state of the art in short-term PV generation forecast, we resort

Fig. 2. Entity-relationship diagram for the proposed database infrastr

to the recent comprehensive review by Antonanzas et al. [27]. To doso, we use the results compiled in Fig. 9 in that paper, to which weoverlay the forecasting skills here reported (Fig. 5). As remarkedabove, this paper focus more on providing a general forecastinginfrastructure than on the forecasting accuracy. Nevertheless, thiscomparison clearly shows that our results compare well with re-sults from competing models. In general methods with better skillare based on more complex machine learning tools and/or includeadditional predictors to the model. Two of the methods that clearlyshow better skill than what was presented here are identifiedexplicitly in Fig. 5. In the case of the work by Soubdhan et al. [28]the higher forecasting skill was obtained using a Kalman filteringapproach with a twofold parameter tuning procedure and exoge-nous solar and weather data. In the case of Lorenz et al. [29] theauthors used several exogenous variables (satellite data and nu-merical weather predictions) and support vector regression toimprove the forecast.

ucture. Relationships are indicated using ”Crows foot” notation.

Fig. 3. Variation of the RMSE for the four forecast models with time. Increasing timecorresponds to increasing size of the data base as new values are added sequentially.The different colors indicate the different forecast models. The dashed lines and thesolid lines correspond to cumulative and seasonal metrics, respectively. (For inter-pretation of the references to color in this figure legend, the reader is referred to theWeb version of this article.)

Fig. 5. Comparison against recent works in short term PV generation forecast. The


In the previous analysis we studied the point forecast accuracyof the proposed model. Another important feature of the nearestneighbor forecast implemented here is the ability to produce pre-diction intervals (Eq. (5)). Two quantitative performance parame-ters are often used to test the quality of the PIs: the PI coverageprobability (PICP) [30] and the PI normalized average width(PINAW) [31].

The PICP measures the probability that an actual PO value iswithin the prediction interval and is defined as:

PICP ¼ 1N

XNi¼1

εi (8)

where N is the number of forecast instances and εi is a booleanvariable given by

εi ¼�1 if xi2½Li;Ui�0 if xi;½Li;Ui� (9)

where Lnþk and Unþk denote the lower and upper bounds of the

Fig. 4. Variation of the improvement of the AR and kNN forecast with respect to thepersistence forecast as a function of time. The dashed lines and the solid lines corre-spond to cumulative and seasonal metrics, respectively. The band indicates normalizedminimum (band's lower boundary) and average (band's upper boundary) distance forthe 50 nearest neighbors in the kNN model for each forecast instance.

prediction interval, respectively. Often this value is comparedagainst a nominal coverage used to create the PIs. In this case that isnot done given that the PIs result directly from the nearest neigh-bors for which no nominal value is preset.

The second metric measures the PI width relative to themaximum range observed in the forecasted variable.

PINAW ¼ 1N POmax

XNi¼1

ðUi � LiÞ (10)

where POmax is set to 1050 kW.Good PIs should have large PICP and low PINAW. If PINAW is

large and approaches the maximum possible range, the PIs providelittle information as it is trivial to say that a future PO value will bewithin its extreme values.

Fig. 6 shows these twometrics for the PIs computed using Eq. (5)for the four forecast horizons. The figures show similar features tothe previous ones for the cumulative error metrics with strongvariations at the beginning followed by a quick convergence. Theseasonal curves show larger variations with lower PINAW and

figure is adapted from Fig. 9 in Antonanzas et al. [27]. The sources from where thesevalues were obtained are listed in the legend in Fig. 9 in Antonanzas et al. [27]. Theblack lines and symbols correspond to the results obtained with the methodologyproposed here.

Fig. 6. Performance of the nearest neighbor forecasted PIs, measured by the metricsPINAW (left) and PICP (right), as a function of time. The different colors indicate theforecast horizon. The dashed lines and the solid lines correspond to cumulative andseasonal metrics, respectively. (For interpretation of the references to color in thisfigure legend, the reader is referred to the Web version of this article.)


lower PICP in winter periods, and higher values in summer.The figures show that the cumulative PICP converges to z95%

regardless of the forecast horizon whereas for PINAW the conver-gence value varies much more. For instance, the cumulative PINAWconverges to z25% and z35% for the 15 and 60min forecasts,respectively.

The evolution of the global PINAW with time shows a slight butconsistent decrease as more data is inserted into the table, an effectnot so evident in the evolution of the point forecast metrics. Thisfact shows that increasing the pool of nearest neighbor candidateshelps to define the PIs better (narrower PIs) without decreasing thePICP (PICP remains constant as PINAW decreases). This is a desir-able characteristic of these forecasts, although, as Fig. 3 shows theforecast error does not improve in the long term.

Fig. 7. Daily RMSE and forecast skill for the 30min forecast horizon for several days with siselected days. The measured PO for each day is shown below the time line (solid black) togetthe height of the shaded area behind the daily PO profile. The scale to the right of the profilesmodels and the gray dashes indicate the skill for the kNN forecast. The skill is measured usinPO variability: (a) low, (b) medium, and (c) high.

In order to better observe the effect of an increasing pool ofcandidates available to the nearest neighbors model, we selectedseveral days from the entire dataset for three levels of PO vari-ability: low, medium and high. The days were selected by matchingdays with similar daily characteristics: RMSE of the persistencemodel, PO average, PO standard deviation, and PO line-length. Inthis analysis we use the 30min forecast results. The same analysisfor other horizons yields a similar outcome. Fig. 7 shows themeasured PO for the selected days, the daily RMSE values for thepersistence and kNN models, the forecasting skill, and the PIbounds as well as the daily PINAW value for the kNN model. Theplots are ordered from top to bottom from low to high POvariability.

The figure illustrates the reduction in the width of the PI with

milar daily PO profiles). The circles on the time line at the top of the plots indicate theher with the PI bounds calculated with the kNN model. The daily PINAW is indicated byis used to quantify the daily PINAW. The scatter plots show the daily RMSE for the threeg the gray scale to the right of the scatter plot. The three subplots show days of different

Fig. 8. Same as Fig. 3 but for the wind power plant PO forecast. No AR model wasimplemented in this case.

Fig. 9. Same as Fig. 4 but for the wind power plant PO forecast. No AR model wasimplemented in this case.

Fig. 10. Same as Fig. 6 but for the wind power plant PO forecast.


the increase of the database size. That effect is especially noticeablein the case of the low variability days. The same effect is alsonoticeable in themediumvariability days but to a lesser extent, andnot noticeable in the high variability days.

For the low and medium variability days we observe that thedaily RMSE for the persistence model is very stable as it does notdepend on the database length. In the case of the kNN model, alsoas expected, the highest RMSE values and lowest skills are observedfor days near the beginning of the time line (due to the small pool of

Table 4Comparison against recent works in short-termwind power solar forecasting with similarare only reported in figures. The nRMSE metric is defined as the RMSE (Eq. (6)) divided

Authors/year Resolution Horizon Me

Dowell and Pinson (2016) [35] 5min 5min spaDong et al. (2017) [36] a 15min 15e120min hybCavalcante et al. (2017) [37] 1 h 1 h vecYan et al. (2016) [38] 1 h 1e12 h hyb

Present work 15min 15e60min kN

a Results from Experiment II which are comparable to the forecast configuration pres

candidates).In the case of the high variability days, Fig. 7(c) shows that,

although, the predictions by the kNN model are always better thanpersistence, the error reduction is much smaller than in the othertwo cases.

3.2. Wind generation forecasting

The performance analysis for the second test-case followsexactly the same procedure as in the previous case. Figs. 8 and 9show the cumulative and seasonal RMSE and forecast skill,respectively for the wind power forecasts.

The global skill values at the end of the experiments vary be-tween z5% and 8%. These values are substantially lower than theones obtained in the previous case. This is expected given thehigher variability of wind power relatively PV power. The seasonalskills plotted in Fig. 9 show values above 15% and 20% in thesummer in contrast to values below 5% in winter, indicating astrong seasonality in the kNN forecasting performance.

It is well documented [32e34] that for very short forecastinghorizons such as the ones target in this work the naïve persistencemodel is very competitive and hard to beat. Table 4 compares theresults obtained with the proposed methodology against metricsprovided in recent works onwind power short term forecasting. Asin the case of PV generation forecast presented above, we can seethat our results are comparable to the state of the art.

The PI performancemetrics plotted in Fig.10 showa very similarbehavior to the previous case. Again the cumulative PICP convergeto 95% and the PINAW values show a consistent decrease withincreasing data availability. A strong seasonality is also observed inthis plot.

The tightening of the PIs is illustrated in Fig. 11 for 10 days withsimilar wind power profiles. Is possible to observed a slight butconsistence reduction in the width of the PIs as the size of thedatabase increases. Cases that depart from this trend, such as days 7

data resolution and forecast horizon. Some values are approximated given that theyby the power plant nominal capacity and multiplied by 100.

thod nRMSE [%]

rse vector autoregressive models 3.95rid (wavelet decomposition þ linear fuzzy neural network) 2.66e14.78tor autoregression with LASSO z11rid (temporally local moving window þ Gaussian process) z7ez24

N implemented in a MySQL infrastructure 4.59e10.97

ented here.

Fig. 11. Same as Fig. 7 but for days characterized with medium variability window power. The upper and lower PI bounds correspond to the 30min forecast. The inverted trianglesover the x-axis indicate negative forecasting skill.


and 8, also show a poorer point forecast performance (negative inthose cases). Nevertheless, the analysis of point forecast and PIbounds for this case shows a good overall performance relative tothe persistence forecast. Although, in this case seasonal effects aremuch stronger that in the previous case.

4. Conclusion

Here we demonstrate a MySQL database infrastructure to ac-quire, filter, classify, and forecast solar and wind power generationdata. From this work we can conclude that MySQL can be used formuch more that just storing data for solar forecast models. Anyforecast model that can be expressed in an algebraic form can beimplemented in this manner as it was demonstrated with the auto-regressive used to predict solar power generation. By adapting thedata classification and forecasting procedures presented here, it ispossible to deploy many of the forecast models present in theliterature in real-time. This can be very relevant in the productionof stand-alone forecasting units that provide local monitoring andforecasting for renewable generation. For instance, in the case ofsolar generation, a simple hardware configuration consisting of aportable computing unit (e.g. BeagleBone or RaspberryPI) andphotodiode sensor could host the proposed methodology.

The forecasts produced in this work may not be able to competewith more sophisticated approaches (sky imagers, cloud tracking,ect.) but they are robust backup approaches. The nearest neighboralgorithm is especially well suited for this database infrastructuregiven that, one of the key functions of MySQL, is to query data basedon some condition (in this case the similarity of data classificationfeatures). They are easy to implement and perform much betterthan simple auto-regressive models. This model also shows a verydesirable property: its forecasting performance improves (mostlythe prediction intervals) as more data becomes available. Thisimprovement is achieved automatically (no model training isnecessary) by virtue of the increase in candidates from whichnearest neighbors are selected and the forecast produced.

The error analysis and prediction interval quality assessmentrevealed a strong seasonality in both test-cases (stronger in thesecond). The lower performance in some seasons may beaddressed, perhaps, by using different classification features whenquerying for the nearest-neighbors in different seasons. Thisobservation, and the fact that no classification feature selection, nooptimization of the number of neighbors, no optimization in theaggregation of the neighbors when computing the forecast wasimplemented suggests that better forecasting performance ispossible with this database infrastructure.

Acknowledgments

The authors gratefully acknowledge funding from the CaliforniaCalifornia Energy Commission PIER PON-13-303 program, which ismanaged by Dr. Silvia Palma-Rojas.

Appendix A. MySQL procedures

In this appendix we include some MySQL code snippets thatillustrate some of the operations available for the purpose of thiswork. As mentioned the complete code is available in theSupplementary material.

Appendix A.1. Procedure to compute solar zenith angle

The zenith angle qz for a given timestamp (in UTC) andgeographical location are easily computed using the followingMySQL procedure (global variables start with @ and are listed inTable 2):


Appendix A.2. Procedure to compute the AR forecast

The persistence and AR models are implemented using a simpleMySQL procedure that determines the predicted values and insertsthem into the respective forecastPer and forecastAR tables:

Appendix A.3. Procedure to compute the kNN forecast

The key procedure in this work computes the kNN forecastfrom the list of nearest neighbors. This procedure retrieves thetarget data depending on the forecasting horizon and aggregatesthose values using the arithmetic mean operator AVG(). The list ofnearest neighbors is retrieved from the table tempTableNN,which is populated according the code snippet provided in section2.4.2.

Appendix B. Supplementary data

Supplementary data related to this article can be found athttps://doi.org/10.1016/j.renene.2018.02.043.

References

[1] C.L. Archer, H.P. Sim~ao, W. Kempton, W.B. Powell, M.J. Dvorak, The challengeof integrating offshore wind power in the U.S. electric grid. Part I: windforecast error, Renew. Energy 103 (2017) 346e360.

[2] D. Halamay, T. Brekken, A. Simmons, S. McArthur, Reserve requirement im-pacts of large-scale integration of wind, solar, and ocean wave power gen-eration, IEEE Trans. Sustain. Energy 2 (2011) 321e328.

[3] D. Neves, M.C. Brito, C.A. Silva, Impact of solar and wind forecast uncertaintieson demand response of isolated microgrids, Renew. Energy 87 (2016)1003e1015.


http://refhub.elsevier.com/S0960-1481(18)30187-3/sref1














[4] A.M. Foley, P.G. Leahy, A. Marvuglia, E.J. McKeogh, Current methods and ad-vances in forecasting of wind power generation, Renew. Energy 37 (2012)1e8.

[5] R.H. Inman, H.T.C. Pedro, C.F.M. Coimbra, Solar forecasting methods forrenewable energy integration, Prog. Energy Combust. Sci. 39 (2013) 535e576.

[6] J. Kleissl, Solar Energy Forecasting and Resource Assessment, Academic Press,2013.

[7] G. Reikard, S.E. Haupt, T. Jensen, Forecasting ground-level irradiance overshort horizons: time series, meteorological, and time-varying parametermodels, Renew. Energy 112 (2017) 474e485.

[8] C. Voyant, G. Notton, S. Kalogirou, M.L. Nivet, C. Paoli, F. Motte, A. Fouilloy,Machine learning methods for solar radiation forecasting: a review, Renew.Energy 105 (2017) 569e582.

[9] H.T.C. Pedro, R.H. Inman, C.F.M. Coimbra, Mathematical methods for opti-mized solar forecasting, in: G. Kariniotakis (Ed.), Renewable Energy Fore-casting, Woodhead Publishing, 2017, pp. 111e152. Woodhead PublishingSeries in Energy.

[10] M.Q. Raza, M. Nadarajah, C. Ekanayake, On recent advances in PV outputpower forecast, Sol. Energy 136 (2016) 125e144.

[11] A. Barua, S.W. Thomas, A.E. Hassan, What are developers talking about? Ananalysis of topics and trends in Stack Overflow, Empir. Software Eng. 19(2012) 619e654.

[12] MySQL 5.5, MySQL 5.5 Reference Manual, 2016. http://dev.mysql.com/doc/refman/5.5/en/. (Accessed June 2016).

[13] R. Zach, M. Schuss, R. Bruer, A. Mahdavi, Improving building monitoring usinga data preprocessing storage engine based on MySQL, in: eWork and EBusi-ness in Architecture, Engineering and Construction, CRC Press, 2012,pp. 151e157.

[14] G. Anders, S.D. Mackowiak, M. Jens, J. Maaskola, A. Kuntzagk, N. Rajewsky,M. Landthaler, C. Dieterich, doRiNA: a database of RNA interactions in post-transcriptional regulation, Nucleic Acids Res. 40 (2012) 180e186.

[15] Y. Chu, M. Li, H.T.C. Pedro, C.F.M. Coimbra, Real-time prediction intervals forintra-hour DNI forecasts, Renew. Energy 83 (2015a) 234e244.

[16] Y. Chu, H.T.C. Pedro, M. Li, C.F.M. Coimbra, Real-time forecasting of solarirradiance ramps with smart image processing, Sol. Energy 114 (2015b)91e104.

[17] SURFRAD, SURFRAD Data README, 2016. ftp://aftp.cmdl.noaa.gov/data/radiation/surfrad/Boulder_CO/README. (Accessed June 2016).

[18] V. Badescu, C.A. Gueymard, S. Cheval, C. Oprea, M. Baciu, A. Dumitrescu,F. Iacobescu, I. Milos, C. Rada, Accuracy analysis for fifty-four clear-sky solarradiation models using routine hourly global irradiance measurements inRomania, Renew. Energy 55 (2013) 85e103.

[19] V. Badescu, Modeling Solar Radiation at the Earth's Surface. Springer, 2014.[20] H.T.C. Pedro, C.F.M. Coimbra, Nearest-neighbor methodology for prediction of

intra-hour global horizontal and direct normal irradiances, Renew. Energy 80(2015a) 770e782.

[21] H.T.C. Pedro, C.F.M. Coimbra, Short-term irradiance forecastability for varioussolar micro-climates, Sol. Energy 122 (2015b) 587e602.

[22] S.A. Fatemi, A. Kuh, Solar radiation forecasting using zenith angle, in: 2013IEEE Global Conference on Signal and Information Processing, 2013,

pp. 523e526.[23] MySQL, MySQL Community Server, 2016. https://dev.mysql.com/downloads/

mysql/. (Accessed June 2016).[24] NREL, Wind Integration National Dataset Toolkit, 2017. https://www.nrel.gov/

grid/wind-toolkit.html.[25] J. King, A. Clifton, B.M. Hodge, Validation of Power Output for the WIND

Toolkit, Technical Report, National Renewable Energy Laboratory, 2014.[26] C. Draxl, A. Clifton, B.M. Hodge, J. McCaa, The wind integration national

dataset (WIND) Toolkit, Appl. Energy 151 (2015) 355e366.[27] J. Antonanzas, N. Osorio, R. Escobar, R. Urraca, F.J. Martinez-de Pison,

F. Antonanzas-Torres, Review of photovoltaic power forecasting, Sol. Energy136 (2016) 78e111.

[28] T. Soubdhan, J. Ndong, H. Ould-Baba, M.T. Do, A robust forecasting frameworkbased on the Kalman filtering approach with a twofold parameter tuningprocedure: application to solar and photovoltaic prediction, Sol. Energy 131(2016) 246e259.

[29] E. Lorenz, J. Kühnert, B. Wolff, A. Hammer, O. Kramer, D. Heinemann, PVpower predictions on different spatial and temporal scales integrating pvmeasurements, satellite data and numerical weather predictions, in: Pro-ceedings of the 29-th European Photovoltaic Solar Energy Conference andExhibition (EU PVSEC), 2014, pp. 22e26.

[30] A. Khosravi, S. Nahavandi, D. Creighton, Construction of optimal predictionintervals for load forecasting problems, IEEE Trans. Power Syst. 25 (2010)1496e1503.

[31] A. Khosravi, S. Nahavandi, D. Creighton, Prediction interval construction andoptimization for adaptive neurofuzzy inference systems, IEEE Trans. FuzzySyst. 19 (2011) 983e988.

[32] R. Blonbou, Very short-term wind power forecasting with neural networksand adaptive Bayesian learning, Renew. Energy 36 (2011) 1118e1124.

[33] H. Madsen, P. Pinson, G. Kariniotakis, H.A. Nielsen, T.S. Nielsen, Standardizingthe performance evaluation of short-term wind power prediction models,Wind Eng. 29 (2005) 475e489.

[34] M. Milligan, M. Schwartz, Y.h. Wan, Statistical Wind Power ForecastingModels: Results for US Wind Farms, Technical Report, National RenewableEnergy Laboratory (NREL), Golden, CO, 2003.

[35] J. Dowell, P. Pinson, Very-short-term probabilistic wind power forecasts bysparse vector autoregression, IEEE Trans. Smart Grid 7 (2016) 763e770.

[36] Q. Dong, Y. Sun, P. Li, A novel forecasting model based on a hybrid processingstrategy and an optimized local linear fuzzy neural network to make windpower forecasting: a case study of wind farms in China, Renew. Energy 102(2017) 241e257.

[37] L. Cavalcante, R.J. Bessa, M. Reis, J. Browell, LASSO vector autoregressionstructures for very short-term wind power forecasting, Wind Energy 20(2017) 657e675.

[38] J. Yan, K. Li, E.W. Bai, J. Deng, A.M. Foley, Hybrid probabilistic wind powerforecasting using temporally local Gaussian process, IEEE Trans. Sustain. En-ergy 7 (2016) 87e95.

[39] R. Marquez, C.F.M. Coimbra, Proposed Metric for Evaluating Solar ForecastingModels, ASME J. Solar Energy Eng. 135 (2013) 0110161e0110169.






























http://dev.mysql.com/doc/refman/5.5/en/

http://dev.mysql.com/doc/refman/5.5/en/

















ftp://aftp.cmdl.noaa.gov/data/radiation/surfrad/Boulder_CO/README

ftp://aftp.cmdl.noaa.gov/data/radiation/surfrad/Boulder_CO/README


















https://dev.mysql.com/downloads/mysql/

https://dev.mysql.com/downloads/mysql/

https://www.nrel.gov/grid/wind-toolkit.html

https://www.nrel.gov/grid/wind-toolkit.html


























































Documents

A database infrastructure to implement real-time …coimbra.ucsd.edu/publications/papers/2018_Pedro_Lim...A database infrastructure to implement real-time solar and wind power generation