12
The problem of multicollinearity in horizontal solar radiation estimation models and a new model for Turkey Haydar Demirhan ,1 Department of Statistics, Hacettepe University, Beytepe, Ankara 06800, Turkey article info Article history: Received 18 February 2014 Accepted 11 April 2014 Keywords: Eccentricity correction factor Entropy Eureqa Genetic programming Maximum possible sunshine duration Model selection criteria Solar declination angle Statistical modelling abstract Due to the considerable decrease in energy resources and increasing energy demand, solar energy is an appealing field of investment and research. There are various modelling strategies and particular models for the estimation of the amount of solar radiation reaching at a particular point over the Earth. In this article, global solar radiation estimation models are taken into account. To emphasize severity of multi- collinearity problem in solar radiation estimation models, some of the models developed for Turkey are revisited. It is observed that these models have been identified as accurate under certain multicollinearity structures, and when the multicollinearity is eliminated, the accuracy of these models is controversial. Thus, a reliable model that does not suffer from multicollinearity and gives precise estimates of global solar radiation for the whole region of Turkey is necessary. A new nonlinear model for the estimation of average daily horizontal solar radiation is proposed making use of the genetic programming technique. There is no multicollinearity problem in the new model, and its estimation accuracy is better than the revisited models in terms of numerous statistical performance measures. According to the proposed model, temperature, precipitation, altitude, longitude, and monthly average daily extraterrestrial hori- zontal solar radiation have significant effect on the average daily global horizontal solar radiation. Rela- tive humidity and soil temperature are not included in the model due to their high correlation with precipitation and temperature, respectively. While altitude has the highest relative impact on the average daily horizontal solar radiation, impact of temperature is greater than that of both longitude and precipitation. Ó 2014 Elsevier Ltd. All rights reserved. 1. Introduction Solar energy is an important renewable energy source. It has essential effects on environmental processes. Therefore, it is directly related with indicators of development of a country, such as cereal yield, crop production, forest area, gas emissions, and energy production. Because it is possible to convert solar energy to electricity without contributing the increase in climate change or damaging water resources, it seems more beneficial than the other methods of electricity generation and attracts great attention [1]. Accurate estimation of the amount of solar radiation absorbed or reflected by certain points over the Earth is an appealing research area. Through the article, we refer ‘‘horizontal global solar radiation’’ as ‘‘solar radiation’’ unless otherwise is stated. Scientists use various statistical modelling tools for estimation of the amount of solar radiation, and compare proposed models to identify the most accurate one. Because there are many meteo- rological and geographical variables, terrestrial factors, and many types of models that can be used in solar energy modelling, identi- fication of an accurate model is a very complex task. Some of those variables are temperature, soil temperature, precipitation, relative humidity, cloudiness, cloud types, sunshine duration, longitude, latitude, altitude, evapotranspiration, the Earth’s distance from the sun, solar elevation angle, aerosol concentration in the atmo- sphere, ozone concentration, air pollution, and the ground albedo [2]. Linear, trigonometric, polynomial, or logical models with main effect and interaction terms can be used to model the amount of solar radiation reaching at a particular point over the Earth. A group of models includes large scale solar energy atlases such as European Solar Radiation Atlas (ESRA), the Solar Radiation Poten- tial Atlas (GEPA), and the TEKNOLOGIS (see Demirhan et al. [3] for details). Each of these atlases has a specific estimation model and an algorithm, and covers large regions. Another group of mod- els has been developed for a particular region of the Earth or for a limited time span. There are a lot of models for some cities of Tur- key. Dinçer et al. [4] proposed a model for the site of Gebze, Turkey. http://dx.doi.org/10.1016/j.enconman.2014.04.035 0196-8904/Ó 2014 Elsevier Ltd. All rights reserved. Tel.: +90 312 2977900; fax: +90 312 2977913. E-mail address: [email protected] 1 http://yunus.hacettepe.edu.tr/~haydarde Energy Conversion and Management 84 (2014) 334–345 Contents lists available at ScienceDirect Energy Conversion and Management journal homepage: www.elsevier.com/locate/enconman

The problem of multicollinearity in horizontal solar radiation estimation models and a new model for Turkey

  • Upload
    haydar

  • View
    214

  • Download
    0

Embed Size (px)

Citation preview

Page 1: The problem of multicollinearity in horizontal solar radiation estimation models and a new model for Turkey

Energy Conversion and Management 84 (2014) 334–345

Contents lists available at ScienceDirect

Energy Conversion and Management

journal homepage: www.elsevier .com/locate /enconman

The problem of multicollinearity in horizontal solar radiation estimationmodels and a new model for Turkey

http://dx.doi.org/10.1016/j.enconman.2014.04.0350196-8904/� 2014 Elsevier Ltd. All rights reserved.

⇑ Tel.: +90 312 2977900; fax: +90 312 2977913.E-mail address: [email protected]

1 http://yunus.hacettepe.edu.tr/~haydarde

Haydar Demirhan ⇑,1

Department of Statistics, Hacettepe University, Beytepe, Ankara 06800, Turkey

a r t i c l e i n f o

Article history:Received 18 February 2014Accepted 11 April 2014

Keywords:Eccentricity correction factorEntropyEureqaGenetic programmingMaximum possible sunshine durationModel selection criteriaSolar declination angleStatistical modelling

a b s t r a c t

Due to the considerable decrease in energy resources and increasing energy demand, solar energy is anappealing field of investment and research. There are various modelling strategies and particular modelsfor the estimation of the amount of solar radiation reaching at a particular point over the Earth. In thisarticle, global solar radiation estimation models are taken into account. To emphasize severity of multi-collinearity problem in solar radiation estimation models, some of the models developed for Turkey arerevisited. It is observed that these models have been identified as accurate under certain multicollinearitystructures, and when the multicollinearity is eliminated, the accuracy of these models is controversial.Thus, a reliable model that does not suffer from multicollinearity and gives precise estimates of globalsolar radiation for the whole region of Turkey is necessary. A new nonlinear model for the estimationof average daily horizontal solar radiation is proposed making use of the genetic programming technique.There is no multicollinearity problem in the new model, and its estimation accuracy is better than therevisited models in terms of numerous statistical performance measures. According to the proposedmodel, temperature, precipitation, altitude, longitude, and monthly average daily extraterrestrial hori-zontal solar radiation have significant effect on the average daily global horizontal solar radiation. Rela-tive humidity and soil temperature are not included in the model due to their high correlation withprecipitation and temperature, respectively. While altitude has the highest relative impact on the averagedaily horizontal solar radiation, impact of temperature is greater than that of both longitude andprecipitation.

� 2014 Elsevier Ltd. All rights reserved.

1. Introduction

Solar energy is an important renewable energy source. It hasessential effects on environmental processes. Therefore, it isdirectly related with indicators of development of a country, suchas cereal yield, crop production, forest area, gas emissions, andenergy production. Because it is possible to convert solar energyto electricity without contributing the increase in climate changeor damaging water resources, it seems more beneficial than theother methods of electricity generation and attracts great attention[1]. Accurate estimation of the amount of solar radiation absorbedor reflected by certain points over the Earth is an appealingresearch area. Through the article, we refer ‘‘horizontal global solarradiation’’ as ‘‘solar radiation’’ unless otherwise is stated.

Scientists use various statistical modelling tools for estimationof the amount of solar radiation, and compare proposed models

to identify the most accurate one. Because there are many meteo-rological and geographical variables, terrestrial factors, and manytypes of models that can be used in solar energy modelling, identi-fication of an accurate model is a very complex task. Some of thosevariables are temperature, soil temperature, precipitation, relativehumidity, cloudiness, cloud types, sunshine duration, longitude,latitude, altitude, evapotranspiration, the Earth’s distance fromthe sun, solar elevation angle, aerosol concentration in the atmo-sphere, ozone concentration, air pollution, and the ground albedo[2]. Linear, trigonometric, polynomial, or logical models with maineffect and interaction terms can be used to model the amount ofsolar radiation reaching at a particular point over the Earth. Agroup of models includes large scale solar energy atlases such asEuropean Solar Radiation Atlas (ESRA), the Solar Radiation Poten-tial Atlas (GEPA), and the TEKNOLOGIS (see Demirhan et al. [3]for details). Each of these atlases has a specific estimation modeland an algorithm, and covers large regions. Another group of mod-els has been developed for a particular region of the Earth or for alimited time span. There are a lot of models for some cities of Tur-key. Dinçer et al. [4] proposed a model for the site of Gebze, Turkey.

Page 2: The problem of multicollinearity in horizontal solar radiation estimation models and a new model for Turkey

H. Demirhan / Energy Conversion and Management 84 (2014) 334–345 335

Togrul and Onat [5] introduced a model for the estimation of solarradiation for Elazig, Turkey. Sen and Sahin [6] proposed a cumula-tive semivariogram approach for the month of January over 29 cit-ies of Turkey. Saylan et al. [7] presented solar radiation estimatesfor three cities of Turkey. Sozen et al. [8] gave solar radiation esti-mates for 12 cities of Turkey by using the artificial neural networks(ANN). Sozen et al. [9,10] proposed other models for Turkey over17 meteorological stations using ANN. Sozen et al. [11] gave amodelling strategy over data of 18 cities of Turkey. Menges et al.[12] proposed a model for Konya, Turkey. Senkal and Kuleli [13]introduced a model for 12 cities of Turkey. Senkal [14] focusedon nine cities of Turkey. Senkal et al. [15] gave solar radiation esti-mates for two cities of Turkey. Koca et al. [16] proposed a model forseven cities from the Mediterranean region of Turkey. Some mod-els were proposed for specific countries. Jin et al. [17] proposed ageneric model for the monthly average daily solar radiation forChina. Ozgoren et al. [18] developed a model to estimate monthlymean daily sum solar radiation for Turkey. Khorasanizadeh andMohammadi [19] focused on the region of Iran. It is possible toapply a model or modelling strategy developed for a particularregion to another region of the Earth. Ertekin and Evrendilek [20]applied 18 existing models to estimate average daily solar radia-tion for Turkey. In an extensive and valuable work, Evrendilekand Ertekin [21] applied existing 78 models, in which 17 variablesare considered, for the region of Turkey. This work also provides acomprehensive review of the models proposed for the estimationof the average daily horizontal global solar radiation. Evrendilekand Ertekin [21] identified successful models in terms of estima-tion accuracy for the region of Turkey. Sonmete et al. [22] consid-ered existing 82 models for two cities of Turkey in a comparativecase study.

Statistical modelling is a mechanical work. Almost each kind ofmodel has its own assumptions. Violations of model assumptionsare effectual on the significance tests of model components, accu-racy of parameter estimates, and model selection tools. Distribu-tional assumptions, influential observations, and possiblemulticollinearity structures between exploratory variables shouldbe regarded for a successful modelling task. The main problem insolar radiation estimation models is the multicollinearity. Whena variable is seen in a model more than once as in polynomial mod-els, or if inter-correlated variables are included in the same model,strong collinearity structures are formed. For example, if tempera-ture (T) and its square (T2) are included in a model as exploratoryvariables at the same time, the terms T and T2 generate a collinear-ity pattern. If prediction will be made with a model suffering frommulticollinearity for only the points of the parameterization data-set, multicollinearity does not cause serious problems. However, ifthe aim of modelling is to figure out the process generating datasetof interest, to identify the most suitable model, or to draw infer-ences from parameter estimates, impact of multicollinearity isserious [23, p. 352]. As the result of multicollinearity, variancesof parameter estimates inflate and small changes in observationscause considerable changes in the values of parameter estimates[24, p. 216]. Statistical measures and significance tests based onvariances of estimators become unreliable; and hence, some signif-icant variables can appear to be non-significant. Specifically, in apolynomial model, various transformations of exploratory vari-ables can be applied to minimize the effect of multicollinearity[25].

In the comparison of estimation accuracy of existing models,several statistical measures are calculated over parameterisationand validation datasets. Commonly used measures are coefficientof determination (R2), its adjusted version ðR2

adjÞ, mean percentageerror (e), mean bias (MB), mean squared error (MSE), root meansquare error (RMSE), relative percentage error (RPE), mean predic-tion bias (MPB), mean squared prediction error (MSPE), correlation

coefficient, amount of toleration, average absolute bias (AAB), aver-age absolute prediction bias (AAPB), average bias (AB), modelselection criteria, entropy, one-way analysis of variance (ANOVA),and goodness of fit tests [3,20–22]. It is very important to use suit-able measures for the comparison of models. Note that R2; R2

adj,RMSE, MSE, and ANOVA can be unreliable in the presence of mul-ticollinearity; and hence, the measures based on both variance ofestimators and bias are all untrustworthy.

In this article, estimation of the amount of average daily solarradiation over the region between 36� and 42� N latitudes and26� and 42� E longitudes is taken into account. This region is called‘‘Turkey’’ throughout the manuscript. Our aim is twofold. First, wewould like to attract attention to the multicollinearity issue insolar radiation modelling. Evrendilek and Ertekin [20,21] identifiedseveral models that give accurate estimates for the average dailysolar radiation over Turkey. These models were revisited andeffects of multicollinearity were figured out and discussed over adataset of 65 weather stations in Turkey. Appropriate transforma-tions were made on explanatory variables to reduce the impact ofmulticollinearity, and the models were reapplied over the trans-formed data set. By this way, more reliable versions of the modelsare obtained for the estimation of daily solar radiation. However,estimation performances of these models became unsatisfactoryafter the transformation used to eliminate multicollinearity. There-fore, we need to have a model that is not suffering from multicol-linearity, and at the same time, give more precise estimates ofglobal solar radiation than the existing models. Based on theresults of multicollinearity analysis, our second aim is to derive anew model including logical and trigonometric terms for the esti-mation of average daily solar radiation by using the dataset of 65locations in Turkey. The new model does not suffer from the mul-ticollinearity problem. Estimation accuracy of our model is vali-dated and compared with the previously proposed models.Consequently, it is observed that the new model gives more preciseestimates and predictions for the amount of average daily solarradiation than the previously proposed models.

In the second section, the dataset is described, revisited modelsare illustrated, and statistical measures used to compare and eval-uate candidate models are defined. In the third section, the modelswere fitted to our dataset, and the multicollinearity issue is evalu-ated and discussed. A new model is proposed for the estimation ofsolar radiation. Also, the new and existing models are comparedand validated in terms of estimation and prediction accuracy. Inthe fourth section, conclusions are given.

2. Data, empirical models and statistical measures

2.1. Data description

Our dataset contains measurements of solar radiation at 65 cli-mate stations of the Turkish State Meteorological Service (DMI)between 2000 and 2013. In these stations, solar radiation isrecorded hourly by using piranometers calibrated in the Calibra-tion Centre of DMI, which is accredited by The Turkish Accredita-tion Agency. Recording period for solar radiation measurementsis the same for all considered stations. Locations of the stationsand quartiles of the distribution of altitudes of sites are seen inFig. 1. Out of the 65 stations, 15 were randomly chosen andassigned to the validation dataset, and the remaining 50 stationsconstitute the parameterisation dataset. In Fig. 1, rectangles repre-sent the validation stations, whereas circles represent the stationsused for parameterisation.

We focus on the monthly average daily solar radiation on a hor-izontal surface (MJ/m2/day) as the dependent variable in our mod-els. For each month, mean level and 95% confidence interval of

Page 3: The problem of multicollinearity in horizontal solar radiation estimation models and a new model for Turkey

Fig. 1. Locations of measurement sites and quartiles of the distribution of their altitudes.

336 H. Demirhan / Energy Conversion and Management 84 (2014) 334–345

average daily horizontal global solar radiation over 65 observationsare given in Fig. 2. Due to the climate of Turkey, higher variation isseen in the amount of recorded solar radiation for the monthsbetween May and September. Thus, it is expected to have moreaccurate estimates of solar radiation for November, December, Jan-uary, and February.

For each month, the probability–probability (PP) plot of theaverage daily solar radiation measurements with a 95% confidenceinterval is given in Fig. 3. The P-value of Kolmogorov–Smirnov (KS)test of normality is 0.01 for January, and the smallest P-value forthe rest of months is 0.06 for October. The general P-value forthe KS test of averages is 0.16. According to test results and PPplots, normality assumption is not violated, and there is no outlierin the parameterisation dataset.

Descriptive statistics for the average daily solar radiation mea-surements are given in Table 1. Both sample mean and standarddeviation obtained over the dataset support the inferences drawnfrom Fig. 1. Distributions of monthly solar radiation measurementsare nearly symmetric and sharper than the standard normal distri-bution for April, May, July, and August. Distribution of overall datais nearly symmetric and moderately flatter than the standard nor-mal distribution. Note that while the standard deviation of mean

Fig. 2. 95% Confidence intervals of average daily solar radiation for each month.

annual solar radiation measurements used by Evrendilek andErtekin [21], which covers 1968 and 2004, is 1.86; it is 2.66 inour dataset, which covers 2000 and 2013. There is a significant dif-ference between variances of the amount of global solar radiationover these time periods at 5% level of significance (P < 0.001). Thisimplies that in the last ten years, there is a considerable increase inthe dispersion of the amount of horizontal solar radiation reachingto the ground level. And, it also signifies the severity of climatechange.

Longitude (u, decimal degrees), latitude (k, decimal degrees),altitude (Z, m), precipitation (PPT, m), mean air temperature(T, �C), relative humidity (RH, %), and soil temperature (ST, �C)are considered as exploratory variables. Descriptive statistics ofthese variables are calculated over the parameterization datasetand given in Table 2. Only the distribution of PPT values is posi-tively skewed and shaper than the standard normal distribution.Distributions of the other exploratory variables are nearly symmet-ric and moderately flatter than the standard normal distribution.Standard error of mean of altitudes implies that our terrain ofinterest has a complex structure in terms of elevation.

In addition to the exploratory variables given in Table 2, themonthly average daily extraterrestrial horizontal solar radiation(H0, MJ/m2/day), the maximum possible sunshine duration (S0, h),and day length (S, h) are considered. The monthly average dailyextraterrestrial horizontal solar radiation is calculated by Eq. (1)[21,26]:

H0 ¼24p

Igsf cosðkÞ cosðdÞ sinðwsÞ þp

180ws sinðkÞ sinðdÞ

h ið1Þ

where Igs is the solar constant (1.367 W/m2), f is the eccentricitycorrection factor calculated by Eq. (2):

f ¼ 1þ 0:033 cos360n365

� �� �; ð2Þ

k is the latitude of the site, d is the solar declination angle calcu-lated by Eq. (3):

d ¼ 23:45 sin360ð284þ nÞ

365

� �; ð3Þ

ws is the mean sunrise hour angle for a given month calculatedby Eq. (4):

ws ¼ cos�1½�tanðkÞtanðdÞ�; ð4Þ

Page 4: The problem of multicollinearity in horizontal solar radiation estimation models and a new model for Turkey

Fig. 3. PP plot of the average daily solar radiation for each month.

Table 1Descriptive statistics for average daily solar radiation measurements.

Month Mean SE of mean StDev Min Q1 Median Q3 Max Skew Kurt

1 6.856 0.209 1.492 3.559 5.577 7.204 8.17 9.363 �0.42 �0.912 9.974 0.328 2.367 5.185 8.359 9.71 11.473 16.276 0.49 0.163 14.931 0.338 2.435 9.333 13.074 15.376 16.566 19.692 �0.18 �0.684 18.666 0.385 2.778 8.986 17.001 18.39 20.279 25.073 �0.37 2.075 23.144 0.377 2.715 14.23 21.503 23.444 25.037 28.217 �0.96 2.146 25.963 0.517 3.731 15.478 23.665 25.568 29.18 32.474 �0.4 0.477 26.889 0.511 3.685 14.994 24.681 27.246 29.752 32.664 �0.82 1.258 24.123 0.487 3.51 13.802 22.045 24.097 27.407 30.163 �0.77 0.979 19.197 0.452 3.262 10.413 16.713 19.272 21.926 25.356 �0.38 �0.09

10 13.357 0.293 2.109 8.855 11.237 13.833 14.841 18.091 �0.13 �0.7411 8.862 0.271 1.952 4.817 7.346 9.132 10.009 14.699 0.19 0.612 6.351 0.191 1.366 3.242 5.201 6.693 7.455 9.092 �0.38 �0.58Overall 16.558 0.308 7.682 3.242 9.327 16.201 23.134 32.664 0.15 �1.17

SE, standard error; StDev, standard deviation; Min, minimum; Q1, first quartile; Q3, third quartile; Max, maximum; Skew, skewness; Kurt, kurtosis.

Table 2Descriptive statistics for exploratory variables.

Variable Mean SE of mean StDev Min Q1 Median Q3 Max Skew Kurt

T 13.21 0.35 8.67 �10.80 6.28 13.22 20.33 31.18 �0.12 �0.80PPT 0.47 0.01 0.37 0.00 0.22 0.41 0.60 2.72 1.86 5.90RH 58.61 0.51 12.85 19.84 49.86 60.25 68.06 84.85 �0.44 �0.47ST 0.69 0.03 0.79 �1.62 0.08 0.70 1.28 2.53 �0.06 �0.56k 38.87 0.06 1.51 36.07 37.72 38.74 40.14 41.68 0.08 �1.04u 34.26 0.19 4.74 26.40 30.56 33.78 38.11 44.09 0.28 �0.76Z 717.3 20.70 515.80 1.00 93.00 802.0 1074 1764 0.00 �0.99

H. Demirhan / Energy Conversion and Management 84 (2014) 334–345 337

Page 5: The problem of multicollinearity in horizontal solar radiation estimation models and a new model for Turkey

338 H. Demirhan / Energy Conversion and Management 84 (2014) 334–345

and n is the number of day of the year starting from January 1. For agiven month, the maximum possible sunshine duration is calcu-lated by Eq. (5):

S0 ¼2

15ws: ð5Þ

Various models are available for the calculation of day length.We use the CBM model. In the model, revolution angle (h) is pre-dicted from the day of the year (J), sun’s declination angle (/) ispredicted from the Earth orbit revolution angle, and day length(S) is predicted from latitude (L) and sun’s declination angle. Fora point on the Earth, the CBM model predicts day length at eleva-tion zero with non-sloping ground, and provides accurate predic-tions of the day length for a wide range of latitudes [27]. Theformulation of CBM model is as follows [27]:

S ¼ 24� 24p

cos�1 sin 0:8333p180

� �þ sin Lp

180

� �sinð/Þ

cos Lp180

� �cosð/Þ

" #; ð6Þ

where / is defined in radians as follows:

/ ¼ sin�1½0:39795 cosðhÞ�;

where h is defined as follows:

h ¼ 0:2163108þ 2 tan�1f0:9671396 tan½0:0086 ðJ � 186Þ�g:

2.2. Revisited models

Evrendilek and Ertekin [21] considered 17 exploratory variablesalong with 78 empirical models for the estimation of average dailysolar radiation. They considered linear, quadratic, cubic, power,exponential, and hybrid models. In 24 models, at least one of theexploratory variables is seen in model equation more than once;and hence, these models thought to suffer from multicollinearity.The MPE, MB, minimum and maximum RPE values of these 24models were evaluated to figure out models giving a suitable fit.Three appropriate models (original model numbers given byEvrendilek and Ertekin [21] are 22.1, 46, and 47) were identified,two of which were also indicated by Evrendilek and Ertekin [21]as generic models. Two more models were identified as suitable(original model numbers are 29 and 32) by the evaluation of RMSEand validation R2 values of the remaining 54 models. As the result,out of the 78 models, we revisited five candidate models for theaverage daily solar radiation over Turkey. Model equations, origi-nal model numbers given by Evrendilek and Ertekin [21], and ref-erence for each model are presented in Table 3. In the modelspresented in Table 3, e � Nð0;r2

e Þ is the error term with the con-stant variance r2

e .These models were fitted to the parameterisation dataset, and

then, their estimation and precision accuracies were evaluatedand compared over both parameterization and validation datasets.

Table 3Original model number, equation and reference for each revisited empirical model.

Model no. Original model no. Equation

1 22.1 HH0¼ b0 þ b1T þ b2PPTþ b3

SS0

þ b4

2 46 HH0¼ b0 þ b1T þ b2PPTþ b3PPT2 þ b

3 47 HH0¼ b0 þ b1T þ b2T2 þ b3PPTþ b4PP

4 29 H ¼ b0 þ b1H0 þ b2S

S0

þ b3RH þ b4

5 32 H ¼ b0 þ b1H0 þ b2S

S0

þ b3 sinðdÞ þ

2.3. Statistical measures used for model comparison and evaluation

The MSE, R2adj, e (%), MB (MJ/m2/day), AAB (MJ/m2/day), AAPB

(MJ/m2/day), MSPE, Akaike Information Criterion (AIC), SchwartzBayesian Criterion (SBC), and entropy were used to compare theproposed model and those given in Table 3 with each other.Entropy is a measure of uncertainty in a random variable (rv)and calculated in the units of NAT. Because entropy is a calibratedmeasure, it is possible to compare a number of models by usingentropy. The model with the lowest entropy gives the best-fit [3].

Let N be the size of parameterization dataset; M be the size ofvalidation dataset; for the site i, yi and yi be the actual and esti-mated value of solar radiation over the parameterization dataset,respectively; zi and zi be the actual and estimated value of solarradiation over the validation dataset, respectively; S2

r be the vari-ance of errors; �r ¼ ð1=NÞ

PNi¼1ðyi � yiÞ; and p(x) be the probability

function of an rv. Formulations of the measures used for modelevaluation and comparison are given in Table 4 [23]. To calculateMSPE, �r and S2

r are calculated over the validation dataset, and kis the number of parameters in the considered model.

There are several diagnostic tools to detect multicollinearitysuch as condition index, variance inflation factor, and eigenanaly-sis. Let X be an N � (k + 1) design matrix, gi, i = 1, . . ., (k + 1) be itheigenvalue of the matrix X0X, and tij be the jth element of ith eigen-vector. Condition index for ith dimension is obtained byci ¼

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffimaxðg1; . . .gkþ1Þ=gi

p. A condition index of 10, 30, or 100 indi-

cates weak, moderate, or strong dependency between exploratoryvariables, respectively; and hence, the case ci > 100 indicates a seri-ous multicollinearity problem. Variance inflation factor for the jthregression coefficient (VIFj) is another common diagnostic tool ofmulticollinearity. It is calculated by VIFj ¼ 1=ð1� R2

j Þ, where R2j is

the coefficient of determination from the regression of jth variableon the other exploratory variables. If VIFj > 10, then the jth regres-sion coefficient is seriously affected by a collinearity structure. Thestructure of multicollinearity can be detected by the eigenanalysis.It can also be carried out over the standardized design matrix.Eigenvalues near zero indicate the variables suffering from themulticollinearity. Variance proportion of ith exploratory variableattributed to a collinearity structure related with the eigenvalue

gi is calculated by pji ¼ ðm2ij=gjÞ=

Pðkþ1Þr¼1 m2

ij=gj

. Thus, an eigenvalue

near zero attached to high variance proportions indicates a collin-earity structure that damages variance estimates of the corre-sponding variables [23,24,31].

3. Modelling

3.1. Evaluation of existing models

Models 1–5 of Table 3 were fitted to the parameterisation data-set, and the statistical measures given in Table 4 were calculatedover both parameterisation and validation datasets. The resultsare seen in Table 5.

Refs.

SS0

T þ b5

SS0

PPTþ e [28]

4SS0

þ b5

SS0

T þ b6

SS0

PPTþ b7

SS0

PPT2 þ e [29]

T2 þ b5S

S0

þ b65

SS0

T þ b72

SS0

T2 þ b8

SS0

PPTþ b9

SS0

PPT2 þ e [29]

ST þ b5T þ e [5]

b4RH þ b5ST þ b6T þ e [30]

Page 6: The problem of multicollinearity in horizontal solar radiation estimation models and a new model for Turkey

Table 4Formulae for statistical model comparison measures.

Measure Formulation

MSE MSE ¼ ðN�1ÞS2r

N þ �r2 (7)

R2adj R2

adj ¼ 1� �rS2r

(8)

e e ¼ jyi�yi jyi

100 (9)

RPE e ¼ jzi�zi jzi

100 (10)

AAB AAB ¼ ð1=NÞPN

i¼1jyi � yij (11)

AAPB AAPB ¼ ð1=MÞPM

i¼1jzi � zij (12)

MSPE MSPE ¼ ðM�1ÞS2r

M þ �r2 (13)

AIC AIC ¼ N logðS2r=NÞ þ 2k (14)

SBC SBC ¼ N logðS2rÞ þ klogðNÞ (15)

Entropy H(X) = �P

p(x)log(p(x)) (16)

Table 6Parameter estimates, significance P-values for t-tests, VIF, eigenvalue, and varianceproportions for the model 1.

Coeff. Est. SE. P VIF CI

b0 �1381.03 291.52 0.00 273,097 1.00b1 24.06 15.78 0.13 287,099 2.45b2 939.73 298.47 0.00 60,631 7.28b3 25.28 5.05 0.00 277,806 358b4 �0.42 0.27 0.12 289,331 1057b5 �16.54 5.15 0.00 60,757 1994

Coeff. Eigenvalue Variance proportions

b0 b1 b2 b3 b4 b5

b0 >200 0.000 0.000 0.000 0.000 0.000 0.000b1 >200 0.000 0.000 0.000 0.000 0.000 0.000b2 188.6 0.000 0.000 0.000 0.000 0.000 0.000b3 0.078 0.053 0.001 0.121 0.053 0.000 0.122b4 0.009 0.536 0.035 0.760 0.536 0.033 0.756b5 0.003 0.411 0.964 0.119 0.412 0.967 0.122

Coeff., coefficient; Est., parameter estimate; SE., standard error of coefficient; P,significance value of t-test; CI, condition index.

H. Demirhan / Energy Conversion and Management 84 (2014) 334–345 339

The R2adj values for the models 1–3 are high over both parame-

terisation and validation datasets. Values of R2adj calculated over

the validation dataset are smaller than those calculated fromparameterization dataset for the models 1–3. However, it is nota-ble that there is no considerable difference between R2

adj values ofthe models 4 and 5 obtained over parameterisation and validationdatasets. For all models, bias and absolute bias estimates are satis-factory. Although maximum mean percentage error values are veryhigh over both datasets, median mean percentage error values areacceptable. Over the parameterisation dataset, S2

r is calculated as10.67, 10.57, 10.47, 9.54, and 9.47 for the models 1–5, respectively.

For the models 1–5, parameter estimates, their significancetests, VIF, condition index, eigenvalues, and variance proportionsare given in Tables 6–10, respectively.

For the model 1, the effect of temperature on the amount ofsolar radiation is found insignificant at the 5% level of significance.However, VIF (>10), CI (>100), and eigenvalues (ffi0) for b3, b4, andb5 indicate three collinearity structures. The first one is weak,includes b2 and b5, and is caused by the inclusion of both of PPT

and SS0

PPT in the model. The second one is moderate, includes

b0, b2, b3, and b5, and is caused by the inclusion of the terms

H0; PPT; SS0

, and S

S0

PPT in the model. The last one is strong,

includes b1 and b4, and is caused by the inclusion of both of T

Table 5Values of statistical measures for the models 1–5.

Dataset Measure Model

1

Parameterisation MB (MJ/m2/day) 0.011AAB (MJ/m2/day) 2.154MSE 10.65

R2adj

0.967

Min(e) (%) 0.042Median(e) (%) 10.77Max(e) (%) 702.0AIC �1117.01SBC 670.38Entropy (NAT) 1.746

Validation MPB (MJ/m2/day) �1.031AAPB (MJ/m2/day) 5.405MSPE 56.75

R2adj

0.849

Min(RPE) (%) 0.017Median(RPE) (%) 29.59Max(RPE) (%) 699.3AIC �152.78SBC 456.41Entropy (NAT) 1.571

and SS0

T in the model. Therefore, R2

adj values, variance estimates,

and significance values of the tests given under the column ‘‘P’’of Table 6 are unreliable.

For the model 2, the quadratic effect of precipitation on theamount of horizontal solar radiation is found insignificant at the5% level of significance. However, VIF (>10), CI (>100), and eigen-values (ffi0) for b4, . . ., b7 suggest three collinearity structures, sim-ilar to the model 1. In Table 7, boldfaced variance proportions showcollinearity structures. Each term of the model is included in atleast one collinearity structure. Because the terms PPT2 and

SS0

PPT2 are in a strong collinearity structure, results of their sig-

nificance tests are unreliable as well as the R2adj value of the model.

For the model 3, the effect of temperature on the amount of hor-izontal solar radiation is found insignificant at the 5% level of sig-nificance. According to VIF ( > 10), CI ( > 100), and eigenvalues( ffi 0) for b5, . . ., b9, there are two weak, a moderate, and a strongcollinearity structure. Because each term of the model is includedin at least one collinearity structure, the R2

adj of the model and sig-nificance tests are unreliable.

2 3 4 5

0.015 0.007 0.000 0.0002.147 2.127 2.092 2.08510.55 10.46 9.522 9.4580.967 0.968 0.841 0.843

0.035 0.031 0.027 0.09010.83 10.02 10.079 10.29686.2 670.8 589.9 596.2�1115.75 �1114.34 �1148.10 �1149.95673.25 676.26 639.29 637.441.765 1.777 1.765 1.762

�0.762 �0.747 �0.817 �0.8175.247 5.278 5.729 5.73852.19 52.24 61.381 61.1750.860 0.859 0.836 0.837

0.161 0.031 1.161 0.09029.46 31.00 97.928 29.93686.2 670.7 687.2 596.2�157.11 �152.97 �143.33 �143.73452.04 456.94 464.99 464.591.767 1.759 1.739 1.728

Page 7: The problem of multicollinearity in horizontal solar radiation estimation models and a new model for Turkey

Table 7Parameter estimates, P-values for t-tests, VIF, eigenvalue, and variance proportions for the model 2.

Coeff. Est. SE. P VIF CI

b0 �1531.53 374.36 0.00 453,401 1b1 31.71 16.05 0.05 299,140 2.48b2 1490.55 727.83 0.04 362,978 6.71b3 �848.79 465.89 0.07 146,174 13.92b4 27.96 6.46 0.00 458,437 524.9b5 �0.55 0.28 0.05 301,853 1305b6 �26.3 12.53 0.04 361,533 2414b7 14.87 8.08 0.07 146,157 3569

Coeff. Eigenvalue Variance proportions

b0 b1 b2 b3 b4 b5 b6 b7

b0 >200 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000b1 >200 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000b2 >200 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000b3 87.01 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000b4 0.061 0.046 0.001 0.019 0.005 0.047 0.001 0.018 0.006b5 0.010 0.222 0.007 0.031 0.159 0.219 0.007 0.029 0.163b6 0.003 0.011 0.776 0.073 0.082 0.012 0.774 0.072 0.081b7 0.001 0.722 0.216 0.877 0.754 0.722 0.219 0.881 0.749

Table 8Parameter estimates, P-values for t-tests, VIF, eigenvalue, and variance proportionsfor the model 3.

Coeff. Est. SE. P VIF CI

b0 �1745.87 481.22 0.00 753,837 1.00b1 �38.78 34.94 0.27 1,425,865 2.38b2 3.95 1.63 0.02 1,581,761 5.64b3 2317.04 805.47 0.00 447,311 14.32b4 �1143.48 484.82 0.02 159,276 29.07b5 31.62 8.31 0.00 763,685 592.9b6 0.66 0.60 0.28 1,451,163 1416b7 �0.07 0.03 0.02 1,589,894 2720b8 �40.49 13.85 0.00 444,776 3969b9 19.94 8.41 0.02 159,261 8758

Coeff. Eigenvalue Variance proportions

b0 b1 b2 b3 b4

b0 >200 0.000 0.000 0.000 0.000 0.000b1 >200 0.000 0.000 0.000 0.000 0.000b2 >200 0.000 0.000 0.000 0.000 0.000b3 90.08 0.000 0.000 0.000 0.000 0.000b4 21.86 0.000 0.000 0.000 0.000 0.000b5 0.053 0.027 0.000 0.000 0.018 0.008b6 0.009 0.121 0.008 0.004 0.027 0.113b7 0.002 0.074 0.059 0.041 0.099 0.346b8 0.001 0.768 0.089 0.010 0.669 0.454b9 0.000 0.011 0.843 0.946 0.187 0.080

Variance proportions

Coeff. b5 b6 b7 b8 b9

b0 0.000 0.000 0.000 0.000 0.000b1 0.000 0.000 0.000 0.000 0.000b2 0.000 0.000 0.000 0.000 0.000b3 0.000 0.000 0.000 0.000 0.000b4 0.000 0.000 0.000 0.000 0.000b5 0.028 0.000 0.000 0.017 0.009b6 0.119 0.008 0.004 0.025 0.117b7 0.075 0.062 0.039 0.101 0.347b8 0.769 0.092 0.009 0.670 0.449b9 0.010 0.838 0.948 0.186 0.078

Table 9Parameter estimates, P-values for t-tests, VIF, eigenvalue, and variance proportionsfor the model 4.

Coeff. Est. SE. P VIF CI

b0 �9.961 28.971 0.731 1.000b1 70.637 2.930 0.000 3.525 2.761b2 0.223 0.511 0.663 4.432 9.667b3 �0.093 0.020 0.000 4.183 18.291b4 �5.590 1.076 0.000 47.615 48.524b5 0.543 0.122 0.000 73.772 771.10

Coeff. Eigenvalue Variance proportions

b0 b1 b2 b3 b4 b5

b0 5.239 0.000 0.001 0.000 0.000 0.000 0.000b1 0.687 0.000 0.000 0.000 0.003 0.008 0.001b2 0.056 0.000 0.352 0.000 0.042 0.036 0.000b3 0.016 0.000 0.419 0.000 0.180 0.057 0.051b4 0.002 0.001 0.044 0.001 0.773 0.857 0.860b5 0.000 0.999 0.183 0.999 0.002 0.042 0.087

Coeff., coefficient; Est., parameter estimate; SE., standard error of coefficient; P,significance value of t-test; CI, condition index.

Table 10Parameter estimates, P-values for t-tests, VIF, eigenvalue, and variance proportionsfor the model 5.

Coeff. Est. SE. P VIF CI

b0 �0.961 29.227 0.974 1.000b1 73.341 3.205 0.000 4.239 2.154b2 0.061 0.515 0.906 4.538 3.112b3 0.470 0.229 0.040 1.360 9.908b4 �0.091 0.020 0.000 4.195 19.500b5 �5.458 1.075 0.000 47.785 48.791b6 0.511 0.123 0.000 74.987 781.293

Coeff. Eigenvalue Variance proportions

b0 b1 b2 b3 b4 b5 b6

b0 5.255 0.000 0.001 0.000 0.001 0.000 0.000 0.000b1 1.133 0.000 0.000 0.000 0.485 0.001 0.001 0.000b2 0.543 0.000 0.000 0.000 0.327 0.003 0.008 0.001b3 0.054 0.000 0.284 0.000 0.039 0.047 0.043 0.000b4 0.014 0.000 0.463 0.000 0.118 0.193 0.047 0.055b5 0.002 0.001 0.051 0.001 0.008 0.753 0.864 0.870b6 0.000 0.999 0.201 0.999 0.023 0.002 0.037 0.073

Coeff., coefficient; Est., parameter estimate; SE., standard error of coefficient; P,significance value of t-test; CI, condition index.

340 H. Demirhan / Energy Conversion and Management 84 (2014) 334–345

The effect of SS0

on the amount of solar radiation is found

insignificant at the 5% level of significance in both of the models4 and 5. Two VIF values, two eigenvalues, and a CI value are greaterthan their threshold values for both models. Thus, it is expected tohave at least two collinearity structures in each model. When var-iance proportions corresponding to the eigenvalues close to zeroare taken into account, it is seen that the cause of multicollinearity

Page 8: The problem of multicollinearity in horizontal solar radiation estimation models and a new model for Turkey

H. Demirhan / Energy Conversion and Management 84 (2014) 334–345 341

is the inclusion of both of soil temperature and temperature in themodels. Because the amount of correlation between ST and T is0.976 (P = 0.000), there is a strong inter-relation between soil tem-perature and temperature. Information on the amount of solarradiation contained in each of these variables is nearly equivalentto each other. This is the cause of the strong collinearity. R2

adj valuesof the models and significance tests are expected to be affected bythis collinearity structure, moderately.

To overcome these collinearity issues, various transformationswere applied to the original variables. We applied the transforma-tions suggested by Shacham and Brauner [25], but those did notwork for the mentioned collinearity structures. Then, the originalvariables were simply centred by subtracting their means fromtheir original values. Parameter estimates, their significance tests,VIF, condition index, and eigenvalues for centred versions of themodels 1–3 are given in Tables 11–13, respectively. Also, valuesof the statistical performance measures are given in Table 16 forall models.

Small VIF and CI values and high eigenvalues imply that thecentred version of the model 1 does not suffer from a collinearitystructure between exploratory variables. However, significancetest results and values of statistical measures are changed. Precip-itation is found to be insignificant on the amount of solar radiationrather than the temperature. The value of R2

adj is dramaticallydecreased to 0.623, and S2

r is increased to 75.11 for the model 1.

Table 11Parameter estimates, P-values for t-tests, VIF, eigenvalue, and variance proportionsfor the model 1 over centred data.

Coeff. Est. SE. P VIF CI Eigenvalue

b0 58.24 10.35 0.00 3.42 1.00 >200b1 8.47 1.61 0.00 7.50 1.59 >200b2 �23.91 29.68 0.42 4.58 1.71 >200b3 179.07 27.34 0.00 8.82 3.87 187.3b4 11.43 2.08 0.00 6.10 6.76 61.48b5 �47.11 43.03 0.27 4.54 6.88 59.22

Table 12Parameter estimates, P-values for t-tests, VIF, eigenvalue, and variance proportionsfor the model 2 over centred data.

Coeff. Est. SE. P VIF CI Eigenvalue

b0 66.05 10.89 0.00 3.85 1.00 >200b1 10.52 1.76 0.00 9.04 1.35 >200b2 7.96 33.60 0.81 5.96 1.73 >200b3 �164.79 49.38 0.00 18.58 4.08 191.8b4 184.85 28.60 0.00 9.79 4.19 181.6b5 13.30 2.19 0.00 6.84 7.11 63.15b6 �14.50 53.78 0.79 7.20 8.26 46.73b7 �209.89 70.79 0.00 20.56 13.84 16.64

Table 13Parameter estimates, P-values for t-tests, VIF, eigenvalue, and variance proportionsfor the model 3 over centred data.

Coeff. Est. SE. P VIF CI Eigenvalue

b0 79.03 11.23 0.00 4.35 1.00 >200b1 13.54 2.01 0.00 12.66 1.35 >200b2 �0.77 0.14 0.00 10.19 1.59 >200b3 �84.97 35.60 0.02 7.11 3.73 >200b4 �32.94 52.22 0.53 22.10 4.38 194.9b5 187.09 28.71 0.00 10.49 6.50 88.44b6 17.37 3.73 0.00 21.06 7.84 60.76b7 �0.75 0.19 0.00 11.53 8.57 50.82b8 �111.72 56.39 0.05 8.42 13.45 20.66b9 �43.37 73.35 0.56 23.48 16.48 13.76

For both of the models 2 and 3, we have several VIF values greaterthan 10. However, corresponding CI and eigenvalues are neithergreater than 100 nor close to zero. Thus, centring successfully over-comes the multicollinearity problem for the models 1–3. Becauseno collinearity structure is detected, variance proportions are omit-ted in Tables 11–13. At the 5% level of significance, linear effect ofprecipitation is found to be insignificant on the amount of solarradiation instead of its quadratic effect over the model 2; and thequadratic effect of precipitation is found to be insignificant insteadof the effect of temperature over the model 3. The values of R2

adj areconsiderably decreased to 0.628, 0.65, and those of S2

r areincreased to 75.53 and 73.19 for the models 2 and 3, respectively.When considered over the validation dataset, R2

adj values decreasedto the level of 0.45, and S2

r values increased to the level of 100 forall models. As expected, variance estimates are seriously affected,whereas bias estimates are slightly affected by the multicollinear-ity structures. These results show apparent side effects of multicol-linearity on the variances, reliability of significance tests and otherstatistical measures.

Because strongly correlated variables are included, centringdoes not provide a solution to the multicollinearity problem inboth of the models 4 and 5. Instead, temperature, which is one ofthe highly inter-correlated variables, was dropped from full ver-sions of the models 4 and 5. In this case, the term S

S0

was found

insignificant at the 5% level in both models. Then, it was alsodropped from the models and reduced versions of the models 4and 5 were obtained. Parameter estimates, their significance tests,and multicollinearity diagnostics are given in Tables 14 and 15.Only one of the CI values is greater than 10 in both tables. This isdue to a collinearity between the constant term and RH; hence, itis ignorable. Values of statistical measures, including MSE andR2

adj, are close to each other for the reduced and full versions ofthe models 4 and 5. According to Tables 9 and 10, constant term,RH, ST, and T are the variables included in collinearity structuresin the models 4 and 5. It is considerable that standard error esti-mates of the constant term and ST are decreased after the elimina-tion of multicollinearity.

In conclusion, it is observed that the multicollinearity problemis seen in the models that include inter-correlated terms or any ofthe terms more than once. Collinearity structures have serious sideeffects on the variance estimates of terms, hence on the statisticalmeasures based on variance estimates. Because such models arefrequently used in the solar radiation modelling, one should takeinto account the multicollinearity problem and its solutions. Evenif multicollinearity is eliminated by various transformations,

Table 14Parameter estimates, P-values for t-tests, VIF, eigenvalue, and variance proportionsfor the reduced version of model 4.

Coeff. Est. SE. P VIF CI Eigenvalue

b0 9.659 1.117 0.000 1.000 3.414b1 76.283 2.466 0.000 2.416 2.556 0.522b3 �0.163 0.013 0.000 1.854 7.828 0.056b4 �0.751 0.233 0.001 2.157 20.669 0.008

Table 15Parameter estimates, P-values for t-tests, VIF, eigenvalue, and variance proportionsfor the reduced version of model 5.

Coeff. Est. SE. P VIF CI Eigenvalue

b0 8.639 1.165 0.000 1.000 3.427b1 78.927 2.615 0.000 2.748 1.784 1.077b3 0.655 0.225 0.004 1.285 2.804 0.436b4 �0.153 0.014 0.000 1.984 8.035 0.053b5 �1.006 0.248 0.000 2.468 21.785 0.007

Page 9: The problem of multicollinearity in horizontal solar radiation estimation models and a new model for Turkey

Table 16Values of statistical measures for the models purified from multicollinearity.

Dataset Measure Model

1 2 3 4 5

Parameterisation MB (MJ/m2/day) �6.879 �6.691 �6.288 0.000 0.000AAB (MJ/m2/day) 9.050 8.841 8.126 2.144 2.125MSE 122.3 120.2 112.6 9.878 9.747

R2adj

0.623 0.628 0.650 0.836 0.838

Min(e) (%) 0.218 0.146 0.036 0.021 0.013Median(e) (%) 61.88 55.08 41.68 10.63 10.88Max(e) (%) 948.1 990.5 922.8 613.78 620.3AIC �578.05 �572.50 �577.20 �1141.95 �1143.63SBC 1344.45 1345.20 1332.86 643.83 642.96Entropy (NAT) 1.549 1.558 1.656 1.761 1.773

Validation MPB (MJ/m2/day) �10.25 �10.04 �10.82 �1.175 �1.075AAPB (MJ/m2/day) 12.18 201.7 214.5 5.815 5.725MSPE 210.0 90.85 86.49 59.74 58.93

R2adj

0.441 0.459 0.420 0.841 0.843

min(RPE) (%) 0.035 2.071 0.388 1.133 0.012median(RPE) (%) 76.67 72.57 73.82 98.020 33.867max(RPE) (%) 880.8 790.9 774.9 636.0 620.4AIC �83.40 �83.74 �83.70 �147.66 �148.76SBC 599.62 600.02 611.51 462.03 460.53Entropy (NAT) 1.722 1.668 1.737 1.741 1.749

342 H. Demirhan / Energy Conversion and Management 84 (2014) 334–345

accuracy of estimates and predictions dramatically reduces. There-fore, it is necessary to have a model that does not originally containa collinearity structure, and at the same time, provides accurateestimates and predictions.

3.2. A new solar radiation estimation model

As seen in the results given in Section 3.1, centring is a simplesolution to the multicollinearity problem. However, the resultsobtained after the elimination of multicollinearity may consider-ably change model comparison results and inferences drawn fora particular model. In the modelling, the primary purpose shouldbe to derive models without multicollinearity instead of using apolynomial model and then dealing with multicollinearity.

Another important point in building the models is the conceptof parsimony. It is related with use of a simpler model, whichmay contain less number of variables or parameters, instead of alarge model to explain the phenomena of interest. Consideringthe literature on the estimation of the amount of solar radiationreaching at the ground, there are more than fifteen variables thatcan be used in a model. A simple model in terms of both the num-ber of parameters and number of variables is less open to theanomalies caused by violations of model assumptions and mea-surement or random errors seen in exploratory variables. Also,interpretation of a simpler model is easier than that of a larger one.

Considering these discussions and the suitable models pre-sented by Evrendilek and Ertekin [21], we focus on the monthlyaverage daily extraterrestrial horizontal solar radiation, tempera-ture, precipitation, soil temperature, relative humidity, latitude,longitude, and altitude of the site as exploratory variables for mod-elling the amount of solar radiation over Turkey. The genetic pro-gramming (GP) technique given by Koza [32] was used to obtainboth the most suitable form and parameter estimates of the newmodel. It is nearly impossible with classical modelling techniquesto search over a large model space including linear and nonlinearmodels with exponential, logarithmic, logical, trigonometric terms.Even if it is possible to move over such a model space, it will beimpossible to find suitable parameter estimates for visited models.However, the GP technique is capable of both moving over such amodel space and finding accurate estimates of model parameters.In practice, the GP technique is simply applied by using softwarecalled ‘‘Eureqa Pro.’’ It automatically scans models including expo-

nential, logarithmic, logical, trigonometric terms by using the GPbased symbolic regression technique [33,34].

In the GP based symbolic regression technique, correlated vari-ables within a set of variables of interest are determined by using ameasure of predictive ability based on partial derivatives. This iscalled variable pairing. Calculation of these partial derivatives isexplained in detail by Schmidt and Lipson [33, supplementarymaterial p. 4–5]. A landscape is defined over a set of equationsby the variable pairing metric, and the GP technique is used toexplore this landscape. An initial expression is built by addingsome constants, operators, mathematical or logical functions, suchas sine, cosine, floor, ceil, log, if, less than, and XOR. Then, the initialexpression is evolved by evolutionary algorithms until a desiredlevel of accuracy is reached. In this process, expressions give accu-rate fit are survived and unsuccessful ones are excluded from theset of possible models. Mathematical form of the model and corre-sponding parameter values are simultaneously searched overobserved data. As the time progresses, the model of interestevolves as the GP algorithm constructs and reconstructs modelstructures from a set of possible mathematical and logical func-tions [33]. In the early times of search, predictive ability is lowand equation complexity is high, but the predictive abilityincreases and equation complexity decreases by the time. Eventu-ally, the algorithm approaches the exact solution [33]. To findparameter estimates of each model, similar models are refitted tonumerically simulated data, and eventually a system of equationsis found for each parameter of the original model. Then, explicitsymbolic regression is used to solve the system of equations.

An advantage of this GP approach is that due to the used partialderivatives in variable pairing, it is possible to draw inferences onboth the magnitude and sensitivity of the effect of each variable onthe amount of global solar radiation without fixing the effect of therest of variables. However, in the classical linear or nonlinearregression models, inferences on the effect of each variable aremade by holding the effects of the rest of variables fixed. Thus,inferences drawn by the GP technique are more flexible and reli-able than those made by the classical approaches.

By using the software ‘‘Eureqa Pro’’, the GP approach was runover the parameterisation portion of the average daily solar radia-tion dataset with the terms T, PPT, ST, RH, k, /, Z, H0, and S/S0. After2 � 1011 formula evaluations, a series of models and correspondingparameter estimates were generated. The model with the highest

Page 10: The problem of multicollinearity in horizontal solar radiation estimation models and a new model for Turkey

Table 17Parameter estimates, significance test of each parameter, VIF, eigenvalue, andvariance proportions for model 6.

Coeff. Est. SE. P VIF CI Eigenvalue

b0 �3.542963 0.65 0.000 – 1.00 >200b1 67.90979 6.25 0.000 3.06 3.77 >200b2 8.709125 1.70 0.000 4.02 7.56 100.5b3 0.003728 0.10 0.000 1.94 7.91 91.8b4 �1.000000 0.25 0.000 1.59 20.79 13.3b5 93.512786 0.08 0.000 1.00 69.22 1.2

H. Demirhan / Energy Conversion and Management 84 (2014) 334–345 343

R2adj, the smallest MSE, AIC, SBC, and entropy values over both

parameterization and validation datasets is as the following:

H¼ b0þb1H0þb2T/þbb3Zcþb4PPTþcosðb5ZÞH0 logðZÞþe; ð17Þ

where b�c is the floor function, and e is the error term. We refer Eq.(7) as ‘‘model 6’’ throughout the article.

Parameter estimates, P-value for the significance test of eachparameter, VIF, eigenvalue, and variance proportions for model 6are presented in Table 17.

The GP technique does not readily provide significance tests ofthe terms in the model. However, it is possible to test significanceof the contribution of each term to the explanation of overallvariance by the F-test. P-values of Table 17 are obtained by testingthe significance of contribution of each term to the amount ofexplained variation by the model 6. Thus, at the 5% level ofsignificance, the monthly average daily extraterrestrial horizontalsolar radiation, temperature, altitude, longitude, and precipitationsignificantly contribute to the amount of explained variation inthe average daily solar radiation. Because all VIF values are lessthan 10, all CI values are less than 100, and none of the eigenvaluesis close to zero, there is no collinearity problem with the newmodel; and hence, the corresponding variance proportions arenot reported.

Over the model 6, the sensitivity and magnitude of each vari-able on the average daily horizontal solar radiation are seen inTable 18. For each term, the sensitivity measure is calculated byevaluating the mean of absolute value of partial derivative ofdependent variable with respect to each exploratory variable mul-tiplied by the ratio of their standard deviations at all data points.

Table 18Sensitivity and magnitude of each variable on the average daily solar radiation.

Variable Sensitivity Positive(%)

Positivemagnitude

Negative(%)

Negativemagnitude

u 0.072 6 0.009 94 0.076Z 5756.2 44 5463.1 56 5987H0 0.721 100 0.721 0 0.000PPT 0.036 0 0.000 100 0.036T 0.312 100 0.312 0 0.000

Table 19Values of statistical measures for model 6.

Parameterization dataset Validation dataset

MB (MJ/m2/day) 0.203 MPB (MJ/m2/day) �0.443AAB (MJ/m2/day) 1.691 AAPB (MJ/m2/day) 5.455MSE 7.969 MSPE 52.937

R2adj

0.975 R2adj

0.861

Min(e) (%) 0.011 Min(RPE) (%) 0.088Median(e) (%) 7.482 Median(RPE) (%) 36.293Max(e) (%) 678.6 Max(RPE) (%) 678.6AIC �1198.7 AIC �158.736SBC 590.11 SBC 448.80Entropy (NAT) 1.379 Entropy (NAT) 1.378

Positive (negative) percentage is the percentage of data pointswhere partial derivative of dependent variable with respect to eachexploratory variable is greater (less) than zero. For the positive(negative) magnitude measure, the sensitivity measure is recalcu-lated over the data points where partial derivative of dependentvariable with respect to each exploratory variable is greater (less)than zero (see Schmidt and Lipson [33] for details of calculationsof sensitivity and magnitude).

Altitude has the highest relative impact on the average dailysolar radiation within the model 6. Impact of temperature isgreater than that of precipitation. With the likelihood of 94% anincrease in longitude results in a decrease in the amount of averagedaily solar radiation. Also, with the likelihood of 6% of the time iteither increases the amount of average daily solar radiation orhas no influence. While an increase in altitude decreases theamount of average daily solar radiation with the likelihood of56%, it either increases the amount of average daily solar radiationor has no influence with the likelihood of 44%. An increase in pre-cipitation does not increase the amount of average daily solar radi-ation; and an increase in either temperature or the monthlyaverage daily extraterrestrial horizontal solar radiation does notdecrease the amount of average daily solar radiation.

Values of statistical measures, model selection criteria, andentropy for the model 6 are given in Table 19 over both parameteri-sation and validation datasets. When compared to the centred ver-sions of models 1–3, and reduced versions of models 4 and 5(Table 16), the estimation and prediction performances of themodel 6 are superior for both parameterisation and validation data-sets. All of the AIC, SBC, and entropy values for the model 6 are lessthan those of the models 1–5 over both parameterization and vali-dation datasets. Also, performance of the model 6 is better than thatof the models 1–5 in terms of the amount of uncertainty in esti-mates. Consequently, the model 6 gives more precise estimatesthan those of models 1–3; and hence, the proposed model accom-panies an improvement in accuracy of estimation and predictionof the amount of global solar radiation. In the models 1–3, effectsof the monthly average daily extraterrestrial horizontal solar radia-tion, the ratio of day length to the maximum possible sunshineduration, and temperature are found significant after eliminationof multicollinearity. In the model 4, effects of relative humidityand soil temperature are also found significant. In addition to thesevariables, the effect of the solar declination angle is found signifi-cant in the model 5. Similar to these models, effects of the monthlyaverage daily extraterrestrial horizontal solar radiation and tem-perature are significant according to the proposed model. However,it is obtained that the ratio of day length to the maximum possiblesunshine duration is insignificant, and in an agreement with themodel 2, precipitation is found significant. Additionally, the pro-posed model now suggests that both altitude and longitude havesignificant effects on the amount of solar radiation.

In conclusion, in the estimation of the amount of average dailysolar radiation, the model 6 has better estimation and predictionperformances than those of the recent models giving accurate fit.The uncertainty values in estimates and predictions are at the lowestlevels for the model 6. Also, model selection criteria suggest use ofthe model 6 for the estimation of the amount of average daily solarradiation over Turkey. The model 6 suggests that in addition to themonthly average daily extraterrestrial horizontal solar radiation,precipitation and temperature, impacts of altitude and longitudeon the amount of average daily solar radiation are both significant.

4. Conclusion

In this article, models used for the estimation and prediction ofthe amount of average daily solar radiation are taken into

Page 11: The problem of multicollinearity in horizontal solar radiation estimation models and a new model for Turkey

344 H. Demirhan / Energy Conversion and Management 84 (2014) 334–345

consideration over the region of Turkey. A set of models that giveaccurate fit for the region of Turkey from the literature is revisited.The multicollinearity problem is frequently seen in the nonlinearor polynomial models used for the estimation of solar radiation.The impacts of multicollinearity on the significance tests of modelterms, and model evaluation and selection criteria are discussedand demonstrated. Then, a new model for the estimation and pre-diction of the amount of average daily solar radiation over Turkeyis proposed. The accuracy of the new model is compared to those ofthe models from the revisited set.

In general, the models used in the solar radiation modellinghave complex structures. Some of the variables are seen in a modelmore than once as either interaction terms with other variables orhigher order terms. Both of these situations cause severe collinear-ity structures between parameters of the model. Simple solutionsto the multicollinearity problem are centring the original variablesbefore parameter estimation and including only one of the corre-lated variables in the model. As seen in Section 3, when the impactof multicollinearity is eliminated, there would be considerablechanges in the values of model evaluation statistics such as thecoefficient of determination, and inferences on the significance ofmodel terms. Five models, which were identified as suitable forthe estimation of solar radiation, were fitted to our parameteriza-tion dataset. When the problem of multicollinearity is not takeninto account, all the models seem to give accurate fit; and at the5% level of significance, the linear effect of temperature and thequadratic effect of precipitation on the amount of horizontal solarradiation are found insignificant. In addition, a strong correlation(0.976) is detected between soil temperature and temperature,which are seen in some of the focused models at the same time.After the elimination of multicollinearity by centring, coefficientsof determination are dramatically decreased and variance esti-mates are increased. At the 5% level of significance, linear effectof precipitation is found to be insignificant on the amount of hor-izontal solar radiation rather than the temperature. For the modelsincluding temperature and soil temperature simultaneously, cen-tring did not provide a solution to the multicollinearity problem;and hence, temperature dropped from the model to overcomethe multicollinearity. As clearly seen from these conclusions,researchers in the field of solar radiation modelling should beaware of the multicollinearity problem and its solutions.

After the elimination of multicollinearity, existing models giveworse fit. However, inferences about the significance of the modelterms are unreliable under the presence of multicollinearity.Therefore, we propose a new model to overcome the multicolline-arity problem and successfully estimate and predict the amount ofaverage daily solar radiation. The variable altitude is seen twice inthe proposed model. In the first occurrence, it is placed within thefloor function; and in the second occurrence, it is placed withinboth cosine and natural logarithm functions and interacting withthe monthly average daily extraterrestrial horizontal solar radia-tion. Due to the structure of interaction and complexity of floor,cosine and natural logarithm functions, no efficient collinearitystructure is detected. According to the model selection criteria,entropy, and other model evaluation and comparison tools, bothestimation and prediction accuracy of the proposed model arefound superior to the existing models that are previously identifiedas satisfactory for Turkey data. At the 5% level of significance,effects of temperature, precipitation, altitude, longitude, andmonthly average daily extraterrestrial horizontal solar radiationare found statistically significant; and both relative humidity andlatitude do not have significant impact on the explained variationof the amount of solar radiation. Because a model with tempera-ture provides a better fit than a model with soil temperaturewithin our modelling context, models including soil temperaturehave not been survived during the model search. According to

the proposed model, the variable with highest relative impact onthe average daily solar radiation is altitude. Temperature has agreater degree of impact than that of precipitation. It is very likelythat an increase in longitude decreases the amount of average dailysolar radiation. An increase in altitude neither decreases norincreases the amount of average daily solar radiation. While anincrease in precipitation does not increase the amount of averagedaily solar radiation, an increase in either temperature or themonthly average daily extraterrestrial horizontal solar radiationdoes not decrease the amount of average daily solar radiation.

Note that the proposed model is also suitable for the regionssimilar to Turkey. In fact, the inferences on the impact of variableson the amount of solar radiation drawn from the new model arevalid in general. Variation of solar radiation measurements in ourdataset used to find parameter estimates of the new model ishigher than those used in earlier modelling studies in the litera-ture. Also, our terrain of interest, Turkey, has a complex structure.Considering these situations and the degree of accuracy of the pro-posed model in the estimation and prediction of the amount ofaverage daily solar radiation, we expect to get precise estimateswhen the model is used to estimate the amount of solar radiationfor a location outside the region of Turkey. Therefore, one can con-fidently use the proposed model for other similar terrains over theglobe. However, if location or structure of the terrain of interest isvery disparate from the region of Turkey, the proposed model maygive inaccurate estimates.

References

[1] Kaygusuz K, Sari A. Renewable energy potential and utilization in Turkey.Energy Convers Manage 2003;44:459–78.

[2] Laska K, Prosek P, Budik L, Budikova M, Milinevsky G. Prediction of erythemallyeffective UVB radiation by means of nonlinear regression model.Environmetrics 2009;20:633–46.

[3] Demirhan H, Mentes� T, Atilla M. Statistical comparison of global solar radiationestimation models over Turkey. Energy Convers Manage 2013;68:141–8.

[4] Dinçer I, Dilmac S, Ture IE, Edin M. A simple technique for estimating solarradiation parameters and its application for Gebze. Energy Convers Manage1996;37:183–98.

[5] Togrul IT, Onat E. A study for estimating solar radiation in Elazig usinggeographical and meteorological data. Energy Convers Manage1999;40:1577–84.

[6] Sen Z, Sahin AD. Spatial interpolation and estimation of solar irradiation bycumulative semivariograms. Sol Energy 2001;71:11–21.

[7] Saylan L, Sen O, Toros H, Arisoy A. Solar energy potential for heating coolingsystems in big cities of Turkey. Energy Convers Manage 2002;43:1829–37.

[8] Sozen A, Arcaklioglu E, Ozalp M. Estimation of solar radiation in Turkey byartificial neural network using meteorological and geographical data. EnergyConvers Manage 2004;45:3033–52.

[9] Sozen A, Arcaklioglu E, Ozalp M, Kanit EG. Use of artificial neural networks formapping of solar potential in Turkey. Appl Energy 2004;77:273–86.

[10] Sozen A, Ozalp M, Arcaklioglu E, Kanit EG. A study for estimating solarresources in Turkey using artificial neural networks. Energy Sources2004;26:1369–78.

[11] Sozen A, Arcaklioglu E, Ozalp M, Kanit EG. Solar energy potential in Turkey.Appl Energy 2005;80:367–81.

[12] Menges HO, Ertekin C, Sonmete H. Evaluation of global solar radiation modelsfor Konya, Turkey. Energy Convers Manage 2006;47:3149–73.

[13] Senkal O, Kuleli T. Estimation of solar radiation over Turkey using artificialneural network and satellite data. Appl Energy 2009;86:1222–8.

[14] Senkal O. Modeling of solar radiation using remote sensing and artificial neuralnetwork in Turkey. Energy 2010;35:4795–801.

[15] Senkal O, Sahin M, Pestemalci V. The estimation of solar radiation for differenttime periods. Energy Sources, Part A 2010;32:1176–84.

[16] Koca A, Oztop HF, Varol Y, Koca GO. Estimation of solar radiation usingartificial neural networks with different input parameters for Mediterraneanregion of Anatolia in Turkey. Expert Syst Appl 2011;38:8756–876.

[17] Jin Z, Yezheng W, Gang Y. General formula for estimation of monthly averagedaily global solar radiation in China. Energy Convers Manage 2005;46:257–68.

[18] Ozgoren M, Bilgili M, Sahin B. Estimation of global solar radiation using ANNover Turkey. Expert Syst Appl 2012;39:5043–51.

[19] Khorasanizadeh H, Mohammadi K. Prediction of daily global solar radiation byday of the year in four cities located in the sunny regions of Iran. EnergyConvers Manage 2013;76:385–92.

[20] Ertekin C, Evrendilek F. Spatio-temporal modelling of global solar radiationdynamics as a function of sunshine duration for Turkey. Agric Forest Meteorol2007;145:36–47.

Page 12: The problem of multicollinearity in horizontal solar radiation estimation models and a new model for Turkey

H. Demirhan / Energy Conversion and Management 84 (2014) 334–345 345

[21] Evrendilek F, Ertekin C. Assessing solar radiation models using multiplevariables over Turkey. Clim Dyn 2008;31:131–49.

[22] Sonmete MH, Ertekin C, Menges HO, Hacıseferogulları H, Evrendilek F.Assessing monthly average solar radiation models: a comparative case studyin Turkey. Environ Monit Assess 2011;175:251–77.

[23] Rawlings JO. Applied regression analysis: a researchtool. California: Wadsworth; 1988.

[24] Sanford W. Applied linear regression. New York: Wiley; 2005.[25] Shacham M, Brauner N. Minimizing the effects of collinearity in polynomial

regression. Ind Eng Chem Res 1997;36:4405–12.[26] Duffie JA, Beckman WA. Solar engineering of thermal process. New

York: Wiley; 2006.[27] Forsythe WC, Rykiel EJ, Stahl RS, Wu H, Schoolfield RM. A model comparison

for daylength as a function of latitude and day of year. Ecol Model1995;80:87–95.

[28] Gariepy J. Estimation du rayonnement solaria global. In: Internal report.service of meteorology, government of Quebec, Canada, 1980.

[29] Chen R, Kang E, Ji X, Yang J, Zhang Z. Trends of the global radiation andsunshine hours in 1961–1998 and their relationships in China. Energy ConversManage 2006;47:2859–66.

[30] Alnaser WE. New model to estimate the solar global irradiation usingastronomical and meteorological parameters. Renew Energy 1993;3:175–7.

[31] Myers RH. Classical and modern regression with applications. Boston: PWS-KENT; 1990.

[32] Koza JR. Genetic programming: on the programming of computers by means ofnatural selection. Cambridge: MIT Press; 1992.

[33] Schmidt M, Lipson H. Distilling free-form natural laws from experimental data.Science 2009;324(5923):81–5.

[34] Schmidt M, Lipson H. Eureqa (version 0.98 beta) [software], available from<www.nutonian.com>. [accessed: 2014.02.13].