Upload
thienng
View
8
Download
5
Embed Size (px)
DESCRIPTION
Wind Analysis
Citation preview
1 JournalofComputerEngineering Vol2,No1(2011) ISSN: 20101619
1
JournalofComputerEngineeringjournalhomepage:http://icoci.org/jce
DataMiningTechniquesforWindSpeedAnalysis(AcasestudyforGazaStrip)
1MarwaF.AlRoby,2Alaa M.ElHaleesComputerScienceDepartment,IslamicUniversityofGaza Palestine
[email protected] ,[email protected]
Abstract:Weather DataMining is a formof Datamining concernedwith finding hidden patterns inside largely availablemeteorologicaldata, so that the information retrieved canbe transformed into usable knowledge. In this paperweusedmeteorologicaldataminingtoanalyzewindspeedbehavior.Thedatawasrecordedbetween2004toNovember2006dailyhistorical data by meteorological station of Gaza. After preprocessing the data, we applied data mining techniques:associationrules,classification,clusterandoutlieranalysis.Fromthesefourtasks,wefoundthemostappropriateofthesetechniquestobeappliedonweatherdataisclassificationtask,especiallytheneuralnetworksmethodbecausethenatureofthedataistimeseries.
Keywords:Datamining,Associationrules,Classification,Cluster,Outlieranalysis
I. INTRODUCTION
Weatherpredictionhasbeenoneofthemostinterestingand fascinating domain. It plays an important role innatural and human life. Agriculture sectors, aviationoperations,fisheries,foodsecurity,shipping,safetyatsea,monitoring of water resources, many industries andtourismaredependentontheweatherconditions[7].It isoftenusedtopredictandwarnaboutnaturaldisastersthatarecausedbyabruptchangeinclimaticconditionstotakethenecessaryprotectioninearly time [7]. Italsohelps inusing wind energy as an alternative source of energy,mainly for electrical power generation. For their reasons,the scientists have been forecasting the meteorologicalcharacteristicsusingalargesetofmethods,oneofthemisdataminingmethod,amethoddevelopedrecently,canbesuccessfullyappliedinthisdomain[2].
This paper investigates the wind speed predictiondomainofdataminingusingacasestudy.Itshowedwhatkindofdatacouldbecollected,howcouldwepreprocessthe data,how to apply dataminingmethods on the data,and finally how can we benefited from the discoveredknowledge. There are many kinds of knowledge can bediscovered from data. In this work we investigated themost common ones which are association, classification,clusteringandoutlierdetection.
The paper is structured as follows: in Section 2summariesrelatedworks inweatherdatamining.Section3 gives a general description of the datawe used in ourcasestudy.Section4describesthepreprocessstageoftheused data. Section 5 illustrates our experiments aboutapplying data mining methods on the wind speed data.Finallyweconcludethispaperwithasummary.
II. LITERATUREREVIEW
Many of research have been done in wind speed dataanalysisdomain.Inthissectionwereviewsomeofthem.
With increasing agriculturaland industrial activities inthecountry, thedemand forenergy isalso increasing,K.SreelakshmiandP.Ramakanthkumar[4]introducedafeedforwardneuralnetworkmodelforshorttermwindspeedprediction,whichusesbackpropagationalgorithm.They utilize six different parameters values (Meantemperature, Humidity, Wind gust, Wind direction,Barometric pressure and Wind speed) as the input tomodel. Historical data of 10 years is considered for theexperimentation. Finally, they reach on model thatpredictedwindspeeddiffersfromtheactualvaluebymax5%.EliaGeorgianaPetre [2] tried to forecastweatherby using CART (Classification And Regression Trees)that is one of themost popular decision tree algorithms,can be used to build a classification tree to predict thefuture temperaturevalues.DatacollectionregisteredoverHongKong. For this, there are usedmeteorological dataregistered between 2002 and 2005.CARTalgorithm canonly work with nominal variables, for this reason, theytranslatedatatonominaltype.Finally,theyobtainresultsfortheirexampleasillustrateinFigure(1).
Paulo Cortez andAnbalMorais [1], present a novel dataminingapproachtopredict theburnedarea(orsize)offorestfiresbyusingmeteorologicaldata.Theyusedrecentrealworld
Figure1:Evaluationontrainingset
2 JournalofComputerEngineering Vol2,No1(2011) ISSN: 20101619
2
data, collected from thenortheastregionofPortugal.Also theyapplyseveralexperimentswithfivedataminingtechniques(i.e.multiple regression,DecisionTrees (DT),RandomForest(RF),Neural Networks (NN) and Support Vector Machine (SVM)).Four distinct feature selection they used: spatial, temporal,components from the Canadian Fire Weather Index (FWI)system and meteorological data (rain, wind, temperature andhumidity). Finally, they find SVM technique produce the bestpredictionsforsmallfires.Thedrawbackisthelowerpredictiveaccuracy for large fires. T.Tugay Bilgin and A.YlmazCamurcu[5],determinedregionsinTurkeythathavesimilarairtemperaturecharacteristics.Theyappliedadataminingbasedondensity based spatial clustering of applications with noise(DBSCAN) on air temperature database that is collected fromcountry wide metrology stations in Turkey. Finally, theyobtainedclusterthathavesimilartemperaturetrends.L.Fugonetal [3], used different data mining models for evaluate windpowerforecasting.Modelsareevaluatedincludeneuralnetworks,RandomForestsandsupportvectormachines.
III.DATACOLLECTION
In our case study we collected the weather dataset frommeteorological station of Gaza. Gaza is located on thecoast of the Mediterranean Sea, north of the SinaiPeninsula and southwest of Jerusalem at 34olongitudeand 31olatitude [8]. Gaza strip has a temperate climate,with mild winters, and dry Mediterranean type, hotsummers subject to drought [9]. The observed data ofwindspeedcontain4yearscover theyears fromJanuary2003toNovember2006dailyhistoricaldata.Thenumberof recodes is 1429. Our work involves the utilization offive different parameters valueswhich areacquired fromthe weather station report such average temperature,pressure, humidity, wind direction and wind speed. Theparameters are considered as input shown in the table 1additional to day and month attributes. The output isdegreeofwindspeed.Thetypeofattributesisnumeric.
IV.DATAPREPARATIONANDPREPROCESSING
Togetbetterinputdatafordataminingtechniques,wedid some preprocessing for the collected data. After weintegratedthedata intoone file, toincrease interpretationandinclusiveness,werearrangedatainlagformat.Thelagisthevalueofwindspeedrecordedintheprecedingday.In this case we make three lags of wind speed [day,yesterday,anddaybeforeyesterday]throughwindowsizewe discretized the numerical wind speed attribute to
categorical ones. For example, we grouped wind speedattributeintofivegroupsveryhigh,high,middle,lowandvery low.Alsoweapplynormalizationmethodonsubsetattributes [average temperature, pressure, humidity, winddirection and three lags wind speed] to conversion thevalues between zero and one. Figure (2) illustratenormalized data, windowing wind speed attribute anddiscretizedtheoutput.
After we using some preprocessing techniques, we alsovisualized data by using Weka software as displayed inFigure(3).
V. DATAMININGTASKSINWINDSPEEDSYSTEMS
Inthenextsections,wedescribetheresultsofapplyingdataminingtasksonourdataforeachofthefourtasks.
Table1:Listofparameters
Figure2:Normalizeddata,windowingwindspeedattributeanddiscretizedtheoutput.
Figure 3:visualizingdatausedinthecasestudyusingWekasoftware.
3 JournalofComputerEngineering Vol2,No1(2011) ISSN: 20101619
3
VI.ASSOCIATIONRULESAssociationrulemining,playingacriticalroleinthefieldof data mining, searches for interesting relationshipsamong items in a given data set [6]. It studies thefrequency of items occurring together in transactionaldatabases, and based on a threshold called support,identifies the frequent item sets. Another threshold,confidence which is conditional probability than an itemappears in a transaction when another item appears, isusedtopinpointassociationrules.Forassociation rulesmining,weusedFPGrowthminingalgorithmwithminsupport=0.95andminconfidence=0.8It allows finding rules of the form If [wind_sp1 = low,RH_avg=highandT_avg=high]then[wind_sp0=lowwhere wind_sp1, RH_avg and T_avg] as first rule inFigure(4).However,thepreviousrulesays:whenthelag1ofwindspeedislow,theaverageofhumidityishighandtheaverageof temperatureishigh thenwe found lag0ofwind speed is low. The fourth rule is [wind_sp2 = low,wind_sp1 = low, output (windspeed) = low, RH_avg =high,T_avg=high] then [wind_sp0= low].Thatmeanswhenthelag2ofwindspeedislow,lag1ofwindspeedislow,outputofwindspeedislow,theaverageofhumidityis high and the average of temperature is high then wefound lag0 of wind speed is low. The strength of someassociation rules can be obtained with change minconfidence=0.987asillustrateinFigure(4).
VII. CLASSIFICATION
Classification analysis is a data mining task thatorganization of data in given classes. Also known assupervised classification, the classification uses givenclass labels to order the objects in the data collection.Classification approaches normally use a training setwhereallobjectsarealreadyassociatedwithknownclasslabels [8]. In wind speed data mining, we classify thedegreeofwindspeedforeachdayintooneofourclasses(veryhigh,high,middle,lowandverylow),alsowespiltdatato70%fortrainingand30%for testing. Inourcasestudyweusedthreeclassificationmethods:
Firstmethod:UsingRuleInduction
Ruleinductionisoneofthemostimportanttechniquesofmachine learning that is extraction of useful ifthen rulesfromdatabasedonstatisticalsignificance[9].Weusedit
torepresentlogicalrulesofwindspeeddataasillustratedinFigure(5).wereachto64.80%ofaccuracy.
Secondmethod:Usingknearestneighbor
A technique thatclassifieseachrecord inadatasetbasedon a combination of the classes of the k record (s)mostsimilartoitinahistoricaldataset[9],inourcasewechosek=5. We reach to 62.70% of accuracy that is less thanaccuracyinruleinduction.
Thirdmethod:Usingneuralnetwork
Neural network is a simulation of the human brainacquiresknowledgethroughlearning.AlsoitisNonlinearpredictive models and resembles biological neuralnetworksinstructure.Amultilayeredperceptronnetwork(MLPN) from the rapid miner program were trained ondataset using the feedforward back propagation (FFBP)algorithm with two hidden layer as Figure (6) and thenumberoftrainingcyclesis1000.Inourwindspeeddata,wereachto67.37%ofaccuracy.
As in result, we find neural network method performedquitewell,comparedwiththeotherclassificationmethods.Buttodeterminewhichmethodisthebest,weappliedTTest to determine the probability for thenullhypothesiswe found the probabilities for random values with thesameresult.
Figure 4:Topfiveofassociationrulesforwindspeeddata
Figure 5:Rulemodelofwindspeed
Figure 6:Thestructureofthefeedforwardbackpropagationneuralnetwork(FFBP).
4 JournalofComputerEngineering Vol2,No1(2011) ISSN: 20101619
4
VIII. CLUSTERING
Similar to classification, clustering is the organization ofdata in classes. However, unlike classification, inclustering, class labels are unknown and it is findinggroupsofobjectssuchthattheobjectsinonegroupwillbesimilar to one another and different from the objects inanother group. Clustering is also called unsupervisedclassification,becausetheclassificationisnotdictatedbygivenclasslabels[8].
In wind speed data mining, clustering has been used togroup wind speed state for each day according tochangeabilityofweatherforeachday.InourcaseweusedKMeans Algorithm to cluster the given data into fivegroups (k=5) and guide them based on their behavior.Figure(7)givesMeanofeachclusterforeachattribute.
ToshowthegraphweuseSingularValueDecomposition(SVD)withtwodimensionsasdescribedinFigure(8)
IX.OUTLIERDETECTIONOutliers are data elements that cannot be grouped in agiven class or cluster also known as exceptions orsurprises,theyareoftenveryimportanttoidentify.Whileoutliers can be considered noise and discarded in someapplications,theycanrevealimportantknowledgeinotherdomains, and thus can be very significant and theiranalysisvaluable[8].
In our case study, we used outlier analysis to detectoutliersinthewindspeeddataset.Twomethodsappliedforoutlierdetection:
Firstmethod:UsingDetectOutlier(Distances)
DetectOutlier (Distances)method to identifiesn outliersinthegivenwindspeeddatasetbasedon thedistance totheirknearestneighbors,inthiscasewesetupparameterswith put the number of neighbors (K) =10 to be theanalyzed, alsowe change the number of topnOutliers=30, and we choose Euclidian distance function will beused for calculating the distance between two points.Toshow the graph we used Singular Value Decomposition(SVD) with two dimensions as described in Figure (9).Red points are representing outliers and blue points arerepresenting normal instances, green circle is surroundedofoutliers.
Secondmethod:UsingDetectOutlier(LOF)
Detect Outlier (LOF)method to identifies outliers in thegivenwindspeeddatasetbasedonlocaloutlierfactors,inthis casewe set up parameterswith put10 for the lowerbound for minimal points for the outlier and 30 for theupper bound forminimal points for the outlier, and alsowe choose Euclidian distance function will be used forcalculating the distance between two points. The LOF iscomputed for eachminimal points value in the range byaveraging the ratio between the minimal points localreachabilitydensity of all objects in the kneighborhoodand the object itself.To show the graphwe useSingularValue Decomposition (SVD) with two dimensions asdescribed in Figure (10). Red points are representingoutliersandbluepointsarerepresentingnormalinstances,greencircleissurroundedofoutliers.
Figure7:ClusteringwindspeeddataintofivegroupsusingKMeansAlgorithm
Figure 8:Graphofwindspeedclusters
Figure9:GraphdisplaytheoutliersinwindspeeddatabyDetectOutlier Distances
5 JournalofComputerEngineering Vol2,No1(2011) ISSN: 20101619
5
X. CONCLUSIONIn this paper, we applied data mining tasks (associationrules, classification, cluster and outlieranalysis) onwindspeed data set. The data set observation recorded fromJanuary 2003 to November 2006 daily historical datathroughmeteorologicalstationofGaza.
WeappliedassociationrulesbyusedFPGrowthminingalgorithmwithminsupport=0.95andminconfidence=0.8and tried to obtain five strength rules with change minconfidence to0.987.
Thenwediscoveredclassificationrules,weappliedthreemethod: rule induction, k nearest neighbor and neuralnetwork.Inruleinductionwereachto64.80%ofaccuracyandwithusingknearestneighborwereachto62.70%ofaccuracy that is less than accuracy in rule induction,finally,weappliedfeedforwardbackpropagation(FFBP)algorithmandreachto67.37%ofaccuracy.
Alsoweclusteredthewindspeeddataintofivegroupsandguide thembased on their behavior.After that,we usingoutlier analysis we detected all outliers in the data, weapplied two methods: Detect Outlier (Distances) andDetectOutlier(LOF).
Finally, we found the most appropriate of thesetechniques to be applied onweatherdata is classificationtask, especially the neural networks method because thenatureofthedataistimeseries.
REFERENCES[1] Cortez,P.andA.Morais,"Adataminingapproachto
predict forest fires usingmeteorological data.", NewTrends in Artificial Intelligence, Proceedings of the13th EPIA Portuguese Conference on ArtificialIntelligence,pp:512523,2007.
[2] EliaGeorgianaPetre, "ADecisionTree forWeatherPrediction"PP:7782,LXI,No1/2009.
[3] Fugon L., Juban J., Kariniotakis G., 2008. "Datamining forwind power forecasting".EuropeanWindEnergyConfernce.
[4] Sreelakshmi K, Ramkanthakumar, P., 2008. "NeuralNetworks for shortterm wind speed prediction".World Academy of Science, Engineering andTechnology42,721725.
[5] Tugay Bilgin and Yilmaz amurcu, 2004, "A DataMiningApplicationonAirTemperatureDatabase",inLNCS 3261 Advances in Information Systems,SpringerBerlin/Heidelberg,ISBN9783540234784,pp.6876
[6] S. YAU AND L. ZHOU, "Association rule miningand quantitative association rule mining amonginfrequentitems"
[7] World Meteorogical Organization (WMO)http://www.wmo.int/pages/index_en.html [Online]
[8] http://churmura.com/technology/computerscience/whatcandataminingdo/30684/ [Online]
[9] http://www.anderson.ucla.edu/faculty/jason.frand/teacher/technologies/palace/datamining.htm [Online]
[10]http://www.webgaza.net/background/Gaza_City.htm[Online]
[11]http://www.gazaweather.com/ [Online]
Figure10:GraphdisplaytheoutliersinwindspeeddatabyDetectOutlierLOF