11
Hindawi Publishing Corporation Mathematical Problems in Engineering Volume 2013, Article ID 136241, 10 pages http://dx.doi.org/10.1155/2013/136241 Research Article An Improved Generalized-Trend-Diffusion-Based Data Imputation for Steel Industry Ying Liu, Zheng Lv, and Wei Wang School of Control Sciences and Engineering, Dalian University of Technology, Dalian 116023, China Correspondence should be addressed to Ying Liu; liu [email protected] Received 5 January 2013; Accepted 20 February 2013 Academic Editor: Jun Zhao Copyright © 2013 Ying Liu et al. is is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Integrality and validity of industrial data are the fundamental factors in the domain of data-driven modeling. Aiming at the data missing problem of gas flow in steel industry, an improved Generalized-Trend-Diffusion (iGTD) algorithm is proposed in this study, where in particular it considers the sort of problem with data properties of consecutively missing and small samples. And, the imputation accuracy can be greatly increased by the proposed Gaussian membership-based GTD which expands the useful knowledge of data samples. In addition, the imputation order is further discussed to enhance the sequential forecasting accuracy of gas flow. To verify the effectiveness of the proposed method, a series of experiments that consists of three categories of data features in the gas system is presented, and the results indicate that this method is comprehensively better for the imputation of the periodical-like data and the time-series-like data. 1. Introduction Data missing is one of the major obstacles to obtain valid data samples [1], which also might be common or even inevitable in some data-driven-based research fields, such as sample surveys, industrial productions, medical research, soſt engineering, and wireless broadcast environment [2, 3]. e data missing problem might destroy the samples integrality since every cell in database may not be independent, and furthermore a single missing value might call for dropping the entire observed values or the useful information [4, 5]. As such, some useful information or knowledge could be lost from the data set. Moreover, the data missing will also lead to the nonresponse bias of samples which could be a serious concern for the data-driven-based studies [610]. In the literatures, most of the existing methods for such problem were mainly based on the statistical techniques. For instance, the multiple imputation (MI), a kind of pop- ular technique, was used to resolve the missing data of gross domestic product (GDP) [11], cancer databases [12], and sample surveys [13]. And, a similar response pattern imputation (SRPI) was also implemented in [14]. In [15], the authors used the classic expectation-maximization (EM) algorithm, principal component analysis (PCA), and singular value decomposition, while [16, 17] utilized the maximum likelihood technique to carry out the missing data impu- tation. However, all the techniques mentioned previously might be hard to reflect the relationship among regression variables, since the imputed values were mere approximations of unknown values. Besides the statistical techniques, the machine learning was paying more and more attentions nowadays, as presented in [18, 19]. In industrial manufacturing process, the phenomenon of data missing oſten occurs due to the events such as data collector failures, transmission errors, or information storage errors, which directly result in some obstacles for establish- ing data-driven-models, such as scheduling models, data- driven based regression prediction models, and stochastic optimization models [2022]. ere were different types of approaches for industrial practitioners in the literatures to deal with these data missing problems. In [23, 24], the authors proposed a method called list-wise deletion that was easy to be implemented; however, it tended to reduce the sample data size. Considering that a lot of missing data in industry have the form of time series sampled in equal intervals in most cases, the integrality of sample data has to be broken

Research Article An Improved Generalized-Trend-Diffusion ...downloads.hindawi.com/journals/mpe/2013/136241.pdf · Research Article An Improved Generalized-Trend-Diffusion-Based Data

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Research Article An Improved Generalized-Trend-Diffusion ...downloads.hindawi.com/journals/mpe/2013/136241.pdf · Research Article An Improved Generalized-Trend-Diffusion-Based Data

Hindawi Publishing CorporationMathematical Problems in EngineeringVolume 2013 Article ID 136241 10 pageshttpdxdoiorg1011552013136241

Research ArticleAn Improved Generalized-Trend-Diffusion-Based DataImputation for Steel Industry

Ying Liu Zheng Lv and Wei Wang

School of Control Sciences and Engineering Dalian University of Technology Dalian 116023 China

Correspondence should be addressed to Ying Liu liu yingdluteducn

Received 5 January 2013 Accepted 20 February 2013

Academic Editor Jun Zhao

Copyright copy 2013 Ying Liu et alThis is an open access article distributed under the Creative Commons Attribution License whichpermits unrestricted use distribution and reproduction in any medium provided the original work is properly cited

Integrality and validity of industrial data are the fundamental factors in the domain of data-driven modeling Aiming at the datamissing problem of gas flow in steel industry an improved Generalized-Trend-Diffusion (iGTD) algorithm is proposed in thisstudy where in particular it considers the sort of problem with data properties of consecutively missing and small samples Andthe imputation accuracy can be greatly increased by the proposed Gaussian membership-based GTD which expands the usefulknowledge of data samples In addition the imputation order is further discussed to enhance the sequential forecasting accuracyof gas flow To verify the effectiveness of the proposed method a series of experiments that consists of three categories of datafeatures in the gas system is presented and the results indicate that this method is comprehensively better for the imputation of theperiodical-like data and the time-series-like data

1 Introduction

Data missing is one of the major obstacles to obtain validdata samples [1] which also might be common or eveninevitable in some data-driven-based research fields such assample surveys industrial productionsmedical research softengineering and wireless broadcast environment [2 3] Thedata missing problem might destroy the samples integralitysince every cell in database may not be independent andfurthermore a single missing value might call for droppingthe entire observed values or the useful information [4 5]As such some useful information or knowledge could belost from the data set Moreover the data missing will alsolead to the nonresponse bias of samples which could bea serious concern for the data-driven-based studies [6ndash10]In the literatures most of the existing methods for suchproblem were mainly based on the statistical techniquesFor instance the multiple imputation (MI) a kind of pop-ular technique was used to resolve the missing data ofgross domestic product (GDP) [11] cancer databases [12]and sample surveys [13] And a similar response patternimputation (SRPI) was also implemented in [14] In [15]the authors used the classic expectation-maximization (EM)

algorithm principal component analysis (PCA) and singularvalue decomposition while [16 17] utilized the maximumlikelihood technique to carry out the missing data impu-tation However all the techniques mentioned previouslymight be hard to reflect the relationship among regressionvariables since the imputed valuesweremere approximationsof unknown values Besides the statistical techniques themachine learning was paying more and more attentionsnowadays as presented in [18 19]

In industrial manufacturing process the phenomenonof data missing often occurs due to the events such as datacollector failures transmission errors or information storageerrors which directly result in some obstacles for establish-ing data-driven-models such as scheduling models data-driven based regression prediction models and stochasticoptimization models [20ndash22] There were different types ofapproaches for industrial practitioners in the literatures todeal with these datamissing problems In [23 24] the authorsproposed a method called list-wise deletion that was easyto be implemented however it tended to reduce the sampledata size Considering that a lot of missing data in industryhave the form of time series sampled in equal intervals inmost cases the integrality of sample data has to be broken

2 Mathematical Problems in Engineering

by such deleting the missing points Besides wasting a lotof costly collected data this method also led to invalidresults if the excluded group was a selective subsample fromthe entire sample [25] Mean imputation presented in [26]was another widely employed method However the meanvalues of the sample might eliminate the samples diversity intime series whose amplitude dramatically changes and thedistortion of samples was usually unacceptable for industrialpractitioners With respect to the other statistical or machinelearning techniques themaximum likelihood estimation andthe linear interpolatorwere respectively proposed in [27 28]where the effective experimentswere used to validate the timeseries imputation Yet all of these experiments showed highdemands of samples and as a result their applications inreal industrial process were rather limited As for all of theabove mentioned methods few of them can bring satisfyingimputation accuracy once the consecutive missing happensor the missing rate is high and the sample size is too small

The Generalized-Trend-Diffusion (GTD) is a methodof sample construction aiming at small data sets As thevirtual examples presented in [29] and the functional virtualpopulation in [30] the so-called shadow data and member-ship functions were employed to increase the knowledge ofsmall data sets see more details in [31] And the expandedsamples were provided for Back Propagate-based (BP) neuralnetworks to carry out the forecasting resulting in the pre-diction accuracy higher than that without expanding Thusthe most significant advantage of GTD was that it couldbring satisfactory forecasting accuracy with relatively smalldata sets On the other hand the original GTD described themembership degree to themean value of observed sample viaa triangular membership function As such each observeddata point deviation from the mean value is proportional tothe difference between the membership function values thatis the observed data points linearly deviate from the meanpoint However such description of deviation cannot bringexcellent accuracy in the imputation tasks for real industrialmanufacturing process

This paper aims at the missing data imputation of blastfurnace gas flow in steel energy system An improved GTDmodeling algorithmbased onGaussianmembership functionis proposed considering the diversity of the gas flow data andthe complex missing situations The Gaussian membershipshows that the observed data deviate from the mean valuenonuniformly and this deviation makes the close-to-meanvalues more likely to appear in the imputation The samplesexpanded by the membership function make the predictedvalues by BP-based network lean to the mean And suchpredicted values do not make the samples single as thoseby the mean imputation do In addition the imputationorder is essential to the accuracy of time series problem Aboth-side-toward-middle (BSTM) order is proposed in thispaper which is indicated to be more appropriate than thechronological order And the tests are implemented to verifythe effectiveness of the proposedmethod inwhich the sampledata comes from the practice of Shanghai Baosteel Co LtdThe results demonstrate that the improved GTD method ismuch better than the original version and other methods inseveral cases

This study is organized as follows In Section 2 the prac-tical conditions of blast furnace gas in Baosteel is describedAnd then the original GTD and its improved version areestablished in Section 3 where the details of how to use theimprovedGTD for the industrial missing data imputation arediscussed In Section 4 the validity of the improved GTD isverified by a series of comparative experiments Finally thisstudy is summarized in Section 5

2 Problem Description

Blast furnace gas (BFG) is a kind of byproduct gas generatedin the process of ironmaking [32] As an important secondaryenergy for blast furnaces coke ovens power stations andother units its proper utility can not only reduce the energyconsumption of steel enterprises but also improve theireconomic profits Figure 1 shows the BFG system structureof a steel plant where four blast furnaces supply the gasto consumers However BFG could be diffused if the flowprediction and the scheduling are inappropriately carried outwhich will seriously pollute the environment In this case thesupervision of BFGrsquos generation and consumption becomes acrucial task for the steel enterprises

Currently the on-site technicians perform the balancescheduling by estimating the BFG generation amount whichcomes from the observed data However the observed dataoften miss due to the collector failure transmission errorsinformation storage errors and so forth Furthermore thegenerating process of BFG is rather complex and the outputfluctuates irregularly therefore the data missing makes theworkers work hard to perceive the dynamics of gas flow viageneric model In practice the gas engineers in ShanghaiBaosteel employ the personal experience-based estimationas the current wide using method when encountering singlepoint missing However there are more consecutive missingpoints in real manufacturing process which make suchmethod relatively weak In addition if the missing rate ishigh the whole time series can be treated as a combination ofseveral small size series In this case the existing methodolo-gies like the recursive neural networks presented in [33 34]cannot be utilized because they need a large amount of sampledata to train the regression model

Aiming at the various features of a large number of gasunits of BFG system we summarize the flow tendencies ofthe generation and consumption units as three categorieswhich involve (1) the periodicity-like flow data (the gasconsumption amount of hot blast stove see Figure 2) (2)the concussive flow data (the gas consumption of coke ovensee Figure 3) and (3) the ordinary time series flow (thegeneration amount of blast furnace see Figure 4)

3 Improved Generalized-Trend-Diffusion

31 Generalized-Trend-Diffusion The GTD is a method ofsample construction aiming at small data sets which gen-erates shadow data using the real data and the occurrenceorder of the observed data The importance degree of thoseshadow data and observed data is quantified by the mem-bership function values based on fuzzy theories Both the

Mathematical Problems in Engineering 3

No 1 CP

Nos 5 6coke ovenNos 3 4

coke ovenNos 1 2

coke oven

Powerstation

CCP

Emissiontower

Cold rolling

Hot rolling

synthesis

synthesis

gas tank

No 1 blast furnace No 2 blast furnace No 3 blast furnace No 4 blast furnace

BLR

No 1No 2 CP

No 1

No 2

gas tankNo 2

Figure 1 BFG network of Baosteel

50 100 150 200 250 30050

100

150

200

250

Time

BFG on hot blast furnace

Am

plitu

de (k

m3)

Figure 2 The consumption flow of hot blast stove

20 40 60 80 100 120 140 160 180 200105110115120125130135140145150155

Time

BFG on coke oven

Am

plitu

de (k

m3)

Figure 3 The consumption flow of coke oven

membership values and the shadow data can be treated asthe additional hidden data-related information which helpsto improve the imputation accuracy All the previous featuresabove make the GTD fit for the missing data imputation oftime series because of their lack of more information except

20 40 60 80 100 120 140 160 180 200300

350

400

450

500

550

600

650

Time

BFG

Am

plitu

de (k

m3)

Figure 4 The generation flow of blast furnace

time [31] One can start by considering that observations arecollected with an empty set where each point occurs witheach observation (Figure 5) As the data increases the centrallocation symbolized ldquoCrdquo in Figure 5 of the data for eachobservationmoves fromone location to another If each pointdeviation from the central location can be obtained then thedetailed distribution of the whole sample is clear As such theGTDwithmembership function can be used to describe suchdeviation Let themembership function value at ldquoCrdquo be 1 andlet those of somemissing data beMF

119909 When these values get

closer to 1 the missing data approach ldquoCrdquo and vice versaIn the original GTD model one can let MF

119905be the

membership function for the data collected at Step 119905 Forexample MF

1at Step 1 refers to 119884

1only MF

2at Step 2 refers

to 1198841 1198842 MF

3at Step 3 refers to 119884

1 1198842 1198843 and so forth

The data like 1198841at Step 2 and 119884

1 1198842 at Step 3 are called

the shadow data They were called as such name becauseeach of them was used repeatedly in each step when formingthe corresponding membership functions while it occurredactually once

Then the imputation can be done by the shadow dataOne can suppose that a sequence of data denoted as

4 Mathematical Problems in Engineering

C

C

C

C

C

Time

Step 1

Step 2

Step 3

Step 4

Step 5

1198841

1198841

1198841

1198841

1198841

1198842

1198842

1198842

1198842

1198843

1198843

1198843

1198844

11988441198845

Figure 5 Time series and central location

C

C

C

CStep 1

Time

1198831119883119905

119883119905

119883119905

119883119905

119883119905minus1

119883119905minus1

119883119905minus1

119883119905minus2

119883119905minus2

Step 119905

Step 119905 minus 1

Step 119905 minus 2

Figure 6 Backward tracking process

1198831 1198832 119883

119905minus1 119883119905 with 119883

119905+1missing has been obtained

The shadow data can be built by unevenly repeating themore recent data which bring more important contemporaryinformation of system variation than that provided by theprevious data As shown in Figure 6 the most recent point119883119905is repeated 119905 times 119883

119905minus1is repeated (119905 minus 1) times 119883

119905minus2

is repeated (119905 minus 2) times and so forth And such repeatingwas called as the backward tracking progress since it is donein the backward tracking progress Then these repeated data(shadow data) with their membership function values help toenlarge the sample knowledge

32 Improved Generalized-Trend-Diffusion A triangularmembership function (Figure 7) was used to describe eachpointrsquos deviation from the mean value in the original GTD

A

1B

C

119884min 119884max119884

MF(119884119894)

MF(119884min )

MF(119884max )

Figure 7 Triangular membership function

A

B

C

1

119884min 119884max

MF(119884119894)

MF(119884min )

MF(119884max )

119884

Figure 8 Gaussian membership function

However such description was somewhat unreasonablesince it restricted the deviation form as a linear one thatis the deviation from the central location was proportionalto the difference between the memberships Under suchcondition the possibility of the data value to appear inthe imputation is equal However the mean-like data haveactually a higher possibility of appearance in the industrialmanufacturing process In this study we can call such datavividly as high frequency cloud If a membership functioncan describe the high frequency cloud more like the meanvalue then the data in the cloud (mean like) will reappearin the imputation with higher possibility Considering suchmotivation the Gaussian membership function (Figure 8)could be more competent to accomplish this job

The information diffusion principle [35] is another reasonfor choosing the Gaussian membership function Informa-tion diffusion has a function of filling in the blanks like themolecular diffusion and its cause lies on that some dataacquire little information from the sample knowledge whilemolecular diffusion is caused by the heterogeneity in thespace distribution As for themolecular diffusion it had beenproved that current molecular density is proportional to theconcentration gradient If this principle is linked with thelaw of conservation of mass the molecular diffusion can bedescribed in the same form as the probability density of Gaus-sian distribution As a kind of incremental learning methodthe GTD is a representation of information diffusion Sincethe causes of information diffusion and molecular diffusionare similar we here get an inspiration to employ the Gaussian

Mathematical Problems in Engineering 5

membership function in the improved GTD The form ofGaussian function is as follows

119891 (119909) = 119886119890minus(119909minus119887)

21198882

(1)

where119886 119887 and 119888 are real constants and119886 gt 0 In order tomakethe function adaptive to the sample construction in this studywe use (2) as the general form of Gaussian function insteadof (1) as follows

119891 (119909) = 119890minus(119909minus120583)

21205902

(2)

where 120583 is the mean value of sample and 120590 is the standarddeviation Here we make 119886 as 1 since the membership valueat the mean value should be 1 After its form confirmedthe Gaussian membership is capable to enlarge the sampleknowledge instead of the triangular one

33 Data Imputation TheBPalgorithm is a supervised learn-ing method in a network which is effected by altering theweights to minimize the difference between the output valueand the desired output value [36] The enlarged knowledgethen can be utilized by BP neural networks to finish theprediction

Missing data points need to be imputed one by one sothat the order of imputation should be another concern inthis study If the imputation is real time the chronologicalorder has to be taken because one cannot currently acquirethe future data points However the study in this paper is adata mining job which does not need real-time imputationsFurthermore if the imputation is not real time the BSTMorder is superior to the chronological one For instance letthere be five consecutive missing data as Figure 9 shows Ifthe imputation order is chronological the forecast error ofpoint number 1 will be amplified so as to affect the forecastaccuracy of point number 2 and the error of point number2 will be again propagated to that of point number 3 In suchway the errors will be cumulative

Besides data points in time series are always fluctuatingas Figure 9 shows If the missing happens on the hillside thechronological imputation result is very likely to be the sameas since it continues the peak However the result maybe like if we use the BSTM order That is points number1 and number 2 are on the peak while points number 4 andnumber 5 are on the plain which both continue the trends Asfor point number 3 we impute that it sings themean of pointsnumber 2 and number 4 Obviously deviates from thereal values more than which shows that the BSTM order issuperior to the chronological oneThis summary is consistentif analyzing the missing points on the peak or on the plain

Let (1198831 1198832 119883

119905minus1 119883119905) be a time series the index of the

firstmissing point denotes as119899 the number of the consecutive

Time

ObservedMissing

ChronologicalBSTM

12

3

4

5

Figure 9 Imputation by BSTM and chronological manner

missing points denotes as119898 and the variable 119889 represents theembedding dimension then we have

former119894

= net ( (119883119894minus119889MF119894minus119889)

(119883119894minus119889+1

MF119894minus119889+1

) (119894minus1MF119894minus1))

latter119894

= net ((119894+1MF119894+1)

(119894+2MF119894+2) (119883

119894+119889MF119894+119889) )

(3)

where former119894

is the imputation of the former half while

latter119894

is in the latter half Then all the imputations 119894can

be expressed as

119894=

former119894

119899 le 119894 le 119899 +119898

2minus 1

latter119894

119899 +119898

2le 119894 le 119899 + 119898 minus 1

if 119898 = 2119896 119896 isin 119873lowast

119894=

former119894

119899 le 119894 le 119899 +119898 minus 1

2minus 1

1

2(

former119894

+ latter119894

) 119894 = 119899 +119898 minus 1

2

latter119894

119899+119898 minus 1

2+1le 119894 le 119899 + 119898 minus 1

if 119898 = 2119896 minus 1 119896 isin 119873lowast(4)

4 Experimental Results and Analysis

The imputation tests of missing data in BFG flow are carriedout with the proposed Gaussian membership function-basedmethod called iGTD here First of all the superiority of theBSTM order to the chronological one is tested and verifiedA series of consecutive 800 data is picked from number 1blast furnace in Baosteel dating from 1434001382010 to354001482010 Considering that it is difficult to guaranteequantities of consecutive valid data in real industrialdatabases and the small set of samples is our concerns inthis study the embedding dimension is empirically chosen

6 Mathematical Problems in Engineering

Table 1 Imputation accuracies with two kinds of orders

Group Order RMSE NRMSE MAPE ()

1 Chronological order 16930 01486 4565BSTM 11054 00970 2597

2 Chronological order 9274 00669 1880BSTM 8240 00594 1456

3 Chronological order 10770 00842 2395BSTM 5732 00448 1141

4 Chronological order 12047 01022 2975BSTM 9564 00812 2144

0100200300400500600700

1 25 49 73 97 121

145

169

193

217

241

265

289

313

337

361

385

Figure 10 Overview of missing data

as 15 and the hidden neuron size is chosen as 10 in the samemanner We divide the sample data into 4 groups For eachgroup we randomly remove 3 consecutive points (A B andC) in 3 places The tests are respectively implemented inthe chronological order (A-B-C) and the BSTM order (A-C-B) Here we use three indexes as the evaluation criterion ofthe imputation accuracy which are root mean square error(RMSE) normalized root mean square error (NRMSE) andmean absolute percentage error (NRMSE) as follows

RMSE = radic 1119899

119899

sum

119894=1

(119894minus 119884119894)2

NRMSE = radic 1

1198991198842

119899

sum

119894=1

(119894minus 119884119894)2

MAPE = 100119899

119899

sum

119894=1

10038161003816100381610038161003816119894minus 119884119894

10038161003816100381610038161003816

119884119894

(5)

where 119899 is the total number of imputation is the imputationvalue and 119884 is the real value As for the separated 4 groupsof data the imputation accuracies for the different orderare shown in Table 1 It is apparent that the effectiveness ofBSTM is superior to that of the chronological order-basedimputation method

To further verify the effectiveness of the proposedGaussian-based membership function we comprehensivelytake the three categories of gas flow data mentioned inprevious section which include the periodicity-like flowdatashown like the BFG consumption amount by hot blast stovethe concussive flow data shown like the gas consumption by

coke oven and the ordinary time series flow shown like thegeneration amount by blast furnaceThe comparative experi-ments are carried out by using the EM method regressionspline and the original GTD EM algorithm is an iterativemethod for finding maximum likelihood or maximum aposteriori estimates of parameters in statistical models [37]which was widely used in dealing withmissing data since themaximum likelihood estimate of the unknown parameterscan be determined by the incomplete data set The regressionmethod employed multiple linear regressions to estimate themissing values

We still apply the real industrial data in Baosteel tocomplete the comparative experiment where the collecteddata are divided into several groups and some consecutive3 points 4 points and 5 points are removed from the timeseries In order to cover the all of the possible situations theremoved data involves the time series areas on peak troughand plain as Figure 10 shows in which the points in red areremoved

(1) BFG Consumption by Hot Blast Stove (PeriodicityLike) The experimental data are from number 2 hotblast stove in Shanghai Baosteel randomly selected from14280013082010 to 21550014082010 These data aredivided into 3 groups each of which is then divided into3 subgroups and each subgroup contains 200 points Theaccuracies of the imputation result are presented in Table 2

It can be found that the results by both the original GTDand the proposed iGTD are much more excellent in termsof the accuracy when the consecutive data missing occursFurthermore the effectiveness of the iGTD is generally betterthan that of GTD Then a conclusion can be drawn thatthe iGTD employed the Gaussian membership function canobtain the better data imputation results compared to thetriangle-based membership of GTD

(2) BFG Consumption by Coke Oven (Concussive) Theexperimental data are from number 1 coke oven in Shang-hai Baosteel randomly selected from 07140014082010 to14350015082010 The data-grouped measure is similarto that in the validation for periodicity-like data missingAnd the corresponding imputation accuracies are listed inTable 3 From the experiments results all the five methodsare almost same imputation accuracies and in particular EMshould be the best solution method of the five However it is

Mathematical Problems in Engineering 7

Table 2 Imputation accuracies with different methods (hot blast stoversquos BFG consumption)

Missing number Methods Group 1 Group 2 Group 3RMSE NRMSE MAPE RMSE NRMSE MAPE RMSE NRMSE MAPE

Regression 5716 00713 2221 5596 00777 2856 3556 00489 1696EM 4356 00543 1788 4582 00636 2576 3446 00474 1758

3 points SPLINE 3672 00458 1380 6062 00842 3169 2210 00304 1068GTD 913 00114 288 1490 00207 585 1450 00199 342iGTD 719 00090 205 1442 00200 571 623 00086 287

Regression 4717 00578 2307 2910 00333 1152 3012 00356 1156EM 2615 00320 1265 2159 00247 1096 2644 00313 1392

4 points SPLINE 2518 00308 780 2106 00241 651 1499 00177 525GTD 1573 00193 474 1178 00135 382 1742 00206 585iGTD 1509 00185 425 980 00112 337 1663 00197 513

Regression 2884 00307 1190 5571 00612 3084 3992 00429 2031EM 2554 00272 1335 3999 00439 2438 4080 00439 2578

5 points SPLINE 4733 00505 2145 1521 00167 700 3342 00359 1918GTD 2012 00215 670 789 00087 393 667 00072 270iGTD 1503 00160 597 709 00078 355 515 00055 216

Table 3 Imputation accuracies with different methods (coke ovenrsquos BFG consumption)

Missing number Methods Group 1 Group 2 Group 3RMSE NRMSE MAPE RMSE NRMSE MAPE RMSE NRMSE MAPE

Regression 1673 00300 790 1033 00185 356 937 00167 324EM 995 00178 543 725 00130 405 372 00066 240

3 points SPLINE 1877 00336 773 1681 00301 779 2264 00402 949GTD 1583 00283 830 1582 00283 747 1567 00278 749iGTD 1394 00250 676 1376 00246 655 1374 00244 730

Regression 1666 00260 781 1110 00171 418 1082 00168 449EM 1032 00161 547 672 00104 380 671 00104 316

4 points SPLINE 1577 00246 650 3826 00591 933 1497 00231 637GTD 1593 00249 744 1975 00305 1055 2951 00459 1358iGTD 1381 00216 607 1936 00299 939 1302 00203 651

Regression 1324 00186 573 1324 00184 555 1598 00227 778EM 1034 00145 531 752 00104 420 1078 00153 527

5 points SPLINE 1690 00237 906 2157 00300 851 1385 00196 620GTD 1549 00217 833 2283 00317 1100 2023 00287 1148iGTD 1181 00166 545 1984 00276 902 1927 00273 945

mentionable that the effectiveness of the proposed iGTD stilldoes better than GTD in this test

(3) BFG Generation Amount (Normal Time Series) Theexperimental data are from number 1 blast furnace in Shang-hai Baosteel randomly selected from 02280027032010to 18330001042010 And the comparative accuracies arelisted in Table 4

From Table 3 we can discover that the regressionmethodpresents the worst performance while iGTD obtains the bestone For the data with normal property of time series iGTDis better than GTD while GTD wins all the other threemethods

A conclusion can be drawn from Tables 2ndash4 that theproposed iGTD and the GTD are superior to regression

EM and spline for the periodicity-like data and the normaltime-series-like data As for the data with concussive ampli-tude both iGTD and GTD do not have an advantage and yetiGTD still beats GTD which means the proposed Gaussianmembership function is superior to the triangular one in thereal industrial manufacturing process And for the visualimputation results of the BFG generation and consumptionthe comparative imputation curves are randomly chosen asFigures 11 12 and 13 showwhere the advantage of themethodproposed in this study can be easily presented

5 Conclusion

This study aims at the imputation of missing data of gas flowin steel industry In order to improve the imputation accuracy

8 Mathematical Problems in Engineering

Table 4 Imputation accuracies with different methods (BFG output)

Missing number Methods Group 1 Group 2 Group 3RMSE NRMSE MAPE RMSE NRMSE MAPE RMSE NRMSE MAPE

Regression 10817 00598 2194 9704 00476 1711 10920 00537 1660EM 8743 00483 1804 8343 00409 1538 7165 00352 1072

3 points SPLINE 9299 00514 1755 8674 00437 1541 12967 00638 1759GTD 7521 00416 1606 7073 00347 1250 5669 00279 918iGTD 6970 00385 1474 6864 00336 1226 5548 00273 932

Regression 9825 00457 1799 9309 00394 1545 9538 00408 1549EM 7794 00363 1470 7553 00319 1309 7374 00315 1157

4 points SPLINE 12822 00597 2051 8583 00363 1429 6695 00286 1086GTD 7570 00352 1421 6336 00268 1047 6231 00266 944iGTD 7091 00330 1315 6325 00267 1032 4151 00178 674

Regression 8876 00362 1709 7913 00296 1122 7624 00289 1147EM 7261 00296 1298 7174 00268 1198 7139 00270 1097

5 points SPLINE 9783 00399 1672 10518 00393 1678 10259 00389 1560GTD 6744 00275 1190 6100 00228 963 6723 00255 1047iGTD 6719 00274 1195 5775 00216 929 6295 00238 1014

75 80 85 90 95 100 105300

350

400

450

500

550

600

650

700

Time

Complete sequenceRegressionEM

SPLINEGTDiGTD

Am

plitu

de (m3)

Figure 11 Comparison of different methods (3 points)

Complete sequenceRegressionEM

SPLINEGTDiGTD

55 60 65 70 75 80300

350

400

450

500

550

600

650

700

Time

Am

plitu

de (m3)

Figure 12 Comparison of different methods (4 points)

275 280 285 290 295 300 305300350400450500550600650700

Time

Am

plitu

de (m3)

Complete sequenceRegressionEM

SPLINEGTDiGTD

Figure 13 Comparison of different methods (5 points)

the proposed iGTD replaces the triangular membershipfunction with the Gaussian one Furthermore the order ofimputation is further discussedThe verification experimentsshow that the BSTM order brings less error than the chrono-logical one does since more observed data are utilized Asfor the different data imputation method compared to theoriginal GTD EM regression and spline the proposed iGTDhas some advantages in the problem with data properties ofconsecutively missing and small samples And the satisfyingimputation accuracy provides the powerful support for thegas resources scheduling later On the other hand althoughthe approach developed in this study can handle some typesof missing in real industry some theoretical analyses and theexpanded application for example the type of concussiveflow data need to be given a further consideration in thefuture

Mathematical Problems in Engineering 9

Acknowledgments

This work is supported by the National Natural SciencesFoundation of China (no 61034003 no 61104157 and no61273037) and the Fundamental Research Funds for theCentral Universities of China (no DUT11RC(3)07) Thecooperation of energy center of Shanghai Baosteel Co LtdChina in this work is greatly appreciated

References

[1] J Han and M Kamber Data Mining Concepts and TechniquesElsevier 2006

[2] P C Austin and M D Escobar ldquoBayesian modeling of missingdata in clinical researchrdquo Computational Statistics and DataAnalysis vol 49 no 3 pp 821ndash836 2005

[3] S Y Yi S Jung and J Suh ldquoBettermobile clientrsquos cache reusabil-ity and data access time in a wireless broadcast environmentrdquoData and Knowledge Engineering vol 63 no 2 pp 293ndash3142007

[4] K Lakshminarayan S A Harp and T Samad ldquoImputation ofmissing data in industrial databasesrdquo Applied Intelligence vol11 no 3 pp 259ndash275 1999

[5] R Lenz andM Reichert ldquoIT support for healthcare processesmdashpremises challenges perspectivesrdquo Data and Knowledge Engi-neering vol 61 no 1 pp 39ndash58 2007

[6] A Mercatanti ldquoAnalyzing a randomized experiment withimperfect compliance and ignorable conditions for missingdata theoretical and computational issuesrdquo ComputationalStatistics and Data Analysis vol 46 no 3 pp 493ndash509 2004

[7] T Ozacar O Ozturk and M O Unalir ldquoANEMONE Anenvironment for modular ontology developmentrdquo Data andKnowledge Engineering vol 70 no 6 pp 504ndash526 2011

[8] A C Favre A Matei and Y Tille ldquoA variant of the Coxalgorithm for the imputation of non-response of qualitativedatardquo Computational Statistics and Data Analysis vol 45 no4 pp 709ndash719 2004

[9] A F deWinter A J Oldehinkel R Veenstra J A BrunnekreefF C Verhulst and J Ormel ldquoEvaluation of non-response biasin mental health determinants and outcomes in a large sampleof pre-adolescentsrdquo European Journal of Epidemiology vol 20no 2 pp 173ndash181 2005

[10] P A Patrician ldquoMultiple imputation for missing datardquo Researchin Nursing and Health vol 25 no 1 pp 76ndash84 2002

[11] J Honaker and G King ldquoWhat to do about missing valuesin time-series cross-section datardquo American Journal of PoliticalScience vol 54 no 2 pp 561ndash581 2010

[12] N Sartori A Salvan and K Thomaseth ldquoMultiple imputationof missing values in a cancer mortality analysis with estimatedexposure doserdquo Computational Statistics and Data Analysis vol49 no 3 pp 937ndash953 2005

[13] J M Lepkowski W D Mosher K E Davis R M Grovesand J Van Hoewyk ldquoThe 2006ndash2010 National Survey of FamilyGrowth sample design and analysis of a continuous surveyrdquoVital and Health Statistics no 150 pp 1ndash36 2010

[14] I Myrtveit E Stensrud and U H Olsson ldquoAnalyzing datasets with missing data an empirical evaluation of imputationmethods and likelihood-based methodsrdquo IEEE Transactions onSoftware Engineering vol 27 no 11 pp 999ndash1013 2001

[15] A L Bello ldquoImputation techniques in regression analysis look-ing closely at their implementationrdquo Computational Statisticsand Data Analysis vol 20 no 1 pp 45ndash57 1995

[16] D A Newman ldquoLongitudinal modeling with randomly andsystematically missing data a simulation of Ad Hoc maximumlikelihood and multiple imputation techniquesrdquo Organiza-tional Research Methods vol 6 no 3 pp 328ndash362 2003

[17] R H Jones ldquoMaximum likelihood fitting of ARMA models totime series with missing observationsrdquo Technometrics vol 22no 3 pp 389ndash395 1980

[18] G Jagannathan and R N Wright ldquoPrivacy-preserving imputa-tion of missing datardquo Data and Knowledge Engineering vol 65no 1 pp 40ndash56 2008

[19] J Van Hulse and T Khoshgoftaar ldquoKnowledge discovery fromimbalanced and noisy datardquo Data and Knowledge Engineeringvol 68 no 12 pp 1513ndash1542 2009

[20] A Bagheri and M Zandieh ldquoBi-criteria flexible job-shopscheduling with sequence-dependent setup timesmdashvariableneighborhood search approachrdquo Journal of Manufacturing Sys-tems vol 30 no 1 pp 8ndash15 2011

[21] M Zhang J Zhu D Djurdjanovic and J Ni ldquoA comparativestudy on the classification of engineering surfaces with dimen-sion reduction and coefficient shrinkage methodsrdquo Journal ofManufacturing Systems vol 25 no 3 pp 209ndash220 2006

[22] M S Pishvaee F Jolai and J Razmi ldquoA stochastic optimizationmodel for integrated forwardreverse logistics network designrdquoJournal of Manufacturing Systems vol 28 no 4 pp 107ndash1142009

[23] P A Ferrari P Annoni A Barbiero and G Manzi ldquoAnimputation method for categorical variables with applicationto nonlinear principal component analysisrdquo ComputationalStatistics and Data Analysis vol 55 no 7 pp 2410ndash2420 2011

[24] T H Lin ldquoA comparison of multiple imputation with EMalgorithm and MCMC method for quality of life missing datardquoQuality and Quantity vol 44 no 2 pp 277ndash287 2010

[25] S Van Buuren H C Boshuizen and D L Knook ldquoMultipleimputation of missing blood pressure covariates in survivalanalysisrdquo Statistics in Medicine vol 18 pp 681ndash694 1999

[26] K Hron M Templ and P Filzmoser ldquoImputation of missingvalues for compositional data using classical and robust meth-odsrdquo Computational Statistics and Data Analysis vol 54 no 12pp 3095ndash3107 2010

[27] F Dudbridge ldquoLikelihood-based association analysis fornuclear families and unrelated subjects with missing genotypedatardquo Human Heredity vol 66 no 2 pp 87ndash98 2008

[28] Z Lu and Y V Hui ldquo1198711linear interpolator for missing values

in time seriesrdquo Annals of the Institute of Statistical Mathematicsvol 55 no 1 pp 197ndash216 2003

[29] DDecoste and B Scholkopf ldquoTraining invariant support vectormachinesrdquoMachine Learning vol 46 no 1ndash3 pp 161ndash190 2002

[30] D Li L Chen and Y Lin ldquoUsing Functional Virtual Populationas assistance to learn scheduling knowledge in dynamic man-ufacturing environmentsrdquo International Journal of ProductionResearch vol 41 no 17 pp 4011ndash4024 2003

[31] Y S Lin and D C Li ldquoThe Generalized-Trend-Diffusionmodeling algorithm for small data sets in the early stagesof manufacturing systemsrdquo European Journal of OperationalResearch vol 207 no 1 pp 121ndash130 2010

[32] K GotoHOkabe F A Chowdhury S Shimizu Y Fujioka andM Onoda ldquoDevelopment of novel absorbents for CO

2capture

from blast furnace gasrdquo International Journal of Greenhouse GasControl vol 5 no 5 pp 1214ndash1219 2011

[33] H Jaeger ldquoA tutorial on training recurrent neural net-works covering BPTT RTRL EKF and ldquoEcho State Networkrdquo

10 Mathematical Problems in Engineering

approachrdquoGMDReport 159 GermanNational ResearchCenterfor Information Technology Berlin German 2002

[34] H Jaeger ldquoAdaptive nonlinear system identification with echostate networksrdquo Advances in Neural Information ProcessingSystems vol 15 pp 593ndash600 2003

[35] C FHuang ldquoPrinciple of information diffusionrdquo Fuzzy Sets andSystems vol 91 no 1 pp 69ndash90 1997

[36] E P Zhou and D K Harrison ldquoImproving error compensationvia a fuzzy-neural hybrid modelrdquo Journal of ManufacturingSystems vol 18 no 5 pp 335ndash344 1999

[37] G Ward T Hastie S Barry J Elith and J R LeathwickldquoPresence-only data and the EM algorithmrdquo Biometrics vol 65no 2 pp 554ndash563 2009

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

Page 2: Research Article An Improved Generalized-Trend-Diffusion ...downloads.hindawi.com/journals/mpe/2013/136241.pdf · Research Article An Improved Generalized-Trend-Diffusion-Based Data

2 Mathematical Problems in Engineering

by such deleting the missing points Besides wasting a lotof costly collected data this method also led to invalidresults if the excluded group was a selective subsample fromthe entire sample [25] Mean imputation presented in [26]was another widely employed method However the meanvalues of the sample might eliminate the samples diversity intime series whose amplitude dramatically changes and thedistortion of samples was usually unacceptable for industrialpractitioners With respect to the other statistical or machinelearning techniques themaximum likelihood estimation andthe linear interpolatorwere respectively proposed in [27 28]where the effective experimentswere used to validate the timeseries imputation Yet all of these experiments showed highdemands of samples and as a result their applications inreal industrial process were rather limited As for all of theabove mentioned methods few of them can bring satisfyingimputation accuracy once the consecutive missing happensor the missing rate is high and the sample size is too small

The Generalized-Trend-Diffusion (GTD) is a methodof sample construction aiming at small data sets As thevirtual examples presented in [29] and the functional virtualpopulation in [30] the so-called shadow data and member-ship functions were employed to increase the knowledge ofsmall data sets see more details in [31] And the expandedsamples were provided for Back Propagate-based (BP) neuralnetworks to carry out the forecasting resulting in the pre-diction accuracy higher than that without expanding Thusthe most significant advantage of GTD was that it couldbring satisfactory forecasting accuracy with relatively smalldata sets On the other hand the original GTD described themembership degree to themean value of observed sample viaa triangular membership function As such each observeddata point deviation from the mean value is proportional tothe difference between the membership function values thatis the observed data points linearly deviate from the meanpoint However such description of deviation cannot bringexcellent accuracy in the imputation tasks for real industrialmanufacturing process

This paper aims at the missing data imputation of blastfurnace gas flow in steel energy system An improved GTDmodeling algorithmbased onGaussianmembership functionis proposed considering the diversity of the gas flow data andthe complex missing situations The Gaussian membershipshows that the observed data deviate from the mean valuenonuniformly and this deviation makes the close-to-meanvalues more likely to appear in the imputation The samplesexpanded by the membership function make the predictedvalues by BP-based network lean to the mean And suchpredicted values do not make the samples single as thoseby the mean imputation do In addition the imputationorder is essential to the accuracy of time series problem Aboth-side-toward-middle (BSTM) order is proposed in thispaper which is indicated to be more appropriate than thechronological order And the tests are implemented to verifythe effectiveness of the proposedmethod inwhich the sampledata comes from the practice of Shanghai Baosteel Co LtdThe results demonstrate that the improved GTD method ismuch better than the original version and other methods inseveral cases

This study is organized as follows In Section 2 the prac-tical conditions of blast furnace gas in Baosteel is describedAnd then the original GTD and its improved version areestablished in Section 3 where the details of how to use theimprovedGTD for the industrial missing data imputation arediscussed In Section 4 the validity of the improved GTD isverified by a series of comparative experiments Finally thisstudy is summarized in Section 5

2 Problem Description

Blast furnace gas (BFG) is a kind of byproduct gas generatedin the process of ironmaking [32] As an important secondaryenergy for blast furnaces coke ovens power stations andother units its proper utility can not only reduce the energyconsumption of steel enterprises but also improve theireconomic profits Figure 1 shows the BFG system structureof a steel plant where four blast furnaces supply the gasto consumers However BFG could be diffused if the flowprediction and the scheduling are inappropriately carried outwhich will seriously pollute the environment In this case thesupervision of BFGrsquos generation and consumption becomes acrucial task for the steel enterprises

Currently the on-site technicians perform the balancescheduling by estimating the BFG generation amount whichcomes from the observed data However the observed dataoften miss due to the collector failure transmission errorsinformation storage errors and so forth Furthermore thegenerating process of BFG is rather complex and the outputfluctuates irregularly therefore the data missing makes theworkers work hard to perceive the dynamics of gas flow viageneric model In practice the gas engineers in ShanghaiBaosteel employ the personal experience-based estimationas the current wide using method when encountering singlepoint missing However there are more consecutive missingpoints in real manufacturing process which make suchmethod relatively weak In addition if the missing rate ishigh the whole time series can be treated as a combination ofseveral small size series In this case the existing methodolo-gies like the recursive neural networks presented in [33 34]cannot be utilized because they need a large amount of sampledata to train the regression model

Aiming at the various features of a large number of gasunits of BFG system we summarize the flow tendencies ofthe generation and consumption units as three categorieswhich involve (1) the periodicity-like flow data (the gasconsumption amount of hot blast stove see Figure 2) (2)the concussive flow data (the gas consumption of coke ovensee Figure 3) and (3) the ordinary time series flow (thegeneration amount of blast furnace see Figure 4)

3 Improved Generalized-Trend-Diffusion

31 Generalized-Trend-Diffusion The GTD is a method ofsample construction aiming at small data sets which gen-erates shadow data using the real data and the occurrenceorder of the observed data The importance degree of thoseshadow data and observed data is quantified by the mem-bership function values based on fuzzy theories Both the

Mathematical Problems in Engineering 3

No 1 CP

Nos 5 6coke ovenNos 3 4

coke ovenNos 1 2

coke oven

Powerstation

CCP

Emissiontower

Cold rolling

Hot rolling

synthesis

synthesis

gas tank

No 1 blast furnace No 2 blast furnace No 3 blast furnace No 4 blast furnace

BLR

No 1No 2 CP

No 1

No 2

gas tankNo 2

Figure 1 BFG network of Baosteel

50 100 150 200 250 30050

100

150

200

250

Time

BFG on hot blast furnace

Am

plitu

de (k

m3)

Figure 2 The consumption flow of hot blast stove

20 40 60 80 100 120 140 160 180 200105110115120125130135140145150155

Time

BFG on coke oven

Am

plitu

de (k

m3)

Figure 3 The consumption flow of coke oven

membership values and the shadow data can be treated asthe additional hidden data-related information which helpsto improve the imputation accuracy All the previous featuresabove make the GTD fit for the missing data imputation oftime series because of their lack of more information except

20 40 60 80 100 120 140 160 180 200300

350

400

450

500

550

600

650

Time

BFG

Am

plitu

de (k

m3)

Figure 4 The generation flow of blast furnace

time [31] One can start by considering that observations arecollected with an empty set where each point occurs witheach observation (Figure 5) As the data increases the centrallocation symbolized ldquoCrdquo in Figure 5 of the data for eachobservationmoves fromone location to another If each pointdeviation from the central location can be obtained then thedetailed distribution of the whole sample is clear As such theGTDwithmembership function can be used to describe suchdeviation Let themembership function value at ldquoCrdquo be 1 andlet those of somemissing data beMF

119909 When these values get

closer to 1 the missing data approach ldquoCrdquo and vice versaIn the original GTD model one can let MF

119905be the

membership function for the data collected at Step 119905 Forexample MF

1at Step 1 refers to 119884

1only MF

2at Step 2 refers

to 1198841 1198842 MF

3at Step 3 refers to 119884

1 1198842 1198843 and so forth

The data like 1198841at Step 2 and 119884

1 1198842 at Step 3 are called

the shadow data They were called as such name becauseeach of them was used repeatedly in each step when formingthe corresponding membership functions while it occurredactually once

Then the imputation can be done by the shadow dataOne can suppose that a sequence of data denoted as

4 Mathematical Problems in Engineering

C

C

C

C

C

Time

Step 1

Step 2

Step 3

Step 4

Step 5

1198841

1198841

1198841

1198841

1198841

1198842

1198842

1198842

1198842

1198843

1198843

1198843

1198844

11988441198845

Figure 5 Time series and central location

C

C

C

CStep 1

Time

1198831119883119905

119883119905

119883119905

119883119905

119883119905minus1

119883119905minus1

119883119905minus1

119883119905minus2

119883119905minus2

Step 119905

Step 119905 minus 1

Step 119905 minus 2

Figure 6 Backward tracking process

1198831 1198832 119883

119905minus1 119883119905 with 119883

119905+1missing has been obtained

The shadow data can be built by unevenly repeating themore recent data which bring more important contemporaryinformation of system variation than that provided by theprevious data As shown in Figure 6 the most recent point119883119905is repeated 119905 times 119883

119905minus1is repeated (119905 minus 1) times 119883

119905minus2

is repeated (119905 minus 2) times and so forth And such repeatingwas called as the backward tracking progress since it is donein the backward tracking progress Then these repeated data(shadow data) with their membership function values help toenlarge the sample knowledge

32 Improved Generalized-Trend-Diffusion A triangularmembership function (Figure 7) was used to describe eachpointrsquos deviation from the mean value in the original GTD

A

1B

C

119884min 119884max119884

MF(119884119894)

MF(119884min )

MF(119884max )

Figure 7 Triangular membership function

A

B

C

1

119884min 119884max

MF(119884119894)

MF(119884min )

MF(119884max )

119884

Figure 8 Gaussian membership function

However such description was somewhat unreasonablesince it restricted the deviation form as a linear one thatis the deviation from the central location was proportionalto the difference between the memberships Under suchcondition the possibility of the data value to appear inthe imputation is equal However the mean-like data haveactually a higher possibility of appearance in the industrialmanufacturing process In this study we can call such datavividly as high frequency cloud If a membership functioncan describe the high frequency cloud more like the meanvalue then the data in the cloud (mean like) will reappearin the imputation with higher possibility Considering suchmotivation the Gaussian membership function (Figure 8)could be more competent to accomplish this job

The information diffusion principle [35] is another reasonfor choosing the Gaussian membership function Informa-tion diffusion has a function of filling in the blanks like themolecular diffusion and its cause lies on that some dataacquire little information from the sample knowledge whilemolecular diffusion is caused by the heterogeneity in thespace distribution As for themolecular diffusion it had beenproved that current molecular density is proportional to theconcentration gradient If this principle is linked with thelaw of conservation of mass the molecular diffusion can bedescribed in the same form as the probability density of Gaus-sian distribution As a kind of incremental learning methodthe GTD is a representation of information diffusion Sincethe causes of information diffusion and molecular diffusionare similar we here get an inspiration to employ the Gaussian

Mathematical Problems in Engineering 5

membership function in the improved GTD The form ofGaussian function is as follows

119891 (119909) = 119886119890minus(119909minus119887)

21198882

(1)

where119886 119887 and 119888 are real constants and119886 gt 0 In order tomakethe function adaptive to the sample construction in this studywe use (2) as the general form of Gaussian function insteadof (1) as follows

119891 (119909) = 119890minus(119909minus120583)

21205902

(2)

where 120583 is the mean value of sample and 120590 is the standarddeviation Here we make 119886 as 1 since the membership valueat the mean value should be 1 After its form confirmedthe Gaussian membership is capable to enlarge the sampleknowledge instead of the triangular one

33 Data Imputation TheBPalgorithm is a supervised learn-ing method in a network which is effected by altering theweights to minimize the difference between the output valueand the desired output value [36] The enlarged knowledgethen can be utilized by BP neural networks to finish theprediction

Missing data points need to be imputed one by one sothat the order of imputation should be another concern inthis study If the imputation is real time the chronologicalorder has to be taken because one cannot currently acquirethe future data points However the study in this paper is adata mining job which does not need real-time imputationsFurthermore if the imputation is not real time the BSTMorder is superior to the chronological one For instance letthere be five consecutive missing data as Figure 9 shows Ifthe imputation order is chronological the forecast error ofpoint number 1 will be amplified so as to affect the forecastaccuracy of point number 2 and the error of point number2 will be again propagated to that of point number 3 In suchway the errors will be cumulative

Besides data points in time series are always fluctuatingas Figure 9 shows If the missing happens on the hillside thechronological imputation result is very likely to be the sameas since it continues the peak However the result maybe like if we use the BSTM order That is points number1 and number 2 are on the peak while points number 4 andnumber 5 are on the plain which both continue the trends Asfor point number 3 we impute that it sings themean of pointsnumber 2 and number 4 Obviously deviates from thereal values more than which shows that the BSTM order issuperior to the chronological oneThis summary is consistentif analyzing the missing points on the peak or on the plain

Let (1198831 1198832 119883

119905minus1 119883119905) be a time series the index of the

firstmissing point denotes as119899 the number of the consecutive

Time

ObservedMissing

ChronologicalBSTM

12

3

4

5

Figure 9 Imputation by BSTM and chronological manner

missing points denotes as119898 and the variable 119889 represents theembedding dimension then we have

former119894

= net ( (119883119894minus119889MF119894minus119889)

(119883119894minus119889+1

MF119894minus119889+1

) (119894minus1MF119894minus1))

latter119894

= net ((119894+1MF119894+1)

(119894+2MF119894+2) (119883

119894+119889MF119894+119889) )

(3)

where former119894

is the imputation of the former half while

latter119894

is in the latter half Then all the imputations 119894can

be expressed as

119894=

former119894

119899 le 119894 le 119899 +119898

2minus 1

latter119894

119899 +119898

2le 119894 le 119899 + 119898 minus 1

if 119898 = 2119896 119896 isin 119873lowast

119894=

former119894

119899 le 119894 le 119899 +119898 minus 1

2minus 1

1

2(

former119894

+ latter119894

) 119894 = 119899 +119898 minus 1

2

latter119894

119899+119898 minus 1

2+1le 119894 le 119899 + 119898 minus 1

if 119898 = 2119896 minus 1 119896 isin 119873lowast(4)

4 Experimental Results and Analysis

The imputation tests of missing data in BFG flow are carriedout with the proposed Gaussian membership function-basedmethod called iGTD here First of all the superiority of theBSTM order to the chronological one is tested and verifiedA series of consecutive 800 data is picked from number 1blast furnace in Baosteel dating from 1434001382010 to354001482010 Considering that it is difficult to guaranteequantities of consecutive valid data in real industrialdatabases and the small set of samples is our concerns inthis study the embedding dimension is empirically chosen

6 Mathematical Problems in Engineering

Table 1 Imputation accuracies with two kinds of orders

Group Order RMSE NRMSE MAPE ()

1 Chronological order 16930 01486 4565BSTM 11054 00970 2597

2 Chronological order 9274 00669 1880BSTM 8240 00594 1456

3 Chronological order 10770 00842 2395BSTM 5732 00448 1141

4 Chronological order 12047 01022 2975BSTM 9564 00812 2144

0100200300400500600700

1 25 49 73 97 121

145

169

193

217

241

265

289

313

337

361

385

Figure 10 Overview of missing data

as 15 and the hidden neuron size is chosen as 10 in the samemanner We divide the sample data into 4 groups For eachgroup we randomly remove 3 consecutive points (A B andC) in 3 places The tests are respectively implemented inthe chronological order (A-B-C) and the BSTM order (A-C-B) Here we use three indexes as the evaluation criterion ofthe imputation accuracy which are root mean square error(RMSE) normalized root mean square error (NRMSE) andmean absolute percentage error (NRMSE) as follows

RMSE = radic 1119899

119899

sum

119894=1

(119894minus 119884119894)2

NRMSE = radic 1

1198991198842

119899

sum

119894=1

(119894minus 119884119894)2

MAPE = 100119899

119899

sum

119894=1

10038161003816100381610038161003816119894minus 119884119894

10038161003816100381610038161003816

119884119894

(5)

where 119899 is the total number of imputation is the imputationvalue and 119884 is the real value As for the separated 4 groupsof data the imputation accuracies for the different orderare shown in Table 1 It is apparent that the effectiveness ofBSTM is superior to that of the chronological order-basedimputation method

To further verify the effectiveness of the proposedGaussian-based membership function we comprehensivelytake the three categories of gas flow data mentioned inprevious section which include the periodicity-like flowdatashown like the BFG consumption amount by hot blast stovethe concussive flow data shown like the gas consumption by

coke oven and the ordinary time series flow shown like thegeneration amount by blast furnaceThe comparative experi-ments are carried out by using the EM method regressionspline and the original GTD EM algorithm is an iterativemethod for finding maximum likelihood or maximum aposteriori estimates of parameters in statistical models [37]which was widely used in dealing withmissing data since themaximum likelihood estimate of the unknown parameterscan be determined by the incomplete data set The regressionmethod employed multiple linear regressions to estimate themissing values

We still apply the real industrial data in Baosteel tocomplete the comparative experiment where the collecteddata are divided into several groups and some consecutive3 points 4 points and 5 points are removed from the timeseries In order to cover the all of the possible situations theremoved data involves the time series areas on peak troughand plain as Figure 10 shows in which the points in red areremoved

(1) BFG Consumption by Hot Blast Stove (PeriodicityLike) The experimental data are from number 2 hotblast stove in Shanghai Baosteel randomly selected from14280013082010 to 21550014082010 These data aredivided into 3 groups each of which is then divided into3 subgroups and each subgroup contains 200 points Theaccuracies of the imputation result are presented in Table 2

It can be found that the results by both the original GTDand the proposed iGTD are much more excellent in termsof the accuracy when the consecutive data missing occursFurthermore the effectiveness of the iGTD is generally betterthan that of GTD Then a conclusion can be drawn thatthe iGTD employed the Gaussian membership function canobtain the better data imputation results compared to thetriangle-based membership of GTD

(2) BFG Consumption by Coke Oven (Concussive) Theexperimental data are from number 1 coke oven in Shang-hai Baosteel randomly selected from 07140014082010 to14350015082010 The data-grouped measure is similarto that in the validation for periodicity-like data missingAnd the corresponding imputation accuracies are listed inTable 3 From the experiments results all the five methodsare almost same imputation accuracies and in particular EMshould be the best solution method of the five However it is

Mathematical Problems in Engineering 7

Table 2 Imputation accuracies with different methods (hot blast stoversquos BFG consumption)

Missing number Methods Group 1 Group 2 Group 3RMSE NRMSE MAPE RMSE NRMSE MAPE RMSE NRMSE MAPE

Regression 5716 00713 2221 5596 00777 2856 3556 00489 1696EM 4356 00543 1788 4582 00636 2576 3446 00474 1758

3 points SPLINE 3672 00458 1380 6062 00842 3169 2210 00304 1068GTD 913 00114 288 1490 00207 585 1450 00199 342iGTD 719 00090 205 1442 00200 571 623 00086 287

Regression 4717 00578 2307 2910 00333 1152 3012 00356 1156EM 2615 00320 1265 2159 00247 1096 2644 00313 1392

4 points SPLINE 2518 00308 780 2106 00241 651 1499 00177 525GTD 1573 00193 474 1178 00135 382 1742 00206 585iGTD 1509 00185 425 980 00112 337 1663 00197 513

Regression 2884 00307 1190 5571 00612 3084 3992 00429 2031EM 2554 00272 1335 3999 00439 2438 4080 00439 2578

5 points SPLINE 4733 00505 2145 1521 00167 700 3342 00359 1918GTD 2012 00215 670 789 00087 393 667 00072 270iGTD 1503 00160 597 709 00078 355 515 00055 216

Table 3 Imputation accuracies with different methods (coke ovenrsquos BFG consumption)

Missing number Methods Group 1 Group 2 Group 3RMSE NRMSE MAPE RMSE NRMSE MAPE RMSE NRMSE MAPE

Regression 1673 00300 790 1033 00185 356 937 00167 324EM 995 00178 543 725 00130 405 372 00066 240

3 points SPLINE 1877 00336 773 1681 00301 779 2264 00402 949GTD 1583 00283 830 1582 00283 747 1567 00278 749iGTD 1394 00250 676 1376 00246 655 1374 00244 730

Regression 1666 00260 781 1110 00171 418 1082 00168 449EM 1032 00161 547 672 00104 380 671 00104 316

4 points SPLINE 1577 00246 650 3826 00591 933 1497 00231 637GTD 1593 00249 744 1975 00305 1055 2951 00459 1358iGTD 1381 00216 607 1936 00299 939 1302 00203 651

Regression 1324 00186 573 1324 00184 555 1598 00227 778EM 1034 00145 531 752 00104 420 1078 00153 527

5 points SPLINE 1690 00237 906 2157 00300 851 1385 00196 620GTD 1549 00217 833 2283 00317 1100 2023 00287 1148iGTD 1181 00166 545 1984 00276 902 1927 00273 945

mentionable that the effectiveness of the proposed iGTD stilldoes better than GTD in this test

(3) BFG Generation Amount (Normal Time Series) Theexperimental data are from number 1 blast furnace in Shang-hai Baosteel randomly selected from 02280027032010to 18330001042010 And the comparative accuracies arelisted in Table 4

From Table 3 we can discover that the regressionmethodpresents the worst performance while iGTD obtains the bestone For the data with normal property of time series iGTDis better than GTD while GTD wins all the other threemethods

A conclusion can be drawn from Tables 2ndash4 that theproposed iGTD and the GTD are superior to regression

EM and spline for the periodicity-like data and the normaltime-series-like data As for the data with concussive ampli-tude both iGTD and GTD do not have an advantage and yetiGTD still beats GTD which means the proposed Gaussianmembership function is superior to the triangular one in thereal industrial manufacturing process And for the visualimputation results of the BFG generation and consumptionthe comparative imputation curves are randomly chosen asFigures 11 12 and 13 showwhere the advantage of themethodproposed in this study can be easily presented

5 Conclusion

This study aims at the imputation of missing data of gas flowin steel industry In order to improve the imputation accuracy

8 Mathematical Problems in Engineering

Table 4 Imputation accuracies with different methods (BFG output)

Missing number Methods Group 1 Group 2 Group 3RMSE NRMSE MAPE RMSE NRMSE MAPE RMSE NRMSE MAPE

Regression 10817 00598 2194 9704 00476 1711 10920 00537 1660EM 8743 00483 1804 8343 00409 1538 7165 00352 1072

3 points SPLINE 9299 00514 1755 8674 00437 1541 12967 00638 1759GTD 7521 00416 1606 7073 00347 1250 5669 00279 918iGTD 6970 00385 1474 6864 00336 1226 5548 00273 932

Regression 9825 00457 1799 9309 00394 1545 9538 00408 1549EM 7794 00363 1470 7553 00319 1309 7374 00315 1157

4 points SPLINE 12822 00597 2051 8583 00363 1429 6695 00286 1086GTD 7570 00352 1421 6336 00268 1047 6231 00266 944iGTD 7091 00330 1315 6325 00267 1032 4151 00178 674

Regression 8876 00362 1709 7913 00296 1122 7624 00289 1147EM 7261 00296 1298 7174 00268 1198 7139 00270 1097

5 points SPLINE 9783 00399 1672 10518 00393 1678 10259 00389 1560GTD 6744 00275 1190 6100 00228 963 6723 00255 1047iGTD 6719 00274 1195 5775 00216 929 6295 00238 1014

75 80 85 90 95 100 105300

350

400

450

500

550

600

650

700

Time

Complete sequenceRegressionEM

SPLINEGTDiGTD

Am

plitu

de (m3)

Figure 11 Comparison of different methods (3 points)

Complete sequenceRegressionEM

SPLINEGTDiGTD

55 60 65 70 75 80300

350

400

450

500

550

600

650

700

Time

Am

plitu

de (m3)

Figure 12 Comparison of different methods (4 points)

275 280 285 290 295 300 305300350400450500550600650700

Time

Am

plitu

de (m3)

Complete sequenceRegressionEM

SPLINEGTDiGTD

Figure 13 Comparison of different methods (5 points)

the proposed iGTD replaces the triangular membershipfunction with the Gaussian one Furthermore the order ofimputation is further discussedThe verification experimentsshow that the BSTM order brings less error than the chrono-logical one does since more observed data are utilized Asfor the different data imputation method compared to theoriginal GTD EM regression and spline the proposed iGTDhas some advantages in the problem with data properties ofconsecutively missing and small samples And the satisfyingimputation accuracy provides the powerful support for thegas resources scheduling later On the other hand althoughthe approach developed in this study can handle some typesof missing in real industry some theoretical analyses and theexpanded application for example the type of concussiveflow data need to be given a further consideration in thefuture

Mathematical Problems in Engineering 9

Acknowledgments

This work is supported by the National Natural SciencesFoundation of China (no 61034003 no 61104157 and no61273037) and the Fundamental Research Funds for theCentral Universities of China (no DUT11RC(3)07) Thecooperation of energy center of Shanghai Baosteel Co LtdChina in this work is greatly appreciated

References

[1] J Han and M Kamber Data Mining Concepts and TechniquesElsevier 2006

[2] P C Austin and M D Escobar ldquoBayesian modeling of missingdata in clinical researchrdquo Computational Statistics and DataAnalysis vol 49 no 3 pp 821ndash836 2005

[3] S Y Yi S Jung and J Suh ldquoBettermobile clientrsquos cache reusabil-ity and data access time in a wireless broadcast environmentrdquoData and Knowledge Engineering vol 63 no 2 pp 293ndash3142007

[4] K Lakshminarayan S A Harp and T Samad ldquoImputation ofmissing data in industrial databasesrdquo Applied Intelligence vol11 no 3 pp 259ndash275 1999

[5] R Lenz andM Reichert ldquoIT support for healthcare processesmdashpremises challenges perspectivesrdquo Data and Knowledge Engi-neering vol 61 no 1 pp 39ndash58 2007

[6] A Mercatanti ldquoAnalyzing a randomized experiment withimperfect compliance and ignorable conditions for missingdata theoretical and computational issuesrdquo ComputationalStatistics and Data Analysis vol 46 no 3 pp 493ndash509 2004

[7] T Ozacar O Ozturk and M O Unalir ldquoANEMONE Anenvironment for modular ontology developmentrdquo Data andKnowledge Engineering vol 70 no 6 pp 504ndash526 2011

[8] A C Favre A Matei and Y Tille ldquoA variant of the Coxalgorithm for the imputation of non-response of qualitativedatardquo Computational Statistics and Data Analysis vol 45 no4 pp 709ndash719 2004

[9] A F deWinter A J Oldehinkel R Veenstra J A BrunnekreefF C Verhulst and J Ormel ldquoEvaluation of non-response biasin mental health determinants and outcomes in a large sampleof pre-adolescentsrdquo European Journal of Epidemiology vol 20no 2 pp 173ndash181 2005

[10] P A Patrician ldquoMultiple imputation for missing datardquo Researchin Nursing and Health vol 25 no 1 pp 76ndash84 2002

[11] J Honaker and G King ldquoWhat to do about missing valuesin time-series cross-section datardquo American Journal of PoliticalScience vol 54 no 2 pp 561ndash581 2010

[12] N Sartori A Salvan and K Thomaseth ldquoMultiple imputationof missing values in a cancer mortality analysis with estimatedexposure doserdquo Computational Statistics and Data Analysis vol49 no 3 pp 937ndash953 2005

[13] J M Lepkowski W D Mosher K E Davis R M Grovesand J Van Hoewyk ldquoThe 2006ndash2010 National Survey of FamilyGrowth sample design and analysis of a continuous surveyrdquoVital and Health Statistics no 150 pp 1ndash36 2010

[14] I Myrtveit E Stensrud and U H Olsson ldquoAnalyzing datasets with missing data an empirical evaluation of imputationmethods and likelihood-based methodsrdquo IEEE Transactions onSoftware Engineering vol 27 no 11 pp 999ndash1013 2001

[15] A L Bello ldquoImputation techniques in regression analysis look-ing closely at their implementationrdquo Computational Statisticsand Data Analysis vol 20 no 1 pp 45ndash57 1995

[16] D A Newman ldquoLongitudinal modeling with randomly andsystematically missing data a simulation of Ad Hoc maximumlikelihood and multiple imputation techniquesrdquo Organiza-tional Research Methods vol 6 no 3 pp 328ndash362 2003

[17] R H Jones ldquoMaximum likelihood fitting of ARMA models totime series with missing observationsrdquo Technometrics vol 22no 3 pp 389ndash395 1980

[18] G Jagannathan and R N Wright ldquoPrivacy-preserving imputa-tion of missing datardquo Data and Knowledge Engineering vol 65no 1 pp 40ndash56 2008

[19] J Van Hulse and T Khoshgoftaar ldquoKnowledge discovery fromimbalanced and noisy datardquo Data and Knowledge Engineeringvol 68 no 12 pp 1513ndash1542 2009

[20] A Bagheri and M Zandieh ldquoBi-criteria flexible job-shopscheduling with sequence-dependent setup timesmdashvariableneighborhood search approachrdquo Journal of Manufacturing Sys-tems vol 30 no 1 pp 8ndash15 2011

[21] M Zhang J Zhu D Djurdjanovic and J Ni ldquoA comparativestudy on the classification of engineering surfaces with dimen-sion reduction and coefficient shrinkage methodsrdquo Journal ofManufacturing Systems vol 25 no 3 pp 209ndash220 2006

[22] M S Pishvaee F Jolai and J Razmi ldquoA stochastic optimizationmodel for integrated forwardreverse logistics network designrdquoJournal of Manufacturing Systems vol 28 no 4 pp 107ndash1142009

[23] P A Ferrari P Annoni A Barbiero and G Manzi ldquoAnimputation method for categorical variables with applicationto nonlinear principal component analysisrdquo ComputationalStatistics and Data Analysis vol 55 no 7 pp 2410ndash2420 2011

[24] T H Lin ldquoA comparison of multiple imputation with EMalgorithm and MCMC method for quality of life missing datardquoQuality and Quantity vol 44 no 2 pp 277ndash287 2010

[25] S Van Buuren H C Boshuizen and D L Knook ldquoMultipleimputation of missing blood pressure covariates in survivalanalysisrdquo Statistics in Medicine vol 18 pp 681ndash694 1999

[26] K Hron M Templ and P Filzmoser ldquoImputation of missingvalues for compositional data using classical and robust meth-odsrdquo Computational Statistics and Data Analysis vol 54 no 12pp 3095ndash3107 2010

[27] F Dudbridge ldquoLikelihood-based association analysis fornuclear families and unrelated subjects with missing genotypedatardquo Human Heredity vol 66 no 2 pp 87ndash98 2008

[28] Z Lu and Y V Hui ldquo1198711linear interpolator for missing values

in time seriesrdquo Annals of the Institute of Statistical Mathematicsvol 55 no 1 pp 197ndash216 2003

[29] DDecoste and B Scholkopf ldquoTraining invariant support vectormachinesrdquoMachine Learning vol 46 no 1ndash3 pp 161ndash190 2002

[30] D Li L Chen and Y Lin ldquoUsing Functional Virtual Populationas assistance to learn scheduling knowledge in dynamic man-ufacturing environmentsrdquo International Journal of ProductionResearch vol 41 no 17 pp 4011ndash4024 2003

[31] Y S Lin and D C Li ldquoThe Generalized-Trend-Diffusionmodeling algorithm for small data sets in the early stagesof manufacturing systemsrdquo European Journal of OperationalResearch vol 207 no 1 pp 121ndash130 2010

[32] K GotoHOkabe F A Chowdhury S Shimizu Y Fujioka andM Onoda ldquoDevelopment of novel absorbents for CO

2capture

from blast furnace gasrdquo International Journal of Greenhouse GasControl vol 5 no 5 pp 1214ndash1219 2011

[33] H Jaeger ldquoA tutorial on training recurrent neural net-works covering BPTT RTRL EKF and ldquoEcho State Networkrdquo

10 Mathematical Problems in Engineering

approachrdquoGMDReport 159 GermanNational ResearchCenterfor Information Technology Berlin German 2002

[34] H Jaeger ldquoAdaptive nonlinear system identification with echostate networksrdquo Advances in Neural Information ProcessingSystems vol 15 pp 593ndash600 2003

[35] C FHuang ldquoPrinciple of information diffusionrdquo Fuzzy Sets andSystems vol 91 no 1 pp 69ndash90 1997

[36] E P Zhou and D K Harrison ldquoImproving error compensationvia a fuzzy-neural hybrid modelrdquo Journal of ManufacturingSystems vol 18 no 5 pp 335ndash344 1999

[37] G Ward T Hastie S Barry J Elith and J R LeathwickldquoPresence-only data and the EM algorithmrdquo Biometrics vol 65no 2 pp 554ndash563 2009

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

Page 3: Research Article An Improved Generalized-Trend-Diffusion ...downloads.hindawi.com/journals/mpe/2013/136241.pdf · Research Article An Improved Generalized-Trend-Diffusion-Based Data

Mathematical Problems in Engineering 3

No 1 CP

Nos 5 6coke ovenNos 3 4

coke ovenNos 1 2

coke oven

Powerstation

CCP

Emissiontower

Cold rolling

Hot rolling

synthesis

synthesis

gas tank

No 1 blast furnace No 2 blast furnace No 3 blast furnace No 4 blast furnace

BLR

No 1No 2 CP

No 1

No 2

gas tankNo 2

Figure 1 BFG network of Baosteel

50 100 150 200 250 30050

100

150

200

250

Time

BFG on hot blast furnace

Am

plitu

de (k

m3)

Figure 2 The consumption flow of hot blast stove

20 40 60 80 100 120 140 160 180 200105110115120125130135140145150155

Time

BFG on coke oven

Am

plitu

de (k

m3)

Figure 3 The consumption flow of coke oven

membership values and the shadow data can be treated asthe additional hidden data-related information which helpsto improve the imputation accuracy All the previous featuresabove make the GTD fit for the missing data imputation oftime series because of their lack of more information except

20 40 60 80 100 120 140 160 180 200300

350

400

450

500

550

600

650

Time

BFG

Am

plitu

de (k

m3)

Figure 4 The generation flow of blast furnace

time [31] One can start by considering that observations arecollected with an empty set where each point occurs witheach observation (Figure 5) As the data increases the centrallocation symbolized ldquoCrdquo in Figure 5 of the data for eachobservationmoves fromone location to another If each pointdeviation from the central location can be obtained then thedetailed distribution of the whole sample is clear As such theGTDwithmembership function can be used to describe suchdeviation Let themembership function value at ldquoCrdquo be 1 andlet those of somemissing data beMF

119909 When these values get

closer to 1 the missing data approach ldquoCrdquo and vice versaIn the original GTD model one can let MF

119905be the

membership function for the data collected at Step 119905 Forexample MF

1at Step 1 refers to 119884

1only MF

2at Step 2 refers

to 1198841 1198842 MF

3at Step 3 refers to 119884

1 1198842 1198843 and so forth

The data like 1198841at Step 2 and 119884

1 1198842 at Step 3 are called

the shadow data They were called as such name becauseeach of them was used repeatedly in each step when formingthe corresponding membership functions while it occurredactually once

Then the imputation can be done by the shadow dataOne can suppose that a sequence of data denoted as

4 Mathematical Problems in Engineering

C

C

C

C

C

Time

Step 1

Step 2

Step 3

Step 4

Step 5

1198841

1198841

1198841

1198841

1198841

1198842

1198842

1198842

1198842

1198843

1198843

1198843

1198844

11988441198845

Figure 5 Time series and central location

C

C

C

CStep 1

Time

1198831119883119905

119883119905

119883119905

119883119905

119883119905minus1

119883119905minus1

119883119905minus1

119883119905minus2

119883119905minus2

Step 119905

Step 119905 minus 1

Step 119905 minus 2

Figure 6 Backward tracking process

1198831 1198832 119883

119905minus1 119883119905 with 119883

119905+1missing has been obtained

The shadow data can be built by unevenly repeating themore recent data which bring more important contemporaryinformation of system variation than that provided by theprevious data As shown in Figure 6 the most recent point119883119905is repeated 119905 times 119883

119905minus1is repeated (119905 minus 1) times 119883

119905minus2

is repeated (119905 minus 2) times and so forth And such repeatingwas called as the backward tracking progress since it is donein the backward tracking progress Then these repeated data(shadow data) with their membership function values help toenlarge the sample knowledge

32 Improved Generalized-Trend-Diffusion A triangularmembership function (Figure 7) was used to describe eachpointrsquos deviation from the mean value in the original GTD

A

1B

C

119884min 119884max119884

MF(119884119894)

MF(119884min )

MF(119884max )

Figure 7 Triangular membership function

A

B

C

1

119884min 119884max

MF(119884119894)

MF(119884min )

MF(119884max )

119884

Figure 8 Gaussian membership function

However such description was somewhat unreasonablesince it restricted the deviation form as a linear one thatis the deviation from the central location was proportionalto the difference between the memberships Under suchcondition the possibility of the data value to appear inthe imputation is equal However the mean-like data haveactually a higher possibility of appearance in the industrialmanufacturing process In this study we can call such datavividly as high frequency cloud If a membership functioncan describe the high frequency cloud more like the meanvalue then the data in the cloud (mean like) will reappearin the imputation with higher possibility Considering suchmotivation the Gaussian membership function (Figure 8)could be more competent to accomplish this job

The information diffusion principle [35] is another reasonfor choosing the Gaussian membership function Informa-tion diffusion has a function of filling in the blanks like themolecular diffusion and its cause lies on that some dataacquire little information from the sample knowledge whilemolecular diffusion is caused by the heterogeneity in thespace distribution As for themolecular diffusion it had beenproved that current molecular density is proportional to theconcentration gradient If this principle is linked with thelaw of conservation of mass the molecular diffusion can bedescribed in the same form as the probability density of Gaus-sian distribution As a kind of incremental learning methodthe GTD is a representation of information diffusion Sincethe causes of information diffusion and molecular diffusionare similar we here get an inspiration to employ the Gaussian

Mathematical Problems in Engineering 5

membership function in the improved GTD The form ofGaussian function is as follows

119891 (119909) = 119886119890minus(119909minus119887)

21198882

(1)

where119886 119887 and 119888 are real constants and119886 gt 0 In order tomakethe function adaptive to the sample construction in this studywe use (2) as the general form of Gaussian function insteadof (1) as follows

119891 (119909) = 119890minus(119909minus120583)

21205902

(2)

where 120583 is the mean value of sample and 120590 is the standarddeviation Here we make 119886 as 1 since the membership valueat the mean value should be 1 After its form confirmedthe Gaussian membership is capable to enlarge the sampleknowledge instead of the triangular one

33 Data Imputation TheBPalgorithm is a supervised learn-ing method in a network which is effected by altering theweights to minimize the difference between the output valueand the desired output value [36] The enlarged knowledgethen can be utilized by BP neural networks to finish theprediction

Missing data points need to be imputed one by one sothat the order of imputation should be another concern inthis study If the imputation is real time the chronologicalorder has to be taken because one cannot currently acquirethe future data points However the study in this paper is adata mining job which does not need real-time imputationsFurthermore if the imputation is not real time the BSTMorder is superior to the chronological one For instance letthere be five consecutive missing data as Figure 9 shows Ifthe imputation order is chronological the forecast error ofpoint number 1 will be amplified so as to affect the forecastaccuracy of point number 2 and the error of point number2 will be again propagated to that of point number 3 In suchway the errors will be cumulative

Besides data points in time series are always fluctuatingas Figure 9 shows If the missing happens on the hillside thechronological imputation result is very likely to be the sameas since it continues the peak However the result maybe like if we use the BSTM order That is points number1 and number 2 are on the peak while points number 4 andnumber 5 are on the plain which both continue the trends Asfor point number 3 we impute that it sings themean of pointsnumber 2 and number 4 Obviously deviates from thereal values more than which shows that the BSTM order issuperior to the chronological oneThis summary is consistentif analyzing the missing points on the peak or on the plain

Let (1198831 1198832 119883

119905minus1 119883119905) be a time series the index of the

firstmissing point denotes as119899 the number of the consecutive

Time

ObservedMissing

ChronologicalBSTM

12

3

4

5

Figure 9 Imputation by BSTM and chronological manner

missing points denotes as119898 and the variable 119889 represents theembedding dimension then we have

former119894

= net ( (119883119894minus119889MF119894minus119889)

(119883119894minus119889+1

MF119894minus119889+1

) (119894minus1MF119894minus1))

latter119894

= net ((119894+1MF119894+1)

(119894+2MF119894+2) (119883

119894+119889MF119894+119889) )

(3)

where former119894

is the imputation of the former half while

latter119894

is in the latter half Then all the imputations 119894can

be expressed as

119894=

former119894

119899 le 119894 le 119899 +119898

2minus 1

latter119894

119899 +119898

2le 119894 le 119899 + 119898 minus 1

if 119898 = 2119896 119896 isin 119873lowast

119894=

former119894

119899 le 119894 le 119899 +119898 minus 1

2minus 1

1

2(

former119894

+ latter119894

) 119894 = 119899 +119898 minus 1

2

latter119894

119899+119898 minus 1

2+1le 119894 le 119899 + 119898 minus 1

if 119898 = 2119896 minus 1 119896 isin 119873lowast(4)

4 Experimental Results and Analysis

The imputation tests of missing data in BFG flow are carriedout with the proposed Gaussian membership function-basedmethod called iGTD here First of all the superiority of theBSTM order to the chronological one is tested and verifiedA series of consecutive 800 data is picked from number 1blast furnace in Baosteel dating from 1434001382010 to354001482010 Considering that it is difficult to guaranteequantities of consecutive valid data in real industrialdatabases and the small set of samples is our concerns inthis study the embedding dimension is empirically chosen

6 Mathematical Problems in Engineering

Table 1 Imputation accuracies with two kinds of orders

Group Order RMSE NRMSE MAPE ()

1 Chronological order 16930 01486 4565BSTM 11054 00970 2597

2 Chronological order 9274 00669 1880BSTM 8240 00594 1456

3 Chronological order 10770 00842 2395BSTM 5732 00448 1141

4 Chronological order 12047 01022 2975BSTM 9564 00812 2144

0100200300400500600700

1 25 49 73 97 121

145

169

193

217

241

265

289

313

337

361

385

Figure 10 Overview of missing data

as 15 and the hidden neuron size is chosen as 10 in the samemanner We divide the sample data into 4 groups For eachgroup we randomly remove 3 consecutive points (A B andC) in 3 places The tests are respectively implemented inthe chronological order (A-B-C) and the BSTM order (A-C-B) Here we use three indexes as the evaluation criterion ofthe imputation accuracy which are root mean square error(RMSE) normalized root mean square error (NRMSE) andmean absolute percentage error (NRMSE) as follows

RMSE = radic 1119899

119899

sum

119894=1

(119894minus 119884119894)2

NRMSE = radic 1

1198991198842

119899

sum

119894=1

(119894minus 119884119894)2

MAPE = 100119899

119899

sum

119894=1

10038161003816100381610038161003816119894minus 119884119894

10038161003816100381610038161003816

119884119894

(5)

where 119899 is the total number of imputation is the imputationvalue and 119884 is the real value As for the separated 4 groupsof data the imputation accuracies for the different orderare shown in Table 1 It is apparent that the effectiveness ofBSTM is superior to that of the chronological order-basedimputation method

To further verify the effectiveness of the proposedGaussian-based membership function we comprehensivelytake the three categories of gas flow data mentioned inprevious section which include the periodicity-like flowdatashown like the BFG consumption amount by hot blast stovethe concussive flow data shown like the gas consumption by

coke oven and the ordinary time series flow shown like thegeneration amount by blast furnaceThe comparative experi-ments are carried out by using the EM method regressionspline and the original GTD EM algorithm is an iterativemethod for finding maximum likelihood or maximum aposteriori estimates of parameters in statistical models [37]which was widely used in dealing withmissing data since themaximum likelihood estimate of the unknown parameterscan be determined by the incomplete data set The regressionmethod employed multiple linear regressions to estimate themissing values

We still apply the real industrial data in Baosteel tocomplete the comparative experiment where the collecteddata are divided into several groups and some consecutive3 points 4 points and 5 points are removed from the timeseries In order to cover the all of the possible situations theremoved data involves the time series areas on peak troughand plain as Figure 10 shows in which the points in red areremoved

(1) BFG Consumption by Hot Blast Stove (PeriodicityLike) The experimental data are from number 2 hotblast stove in Shanghai Baosteel randomly selected from14280013082010 to 21550014082010 These data aredivided into 3 groups each of which is then divided into3 subgroups and each subgroup contains 200 points Theaccuracies of the imputation result are presented in Table 2

It can be found that the results by both the original GTDand the proposed iGTD are much more excellent in termsof the accuracy when the consecutive data missing occursFurthermore the effectiveness of the iGTD is generally betterthan that of GTD Then a conclusion can be drawn thatthe iGTD employed the Gaussian membership function canobtain the better data imputation results compared to thetriangle-based membership of GTD

(2) BFG Consumption by Coke Oven (Concussive) Theexperimental data are from number 1 coke oven in Shang-hai Baosteel randomly selected from 07140014082010 to14350015082010 The data-grouped measure is similarto that in the validation for periodicity-like data missingAnd the corresponding imputation accuracies are listed inTable 3 From the experiments results all the five methodsare almost same imputation accuracies and in particular EMshould be the best solution method of the five However it is

Mathematical Problems in Engineering 7

Table 2 Imputation accuracies with different methods (hot blast stoversquos BFG consumption)

Missing number Methods Group 1 Group 2 Group 3RMSE NRMSE MAPE RMSE NRMSE MAPE RMSE NRMSE MAPE

Regression 5716 00713 2221 5596 00777 2856 3556 00489 1696EM 4356 00543 1788 4582 00636 2576 3446 00474 1758

3 points SPLINE 3672 00458 1380 6062 00842 3169 2210 00304 1068GTD 913 00114 288 1490 00207 585 1450 00199 342iGTD 719 00090 205 1442 00200 571 623 00086 287

Regression 4717 00578 2307 2910 00333 1152 3012 00356 1156EM 2615 00320 1265 2159 00247 1096 2644 00313 1392

4 points SPLINE 2518 00308 780 2106 00241 651 1499 00177 525GTD 1573 00193 474 1178 00135 382 1742 00206 585iGTD 1509 00185 425 980 00112 337 1663 00197 513

Regression 2884 00307 1190 5571 00612 3084 3992 00429 2031EM 2554 00272 1335 3999 00439 2438 4080 00439 2578

5 points SPLINE 4733 00505 2145 1521 00167 700 3342 00359 1918GTD 2012 00215 670 789 00087 393 667 00072 270iGTD 1503 00160 597 709 00078 355 515 00055 216

Table 3 Imputation accuracies with different methods (coke ovenrsquos BFG consumption)

Missing number Methods Group 1 Group 2 Group 3RMSE NRMSE MAPE RMSE NRMSE MAPE RMSE NRMSE MAPE

Regression 1673 00300 790 1033 00185 356 937 00167 324EM 995 00178 543 725 00130 405 372 00066 240

3 points SPLINE 1877 00336 773 1681 00301 779 2264 00402 949GTD 1583 00283 830 1582 00283 747 1567 00278 749iGTD 1394 00250 676 1376 00246 655 1374 00244 730

Regression 1666 00260 781 1110 00171 418 1082 00168 449EM 1032 00161 547 672 00104 380 671 00104 316

4 points SPLINE 1577 00246 650 3826 00591 933 1497 00231 637GTD 1593 00249 744 1975 00305 1055 2951 00459 1358iGTD 1381 00216 607 1936 00299 939 1302 00203 651

Regression 1324 00186 573 1324 00184 555 1598 00227 778EM 1034 00145 531 752 00104 420 1078 00153 527

5 points SPLINE 1690 00237 906 2157 00300 851 1385 00196 620GTD 1549 00217 833 2283 00317 1100 2023 00287 1148iGTD 1181 00166 545 1984 00276 902 1927 00273 945

mentionable that the effectiveness of the proposed iGTD stilldoes better than GTD in this test

(3) BFG Generation Amount (Normal Time Series) Theexperimental data are from number 1 blast furnace in Shang-hai Baosteel randomly selected from 02280027032010to 18330001042010 And the comparative accuracies arelisted in Table 4

From Table 3 we can discover that the regressionmethodpresents the worst performance while iGTD obtains the bestone For the data with normal property of time series iGTDis better than GTD while GTD wins all the other threemethods

A conclusion can be drawn from Tables 2ndash4 that theproposed iGTD and the GTD are superior to regression

EM and spline for the periodicity-like data and the normaltime-series-like data As for the data with concussive ampli-tude both iGTD and GTD do not have an advantage and yetiGTD still beats GTD which means the proposed Gaussianmembership function is superior to the triangular one in thereal industrial manufacturing process And for the visualimputation results of the BFG generation and consumptionthe comparative imputation curves are randomly chosen asFigures 11 12 and 13 showwhere the advantage of themethodproposed in this study can be easily presented

5 Conclusion

This study aims at the imputation of missing data of gas flowin steel industry In order to improve the imputation accuracy

8 Mathematical Problems in Engineering

Table 4 Imputation accuracies with different methods (BFG output)

Missing number Methods Group 1 Group 2 Group 3RMSE NRMSE MAPE RMSE NRMSE MAPE RMSE NRMSE MAPE

Regression 10817 00598 2194 9704 00476 1711 10920 00537 1660EM 8743 00483 1804 8343 00409 1538 7165 00352 1072

3 points SPLINE 9299 00514 1755 8674 00437 1541 12967 00638 1759GTD 7521 00416 1606 7073 00347 1250 5669 00279 918iGTD 6970 00385 1474 6864 00336 1226 5548 00273 932

Regression 9825 00457 1799 9309 00394 1545 9538 00408 1549EM 7794 00363 1470 7553 00319 1309 7374 00315 1157

4 points SPLINE 12822 00597 2051 8583 00363 1429 6695 00286 1086GTD 7570 00352 1421 6336 00268 1047 6231 00266 944iGTD 7091 00330 1315 6325 00267 1032 4151 00178 674

Regression 8876 00362 1709 7913 00296 1122 7624 00289 1147EM 7261 00296 1298 7174 00268 1198 7139 00270 1097

5 points SPLINE 9783 00399 1672 10518 00393 1678 10259 00389 1560GTD 6744 00275 1190 6100 00228 963 6723 00255 1047iGTD 6719 00274 1195 5775 00216 929 6295 00238 1014

75 80 85 90 95 100 105300

350

400

450

500

550

600

650

700

Time

Complete sequenceRegressionEM

SPLINEGTDiGTD

Am

plitu

de (m3)

Figure 11 Comparison of different methods (3 points)

Complete sequenceRegressionEM

SPLINEGTDiGTD

55 60 65 70 75 80300

350

400

450

500

550

600

650

700

Time

Am

plitu

de (m3)

Figure 12 Comparison of different methods (4 points)

275 280 285 290 295 300 305300350400450500550600650700

Time

Am

plitu

de (m3)

Complete sequenceRegressionEM

SPLINEGTDiGTD

Figure 13 Comparison of different methods (5 points)

the proposed iGTD replaces the triangular membershipfunction with the Gaussian one Furthermore the order ofimputation is further discussedThe verification experimentsshow that the BSTM order brings less error than the chrono-logical one does since more observed data are utilized Asfor the different data imputation method compared to theoriginal GTD EM regression and spline the proposed iGTDhas some advantages in the problem with data properties ofconsecutively missing and small samples And the satisfyingimputation accuracy provides the powerful support for thegas resources scheduling later On the other hand althoughthe approach developed in this study can handle some typesof missing in real industry some theoretical analyses and theexpanded application for example the type of concussiveflow data need to be given a further consideration in thefuture

Mathematical Problems in Engineering 9

Acknowledgments

This work is supported by the National Natural SciencesFoundation of China (no 61034003 no 61104157 and no61273037) and the Fundamental Research Funds for theCentral Universities of China (no DUT11RC(3)07) Thecooperation of energy center of Shanghai Baosteel Co LtdChina in this work is greatly appreciated

References

[1] J Han and M Kamber Data Mining Concepts and TechniquesElsevier 2006

[2] P C Austin and M D Escobar ldquoBayesian modeling of missingdata in clinical researchrdquo Computational Statistics and DataAnalysis vol 49 no 3 pp 821ndash836 2005

[3] S Y Yi S Jung and J Suh ldquoBettermobile clientrsquos cache reusabil-ity and data access time in a wireless broadcast environmentrdquoData and Knowledge Engineering vol 63 no 2 pp 293ndash3142007

[4] K Lakshminarayan S A Harp and T Samad ldquoImputation ofmissing data in industrial databasesrdquo Applied Intelligence vol11 no 3 pp 259ndash275 1999

[5] R Lenz andM Reichert ldquoIT support for healthcare processesmdashpremises challenges perspectivesrdquo Data and Knowledge Engi-neering vol 61 no 1 pp 39ndash58 2007

[6] A Mercatanti ldquoAnalyzing a randomized experiment withimperfect compliance and ignorable conditions for missingdata theoretical and computational issuesrdquo ComputationalStatistics and Data Analysis vol 46 no 3 pp 493ndash509 2004

[7] T Ozacar O Ozturk and M O Unalir ldquoANEMONE Anenvironment for modular ontology developmentrdquo Data andKnowledge Engineering vol 70 no 6 pp 504ndash526 2011

[8] A C Favre A Matei and Y Tille ldquoA variant of the Coxalgorithm for the imputation of non-response of qualitativedatardquo Computational Statistics and Data Analysis vol 45 no4 pp 709ndash719 2004

[9] A F deWinter A J Oldehinkel R Veenstra J A BrunnekreefF C Verhulst and J Ormel ldquoEvaluation of non-response biasin mental health determinants and outcomes in a large sampleof pre-adolescentsrdquo European Journal of Epidemiology vol 20no 2 pp 173ndash181 2005

[10] P A Patrician ldquoMultiple imputation for missing datardquo Researchin Nursing and Health vol 25 no 1 pp 76ndash84 2002

[11] J Honaker and G King ldquoWhat to do about missing valuesin time-series cross-section datardquo American Journal of PoliticalScience vol 54 no 2 pp 561ndash581 2010

[12] N Sartori A Salvan and K Thomaseth ldquoMultiple imputationof missing values in a cancer mortality analysis with estimatedexposure doserdquo Computational Statistics and Data Analysis vol49 no 3 pp 937ndash953 2005

[13] J M Lepkowski W D Mosher K E Davis R M Grovesand J Van Hoewyk ldquoThe 2006ndash2010 National Survey of FamilyGrowth sample design and analysis of a continuous surveyrdquoVital and Health Statistics no 150 pp 1ndash36 2010

[14] I Myrtveit E Stensrud and U H Olsson ldquoAnalyzing datasets with missing data an empirical evaluation of imputationmethods and likelihood-based methodsrdquo IEEE Transactions onSoftware Engineering vol 27 no 11 pp 999ndash1013 2001

[15] A L Bello ldquoImputation techniques in regression analysis look-ing closely at their implementationrdquo Computational Statisticsand Data Analysis vol 20 no 1 pp 45ndash57 1995

[16] D A Newman ldquoLongitudinal modeling with randomly andsystematically missing data a simulation of Ad Hoc maximumlikelihood and multiple imputation techniquesrdquo Organiza-tional Research Methods vol 6 no 3 pp 328ndash362 2003

[17] R H Jones ldquoMaximum likelihood fitting of ARMA models totime series with missing observationsrdquo Technometrics vol 22no 3 pp 389ndash395 1980

[18] G Jagannathan and R N Wright ldquoPrivacy-preserving imputa-tion of missing datardquo Data and Knowledge Engineering vol 65no 1 pp 40ndash56 2008

[19] J Van Hulse and T Khoshgoftaar ldquoKnowledge discovery fromimbalanced and noisy datardquo Data and Knowledge Engineeringvol 68 no 12 pp 1513ndash1542 2009

[20] A Bagheri and M Zandieh ldquoBi-criteria flexible job-shopscheduling with sequence-dependent setup timesmdashvariableneighborhood search approachrdquo Journal of Manufacturing Sys-tems vol 30 no 1 pp 8ndash15 2011

[21] M Zhang J Zhu D Djurdjanovic and J Ni ldquoA comparativestudy on the classification of engineering surfaces with dimen-sion reduction and coefficient shrinkage methodsrdquo Journal ofManufacturing Systems vol 25 no 3 pp 209ndash220 2006

[22] M S Pishvaee F Jolai and J Razmi ldquoA stochastic optimizationmodel for integrated forwardreverse logistics network designrdquoJournal of Manufacturing Systems vol 28 no 4 pp 107ndash1142009

[23] P A Ferrari P Annoni A Barbiero and G Manzi ldquoAnimputation method for categorical variables with applicationto nonlinear principal component analysisrdquo ComputationalStatistics and Data Analysis vol 55 no 7 pp 2410ndash2420 2011

[24] T H Lin ldquoA comparison of multiple imputation with EMalgorithm and MCMC method for quality of life missing datardquoQuality and Quantity vol 44 no 2 pp 277ndash287 2010

[25] S Van Buuren H C Boshuizen and D L Knook ldquoMultipleimputation of missing blood pressure covariates in survivalanalysisrdquo Statistics in Medicine vol 18 pp 681ndash694 1999

[26] K Hron M Templ and P Filzmoser ldquoImputation of missingvalues for compositional data using classical and robust meth-odsrdquo Computational Statistics and Data Analysis vol 54 no 12pp 3095ndash3107 2010

[27] F Dudbridge ldquoLikelihood-based association analysis fornuclear families and unrelated subjects with missing genotypedatardquo Human Heredity vol 66 no 2 pp 87ndash98 2008

[28] Z Lu and Y V Hui ldquo1198711linear interpolator for missing values

in time seriesrdquo Annals of the Institute of Statistical Mathematicsvol 55 no 1 pp 197ndash216 2003

[29] DDecoste and B Scholkopf ldquoTraining invariant support vectormachinesrdquoMachine Learning vol 46 no 1ndash3 pp 161ndash190 2002

[30] D Li L Chen and Y Lin ldquoUsing Functional Virtual Populationas assistance to learn scheduling knowledge in dynamic man-ufacturing environmentsrdquo International Journal of ProductionResearch vol 41 no 17 pp 4011ndash4024 2003

[31] Y S Lin and D C Li ldquoThe Generalized-Trend-Diffusionmodeling algorithm for small data sets in the early stagesof manufacturing systemsrdquo European Journal of OperationalResearch vol 207 no 1 pp 121ndash130 2010

[32] K GotoHOkabe F A Chowdhury S Shimizu Y Fujioka andM Onoda ldquoDevelopment of novel absorbents for CO

2capture

from blast furnace gasrdquo International Journal of Greenhouse GasControl vol 5 no 5 pp 1214ndash1219 2011

[33] H Jaeger ldquoA tutorial on training recurrent neural net-works covering BPTT RTRL EKF and ldquoEcho State Networkrdquo

10 Mathematical Problems in Engineering

approachrdquoGMDReport 159 GermanNational ResearchCenterfor Information Technology Berlin German 2002

[34] H Jaeger ldquoAdaptive nonlinear system identification with echostate networksrdquo Advances in Neural Information ProcessingSystems vol 15 pp 593ndash600 2003

[35] C FHuang ldquoPrinciple of information diffusionrdquo Fuzzy Sets andSystems vol 91 no 1 pp 69ndash90 1997

[36] E P Zhou and D K Harrison ldquoImproving error compensationvia a fuzzy-neural hybrid modelrdquo Journal of ManufacturingSystems vol 18 no 5 pp 335ndash344 1999

[37] G Ward T Hastie S Barry J Elith and J R LeathwickldquoPresence-only data and the EM algorithmrdquo Biometrics vol 65no 2 pp 554ndash563 2009

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

Page 4: Research Article An Improved Generalized-Trend-Diffusion ...downloads.hindawi.com/journals/mpe/2013/136241.pdf · Research Article An Improved Generalized-Trend-Diffusion-Based Data

4 Mathematical Problems in Engineering

C

C

C

C

C

Time

Step 1

Step 2

Step 3

Step 4

Step 5

1198841

1198841

1198841

1198841

1198841

1198842

1198842

1198842

1198842

1198843

1198843

1198843

1198844

11988441198845

Figure 5 Time series and central location

C

C

C

CStep 1

Time

1198831119883119905

119883119905

119883119905

119883119905

119883119905minus1

119883119905minus1

119883119905minus1

119883119905minus2

119883119905minus2

Step 119905

Step 119905 minus 1

Step 119905 minus 2

Figure 6 Backward tracking process

1198831 1198832 119883

119905minus1 119883119905 with 119883

119905+1missing has been obtained

The shadow data can be built by unevenly repeating themore recent data which bring more important contemporaryinformation of system variation than that provided by theprevious data As shown in Figure 6 the most recent point119883119905is repeated 119905 times 119883

119905minus1is repeated (119905 minus 1) times 119883

119905minus2

is repeated (119905 minus 2) times and so forth And such repeatingwas called as the backward tracking progress since it is donein the backward tracking progress Then these repeated data(shadow data) with their membership function values help toenlarge the sample knowledge

32 Improved Generalized-Trend-Diffusion A triangularmembership function (Figure 7) was used to describe eachpointrsquos deviation from the mean value in the original GTD

A

1B

C

119884min 119884max119884

MF(119884119894)

MF(119884min )

MF(119884max )

Figure 7 Triangular membership function

A

B

C

1

119884min 119884max

MF(119884119894)

MF(119884min )

MF(119884max )

119884

Figure 8 Gaussian membership function

However such description was somewhat unreasonablesince it restricted the deviation form as a linear one thatis the deviation from the central location was proportionalto the difference between the memberships Under suchcondition the possibility of the data value to appear inthe imputation is equal However the mean-like data haveactually a higher possibility of appearance in the industrialmanufacturing process In this study we can call such datavividly as high frequency cloud If a membership functioncan describe the high frequency cloud more like the meanvalue then the data in the cloud (mean like) will reappearin the imputation with higher possibility Considering suchmotivation the Gaussian membership function (Figure 8)could be more competent to accomplish this job

The information diffusion principle [35] is another reasonfor choosing the Gaussian membership function Informa-tion diffusion has a function of filling in the blanks like themolecular diffusion and its cause lies on that some dataacquire little information from the sample knowledge whilemolecular diffusion is caused by the heterogeneity in thespace distribution As for themolecular diffusion it had beenproved that current molecular density is proportional to theconcentration gradient If this principle is linked with thelaw of conservation of mass the molecular diffusion can bedescribed in the same form as the probability density of Gaus-sian distribution As a kind of incremental learning methodthe GTD is a representation of information diffusion Sincethe causes of information diffusion and molecular diffusionare similar we here get an inspiration to employ the Gaussian

Mathematical Problems in Engineering 5

membership function in the improved GTD The form ofGaussian function is as follows

119891 (119909) = 119886119890minus(119909minus119887)

21198882

(1)

where119886 119887 and 119888 are real constants and119886 gt 0 In order tomakethe function adaptive to the sample construction in this studywe use (2) as the general form of Gaussian function insteadof (1) as follows

119891 (119909) = 119890minus(119909minus120583)

21205902

(2)

where 120583 is the mean value of sample and 120590 is the standarddeviation Here we make 119886 as 1 since the membership valueat the mean value should be 1 After its form confirmedthe Gaussian membership is capable to enlarge the sampleknowledge instead of the triangular one

33 Data Imputation TheBPalgorithm is a supervised learn-ing method in a network which is effected by altering theweights to minimize the difference between the output valueand the desired output value [36] The enlarged knowledgethen can be utilized by BP neural networks to finish theprediction

Missing data points need to be imputed one by one sothat the order of imputation should be another concern inthis study If the imputation is real time the chronologicalorder has to be taken because one cannot currently acquirethe future data points However the study in this paper is adata mining job which does not need real-time imputationsFurthermore if the imputation is not real time the BSTMorder is superior to the chronological one For instance letthere be five consecutive missing data as Figure 9 shows Ifthe imputation order is chronological the forecast error ofpoint number 1 will be amplified so as to affect the forecastaccuracy of point number 2 and the error of point number2 will be again propagated to that of point number 3 In suchway the errors will be cumulative

Besides data points in time series are always fluctuatingas Figure 9 shows If the missing happens on the hillside thechronological imputation result is very likely to be the sameas since it continues the peak However the result maybe like if we use the BSTM order That is points number1 and number 2 are on the peak while points number 4 andnumber 5 are on the plain which both continue the trends Asfor point number 3 we impute that it sings themean of pointsnumber 2 and number 4 Obviously deviates from thereal values more than which shows that the BSTM order issuperior to the chronological oneThis summary is consistentif analyzing the missing points on the peak or on the plain

Let (1198831 1198832 119883

119905minus1 119883119905) be a time series the index of the

firstmissing point denotes as119899 the number of the consecutive

Time

ObservedMissing

ChronologicalBSTM

12

3

4

5

Figure 9 Imputation by BSTM and chronological manner

missing points denotes as119898 and the variable 119889 represents theembedding dimension then we have

former119894

= net ( (119883119894minus119889MF119894minus119889)

(119883119894minus119889+1

MF119894minus119889+1

) (119894minus1MF119894minus1))

latter119894

= net ((119894+1MF119894+1)

(119894+2MF119894+2) (119883

119894+119889MF119894+119889) )

(3)

where former119894

is the imputation of the former half while

latter119894

is in the latter half Then all the imputations 119894can

be expressed as

119894=

former119894

119899 le 119894 le 119899 +119898

2minus 1

latter119894

119899 +119898

2le 119894 le 119899 + 119898 minus 1

if 119898 = 2119896 119896 isin 119873lowast

119894=

former119894

119899 le 119894 le 119899 +119898 minus 1

2minus 1

1

2(

former119894

+ latter119894

) 119894 = 119899 +119898 minus 1

2

latter119894

119899+119898 minus 1

2+1le 119894 le 119899 + 119898 minus 1

if 119898 = 2119896 minus 1 119896 isin 119873lowast(4)

4 Experimental Results and Analysis

The imputation tests of missing data in BFG flow are carriedout with the proposed Gaussian membership function-basedmethod called iGTD here First of all the superiority of theBSTM order to the chronological one is tested and verifiedA series of consecutive 800 data is picked from number 1blast furnace in Baosteel dating from 1434001382010 to354001482010 Considering that it is difficult to guaranteequantities of consecutive valid data in real industrialdatabases and the small set of samples is our concerns inthis study the embedding dimension is empirically chosen

6 Mathematical Problems in Engineering

Table 1 Imputation accuracies with two kinds of orders

Group Order RMSE NRMSE MAPE ()

1 Chronological order 16930 01486 4565BSTM 11054 00970 2597

2 Chronological order 9274 00669 1880BSTM 8240 00594 1456

3 Chronological order 10770 00842 2395BSTM 5732 00448 1141

4 Chronological order 12047 01022 2975BSTM 9564 00812 2144

0100200300400500600700

1 25 49 73 97 121

145

169

193

217

241

265

289

313

337

361

385

Figure 10 Overview of missing data

as 15 and the hidden neuron size is chosen as 10 in the samemanner We divide the sample data into 4 groups For eachgroup we randomly remove 3 consecutive points (A B andC) in 3 places The tests are respectively implemented inthe chronological order (A-B-C) and the BSTM order (A-C-B) Here we use three indexes as the evaluation criterion ofthe imputation accuracy which are root mean square error(RMSE) normalized root mean square error (NRMSE) andmean absolute percentage error (NRMSE) as follows

RMSE = radic 1119899

119899

sum

119894=1

(119894minus 119884119894)2

NRMSE = radic 1

1198991198842

119899

sum

119894=1

(119894minus 119884119894)2

MAPE = 100119899

119899

sum

119894=1

10038161003816100381610038161003816119894minus 119884119894

10038161003816100381610038161003816

119884119894

(5)

where 119899 is the total number of imputation is the imputationvalue and 119884 is the real value As for the separated 4 groupsof data the imputation accuracies for the different orderare shown in Table 1 It is apparent that the effectiveness ofBSTM is superior to that of the chronological order-basedimputation method

To further verify the effectiveness of the proposedGaussian-based membership function we comprehensivelytake the three categories of gas flow data mentioned inprevious section which include the periodicity-like flowdatashown like the BFG consumption amount by hot blast stovethe concussive flow data shown like the gas consumption by

coke oven and the ordinary time series flow shown like thegeneration amount by blast furnaceThe comparative experi-ments are carried out by using the EM method regressionspline and the original GTD EM algorithm is an iterativemethod for finding maximum likelihood or maximum aposteriori estimates of parameters in statistical models [37]which was widely used in dealing withmissing data since themaximum likelihood estimate of the unknown parameterscan be determined by the incomplete data set The regressionmethod employed multiple linear regressions to estimate themissing values

We still apply the real industrial data in Baosteel tocomplete the comparative experiment where the collecteddata are divided into several groups and some consecutive3 points 4 points and 5 points are removed from the timeseries In order to cover the all of the possible situations theremoved data involves the time series areas on peak troughand plain as Figure 10 shows in which the points in red areremoved

(1) BFG Consumption by Hot Blast Stove (PeriodicityLike) The experimental data are from number 2 hotblast stove in Shanghai Baosteel randomly selected from14280013082010 to 21550014082010 These data aredivided into 3 groups each of which is then divided into3 subgroups and each subgroup contains 200 points Theaccuracies of the imputation result are presented in Table 2

It can be found that the results by both the original GTDand the proposed iGTD are much more excellent in termsof the accuracy when the consecutive data missing occursFurthermore the effectiveness of the iGTD is generally betterthan that of GTD Then a conclusion can be drawn thatthe iGTD employed the Gaussian membership function canobtain the better data imputation results compared to thetriangle-based membership of GTD

(2) BFG Consumption by Coke Oven (Concussive) Theexperimental data are from number 1 coke oven in Shang-hai Baosteel randomly selected from 07140014082010 to14350015082010 The data-grouped measure is similarto that in the validation for periodicity-like data missingAnd the corresponding imputation accuracies are listed inTable 3 From the experiments results all the five methodsare almost same imputation accuracies and in particular EMshould be the best solution method of the five However it is

Mathematical Problems in Engineering 7

Table 2 Imputation accuracies with different methods (hot blast stoversquos BFG consumption)

Missing number Methods Group 1 Group 2 Group 3RMSE NRMSE MAPE RMSE NRMSE MAPE RMSE NRMSE MAPE

Regression 5716 00713 2221 5596 00777 2856 3556 00489 1696EM 4356 00543 1788 4582 00636 2576 3446 00474 1758

3 points SPLINE 3672 00458 1380 6062 00842 3169 2210 00304 1068GTD 913 00114 288 1490 00207 585 1450 00199 342iGTD 719 00090 205 1442 00200 571 623 00086 287

Regression 4717 00578 2307 2910 00333 1152 3012 00356 1156EM 2615 00320 1265 2159 00247 1096 2644 00313 1392

4 points SPLINE 2518 00308 780 2106 00241 651 1499 00177 525GTD 1573 00193 474 1178 00135 382 1742 00206 585iGTD 1509 00185 425 980 00112 337 1663 00197 513

Regression 2884 00307 1190 5571 00612 3084 3992 00429 2031EM 2554 00272 1335 3999 00439 2438 4080 00439 2578

5 points SPLINE 4733 00505 2145 1521 00167 700 3342 00359 1918GTD 2012 00215 670 789 00087 393 667 00072 270iGTD 1503 00160 597 709 00078 355 515 00055 216

Table 3 Imputation accuracies with different methods (coke ovenrsquos BFG consumption)

Missing number Methods Group 1 Group 2 Group 3RMSE NRMSE MAPE RMSE NRMSE MAPE RMSE NRMSE MAPE

Regression 1673 00300 790 1033 00185 356 937 00167 324EM 995 00178 543 725 00130 405 372 00066 240

3 points SPLINE 1877 00336 773 1681 00301 779 2264 00402 949GTD 1583 00283 830 1582 00283 747 1567 00278 749iGTD 1394 00250 676 1376 00246 655 1374 00244 730

Regression 1666 00260 781 1110 00171 418 1082 00168 449EM 1032 00161 547 672 00104 380 671 00104 316

4 points SPLINE 1577 00246 650 3826 00591 933 1497 00231 637GTD 1593 00249 744 1975 00305 1055 2951 00459 1358iGTD 1381 00216 607 1936 00299 939 1302 00203 651

Regression 1324 00186 573 1324 00184 555 1598 00227 778EM 1034 00145 531 752 00104 420 1078 00153 527

5 points SPLINE 1690 00237 906 2157 00300 851 1385 00196 620GTD 1549 00217 833 2283 00317 1100 2023 00287 1148iGTD 1181 00166 545 1984 00276 902 1927 00273 945

mentionable that the effectiveness of the proposed iGTD stilldoes better than GTD in this test

(3) BFG Generation Amount (Normal Time Series) Theexperimental data are from number 1 blast furnace in Shang-hai Baosteel randomly selected from 02280027032010to 18330001042010 And the comparative accuracies arelisted in Table 4

From Table 3 we can discover that the regressionmethodpresents the worst performance while iGTD obtains the bestone For the data with normal property of time series iGTDis better than GTD while GTD wins all the other threemethods

A conclusion can be drawn from Tables 2ndash4 that theproposed iGTD and the GTD are superior to regression

EM and spline for the periodicity-like data and the normaltime-series-like data As for the data with concussive ampli-tude both iGTD and GTD do not have an advantage and yetiGTD still beats GTD which means the proposed Gaussianmembership function is superior to the triangular one in thereal industrial manufacturing process And for the visualimputation results of the BFG generation and consumptionthe comparative imputation curves are randomly chosen asFigures 11 12 and 13 showwhere the advantage of themethodproposed in this study can be easily presented

5 Conclusion

This study aims at the imputation of missing data of gas flowin steel industry In order to improve the imputation accuracy

8 Mathematical Problems in Engineering

Table 4 Imputation accuracies with different methods (BFG output)

Missing number Methods Group 1 Group 2 Group 3RMSE NRMSE MAPE RMSE NRMSE MAPE RMSE NRMSE MAPE

Regression 10817 00598 2194 9704 00476 1711 10920 00537 1660EM 8743 00483 1804 8343 00409 1538 7165 00352 1072

3 points SPLINE 9299 00514 1755 8674 00437 1541 12967 00638 1759GTD 7521 00416 1606 7073 00347 1250 5669 00279 918iGTD 6970 00385 1474 6864 00336 1226 5548 00273 932

Regression 9825 00457 1799 9309 00394 1545 9538 00408 1549EM 7794 00363 1470 7553 00319 1309 7374 00315 1157

4 points SPLINE 12822 00597 2051 8583 00363 1429 6695 00286 1086GTD 7570 00352 1421 6336 00268 1047 6231 00266 944iGTD 7091 00330 1315 6325 00267 1032 4151 00178 674

Regression 8876 00362 1709 7913 00296 1122 7624 00289 1147EM 7261 00296 1298 7174 00268 1198 7139 00270 1097

5 points SPLINE 9783 00399 1672 10518 00393 1678 10259 00389 1560GTD 6744 00275 1190 6100 00228 963 6723 00255 1047iGTD 6719 00274 1195 5775 00216 929 6295 00238 1014

75 80 85 90 95 100 105300

350

400

450

500

550

600

650

700

Time

Complete sequenceRegressionEM

SPLINEGTDiGTD

Am

plitu

de (m3)

Figure 11 Comparison of different methods (3 points)

Complete sequenceRegressionEM

SPLINEGTDiGTD

55 60 65 70 75 80300

350

400

450

500

550

600

650

700

Time

Am

plitu

de (m3)

Figure 12 Comparison of different methods (4 points)

275 280 285 290 295 300 305300350400450500550600650700

Time

Am

plitu

de (m3)

Complete sequenceRegressionEM

SPLINEGTDiGTD

Figure 13 Comparison of different methods (5 points)

the proposed iGTD replaces the triangular membershipfunction with the Gaussian one Furthermore the order ofimputation is further discussedThe verification experimentsshow that the BSTM order brings less error than the chrono-logical one does since more observed data are utilized Asfor the different data imputation method compared to theoriginal GTD EM regression and spline the proposed iGTDhas some advantages in the problem with data properties ofconsecutively missing and small samples And the satisfyingimputation accuracy provides the powerful support for thegas resources scheduling later On the other hand althoughthe approach developed in this study can handle some typesof missing in real industry some theoretical analyses and theexpanded application for example the type of concussiveflow data need to be given a further consideration in thefuture

Mathematical Problems in Engineering 9

Acknowledgments

This work is supported by the National Natural SciencesFoundation of China (no 61034003 no 61104157 and no61273037) and the Fundamental Research Funds for theCentral Universities of China (no DUT11RC(3)07) Thecooperation of energy center of Shanghai Baosteel Co LtdChina in this work is greatly appreciated

References

[1] J Han and M Kamber Data Mining Concepts and TechniquesElsevier 2006

[2] P C Austin and M D Escobar ldquoBayesian modeling of missingdata in clinical researchrdquo Computational Statistics and DataAnalysis vol 49 no 3 pp 821ndash836 2005

[3] S Y Yi S Jung and J Suh ldquoBettermobile clientrsquos cache reusabil-ity and data access time in a wireless broadcast environmentrdquoData and Knowledge Engineering vol 63 no 2 pp 293ndash3142007

[4] K Lakshminarayan S A Harp and T Samad ldquoImputation ofmissing data in industrial databasesrdquo Applied Intelligence vol11 no 3 pp 259ndash275 1999

[5] R Lenz andM Reichert ldquoIT support for healthcare processesmdashpremises challenges perspectivesrdquo Data and Knowledge Engi-neering vol 61 no 1 pp 39ndash58 2007

[6] A Mercatanti ldquoAnalyzing a randomized experiment withimperfect compliance and ignorable conditions for missingdata theoretical and computational issuesrdquo ComputationalStatistics and Data Analysis vol 46 no 3 pp 493ndash509 2004

[7] T Ozacar O Ozturk and M O Unalir ldquoANEMONE Anenvironment for modular ontology developmentrdquo Data andKnowledge Engineering vol 70 no 6 pp 504ndash526 2011

[8] A C Favre A Matei and Y Tille ldquoA variant of the Coxalgorithm for the imputation of non-response of qualitativedatardquo Computational Statistics and Data Analysis vol 45 no4 pp 709ndash719 2004

[9] A F deWinter A J Oldehinkel R Veenstra J A BrunnekreefF C Verhulst and J Ormel ldquoEvaluation of non-response biasin mental health determinants and outcomes in a large sampleof pre-adolescentsrdquo European Journal of Epidemiology vol 20no 2 pp 173ndash181 2005

[10] P A Patrician ldquoMultiple imputation for missing datardquo Researchin Nursing and Health vol 25 no 1 pp 76ndash84 2002

[11] J Honaker and G King ldquoWhat to do about missing valuesin time-series cross-section datardquo American Journal of PoliticalScience vol 54 no 2 pp 561ndash581 2010

[12] N Sartori A Salvan and K Thomaseth ldquoMultiple imputationof missing values in a cancer mortality analysis with estimatedexposure doserdquo Computational Statistics and Data Analysis vol49 no 3 pp 937ndash953 2005

[13] J M Lepkowski W D Mosher K E Davis R M Grovesand J Van Hoewyk ldquoThe 2006ndash2010 National Survey of FamilyGrowth sample design and analysis of a continuous surveyrdquoVital and Health Statistics no 150 pp 1ndash36 2010

[14] I Myrtveit E Stensrud and U H Olsson ldquoAnalyzing datasets with missing data an empirical evaluation of imputationmethods and likelihood-based methodsrdquo IEEE Transactions onSoftware Engineering vol 27 no 11 pp 999ndash1013 2001

[15] A L Bello ldquoImputation techniques in regression analysis look-ing closely at their implementationrdquo Computational Statisticsand Data Analysis vol 20 no 1 pp 45ndash57 1995

[16] D A Newman ldquoLongitudinal modeling with randomly andsystematically missing data a simulation of Ad Hoc maximumlikelihood and multiple imputation techniquesrdquo Organiza-tional Research Methods vol 6 no 3 pp 328ndash362 2003

[17] R H Jones ldquoMaximum likelihood fitting of ARMA models totime series with missing observationsrdquo Technometrics vol 22no 3 pp 389ndash395 1980

[18] G Jagannathan and R N Wright ldquoPrivacy-preserving imputa-tion of missing datardquo Data and Knowledge Engineering vol 65no 1 pp 40ndash56 2008

[19] J Van Hulse and T Khoshgoftaar ldquoKnowledge discovery fromimbalanced and noisy datardquo Data and Knowledge Engineeringvol 68 no 12 pp 1513ndash1542 2009

[20] A Bagheri and M Zandieh ldquoBi-criteria flexible job-shopscheduling with sequence-dependent setup timesmdashvariableneighborhood search approachrdquo Journal of Manufacturing Sys-tems vol 30 no 1 pp 8ndash15 2011

[21] M Zhang J Zhu D Djurdjanovic and J Ni ldquoA comparativestudy on the classification of engineering surfaces with dimen-sion reduction and coefficient shrinkage methodsrdquo Journal ofManufacturing Systems vol 25 no 3 pp 209ndash220 2006

[22] M S Pishvaee F Jolai and J Razmi ldquoA stochastic optimizationmodel for integrated forwardreverse logistics network designrdquoJournal of Manufacturing Systems vol 28 no 4 pp 107ndash1142009

[23] P A Ferrari P Annoni A Barbiero and G Manzi ldquoAnimputation method for categorical variables with applicationto nonlinear principal component analysisrdquo ComputationalStatistics and Data Analysis vol 55 no 7 pp 2410ndash2420 2011

[24] T H Lin ldquoA comparison of multiple imputation with EMalgorithm and MCMC method for quality of life missing datardquoQuality and Quantity vol 44 no 2 pp 277ndash287 2010

[25] S Van Buuren H C Boshuizen and D L Knook ldquoMultipleimputation of missing blood pressure covariates in survivalanalysisrdquo Statistics in Medicine vol 18 pp 681ndash694 1999

[26] K Hron M Templ and P Filzmoser ldquoImputation of missingvalues for compositional data using classical and robust meth-odsrdquo Computational Statistics and Data Analysis vol 54 no 12pp 3095ndash3107 2010

[27] F Dudbridge ldquoLikelihood-based association analysis fornuclear families and unrelated subjects with missing genotypedatardquo Human Heredity vol 66 no 2 pp 87ndash98 2008

[28] Z Lu and Y V Hui ldquo1198711linear interpolator for missing values

in time seriesrdquo Annals of the Institute of Statistical Mathematicsvol 55 no 1 pp 197ndash216 2003

[29] DDecoste and B Scholkopf ldquoTraining invariant support vectormachinesrdquoMachine Learning vol 46 no 1ndash3 pp 161ndash190 2002

[30] D Li L Chen and Y Lin ldquoUsing Functional Virtual Populationas assistance to learn scheduling knowledge in dynamic man-ufacturing environmentsrdquo International Journal of ProductionResearch vol 41 no 17 pp 4011ndash4024 2003

[31] Y S Lin and D C Li ldquoThe Generalized-Trend-Diffusionmodeling algorithm for small data sets in the early stagesof manufacturing systemsrdquo European Journal of OperationalResearch vol 207 no 1 pp 121ndash130 2010

[32] K GotoHOkabe F A Chowdhury S Shimizu Y Fujioka andM Onoda ldquoDevelopment of novel absorbents for CO

2capture

from blast furnace gasrdquo International Journal of Greenhouse GasControl vol 5 no 5 pp 1214ndash1219 2011

[33] H Jaeger ldquoA tutorial on training recurrent neural net-works covering BPTT RTRL EKF and ldquoEcho State Networkrdquo

10 Mathematical Problems in Engineering

approachrdquoGMDReport 159 GermanNational ResearchCenterfor Information Technology Berlin German 2002

[34] H Jaeger ldquoAdaptive nonlinear system identification with echostate networksrdquo Advances in Neural Information ProcessingSystems vol 15 pp 593ndash600 2003

[35] C FHuang ldquoPrinciple of information diffusionrdquo Fuzzy Sets andSystems vol 91 no 1 pp 69ndash90 1997

[36] E P Zhou and D K Harrison ldquoImproving error compensationvia a fuzzy-neural hybrid modelrdquo Journal of ManufacturingSystems vol 18 no 5 pp 335ndash344 1999

[37] G Ward T Hastie S Barry J Elith and J R LeathwickldquoPresence-only data and the EM algorithmrdquo Biometrics vol 65no 2 pp 554ndash563 2009

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

Page 5: Research Article An Improved Generalized-Trend-Diffusion ...downloads.hindawi.com/journals/mpe/2013/136241.pdf · Research Article An Improved Generalized-Trend-Diffusion-Based Data

Mathematical Problems in Engineering 5

membership function in the improved GTD The form ofGaussian function is as follows

119891 (119909) = 119886119890minus(119909minus119887)

21198882

(1)

where119886 119887 and 119888 are real constants and119886 gt 0 In order tomakethe function adaptive to the sample construction in this studywe use (2) as the general form of Gaussian function insteadof (1) as follows

119891 (119909) = 119890minus(119909minus120583)

21205902

(2)

where 120583 is the mean value of sample and 120590 is the standarddeviation Here we make 119886 as 1 since the membership valueat the mean value should be 1 After its form confirmedthe Gaussian membership is capable to enlarge the sampleknowledge instead of the triangular one

33 Data Imputation TheBPalgorithm is a supervised learn-ing method in a network which is effected by altering theweights to minimize the difference between the output valueand the desired output value [36] The enlarged knowledgethen can be utilized by BP neural networks to finish theprediction

Missing data points need to be imputed one by one sothat the order of imputation should be another concern inthis study If the imputation is real time the chronologicalorder has to be taken because one cannot currently acquirethe future data points However the study in this paper is adata mining job which does not need real-time imputationsFurthermore if the imputation is not real time the BSTMorder is superior to the chronological one For instance letthere be five consecutive missing data as Figure 9 shows Ifthe imputation order is chronological the forecast error ofpoint number 1 will be amplified so as to affect the forecastaccuracy of point number 2 and the error of point number2 will be again propagated to that of point number 3 In suchway the errors will be cumulative

Besides data points in time series are always fluctuatingas Figure 9 shows If the missing happens on the hillside thechronological imputation result is very likely to be the sameas since it continues the peak However the result maybe like if we use the BSTM order That is points number1 and number 2 are on the peak while points number 4 andnumber 5 are on the plain which both continue the trends Asfor point number 3 we impute that it sings themean of pointsnumber 2 and number 4 Obviously deviates from thereal values more than which shows that the BSTM order issuperior to the chronological oneThis summary is consistentif analyzing the missing points on the peak or on the plain

Let (1198831 1198832 119883

119905minus1 119883119905) be a time series the index of the

firstmissing point denotes as119899 the number of the consecutive

Time

ObservedMissing

ChronologicalBSTM

12

3

4

5

Figure 9 Imputation by BSTM and chronological manner

missing points denotes as119898 and the variable 119889 represents theembedding dimension then we have

former119894

= net ( (119883119894minus119889MF119894minus119889)

(119883119894minus119889+1

MF119894minus119889+1

) (119894minus1MF119894minus1))

latter119894

= net ((119894+1MF119894+1)

(119894+2MF119894+2) (119883

119894+119889MF119894+119889) )

(3)

where former119894

is the imputation of the former half while

latter119894

is in the latter half Then all the imputations 119894can

be expressed as

119894=

former119894

119899 le 119894 le 119899 +119898

2minus 1

latter119894

119899 +119898

2le 119894 le 119899 + 119898 minus 1

if 119898 = 2119896 119896 isin 119873lowast

119894=

former119894

119899 le 119894 le 119899 +119898 minus 1

2minus 1

1

2(

former119894

+ latter119894

) 119894 = 119899 +119898 minus 1

2

latter119894

119899+119898 minus 1

2+1le 119894 le 119899 + 119898 minus 1

if 119898 = 2119896 minus 1 119896 isin 119873lowast(4)

4 Experimental Results and Analysis

The imputation tests of missing data in BFG flow are carriedout with the proposed Gaussian membership function-basedmethod called iGTD here First of all the superiority of theBSTM order to the chronological one is tested and verifiedA series of consecutive 800 data is picked from number 1blast furnace in Baosteel dating from 1434001382010 to354001482010 Considering that it is difficult to guaranteequantities of consecutive valid data in real industrialdatabases and the small set of samples is our concerns inthis study the embedding dimension is empirically chosen

6 Mathematical Problems in Engineering

Table 1 Imputation accuracies with two kinds of orders

Group Order RMSE NRMSE MAPE ()

1 Chronological order 16930 01486 4565BSTM 11054 00970 2597

2 Chronological order 9274 00669 1880BSTM 8240 00594 1456

3 Chronological order 10770 00842 2395BSTM 5732 00448 1141

4 Chronological order 12047 01022 2975BSTM 9564 00812 2144

0100200300400500600700

1 25 49 73 97 121

145

169

193

217

241

265

289

313

337

361

385

Figure 10 Overview of missing data

as 15 and the hidden neuron size is chosen as 10 in the samemanner We divide the sample data into 4 groups For eachgroup we randomly remove 3 consecutive points (A B andC) in 3 places The tests are respectively implemented inthe chronological order (A-B-C) and the BSTM order (A-C-B) Here we use three indexes as the evaluation criterion ofthe imputation accuracy which are root mean square error(RMSE) normalized root mean square error (NRMSE) andmean absolute percentage error (NRMSE) as follows

RMSE = radic 1119899

119899

sum

119894=1

(119894minus 119884119894)2

NRMSE = radic 1

1198991198842

119899

sum

119894=1

(119894minus 119884119894)2

MAPE = 100119899

119899

sum

119894=1

10038161003816100381610038161003816119894minus 119884119894

10038161003816100381610038161003816

119884119894

(5)

where 119899 is the total number of imputation is the imputationvalue and 119884 is the real value As for the separated 4 groupsof data the imputation accuracies for the different orderare shown in Table 1 It is apparent that the effectiveness ofBSTM is superior to that of the chronological order-basedimputation method

To further verify the effectiveness of the proposedGaussian-based membership function we comprehensivelytake the three categories of gas flow data mentioned inprevious section which include the periodicity-like flowdatashown like the BFG consumption amount by hot blast stovethe concussive flow data shown like the gas consumption by

coke oven and the ordinary time series flow shown like thegeneration amount by blast furnaceThe comparative experi-ments are carried out by using the EM method regressionspline and the original GTD EM algorithm is an iterativemethod for finding maximum likelihood or maximum aposteriori estimates of parameters in statistical models [37]which was widely used in dealing withmissing data since themaximum likelihood estimate of the unknown parameterscan be determined by the incomplete data set The regressionmethod employed multiple linear regressions to estimate themissing values

We still apply the real industrial data in Baosteel tocomplete the comparative experiment where the collecteddata are divided into several groups and some consecutive3 points 4 points and 5 points are removed from the timeseries In order to cover the all of the possible situations theremoved data involves the time series areas on peak troughand plain as Figure 10 shows in which the points in red areremoved

(1) BFG Consumption by Hot Blast Stove (PeriodicityLike) The experimental data are from number 2 hotblast stove in Shanghai Baosteel randomly selected from14280013082010 to 21550014082010 These data aredivided into 3 groups each of which is then divided into3 subgroups and each subgroup contains 200 points Theaccuracies of the imputation result are presented in Table 2

It can be found that the results by both the original GTDand the proposed iGTD are much more excellent in termsof the accuracy when the consecutive data missing occursFurthermore the effectiveness of the iGTD is generally betterthan that of GTD Then a conclusion can be drawn thatthe iGTD employed the Gaussian membership function canobtain the better data imputation results compared to thetriangle-based membership of GTD

(2) BFG Consumption by Coke Oven (Concussive) Theexperimental data are from number 1 coke oven in Shang-hai Baosteel randomly selected from 07140014082010 to14350015082010 The data-grouped measure is similarto that in the validation for periodicity-like data missingAnd the corresponding imputation accuracies are listed inTable 3 From the experiments results all the five methodsare almost same imputation accuracies and in particular EMshould be the best solution method of the five However it is

Mathematical Problems in Engineering 7

Table 2 Imputation accuracies with different methods (hot blast stoversquos BFG consumption)

Missing number Methods Group 1 Group 2 Group 3RMSE NRMSE MAPE RMSE NRMSE MAPE RMSE NRMSE MAPE

Regression 5716 00713 2221 5596 00777 2856 3556 00489 1696EM 4356 00543 1788 4582 00636 2576 3446 00474 1758

3 points SPLINE 3672 00458 1380 6062 00842 3169 2210 00304 1068GTD 913 00114 288 1490 00207 585 1450 00199 342iGTD 719 00090 205 1442 00200 571 623 00086 287

Regression 4717 00578 2307 2910 00333 1152 3012 00356 1156EM 2615 00320 1265 2159 00247 1096 2644 00313 1392

4 points SPLINE 2518 00308 780 2106 00241 651 1499 00177 525GTD 1573 00193 474 1178 00135 382 1742 00206 585iGTD 1509 00185 425 980 00112 337 1663 00197 513

Regression 2884 00307 1190 5571 00612 3084 3992 00429 2031EM 2554 00272 1335 3999 00439 2438 4080 00439 2578

5 points SPLINE 4733 00505 2145 1521 00167 700 3342 00359 1918GTD 2012 00215 670 789 00087 393 667 00072 270iGTD 1503 00160 597 709 00078 355 515 00055 216

Table 3 Imputation accuracies with different methods (coke ovenrsquos BFG consumption)

Missing number Methods Group 1 Group 2 Group 3RMSE NRMSE MAPE RMSE NRMSE MAPE RMSE NRMSE MAPE

Regression 1673 00300 790 1033 00185 356 937 00167 324EM 995 00178 543 725 00130 405 372 00066 240

3 points SPLINE 1877 00336 773 1681 00301 779 2264 00402 949GTD 1583 00283 830 1582 00283 747 1567 00278 749iGTD 1394 00250 676 1376 00246 655 1374 00244 730

Regression 1666 00260 781 1110 00171 418 1082 00168 449EM 1032 00161 547 672 00104 380 671 00104 316

4 points SPLINE 1577 00246 650 3826 00591 933 1497 00231 637GTD 1593 00249 744 1975 00305 1055 2951 00459 1358iGTD 1381 00216 607 1936 00299 939 1302 00203 651

Regression 1324 00186 573 1324 00184 555 1598 00227 778EM 1034 00145 531 752 00104 420 1078 00153 527

5 points SPLINE 1690 00237 906 2157 00300 851 1385 00196 620GTD 1549 00217 833 2283 00317 1100 2023 00287 1148iGTD 1181 00166 545 1984 00276 902 1927 00273 945

mentionable that the effectiveness of the proposed iGTD stilldoes better than GTD in this test

(3) BFG Generation Amount (Normal Time Series) Theexperimental data are from number 1 blast furnace in Shang-hai Baosteel randomly selected from 02280027032010to 18330001042010 And the comparative accuracies arelisted in Table 4

From Table 3 we can discover that the regressionmethodpresents the worst performance while iGTD obtains the bestone For the data with normal property of time series iGTDis better than GTD while GTD wins all the other threemethods

A conclusion can be drawn from Tables 2ndash4 that theproposed iGTD and the GTD are superior to regression

EM and spline for the periodicity-like data and the normaltime-series-like data As for the data with concussive ampli-tude both iGTD and GTD do not have an advantage and yetiGTD still beats GTD which means the proposed Gaussianmembership function is superior to the triangular one in thereal industrial manufacturing process And for the visualimputation results of the BFG generation and consumptionthe comparative imputation curves are randomly chosen asFigures 11 12 and 13 showwhere the advantage of themethodproposed in this study can be easily presented

5 Conclusion

This study aims at the imputation of missing data of gas flowin steel industry In order to improve the imputation accuracy

8 Mathematical Problems in Engineering

Table 4 Imputation accuracies with different methods (BFG output)

Missing number Methods Group 1 Group 2 Group 3RMSE NRMSE MAPE RMSE NRMSE MAPE RMSE NRMSE MAPE

Regression 10817 00598 2194 9704 00476 1711 10920 00537 1660EM 8743 00483 1804 8343 00409 1538 7165 00352 1072

3 points SPLINE 9299 00514 1755 8674 00437 1541 12967 00638 1759GTD 7521 00416 1606 7073 00347 1250 5669 00279 918iGTD 6970 00385 1474 6864 00336 1226 5548 00273 932

Regression 9825 00457 1799 9309 00394 1545 9538 00408 1549EM 7794 00363 1470 7553 00319 1309 7374 00315 1157

4 points SPLINE 12822 00597 2051 8583 00363 1429 6695 00286 1086GTD 7570 00352 1421 6336 00268 1047 6231 00266 944iGTD 7091 00330 1315 6325 00267 1032 4151 00178 674

Regression 8876 00362 1709 7913 00296 1122 7624 00289 1147EM 7261 00296 1298 7174 00268 1198 7139 00270 1097

5 points SPLINE 9783 00399 1672 10518 00393 1678 10259 00389 1560GTD 6744 00275 1190 6100 00228 963 6723 00255 1047iGTD 6719 00274 1195 5775 00216 929 6295 00238 1014

75 80 85 90 95 100 105300

350

400

450

500

550

600

650

700

Time

Complete sequenceRegressionEM

SPLINEGTDiGTD

Am

plitu

de (m3)

Figure 11 Comparison of different methods (3 points)

Complete sequenceRegressionEM

SPLINEGTDiGTD

55 60 65 70 75 80300

350

400

450

500

550

600

650

700

Time

Am

plitu

de (m3)

Figure 12 Comparison of different methods (4 points)

275 280 285 290 295 300 305300350400450500550600650700

Time

Am

plitu

de (m3)

Complete sequenceRegressionEM

SPLINEGTDiGTD

Figure 13 Comparison of different methods (5 points)

the proposed iGTD replaces the triangular membershipfunction with the Gaussian one Furthermore the order ofimputation is further discussedThe verification experimentsshow that the BSTM order brings less error than the chrono-logical one does since more observed data are utilized Asfor the different data imputation method compared to theoriginal GTD EM regression and spline the proposed iGTDhas some advantages in the problem with data properties ofconsecutively missing and small samples And the satisfyingimputation accuracy provides the powerful support for thegas resources scheduling later On the other hand althoughthe approach developed in this study can handle some typesof missing in real industry some theoretical analyses and theexpanded application for example the type of concussiveflow data need to be given a further consideration in thefuture

Mathematical Problems in Engineering 9

Acknowledgments

This work is supported by the National Natural SciencesFoundation of China (no 61034003 no 61104157 and no61273037) and the Fundamental Research Funds for theCentral Universities of China (no DUT11RC(3)07) Thecooperation of energy center of Shanghai Baosteel Co LtdChina in this work is greatly appreciated

References

[1] J Han and M Kamber Data Mining Concepts and TechniquesElsevier 2006

[2] P C Austin and M D Escobar ldquoBayesian modeling of missingdata in clinical researchrdquo Computational Statistics and DataAnalysis vol 49 no 3 pp 821ndash836 2005

[3] S Y Yi S Jung and J Suh ldquoBettermobile clientrsquos cache reusabil-ity and data access time in a wireless broadcast environmentrdquoData and Knowledge Engineering vol 63 no 2 pp 293ndash3142007

[4] K Lakshminarayan S A Harp and T Samad ldquoImputation ofmissing data in industrial databasesrdquo Applied Intelligence vol11 no 3 pp 259ndash275 1999

[5] R Lenz andM Reichert ldquoIT support for healthcare processesmdashpremises challenges perspectivesrdquo Data and Knowledge Engi-neering vol 61 no 1 pp 39ndash58 2007

[6] A Mercatanti ldquoAnalyzing a randomized experiment withimperfect compliance and ignorable conditions for missingdata theoretical and computational issuesrdquo ComputationalStatistics and Data Analysis vol 46 no 3 pp 493ndash509 2004

[7] T Ozacar O Ozturk and M O Unalir ldquoANEMONE Anenvironment for modular ontology developmentrdquo Data andKnowledge Engineering vol 70 no 6 pp 504ndash526 2011

[8] A C Favre A Matei and Y Tille ldquoA variant of the Coxalgorithm for the imputation of non-response of qualitativedatardquo Computational Statistics and Data Analysis vol 45 no4 pp 709ndash719 2004

[9] A F deWinter A J Oldehinkel R Veenstra J A BrunnekreefF C Verhulst and J Ormel ldquoEvaluation of non-response biasin mental health determinants and outcomes in a large sampleof pre-adolescentsrdquo European Journal of Epidemiology vol 20no 2 pp 173ndash181 2005

[10] P A Patrician ldquoMultiple imputation for missing datardquo Researchin Nursing and Health vol 25 no 1 pp 76ndash84 2002

[11] J Honaker and G King ldquoWhat to do about missing valuesin time-series cross-section datardquo American Journal of PoliticalScience vol 54 no 2 pp 561ndash581 2010

[12] N Sartori A Salvan and K Thomaseth ldquoMultiple imputationof missing values in a cancer mortality analysis with estimatedexposure doserdquo Computational Statistics and Data Analysis vol49 no 3 pp 937ndash953 2005

[13] J M Lepkowski W D Mosher K E Davis R M Grovesand J Van Hoewyk ldquoThe 2006ndash2010 National Survey of FamilyGrowth sample design and analysis of a continuous surveyrdquoVital and Health Statistics no 150 pp 1ndash36 2010

[14] I Myrtveit E Stensrud and U H Olsson ldquoAnalyzing datasets with missing data an empirical evaluation of imputationmethods and likelihood-based methodsrdquo IEEE Transactions onSoftware Engineering vol 27 no 11 pp 999ndash1013 2001

[15] A L Bello ldquoImputation techniques in regression analysis look-ing closely at their implementationrdquo Computational Statisticsand Data Analysis vol 20 no 1 pp 45ndash57 1995

[16] D A Newman ldquoLongitudinal modeling with randomly andsystematically missing data a simulation of Ad Hoc maximumlikelihood and multiple imputation techniquesrdquo Organiza-tional Research Methods vol 6 no 3 pp 328ndash362 2003

[17] R H Jones ldquoMaximum likelihood fitting of ARMA models totime series with missing observationsrdquo Technometrics vol 22no 3 pp 389ndash395 1980

[18] G Jagannathan and R N Wright ldquoPrivacy-preserving imputa-tion of missing datardquo Data and Knowledge Engineering vol 65no 1 pp 40ndash56 2008

[19] J Van Hulse and T Khoshgoftaar ldquoKnowledge discovery fromimbalanced and noisy datardquo Data and Knowledge Engineeringvol 68 no 12 pp 1513ndash1542 2009

[20] A Bagheri and M Zandieh ldquoBi-criteria flexible job-shopscheduling with sequence-dependent setup timesmdashvariableneighborhood search approachrdquo Journal of Manufacturing Sys-tems vol 30 no 1 pp 8ndash15 2011

[21] M Zhang J Zhu D Djurdjanovic and J Ni ldquoA comparativestudy on the classification of engineering surfaces with dimen-sion reduction and coefficient shrinkage methodsrdquo Journal ofManufacturing Systems vol 25 no 3 pp 209ndash220 2006

[22] M S Pishvaee F Jolai and J Razmi ldquoA stochastic optimizationmodel for integrated forwardreverse logistics network designrdquoJournal of Manufacturing Systems vol 28 no 4 pp 107ndash1142009

[23] P A Ferrari P Annoni A Barbiero and G Manzi ldquoAnimputation method for categorical variables with applicationto nonlinear principal component analysisrdquo ComputationalStatistics and Data Analysis vol 55 no 7 pp 2410ndash2420 2011

[24] T H Lin ldquoA comparison of multiple imputation with EMalgorithm and MCMC method for quality of life missing datardquoQuality and Quantity vol 44 no 2 pp 277ndash287 2010

[25] S Van Buuren H C Boshuizen and D L Knook ldquoMultipleimputation of missing blood pressure covariates in survivalanalysisrdquo Statistics in Medicine vol 18 pp 681ndash694 1999

[26] K Hron M Templ and P Filzmoser ldquoImputation of missingvalues for compositional data using classical and robust meth-odsrdquo Computational Statistics and Data Analysis vol 54 no 12pp 3095ndash3107 2010

[27] F Dudbridge ldquoLikelihood-based association analysis fornuclear families and unrelated subjects with missing genotypedatardquo Human Heredity vol 66 no 2 pp 87ndash98 2008

[28] Z Lu and Y V Hui ldquo1198711linear interpolator for missing values

in time seriesrdquo Annals of the Institute of Statistical Mathematicsvol 55 no 1 pp 197ndash216 2003

[29] DDecoste and B Scholkopf ldquoTraining invariant support vectormachinesrdquoMachine Learning vol 46 no 1ndash3 pp 161ndash190 2002

[30] D Li L Chen and Y Lin ldquoUsing Functional Virtual Populationas assistance to learn scheduling knowledge in dynamic man-ufacturing environmentsrdquo International Journal of ProductionResearch vol 41 no 17 pp 4011ndash4024 2003

[31] Y S Lin and D C Li ldquoThe Generalized-Trend-Diffusionmodeling algorithm for small data sets in the early stagesof manufacturing systemsrdquo European Journal of OperationalResearch vol 207 no 1 pp 121ndash130 2010

[32] K GotoHOkabe F A Chowdhury S Shimizu Y Fujioka andM Onoda ldquoDevelopment of novel absorbents for CO

2capture

from blast furnace gasrdquo International Journal of Greenhouse GasControl vol 5 no 5 pp 1214ndash1219 2011

[33] H Jaeger ldquoA tutorial on training recurrent neural net-works covering BPTT RTRL EKF and ldquoEcho State Networkrdquo

10 Mathematical Problems in Engineering

approachrdquoGMDReport 159 GermanNational ResearchCenterfor Information Technology Berlin German 2002

[34] H Jaeger ldquoAdaptive nonlinear system identification with echostate networksrdquo Advances in Neural Information ProcessingSystems vol 15 pp 593ndash600 2003

[35] C FHuang ldquoPrinciple of information diffusionrdquo Fuzzy Sets andSystems vol 91 no 1 pp 69ndash90 1997

[36] E P Zhou and D K Harrison ldquoImproving error compensationvia a fuzzy-neural hybrid modelrdquo Journal of ManufacturingSystems vol 18 no 5 pp 335ndash344 1999

[37] G Ward T Hastie S Barry J Elith and J R LeathwickldquoPresence-only data and the EM algorithmrdquo Biometrics vol 65no 2 pp 554ndash563 2009

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

Page 6: Research Article An Improved Generalized-Trend-Diffusion ...downloads.hindawi.com/journals/mpe/2013/136241.pdf · Research Article An Improved Generalized-Trend-Diffusion-Based Data

6 Mathematical Problems in Engineering

Table 1 Imputation accuracies with two kinds of orders

Group Order RMSE NRMSE MAPE ()

1 Chronological order 16930 01486 4565BSTM 11054 00970 2597

2 Chronological order 9274 00669 1880BSTM 8240 00594 1456

3 Chronological order 10770 00842 2395BSTM 5732 00448 1141

4 Chronological order 12047 01022 2975BSTM 9564 00812 2144

0100200300400500600700

1 25 49 73 97 121

145

169

193

217

241

265

289

313

337

361

385

Figure 10 Overview of missing data

as 15 and the hidden neuron size is chosen as 10 in the samemanner We divide the sample data into 4 groups For eachgroup we randomly remove 3 consecutive points (A B andC) in 3 places The tests are respectively implemented inthe chronological order (A-B-C) and the BSTM order (A-C-B) Here we use three indexes as the evaluation criterion ofthe imputation accuracy which are root mean square error(RMSE) normalized root mean square error (NRMSE) andmean absolute percentage error (NRMSE) as follows

RMSE = radic 1119899

119899

sum

119894=1

(119894minus 119884119894)2

NRMSE = radic 1

1198991198842

119899

sum

119894=1

(119894minus 119884119894)2

MAPE = 100119899

119899

sum

119894=1

10038161003816100381610038161003816119894minus 119884119894

10038161003816100381610038161003816

119884119894

(5)

where 119899 is the total number of imputation is the imputationvalue and 119884 is the real value As for the separated 4 groupsof data the imputation accuracies for the different orderare shown in Table 1 It is apparent that the effectiveness ofBSTM is superior to that of the chronological order-basedimputation method

To further verify the effectiveness of the proposedGaussian-based membership function we comprehensivelytake the three categories of gas flow data mentioned inprevious section which include the periodicity-like flowdatashown like the BFG consumption amount by hot blast stovethe concussive flow data shown like the gas consumption by

coke oven and the ordinary time series flow shown like thegeneration amount by blast furnaceThe comparative experi-ments are carried out by using the EM method regressionspline and the original GTD EM algorithm is an iterativemethod for finding maximum likelihood or maximum aposteriori estimates of parameters in statistical models [37]which was widely used in dealing withmissing data since themaximum likelihood estimate of the unknown parameterscan be determined by the incomplete data set The regressionmethod employed multiple linear regressions to estimate themissing values

We still apply the real industrial data in Baosteel tocomplete the comparative experiment where the collecteddata are divided into several groups and some consecutive3 points 4 points and 5 points are removed from the timeseries In order to cover the all of the possible situations theremoved data involves the time series areas on peak troughand plain as Figure 10 shows in which the points in red areremoved

(1) BFG Consumption by Hot Blast Stove (PeriodicityLike) The experimental data are from number 2 hotblast stove in Shanghai Baosteel randomly selected from14280013082010 to 21550014082010 These data aredivided into 3 groups each of which is then divided into3 subgroups and each subgroup contains 200 points Theaccuracies of the imputation result are presented in Table 2

It can be found that the results by both the original GTDand the proposed iGTD are much more excellent in termsof the accuracy when the consecutive data missing occursFurthermore the effectiveness of the iGTD is generally betterthan that of GTD Then a conclusion can be drawn thatthe iGTD employed the Gaussian membership function canobtain the better data imputation results compared to thetriangle-based membership of GTD

(2) BFG Consumption by Coke Oven (Concussive) Theexperimental data are from number 1 coke oven in Shang-hai Baosteel randomly selected from 07140014082010 to14350015082010 The data-grouped measure is similarto that in the validation for periodicity-like data missingAnd the corresponding imputation accuracies are listed inTable 3 From the experiments results all the five methodsare almost same imputation accuracies and in particular EMshould be the best solution method of the five However it is

Mathematical Problems in Engineering 7

Table 2 Imputation accuracies with different methods (hot blast stoversquos BFG consumption)

Missing number Methods Group 1 Group 2 Group 3RMSE NRMSE MAPE RMSE NRMSE MAPE RMSE NRMSE MAPE

Regression 5716 00713 2221 5596 00777 2856 3556 00489 1696EM 4356 00543 1788 4582 00636 2576 3446 00474 1758

3 points SPLINE 3672 00458 1380 6062 00842 3169 2210 00304 1068GTD 913 00114 288 1490 00207 585 1450 00199 342iGTD 719 00090 205 1442 00200 571 623 00086 287

Regression 4717 00578 2307 2910 00333 1152 3012 00356 1156EM 2615 00320 1265 2159 00247 1096 2644 00313 1392

4 points SPLINE 2518 00308 780 2106 00241 651 1499 00177 525GTD 1573 00193 474 1178 00135 382 1742 00206 585iGTD 1509 00185 425 980 00112 337 1663 00197 513

Regression 2884 00307 1190 5571 00612 3084 3992 00429 2031EM 2554 00272 1335 3999 00439 2438 4080 00439 2578

5 points SPLINE 4733 00505 2145 1521 00167 700 3342 00359 1918GTD 2012 00215 670 789 00087 393 667 00072 270iGTD 1503 00160 597 709 00078 355 515 00055 216

Table 3 Imputation accuracies with different methods (coke ovenrsquos BFG consumption)

Missing number Methods Group 1 Group 2 Group 3RMSE NRMSE MAPE RMSE NRMSE MAPE RMSE NRMSE MAPE

Regression 1673 00300 790 1033 00185 356 937 00167 324EM 995 00178 543 725 00130 405 372 00066 240

3 points SPLINE 1877 00336 773 1681 00301 779 2264 00402 949GTD 1583 00283 830 1582 00283 747 1567 00278 749iGTD 1394 00250 676 1376 00246 655 1374 00244 730

Regression 1666 00260 781 1110 00171 418 1082 00168 449EM 1032 00161 547 672 00104 380 671 00104 316

4 points SPLINE 1577 00246 650 3826 00591 933 1497 00231 637GTD 1593 00249 744 1975 00305 1055 2951 00459 1358iGTD 1381 00216 607 1936 00299 939 1302 00203 651

Regression 1324 00186 573 1324 00184 555 1598 00227 778EM 1034 00145 531 752 00104 420 1078 00153 527

5 points SPLINE 1690 00237 906 2157 00300 851 1385 00196 620GTD 1549 00217 833 2283 00317 1100 2023 00287 1148iGTD 1181 00166 545 1984 00276 902 1927 00273 945

mentionable that the effectiveness of the proposed iGTD stilldoes better than GTD in this test

(3) BFG Generation Amount (Normal Time Series) Theexperimental data are from number 1 blast furnace in Shang-hai Baosteel randomly selected from 02280027032010to 18330001042010 And the comparative accuracies arelisted in Table 4

From Table 3 we can discover that the regressionmethodpresents the worst performance while iGTD obtains the bestone For the data with normal property of time series iGTDis better than GTD while GTD wins all the other threemethods

A conclusion can be drawn from Tables 2ndash4 that theproposed iGTD and the GTD are superior to regression

EM and spline for the periodicity-like data and the normaltime-series-like data As for the data with concussive ampli-tude both iGTD and GTD do not have an advantage and yetiGTD still beats GTD which means the proposed Gaussianmembership function is superior to the triangular one in thereal industrial manufacturing process And for the visualimputation results of the BFG generation and consumptionthe comparative imputation curves are randomly chosen asFigures 11 12 and 13 showwhere the advantage of themethodproposed in this study can be easily presented

5 Conclusion

This study aims at the imputation of missing data of gas flowin steel industry In order to improve the imputation accuracy

8 Mathematical Problems in Engineering

Table 4 Imputation accuracies with different methods (BFG output)

Missing number Methods Group 1 Group 2 Group 3RMSE NRMSE MAPE RMSE NRMSE MAPE RMSE NRMSE MAPE

Regression 10817 00598 2194 9704 00476 1711 10920 00537 1660EM 8743 00483 1804 8343 00409 1538 7165 00352 1072

3 points SPLINE 9299 00514 1755 8674 00437 1541 12967 00638 1759GTD 7521 00416 1606 7073 00347 1250 5669 00279 918iGTD 6970 00385 1474 6864 00336 1226 5548 00273 932

Regression 9825 00457 1799 9309 00394 1545 9538 00408 1549EM 7794 00363 1470 7553 00319 1309 7374 00315 1157

4 points SPLINE 12822 00597 2051 8583 00363 1429 6695 00286 1086GTD 7570 00352 1421 6336 00268 1047 6231 00266 944iGTD 7091 00330 1315 6325 00267 1032 4151 00178 674

Regression 8876 00362 1709 7913 00296 1122 7624 00289 1147EM 7261 00296 1298 7174 00268 1198 7139 00270 1097

5 points SPLINE 9783 00399 1672 10518 00393 1678 10259 00389 1560GTD 6744 00275 1190 6100 00228 963 6723 00255 1047iGTD 6719 00274 1195 5775 00216 929 6295 00238 1014

75 80 85 90 95 100 105300

350

400

450

500

550

600

650

700

Time

Complete sequenceRegressionEM

SPLINEGTDiGTD

Am

plitu

de (m3)

Figure 11 Comparison of different methods (3 points)

Complete sequenceRegressionEM

SPLINEGTDiGTD

55 60 65 70 75 80300

350

400

450

500

550

600

650

700

Time

Am

plitu

de (m3)

Figure 12 Comparison of different methods (4 points)

275 280 285 290 295 300 305300350400450500550600650700

Time

Am

plitu

de (m3)

Complete sequenceRegressionEM

SPLINEGTDiGTD

Figure 13 Comparison of different methods (5 points)

the proposed iGTD replaces the triangular membershipfunction with the Gaussian one Furthermore the order ofimputation is further discussedThe verification experimentsshow that the BSTM order brings less error than the chrono-logical one does since more observed data are utilized Asfor the different data imputation method compared to theoriginal GTD EM regression and spline the proposed iGTDhas some advantages in the problem with data properties ofconsecutively missing and small samples And the satisfyingimputation accuracy provides the powerful support for thegas resources scheduling later On the other hand althoughthe approach developed in this study can handle some typesof missing in real industry some theoretical analyses and theexpanded application for example the type of concussiveflow data need to be given a further consideration in thefuture

Mathematical Problems in Engineering 9

Acknowledgments

This work is supported by the National Natural SciencesFoundation of China (no 61034003 no 61104157 and no61273037) and the Fundamental Research Funds for theCentral Universities of China (no DUT11RC(3)07) Thecooperation of energy center of Shanghai Baosteel Co LtdChina in this work is greatly appreciated

References

[1] J Han and M Kamber Data Mining Concepts and TechniquesElsevier 2006

[2] P C Austin and M D Escobar ldquoBayesian modeling of missingdata in clinical researchrdquo Computational Statistics and DataAnalysis vol 49 no 3 pp 821ndash836 2005

[3] S Y Yi S Jung and J Suh ldquoBettermobile clientrsquos cache reusabil-ity and data access time in a wireless broadcast environmentrdquoData and Knowledge Engineering vol 63 no 2 pp 293ndash3142007

[4] K Lakshminarayan S A Harp and T Samad ldquoImputation ofmissing data in industrial databasesrdquo Applied Intelligence vol11 no 3 pp 259ndash275 1999

[5] R Lenz andM Reichert ldquoIT support for healthcare processesmdashpremises challenges perspectivesrdquo Data and Knowledge Engi-neering vol 61 no 1 pp 39ndash58 2007

[6] A Mercatanti ldquoAnalyzing a randomized experiment withimperfect compliance and ignorable conditions for missingdata theoretical and computational issuesrdquo ComputationalStatistics and Data Analysis vol 46 no 3 pp 493ndash509 2004

[7] T Ozacar O Ozturk and M O Unalir ldquoANEMONE Anenvironment for modular ontology developmentrdquo Data andKnowledge Engineering vol 70 no 6 pp 504ndash526 2011

[8] A C Favre A Matei and Y Tille ldquoA variant of the Coxalgorithm for the imputation of non-response of qualitativedatardquo Computational Statistics and Data Analysis vol 45 no4 pp 709ndash719 2004

[9] A F deWinter A J Oldehinkel R Veenstra J A BrunnekreefF C Verhulst and J Ormel ldquoEvaluation of non-response biasin mental health determinants and outcomes in a large sampleof pre-adolescentsrdquo European Journal of Epidemiology vol 20no 2 pp 173ndash181 2005

[10] P A Patrician ldquoMultiple imputation for missing datardquo Researchin Nursing and Health vol 25 no 1 pp 76ndash84 2002

[11] J Honaker and G King ldquoWhat to do about missing valuesin time-series cross-section datardquo American Journal of PoliticalScience vol 54 no 2 pp 561ndash581 2010

[12] N Sartori A Salvan and K Thomaseth ldquoMultiple imputationof missing values in a cancer mortality analysis with estimatedexposure doserdquo Computational Statistics and Data Analysis vol49 no 3 pp 937ndash953 2005

[13] J M Lepkowski W D Mosher K E Davis R M Grovesand J Van Hoewyk ldquoThe 2006ndash2010 National Survey of FamilyGrowth sample design and analysis of a continuous surveyrdquoVital and Health Statistics no 150 pp 1ndash36 2010

[14] I Myrtveit E Stensrud and U H Olsson ldquoAnalyzing datasets with missing data an empirical evaluation of imputationmethods and likelihood-based methodsrdquo IEEE Transactions onSoftware Engineering vol 27 no 11 pp 999ndash1013 2001

[15] A L Bello ldquoImputation techniques in regression analysis look-ing closely at their implementationrdquo Computational Statisticsand Data Analysis vol 20 no 1 pp 45ndash57 1995

[16] D A Newman ldquoLongitudinal modeling with randomly andsystematically missing data a simulation of Ad Hoc maximumlikelihood and multiple imputation techniquesrdquo Organiza-tional Research Methods vol 6 no 3 pp 328ndash362 2003

[17] R H Jones ldquoMaximum likelihood fitting of ARMA models totime series with missing observationsrdquo Technometrics vol 22no 3 pp 389ndash395 1980

[18] G Jagannathan and R N Wright ldquoPrivacy-preserving imputa-tion of missing datardquo Data and Knowledge Engineering vol 65no 1 pp 40ndash56 2008

[19] J Van Hulse and T Khoshgoftaar ldquoKnowledge discovery fromimbalanced and noisy datardquo Data and Knowledge Engineeringvol 68 no 12 pp 1513ndash1542 2009

[20] A Bagheri and M Zandieh ldquoBi-criteria flexible job-shopscheduling with sequence-dependent setup timesmdashvariableneighborhood search approachrdquo Journal of Manufacturing Sys-tems vol 30 no 1 pp 8ndash15 2011

[21] M Zhang J Zhu D Djurdjanovic and J Ni ldquoA comparativestudy on the classification of engineering surfaces with dimen-sion reduction and coefficient shrinkage methodsrdquo Journal ofManufacturing Systems vol 25 no 3 pp 209ndash220 2006

[22] M S Pishvaee F Jolai and J Razmi ldquoA stochastic optimizationmodel for integrated forwardreverse logistics network designrdquoJournal of Manufacturing Systems vol 28 no 4 pp 107ndash1142009

[23] P A Ferrari P Annoni A Barbiero and G Manzi ldquoAnimputation method for categorical variables with applicationto nonlinear principal component analysisrdquo ComputationalStatistics and Data Analysis vol 55 no 7 pp 2410ndash2420 2011

[24] T H Lin ldquoA comparison of multiple imputation with EMalgorithm and MCMC method for quality of life missing datardquoQuality and Quantity vol 44 no 2 pp 277ndash287 2010

[25] S Van Buuren H C Boshuizen and D L Knook ldquoMultipleimputation of missing blood pressure covariates in survivalanalysisrdquo Statistics in Medicine vol 18 pp 681ndash694 1999

[26] K Hron M Templ and P Filzmoser ldquoImputation of missingvalues for compositional data using classical and robust meth-odsrdquo Computational Statistics and Data Analysis vol 54 no 12pp 3095ndash3107 2010

[27] F Dudbridge ldquoLikelihood-based association analysis fornuclear families and unrelated subjects with missing genotypedatardquo Human Heredity vol 66 no 2 pp 87ndash98 2008

[28] Z Lu and Y V Hui ldquo1198711linear interpolator for missing values

in time seriesrdquo Annals of the Institute of Statistical Mathematicsvol 55 no 1 pp 197ndash216 2003

[29] DDecoste and B Scholkopf ldquoTraining invariant support vectormachinesrdquoMachine Learning vol 46 no 1ndash3 pp 161ndash190 2002

[30] D Li L Chen and Y Lin ldquoUsing Functional Virtual Populationas assistance to learn scheduling knowledge in dynamic man-ufacturing environmentsrdquo International Journal of ProductionResearch vol 41 no 17 pp 4011ndash4024 2003

[31] Y S Lin and D C Li ldquoThe Generalized-Trend-Diffusionmodeling algorithm for small data sets in the early stagesof manufacturing systemsrdquo European Journal of OperationalResearch vol 207 no 1 pp 121ndash130 2010

[32] K GotoHOkabe F A Chowdhury S Shimizu Y Fujioka andM Onoda ldquoDevelopment of novel absorbents for CO

2capture

from blast furnace gasrdquo International Journal of Greenhouse GasControl vol 5 no 5 pp 1214ndash1219 2011

[33] H Jaeger ldquoA tutorial on training recurrent neural net-works covering BPTT RTRL EKF and ldquoEcho State Networkrdquo

10 Mathematical Problems in Engineering

approachrdquoGMDReport 159 GermanNational ResearchCenterfor Information Technology Berlin German 2002

[34] H Jaeger ldquoAdaptive nonlinear system identification with echostate networksrdquo Advances in Neural Information ProcessingSystems vol 15 pp 593ndash600 2003

[35] C FHuang ldquoPrinciple of information diffusionrdquo Fuzzy Sets andSystems vol 91 no 1 pp 69ndash90 1997

[36] E P Zhou and D K Harrison ldquoImproving error compensationvia a fuzzy-neural hybrid modelrdquo Journal of ManufacturingSystems vol 18 no 5 pp 335ndash344 1999

[37] G Ward T Hastie S Barry J Elith and J R LeathwickldquoPresence-only data and the EM algorithmrdquo Biometrics vol 65no 2 pp 554ndash563 2009

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

Page 7: Research Article An Improved Generalized-Trend-Diffusion ...downloads.hindawi.com/journals/mpe/2013/136241.pdf · Research Article An Improved Generalized-Trend-Diffusion-Based Data

Mathematical Problems in Engineering 7

Table 2 Imputation accuracies with different methods (hot blast stoversquos BFG consumption)

Missing number Methods Group 1 Group 2 Group 3RMSE NRMSE MAPE RMSE NRMSE MAPE RMSE NRMSE MAPE

Regression 5716 00713 2221 5596 00777 2856 3556 00489 1696EM 4356 00543 1788 4582 00636 2576 3446 00474 1758

3 points SPLINE 3672 00458 1380 6062 00842 3169 2210 00304 1068GTD 913 00114 288 1490 00207 585 1450 00199 342iGTD 719 00090 205 1442 00200 571 623 00086 287

Regression 4717 00578 2307 2910 00333 1152 3012 00356 1156EM 2615 00320 1265 2159 00247 1096 2644 00313 1392

4 points SPLINE 2518 00308 780 2106 00241 651 1499 00177 525GTD 1573 00193 474 1178 00135 382 1742 00206 585iGTD 1509 00185 425 980 00112 337 1663 00197 513

Regression 2884 00307 1190 5571 00612 3084 3992 00429 2031EM 2554 00272 1335 3999 00439 2438 4080 00439 2578

5 points SPLINE 4733 00505 2145 1521 00167 700 3342 00359 1918GTD 2012 00215 670 789 00087 393 667 00072 270iGTD 1503 00160 597 709 00078 355 515 00055 216

Table 3 Imputation accuracies with different methods (coke ovenrsquos BFG consumption)

Missing number Methods Group 1 Group 2 Group 3RMSE NRMSE MAPE RMSE NRMSE MAPE RMSE NRMSE MAPE

Regression 1673 00300 790 1033 00185 356 937 00167 324EM 995 00178 543 725 00130 405 372 00066 240

3 points SPLINE 1877 00336 773 1681 00301 779 2264 00402 949GTD 1583 00283 830 1582 00283 747 1567 00278 749iGTD 1394 00250 676 1376 00246 655 1374 00244 730

Regression 1666 00260 781 1110 00171 418 1082 00168 449EM 1032 00161 547 672 00104 380 671 00104 316

4 points SPLINE 1577 00246 650 3826 00591 933 1497 00231 637GTD 1593 00249 744 1975 00305 1055 2951 00459 1358iGTD 1381 00216 607 1936 00299 939 1302 00203 651

Regression 1324 00186 573 1324 00184 555 1598 00227 778EM 1034 00145 531 752 00104 420 1078 00153 527

5 points SPLINE 1690 00237 906 2157 00300 851 1385 00196 620GTD 1549 00217 833 2283 00317 1100 2023 00287 1148iGTD 1181 00166 545 1984 00276 902 1927 00273 945

mentionable that the effectiveness of the proposed iGTD stilldoes better than GTD in this test

(3) BFG Generation Amount (Normal Time Series) Theexperimental data are from number 1 blast furnace in Shang-hai Baosteel randomly selected from 02280027032010to 18330001042010 And the comparative accuracies arelisted in Table 4

From Table 3 we can discover that the regressionmethodpresents the worst performance while iGTD obtains the bestone For the data with normal property of time series iGTDis better than GTD while GTD wins all the other threemethods

A conclusion can be drawn from Tables 2ndash4 that theproposed iGTD and the GTD are superior to regression

EM and spline for the periodicity-like data and the normaltime-series-like data As for the data with concussive ampli-tude both iGTD and GTD do not have an advantage and yetiGTD still beats GTD which means the proposed Gaussianmembership function is superior to the triangular one in thereal industrial manufacturing process And for the visualimputation results of the BFG generation and consumptionthe comparative imputation curves are randomly chosen asFigures 11 12 and 13 showwhere the advantage of themethodproposed in this study can be easily presented

5 Conclusion

This study aims at the imputation of missing data of gas flowin steel industry In order to improve the imputation accuracy

8 Mathematical Problems in Engineering

Table 4 Imputation accuracies with different methods (BFG output)

Missing number Methods Group 1 Group 2 Group 3RMSE NRMSE MAPE RMSE NRMSE MAPE RMSE NRMSE MAPE

Regression 10817 00598 2194 9704 00476 1711 10920 00537 1660EM 8743 00483 1804 8343 00409 1538 7165 00352 1072

3 points SPLINE 9299 00514 1755 8674 00437 1541 12967 00638 1759GTD 7521 00416 1606 7073 00347 1250 5669 00279 918iGTD 6970 00385 1474 6864 00336 1226 5548 00273 932

Regression 9825 00457 1799 9309 00394 1545 9538 00408 1549EM 7794 00363 1470 7553 00319 1309 7374 00315 1157

4 points SPLINE 12822 00597 2051 8583 00363 1429 6695 00286 1086GTD 7570 00352 1421 6336 00268 1047 6231 00266 944iGTD 7091 00330 1315 6325 00267 1032 4151 00178 674

Regression 8876 00362 1709 7913 00296 1122 7624 00289 1147EM 7261 00296 1298 7174 00268 1198 7139 00270 1097

5 points SPLINE 9783 00399 1672 10518 00393 1678 10259 00389 1560GTD 6744 00275 1190 6100 00228 963 6723 00255 1047iGTD 6719 00274 1195 5775 00216 929 6295 00238 1014

75 80 85 90 95 100 105300

350

400

450

500

550

600

650

700

Time

Complete sequenceRegressionEM

SPLINEGTDiGTD

Am

plitu

de (m3)

Figure 11 Comparison of different methods (3 points)

Complete sequenceRegressionEM

SPLINEGTDiGTD

55 60 65 70 75 80300

350

400

450

500

550

600

650

700

Time

Am

plitu

de (m3)

Figure 12 Comparison of different methods (4 points)

275 280 285 290 295 300 305300350400450500550600650700

Time

Am

plitu

de (m3)

Complete sequenceRegressionEM

SPLINEGTDiGTD

Figure 13 Comparison of different methods (5 points)

the proposed iGTD replaces the triangular membershipfunction with the Gaussian one Furthermore the order ofimputation is further discussedThe verification experimentsshow that the BSTM order brings less error than the chrono-logical one does since more observed data are utilized Asfor the different data imputation method compared to theoriginal GTD EM regression and spline the proposed iGTDhas some advantages in the problem with data properties ofconsecutively missing and small samples And the satisfyingimputation accuracy provides the powerful support for thegas resources scheduling later On the other hand althoughthe approach developed in this study can handle some typesof missing in real industry some theoretical analyses and theexpanded application for example the type of concussiveflow data need to be given a further consideration in thefuture

Mathematical Problems in Engineering 9

Acknowledgments

This work is supported by the National Natural SciencesFoundation of China (no 61034003 no 61104157 and no61273037) and the Fundamental Research Funds for theCentral Universities of China (no DUT11RC(3)07) Thecooperation of energy center of Shanghai Baosteel Co LtdChina in this work is greatly appreciated

References

[1] J Han and M Kamber Data Mining Concepts and TechniquesElsevier 2006

[2] P C Austin and M D Escobar ldquoBayesian modeling of missingdata in clinical researchrdquo Computational Statistics and DataAnalysis vol 49 no 3 pp 821ndash836 2005

[3] S Y Yi S Jung and J Suh ldquoBettermobile clientrsquos cache reusabil-ity and data access time in a wireless broadcast environmentrdquoData and Knowledge Engineering vol 63 no 2 pp 293ndash3142007

[4] K Lakshminarayan S A Harp and T Samad ldquoImputation ofmissing data in industrial databasesrdquo Applied Intelligence vol11 no 3 pp 259ndash275 1999

[5] R Lenz andM Reichert ldquoIT support for healthcare processesmdashpremises challenges perspectivesrdquo Data and Knowledge Engi-neering vol 61 no 1 pp 39ndash58 2007

[6] A Mercatanti ldquoAnalyzing a randomized experiment withimperfect compliance and ignorable conditions for missingdata theoretical and computational issuesrdquo ComputationalStatistics and Data Analysis vol 46 no 3 pp 493ndash509 2004

[7] T Ozacar O Ozturk and M O Unalir ldquoANEMONE Anenvironment for modular ontology developmentrdquo Data andKnowledge Engineering vol 70 no 6 pp 504ndash526 2011

[8] A C Favre A Matei and Y Tille ldquoA variant of the Coxalgorithm for the imputation of non-response of qualitativedatardquo Computational Statistics and Data Analysis vol 45 no4 pp 709ndash719 2004

[9] A F deWinter A J Oldehinkel R Veenstra J A BrunnekreefF C Verhulst and J Ormel ldquoEvaluation of non-response biasin mental health determinants and outcomes in a large sampleof pre-adolescentsrdquo European Journal of Epidemiology vol 20no 2 pp 173ndash181 2005

[10] P A Patrician ldquoMultiple imputation for missing datardquo Researchin Nursing and Health vol 25 no 1 pp 76ndash84 2002

[11] J Honaker and G King ldquoWhat to do about missing valuesin time-series cross-section datardquo American Journal of PoliticalScience vol 54 no 2 pp 561ndash581 2010

[12] N Sartori A Salvan and K Thomaseth ldquoMultiple imputationof missing values in a cancer mortality analysis with estimatedexposure doserdquo Computational Statistics and Data Analysis vol49 no 3 pp 937ndash953 2005

[13] J M Lepkowski W D Mosher K E Davis R M Grovesand J Van Hoewyk ldquoThe 2006ndash2010 National Survey of FamilyGrowth sample design and analysis of a continuous surveyrdquoVital and Health Statistics no 150 pp 1ndash36 2010

[14] I Myrtveit E Stensrud and U H Olsson ldquoAnalyzing datasets with missing data an empirical evaluation of imputationmethods and likelihood-based methodsrdquo IEEE Transactions onSoftware Engineering vol 27 no 11 pp 999ndash1013 2001

[15] A L Bello ldquoImputation techniques in regression analysis look-ing closely at their implementationrdquo Computational Statisticsand Data Analysis vol 20 no 1 pp 45ndash57 1995

[16] D A Newman ldquoLongitudinal modeling with randomly andsystematically missing data a simulation of Ad Hoc maximumlikelihood and multiple imputation techniquesrdquo Organiza-tional Research Methods vol 6 no 3 pp 328ndash362 2003

[17] R H Jones ldquoMaximum likelihood fitting of ARMA models totime series with missing observationsrdquo Technometrics vol 22no 3 pp 389ndash395 1980

[18] G Jagannathan and R N Wright ldquoPrivacy-preserving imputa-tion of missing datardquo Data and Knowledge Engineering vol 65no 1 pp 40ndash56 2008

[19] J Van Hulse and T Khoshgoftaar ldquoKnowledge discovery fromimbalanced and noisy datardquo Data and Knowledge Engineeringvol 68 no 12 pp 1513ndash1542 2009

[20] A Bagheri and M Zandieh ldquoBi-criteria flexible job-shopscheduling with sequence-dependent setup timesmdashvariableneighborhood search approachrdquo Journal of Manufacturing Sys-tems vol 30 no 1 pp 8ndash15 2011

[21] M Zhang J Zhu D Djurdjanovic and J Ni ldquoA comparativestudy on the classification of engineering surfaces with dimen-sion reduction and coefficient shrinkage methodsrdquo Journal ofManufacturing Systems vol 25 no 3 pp 209ndash220 2006

[22] M S Pishvaee F Jolai and J Razmi ldquoA stochastic optimizationmodel for integrated forwardreverse logistics network designrdquoJournal of Manufacturing Systems vol 28 no 4 pp 107ndash1142009

[23] P A Ferrari P Annoni A Barbiero and G Manzi ldquoAnimputation method for categorical variables with applicationto nonlinear principal component analysisrdquo ComputationalStatistics and Data Analysis vol 55 no 7 pp 2410ndash2420 2011

[24] T H Lin ldquoA comparison of multiple imputation with EMalgorithm and MCMC method for quality of life missing datardquoQuality and Quantity vol 44 no 2 pp 277ndash287 2010

[25] S Van Buuren H C Boshuizen and D L Knook ldquoMultipleimputation of missing blood pressure covariates in survivalanalysisrdquo Statistics in Medicine vol 18 pp 681ndash694 1999

[26] K Hron M Templ and P Filzmoser ldquoImputation of missingvalues for compositional data using classical and robust meth-odsrdquo Computational Statistics and Data Analysis vol 54 no 12pp 3095ndash3107 2010

[27] F Dudbridge ldquoLikelihood-based association analysis fornuclear families and unrelated subjects with missing genotypedatardquo Human Heredity vol 66 no 2 pp 87ndash98 2008

[28] Z Lu and Y V Hui ldquo1198711linear interpolator for missing values

in time seriesrdquo Annals of the Institute of Statistical Mathematicsvol 55 no 1 pp 197ndash216 2003

[29] DDecoste and B Scholkopf ldquoTraining invariant support vectormachinesrdquoMachine Learning vol 46 no 1ndash3 pp 161ndash190 2002

[30] D Li L Chen and Y Lin ldquoUsing Functional Virtual Populationas assistance to learn scheduling knowledge in dynamic man-ufacturing environmentsrdquo International Journal of ProductionResearch vol 41 no 17 pp 4011ndash4024 2003

[31] Y S Lin and D C Li ldquoThe Generalized-Trend-Diffusionmodeling algorithm for small data sets in the early stagesof manufacturing systemsrdquo European Journal of OperationalResearch vol 207 no 1 pp 121ndash130 2010

[32] K GotoHOkabe F A Chowdhury S Shimizu Y Fujioka andM Onoda ldquoDevelopment of novel absorbents for CO

2capture

from blast furnace gasrdquo International Journal of Greenhouse GasControl vol 5 no 5 pp 1214ndash1219 2011

[33] H Jaeger ldquoA tutorial on training recurrent neural net-works covering BPTT RTRL EKF and ldquoEcho State Networkrdquo

10 Mathematical Problems in Engineering

approachrdquoGMDReport 159 GermanNational ResearchCenterfor Information Technology Berlin German 2002

[34] H Jaeger ldquoAdaptive nonlinear system identification with echostate networksrdquo Advances in Neural Information ProcessingSystems vol 15 pp 593ndash600 2003

[35] C FHuang ldquoPrinciple of information diffusionrdquo Fuzzy Sets andSystems vol 91 no 1 pp 69ndash90 1997

[36] E P Zhou and D K Harrison ldquoImproving error compensationvia a fuzzy-neural hybrid modelrdquo Journal of ManufacturingSystems vol 18 no 5 pp 335ndash344 1999

[37] G Ward T Hastie S Barry J Elith and J R LeathwickldquoPresence-only data and the EM algorithmrdquo Biometrics vol 65no 2 pp 554ndash563 2009

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

Page 8: Research Article An Improved Generalized-Trend-Diffusion ...downloads.hindawi.com/journals/mpe/2013/136241.pdf · Research Article An Improved Generalized-Trend-Diffusion-Based Data

8 Mathematical Problems in Engineering

Table 4 Imputation accuracies with different methods (BFG output)

Missing number Methods Group 1 Group 2 Group 3RMSE NRMSE MAPE RMSE NRMSE MAPE RMSE NRMSE MAPE

Regression 10817 00598 2194 9704 00476 1711 10920 00537 1660EM 8743 00483 1804 8343 00409 1538 7165 00352 1072

3 points SPLINE 9299 00514 1755 8674 00437 1541 12967 00638 1759GTD 7521 00416 1606 7073 00347 1250 5669 00279 918iGTD 6970 00385 1474 6864 00336 1226 5548 00273 932

Regression 9825 00457 1799 9309 00394 1545 9538 00408 1549EM 7794 00363 1470 7553 00319 1309 7374 00315 1157

4 points SPLINE 12822 00597 2051 8583 00363 1429 6695 00286 1086GTD 7570 00352 1421 6336 00268 1047 6231 00266 944iGTD 7091 00330 1315 6325 00267 1032 4151 00178 674

Regression 8876 00362 1709 7913 00296 1122 7624 00289 1147EM 7261 00296 1298 7174 00268 1198 7139 00270 1097

5 points SPLINE 9783 00399 1672 10518 00393 1678 10259 00389 1560GTD 6744 00275 1190 6100 00228 963 6723 00255 1047iGTD 6719 00274 1195 5775 00216 929 6295 00238 1014

75 80 85 90 95 100 105300

350

400

450

500

550

600

650

700

Time

Complete sequenceRegressionEM

SPLINEGTDiGTD

Am

plitu

de (m3)

Figure 11 Comparison of different methods (3 points)

Complete sequenceRegressionEM

SPLINEGTDiGTD

55 60 65 70 75 80300

350

400

450

500

550

600

650

700

Time

Am

plitu

de (m3)

Figure 12 Comparison of different methods (4 points)

275 280 285 290 295 300 305300350400450500550600650700

Time

Am

plitu

de (m3)

Complete sequenceRegressionEM

SPLINEGTDiGTD

Figure 13 Comparison of different methods (5 points)

the proposed iGTD replaces the triangular membershipfunction with the Gaussian one Furthermore the order ofimputation is further discussedThe verification experimentsshow that the BSTM order brings less error than the chrono-logical one does since more observed data are utilized Asfor the different data imputation method compared to theoriginal GTD EM regression and spline the proposed iGTDhas some advantages in the problem with data properties ofconsecutively missing and small samples And the satisfyingimputation accuracy provides the powerful support for thegas resources scheduling later On the other hand althoughthe approach developed in this study can handle some typesof missing in real industry some theoretical analyses and theexpanded application for example the type of concussiveflow data need to be given a further consideration in thefuture

Mathematical Problems in Engineering 9

Acknowledgments

This work is supported by the National Natural SciencesFoundation of China (no 61034003 no 61104157 and no61273037) and the Fundamental Research Funds for theCentral Universities of China (no DUT11RC(3)07) Thecooperation of energy center of Shanghai Baosteel Co LtdChina in this work is greatly appreciated

References

[1] J Han and M Kamber Data Mining Concepts and TechniquesElsevier 2006

[2] P C Austin and M D Escobar ldquoBayesian modeling of missingdata in clinical researchrdquo Computational Statistics and DataAnalysis vol 49 no 3 pp 821ndash836 2005

[3] S Y Yi S Jung and J Suh ldquoBettermobile clientrsquos cache reusabil-ity and data access time in a wireless broadcast environmentrdquoData and Knowledge Engineering vol 63 no 2 pp 293ndash3142007

[4] K Lakshminarayan S A Harp and T Samad ldquoImputation ofmissing data in industrial databasesrdquo Applied Intelligence vol11 no 3 pp 259ndash275 1999

[5] R Lenz andM Reichert ldquoIT support for healthcare processesmdashpremises challenges perspectivesrdquo Data and Knowledge Engi-neering vol 61 no 1 pp 39ndash58 2007

[6] A Mercatanti ldquoAnalyzing a randomized experiment withimperfect compliance and ignorable conditions for missingdata theoretical and computational issuesrdquo ComputationalStatistics and Data Analysis vol 46 no 3 pp 493ndash509 2004

[7] T Ozacar O Ozturk and M O Unalir ldquoANEMONE Anenvironment for modular ontology developmentrdquo Data andKnowledge Engineering vol 70 no 6 pp 504ndash526 2011

[8] A C Favre A Matei and Y Tille ldquoA variant of the Coxalgorithm for the imputation of non-response of qualitativedatardquo Computational Statistics and Data Analysis vol 45 no4 pp 709ndash719 2004

[9] A F deWinter A J Oldehinkel R Veenstra J A BrunnekreefF C Verhulst and J Ormel ldquoEvaluation of non-response biasin mental health determinants and outcomes in a large sampleof pre-adolescentsrdquo European Journal of Epidemiology vol 20no 2 pp 173ndash181 2005

[10] P A Patrician ldquoMultiple imputation for missing datardquo Researchin Nursing and Health vol 25 no 1 pp 76ndash84 2002

[11] J Honaker and G King ldquoWhat to do about missing valuesin time-series cross-section datardquo American Journal of PoliticalScience vol 54 no 2 pp 561ndash581 2010

[12] N Sartori A Salvan and K Thomaseth ldquoMultiple imputationof missing values in a cancer mortality analysis with estimatedexposure doserdquo Computational Statistics and Data Analysis vol49 no 3 pp 937ndash953 2005

[13] J M Lepkowski W D Mosher K E Davis R M Grovesand J Van Hoewyk ldquoThe 2006ndash2010 National Survey of FamilyGrowth sample design and analysis of a continuous surveyrdquoVital and Health Statistics no 150 pp 1ndash36 2010

[14] I Myrtveit E Stensrud and U H Olsson ldquoAnalyzing datasets with missing data an empirical evaluation of imputationmethods and likelihood-based methodsrdquo IEEE Transactions onSoftware Engineering vol 27 no 11 pp 999ndash1013 2001

[15] A L Bello ldquoImputation techniques in regression analysis look-ing closely at their implementationrdquo Computational Statisticsand Data Analysis vol 20 no 1 pp 45ndash57 1995

[16] D A Newman ldquoLongitudinal modeling with randomly andsystematically missing data a simulation of Ad Hoc maximumlikelihood and multiple imputation techniquesrdquo Organiza-tional Research Methods vol 6 no 3 pp 328ndash362 2003

[17] R H Jones ldquoMaximum likelihood fitting of ARMA models totime series with missing observationsrdquo Technometrics vol 22no 3 pp 389ndash395 1980

[18] G Jagannathan and R N Wright ldquoPrivacy-preserving imputa-tion of missing datardquo Data and Knowledge Engineering vol 65no 1 pp 40ndash56 2008

[19] J Van Hulse and T Khoshgoftaar ldquoKnowledge discovery fromimbalanced and noisy datardquo Data and Knowledge Engineeringvol 68 no 12 pp 1513ndash1542 2009

[20] A Bagheri and M Zandieh ldquoBi-criteria flexible job-shopscheduling with sequence-dependent setup timesmdashvariableneighborhood search approachrdquo Journal of Manufacturing Sys-tems vol 30 no 1 pp 8ndash15 2011

[21] M Zhang J Zhu D Djurdjanovic and J Ni ldquoA comparativestudy on the classification of engineering surfaces with dimen-sion reduction and coefficient shrinkage methodsrdquo Journal ofManufacturing Systems vol 25 no 3 pp 209ndash220 2006

[22] M S Pishvaee F Jolai and J Razmi ldquoA stochastic optimizationmodel for integrated forwardreverse logistics network designrdquoJournal of Manufacturing Systems vol 28 no 4 pp 107ndash1142009

[23] P A Ferrari P Annoni A Barbiero and G Manzi ldquoAnimputation method for categorical variables with applicationto nonlinear principal component analysisrdquo ComputationalStatistics and Data Analysis vol 55 no 7 pp 2410ndash2420 2011

[24] T H Lin ldquoA comparison of multiple imputation with EMalgorithm and MCMC method for quality of life missing datardquoQuality and Quantity vol 44 no 2 pp 277ndash287 2010

[25] S Van Buuren H C Boshuizen and D L Knook ldquoMultipleimputation of missing blood pressure covariates in survivalanalysisrdquo Statistics in Medicine vol 18 pp 681ndash694 1999

[26] K Hron M Templ and P Filzmoser ldquoImputation of missingvalues for compositional data using classical and robust meth-odsrdquo Computational Statistics and Data Analysis vol 54 no 12pp 3095ndash3107 2010

[27] F Dudbridge ldquoLikelihood-based association analysis fornuclear families and unrelated subjects with missing genotypedatardquo Human Heredity vol 66 no 2 pp 87ndash98 2008

[28] Z Lu and Y V Hui ldquo1198711linear interpolator for missing values

in time seriesrdquo Annals of the Institute of Statistical Mathematicsvol 55 no 1 pp 197ndash216 2003

[29] DDecoste and B Scholkopf ldquoTraining invariant support vectormachinesrdquoMachine Learning vol 46 no 1ndash3 pp 161ndash190 2002

[30] D Li L Chen and Y Lin ldquoUsing Functional Virtual Populationas assistance to learn scheduling knowledge in dynamic man-ufacturing environmentsrdquo International Journal of ProductionResearch vol 41 no 17 pp 4011ndash4024 2003

[31] Y S Lin and D C Li ldquoThe Generalized-Trend-Diffusionmodeling algorithm for small data sets in the early stagesof manufacturing systemsrdquo European Journal of OperationalResearch vol 207 no 1 pp 121ndash130 2010

[32] K GotoHOkabe F A Chowdhury S Shimizu Y Fujioka andM Onoda ldquoDevelopment of novel absorbents for CO

2capture

from blast furnace gasrdquo International Journal of Greenhouse GasControl vol 5 no 5 pp 1214ndash1219 2011

[33] H Jaeger ldquoA tutorial on training recurrent neural net-works covering BPTT RTRL EKF and ldquoEcho State Networkrdquo

10 Mathematical Problems in Engineering

approachrdquoGMDReport 159 GermanNational ResearchCenterfor Information Technology Berlin German 2002

[34] H Jaeger ldquoAdaptive nonlinear system identification with echostate networksrdquo Advances in Neural Information ProcessingSystems vol 15 pp 593ndash600 2003

[35] C FHuang ldquoPrinciple of information diffusionrdquo Fuzzy Sets andSystems vol 91 no 1 pp 69ndash90 1997

[36] E P Zhou and D K Harrison ldquoImproving error compensationvia a fuzzy-neural hybrid modelrdquo Journal of ManufacturingSystems vol 18 no 5 pp 335ndash344 1999

[37] G Ward T Hastie S Barry J Elith and J R LeathwickldquoPresence-only data and the EM algorithmrdquo Biometrics vol 65no 2 pp 554ndash563 2009

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

Page 9: Research Article An Improved Generalized-Trend-Diffusion ...downloads.hindawi.com/journals/mpe/2013/136241.pdf · Research Article An Improved Generalized-Trend-Diffusion-Based Data

Mathematical Problems in Engineering 9

Acknowledgments

This work is supported by the National Natural SciencesFoundation of China (no 61034003 no 61104157 and no61273037) and the Fundamental Research Funds for theCentral Universities of China (no DUT11RC(3)07) Thecooperation of energy center of Shanghai Baosteel Co LtdChina in this work is greatly appreciated

References

[1] J Han and M Kamber Data Mining Concepts and TechniquesElsevier 2006

[2] P C Austin and M D Escobar ldquoBayesian modeling of missingdata in clinical researchrdquo Computational Statistics and DataAnalysis vol 49 no 3 pp 821ndash836 2005

[3] S Y Yi S Jung and J Suh ldquoBettermobile clientrsquos cache reusabil-ity and data access time in a wireless broadcast environmentrdquoData and Knowledge Engineering vol 63 no 2 pp 293ndash3142007

[4] K Lakshminarayan S A Harp and T Samad ldquoImputation ofmissing data in industrial databasesrdquo Applied Intelligence vol11 no 3 pp 259ndash275 1999

[5] R Lenz andM Reichert ldquoIT support for healthcare processesmdashpremises challenges perspectivesrdquo Data and Knowledge Engi-neering vol 61 no 1 pp 39ndash58 2007

[6] A Mercatanti ldquoAnalyzing a randomized experiment withimperfect compliance and ignorable conditions for missingdata theoretical and computational issuesrdquo ComputationalStatistics and Data Analysis vol 46 no 3 pp 493ndash509 2004

[7] T Ozacar O Ozturk and M O Unalir ldquoANEMONE Anenvironment for modular ontology developmentrdquo Data andKnowledge Engineering vol 70 no 6 pp 504ndash526 2011

[8] A C Favre A Matei and Y Tille ldquoA variant of the Coxalgorithm for the imputation of non-response of qualitativedatardquo Computational Statistics and Data Analysis vol 45 no4 pp 709ndash719 2004

[9] A F deWinter A J Oldehinkel R Veenstra J A BrunnekreefF C Verhulst and J Ormel ldquoEvaluation of non-response biasin mental health determinants and outcomes in a large sampleof pre-adolescentsrdquo European Journal of Epidemiology vol 20no 2 pp 173ndash181 2005

[10] P A Patrician ldquoMultiple imputation for missing datardquo Researchin Nursing and Health vol 25 no 1 pp 76ndash84 2002

[11] J Honaker and G King ldquoWhat to do about missing valuesin time-series cross-section datardquo American Journal of PoliticalScience vol 54 no 2 pp 561ndash581 2010

[12] N Sartori A Salvan and K Thomaseth ldquoMultiple imputationof missing values in a cancer mortality analysis with estimatedexposure doserdquo Computational Statistics and Data Analysis vol49 no 3 pp 937ndash953 2005

[13] J M Lepkowski W D Mosher K E Davis R M Grovesand J Van Hoewyk ldquoThe 2006ndash2010 National Survey of FamilyGrowth sample design and analysis of a continuous surveyrdquoVital and Health Statistics no 150 pp 1ndash36 2010

[14] I Myrtveit E Stensrud and U H Olsson ldquoAnalyzing datasets with missing data an empirical evaluation of imputationmethods and likelihood-based methodsrdquo IEEE Transactions onSoftware Engineering vol 27 no 11 pp 999ndash1013 2001

[15] A L Bello ldquoImputation techniques in regression analysis look-ing closely at their implementationrdquo Computational Statisticsand Data Analysis vol 20 no 1 pp 45ndash57 1995

[16] D A Newman ldquoLongitudinal modeling with randomly andsystematically missing data a simulation of Ad Hoc maximumlikelihood and multiple imputation techniquesrdquo Organiza-tional Research Methods vol 6 no 3 pp 328ndash362 2003

[17] R H Jones ldquoMaximum likelihood fitting of ARMA models totime series with missing observationsrdquo Technometrics vol 22no 3 pp 389ndash395 1980

[18] G Jagannathan and R N Wright ldquoPrivacy-preserving imputa-tion of missing datardquo Data and Knowledge Engineering vol 65no 1 pp 40ndash56 2008

[19] J Van Hulse and T Khoshgoftaar ldquoKnowledge discovery fromimbalanced and noisy datardquo Data and Knowledge Engineeringvol 68 no 12 pp 1513ndash1542 2009

[20] A Bagheri and M Zandieh ldquoBi-criteria flexible job-shopscheduling with sequence-dependent setup timesmdashvariableneighborhood search approachrdquo Journal of Manufacturing Sys-tems vol 30 no 1 pp 8ndash15 2011

[21] M Zhang J Zhu D Djurdjanovic and J Ni ldquoA comparativestudy on the classification of engineering surfaces with dimen-sion reduction and coefficient shrinkage methodsrdquo Journal ofManufacturing Systems vol 25 no 3 pp 209ndash220 2006

[22] M S Pishvaee F Jolai and J Razmi ldquoA stochastic optimizationmodel for integrated forwardreverse logistics network designrdquoJournal of Manufacturing Systems vol 28 no 4 pp 107ndash1142009

[23] P A Ferrari P Annoni A Barbiero and G Manzi ldquoAnimputation method for categorical variables with applicationto nonlinear principal component analysisrdquo ComputationalStatistics and Data Analysis vol 55 no 7 pp 2410ndash2420 2011

[24] T H Lin ldquoA comparison of multiple imputation with EMalgorithm and MCMC method for quality of life missing datardquoQuality and Quantity vol 44 no 2 pp 277ndash287 2010

[25] S Van Buuren H C Boshuizen and D L Knook ldquoMultipleimputation of missing blood pressure covariates in survivalanalysisrdquo Statistics in Medicine vol 18 pp 681ndash694 1999

[26] K Hron M Templ and P Filzmoser ldquoImputation of missingvalues for compositional data using classical and robust meth-odsrdquo Computational Statistics and Data Analysis vol 54 no 12pp 3095ndash3107 2010

[27] F Dudbridge ldquoLikelihood-based association analysis fornuclear families and unrelated subjects with missing genotypedatardquo Human Heredity vol 66 no 2 pp 87ndash98 2008

[28] Z Lu and Y V Hui ldquo1198711linear interpolator for missing values

in time seriesrdquo Annals of the Institute of Statistical Mathematicsvol 55 no 1 pp 197ndash216 2003

[29] DDecoste and B Scholkopf ldquoTraining invariant support vectormachinesrdquoMachine Learning vol 46 no 1ndash3 pp 161ndash190 2002

[30] D Li L Chen and Y Lin ldquoUsing Functional Virtual Populationas assistance to learn scheduling knowledge in dynamic man-ufacturing environmentsrdquo International Journal of ProductionResearch vol 41 no 17 pp 4011ndash4024 2003

[31] Y S Lin and D C Li ldquoThe Generalized-Trend-Diffusionmodeling algorithm for small data sets in the early stagesof manufacturing systemsrdquo European Journal of OperationalResearch vol 207 no 1 pp 121ndash130 2010

[32] K GotoHOkabe F A Chowdhury S Shimizu Y Fujioka andM Onoda ldquoDevelopment of novel absorbents for CO

2capture

from blast furnace gasrdquo International Journal of Greenhouse GasControl vol 5 no 5 pp 1214ndash1219 2011

[33] H Jaeger ldquoA tutorial on training recurrent neural net-works covering BPTT RTRL EKF and ldquoEcho State Networkrdquo

10 Mathematical Problems in Engineering

approachrdquoGMDReport 159 GermanNational ResearchCenterfor Information Technology Berlin German 2002

[34] H Jaeger ldquoAdaptive nonlinear system identification with echostate networksrdquo Advances in Neural Information ProcessingSystems vol 15 pp 593ndash600 2003

[35] C FHuang ldquoPrinciple of information diffusionrdquo Fuzzy Sets andSystems vol 91 no 1 pp 69ndash90 1997

[36] E P Zhou and D K Harrison ldquoImproving error compensationvia a fuzzy-neural hybrid modelrdquo Journal of ManufacturingSystems vol 18 no 5 pp 335ndash344 1999

[37] G Ward T Hastie S Barry J Elith and J R LeathwickldquoPresence-only data and the EM algorithmrdquo Biometrics vol 65no 2 pp 554ndash563 2009

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

Page 10: Research Article An Improved Generalized-Trend-Diffusion ...downloads.hindawi.com/journals/mpe/2013/136241.pdf · Research Article An Improved Generalized-Trend-Diffusion-Based Data

10 Mathematical Problems in Engineering

approachrdquoGMDReport 159 GermanNational ResearchCenterfor Information Technology Berlin German 2002

[34] H Jaeger ldquoAdaptive nonlinear system identification with echostate networksrdquo Advances in Neural Information ProcessingSystems vol 15 pp 593ndash600 2003

[35] C FHuang ldquoPrinciple of information diffusionrdquo Fuzzy Sets andSystems vol 91 no 1 pp 69ndash90 1997

[36] E P Zhou and D K Harrison ldquoImproving error compensationvia a fuzzy-neural hybrid modelrdquo Journal of ManufacturingSystems vol 18 no 5 pp 335ndash344 1999

[37] G Ward T Hastie S Barry J Elith and J R LeathwickldquoPresence-only data and the EM algorithmrdquo Biometrics vol 65no 2 pp 554ndash563 2009

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

Page 11: Research Article An Improved Generalized-Trend-Diffusion ...downloads.hindawi.com/journals/mpe/2013/136241.pdf · Research Article An Improved Generalized-Trend-Diffusion-Based Data

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of