[IEEE 2012 Eleventh International Conference on Machine Learning and Applications (ICMLA) - Boca Raton, FL, USA (2012.12.12-2012.12.15)] 2012 11th International Conference on Machine

Automated Storage Tiering Using Markov Chain Correlation Based Clustering

Malak Alshawabkeh∗, Alma Riska†, Adnan Sahin†, and Motasem Awwad†∗Dept. of Electrical and Computer Engineering, Northeastern University, Boston, MA 02115

Email: [email protected]† Research and Innovation Systems, EMC, Hopkinton, MA 01748

Email: {alma.riska, adnan.sahin,motasem.awwad}@emc.com

Abstract—In this paper, we develop an automated andadaptive framework that aims to move active data to highperformance storage tiers and inactive data to low cost/highcapacity storage tiers by learning patterns of the storageworkloads. The framework proposed is designed using efficientMarkov chain correlation based clustering method (MCC),which can quickly predict or detect any changes in the currentworkload based on what the system has experienced before.The workload data is first normalized and Markov chains areconstructed from the dynamics of the IO loads of the datastorage units. Based on the correlation of one-step Markovchain transition probabilities k-means method is employed togroup the storage units that have similar behavior at eachpoint. Such framework can then easily be incorporated invarious resource management policies that aim at enhancingperformance, reliability, availability. The predictive nature ofthe model, particularly makes a storage system both fasterand lower-cost at the same time, because it only uses high-performance tiers when needed, and uses low cost/high capacitytiers when possible.

Keywords-IO workloads; storage tiering; markov chain;clustering

I. INTRODUCTION

Storage systems today have grown complex with regard tosize and the host of features they provide to their users. Thetrend of providing higher performance, reliability, availabil-ity, integrity, and capacity without increasing the cost has be-come even more prevalent as reliance on digitally stored datahas gotten central stage in all paths of life [5]. Independentof the applications that will utilize the storage system, i.e.,enterprise or consumer, centralized or distributed, storagesystem design and operation is hierarchical, which meansthat the best performance is achieved when the data isaccessed through the component with highest performance.

For example, many storage systems are designed to auto-matically move data between high-cost and low-cost storagemedia. This data storage technique, known as HierarchicalStorage Management (HSM) [4], exist because high-speedstorage devices, such as hard disk drive arrays, are moreexpensive (per byte stored) than slower devices, such asoptical discs and magnetic tape drives. While it would beideal to have all data available on high-speed devices all thetime, this is prohibitively expensive for many organizations.Instead, HSM systems store the bulk of the enterprise’s dataon slower devices, and then copy data to faster disk drives

when needed. In effect, HSM turns the fast disk drives intocaches for the slower mass storage devices. The HSM systemmonitors the way data is used and makes best guesses asto which data can safely be moved to slower devices andwhich data should stay on the fast devices. In a typical HSMscenario, data files which are frequently used, i.e., hot data,are stored on disk drives, but are eventually migrated to tapeif they are not used for a certain period of time, typically afew months. If a user does reuse a file which is on tape, it isautomatically moved back to disk storage. The advantage isthat the total amount of stored data can be much larger thanthe capacity of the disk storage available, but since onlyrarely-used files are on tape, most users will usually notnotice any slowdown. HSM is referred to as tiered storage.

In tiered storage, the goal is to have data at the highestperforming component, or tier, at the time of access. Thisis not a new problem. The caches, as integral part of anycomputer system, aim to do just that. However, now the scaleand granularity of data and time are different. Specifically,caches operate with small amounts of data, i.e., at most a fewGBytes, and retain the data for at most seconds at a time.The systems that were mentioned above, would benefit fromhaving at the right component, large amount of the rightdata, i.e., hundreds of BGytes to few TBytes, at the rightapproximate time, i.e., in the next hour.

In this paper, we propose an automated and adaptivemodeling framework to monitor workloads (IO loads) fordata storage units/devices in order to predict the futureIO accesses (i.e., activity/behavior) of a data device toautomate the identification of these “busy” devices forthe purposes of relocating application data across differentperformance/capacity tiers within an array.

Although all storage systems monitor their workload andoperation, it is not straightforward to input the monitoredmetrics into a decision-making algorithm, particularly sincethe logs can practically only be stored for a limited time. Ourmodeling framework aims at extracting out of a monitoredmetric the spatial and temporal characteristics for a compactrepresentation of the workload monitored in the system overa long period of time.

For this, we employ a clustering tool as it plays animportant role identifying subsets of data that have similarbehavior. The requirements of any clustering method are: 1)

2012 11th International Conference on Machine Learning and Applications

978-0-7695-4913-2/12 $26.00 © 2012 IEEE

DOI 10.1109/ICMLA.2012.71

392


978-0-7695-4913-2/12 $26.00 © 2012 IEEE

DOI 10.1109/ICMLA.2012.71

392


978-0-7695-4913-2/12 $26.00 © 2012 IEEE

DOI 10.1109/ICMLA.2012.71

392

to perform effectively without any prior knowledge of thedata, 2) to perform effectively in the presence of noise levelin the dataset, and 3) to give information about the relation-ship between clusters [1]. To meet the above requirements,we apply an efficient clustering method based on MarkovChain Correlation (MCC) [2]. It converts each of the givendevice (at any given granularity from KBytes to Gigabytes)into an one step Markov chain (MC) probability matrix andthen k-means clustering method is applied to the MC to,first, group together all devices that behave the same acrosstime and, second, identify the time periods that these groupsexhibit the same characteristics or return to a previouslyobserved characteristic.

Characterization of storage system workloads has shownthat among the large number of devices in the system(depending on the granularity from a few thousands tomillions of them) one can identify a few unique groupsthat hold the same characteristics over time. Similarly,the characterization has shown that IO workload revisitsprevious behavior, which means that once a certain behavioris observed and learned it can be stored for future reference.

The model does not limit itself on the metric it uses. It canbe the data access frequency, as the most common monitoredmetric. However it can be less obvious metrics, such as theperiods between two different accesses in the same data,or the type of access, like a read or a write. In our paperwe develop our model assuming that the metric is accessfrequencies measured in IOs per unit of time to simplify thepresentation.

The rest of the paper is organized as follows. In Section II,we provide the characteristics of the learning frameworkproposed in this paper. Section III develops the modelingframework. We show the effectiveness of the proposedframework in Section IV, and we conclude the paper nSection V.

II. FRAMEWORK CHARACTERISTICS

This section represents a short summary of a learningframework that aims at extracting information from statisticsmeasured in a storage system [3]. The framework works withthe IO counts per device [6], but can be easily generalizedto account for other metrics or combination of them. Thegoal of the framework is to extract from the voluminousmeasurements the necessary information that can guide dataplacement in a multi-tiered storage system. The frameworkis compact on the amount of information that it stores. Theframework can be used proactively to predict the workloadthat will be seen in the system, or reactively to detect aworkload change based on previous observations of the sameactivity. The framework has two parts:

• States capture the different levels of IO activity fora given device. A state represents a period of timethat a device experiences the same IO intensities. Thestates represent a temporal reduction of the monitored

statistics. The main characteristics of the state modelare:

– The average IO rate observed for the state and thenumber of units of time the device remains in thestate.

– The matrix of empirical probabilities. Each rowrepresents a state’s respective probabilities to moveto any of the other states.

– The ID of the state in which the device is currentlyin as well as the amount of time that the devicehas spent on that state (since it entered the state).

• Clusters represent groups of devices, that behave sim-ilarly. Devices in a cluster change the IO intensitiesall together with the same pace. A cluster’s goal isto capture any correlation that may exist among thedevices in the system, such as the correlation that mayexist between devices in the same storage group, ordevices accessed by a single application. A device canparticipate only in one cluster. A cluster representsa spatial reduction in the monitored statistics. Theexpectation is, based on trace analysis, that the activityin hundreds of thousands of devices can be captured,without loss of information, via tens or a few hundredclusters.

The framework can be used to create the view of the entiresystem at any given time as the collection of the states thatthe clusters of the system are currently in.

III. MODELING FRAMEWORK

A. Storage Data

Given a historic workload trace L, which can be repre-sented as a D × T matrix.

X =

⎡⎢⎢⎢⎣

x1,1 x1,2 · · · x1,T

x2,1 x2,2 · · · x2,T

......

. . ....

xD,1 xD,2 · · · xD,T

⎤⎥⎥⎥⎦ (1)

where D is the total number of data storage units inthe system, and T is the total number of intervals. xd,t

is the IO intensity for any chosen data unit/device d attime t. The devices behave in a certain way for eachof the t intervals and those with similar behavior are ofimportance. The device can also be represented as a rowvector V = [xi1 , xi2 , · · · , xiT ] of size T , where each pointdenotes the reading of the device subject to test conditions 1to s . When D such devices are represented as row vectors,the Eq. 1 can be rewritten as Eq. 2.

X =

⎡⎢⎢⎢⎣

V1

V2

...VD

⎤⎥⎥⎥⎦ (2)

393393393

We model the IO intensity patterns of a single deviceas a Markov chain. The Markov chain model implies thatinteraction times between successive IO loads of the samedevice are exponentially distributed.

B. Markov Chain Estimation

A Markov chain (MC) is a stochastic process that pro-ceeds through different states in certain time epochs. Itsbasic property is that the probability of entering a statedepends only on the current state, not on the previous history(this is a first-order Markov chain; higher order Markovchains are not relevant to this work). This property has themathematical implication that the time for which the processresides in a given state must be an exponentially distributedrandom variable; different states may have different meanresidence times, however. Thus, a MC with states denoted1, 2, ...K is uniquely described by a matrix P = (pij)of transition probabilities between states, and the meanresidence times (or “state holding times”) Hi of the states.Equivalently, one can specify the transition rates vij betweenstates i and j, where vij = 1/Hi ∗ pij ; the term 1/Hi isalso known as the state departure rate and denoted as vi.

C. States Definitions

In our application setting, we assume three levels of IOintensities based on a threshold mechanism:

1) High, marked with “H”2) Medium, marked with “M”3) Low, marked with “L”

We let the state of the MC corresponds to any of the threeIO intensity levels described above. For each device d, pijdenotes the probability that a device changes behavior fromstate i to state j. The state residence time corresponds to thetime that the device resides at a state Hi. We are interestedin predicting the future IO intensity accesses of a device. Inthis prediction, we can exploit the knowledge of a device’scurrent state. Thus, the first relevant measure that we areinterested in are the probabilities pij(t) that a device will bein state j (i.e., will change behavior of IO intensity, and thuscan be promoted or demoted) at time t from now, given thatit currently resides in state i.

D. Threshold Mechanism

States capture the different levels of IO activity for a givendevice. A state represents a period of time that the deviceexperiences the same IO intensities. The goal is to identifythree levels of IO intensities: “H”, “M” and “L”, basedon the IO intensity patterns for each device. To determinethese levels of IO intensities a threshold mechanism isadopted based on histogram [1]. Recall that a histogram is aprobability distribution p(u) = Du/D. That is, the numberof devices Du having IO activity u as a fraction of the totalnumber of devices D. What makes thresholding difficult is

that these IO density ranges usually overlap. What we wantto do is to minimize the error of classifying an IO activityas a low, medium or high.

To do this, we try to minimize the area under thehistogram for one region that lies on the other region’sside of the threshold. The problem is that we don’t havethe histograms for each region, only the histogram for thecombined regions. We consider the values in the threeregions as three clusters. The idea is to pick a thresholdsuch that each IO activity on each side of the threshold iscloser in intensity to the mean of all IO intensities on thatside of the threshold than the mean of all IO intensities onthe other side of the threshold.

In other words, let μL be the mean of all IO intensities lessthan the threshold θ1, μM be the mean of all IO intensitiesgreater than the threshold θ1 and less than threshold θ2 andμH be the mean of all IO intensities greater than θ2. Wewant to find θ1 such that the following holds

∀u ≥ θ1 : |u− μL| > |u− μM | (3)

and∀u < θ1 : |u− μL| < |u− μM | (4)

and for θ2

∀u ≥ θ2 : |u− μM | > |u− μH | (5)

and∀u < θ2 : |u− μM | < |u− μH | (6)

E. Transition Probability Matrix

To compute the one step transition probability matrices,we consider a sequence of device state points as a stochasticprocess Xt whose conditional probability of the future event,given the past event and the present state Xt = i , isindependent of the past events and depends only upon thepresent state. It is assumed that the devices data has theMarkovian property, i.e.,

P{Xt+1 = j|X0 = x0, X1 = k1, · · · , Xt−1 = kt−1,

Xt = i} = P{Xt+1 = j|Xt = i}(7)

for t = 0, 1, · · · , T and every sequencei, j, x0, x1, · · · , xt−1. P

(n)ij is the conditional probability

which is also called as n-step transition probability that therandom variable X , starting in state i, will be in state jafter n steps. The transition probability has the followingimportant properties:

(1) p(n)ij ≥ 0, ∀i, j;n = 0, 1, 2, · · ·

(2)∑s

j=0 p(n)ij = 1, ∀i;n = 0, 1, 2, · · ·

which implies that a process at a particular state mustbe in any of the states in the next time point and the sum

394394394

of all probabilities that the process can be in the next timestep should be equal to 1. Based on the assumption, theone step transition probability that the device will be instate j at data point t+ 1 is dependent only on the state atthe previous data point t and independent of all other datapoints. In our approach we restrict the transition probabilitymatrix to exclude self-transitions, i.e., probability of stayingin the same state. We define a transition probability matrixP which it can be obtained from the transition frequencymatrix R , where

P =

⎡⎢⎢⎢⎣

− p1,2 · · · p1,sp2,1 − · · · p2,s

...... − ...

ps,1 ps,2 · · · −

⎤⎥⎥⎥⎦ (8)

and

R =

⎡⎢⎢⎢⎣

r1,1 r1,2 · · · r1,sr2,1 r2,2 · · · r2,s

......

. . ....

rs,1 rs, 2 · · · rs,s

⎤⎥⎥⎥⎦ (9)

Each transition frequency rij has the information aboutthe number of transitions from states i to j, where i, j =1, 2, · · · , s.

The entry pij in matrix P is computed from the transitionfrequency rij , i.e.,

pij = rij/ri (10)

We also compute the marginal distributions of each statep(i). We then define a new transition probability matrix P ′

that combine both marginal and entries of P matrix.

P ′ =

⎡⎢⎢⎢⎣

p1 p1,2 · · · p1,sp2,1 p2 · · · p2,s

......

......

ps,1 ps,2 · · · ps

⎤⎥⎥⎥⎦ (11)

These transition probability matrices are used for cluster-ing devices based on their correlation, as described in nextsection. The reason for estimating the Markov chains andthen finding their correlation, instead of doing this directlyon the device data, is that the MC contains the dynamicsinformation of each represented device. Each row of MCrepresents the probabilities of moving from one state toanother state and this information is used for extracting therepresentative profile of each devices group based on theirdynamics.

F. Markov Chain Clustering (MCC) Model

The MCC clustering method is described using matrix no-tations throughout the paper. The three matrices introducedso far are the trace matrix (device) XD×T , the transitionfrequency matrix Rs×s and the transition probability matrix

P ′s×s for each device. A clustering method is adopted inthe MC merging process to extract all the closely correlateddevices. The MCC method is described in Algorithm 1.

Algorithm 1 Markov Chain Clustering (MCC)

• Obtain the transition probability matrix for each deviced ∈ D and let it be Pj , where j = 1, 2, · · · , D.

• Compute the residual time πs for each state s.• Convert s columns of the matrix Pj to one column by

appending each column one after the other. And appendthe residual times of all states to the end of this column.The resulting vector is named PNj which has size k×1,where k = (s× s) + s.

• Cluster all PN vectors based on k-means clusteringmethod algorithm, where the similarity function is thePearson Correlation Coefficient.

• Run prediction model on testing data; which works asfollows:

– Assign each device to its own cluster as defined inthe training phase.

– Define initial state for each cluster based on theaverage of the initial IO intensities (first interval)of its members (devices).

– For each cluster (and so for its members) predictthe next state to jump to based on the highestprobability defined in the PN vector.

IV. RESULTS AND ANALYSIS

In this section, two different traces are used for evalu-ation [3]. The first trace contains IO activities for 73272devices gathered in 43 hours, while the second trace containsIO activities for 110427 devices gathered in longer timearound 304 hours. We first show an example that demon-strates the characteristics of the storage workload used tosupport the learning model proposed in this paper.

A. Markov Chain Clustering Example

Figure 1 shows an example of applying the MCC methodto twelve devices samples. The x-axis in the Figure showsnumber of T intervals, and the y-axis shows the IO intensi-ties. The plots in Figure 2 show examples of the generatedMCC clusters resulted from Figure 1. The threshold of thedefined states are plotted in horizontal lines and marked asTheta 1 and Theta 2.

B. Performance Results

The effectiveness of the model was valuated using theaverage prediction error ε̄ for each C cluster at each time t.

ε̄Ct =1

Cn

n∑i=0

εD∈C (12)

where Cn is the cluster size, and εD∈C is the predictionerror for the device D tha belongs to cluster C.

395395395

0

10

20

30

40

50

60

70

0 5 10 15 20

IO In

tens

ity

Time

Figure 1: IO intensities over time for 12 devices.

0 10 20 30 40 50 60 70

0 5 10 15 20

IO In

tens

ity

Time

Theta_1 = 0.07879 Theta_2 = 3.23100

0 5

10 15 20 25 30 35 40

0 5 10 15 20

IO In

tens

ity

Time

Theta_1 = 0.07879 Theta_2 = 3.23100

0 1 2 3 4 5 6 7 8

0 5 10 15 20

IO In

tens

ity

Time

Theta_1 = 0.07879 Theta_2 = 3.23100

(a) cluster 1 (b) cluster 2 (c) cluster 3

Figure 2: Three clusters formed using the MCC method for the devices shown in Figure 1.

To speed-up the clustering approach, we first look fordevices that remain in the same state for more than 95%of the time (i.e., the one-state-cluster). In this case we firstcreate 3 clusters each represent one of the states: Low,Medium and High. We refer to them as LClust, MClust andHClust, respectively.

Table I show the average of the average prediction errorresults for all constructed clusters obtained with both traces.We obtain 0.1635 with the first trace and 0.0835 with thesecond trace. The second trace has better results due tolonger learning time for MCC model.

There may be cases when the prediction is that the loadof the system shifts completely toward devices that havebeen dormant. The wrong data placement in such scenariosmay be detrimental to performance and a prediction of thechange may help by allowing for some of the data movement

to happen ahead of time to avoid the impact on performance.Given that any change is predicted probabilistically, partialaction for a prediction may be effective, to balance betweencurrent operation and possible changes in the future.

Figure 3 shows the average prediction error for trace onefor each one-state-cluster case i.e., LClust (red line with +),MClust(green line with ×) and HClust (blue line with ∗).The results show how effective the MCC model in predictingthe future IO load for those devices that, almost, do notchange their activities in the system. The same trend can beseen in Figure 4. This shows the average prediction errorfor trace two. The MCC model is effective in predicting theone-state-cluster cases.

Also note that, if there are any patterns in the incomingstream of IO intensities, then the framework identifies andcaptures them in a compact form. However, if the access

396396396

Table I: Performance Results.

Trace System Number of devices Number of intervals(hours)

Average predictionerror

Number of clusters

Trace one 73272 43 0.1635 20

Trace two 110427 304 0.0835 50

0

0.02

0.04

0.06

0.08

0.1

0.12

0 2 4 6 8 10 12 14 16 18 20

Ave

rage

Pre

dict

ion

Err

or

Time

LClustMClustHClust

Figure 3: Average prediction error for trace one.

0

0.02

0.04

0.06

0.08

0.1

0.12

0 20 40 60 80 100 120 140 160

Ave

rage

Pre

dict

ion

Err

or

Time

LClustMClustHClust

Figure 4: Average prediction error for trace two.

patterns are not present then the framework would not resultin a compact model. As a result, the framework can beused effectively as a complement of a more generic methodthat handles IO count measurements for the purpose of dataplacement in a multi-tiered environment.

V. CONCLUSIONS

This research work has investigated Markov chain corre-lation (MCC) based clustering method and its application todesign an automated and adaptive storage tiering frameworkthat aims to move active data to high performance storage

tiers and inactive data to low cost/high capacity storage tiersby learning patterns of the storage workloads. The modelcan be used to predict or detect fast any changes in thecurrent workload based on what the system has experiencedbefore. Such models can then easily be incorporated invarious resource management policies that aim at enhancingperformance, reliability, availability. The predictive natureof the model particularly helps with populating caches orhigh performing storage devices with the right data at theright time. We evaluated our proposed framework on twodifferent traces. The results show how effective the MCCmodel in predicting the future IO load for those devices.

REFERENCES

[1] R. Chao, H. Wu, and Z. Chen. Image segmentation byautomatic histogram thresholding. In Proceedings of the 2ndInternational Conference on Interaction Sciences: InformationTechnology, Culture and Human, ICIS ’09, pages 136–141,New York, NY, USA, 2009. ACM.

[2] Y. Deng, V. Chokalingam, and C. Zhang. Markov chaincorrelation based clustering of gene expression data. InInformation Technology: Coding and Computing, 2005. ITCC2005. International Conference on, volume 2, pages 750 – 755,april 2005.

[3] EMC. Emc symmetrix, 2011. http://www.emc.com/storage/symmetrix-vmax/symmetrix-vmax.htm.

[4] S. Ogawa, K. Kamimura, T. Kato, T. Uehara, and H. Okuda.Performance analysis of hierarchical storage management sys-tems for video retrieval system. In Consumer Electronics,2001. ICCE. International Conference on, pages 328 –329,2001.

[5] A. Riska and E. Riedel. Evaluation of disk-level workloadsat different time scales. SIGMETRICS Perform. Eval. Rev.,37(2):67–68, October 2009.

[6] A. Sahin, S. More, and P. Hale. System and method forpreparation of workload data for replaying in a data storageenvironment. EMC, US 7213113, May 2007.

397397397

Documents

[IEEE 2012 Eleventh International Conference on Machine Learning and Applications (ICMLA) - Boca Raton, FL, USA (2012.12.12-2012.12.15)] 2012 11th International Conference on Machine