[IEEE 2013 IEEE 6th International Conference on Cloud Computing (CLOUD) - Santa Clara, CA (2013.06.28-2013.07.3)] 2013 IEEE Sixth International Conference on Cloud Computing - Redundancy

Redundancy Aware Virtual Disk Mobility for Cloud Computing

Alexei Karve and Andrzej KochutIBM T.J. Watson Research Center

1101 Kitchawan Road, Route 134, Yorktown Heights, N.Y. 10598{karve,akochut}@us.ibm.com

Abstract—With multiplicity of Cloud service providers offer-ing geographically distributed Compute Clouds, both providersand customers find it necessary to quickly move virtual machineimages between data centers. This is usually accomplished usingstandard rsync-based transfer which is slow and bandwidthintensive given large size of virtual machine images. This articleproposes a mechanism to reconstitute an image on target datacenter using information about overlap among images andcontent already available in the target data center. No changesto file system are needed and the approach can immediatelybe used with traditional libraries storing images as regularfiles. Moreover, we use a peer-to-peer approach that allowssimultaneous retrieval of fragments from multiple data centers.The system and algorithms have been implemented and evaluatedon two Compute Cloud environments using three image librariesrepresentative for a typical service provider. The evaluation showsan average 6 times reduction in terms of network transfer volumeand time and can result in even larger reductions in case of imageswith small configuration changes.

I. INTRODUCTION

Cloud Computing is emerging as a new IT delivery model.The economies of scale of cloud computing comes from thecapability to multiplex different workloads on a shared poolof physical resources. It is projected [11] to significantly growand become one of the key IT delivery models throughout thenext decade. It extensively leverages both virtualization tech-nology [7], [12], [24], [1], [26], [6] and broad scale automationto minimize the delivery costs while keeping high qualityof service. Cloud Customers utilize various types of servicesoffered by specialized providers [13], [2], [19]. They utilize theCloud’s scalable and elastic computational capabilities. Basedon their changing needs, they pay only for the actual use ofcomputing resources, storage and bandwidth.

Large IaaS providers tend to deliver their services out ofmultiple worldwide data centers. A typical cloud data centerhosts thousands of VM images, each containing gigabytes ofdata. Transferring these images is very disk and network I/Ointensive. Moreover, effective network bandwidth across datacenters is often limited to between 1MBps and 5MBps. Asa result, one of the key issues becomes efficient transfer ofvirtual machine images between Cloud data centers. Thereare multiple reasons why virtual machine images may needto be transferred. Service provider usually maintains a setof images that constitute its public catalog - repository fromwhich users can choose VM templates to provision from. Theseusually contain pre-configured virtual appliances consistingof both operating system and software stacks (e.g., web ordatabase servers). As those images are being updated or newones being introduced, service provider needs to transfer themto all global data centers. Another reason for image transferis moving user private images, i.e., images based on saved

Fig. 1. Image transfer process overview: blocks available locally are copiedfrom other images in the target data center while the remaining content isconcurrently streamed from other data centers. Letters A to F denote imagefragments.

virtual machine instances that users have customized. Usersmight transfer them to other data centers to build resilientapplications (by placing redundant application componentsin multiple availability zones which are usually in differentgeographies). Finally, it often happens that Compute Cloudcustomers instead of moving all of their operations to theCloud maintain both their on-premise private Cloud and alsouse service provider cloud in a remote location. The Cloud isoften treated as “spill-over” capacity where computation can bemoved in periods of high demand. In this case users may wantto frequently move VM images between private and publicCloud.

This paper proposes an efficient approach to transferringimages between data centers that significantly reduces theamount of network bandwidth usage and also, as a conse-quence, reduces the time required to perform a transfer. Theapproach uses content similarity in image libraries to avoidtransferring redundant data. The library of virtual disk imagesis logically divided into content based clusters which are effi-ciently managed to avoid causing overhead while maximizinggains from redundancy elimination. When a virtual machineimage needs to be transferred, only the content for clustersthat are not present on the target data center are streamedover the network. Fig. 1 illustrates the approach. Consider animage img0 that needs to be transferred to target data center.The system uses global image overlap information to identifyblocks of the image that are already present at the targetdata center (in our example, blocks A and C are available asfragments of images img1 and img2, respectively). Those frag-ments are simply copied locally therefore completely avoidingnetwork transfer. Remaining content is concurrently streamedfrom other data centers either containing image img0 or otherimages containing the required content. In the example, blocks

2013 IEEE Sixth International Conference on Cloud Computing

978-0-7695-5028-2/13 $26.00 © 2013 IEEE

DOI 10.1109/CLOUD.2013.114

35

F and D are streamed from image img0 available in datacenters 1 and 3, respectively. Blocks B and E are streamed fromimage img3 available in data center 2. The decision which ofthe peer data centers to use is based on available bandwidthand also transfer cost.

Realizing the above approach poses multiple research chal-lenges that this paper addresses. In order to quantify potentialgain, we study similarity across virtual image libraries byexamining degree of repeating content in three image libraries:94 images downloaded from VMWare Marketplace [25], 72images from IBM Research Compute Cloud [8] and 100images from public IBM SmartCloud Enterprise [13]. Bothour own study and these of other authors [15], [18] showthat the similarity is very common in virtual machine librarieswith up to 70% content redundancy in a typical library. This isdue to sharing of software packages, such as operating systemlibraries, software packages, configuration settings, as well as,in many cases, user data that can be replicated in multiple im-ages. However, the redundancy elimination requires the systemto find common content blocks at fine grained level, such as4KB blocks. Increasing the block size quickly diminishes thegain from de-duplication. That poses a significant challengein terms of managing image overlap meta-data that quicklyreaches hundreds of millions of 4KB chunks for a typicallibrary. We have developed a set of algorithms for efficientcomputation and steady-state maintenance of the similaritymeta-data that is later used in deciding which data need to betransferred and which can be reused from other images alreadypresent at the target site. Another challenge that the paperaddresses is how to manage the trade-off between redundancyelimination and fragmentation introduced by reading parts ofVM image from other VM images rather than a flat file. Thealgorithms are evaluated in multiple Compute Clouds spanninggeographically distributed global data centers. Efficiency ofimage reconstitution based on the clustered content represen-tation is also evaluated. Overall, our evaluation shows that theapproach yields reduction of as much as 6 times in terms ofimage transfer volume and time for the system in steady state.

II. RELATED WORK

The closest work to ours is VMTorrent [20] which usesunmodified BitTorrent technology to improve virtual appliancedistribution. It allows downloading from peers that alreadyhave the content therefore reducing download time and alsoload on the systems of virtual appliance publisher. However,VMTorrent does not leverage similarity across images. There-fore, in common case of significantly redundant virtual images,it provides much smaller gain in terms of both download timeand consumed bandwidth than the approach proposed in thispaper. LiveDFS [17] provides a live deduplication file-systemthat reduces the storage space by removing redundant datacopies. It exploits spatial locality to reduce the disk accessoverhead for looking up fingerprints that are stored on disk.However this system works within a single data center. More-over, contrary to LiveDFS [17], our approach does not requireany file system modification since it maintains image overlapmeta-data, i.e., image being transferred is reconstituted on thetarget data center by reading overlapping content fragmentsfrom other image files available in that data center. Hadoop’sHDFS [3] breaks files up into blocks and stores them ondifferent filesystem nodes. Hadoop HDFS likes blocks of 64

Fig. 2. Virtual machine image transfer system architecture.

MB or more, as this reduces the storage requirements of theNamenodes. GPFS [14] breaks files up into blocks and makesthe location of the data transparent. That improves both dataresiliency and read performance (because of the number ofreplicas). However, no content similarity is used so transferof data between the hosts is not optimized. Our objectiveis to transfer images efficiently between data centers withoutchanging the filesystem. An important area of related work thatwe leverage is redundancy detection allowing to identify partsof virtual machine images that are shared between multipleimages in the library. File level de-duplication is intended toeliminate redundant files on a storage system by saving onlya single instance of the file. A very good example of suchapproach is Mirage system [21]. Exploiting image similarityto enable version control for Virtual Machine snapshots is ex-plored in [22]. Another option is segment level de-duplicationwhich eliminates partial duplicate segments or pieces of thefile thus storing only new segments and indexes to the originalsegments to reconstruct the file. Splitting of data into segmentscan be done either using fixed length blocks or variablelength segments. A very good discussion of the redundancyelimination can be found in [9]. Other interesting sources onthis topic are [10], [5]. The proposed approach uses block levelcontent de-duplication and focuses on mechanisms to optimizethe representation of image redundancy for efficient transfers.It also focuses on studying the efficiency of the virtual machinetransfer rather than de-duplication itself.

III. SYSTEM ARCHITECTURE AND KEY ALGORITHMS

Fig. 2 shows the architecture of the image transfer systemwith multiple data centers. The Logical Image Library rep-resents the complete set of images in all data centers. Eachdistinct image in the library is identified by an uuid. Eachdata center has a subset of images from the Logical ImageLibrary. An image from this subset can be quickly instantiatedon hypervisors within that data center. The images in libraryare treated as read-only when VM instances are created bystoring instance changes in independent qcow images thatare linked to the base image, thus leaving the base imagesunchanged. When a VM instance with changes is saved asan image, it is treated as a new image in the library. Imagesthat need to be instantiated but are not locally available in thetarget data center need to be copied from other images thatare available locally and also from remote data center(s). Tofacilitate this process, each data center has an Agent that is usedfor communication and transferring blocks of data betweendata centers. The agent has direct access to the local imagelibrary in the data center where it runs. One of the agents takeson the role of a Global Tracker. The Global Tracker maintainsglobal view of the system including meta-data representing

36

Fig. 3. Example of clusters for 3 library images.

Fig. 4. Cluster data structure containing shared blocks of Images 0, 2 and5 with bitset 100101. Cluster bitset denotes which of the five images haveblocks in this cluster.

image overlap structure, lists of images locally available ineach of the data centers, and also transfer cost and availablenetwork bandwidth information. For high availability, GlobalTracker can be implemented either in fail-over mode or in adistributed fashion.

Image Overlap Meta-data Representation

In order to identify common content segments across VMimages, we use standard content digest computation at the 4KBblock level. We use sha1 as the hash, but can use the sha-256or sha-512 as stronger hashes at the expense of higher storageoverhead. Each new image added to the Logical Image Libraryis analyzed - sha1 hashcodes are computed for all blocks inthe image, sorted and stored in an content digest file.

A key concept used to represent image overlap structureis a cluster. It is defined as a logical set of blocks that arepresent in one or more images. A cluster only describes thelocations of shared blocks, it does not store the image data.For each block, the data structure of a cluster maintains theblock offsets where the block appears within the image. Uniquecontent of each of the images is contained in a singletoncluster related to this image. The Global Tracker maintainsthe data structure representing all of the clusters and efficientlyimplements image addition and removal.

Consider a simple example of clusters for three images:Image-0, Image-1 and Image-2 in an image library as shownin Fig. 3. Clusters CL-001, CL-010, CL-100 are singletonclusters, each containing the blocks only from Image-0, 1 and

2, respectively. The figure shows the actual cluster data blocksshared by the images. For example, block with hash G isunique to Image-0, while blocks with hashes C and D areshared by all three images. For internal redundancy within animage, we use subscripts to denote same blocks. For example,Image-2 contains C1, C2 and C3; this indicates that hash C isrepeated three times in the image. Image-2 consists of clustersCL-100, CL-110 and CL-111. CL-101 is empty. In order tocompute the size (in terms of unique blocks) of the image, thecluster sizes are added since they are non-overlapping. Eachblock is 4096B in this illustration. The total size of uniqueblocks for Image-2 is 12288B + 4096B + 8192B = 24576B.Actual disk size of Image-2 is 32768B. This is larger becauseof internal redundancy of block with hash C.

We persist each cluster using: 1) the cluster blocks metafile ClusterFile; 2) The Hash index file ClusterFile.sha1index,and 3) the Bloom filter for each cluster ClusterFile.bloom. Thecluster blocks file consists of records containing block numbersof distinct sha1s shared by images. Each record can havedifferent number of blocks depending on how many blockshave same sha1 in the image file. We store the total numberof blocks, optional image indexes to reference the start positionof the blocks and lastly the actual list of block numbers in thecorresponding images. The sha1index file contains the distinctsorted list of sha1s, each sha1 pointing to the position ofthat sha1 in the ClusterFile. The ClusterFile.bloom is a space-efficient probabilistic data structure to test whether an sha1 isa member of this cluster. We keep the Bloom filter size largeenough to keep the false positive probability below 1%. Forsimplicity, we refer to each cluster uniquely with a bitset of itsconstituent images. For example, the CL-100101 is referred toas cluster 100101. The least significant bit in the cluster bitsetis for Image-0 on the right and the most significant Image-5on the left. If there are additional images in the image librarynumbered 6 and above, they are not present in this cluster.The cardinality of this cluster is 3 because three of the bitsin this cluster are 1. Thus Image-0, Image-2, and Image-5share the blocks present in this cluster. When an image isadded, the cluster name is extended with the most significantbit on the left. When an image is removed, the higher imageindexes are shifted to the right. Many records in the clustershave a single block for each image. In this case we do notstore the indexes. Consider the earlier example with cluster filebitset 100101 with cardinality=3. The block positions for threeimages: Image-0, Image-2 and Image 5 stored in this clusteras shown in Fig. 4. Image-1, Image-3 and Image-4 do not haveblocks in common with the illustrated cluster and may belongto other clusters. Consider one common sha1 called SHA1-Ashared by each image. If each file has exactly 1 block withsame SHA1-A, then number of blocks is also 3. So in thiscase we do not store the indexes - we directly store the 3block numbers from the corresponding image file. Consider thefollowing record consisting of 16 bytes (4 bytes for the Integernumber of Blocks and 4 bytes each for the three blocks) fromthe image for SHA1-A. The block number 101 is in Image-0,Block number 31 is in Image-2 and 73 for Image-5. If insteadof 1 block each, consider that any one or more images had twoor more blocks with same SHA1-B, then we would have tostore the indexes that tell which block starts at which positionin this cluster file. For example, if the indexes are 0, 3, 7 andnumber of blocks are 10 and number of indexes are 3, then this

37

record would consist of 4 bytes for No. of Blocks + 12 bytesfor the 3 indexes + 40 bytes for the 10 blocks. For SHA1-B,Image-0 starts at index 0 and contains 3 blocks 102,128,199.Image-2 starts at index 3 and contains 4 blocks 34,45,56,78.Image-7 starts at index 7 and contains 3 blocks 77,88,99.

Cluster Meta-data Management Algorithms

The main algorithms for cluster maintenance on GlobalTracker are the AddImage() and RemoveImage() that up-date the meta-data to reflect overlap structure across im-ages in the Logical Image Library. Actual image transferrequires two additional functions: GenerateTransferP lan()and ExecuteTransfer(). The former creates a transfer planfor the retrieval of blocks required in the target image. Thelatter function executes the transfer plan and results in imagebeing available on the target data center.

AddImage() takes the content digest representing the newimage as an argument. The algorithm splits existing clusters,adds new block positions to existing clusters to include theimage being added and creates a new singleton cluster. Thealgorithm compares the list of sha1s of blocks in image beingadded with the sha1s in each cluster in current cluster liststored in Global Tracker. If no blocks from the new image arepresent in a cluster, then it updates the cluster to 0Cluster,where 0 is the most significant bit signifies image positionin Cluster BitSet. Value of 0 means none of the blocks fromthe 0Cluster are present in the Image. If there are 1 ormore blocks in common with the existing cluster, it results insplitting the cluster into two smaller clusters, first 1Cluster isa subset of blocks from the image that are present in the clusterand the second 0Cluster remaining blocks in the cluster.When the 1Cluster is created, the image id and block numbersfrom the digest of the image are added to the 1Cluster forthe sha1 present in the image and cluster. The remaining sha1sfrom cluster are added to the 0Cluster. It is possible that allblocks in cluster are present in the image in which case onlythe 1Cluster is created. The sha1s belonging to the 1Clusterare removed from the sha1s of the image being added andcontinue the process until all clusters are handled. Finally anew singleton cluster is created for remaining sha1s in newimage. Section IV describes additional optimization that avoidscomparing each sha1 from the new image’s content digest witheach sha1 in the Image Library.

RemoveImage() algorithm is used to remove a given imagefrom the library and results in combining clusters. The single-ton cluster is deleted, if present. Next, three possibilities forthe clusters at the bit position of the image to be deleted areconsidered: 1) Look for pairs of clusters containing exactlythe same images except the image index being deleted andcombine each such pair into a single cluster. The imageidand blocks of the image being deleted are removed fromthe new cluster with new bitset (with imageid removed). 2)Once no such pairs are found, the algorithm looks for theclusters containing the image. For every such cluster, the blocknumbers of the image being deleted are removed and a newcluster created (with imageid removed from bitset). 3) For therest of the clusters that do not contain the image being deleted,only the imageid is deleted to indicate the image is no longerpart of the cluster. There are no sha1s for blocks to removebecause none were present in these clusters. At the end of

these three steps we have the list of clusters with the clusterbeing deleted removed.

GenerateTransferPlan() accepts the image identifier and thetarget data center as arguments and generates a transfer plan toinstantiate the image at the target data center. The plan is basedon both overlap meta-data and also transfer cost and availablebandwidth on network connections between data centers. First,all clusters required to create the image are identified. Foreach such cluster, the algorithm checks whether other imagescontaining this cluster are available at the target. If so, thedata will be copied from those images using local copy. Ifnot, the best data center that contains an image belonging tothe required cluster is identified. Metric used to identify whichdata center is the best can be based both on available networkbandwidth between that data center and the target and also onnetwork bandwidth cost. The system allows plugging customoptimization algorithms to select best “source” data center. Thedetails of the optimization solution are not in scope of thispaper.

ExecuteTransfer() executes image transfer based on the plancreated by GenerateTransferPlan(). The plan is generated onthe Global Tracker and then sent to the target data center.The agent on the target data center first sends the requeststo the “source” data centers specifying which blocks of theimages should be read (based on the transfer plan). It also startsreconstitution process which creates an empty image file withappropriate size and starts filling it with the required data. Thewriting process concurrently reads data from locally availableimages and also receives data from “source” data center agentsthat provide data not available locally. The process ends whenboth local copy and remote streaming ends. Image verificationis done by maintaining an xor of non-zero block checksumscomputed as the blocks are received by the target data centerin non-sequential manner. This eliminates the need to wait forthe image to become fully available to start verifying it.

IV. ALGORITHM OPTIMIZATIONS AND EXTENSIONS

We have implemented a number of optimizations to speedup the maintenance of the clusters and the overall system per-formance. In particular, we use Bloom filters and probabilisticsampling to improve new image addition, roll-up small clustersto reduce cluster count, and also provide approach to handlehash collisions.

Bloom Filters

In AddImage, we need to check for the existence of sha1of each block in the virtual machine image’s content digestbeing added with the sha1’s in the cluster. We utilize Bloomfilters to check for existence of a block in a cluster. A Bloomfilter for any membership query guarantees that there will neverbe any false negatives, however there may be false positives. Ifthe Bloom filter check returns negative, we quickly remove thiscluster from consideration because the sha1 does not belong tothis cluster. If the Bloom filter check finds a positive, then weneed to actually make sure that it is a real positive. We store theclusters in sorted order of sha1s. So a binary search confirmsthe positive with a complexity of O(ln(clusterSize)). We alsomaintain a Main Bloom Filter that is a bloom of all blocks inthe image library. A negative against the Main Bloom Filter

38

confirms the absence of the sha1 also from all clusters andwe can therefore remove them from additional comparisonsagainst individual clusters. We additionally achieve parallelexecution by performing the test for the sha1 against the Bloomfilters in multiple simultaneous threads.

Use of Representative Random Sample of Hashcodes

We compare only a sample of sha1s remaining in theContentDigest against each cluster’s Bloom filter underconsideration. Since the sha1 evenly distributes the domainof blocks over the range of sha1s, we use a sample of blocksas representative of all blocks. If the percentage of positivesfrom the sample of blocks from the new image being addedis less than some factor greater than 1 (we used a factor of 4)of the false positive probability of the Bloom filter, we skipthe cluster check. If the number of positives (that may includefalse positives) after comparison with Bloom filter is too low,we consider it not worth comparing the remaining sha1s andsimply ignore any common blocks of current Image with thiscluster. The number of blocks with overlap in the cluster istoo low to warrant splitting the cluster. Even if there werefew blocks from ContentDigest with real overlap against thecluster (not false positives), we would still not want to splitthe cluster that would result in a small cluster.

Roll-up of Small Clusters

Typically, a cluster is formed out of common files belong-ing to an operating system, a middleware or an application.Any custom configuration files on the images would be movedto singleton clusters. Clusters are made up of possibly non-sequential blocks. Non-sequentiality may occur due to frag-mentation in the image. The gaps in the parent clusters arecaused because new tiny clusters are created when a largercluster is split because of small overlaps. We would like theclusters to contain sufficient blocks so that most blocks areread sequentially from the image. If there are too many gaps,then we have multiple seeks that can slow down the read andwrite process from the source images and to the target imagerespectively.

Division of clusters into child clusters is good if thecommonality of files between images is real and will recurin additional images. However, when the number of clustersgrows, it increases the computational overhead and we roll-upthe small clusters into the parent clusters. This roll-up causesthe blocks that were previously separated to get added to therelevant parent clusters. We select parent clusters containingmaximum number of shared images and add the blocks fromthe child cluster to each of the minimum required parentclusters. In certain cases the blocks cannot be merged intoany parents for certain images. These left-over images in thecluster are converted into singletons. The roll-up results inreplication of block information from the child cluster intomultiple parent clusters. The overhead is small as in Fig. 9a(more details discussed in Section V).

We illustrate roll-up with a small cluster list for 12 imageswhere we represent each cluster with the bitset of images thecluster contains. Each cluster has sha1s, one for each distinctblock. The least significant bit in the bitset is image 0 and mostsignificant bit is image 11. Consider the sample cluster list in

Cluster Bitset Size111001000111 500100001000001 334010001000101 245000011000000 200110001000101 129000000000110 100110000000010 50110001000111 2

Cluster Bitset Size111001000111 500000011000000 200

Cluster Bitset Size100001000001 334010001000101 245110001000101 129000000000110 100110000000010 50

(a) Sample Clusters (b) Ineligible rows (c) Eligible rows

Fig. 5. Illustration of cluster roll-up process.

Fig. 5a represented by two columns: 1) the bitset of imagesin the cluster and 2) number of distinct blocks in the cluster.The cluster list is sorted in descending order by number ofdistinct blocks. We select the smallest cluster at the bottomwith 2 blocks for merge with parent clusters, call this tinycluster Cm = 110001000111. We remove out of considerationany parent clusters that have an image bit position set whereCm does not have that image bit position set. The rows inFig. 5b are removed out of consideration for roll-up of Cm

because the first cluster below has extra image 9 set and thesecond cluster has the extra image 7 set. The list of clustersremaining in the eligible parent list is shown in Fig. 5c. Thegoal is to find the minimum number of parents that can satisfythe bits in Cm for roll-up. The Cm data can be merged withtwo parent clusters 110001000101 and 000000000110. Weintroduced redundancy for Image-2. The meta information forImage-2 can be retrieved from either of the parent clusters. Nosingletons need to be created in this example.Note that if wewere to continue the roll-up process up to the maximum blocksa cluster may contain, we will be left with singleton clusters.We will degenerate to having one cluster for each image.

Handling Possible Hash Collisions

Although using the SHA-1 hash function makes the proba-bility of a collision sufficiently unlikely, we have a mechanismto handle collisions to ensure correctness. When we add anew image on a data center, we have three possibilities forthe sha1 hashcodes of the blocks from the new digest file.The hashcodes may be: 1) not present on any data center, 2)present in other images on the same data center, 3) present inimages on other data center(s) but not on the local data center.To check for hash code collisions, we compare the content ofthe added image for each block with another image on eitherthe local data center or any of the remote data centers. If thereare hash collisions i.e. hash is same but the bytes of blocksdo not match, we mark the corresponding cluster blocks in thenew image as Collided and do not use it for reconstitution ofother images. Consider the prior Fig. 3 with three Images andassume that Image-0 is present on Datacenter-0, Image-1 andImage-2 are present on Datacenter-2, and Image-1 is presenton Datacenter 3 as shown in Fig. 6. Each data center showsthe clusters with blocks belonging to the images. CL-101 isempty and therefore not shown.

We add a new fourth Image-3 locally on Datacenter 1.Reclusterization will cause new clusters to be created as inFig. 7. If the Image-3 only had overlap with singleton clustersfor Image-0 and Image-2 the clusters belonging to the newimage are CL-1001, CL-1100 and the rest of the blocks wouldbe in singleton cluster CL-1000.

39

Fig. 6. Checking for hash collisions.

Fig. 7. Cluster overlap with four images.

By its nature, the singleton cluster CL-1000 does not havehash collisions because the blocks in the cluster had newhashes that were not present in any of the other clusters in thelogical image library. CL-001 is split into CL-1001 and CL-0001. Content of Blocks from CL-1001 from the new image 3is compared locally against the corresponding blocks in Image-0 to check for hash collisions locally on Datacenter-1. CL-100 is split into CL-0100 and CL1100. Blocks from CL-1100are not present on Datacenter-1 and must be compared onanother data center. We have three strategies for verificationof blocks: ImmediateVerify, LazyVerify, VerifyDuringTransfer.For immediate verification, we transfer the blocks from Image-3 CL-1000 to Datacenter-2 and verify them against Image-2that contains blocks from same cluster. If verified, we markthe Image-3 as available on Datacenter-1, otherwise mark it ascollided. In second strategy with LazyVerify, the cluster that isnot verified will cause the Image-3 to remain as Not verified.Whenever another Image with overlapping verified blocksarrives at the Datacenter-1, we compare the local contentwith the verified content and when all unverified clustersare verified, mark the Image-3 as Available. For the thirdstrategy, VerifyDuringTransfer, the verification happens onlyduring a transfer request. Consider that we want to transfer thisunverified Image-3 to Datacenter-3. We would transfer the CL-1100 contents from Image-3 from Datacenter-1 to another datacenter that has an image containing CL-1100 (Datacenter-2 inour case) and verify the content. If verified, we can transferthe blocks to Datacenter-3 and the Image-3 on Datacenter-3 ismarked Available.

V. EXPERIMENTAL EVALUATION

In order to evaluate efficiency of the proposed algorithmswe have performed extensive studies of the implementedsystem including transfer, roll-up, and reconstitution. We haveused VM images from IBM Research Compute Cloud [8], IBMSmartCloud Enterprise [13]) and VMWare Marketplace [25].The combined library contained 265 images from IBM andVMWare Marketplace with image sizes ranging from 800MBto 100GB and containing Linux and Windows operatingsystems and wide range of software packages. In case ofIBM, these were IBM Rational tools, Information Managementtools (e.g., DB2 versions), Websphere Application Server, etc.VMWare Marketplace contains open source software stacksfrom Bitnami [4] and Turnkey [23].

Transfer Experiments on Compute Cloud

We have performed extensive experiments with transferringimages within two different cloud environments: IBM Smart-Cloud Enterprise [13], IBM Research Compute Cloud [8]. Wehave performed over 2000 image transfers across data centersin multiple geographies (e.g., Raleigh in USA, Ehningen inGermany, Makhuri in Japan, etc.) and verified the correctnessand performance of the algorithms. Observed average networkbandwidth between the sites ranged between 0.5MBps and4MBps. Fig. 8 presents example results of our transfer ex-periments. Each row of the table is a result of image transfertime comparison between standard rsync based transfer andthe proposed algorithm. Overlap fraction and relative speedupfactor are also provided. An average speedup factor is 6.21.Rsync uses “delta-transfer” approach only within a single file,rather than across collections of files. Our approach achievessuch a significant speedup because it considers both inter- andintra- image similarity and also multiple locations where thecontent can be concurrently retrieved from.

For the SmartCloud Enterprise Image library with 100images with size 1.75TB, a block size of 4KB and SHA1with 20 bytes as the hash code, the meta-data storage size is6.45GB. Thus the storage overhead is less than 0.4%. We havemeasured the overlap across images in terms of unique versustotal of 4KB blocks. IBM library has 13% of unique blocks,while VMWare Marketplace library has 25% of unique blockstherefore giving significant opportunity for optimization.

Steady-state Transfer Simulation

In addition to experiments with the implemented system,we have conducted large scale simulations. The gain fromusing the proposed algorithm depends on degree of similarityacross virtual machine images, which is a function of fractionof blocks shared across images (please refer to [16] for specificdefinition) and also on which of the images are available ona given data center. In order to quantify the amount of savingwe can expect during an average image transfer, we haveexecuted a series of simulations of actual transfer designedto compute the gain for different transfer permutations. Theexperiment is setup with all images initially available inonly one data center and then are transferred, in sequence,to another data center. The transfer order was varied acrosssimulation runs to establish a typical behavior rather thanreport on specific runs. Fig. 9b shows result of an example

40

Fig. 8. Example results of the image transfer experiments on IBM Cloudbetween Raleigh, USA, and Ehningen, Germany.

0

0.005

0.01

0.015

0.02

0.025

0.03

0.035

0.04

0.76 0.78 0.8 0.82 0.84 0.86 0.88 0.9

Red

unda

nt F

ract

ion

Fraction of clusters rolled-up

VMWare Marketplace libraryResearch Compute Cloud library

IBM SmartCloud libraryCombined library

0

0.2

0.4

0.6

0.8

1

1.2

0 10 20 30 40 50 60 70 80

Fra

ctio

n A

vaila

ble

Loca

lly

Number of Images Transferred

VMWare Marketplace permutation 2

(a) Effect of cluster roll-up (b) Local availability for an

on meta-data size example image permutation

0

0.2

0.4

0.6

0.8

1

1.2

0 10 20 30 40 50 60 70 80 90 100

Fra

ctio

n A

vaila

ble

Loca

lly


IBM Cloud Average (100 permutations)

0

0.2

0.4

0.6

0.8

1

1.2

0 10 20 30 40 50 60 70 80

Fra

ctio

n A

vaila

ble

Loca

lly


VMWare Marketplace (100 permutations)

(c) Mean evolution (100 permutations) (d) Mean evolution (100 permutations)

for IBM Cloud library for VMWare Marketplace library

Fig. 9. Effect of cluster roll-up on meta-data size (a), local contentavailability in steady state for an example permutation (b), and averages of 100image permutations (c,d) for image libraries from IBM Cloud and VMWareMarketplace, respectively.

run for a specific ordering of images to be transferred. X-axis is sequence number of image being transferred, y-axisrepresents the fraction of required content that was alreadypresent on the target data center. Each of the images beingtransferred can have widely varying available content fractiondepending on how unique the image is (as compared to theimages that were already transferred). Some images havevery high availability (almost 99%) because they are minorupdates or modifications to previously transferred ones. Onthe contrary, image transferred as number 46 has very lowlocal content availability (under 10%) because it is a Windowsimage while all prior transfers were Linux based images. Wehave repeated the experiment for 100 permutations and thencomputed mean fraction of content available locally for eachimage in sequence. The results of experiments performed forboth IBM and VMware image libraries are presented in Fig. 9cand 9d, respectively. As the number of images available atthe target data center increases, the mean fraction of locally

available content increases to above 80%. The behavior isremarkably similar for both libraries, even though the virtualmachine images in both are significantly different. That showsthat, in steady state, the system can deliver an average of morethan 80% saving in terms of amount of data that needs tobe transferred over the network. The value for any particularimage transfer, however, can deviate significantly from themean based on how much unique content the image brings.

Effect of Cluster Roll-up

We have computed duplication overhead due to “rolling-up” of small clusters into parent clusters as presented inFig. 9a. This relationship allows selecting an appropriate num-ber of clusters for specific redundancy overhead. In practice,keeping only 10% of total unique clusters (i.e. rolling up90% of clusters) is sufficient to reduce meta-data processingoverhead (i.e., image addition and removal) while keepingadditional storage overhead for cluster images below 3%.

We have also studied the effect roll-up has on the sequen-tiality of the blocks in clusters. Non-sequentiality is the numberof gaps in sequences of blocks in a cluster. Clusters containblock numbers from multiple images. Each image in turn canhave different number of blocks represented in the cluster dueto internal redundancy, so the non-sequentiality for each imagein a cluster could be different. We take the average the non-sequentiality for a library of images by computing the sum ofnon-consecutive blocks for each image in all clusters dividedby the sum of total blocks for all images in all clusters in thelibrary. We computed the non-sequentiality of the SCE imagelibrary with 100 images by rolling up small clusters to minimalnumber of parents. The initial non-sequentiality without anyroll-ups for the library was 5.04% for number of clusters equals3646. The initial small roll-ups result in slight increase ofthe non-sequentiality because the gaps in child clusters getpropagated to the parent clusters. As the number of clustersreduced to around 1000 with roll-up of blocks from childclusters to parent clusters, the non-sequentiality reduced andcontinued reducing with larger roll-ups as the gaps in parentclusters get filled up. With roll-up reducing the clusters to 250,the non-sequentiality reduced to 4.7%. This shows that doingroll-ups of clusters with small number of blocks results in alarge reduction in number of clusters. Reducing the numberof clusters improves computational overhead with minimalincrease in storage overhead. The small roll-ups may notincrease sequentiality. However, having a target image createdfrom dependent clusters that are selected from a small set ofsource images has the effect of large roll-up. This increasesthe sequentiality of reads allowing for faster transfers.

Performance of Image Reconstitution

Another aspect of the system is reconstitution of the virtualmachine images based on overlap meta-data and content avail-able locally on the hypervisor. Reading content from multipleimages rather than one sequential image file might potentiallyhave negative effect due to breaking sequentiality of readaccesses. Therefore, we have investigated those effects closely.First, the studies suggest that given we create relatively smallnumber of large clusters and the division is closely related tocontent fragments (such as DB2 or other software package)the sequentiality of clusters is very high. Precisely, our studies

41

0

2

4

6

8

10

12

14

16

0 1 2 3 4 5 6 7 8

Rec

onst

itutio

n R

ate

(MB

ps)

Image ID

P2P ReconstitutionStandard File Read

0

10

20

30

40

50

60

70

80

0 1 2 3 4 5 6 7 8

Rec

onst

itutio

n R

ate

(MB

ps)

Image ID

P2P ReconstitutionStandard File Read

(a) (b)

Fig. 10. Image reconstitution performance with cache disabled (a) andenabled (b) for the proposed method and standard file read.

indicate read average above 0.9 sequential fraction. In addition,we have performed extensive reconstitution bandwidth tests tomeasure effective read rate accomplished by reconstituting theimage using the proposed algorithm as compared to read atraditional flat image file. We have run 1000 reconstitutions of50 randomly selected images in random sequences. Resultsfor 9 example images are provided in Fig. 10. Fig. 10ashows the comparison when the OS cache was cleared aftereach reconstitution. The standard file read has constant ratethat matches physical disk bandwidth. Reconstitution basedon the proposed algorithm yields some improvement due tointernal image redundancy, i.e., content segments that repeatin the image may get cached after first read thus increasingeffective read rate. Fig. 10b shows the comparison with theOS cache enabled. The rate increases for both (due to cachingeffects) but does more so for our reconstitution. The reason isthat the effective caching ratio improves since the content isdeduplicated. Therefore some reconstitution runs benefit fromcached blocks previously read.

VI. CONCLUSIONS AND FUTURE WORK

We have presented and evaluated a virtual machine imagecloning and reconstitution system that uses similarity acrossvirtual machine images to minimize amount of data that hasto be transmitted to the target data center. The clusterizationand declusterization algorithms are validated. The use caseprovides important insights into the circumstances in whichthis scheme can be beneficial. The key parameters affectingthe performance are degree of similarity among the virtualmachines, and block sequentiality. The higher the overlap themore benefit. The system is implemented in a testbed and alsovalidated using extensive discrete event simulations based ona library representative of typical Cloud provider’s catalog.For the studied libraries the system achieves 6 times reductionin amount of data transferred to target data center in steady-state. For smaller image configuration changes the gain may beeven as high as 99%. Also the p2p copy of blocks of clustersallows us to concurrently transfer cluster data from multiplesource data centers further reducing transfer time. The keylimitation of the approach is that the introduction of imageswith significant fraction of unique content leads to low gainsfrom using the proposed approach. For example, transfer ofWindows 2010 image when the only images available locallyare Linux based will find almost no redundancy resultingin transferring the entire content for the image providing nogain. However, this situation is rare for typical libraries, i.e.,any subsequent transfer of Windows based image will likelyfind redundant content with the first one speeding up the

transfer. Another limitation is that increased fragmentationin the images can adversely affect the speed of transfer.Our future work includes exploring algorithms for providingoptimization between the cost and time for data transferredfrom source data centers based on bandwidth cost and priorityof the transfer request.

REFERENCES

[1] Kernel Virtual Machines. Online. http://sourceforge.net/projects/kvm.

[2] Amazon Inc. Amazon Elastic Compute Cloud. Online, 2009. http://aws.amazon.com/ec2/.

[3] Apache Hadoop. HDFS Architecture Guide.http://hadoop.apache.org/docs/r1.0.4/hdfs design.html#Replica+Selection,2013.

[4] Bitnami. Bitnami. http://bitnami.org/, 2012.

[5] J. Bonwick. ZFS Deduplication. Online, 2009. http://blogs.sun.com/bonwick/entry/zfs dedup.

[6] Microsoft Corp. Microsoft Virtualization. Online, 2011. http://www.microsoft.com/virtualization/.

[7] R. Creasy. The Origin of the VM/370 Time-Sharing System. IBMJournal of Research and Development, 1981.

[8] Jim Doran, Frank Franco, Dilma M. Da Silva, and Alexei Karve et al.Rc2 a living lab for cloud computing. IBM Research Report, 2010.

[9] Fred Douglis, Jason Lavoie, John M. Tracey, Purushottam Kulkarni, andPurushottam Kulkarni. Redundancy elimination within large collectionsof files. In In USENIX Annual Technical Conference, General Track,pages 59–72, 2004.

[10] EMC. Data Domain Replicator Software, Network-efficient replicationfor backup and archive data. Online, 2011. http://www.datadomain.com/pdf/DataDomain-Rep-Datasheet.pdf.

[11] Gartner Inc. Special Report on Cloud Computing. Online, 2011. http://www.gartner.com/technology/research/cloud-computing/.

[12] R. Goldberg. Survey of Virtual Machine Research. in IEEE ComputerMagazine, 1974.

[13] IBM. IBM SmartCloud. Online, 2011. http://www.ibm.com/cloud.

[14] IBM. Implementing the IBM General Parallel FileSystem (GPFS) in a Cross-Platform Environment.http://www.redbooks.ibm.com/redbooks/pdfs/sg247844.pdf, 2011.

[15] K. R. Jayaram, Chunyi Peng, Zhe Zhang, Minkyong Kim, Han Chen,and Hui Lei. An empirical analysis of similarity in virtual machineimages. Middleware, 2011.

[16] A. Kochut and Alexei Karve. Leveraging local image redundancy forefficient virtual machine provisioning. IEEE Network Operations andManagement Symposium, 2012.

[17] Chun-Ho Ng, Mingcao Ma, Tsz-YeungWong, Patrick P. C. Lee, andJohn C. S. Lui. Live deduplication storage of virtual machine imagesin an open-source cloud. Middleware, 2011.

[18] Chunyi Peng, Minkyong Kim, Zhe Zhang, and Hui Lei. Vdn: Virtualmachine image distribution network for cloud data centers. IEEENOMS, 2012.

[19] Rackspace. Rackspace Cloud. http://www.rackspace.com/cloud/, 2011.

[20] Joshua Reich and Oren Laadan et. al. Vmtorrent: Virtual applianceson-demand. ACM Sigcomm, 2010.

[21] D. Reimer, A. Thomas, G. Ammons, T. Mummert, B. Alpern, andV. Bala. Opening black boxes: Using semantic information to combatvirtual machine image sprawl. in Proc. of USENIX Virtual ExecutionEnvironments Workshop, 2008.

[22] Chung Pan Tang, Tsz Yeung Wong, and Patrick P. C. Lee. Cloudvs:Enabling version control for virtual machines in an open-source cloudunder commodity settings. IEEE NOMS, 2012.

[23] Turnkey. Turnkey. http://www.turnkeylinux.org/, 2012.

[24] VMware. Online. http://www.vmware.com.

[25] VMware Inc. VMware Virtual Appliance Marketplace. Online, 2011.http://www.vmware.com/appliances/.

[26] Xen. Online. http://www.xensource.com.

42

Documents

[IEEE 2013 IEEE 6th International Conference on Cloud Computing (CLOUD) - Santa Clara, CA (2013.06.28-2013.07.3)] 2013 IEEE Sixth International Conference on Cloud Computing - Redundancy