RISC: A resilient interconnection network for scalable cluster storage systems

Available online at www.sciencedirect.com

Journal of Systems Architecture 54 (2008) 70–80

www.elsevier.com/locate/sysarc

RISC: A resilient interconnection network for scalablecluster storage systems

Yuhui Deng *

Center for Grid Computing, Cranfield University Campus, Bedfordshire MK430AL, United Kingdom

Received 20 November 2006; received in revised form 2 April 2007; accepted 17 April 2007Available online 29 April 2007

Abstract

The explosive growth of data generated by information digitization has been identified as the key driver to escalate stor-age requirements. It is becoming a big challenge to design a resilient and scalable interconnection network which consol-idates hundreds even thousands of storage nodes to satisfy both the bandwidth and storage capacity requirements. Thispaper proposes a resilient interconnection network for storage cluster systems (RISC). The RISC divides storage nodesinto multiple partitions to facilitate the data access locality. Multiple spare links between any two storage nodes areemployed to offer strong resilience to reduce the impact of the failures of links, switches, and storage nodes. The scalabilityis guaranteed by plugging in additional switches and storage nodes without reconfiguring the overall system. Another sali-ent feature is that the RISC achieves a dynamic scalability of resilience by expanding the partition size incrementally withadditional storage nodes along with associated two network interfaces that expand resilience degree and balance workloadproportionally. A metric named resilience coefficient is proposed to measure the interconnection network. A mathematicalmodel and the corresponding case study are employed to illustrate the practicability and efficiency of the RISC.� 2007 Elsevier B.V. All rights reserved.

Keywords: Resilience; Interconnection network; Cluster storage; Scalability

1. Introduction

The explosive growth of data generated by infor-mation digitization has been identified as the keydriver to escalate storage requirements. There aretwo major technologies which impact the evolutionof storage systems. The first one is parallel process-ing such as redundant arrays of inexpensive disks(RAID) [1]. The second one is the influence of net-

1383-7621/$ - see front matter � 2007 Elsevier B.V. All rights reserved

doi:10.1016/j.sysarc.2007.04.002

* Tel.: +44 (0) 1234 750111x4773.E-mail addresses: [email protected], yuhuid@hotmail.

com.

work technology on storage system architecture.Network based storage systems such as networkattached storage (NAS) and storage area network(SAN) [2,3] offer a robust and easy method to con-trol and access large amounts of storage resources.However, most of the modern high performancecomputing systems demand petabytes and even exa-byte storage capacity, and aggregate bandwidthover 100 GB/s (e.g. the data produced by proteinfolding, global earth system model, high energyphysics etc). NAS and SAN cannot meet therequirements. Storage systems must make the tran-sition from relatively few high performance storage

.

mailto:[email protected]

mailto:yuhuid@hotmail.

Y. Deng / Journal of Systems Architecture 54 (2008) 70–80 71

engines to thousands of networked commodity-typestorage devices [4]. With the ever increasing storagedemand, it is a big challenge to design a scalablestorage system which consolidates hundreds eventhousands of storage nodes to satisfy both the band-width and storage capacity requirements.

A cluster is a group of loosely coupled nodes thatwork together so that in many respects they can beviewed as a single powerful node. The evolution ofhigh performance processors and high speed net-works is rapidly driving forward the developmentof cluster [5–8]. Due to commodity-off-the-shelf(COTS) hardware components, cluster is becomingan appealing platform for parallel computing andsupercomputing compared with the traditional sym-metric multi-processor (SMP) and massive parallelprocessing (MPP) systems. The components of acluster are commonly connected to each otherthrough fast local area networks.

The parallel algorithms in cluster computing haveto communicate with each other frequently. Thecommunication delay between sender and receiveris a system-imposed latency. The key to an effectiveand scalable cluster computing system lies in mini-mizing the delays imposed by the system. The com-puting nodes normally employ small and lowlatency messages. As a result, cluster computingrequires a custom network to alleviate or eliminatethe communication delay. The research communityhas been very active in the area of improving variousaspects of the communication performance of clusterduring the past decade. Virtual interface architecture(VIA) [9,10] is a user-level memory-mapped commu-nication architecture that is designed to achieve lowlatency and high bandwidth across a cluster. Fastmessages (FM) [11,12] is a low-level message layerdesigned to deliver the hardware performance ofunderlying network to applications. FM is alsodesigned to enable a high performance layer to theAPIs and the protocols on top of it. Active Messages[13,14] intends to serve as a substrate for buildinglibraries that provide higher-level communicationabstractions and generate communication code froma parallel language compiler, rather than for directuse by programmers. The Active Messages exposesthe full hardware performance to higher layers. GM[15,16] is a message based light-weight communica-tion layer for Myrinet. The design objectives of GMinclude low CPU overhead, portability, low latency,and high bandwidth.

One of the significant advances in cluster net-works over the past several years is that it is now

practical to connect up to tens of thousands ofnodes with networks that have enormous scalablecapacity, and in which the communication fromone node to any other node has the same cost[17]. The cluster networks offer useful referencesfor building storage interconnection networks,because they are capable of delivering high perfor-mance, mass storage capacity, and dealing with verylarge scale. However, storage systems place differentrequirements on the interconnection network com-pared with the cluster computing. Generally, thecluster storage systems distribute data across multi-ple storage nodes, and employ a parallel file systemto boost the aggregate bandwidth by spreading readand write operations across the nodes [18]. Thenodes in cluster computing typically adopt smalland low latency messages to communicate with eachother. Unlike the computing nodes, the storagenodes in a cluster storage system are loosely coupledwith each other. There is very little communicationbetween the storage nodes except data migration,data reorganization or data backup. Therefore, itrequires a high bandwidth rather than a low latency.Storage interconnection networks, because of theirtolerance for higher latency, can exploit commoditytechnologies such as gigabit Ethernet instead of cus-tom network hardware used by the cluster comput-ing networks [4].

A cluster storage system must provide resilienceto guarantee a high reliability and availability at areasonable latency besides high bandwidth, becausethe data in it is valuable, even impossible to regen-erate, and may not be reproducible [19]. Thus, theresilience is a very important feature for designinga large-scale cluster storage system. However, fewresearch efforts have been invested in devising theresilience of cluster networks. Beowulf cluster [20]employs a simple single subnet and a two layer net-work topology. Computational nodes are groupedby the physical cabinet in which they are mounted.Each cabinet has its own Ethernet switch. The cab-inets and servers are connected through a serverswitch. Lustre [18] that builds a cluster file systemfor 1000 nodes provides support for heterogeneousnetworks. It is possible to connect some clients overan Ethernet to the metadata servers (MDS) andobject storage targets (OST) servers, and others overa QSW network, all in a single installation. LVS [21]consists of two layers, a LVS routers layer (oneactive and one backup) that balances the workloadson the real servers and a pool of real servers layerwhich provides the critical services. Some projects

MS

Clients

Switch

Failover Switch

Fig. 1. Highly available star interconnection network.

72 Y. Deng / Journal of Systems Architecture 54 (2008) 70–80

employed a failover metadata server or some dataredundancy mechanisms to provide availability[21,22], but the projects did not consider the failureof network components.

Recent efforts have made important strides indesigning interconnection networks for large-scalestorage systems [4,18]. Hospodor and Miller [4] pro-posed to integrate 1 GB/s network and small (4–12port) switching elements into object based diskdrives to construct a petabyte-scale high perfor-mance storage system. They also discussed how toconstruct the system with different interconnectionarchitectures. Their research results indicate thatthe hypercube, 4-D and 5-D torii appear to be rea-sonable design choices for a 4096 nodes storage sys-tem capable of delivering 100 GB/s from onepetabyte. However, the methods are not so cost-effective due to the tailored disk drives. Qin Xinet al. [19] found that the hypercube is more robustthan a multi-stage butterfly network and a 2D meshstructure. The reason is that the fault tolerance ofan N-dimensional hypercube network is N � 1.The hypercube was originally proposed as an effi-cient interconnection for massively parallel proces-sors (MPP). A large number of researches havebeen done in solving the parallel problems and rout-ing mechanisms on hypercubes [23–26]. However,the hypercube trades increased cost and compli-cated routing mechanism for a high fault tolerance,because the number of connections at each node inthe hypercube grows as the system gets larger. Forinstance, a three-dimensional hypercube consists ofeight nodes with each node connected to three othernodes. A four-dimensional hypercube contains twothree-dimensional hypercubes arranged so that eachnode of one sub-cube is connected to the corre-sponding node of the other sub-cube. There arenow four connections emanating from each nodeand a total of 16 nodes. The network can be general-ized to higher dimensions, but a large number ofconnections emanating from each node cause someengineering problems which limit the network scala-bility [27].

In this paper, based on COTS components, wepropose a Resilient Interconnection for StorageCluster systems (RISC), drawing on previous expe-rience from the cluster computing community. Sincelarge-scale data intensive applications frequentlyinvolve a high degree of data access locality, theRISC divides storage nodes into multiple partitionsto facilitate the locality and simplify intra-partitioncommunications. Multiple spare links between any

two storage nodes are employed to offer a strongresilience to reduce the impact of the failure of links,switches, and storage nodes. The scalability is guar-anteed by plugging in additional switches and stor-age nodes rather than replacing the switches withmore expensive switches which have more portsavailable. Another salient feature is that the RISCachieves dynamic scalability of resilience by expand-ing the partition size incrementally. A resiliencemodel of the RISC has been constructed to explorethe features of the interconnection network. Casestudy shows the practicability and efficiency of theRISC.

The remainder of the paper is organized as fol-lows. Traditional storage interconnection networksand the RISC are introduced in Section 2. Section3 discusses several different failure scenarios andthe corresponding solutions of the RISC. A resil-ience model of the RISC is constructed in Section4. Section 5 illustrates a case study of the RISCbased on the model presented in Section 4. Section6 concludes the paper with remarks on maincontributions.

2. System overview

2.1. Traditional storage interconnection networks

Traditionally, most of the cluster storage systemsemploy a star interconnection network where allstorage nodes are connected to a single centralswitch, and the nodes communicate across the net-work by passing traffic through the switch [21,28].But the star architecture cannot provide constantavailability, because there is only one single linkbetween any two storage nodes in the system. Anoverall system crash will be induced by a failure atthe switch. One way to solve this problem is touse two switches (the second one is for failover) ina star interconnection network (see Fig. 1). Wename it as Highly Available Star (HAS) in the


following discussion. Each storage node is con-nected to the two switches. There are two possiblelinks between any two nodes to guarantee availabil-ity. However, in this architecture, one switchbecomes over loaded when another one fails,because all the data traffic will go through theremaining switch. Failure of the two switches willdestroy the overall system.

Another problem is the scalability. With everincreasing storage demands, star and HAS architec-tures cannot be expanded to a large scale by onlyreplacing the switches with ones that have moreavailable ports. Please note at present the gigabitEthernet switch is available only up to 48-ports.Most of the existing storage systems are scaled byreplacing the entire storage system in a ‘‘forkliftupgrade’’. This method is unacceptable in a systemcontaining petabytes of data because the system istoo large [4]. A hierarchical interconnection net-work is a collection of star networks arranged in ahierarchy. The hierarchical architecture is normallyadopted to extend a cluster to a larger scale. Thescalability is easily guaranteed by plugging in addi-tional switches and storage nodes. However, thehierarchical interconnection network has the inher-ent drawback of a single point of failure. A bunchof storage nodes could be isolated from the networkby a single-point failure of a switch.

N30 N31 N32

S1

N20 N21 N22N10 N11 N12

S2 S3

S4 S5 S6MS

Clients

Fig. 2. A storage cluster with RISC consisting of nine storagenodes and six switches.

2.2. The RISC interconnection network

Data locality is a measure of how well data canbe selected, retrieved, compactly stored, and reusedfor subsequent accesses [29]. In general, there aretwo basic types of data locality: temporal and spa-tial. Temporal locality denotes that a data isaccessed at one point in time and will be accessedin the near future. Temporal locality relies on theaccess pattern of different applications and cantherefore change dynamically. Spatial localitydefines that the probability of accessing a data ishigher if a data near it was just accessed (e.g. pre-fetch). Unlike the temporal locality, the spatiallocality is inherent in the data managed by a storagesystem. The spatial locality is relatively more stable,and does not depend on applications, but rather onthe data organization which is closely related to thesystem architecture. The data locality is a propertyof both the access pattern of applications and thedata organization of system architectures. Reshap-ing access patterns can be employed to improve

the temporal locality [30]. Data reorganization isnormally adopted to improve the spatial locality.

Since large-scale data intensive applications fre-quently involve a high degree of data access locality[29], many research efforts have been invested inexploiting the impact of access pattern and dataorganization of applications on the data locality[29–31]. The communication latency has long beena challenging problem of the cluster community. Awell designed interconnection network of scalablecluster storage systems should be able to enhancethe spatial locality to reduce the communicationlatency. Because different interconnection networksof a cluster storage system can have differentimpacts on the overall system performance andapplication circumstance, the RISC takes advantageof its interconnection network to divide the involvedstorage nodes into multiple partitions to facilitatethe spatial locality. Each partition is composed ofa number of storage nodes and one local switch.Multiple partitions communicate with each otherthrough the multiple inter-partition switches. Eachnode in a partition is connected to two networks:intra-partition network and inter-partition network.The RISC enables the cluster storage system to limitthe impact of local data access within one partitionrather than the overall system performance, bothduring regular operations and under faultconditions.

Fig. 2 illustrates the RISC consisting of nine stor-age nodes and six switches (in the dashed rectangle),where MS is short for metadata server, N denotesstorage node, and S denotes switch. The major roleof metadata is describing the information of howthe data is distributed across the storage nodesand how the data can be accessed in a system. TheMS manages all the metadata of the system, whichallows applications or data users to search, access,

N10

N11N12

N20

N21

N22

N30

N31

N32

S1

S2

S3

S4

S5

S6

Fig. 3. Mapping scheme between the switches and storage nodesof RISC.


and evaluate the data by providing standardizeddescriptions of the stored data [18]. The storagenode in RISC is a commercially available PC con-sisting of processor, memory, motherboard etc.Each node integrates a RAID subsystem to aggre-gate storage capacity, I/O performance and reliabil-ity based on data striping and distribution. Unlikethe hypercube architecture discussed in Section 1,each storage node in the RISC has two networkinterfaces which connect the node to intra-partitionnetwork and inter-partition network, respectively.Compared with the hypercube architecture, theapproach significantly decreases the complexity ofthe routing strategy when the system grows to alarge scale. It is very easy to extend the networkinterface of a storage node from one to two by plug-ging one additional network interface card (NIC)into a PCI slot of the motherboard.

A storage cluster mainly consists of MS, clients,storage nodes and interconnection network [18].Fig. 2 illustrates a typical storage cluster system usingRISC. Please note that this paper mainly focuses onthe interconnection network of cluster storage sys-tems, the availability of MS is beyond of scope of thispaper. The nine storage nodes in Fig. 2 are dividedinto three partitions in terms of data locality. The par-tition depends on the hops of network communica-tion. The hops are defined as links which an I/O hasto travel. For instance, three partitions in Fig. 2 couldbe division1 {P11 = (N10,N11, N12), P12 = (N20,N21, N22), P13 = (N30,N31,N32)} or division2{P21 = (N10,N20, N30), P22 = (N11,N21, N31),P23 = (N12,N22, N32)}. The reason is that two divi-sions have the same number of hops. For the divi-sion1, the hops from the storage nodes of partitionsP11–P13 to the corresponding switches S1–S3 areone hop. The same scenario is applied to the division2where the hops from the storage nodes of partitionsP21–P23 to the corresponding switches S4–S6 areone hop as well. The basic idea of RISC is from aset associative cache scheme which groups slots intosets. Drawing inspiration from the cache mappingscheme, we depict the mapping scheme between thenine storage nodes and six switches in Fig. 3. Ninestorage nodes are clustered into three sets in termsof the port number of switch. The connectionsbetween the sets of storage nodes and the switchesS1–S3 are direct mapping. Each set of storage nodeshas one connection to the switches S4–S6,respectively.

By adding more storage nodes and switches with-out having to reconfigure the overall interconnec-

tion network, the RISC guarantees a dynamicscalability as the bandwidth and capacity require-ment increases and continues to deliver an improvedreliability and a higher performance and connectiv-ity. For instance, we can add an additional storagenode into a specific partition to extend the partitionsize. A new partition consisting of multiple storagenodes can also be added into the RISC to expandthe scale of overall system. The high resilience ofthe RISC is offered by multiple spare links whichwill be detailed in Section 4. The same factorsemployed to increase the resilience also result in aperformance improvement. No centralized switchwith many available ports removes a potential per-formance bottleneck, and applying load balancetechniques to the multiple spare links provides aconsistent performance for the overall system. Com-pared with the HAS architecture, the RISC does notrequire more expensive switches with more ports forthe spare links. It is very easy to calculate that theinterconnection network illustrated in Fig. 2requires 18 switch ports and 18 physical links whichare the same number as that of the HAS, but theRISC requires six cheaper three port switchesinstead of two expensive nine port switches.

The RISC is devised for scalable cluster storagesystems which could be expanded to a large scale,while still providing high resilience, high scalability,and data access locality enhancing. Due to the vir-tual two-tiered interconnection, a two-tired parallel-ism and scalability of the intra-partition and theinter-partition are achieved. Another feature is thatthe RISC is based on COTS components. It is muchmore cost-effective than a custom cluster network.Table 1 summarizes the characteristics of differentinterconnection networks of cluster storage systemsdescribed in this section. We will discuss the routingmechanism in the next section further.

Table 1Characteristics of different interconnection networks of clusterstorage systems

Interconnectionnetwork

Resilience Scalability Spatiallocality

Complexityof routingmechanism

Star No Low High LowHAS Middle Low High LowHierarchy Single

point offailure

High Couldbe high

Low

Hypercube Very high Low Couldbe high

Very high

RISC High High High Middle


3. Failure analysis and routing mechanism

3.1. Failure scenarios

The RISC is designed for scalable cluster storagesystems. The system could be expanded to thou-sands of storage nodes to satisfy both the band-width and storage capacity requirements imposedby the data intensive applications. A storage systemwith multi-petabytes of data typically consists ofover 4000 storage devices, thousands of connectionnodes, and tens of thousands of network links [19].Although a single component is fairly reliable, witha large number of components including storagenodes, switches, links etc, the aggregate rate of acomponent failure can be very high. Zimmermannet al. [32] illustrated that if the mean time to failure(MTTF) of a single disk drive is of the order of1,000,000 h, the MTTF of some disk drives in alarge storage system consisting of 1000 disk drivesis of the order of 1000 h.

Due to the invaluable data, it is fatal to incur afailure in a large-scale cluster storage system. Dataloss can cause a significant influence on the econ-omy. Designing a scalable cluster storage systemwith a high resilience has long been a challengingproblem. To address this problem, we have to inves-tigate the types of failure in a large-scale clusterstorage system. In reality, there are many possiblefailures, such as the failure of the network controlsystem, the power failure, the software failure etc.This paper mainly concentrates on the failure ofthe components which impact the interconnectionnetwork. There are three such types of failure sce-narios: the link failure, the connection node failure(i.e. the switch in Fig. 2), and the storage nodefailure.

The link failure indicates that the connectionbetween a pair of nodes (including storage nodesand connection nodes) gets interrupted, whichcauses all traffic between the pair of nodes to betotally disconnected. A resilient interconnection net-work must be able to provide multiple spare linksbetween any pair of nodes to guarantee the dataavailability and the load balance. The connectionnode failure means the failure of switches inFig. 2. The failure can be caused by power failure,fire, and etc. Compared with the link failure, thefailure of connection nodes are more serious sincea number of links attached are simultaneously bro-ken. The failure can totally isolate a number of stor-age nodes that are connected through theconnection node. Therefore, for a resilient intercon-nection network, it is very important to maintain ahigh availability of the connection nodes. Becausethe storage node failure can directly incur data loss,many data redundancy mechanisms have beendevised to protect the data loss in cluster storagesystems [22]. The redundancy mechanisms are nor-mally managed by metadata servers.

In our example, there are nine storage nodeslabelled from N10 to N33 and six switches labelledfrom S1 to S6 (see Fig. 2). Applications can accessthe RISC through the six switches, respectively.Assume that a data intensive application accessesthe data residing in the system through the switchS4 from the MS. The specified I/O stream will besent from the MS to the storage node N30. We trackthe path of this I/O stream in various scenarios. Atthe initial state without any failure, the I/O requestcan be simply transferred through the S4 to the tar-get storage node N30. In the following analysis, weemploy the division2 discussed in Section 2.2 todivide the RISC into three partitions. We will ana-lyze the three failure scenarios in two cases: theintra-partition failure and the inter-partition failure.A rerouting mechanism will be introduced if a spe-cific failure occurs.

Intra-partition failure. In terms of the division2,the storage nodes (N10,N20, N30) and the switchS4 are within one partition. If the link between theswitch S4 and the target storage node N30 is bro-ken, the I/O stream can take a path (S5!N31! S3! N30), or path (S6! N32! S3!N30), or path (S5! N11! S1! N12! S6!N32! S3! N30), etc. There are many possiblealternate paths, but normally the shortest one is pre-ferred. The choice of a new path is determined bythe rerouting mechanism and the system situation


at that moment. In the RISC, each storage nodeplays a role of router which can forward packageswhen it is necessary. If a failure happens at theswitch S4, the I/O stream will travel the same pathsdetermined by the rerouting mechanism when thelink between the switch S4 and the node N30 is bro-ken. If the target storage node N30 fails, the I/Ostream will detour to the node where the redundantdata is stored in terms of the employed redundancymechanism.

Inter-partition failure. Part of the inter-partitionfailure of the switches and the links will not causeany problems if the intra-partition componentswork well. Assume that one I/O stream goes fromMS to N30. Even if the switch S3, the link betweenthe target storage node N30 and the switch S3 areall crashed, the overall system keeps running. Butif any intra-partition component fails any more(e.g. the switch S4 or the link between the S4 andthe N30), the storage node N30 will be isolated fromthe overall system, which results in an I/O streamloss. This problem can be solved by redirecting theI/O stream to other storage nodes where the redun-dant data resides. If the three paths (N30! S3),(N10! S1), (N20! S2) are broken simultaneously(including links and switches), even if it is an unli-kely event, the partition (N10,N20,N30) will be iso-lated from the cluster storage system. Fortunately,the RISC dynamically increases the number of thespare paths which connect one partition to anotherpartition with the expansion of the partition size.

In the event of one component failure or multiplecomponent failures in the RISC, there are manypossible alternate paths to connect any two nodes.This makes it easy to replace the failed componentswith minimal impact on the overall system availabil-ity or downtime. The resilience degree keeps pacewith the continuous growth of the partition size,due to the increased spare links associated withthe added storage nodes. The RISC is more resilientand less susceptible to failures than the traditionalinterconnection networks illustrated in Table 1.

3.2. Data access and the routing mechanism

The interaction between the applications and thestorage nodes of the RISC is mediated through aMS. When an application requests data stored inthe RISC, firstly, it contacts the MS to get somemetadata information such as the address aboutthe required data. Secondly, it adopts the address

information to access the corresponding storagenodes to retrieve the data.

Hospodor and Miller [4] connected external cli-ents through routers placed at regular locationswithin the cube-style networks (meshes, torii, andhypercube). The arrangement resulted in very poorperformance due to the congestion near the routernodes. To avoid the centralized routing mechanism,we decentralize the routing functions across all stor-age nodes in the RISC. Routing is the process ofbuilding up the forwarding tables that are used tochoose paths for delivering packets. Each storagenode in the RISC plays a role of router which canforward I/O traffic when it is necessary. But undernormal circumstances, the I/O traffic does not gothrough other storage nodes because each node inthe system has a direct connection to the MSthrough the Inter-partition network.

In the event of a link failure which connects astorage node to the inter-partition network, the traf-fic emanating from the node has to be forwarded bythe remaining nodes in the same partition whichhave links to the inter-partition network. Forinstance, if a data set is stored across the three stor-age nodes of N30–N32 in partition P13, the applica-tion contacts the MS to retrieve the metadata first.According to the metadata, the application estab-lishes three direct connections with the three storagenodes N30–N32 through the three inter-partitionswitches S4–S6 to transfer data, respectively. Theprocedure does not involve any node as a router.But if the S4 or the link between the S4 and theN30 is broken, the I/O traffic emanating from N30has to be forwarded by N31 and N32 to S5 andS6, respectively. More complicated failure scenarioshave been discussed in Section 3.1. The implementa-tion of the routing tables in the storage nodes isstraightforward. Resilient overlay network (RON)[26] is an architecture that allows distributed Inter-net applications to detect and recover from pathoutages. RON’s design supports routing throughmultiple intermediate nodes. However, Andersenet al. [26] discovered that using at most one interme-diate RON node is sufficient most of the time.Therefore, it is feasible for each storage node tomaintain a local routing table with limited alternatepaths, even when the RISC grows to a large scale.

The key point of a resilient routing mechanism inthe RISC is how to identify a failure and update therouting tables. A lot of research efforts have beeninvested in designing fault-tolerant routing algo-rithms [19,24–26]. In contrast to the works, our


method is based on the features of a typical clusterstorage system. Because the overall information of acluster storage system is normally managed bymetadata servers, and the size of the system is rela-tively small to the Internet scale, the metadata serv-ers are good candidates to monitor the failures ofthe whole system.

Each storage node in the RISC sends short mes-sages to the MS at regular intervals. If a message isnot acknowledged by the MS for a particular per-iod, the path between the storage node and theMS is assumed to have failed. The node will choosean alternate path from its local routing table to deli-ver the I/O traffic. The periodic heartbeat diffusionbetween the MS and the storage nodes is adoptedto monitor the failures in the RISC. A reasonablelow frequency of heartbeat diffusion is feasiblebecause excessive failure detections will increasemaintenance cost significantly and be unnecessary.By tuning the diffusion period, the time to recoverfrom a failure can be balanced against the band-width overhead of periodic message transmissions.

4. Resilience model of the RISC

The most important aspect of a resilient intercon-nection network for cluster storage systems is thatan I/O stream can be delivered to its target storagenodes successfully in any failure scenarios. Wedefine a resilience coefficient as a metric to measurethe resilience.

Definition. Resilience coefficient is the ratio of themaximum number of data paths to the minimumpaths between any two storage nodes.

For simplicity, we assume that there is a clusterstorage system consisting of m partitions, each par-tition has Nk (1 6 k 6 m) storage nodes. The num-ber of storage nodes can be expressed by followingformula:

N ¼ N 1 N 2 � � � Nmð Þ ð1Þ

It is required to have at least one path betweenany two storage nodes to keep the RISC based clus-ter storage system running. Thus, there should be atleast one link between any two partitions. It is veryeasy to calculate the minimum number of data pathsas follows:

Pathminimum ¼1

2�

Xm

i¼1

Ni �Xm

j¼1;j 6¼i

N j

!ð2Þ

We assume that the RISC employs a full-duplexmode which refers to the transmission of data intwo directions simultaneously. Because we are com-puting the path number, where the coefficient 1

2indi-

cates that a full-duplex transmission is counted asone path.

The RISC adopts multiple spare links betweenany two partitions to guarantee a high resilience.The number of links between partitions forms am · m connection matrix C as follows:

C ¼

c11 c12 � � � c1m

c21 c22 � � � c2m

..

. ... . .

. ...

cm1 cm2 � � � cmm

0BBBB@

1CCCCA ð3Þ

where cij (1 6 i 6 m, 1 6 j 6 m) denotes the directlinks between the ith and the jth partition. cij = 0indicates that there is no direct link between theith and the jth partition.

Due to the partition number m, the nodes in ori-ginal partitions can communicate with the nodes intarget partitions through zero to up to m � 2 inter-mediate partitions. Different scenarios will be dis-cussed in the following section.

The first, we assume that the ith and the jth par-tition communicate with each other directly (i.e. nointermediate partitions). We have the data pathpathij:

Path0�ij ¼ Ni � cij � Nj ð4Þ

In this scenario, the maximum number of datapaths of the overall system is expressed as follows:

Path0 ¼1

2�

Xm

i¼1

Ni �Xm

j¼1;j 6¼i

cij � N j

! !ð5Þ

The second, if the data transmission between theith partition and the jth partition are mediatedthrough one partition labelled l, because the storagenodes in the lth partition can be configured to routedata packages, the data paths can be calculatedusing following formula:

Path1�ij ¼ Ni � cil � Nl � clj � Nj ð6Þ

The maximum number of data paths of the over-all system is calculated as follows:


Path1 ¼1

2�

Xm

i¼1

Ni �Xm

l¼1;l 6¼i

ðcil � N lÞ

�Xm

j¼1;j 6¼i;j 6¼l

ðclj � NjÞ!

ð7Þ

By analogy, the data transmission between theith partition and the jth partition could be mediatedthrough 2, 3, . . . ,m � 2 partitions, respectively. Wehave:

Pathm�2 ¼1

2P m

m

Ym�1

i¼1ðN i � ciðiþ1ÞÞ

� �� Nm ð8Þ

where P mm is the number of full permutation of the m

partitions. The maximum number of data paths ofthe whole system can be expressed as follows:

Pathmaximum ¼Xm�2

i¼0

Pathi ð9Þ

According to the definition, the resilience coeffi-cient R is calculated using following formula:

R ¼ Pathmaximum=Pathminimum ð10Þ

5. Case study

Case study is a particular research method whichoffers a systematic way of investigating events, col-lecting data, analyzing information, reporting theresults, validating hypotheses, etc. As a case study,we will adopt the resilience coefficient as a metricto examine the resilience degree, the failure scenario,and the dynamic scalability in a 3 · 3 RISC illus-trated in Fig. 2.

5.1. Resilience degree

Let’s consider a simple RISC consisting of ninestorage nodes illustrated in Fig. 2. According tothe Eqs. (1)–(3), we can easily calculate the storage

Table 2The maximum number of data paths of a RISC with nine storage nod

Path

Direct communication N1! N2N1! N3N2! N3

Mediate through one partition N1! N3! N2N1! N2! N3N2! N1! N3

nodes number N ¼ ð 3 3 3 Þ, the minimum num-ber of data paths Pathminimum = 3 · 3+3 · 3+3 ·3 = 27, and the connection matrix

C ¼1 3 3

3 1 3

3 3 1

0B@

1CA;

respectively. The maximum number of data pathsbetween any two storage nodes is illustrated in Ta-ble 2 in terms of the Eqs. (5) and (7).

Therefore, in the light of the Eq. (10), the resil-ience coefficient R is (81 + 729)/27 = 30. The resil-ience coefficient R = 30 shows that the RISC ismuch more resilient than the HAS architecturewhich has a resilience coefficient of 2, while requir-ing the same number of switch ports and physicallinks.

5.2. Failure scenarios

The storage nodes (N12,N22,N32) and theswitch S6 are within one partition according to thedivision2 discussed in Section 2.2. We assume thatthe switch S6 or the links {(N12! S6),(N22! S6), (N32! S6)} fail simultaneously. Theconnection matrix is

C ¼1 2 2

2 1 2

2 2 1

0B@

1CA:

According to the Eqs. (5) and (7), the number ofdata paths between any two storage nodes arePath0 = (3 · 2 · 3 ·3 = 54, Path1 = (3 · 2 · 3 · 2 ·3) ·3 = 324, respectively. We have the resiliencecoefficient R = (54 + 324)/27 = 14.

In the most serious scenario, we assume that theswitch S5 and S6 crash synchronously, or the links{(N12! S6),(N22! S6),(N32! S6)} and {(N11!

es divided into three partitions (see Fig. 2)

Path number Total path number

3 · 3 · 3 = 27 813 · 3 · 3 = 273 · 3 · 3 = 27

3 · 3 · 3 · 3 · 3 = 243 7293 · 3 · 3 · 3 · 3 = 2433 · 3 · 3 · 3 · 3 = 243


S5), (N21! S5), (N31! S5)} fail at the same time.The connection matrix is computed as

C ¼1 1 1

1 1 1

1 1 1

0B@

1CA:

We have Path0 = (3 · 1 · 3) · 3 = 27, Path1 = (3 ·1 · 3 · 1 · 3) ·3 = 81. The resilience coefficient Ris (27 + 81)/27 = 4. It indicates that the RISC run-ning in the failure mode is still more resilient thanthe traditional HAS architecture.

5.3. Dynamic scalability

A distinct feature of the RISC is that the RISCachieves a dynamic scalability of resilience byexpanding the partition size incrementally withadditional storage nodes along with associated twonetwork interfaces. If we keep the three partitionsdepicted in Fig. 2, but increase the partition sizefrom three nodes to four nodes and construct theRISC with three 4 port switches for the intra-partition network and four 3 port switches for theinter-partition network. Based on the same calcula-tion, we have N ¼ ð 4 4 4 Þ, the minimum num-ber of data paths Pathminimum = 4 · 4+4 · 4+4 ·4 = 48, and the connection matrix

C ¼1 4 4

4 1 4

4 4 1

0B@

1CA;

respectively. It is very easy to compute thePath0 = (4 · 4 · 4) · 3 = 192 and Path1 =(4 · 4 ·4 · 4 · 4) · 3 = 3072 in terms of the Eqs. (5) and(7). Therefore, the resilience coefficient R is(192 + 3072)/48 = 68.

The resilience coefficient of this configuration isincreased by a factor of 2.27 compared with thatof the configuration which has three partitions eachconsisting of three storage nodes. It indicates thatthe increase of partition size incurs a significantimprovement of the resilience.

Another important feature of the RISC is adynamic load balance. Because each storage nodein the RISC plays a role of router which forwardsI/O packages when it is necessary, in the event of aspecific failure, it does increase the workloads ofthe storage nodes which forward the I/O traffic.But the workloads will be shared by multiple nodes,and the I/O traffic taken over by each node will bedecreased with the increased number of additional

added storage nodes in the partition. For instance,let’s suppose that the required data reside in nodeN30 (see Fig. 2). If the switch S4 or the link betweenthe node N30 and the S4 fail, the I/O traffic emanat-ing from the node N30 has to go through the nodeN31 and N32, which increases the workloads of thenode N31 and N32, but the two nodes share the traf-fic. If we put one more storage node N34 into the par-tition, three nodes will take over the I/O traffic. Thisfeature strikes a good balance between the load bal-ance and the dynamic scalability of resilience.

6. Conclusions

In this paper, we proposed a resilient intercon-nection network for cluster storage systems namedRISC which takes a significant step towards a resil-ient and scalable cluster storage system by dividingthe storage nodes into multiple partitions and pro-viding multiple spare links between any pair of stor-age nodes. The RISC enhances the data accesslocality by partitioning the correlated storage nodestogether, which can reduce the communicationlatency by limiting most of the messages within alocal partition. To expand the cluster storage sys-tem, we can add more nodes and more smallswitches without having to reconfigure the wholearchitecture. Even though multiple spare links areemployed to provide a high resilience, the RISCrequires the same number of switch ports and linksas that of a HAS architecture. The resilience coeffi-cient is proposed as a metric to measure the system’sresilience degree. The case study provides usefulinsights into the behaviors of the RISC.

Acknowledgements

The author would like to thank the anonymousreviewers for their useful comments and feedbackwhich help us to refine our thoughts about theRISC. Their suggestions are very helpful for our fu-ture works.

References

[1] D. Patterson, G. Gibson, R. Katz, A case for redundantarrays of inexpensive disks (RAID), in: Proceedings of ACMConference on Management of Data, 1988, pp. 109–116.

[2] Garth A. Gibson, Rodney Van Meter, Network attachedstorage architecture, Communications of the ACM 43 (11)(2000) 37–45.

[3] M. Mesnier, G.R. Ganger, E. Riedel, Object-based storage,IEEE Communications Magazine 41 (8) (2003) 84–90.


[4] A. Hospodor, E.L. Miller, Interconnection Architectures forpetabyte-scale high-performance storage systems, in: Pro-ceedings of the 21st IEEE/12th NASA Goddard Conferenceon Mass Storage Systems and Technologies, 2004, pp. 273–281.

[5] N.J. Boden, D. Cohen, R.E. Felderman, A.E. Kulawik, C.L.Seitz, J.N. Seizovic, W. Su. Myrinet, A gigabit per secondlocal area network, IEEE Micro 15 (1) (1995) 29–36.

[6] D. Garcia, W. Watson, ServerNetTM II, in: Proceedings of theParallel Computer Routing and Communication(PCRCW’97), 1997, pp. 119–136.

[7] D.B. Gustavson, Qiang Li, The scalable coherent interface(SCI), IEEE Communications Magazine 34 (8) (1996) 52–63.

[8] S. Oral, A.D. George, A user-level multicast performancecomparison of scalable coherent interface and myrinetinterconnects, in: Proceedings of the 28th Annual IEEEInternational Conference on Local Computer Networks,2003, pp. 518–527.

[9] Virtual Interface Architecture Specification 1.0. 1997. http://rimonbarr.com/repository/cs614/san_10.pdf.

[10] G. Amerson, A. Apon, Implementation and design analysisof a network messaging module using virtual interfacearchitecture, in: Proceedings of 2004 IEEE InternationalConference on Cluster Computing, 2004, pp. 255–265.

[11] S. Pakin, V. Karamcheti, A. Chien, Fast messages: efficientportable communication for workstation clusters and mas-sively parallel processors, IEEE Concurrency 5 (2) (1997) 60–73.

[12] M. Lauria, S. Pakin, A.A. Chien, Efficient layering for highspeed communication: fast messages 2.x, in: Proceedings ofthe 7th International Symposium on High PerformanceDistributed Computing, 1998, pp.10–20.

[13] T.V. Eicken, D.E. Culler, S.C. Goldstein, K.E. Schauser,Active messages: a mechanism for integrated communicationand computation, in: Proceedings of the 19th ISCA, 1992,pp.256–266.

[14] Thorsten von Eicken, David E. Culler, Klaus Erik Schauser,Seth Copen Goldstein, Retrospective: active messages: amechanism for integrating computation and communication,in: Proceedings of 25 Years of the International Symposia onComputer Architecture, 1998, pp. 83–84.

[15] Lee Lie-Quan, A. Lumsdaine, The generic message passingframework, in: Proceedings of the International Parallel andDistributed Processing Symposium, 2003, pp. 0–10.

[16] Generic Messages Documentation, http://www.myri.com/GM/doc/gm_toc.html.

[17] C.L. Seitz, Recent advances in cluster networks, in: Pro-ceedings of the 2001 IEEE International Conference onCluster Computing, 2001, pp. 365–365.

[18] Lustre, http://www.lustre.org/.[19] Qin Xin, Ethan L. Miller, Thomas J.E. Schwarz, S.J., Darrell

D.E. Long, Impact of Failure on Interconnection Networksfor Large Storage Systems, in: Proceedings of the 22ndIEEE/13th NASA Goddard Conference on Mass StorageSystems and Technologies, 2005, pp. 189–196.

[20] Beowulf, http://beowulf.cheme.cmu.edu/hardware/network.html.

[21] Linux Virtual Server. http://www.linuxvirtualserver.org/.[22] Kai Hwang, Hai Jin, Roy S.C. Ho, Orthogonal striping and

mirroring in distributed RAID for I/O-centric cluster com-puting, IEEE Transactions on Parallel Distributed Systems13 (1) (2002) 26–44.

[23] Youcef Saad, Martin H. Schultz, Topological properties ofhypercubes, IEEE Transactions on Computers 37 (7) (1988)867–872.

[24] M.A. Sridhar, C.S. Raghavendra, Fault-tolerant networksbased on the de Bruijn graph, IEEE Transactions onComputers 40 (10) (1991) 1167–1174.

[25] P.T. Gaughan, S. Yalamanchili, A family of fault-tolerantrouting protocols for direct multiprocessor networks, IEEETransactions on Parallel and Distributed Systems 5 (6)(1995) 482–487.

[26] D. Andersen, H. Balakrishnan, F. Kaashoek, R. Morris,Resilient overlay networks, in: Proceedings of the 18th ACMSOSP, 2001, pp. 131–145.

[27] A.J.G. Hey, High performance computing-past, present andfuture, Computing & Control Engineering Journal 8 (1)(1997) 33–42.

[28] Wataru Katsurashima, Satoshi Yamakawa, et al., NASswitch: a novel CIFS server virtualization, in: Proceedings ofthe 20th IEEE/11th NASA Goddard Conference on MassStorage Systems and Technologies (MSS’03), 2003, pp.82–86.

[29] Frederik W. Jansen, Erik Reinhard, Data locality in parallelrendering, in: Proceedings of the Second EurographicsWorkshop on Parallel Graphics and Visualisation, 1998,pp. 1–15.

[30] Aart J.C. Bik, Reshaping Access Patterns for ImprovingData Locality, in: Proceedings of the Sixth Workshop onCompilers for Parallel Computers, 1996, pp. 229–310.

[31] Mar’ia E. G’omez, Vicente Santonja, Characterizing Tem-poral Locality in I/O Workload, in: Proceedings of the 2002International Symposium on Performance Evaluation ofComputer and Telecommunication Systems, 2002.

[32] R. Zimmermann, S. Ghandeharizadeh, Highly available andheterogeneous continuous media storage systems, IEEETransactions on Multimedia 6 (6) (2004) 886–896.

Dr.Yuhui Deng received his PhD degreein computer architecture from Huaz-hong University of Science and Tech-nology in July 2004. He was involved inseveral National Nature Science Foun-dation projects in the National KeyLaboratory of Data Storage System ofP.R. China when he was a PhD candi-date. He joined Cranfield University inFebruary 2005. He is now a researchofficer at the Centre for Grid Comput-

ing, Cranfield University. Presently he is working on a projectnamed grid oriented storage (GOS) funded by EPSRC/DTI. His
research interests cover computer architecture, network storage,cluster storage, parallel and distributed computing, etc.
http://rimonbarr.com/repository/cs614/san_10.pdf

http://rimonbarr.com/repository/cs614/san_10.pdf

http://www.myri.com/GM/doc/gm_toc.html

http://www.myri.com/GM/doc/gm_toc.html

http://www.lustre.org/

http://beowulf.cheme.cmu.edu/hardware/network.html

http://beowulf.cheme.cmu.edu/hardware/network.html

http://www.linuxvirtualserver.org/

Documents

RISC: A resilient interconnection network for scalable cluster storage systems