30
Scality, Simply Scaling Storage White Paper March 2012

GDS International Scality Simply Scaling Storage

Embed Size (px)

DESCRIPTION

Summary Scality’s RING Organic Storage is a market-proven, cost-effective software solution for the storage of unstructured data at petabyte scale. Scality RING is based on patented object storage technology, which delivers high availability, ease of operations and total control of your data. It is capable of handling billions of files, without the hassle of volume management or complex backup procedures. The organic design creates a scale-out system with distributed intelligence that has no single point of failure. As a result, RING Organic Storage is resilient and self healing, and technology refreshes do not require any data migration or downtime. Thanks to its parallel design Scality RING delivers very high performance for file read and write operations. 1 Introduction Using only off-the-shelf components, Scality RING provides reliable, enterprise grade mechanisms for data protection and continuity of service. Thanks to an intelligent mix of replication and erasure coding technologies, data durability * beyond twelve nines (99.9999999999%) can be reached. The RING Organic Storage solution uses a distributed, decentralized and geo-redundant peer-to-peer architecture where data is evenly spread among all participating storage servers. The system is an aggregation of independent, loosely coupled servers in a “shared nothing” model, logically unified in a “ring” to provide unique linear scalability, cost efficiency and data protection. The Ring also provides exceptional fault tolerance, protecting against all types of outages (disk, servers, silent error corruption, network, power…). It ensures high availability thanks to intelligent data placement within one site or across multiple sites. It accomplishes this without the use of a central database, which would otherwise conflict with the Ring philosophy of eliminating all potential single points of failure. The essence of to Ring’s performance is its massively parallel architecture, fully utilizing all (potentially heterogeneous) storage servers in order to sustain very high aggregated data transfert rates and IOPS levels. It is designed to scale linearly to thousands of storage servers. Scality has developed a unique approach, primed for the exponential growth in storage demand. The Ring allows you to grow in step with your business needs without having to worry about costly and complex hardware refresh cycles.

Citation preview

Page 1: GDS International Scality Simply Scaling Storage

Scality, Simply Scaling Storage

White Paper March 2012

Page 2: GDS International Scality Simply Scaling Storage

White Paper “Scality, Simply Scaling Storage”

Scality Copyright 2012. Private and Confidential. 1

Contents

Summary ........................................................................................................................... 2  1   Introduction ................................................................................................................. 2  2   Stakes and challenges ............................................................................................... 3  3   What is object storage? .............................................................................................. 5  4   Scality RING Organic Storage .................................................................................... 6  

4.1   Philosophy ........................................................................................................... 6  4.2   Architecture and components .............................................................................. 8  4.3   Performance ...................................................................................................... 13  4.4   Data Protection .................................................................................................. 16  4.5   Consistency model ............................................................................................ 19  4.6   Scalability ........................................................................................................... 20  4.7   Management and Support ................................................................................. 22  4.8   Partners and ecosystem .................................................................................... 23  

5   Conclusion: Scality, a new vision of storage ............................................................ 24  References ...................................................................................................................... 25  Glossary .......................................................................................................................... 28  

Page 3: GDS International Scality Simply Scaling Storage

White Paper “Scality, Simply Scaling Storage”

Scality Copyright 2012. Private and Confidential. 2

Summary

Scality’s RING Organic Storage is a market-proven, cost-effective software solution for the storage of unstructured data at petabyte scale. Scality RING is based on patented object storage technology, which delivers high availability, ease of operations and total control of your data. It is capable of handling billions of files, without the hassle of volume management or complex backup procedures. The organic design creates a scale-out system with distributed intelligence that has no single point of failure. As a result, RING Organic Storage is resilient and self healing, and technology refreshes do not require any data migration or downtime. Thanks to its parallel design Scality RING delivers very high performance for file read and write operations.

1 Introduction

Using only off-the-shelf components, Scality RING provides reliable, enterprise grade mechanisms for data protection and continuity of service. Thanks to an intelligent mix of replication and erasure coding technologies, data durability * beyond twelve nines (99.9999999999%) can be reached. The RING Organic Storage solution uses a distributed, decentralized and geo-redundant peer-to-peer architecture where data is evenly spread among all participating storage servers. The system is an aggregation of independent, loosely coupled servers in a “shared nothing” model, logically unified in a “ring” to provide unique linear scalability, cost efficiency and data protection. The Ring also provides exceptional fault tolerance, protecting against all types of outages (disk, servers, silent error corruption, network, power…). It ensures high availability thanks to intelligent data placement within one site or across multiple sites. It accomplishes this without the use of a central database, which would otherwise conflict with the Ring philosophy of eliminating all potential single points of failure. The essence of to Ring’s performance is its massively parallel architecture, fully utilizing all (potentially heterogeneous) storage servers in order to sustain very high aggregated data transfert rates and IOPS levels. It is designed to scale linearly to thousands of storage servers. Scality has developed a unique approach, primed for the exponential growth in storage demand. The Ring allows you to grow in step with your business needs without having to worry about costly and complex hardware refresh cycles.

* Data Durability: According to Amazon web site, Data Durability is defined as follows: “durability (with respect to an object stored in S3) is defined as the probability that the object will remain intact and accessible after a period of one year. 100% durability would mean that there's no possible way for the object to be lost, 90% durability would mean that there's a 1-in-10 chance, and so forth.”

Page 4: GDS International Scality Simply Scaling Storage

White Paper “Scality, Simply Scaling Storage”

Scality Copyright 2012. Private and Confidential. 3

Compared to other commercial storage systems, the Ring cuts the TCO by 50%*. This is made possible by the use of commodity servers, simplified operations and a flexible pricing model.

2 Stakes and challenges

Increasing data challenges have prompted a proliferation of innovative technological responses, but until recently, these have only been point solutions. This has created islands of storage infrastructures of all types, with mixed presence of storage arrays (SAN†), file servers (NAS‡), backup or archive appliances. This explosion of complexity is a source of cost due to low utilization rates, separate management integrations, hard-to-measure SLA and the inexistence of upgrade paths.

Figure 1: Enterprise reality: islands of storage with a mix of vendors,

models, technologies and version

The natural response of corporate decision makers is to reduce this complexity by focusing solely on standard or market-proven solutions. Among these solutions, a move to SAN-based storage was once thought to provide the best value—in particular for high performance requirements. However, SAN is historically based on expensive Fibre Channel connectivity and more recently, the industry started to react with an interesting evolution with a combination of SAN and Ethernet based on iSCSI protocol.

* See www.scality.com/for-capacity-larger-than-100tb-private-cloud-storage-is-significantly-cheaper-than-public-cloud-storage/ † SAN: Introduced mid nineties, Storage Area Networks allow connections between servers and storage units via an inteliigent network historically based on Fibre Channel and more recently with SCSI over IP aka iSCSI. ‡ NAS: Network Attached Storage is basically a file server most of the time presented as a dedicated appliance.

Page 5: GDS International Scality Simply Scaling Storage

White Paper “Scality, Simply Scaling Storage”

Scality Copyright 2012. Private and Confidential. 4

For a few decades, RAID has been considered the standard mechanism for disk-based data protection. As the RAID groups have become denser, it has become necessary to introduce disk groups with double parity, namely RAID6, to limit exposure to data loss. Although this protection level has an interesting disk overhead ratio, a good durability and thus a low probability of data loss, its availability and performance in case of disk loss remains the main concern especially with large configuration. In addition, the use of a local or network file system presents some advantages, such as the ability to allocate, name and classify information, but it also introduces some true limits. File system characteristics often impose various limits such the maximum file size or the file system size, as well as the number of inodes. An initial advantage can even become its opposite. For example, the search for a data item, such as a directory entry for a file, requires the navigation of the file directory itself. This sequential navigation, although rapid and satisfying for a small number of items, quickly becomes a stifling bottleneck with large data volumes. Attempts to overcome these limitations have included the use of shared clusters with concurrent access to a file system. However, these methods have also quickly reached their limits, as the shared disk model has been able to successfully handle only a few dozen nodes. A “shared nothing” approach (which does not share disks) does push back the threshold of performance degradation beyond previously known limits, with capabilities to aggregate thousands or tens of thousands of nodes. Applied to the NAS world, one of the current market leaders offers a very scalable solution in terms of performance and data redundancy, with a capacity of several PB, but the offer remains restricted to 144 nodes due to network limitations.

Page 6: GDS International Scality Simply Scaling Storage

White Paper “Scality, Simply Scaling Storage”

Scality Copyright 2012. Private and Confidential. 5

3 What is object storage?

Traditional access methods, such as file sharing or block protocols, have their own properties, advantages and limitations. Beyond SAN, NAS and scale-out NAS*, a fourth way, object storage, has been proposed by the industry. An object is a logical entity that groups data and metadata, using the latter for management and descriptions of the stored data. The key point involves establishment of exchange protocols for interaction between clients and servers. There has been several such protocols proposed in the industry, but the one which seems to offer the best characteristics, is the association of keys and values. This is what Scality is based on. This approach has several simultaneous advantages: 1. The protocol is pretty simple to implement thus providing reliability. 2. The performance has a linear response, often based on calculation, and therefore

predictive—without the need for lookup or data centralization. 3. The scalability is no longer constrained by the limits of file system or block storage. 4. The guarantee of the independence of data location and the integration of data

redundancy by coupling multiple copy mechanisms provides robustness. 5. The geo or stretched configurations with an Internet-like topology are feasible with

object storage, but impossible to obtain with block mode or file systems. 6. The consistency is simplified without the need to follow strict rules introduced by

block storage or historical file system design. The object storage concept relies on a value known from outside the system, such as a path or the content of a file, from which a unique key is calculated. The key serves to locate the data, without the response times being affected by the multiplication of systems for size constraints. Storage systems known as Content Addressable Storage (CAS) have emerged over the past decade. These systems are key-value based object-store with the specificity that keys are directly derived (hashes) from the content. Typically, MD5 or SHA-1 algorithms or a combination of both are used to uniquely identify an object. Over the past few years, these storage systems have penetrated the market for archiving data, and several mature offerings exist. They are very efficient for such workload since identical objects have the same key, but they have many drawbacks when used as a general file store. More recently, it was recognized that a more general implementation of key-value based object storage, where the key is not construed by the storage system, but chosen by the

* Scale-out NAS: This is the capability to provide a logical aggregation between multiple file servers or NAS to provide an uniform file access and representation like one server can provide. The complexity of the configuration is completely hidden. This approach addresses critical and high-demanding environment both in term of performance and availability.

Page 7: GDS International Scality Simply Scaling Storage

White Paper “Scality, Simply Scaling Storage”

Scality Copyright 2012. Private and Confidential. 6

application would be much more useful for a wide variety of applications. Amazon led the way with Amazon S3 service in 2006. Scality has developed such a modern key-value based object store, and has added performance, making RING Organic storage an ideal technology for file storage at scale, for any use case.

4 Scality RING Organic Storage

4.1 Philosophy

Scality, a storage infrastructure software vendor, developed the RING Organic Storage solution to address business and IT challenges associated with exponential data growth. It offers the capability to process huge numbers of transactions and store volumes of data at the petabyte scale and beyond. Scality’s philosophy is based on strong principles that are the building blocks of a hyperscalable and high performance infrastructure, assembled from very low cost standard components. It is about constructing a virtually infinite and elastic logical storage space based on standard commodity servers and disks.

Figure 2: multidimensional scalability: capacity only, performance only or both

Page 8: GDS International Scality Simply Scaling Storage

White Paper “Scality, Simply Scaling Storage”

Scality Copyright 2012. Private and Confidential. 7

Scality targets two sets of objectives: 1. IT efficiency

a. using an innovative approach to deliver availability, performance and scalability

b. capability to deliver performance or capacity independently of the other without any compromise on availability

2. Low TCO (producing a real and immediate ROI). Figure 3 illustrates a perfect analogy between a traditional array with common elements (controllers and storage units) and Scality RING components such as connectors and storage servers.

Figure 3: Scality RING Organic Storage high level architecture versus traditionnal storage

To fulfill these objectives, Scality RING is based on:

• Load sharing by several elementary units functioning in parallel (“Divide and Conquer” principle),

• An independent and decentralized approach pioneered by peer-to-peer (P2P) networks (“Divide and Deliver” principle)

• The distribution of objects written to multiple nodes (“Divide and Store” principle). Unlike the typical implementation of static distribution of data, Scality does not require any prior data sharding by the user/admin or the application. All of the splitting is implicit, transparent and integrated with the solution, without any impact on application users.

Page 9: GDS International Scality Simply Scaling Storage

White Paper “Scality, Simply Scaling Storage”

Scality Copyright 2012. Private and Confidential. 8

Figure 4: Scality RING deployment example with 2 storage tiers, a DR site

and multiple connectors as access layer managed by the Supervisor

4.2 Architecture and components

From the start, Scality has based its development on the implementation of university research linked to the distribution and persistence of objects in a network of elementary systems. The ability to access any distributed objects rapidly is a primary requirement. The Chord protocol1, 2, invented at MIT, is a second-generation P2P system leveraged by Scality to map stored objects into a virtual keyspace. Unlike the “unstructured” first generation P2P systems, such as Gnutella, with requests being broadcast to different “producers” of storage, the second generation, known as structured P2P, relies on the effective routing of a request to the node owning the data being requested. Scality has subsequently developed and extended Chord beyond data distribution. It has added functional components in order to reach enterprise-level performance and reliability, affording access time reduction, object persistence guarantees and self-healing capabilities. These components cover the generation of data and server keys, their format and the concept of logical servers, called “storage nodes”, in place of physical servers. These logical servers are instantiated through independent UNIX processes, responsible for part of a physical machine address space.

Page 10: GDS International Scality Simply Scaling Storage

White Paper “Scality, Simply Scaling Storage”

Scality Copyright 2012. Private and Confidential. 9

With these developments Scality implements an essential ingredient for handling infrastructure load increases, and automates the redistribution of object keys from a failing server to other servers and nodes. Each storage node has an automatically assigned key and acts as a simple key/value store. Scality develops flexible and efficient key generation and allocation mechanisms that integrate fault tolerance, the number of copies required and the topology of the Ring (racks, single or multisite). Server keys also apply the concept of “class of storage” to stay aligned with application needs. Scality’s load balancing algorithms result in a keyspace that is uniformly distributed over the cluster of nodes present. These algorithms prevent collisions of data replicas in operational condition but also after possible disk failures or servers.. The capability to glue together different server configurations is also important; some servers could exist in the system from the beginning, alongside new machines with different capacities or performance characteristics. Scality has implemented a dispersal technology that guarantees that all object replicas are stored on different node processes and on different physical storage servers, potentially in different datacenters. This guarantee is maintained even after a disk or server failure. The independence of nodes, servers and keys is what enables Scality’s RING system to guarantee an availability of 99.999%, which qualifying the solution as Carrier Grade, and suitable for the most demanding environments. Scality envisions a completely object-oriented philosophy to overcome traditional limits, and support data storage on the order of several PB and more. The revolution proposed by Scality requires only the use of inexpensive, standard x86 computers as base storage servers. This goes well beyond traditional approaches that rely on centralized indexes, catalogs or DBs of small entities, such as inodes or blocks of data. The paradigm shift is here: delivery of enterprise quality storage with standard servers rather than the traditional approach, with yet another array of hyperscalable disks. The aggregation of standard machines and their logical unification in the form of a ring conforming to a P2P philosophy offers the promise of an always-on system with unlimited scalability. It is a system capable of growing at the speed of all the demanding applications connected to the Internet. The core of the Scality solution resides in it’s unique disitrbuted architecture and intelligent self-healing software. There is no centralization of requests, no unique catalog of data, no hierarchy of systems, and therefore no notion of master or slave. The approach here is purely symmetric. All nodes have the same role and run the same code. Scality RING is complemented by two additional systems: The supervisor, by which the infrastructure is managed, and the connectors that link the Ring with consumers (often represented by applications or client terminals). The supervisor is covered by the chapter 4.7. The connectors take on the role of data translation between a typically standard interface (either open and widely deployed, or very specialized for various business or vertical needs) and the object model as understood by the Ring. Several types of connectors exist. These can be clustered to obtain a redundant and highly performant

Page 11: GDS International Scality Simply Scaling Storage

White Paper “Scality, Simply Scaling Storage”

Scality Copyright 2012. Private and Confidential. 10

access layer and also co-located with the application itself. The Scality RING storage infrastructure provides a scale-out architecture from a connector or storage standpoint.

Figure 5: Different infrastructure elements supporting Scality RING object storage

Internally, Scality software relies on two layers: Scality Core Framework that allows developers to envision an asynchronous event-based exchange model, and Scality RING Core, a higher layer that implements distributed logic and manages keyspaces, self healing and a hierarchy of any number of storage spaces.

Figure 6: – The two functional layers of Scality RING in user mode

In the jargon that Scality introduces, the term “node” is different from a system or a physical server. Several nodes can run on a single server. A configuration of six nodes per server is often recommended, and their existence is purely logical, being just processes in the Unix/Linux sense of the term. They are all independent of each other even if they operate on the same server. These storage node instances control their portion of the global keyspace to locate data and honor object requests.

Page 12: GDS International Scality Simply Scaling Storage

White Paper “Scality, Simply Scaling Storage”

Scality Copyright 2012. Private and Confidential. 11

The nodes are running on the storage servers and are responsible for a portion of the global ring, with each node in charge of an even portion of the global keyspace. Figure 7 illustrates an example with a ring of 12 nodes created from four storage servers, with each one operating 3 storage nodes. Each server is thus responsible for 1/4 of the keyspace. The maximum supported configuration is 6 storage nodes per storage server.

Figure 7: From physical servers to storage nodes organized in a logical ring

(A0…D2 are symbols to illustrate that 2 consecutive nodes don’t come from same physical server)

Keys with a size of 20 bytes are generated by the connectors, which assign them to different nodes. This establishes a ring with a fair balanced policy thanks to a dispersion factor present in the key itself.

Figure 8: Format of a Scality key with 160 bits, of which 24 bits are reserved for key dispersal

Each node functions like a very simple key/value store. The key doesn’t embed location information but the Chord algorithm always maps a key to a specific node at any given time. Also, the internal logic of each node determines the appropriate location of object data on disk. Keys always contain either a hashed or randomly generated prefix, leading to a balanced distribution of data among all the nodes based on consistent hashing principles. Another essential point about the Scality solution is the notion of decentralization and the independence of the nodes, since nodes do not defer to a central intelligence. As a peer-to-peer architecture, any node can receive a request for a key. The path to the right

Page 13: GDS International Scality Simply Scaling Storage

White Paper “Scality, Simply Scaling Storage”

Scality Copyright 2012. Private and Confidential. 12

node follows the equation ½ log2(n), with n being the number of nodes in the Ring, when topology changes. A key is assigned to a node which has the responsibility to store the objects whose keys are immediately inferior to its own key and superior to the keys of the preceding node on the Ring. At the heart of the same system, the In/Out daemons, known as iod, are responsible for the persistence of data on physical media. Their role is therefore to write the data passed to the node on the same machine, monitor storage and ensure durability. Each iod is local to one machine, managing local storage space, and communicating only with the nodes present on that same machine. There is therefore no exchange between a node of a machine and the iod of another machine. There can be multiple iods running on the same physical machine. A typical configuration is one iod per disk. The iods are the only links between the physical residence of a data entry and the layer of services represented by the nodes and connectors. The maximum number of iods on a server is 255, enough to support a very large load locally. Physical storage local to a server consists of regular partitions formatted with the standard ext3 file system. Each iod controls its own file system and the data containers placed above it. These file containers are, in fact, elementary storage units of the Ring that receive written objects directed by the iod from requests to the nodes initiated by any connector. They store three types of information: the index to locate the object on the media, object metadata, and the object data itself. The unique connector-Ring-iod architecture provides a total hardware and network abstraction layer, with connectors at the top acting as an entry gate to the Ring. The nodes of the Ring act as storage servers and iod daemons as storage producers responsible for the physical I/O operations.

Figure 9: Different elements of a storage server

The Scality philosophy is all about delivery of a storage infrastructure that doesn’t compromise on the three dimensions – performance, availability and scalability – even as they evolve over time based on end user or service provider requirements.

Page 14: GDS International Scality Simply Scaling Storage

White Paper “Scality, Simply Scaling Storage”

Scality Copyright 2012. Private and Confidential. 13

4.3 Performance

Performance is measured based on three criterias: latency, throughput and the number of operations per second. Latency is the time required to receive the first byte of data in answer to a READ request. It’s a function of network and storage media speed as well as the Ring distributed storage protocols. IOPS is the number of operations that can be performed per second. It can be measured as either object IOPS (READ, WRITE, DELETE of entire objects) or physical media IOPS (disk blocks per second). Associated with this is the aggregate throughput of the transmission in Gbit/s, also known as the bandwidth. The Scality RING storage platform is designed to adapt to different workloads, often quite variable. This can include relatively small operations (several KB) with a large number of transactions per second, or conversely, large operations (several MB) with a relatively smaller number of transactions. The high degree of parallelism among all nodes, physical servers and connectors is a strong performance differentiator of the RING platform. The “Divide and Conquer” philosophy is implemented by the spreading of data among storage nodes running on several storage servers and making sure that all available resources are being utilized. To find where a file resides, legacy storage systems must go through centralized catalogs. By contrast, the Ring uses a very efficient distributed algorithm based on object keys, and shows a very linear response to increased load requests. It scales logarithmically with the number of servers. Latency is perfectly predictable, and the time to locate data in a Ring remains appropriately small. Response times remain flat even with increased number of servers or objects. The second generation peer-to-peer Chord protocol doesn’t need all nodes to know the complete list of other nodes (the peers) in the configuration. Chord’s main quality resides in its routing efficiency. In the original Chord protocol as documented by MIT, each node just needs the knowledge of its successor and predecessor nodes that are organized in the Ring. Thus, updates to data do not require the synchronization of a complete list of tables on every node, but still avoid the risk of stale information.

Page 15: GDS International Scality Simply Scaling Storage

White Paper “Scality, Simply Scaling Storage”

Scality Copyright 2012. Private and Confidential. 14

Figure 10: Connector lookup method based on a very efficient Chord algorithm

Using the intelligence of the Chord protocol, the ring provides complete coverage of the allotted keyspaces. An initial request is internally routed within the Chord ring until the right node is located. Multiple hops can occur, but the two key pieces of information – predecessors and successors – reduce the latency needed by the protocol to locate the right node. When the desired node is found, the node receiving the original request provides this information directly back to the connector. Figure 10 illustrates a simple lookup request from the connector. Key #33 is requested by the connector and this machine knows only keys 10 and 30. The connector picks the first information and connects to that node, i.e 10. Node 15, 25, 45 and 85 are then contacted. The protocol determines that node 25 connected to node 35 matches the request for key 33. Node 25 sends back the information to node 10, and then to the connector. The Scality RING implementation of Chord modified the original algorithm so that:

• Each node knows its immediate successor and all power of 2 nodes up to the half in term of hops.

• As a consequence, most of the time, 1 hop is needed to find the data. In the case of a topology change, the number of hops follows the equation 1/2log2(n), with n being the number of nodes in the Ring. That leads to 4 hops maximum for a 100 node Ring, and only 5 hop maximum for a 1000 node Ring.

• When changes occur such as an insert or failure of a node, a proxy mechanism is started with a balance job to maintain the Ring in an optimized topology.

Globally, the routing table rarely changes even after an insert. The infrastructure doesn’t need to pause, stop or freeze the environment when storage servers are added.

Page 16: GDS International Scality Simply Scaling Storage

White Paper “Scality, Simply Scaling Storage”

Scality Copyright 2012. Private and Confidential. 15

When a failure occurs, it is handled like a cache miss, and the lookup process feeds the cache line again after determining the new route to the data. Scality allows seamless topology changes as nodes join and leave the infrastructure.The Ring can continue to serve queries even while the system is continuously changing. Most of the time during normal operations, the mapping of connector-key-node is direct, and the performance is optimal. Scality RING leverages many recent technological innovations. For example, the ability to integrate Flash-based storage or SSD units in storage servers and connectors helps to deliver ever greater performance*. Similarly, improvements can be obtained with the integration of 10 GbE or higher network cards, and the configuration of multiple ports and high-frequency multicore CPUs. The integration of new elements is possible without disturbing or unbalancing the solution, and of course, without service interruptions. The absence of centralization removes any potential bottlenecks in one or more servers (such as might occur with a server monopolizing access information).Thus the Scality system avoids any single points of failure (SPOF) that would otherwise have the potential to disrupt the entire system. Performance extends beyond its usual definition involving speed of throughput to also include service availability. Rapid and automatic recovery from an error or breakdown is also considered by Scality to be a key measure of performance.

Server Nodes Software Nodes 4kB get Objects/sec 4kB put Objects/sec

3 36 41,573 26,274 4 48 51,882 33,278 5 60 60,410 39,160 … … … … 24 288 385,000 257,000

Source ESG

Figure 11: Performance results in Objects/sec units Auto-Tiering Scality provides its own storage tiering technology, embedded within the RING, named Auto-Tiering. The method ensures the right alignment between the value of the data and the cost of storage where the data resides. It operates at the object level and is therefore independent of the data structure used by the application and can therefore be applied to many various IT environments. The RING receives different criterias to apply and operate the data movement. The policy engine performs autonomously and automatically the migration and movement of objects within a single RING or between RINGs. The object key continues to reference the same location and is completely seamless for the user and the application. Various configurations can be imagined such as storage consolidation with a N-1 model where N is the number of primary RING.

* See Scality Lab Report by ESG on the performance of Scality RING when using SSD storage.

Page 17: GDS International Scality Simply Scaling Storage

White Paper “Scality, Simply Scaling Storage”

Scality Copyright 2012. Private and Confidential. 16

These N RINGs migrate data to only one secondary RING and share its potential massive capacity. This fundamental function for today's data centers allowsto configure, for example, a first RING with SSDs. That RING provides fast data access and stores only 10 to 20% of the entire data volume. It is connected to a second RING, less accessed, with more capacity fulfilled by SATA drives, thus delivering a higher response time but adequate. Figure 4 illustrates this Auto-Tiering function between 2 RINGs within the same data center. This optional feature fully reaches the financial goals of reducing costs of storage infrastructures and optimizing storage service.

4.4 Data Protection

Scality RING provides several mechanisms for protecting data and the infrastructure on which it operates: Replication and ARC, the new mode for Advanced Resilience Configuration. Replication Scality offers a built-in Replication mode within the Ring to provide a seamless data access even in case of failures. The data are copied in native format without any transformation which offers a real performance gain. Scality Replication offers multiple object copies, called replicas, across different storage nodes, with the guarantee that each replica resides on a different storage server thanks to the dispersion factor expressed in its key (the first 24 bits of each key). The mechanism developed by Scality involves projection guarantees that determine independent target nodes for additional data copies. The maximum number of replicas is 6, although typical values are 3 or 4. Additionally, an option exists to enable replication across multiple rings, on same or remote sites with the flexible choice of unidirectional or multi-directional mode.

Figure 12: Replication with 2 replicas

Page 18: GDS International Scality Simply Scaling Storage

White Paper “Scality, Simply Scaling Storage”

Scality Copyright 2012. Private and Confidential. 17

Advanced Resilience Configuration In addition to replication,Scality developed Advanced Resilience Configuration known as ARC, its erasure-code technology based on IDA (Information Dispersal Algorithm) well known and proven in Telecommunication for a long time. Its reconstruction mechanism is based on Reed-Solomon3 error correcting theory. ARC is a new feature option running within the RING to protect data intelligently against disks, servers, racks or sites failures. This new configuration mode reduces the number of copies and avoids double or triples similar information. Therefore this mechanism cuts significantly the hardware CapEx as well as the related OpEx. As a brief description of Scality ARC, let's consider n objects which need to be stored and protected. For the simplicity of the demonstration, we will assume that they are all 1 MB in size. We assume we want to protect against k failures (which we note ARC (n,k)). Scality ARC will store each of the n pieces of content individually, and will in addition store k new objects which are called checksums. These checksums are mathematical combinations of the original n objects, in such a way that we can reconstruct all the n objects despite the loss of any k elements, whether objects or checksum. With Scality ARC, each of the k checksum would be 1 MB in size, giving a protection against k disks or servers loss, with just an extra k MB of storage required. To illustrate benefits and the mechanism behind this theory, we use the following example:

Figure 13: Scality RING Advanced Resilience Configuration with (16,4) model

Here with just 4 MB of additional storage, we have protected 16 MB against 4 disk or server loss, that’s an overhead of 4/16=25%, much lower than the 200% overhead of replication, and for a much better protection than RAID 6. Note that erasure code protects against server loss as well as disk loss which is not the case of RAID. Traditional implementation of erasure code technology to storage introduces a “penalty on read” meaning that the reader must read several pieces then extract data to receive original information. This is the drawback of the dispersed storage approach, in introduces a 200 – 300 ms latency to serve the data. To avoid that overhead and large number of IOPS, Scality chooses to store original data fragments with checksums fragments independently. Scality implements a (16,4) model referred as ARC(16,4) meaning that redundancy covers additional 25% of the data space. If we compare Replication and ARC in term of cost-effectiveness for same level of redundancy,

Page 19: GDS International Scality Simply Scaling Storage

White Paper “Scality, Simply Scaling Storage”

Scality Copyright 2012. Private and Confidential. 18

Replication with 3 copies needs 2 extra storage space and ARC(16,4) just needs 25% more space which represents 8 times less storage or just 12.5% of the replication case for the storage dedicated to the protection. Globally for data and protection space, ARC implementation represents 2.4 times less storage then total configuration with replication. This is what Scality has implemented and recommends for large scale configuration. So now, it is possible to get 1 000 000 better reliability than RAID 6, with no more disk overhead and no performance bottleneck. In comparison with dispersed storage, Scality approach avoids the penalty on read and continues to offer the best response time with a direct read operation on data.

Figure 14: Scality RING ARC example vs Replication and Dispersed approaches

We summarize some results in the following table with 1PB storage configuration with RAID5, RAID6, Replication and Scality ARC. RAID 5 with 5 data disks and 1 parity disk demonstrate a very good storage overhead but a poor configuration for data durability. RAID 6, with 6 data disks and 2 parity disks, presents a big gain with an interesting durability but a very limited tolerance of disks failure. This limitation demonstrates that RAID 6 is not the right choice for large configuration especially ones beyond 1PB. Replication is a very good reliable solution but the cost represented by the storage overhead could be a drawback for certain account and configuration. Finally, Scality ARC with a 16,4 configuration adds just 20% of storage overhead, a exceptional low risk of data loss and high durability associated with the capability to tolerate multiple storage nodes failures.

Page 20: GDS International Scality Simply Scaling Storage

White Paper “Scality, Simply Scaling Storage”

Scality Copyright 2012. Private and Confidential. 19

Figure 15: Scality ARC comparison with RAID and Replication configurations

4.5 Consistency model

Scality allows some tuning of the three dimensions of Brewer’s CAP theorem13 (consistency, availability and partitioning tolerance). Scality RING can behave in "Strong Consistency" (SC) or "Eventually Consistency" (EC) mode. SC potentially reduces performance, while EC provides better response time but a softer consistency, depending on the configured environment and operating constraints. Scality RING enforces ACID (Atomicity, Consistency; Isolation and Durability) transactions, although administrators can decide to ease some constraints based on their requirements. The goal is that a transaction leaves the Ring in a known stable state without breaking integrity constraints.

R + W > N Strong Consistency R + W <= N Eventually Consistency R : Number of replicas contacted to satisfy a Read operation W : Number of replicas that must send an acknowledget (ACK) to satisfy the Write operation N : Number of storage nodes for storing the replicas of the requested data Particular cases: • W >= 1: Writes are always possible • R >= 2: Minimal redundancy

Figure 16: Illustration of Brewer’s theory

In theory, data consistency with multiple copies can be enforced in the write (write ()) or read (read ()) operation. The traditional approach in the industry is to include the coherency in the typically synchronous write function, with no real control in the read function. All data are identical based on the constraint in the write operation. Read operations are equivalent and deliver similar value. A second approach makes writes fast and always possible, because the consistency check is placed in the read. However, for highly distributed systems or very large

Page 21: GDS International Scality Simply Scaling Storage

White Paper “Scality, Simply Scaling Storage”

Scality Copyright 2012. Private and Confidential. 20

environments, a partitioning can exist with a very asynchronous write mode. To meet the needs of read operations and also move towards the harmonization of different versions of data inherent in a distributed system, Scality extends the Merkle13-17 tree algorithm for the identification of different versions and their alignment with the number of copies constraint. The I/O engine always wants to access and see a consistent version of an object before writing to it. Figure 17 illustrates the write() operation after key allocation with the native ability to execute in parallel mode.

Figure 17: Schema for a write operation from the connector

4.6 Scalability

Scality delivers an enterprise storage infrastructure capable of handling large volumes of data. Scality Ring provides a solution that is highly scalable, offers high performance and is available at a reasonable and affordable cost in comparison with traditional storage world rates. To achieve such results, Scality uses its own virtual storage technology of distributed objects over standard, low cost servers. These include standard servers equipped with x86 processors, SSD, SAS or SATA disks, and multi-Gb Ethernet ports running on Linux. The result is a storage farm providing a cost-effective solution for any given performance, capacity and feature set. Scality delivers capacity rates 60% cheaper than the already attractive prices offered by Amazon S3. A recent Scality TCO study sanctions a storage rate for 1 PB at only 5 cents/GB/month. The TCO of storage with Scality declines further with the growth of storage volume. It is therefore easier and less expensive to add capacity by simply adding storage components to a Scality RING. A very high level of storage service can be maintained simply by adding simple servers that function as storage media or connectors, or by joining several instances of Scality RING. The beauty of the architecture comes from its scalability. The greater the number of servers, the better the global performance and availability, and the cheaper and more efficient the cost of access becomes.

Page 22: GDS International Scality Simply Scaling Storage

White Paper “Scality, Simply Scaling Storage”

Scality Copyright 2012. Private and Confidential. 21

Scality approach was recently studied and validated by ESG – Enterprise Strategy Group – the independent consulting and business strategy firm. ESG conducted a deep analysis and tests on Manageability, Flexibility, Resiliency and Object-based Performance. The following charts illustrate scalability capabilities of Scality RING. For more information, the report is available on Scality web site at http://www.scality.com/research_reports.

Figure 18: Performance Scalability in Objects/sec units

Figure 19: Performance Scalability in Aggregate Throughput (MB/s)

Page 23: GDS International Scality Simply Scaling Storage

White Paper “Scality, Simply Scaling Storage”

Scality Copyright 2012. Private and Confidential. 22

4.7 Management and Support

Scality is easy to configure and manage with a command line interface (CLI) and a Web GUI. CLI named RingSH allows self-created scripts and can be integrated in a management framework as well. The second administration element named the Supervisor is very intuitive and offers many additional capabilities such as monitoring from the Ring to the individual disk, status of storage servers and nodes, capacity statistics or alerts. The Supervisor allows a deep dive to discover details of storage nodes by key or server and provide the management capability to add or remove servers when needed. It could be a hardware refresh with recent new technologies or a maintenance task to replace defunct servers. During all these steps, the Ring continues to serve requests without any impact and redistribute data among existing online servers and resources. This behavior demonstrates and reinforces the elastic characteristic of Scality RING.

Figure 20: Scality RING dashboard with storage nodes and key space distribution

Scality provides 24x7 support and maintenance and delivers a variety of profresional services such as expert on site or dedicated care service.

Page 24: GDS International Scality Simply Scaling Storage

White Paper “Scality, Simply Scaling Storage”

Scality Copyright 2012. Private and Confidential. 23

4.8 Partners and ecosystem

The target markets for Scality RING are the service providers and large enterprises experiencing relentless demand for performance, availability and scalability. Companies wanting to establish their own private or hybrid cloud are a perfect use case for the Scality RING solution. Scality’s adoption in these markets is enabled by the availability of connectors that are customizable to specific requirements. Connectors provide the links and the translations between applications and the Ring. The Ring integrates sophisticated functions that allow the development of application services at a very high level. Connectors can also be developed independently of Scality, thanks to an available API, or they can be the object of cooperative development between the partner, the application users and Scality. The Scality Open Source Program (SCOP) is dedicated to this type of partnership. Several types of connectors are available, from the generic to the specialized:

Type Connectors Example

Generic (for standard or basic IT services)

- REST/http

- File System (FUSE)

- NAS (NFS)

- CDMI (SNIA Standard)

IT (for advanced IT services)

- Email - Zimbra, Open-Xchange, Openwave, Critical Path, Dovecot, Cyrus

- REST Storage Service aka RS2

- Backup

- ECM

- Enterprise NAS

- CommVault

- Nuxeo

- Gluster Panzura, CTera Networks

- Cloud Gateway

- Virtual Computing

- Cloud Desktop/Agent

- TwinStrata, StorSimple

- Parallels

- Gladinet, Mezeo, TeamDrive, CloudBerry Lab, OxygenCloud

Figure 21: Scality list of connectors and associated partners

Page 25: GDS International Scality Simply Scaling Storage

White Paper “Scality, Simply Scaling Storage”

Scality Copyright 2012. Private and Confidential. 24

5 Conclusion: Scality, a new vision of storage

Scality proposes a new way of consuming storage without any inherent limits, all in a very economical fashion. The solution offers enterprise class features that overcome the costs and constraints of traditional storage mechanisms. Whatever the application or usage, Scality RING Organic Storage can deliver a custom-aligned interface to store and exchange data in a block or file or native object mode. The Scality solution is the right answer for demanding applications with high IOPS or guaranteed throughput requirements. Scality protects IT investments no matter what software and hardware are already in place. Best of all, Scality can evolve to take advantage of the latest advances in server technology and consumer pricing value. Internet companies faced these challenges for a decade before they developed and addressed these new challenges on their own, and reach a leadership position not feasible with commercial solutions at that time. Today, you’ll be able to deploy, use and rely on a similar storage datacenter running the Scality RING solution. You’ll be able to differentiate your business from the competition with unprecedented agile behavior using a very cost-effective approach. Scality’s technology is available today, and represents the storage infrastructure of the future. The business value of Scality RING depends as much on its advantages in scalability, performance and availability as on its affordability.

Page 26: GDS International Scality Simply Scaling Storage

White Paper “Scality, Simply Scaling Storage”

Scality Copyright 2012. Private and Confidential. 25

References

1. Chord: A Scalable Peer-to-Peer Lookup Service for Internet Applications Ion Stoica, Robert Morris, David Karger, M. Frans Kaashoek, Hari Balakrishnan - MIT Laboratory for Computer Science http://pdos.csail.mit.edu/6.824/papers/stoica-chord.pdf

2. Multipurpose storage system based upon a distributed hashing mechanism with

transactional support and failover capability US and WIPO patent WO/2010/080533 Vianney Rancurel, Oliver Lemarie, Giorgio Regni, Alain Tauch, Benoit Artuso, Jonathan Gramain http://www.wipo.int/patentscope/search/en/WO2010080533

3. Probabilistic offload engine for distributed hierarchical object storage US and WIPO patent WO/2011/072178 Giorgio Regni, Jonathan Gramain, Vianney Rancurel, Benoit Artuso, Bertrand Demiddelaer, Alain Tauch http://www.wipo.int/patentscope/search/en/WO2011072178

4. On Routing in Distributed Hash Tables Fabius Klemm, Sarunas Girdzijauskas, Jean-Yves Le Boudec, Karl Aberer - School of Computer and Communication Sciences - Ecole Polytechnique Fédérale de Lausanne (EPFL), Switzerland http://www.computer.org/portal/web/csdl/doi/10.1109/P2P.2007.38

5. Improving the Throughput of Distributed Hash Tables Using Congestion-Aware Routing Fabius Klemm, Jean-Yves Le Boudec, Dejan Kosti´c, Karl Aberer - School of Computer and Communication Sciences - Ecole Polytechnique F´ed´erale de Lausanne (EPFL), Lausanne, Switzerland http://research.microsoft.com/en-us/um/redmond/events/iptps2007/papers/klemmleboudeckosticaberer.pdf

6. An Architecture for Peer-to-Peer Information Retrieval Karl Aberer, Fabius Klemm, Martin Rajman, Jie Wu - School of Computer and Communication Sciences - EPFL, Lausanne, Switzerland http://infoscience.epfl.ch/record/54389/files/P2P-IR_Architecture.pdf

7. A High-Performance Distributed Hash Table for Peer-to-Peer Information Retrieval Thèse #4012 (2008) – Fabius Klemm – EPFL http://biblion.epfl.ch/EPFL/theses/2008/4012/4012_abs.pdf

Page 27: GDS International Scality Simply Scaling Storage

White Paper “Scality, Simply Scaling Storage”

Scality Copyright 2012. Private and Confidential. 26

8. Dynamo: Amazon’s Highly Available Key-value Store Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan Kakulapati, Avinash Lakshman, Alex Pilchin, Swaminathan Sivasubramanian, Peter Vosshall and Werner Vogels - Amazon.com http://www.allthingsdistributed.com/files/amazon-dynamo-sosp2007.pdf

9. Bigtable: A Distributed Storage System for Structured Data Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows, Tushar Chandra, Andrew Fikes, Robert E. Gruber - Google http://research.google.com/archive/bigtable-osdi06.pdf

10. The Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung - Google http://research.google.com/archive/gfs-sosp2003.pdf

11. Computing in the RAIN: A Reliable Array of Independent Nodes Vasken Bohossian, Charles C. Fan, Paul S. LeMahieu, Marc D. Riedel, Lihao Xu & Jehoshua Bruck – California Institute of Technology http://www.paradise.caltech.edu/papers/etr029.pdf

12. Time, Clocks, and the Ordering of Events in Distributed Systems L. Lamport Comm. ACM 21, 1978, pp. 558-565 http://research.microsoft.com/en-us/um/people/lamport/pubs/time-clocks.pdf

13. Brewer’s Conjecture and the Feasibility of Consistent, Available, Partition-Tolerant

Web Services Seth Gilbert & Nancy Lynch http://lpd.epfl.ch/sgilbert/pubs/BrewersConjecture-SigAct.pdf

14. Providing Authentication and Integrity in Outsourced Databases using Merkle Hash

Tree's Einar Mykletun, Maithili Narasimha & Gene Tsudik - University of California Irvine http://sconce.ics.uci.edu/das/MerkleODB.pdf

15. Secrecy, authentication, and public key systems

R. Merkle, Ph.D. dissertation, Dept. of Electrical Engineering - Stanford University, 1979 http://dl.acm.org/citation.cfm?id=909000

16. Fractal Merkle Tree Representation and Traversal

M. Jakobsson, T. Leighton, S. Micali and M. Szydlo - RSA-CT ‘03 http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.65.6133&rep=rep1&type=pdf&ei=dEdGT5aSPOjC0QWI3siJDg&usg=AFQjCNESUo-gXFi7gIDSD0H5zfO60UiTmw

Page 28: GDS International Scality Simply Scaling Storage

White Paper “Scality, Simply Scaling Storage”

Scality Copyright 2012. Private and Confidential. 27

17. Implementation of a Hash-based Digital Signature Scheme using Fractal Merkle Tree Representation D. Coluccio http://www.rsa.com/rsalabs/node.asp?id=2003

18. Merkle Tree Traversal in Log Space and Time

M. Szydlo - Eurocrypt '04 http://www.szydlo.com/szydlo-loglog.pdf

19. An Analysis of Latent Sector Errors in Disk Drives Lakshmi N. Bairavasundaram, Garth R. Goodson, Shankar Pasupathy and Jiri Schindler http://research.cs.wisc.edu/wind/Publications/latent-sigmetrics07.html

20. Failure Trends in a Large Disk Drive Population

Eduardo Pinheiro, Wolf-Dietrich Weber and Luiz Andr’e Barroso – Google http://research.google.com/archive/disk_failures.pdf

21. Understanding disk failure rates: What does an MTTF of 1,000,000 hours mean to

you? Bianca Schroeder and Garth A. Gibson http://www.cs.cmu.edu/~bianca/fast07.pdf

Page 29: GDS International Scality Simply Scaling Storage

White Paper “Scality, Simply Scaling Storage”

Scality Copyright 2012. Private and Confidential. 28

Glossary

Features Benefits

Object Storage The Scality RING Organic Storage is consumed through an object interface, independent of classic file and block protocols. This object layer enables better scalability and independence with application and hardware components. All the logic is controlled by the software layer developed by Scality. For integration in legacy environments, Scality RING also exposes file and block interfaces through its connectors.

2nd generation P2P

Scality employs a logically distributed and structured organization based on DHT* without any central intelligence, catalog or hierarchy across servers. It provides an SLA guarantee with linear and near-direct key-node mappings. Scality delivers comprehensive keyspace coverage through consistent hashing based on the Chord protocol.

Key/Value Store

Scality offers linear performance at scale with its key/value approach, which is simple by design but powerful in its delivery result. The generated key, with its 160 bits, offers implicit meaning for object placement, redundancy and service levels.

No data sharding

There is no need to segment or split the data to use Scality RING. The immediate gain is simplicity, flexibility and durability of data during its entire lifetime.

Application integration

Applications access their data through connectors that translate the application data representation to the Scality internal model. There are multiple generic connectors including http/REST, RS2, file or block based accessors.More sophisticated connectors have been developed for mail system integration with Zimbra, Openwave, CriticalPath, Dovecot, Open-Xchange and Cyrus. Additional connectors are available for:

• Mezeo, CloudBerry Lab, Gladinet, OxygenCloud, TeamDrive (Cloud), • Parallels (Virtualization), • Gluster/RedHat, Panzura, CTera Networks (NAS), • TwinStrata, StorSimple (iSCSI Storage Gateway), • CommVault (Backup and Archive), • and Nuexo (ECM), • + SNIA CDMI** support (server and client)

Scality has a partner program, SCOP, to enable fast connector development via the Scality open API.

Object and entity number and size

There is no maximum number or size limitation for objects, files, file systems, databases or tables stored. The object key size is 20 bytes with an identification key on 128 bits. Objects can also have multiple sizes during their lifetimes.

Self-Healing Scality RING implements a comprehensive, fully automated mechanism to detect and fix failures at object, storage or node levels to maintain SLA. The integrity of each object is protected by a checksum method to compute and generate new copies.

Auto-Tiering Scality allows the object hierachization within a RING or across multiple RINGs. It implies a better SLA alignement with migration of inactive data at the secondary level and the presence of active data on primary storage.

Load Balancing Scality fully automates load balancing of the keyspaces within the storage infrastructure (across storage nodes). The load balancing makes sure that objects and metadata are

Page 30: GDS International Scality Simply Scaling Storage

White Paper “Scality, Simply Scaling Storage”

Scality Copyright 2012. Private and Confidential. 29

evenly distributed regardless of root event (loss or addition of nodes, configuration changes, etc.) or failure.

Data Redundancy

Up to 5 replicas (6 copies) per object can be set. This number could be different among stored objects. All configuration parameters are controlled in the administration console running on the supervisor node. The copies are managed by the connectors themselves.

Access Redundancy

Multiple connectors in stateless mode are configured to maintain access to the data infrastructure. Multiple copies and parallel access to different storage nodes provide continuous services to applications.

Advanced Resilience Configuration (ARC)

ARC is the Erasure-code implementation made by Scality to reduce number of data copies and increase resiliency of the data globally. The default configuration is (16,4) meaning 16 data fragments are stored plus 4 checksum fragments. It provides direct data access without additional computation and up to 4 components failures per request (disks, servers, networks…).

Management The control and administration of the platform can be managed from the supervisor node running the dashboard GUI. It is also possible to operate a command line interface (CLI) via RingSH, or by running a script with all desired options.

Hardware agnostic

Scality RING allows a complete independent selection of hardware material without any constraints on CPU, memory, network or disk type. Scality leverages commodities and runs on standard Linux distributions (CentOS, Ubuntu and Debian). Recommendations and guidelines are available to establish the right server farm for specific applications.

No limit to hardware

There is no limit in terms of number or mixture of servers. Pick your own brand and model, then build your storage platform.

Node flavor There are three kinds of logical system within a Scality infrastructure: • Connector node: gateway to access the RING and transport data in or out; very

important in the key generation process and for object redundancy mechanisms. • Supervisor node: specific role in the management and the configuration of the

platform. • Storage node: virtual storage server running on a physical storage server.

Multiple instances run at the same time to enable data dispersion, parallel access and redundancy.

*DHT: Distributed Hash Tables

**CDMI: SNIA Cloud Data Management Interface

***RAIN: Reliable | Random | Redundant Array of Inexpensive | Independent Nodes