Distribute Storage System May-2014

DISTRIBUTED STORAGE

SYSTEM

Mr. Dương Công Lợi

Company: VNG-Corp

Tel: +84989510016/+84908522017

Email:[email protected]/[email protected]

CONTENTS

1. What is distributed-computing system?

2. Principle of distributed database/storage

system

3. Distributed storage system paradigm

4. Canonical problems in distributed systems

5. Common solution for canonical problems in

distributed systems

6. UniversalDistributedStorage

7. Appendix

1. WHAT IS DISTRIBUTED-COMPUTING

SYSTEM?

Distributed-Computing is the process of solving a

computational problem using a distributed

system.

A distributed system is a computing system in

which a number of components on multiple

computers cooperate by communicating over a

network to achieve a common goal.

DISTRIBUTED DATABASE/STORAGE

SYSTEM

A distributed database system, the database is

stored on several computers .

A distributed database is a collection of multiple

, logic computer network .

DISTRIBUTED SYSTEM ADVANCE

Advance

Avoid bottleneck & single-point-of-failure

More Scalability

More Availability

Routing model

Client routing: client request to appropriate server to

read/write data

Server routing: server forward request of client to

appropriate server and send result to this client

* can combine the two model above into a system

DISTRIBUTED STORAGE SYSTEM

Store some data {1,2,3,4,6,7,8} into 1 server

And store them into 3 distributed server

1,2,3,4,6,7,8

1,2,3 4,6

7,8

2. PRINCIPLE OF DISTRIBUTED

DATABASE/STORAGE SYSTEM

Shard data key and store it into appropriate

server use Distributed Hash Table (DHT)

DHT must be consistent hashing:

Uniform distribution of generation

Consistent

Jenkins, Murmur are the good choice;some else:

MD5, SHA slower

3. DISTRIBUTED STORAGE SYSTEM

PARADIGM

Data Hashing/Addressing

Determine server for data store in

Data Replication

Store data into multi server node for more

availability, fault-tolerance


ARCHITECT

Data Hashing/Addressing

Use DHT to addressing server (use server-name) to a

number, performing it on one circle called the keys

space

Use DHT to addressing data and find server store it

by successor(k)=ceiling(addressing(k))

successor(k): server store k

0

server3

server1

server2


ARCHITECT

Addressing – Virtual node

Each server node is generated to more node-id for

evenly distributed, load balance

Server1: n1, n4, n6

Server2: n2, n7

Server3: n3, n5, n8

0

server3

server1

server2

n7

n1

n5

n2

n4

n8

n3

n6


ARCHITECT

Data Replication

Data k1 store in server1 as master and store in

server2 as slave

0

server3

server1

server2

k1

4. CANONICAL PROBLEMS IN DISTRIBUTED

SYSTEMS

Distributed transactions: ACID (Atomicity,

Consistency, Isolation, Durability) requirement

Distributed data independence

Fault tolerance

Transparency

5. COMMON SOLUTION FOR CANONICAL

PROBLEMS IN DISTRIBUTED SYSTEMS

Atomicity and Consistency with Two Phase

Commit protocal

Distributed data independence with consistent

hashing algorithm

Fault tolerance with leader election, multi

master and data replication

Transparency with server routing, client seen

distributed system as a single server

TWO PHASE COMMIT PROTOCAL

What is this?

Two-phase commit is a transaction protocol designed for

the complications that arise with distributed resource

managers.

Two-phase commit technology is used for hotel and

airline reservations, stock market transactions, banking

applications, and credit card systems.

With a two-phase commit protocol, the distributed

transaction manager employs a coordinator to manage

the individual resource managers. The commit process

proceeds as follows:


Phase1: Obtaining a Decision

Step 1 Coordinator asks all participants to prepare

to commit transaction Ti.

Ci adds the records <prepare T> to the log and

forces log to stable storage (a log is a file which

maintains a record of all changes to the database)

sends prepare T messages to all sites where T

executed


Phase1: Making a Decision Step 2 Upon receiving message, transaction

manager at site determines if it can commit the transaction

if not:

add a record <no T> to the log and send abort T message to Ci

if the transaction can be committed, then:

1). add the record <ready T> to the log

2). force all records for T to stable storage

3). send ready T message to Ci


Phase 2: Recording the Decision Step 1 T can be committed of Ci received a ready T

message from all the participating sites: otherwise T

must be aborted.

Step 2 Coordinator adds a decision record, <commit

T> or <abort T>, to the log and forces record onto stable

storage. Once the record is in stable storage, it cannot

be revoked (even if failures occur)

Step 3 Coordinator sends a message to each

participant informing it of the decision (commit or abort)

Step 4 Participants take appropriate action locally.


Costs and Limitations

If one database server is unavailable, none of the

servers gets the updates.

This is correctable through network tuning and

correctly building the data distribution through

database optimization techniques.

LEADER ELECTION

Some leader election algorithm can use: LCR

(LeLann-Chang-Roberts), Pitterson, HS

(Hirschberg-Sinclair)

LEADER ELECTION

Bully Leader Election algorithm

MULTI MASTER

Multi-master replication

Problem of multi-master replication

MULTI MASTER

Solution, 2 candicate model:

Two phase commit (always consistency)

Asynchronize sync data among multi node

Still active despite some node dies

Faster than 2PC

MULTI MASTER

Asynchronize sync data

Data store to main master (called sub-leader), then

this data post to queue to sync to other master.

MULTI MASTER

Asynchronize sync data

req1 req2

Server1

(leader )

Server2

data queue

req2: forward

X

UNIVERSALDISTRIBUTEDSTORAGE

a distributed storage system

6. UNIVERSALDISTRIBUTEDSTORAGE

UniversalDistributedStorage is a distributed

storage system develop for:

Distributed transactions (ACID)

Distributed data independence

Fault tolerance

Leader election (decision for join or leave server node)

Replicate with multiple master replication

Transparency


ARCHITECTURE

Overview

Bussiness

Layer

Distrib

uted

Layer

Storage

Layer

Bussiness

Layer

Distrib

uted

Layer

Storage

Layer

Bussiness

Layer

Distrib

uted

Layer

Storage

Layer

Server


ARCHITECTURE

Internal Overview

Business Layer

Distributed Layer

Storage Layer

dataLocate(), dataRemote()

Result(s)

localData()

Result{s}

Client request(s)

remote

queuing

ARCHITECTURE OVERVIEW


FEATURE

Data hashing/addressing

Use Murmur hashing function


FEATURE

Leader election

Use Bully Leader Election algorithm


FEATURE

Multi-master replication

Use asynchronize sync data among server nodes


STATISTIC

System information:

3 machine 8GB Ram, core i5 3,220GHz

LAN/WAN network

7 physical servers on 3 above mechine

Concurrence write 16500000 items in 3680s, rate~ 4480req/sec (at client computing)

Concurrence read 16500000 items in 1458s, rate~ 11320req/sec (at client computing)

* It doesn’t limit of this system, it limit at clients (this test using 3 client thread)

Q & A

Contact:

Duong Cong Loi

[email protected]

[email protected]

https://www.facebook.com/duongcong.loi

mailto:[email protected]

mailto:[email protected]

https://www.facebook.com/duongcong.loi

7. APPENDIX

APPENDIX - 001

How to join/leave server(s)

1. join/leave 2. join/leave: forward

Leader server

4. broadcast result

3. process join/leave

Server A Server B Server C

APPENDIX - 002

How to move data when join/leave server(s)

Make appropriate data for the moving

Async data for the moving by thread, and control

speed of the moving

APPENDIX - 003

How to detect Leader or sub-leader die

Easy dectect by polling connection

APPENDIX - 004

How to make multi virtual node for one server

Easy generate multi virtual node for one server by

hash server-name

Ex:

make 200 virtual node for server ‘photoTokyo’:

use hash value of: photoTokyo1, photoTokyo2, …,

photoTokyo200

APPENDIX - 005

For fast moving data

Use bloomfilter for dectect exist hash value of data-

key

Use a storage for store all data-key for this local

server

APPENDIX - 006

How to avoid network turnning

Use client connection pool with screening strategy

before, it avoid many connection hanging when call

remote via network between two server

Technology

Distribute Storage System May-2014