43
DISTRIBUTED STORAGE SYSTEM Mr. Dương Công Li Company: VNG-Corp Tel: +84989510016/+84908522017 Email:[email protected]/[email protected]

Distribute Storage System May-2014

Embed Size (px)

DESCRIPTION

Architect document for build distributed storage system and the good sample distributed storage from author

Citation preview

Page 1: Distribute Storage System May-2014

DISTRIBUTED STORAGE

SYSTEM

Mr. Dương Công Lợi

Company: VNG-Corp

Tel: +84989510016/+84908522017

Email:[email protected]/[email protected]

Page 2: Distribute Storage System May-2014

CONTENTS

1. What is distributed-computing system?

2. Principle of distributed database/storage

system

3. Distributed storage system paradigm

4. Canonical problems in distributed systems

5. Common solution for canonical problems in

distributed systems

6. UniversalDistributedStorage

7. Appendix

Page 3: Distribute Storage System May-2014

1. WHAT IS DISTRIBUTED-COMPUTING

SYSTEM?

Distributed-Computing is the process of solving a

computational problem using a distributed

system.

A distributed system is a computing system in

which a number of components on multiple

computers cooperate by communicating over a

network to achieve a common goal.

Page 4: Distribute Storage System May-2014

DISTRIBUTED DATABASE/STORAGE

SYSTEM

A distributed database system, the database is

stored on several computers .

A distributed database is a collection of multiple

, logic computer network .

Page 5: Distribute Storage System May-2014

DISTRIBUTED SYSTEM ADVANCE

Advance

Avoid bottleneck & single-point-of-failure

More Scalability

More Availability

Routing model

Client routing: client request to appropriate server to

read/write data

Server routing: server forward request of client to

appropriate server and send result to this client

* can combine the two model above into a system

Page 6: Distribute Storage System May-2014

DISTRIBUTED STORAGE SYSTEM

Store some data {1,2,3,4,6,7,8} into 1 server

And store them into 3 distributed server

1,2,3,4,6,7,8

1,2,3 4,6

7,8

Page 7: Distribute Storage System May-2014

2. PRINCIPLE OF DISTRIBUTED

DATABASE/STORAGE SYSTEM

Shard data key and store it into appropriate

server use Distributed Hash Table (DHT)

DHT must be consistent hashing:

Uniform distribution of generation

Consistent

Jenkins, Murmur are the good choice;some else:

MD5, SHA slower

Page 8: Distribute Storage System May-2014

3. DISTRIBUTED STORAGE SYSTEM

PARADIGM

Data Hashing/Addressing

Determine server for data store in

Data Replication

Store data into multi server node for more

availability, fault-tolerance

Page 9: Distribute Storage System May-2014

DISTRIBUTED STORAGE SYSTEM

ARCHITECT

Data Hashing/Addressing

Use DHT to addressing server (use server-name) to a

number, performing it on one circle called the keys

space

Use DHT to addressing data and find server store it

by successor(k)=ceiling(addressing(k))

successor(k): server store k

0

server3

server1

server2

Page 10: Distribute Storage System May-2014

DISTRIBUTED STORAGE SYSTEM

ARCHITECT

Addressing – Virtual node

Each server node is generated to more node-id for

evenly distributed, load balance

Server1: n1, n4, n6

Server2: n2, n7

Server3: n3, n5, n8

0

server3

server1

server2

n7

n1

n5

n2

n4

n8

n3

n6

Page 11: Distribute Storage System May-2014

DISTRIBUTED STORAGE SYSTEM

ARCHITECT

Data Replication

Data k1 store in server1 as master and store in

server2 as slave

0

server3

server1

server2

k1

Page 12: Distribute Storage System May-2014

4. CANONICAL PROBLEMS IN DISTRIBUTED

SYSTEMS

Distributed transactions: ACID (Atomicity,

Consistency, Isolation, Durability) requirement

Distributed data independence

Fault tolerance

Transparency

Page 13: Distribute Storage System May-2014

5. COMMON SOLUTION FOR CANONICAL

PROBLEMS IN DISTRIBUTED SYSTEMS

Atomicity and Consistency with Two Phase

Commit protocal

Distributed data independence with consistent

hashing algorithm

Fault tolerance with leader election, multi

master and data replication

Transparency with server routing, client seen

distributed system as a single server

Page 14: Distribute Storage System May-2014

TWO PHASE COMMIT PROTOCAL

What is this?

Two-phase commit is a transaction protocol designed for

the complications that arise with distributed resource

managers.

Two-phase commit technology is used for hotel and

airline reservations, stock market transactions, banking

applications, and credit card systems.

With a two-phase commit protocol, the distributed

transaction manager employs a coordinator to manage

the individual resource managers. The commit process

proceeds as follows:

Page 15: Distribute Storage System May-2014

TWO PHASE COMMIT PROTOCAL

Phase1: Obtaining a Decision

Step 1 Coordinator asks all participants to prepare

to commit transaction Ti.

Ci adds the records <prepare T> to the log and

forces log to stable storage (a log is a file which

maintains a record of all changes to the database)

sends prepare T messages to all sites where T

executed

Page 16: Distribute Storage System May-2014

TWO PHASE COMMIT PROTOCAL

Phase1: Making a Decision Step 2 Upon receiving message, transaction

manager at site determines if it can commit the transaction

if not:

add a record <no T> to the log and send abort T message to Ci

if the transaction can be committed, then:

1). add the record <ready T> to the log

2). force all records for T to stable storage

3). send ready T message to Ci

Page 17: Distribute Storage System May-2014

TWO PHASE COMMIT PROTOCAL

Phase 2: Recording the Decision Step 1 T can be committed of Ci received a ready T

message from all the participating sites: otherwise T

must be aborted.

Step 2 Coordinator adds a decision record, <commit

T> or <abort T>, to the log and forces record onto stable

storage. Once the record is in stable storage, it cannot

be revoked (even if failures occur)

Step 3 Coordinator sends a message to each

participant informing it of the decision (commit or abort)

Step 4 Participants take appropriate action locally.

Page 18: Distribute Storage System May-2014
Page 19: Distribute Storage System May-2014

TWO PHASE COMMIT PROTOCAL

Costs and Limitations

If one database server is unavailable, none of the

servers gets the updates.

This is correctable through network tuning and

correctly building the data distribution through

database optimization techniques.

Page 20: Distribute Storage System May-2014

LEADER ELECTION

Some leader election algorithm can use: LCR

(LeLann-Chang-Roberts), Pitterson, HS

(Hirschberg-Sinclair)

Page 21: Distribute Storage System May-2014

LEADER ELECTION

Bully Leader Election algorithm

Page 22: Distribute Storage System May-2014
Page 23: Distribute Storage System May-2014

MULTI MASTER

Multi-master replication

Problem of multi-master replication

Page 24: Distribute Storage System May-2014

MULTI MASTER

Solution, 2 candicate model:

Two phase commit (always consistency)

Asynchronize sync data among multi node

Still active despite some node dies

Faster than 2PC

Page 25: Distribute Storage System May-2014

MULTI MASTER

Asynchronize sync data

Data store to main master (called sub-leader), then

this data post to queue to sync to other master.

Page 26: Distribute Storage System May-2014

MULTI MASTER

Asynchronize sync data

req1 req2

Server1

(leader )

Server2

data queue

req2: forward

X

Page 27: Distribute Storage System May-2014

UNIVERSALDISTRIBUTEDSTORAGE

a distributed storage system

Page 28: Distribute Storage System May-2014

6. UNIVERSALDISTRIBUTEDSTORAGE

UniversalDistributedStorage is a distributed

storage system develop for:

Distributed transactions (ACID)

Distributed data independence

Fault tolerance

Leader election (decision for join or leave server node)

Replicate with multiple master replication

Transparency

Page 29: Distribute Storage System May-2014

UNIVERSALDISTRIBUTEDSTORAGE

ARCHITECTURE

Overview

Bussiness

Layer

Distrib

uted

Layer

Storage

Layer

Bussiness

Layer

Distrib

uted

Layer

Storage

Layer

Bussiness

Layer

Distrib

uted

Layer

Storage

Layer

Server

Page 30: Distribute Storage System May-2014

UNIVERSALDISTRIBUTEDSTORAGE

ARCHITECTURE

Internal Overview

Business Layer

Distributed Layer

Storage Layer

dataLocate(), dataRemote()

Result(s)

localData()

Result{s}

Client request(s)

remote

queuing

Page 31: Distribute Storage System May-2014

ARCHITECTURE OVERVIEW

Page 32: Distribute Storage System May-2014

UNIVERSALDISTRIBUTEDSTORAGE

FEATURE

Data hashing/addressing

Use Murmur hashing function

Page 33: Distribute Storage System May-2014

UNIVERSALDISTRIBUTEDSTORAGE

FEATURE

Leader election

Use Bully Leader Election algorithm

Page 34: Distribute Storage System May-2014

UNIVERSALDISTRIBUTEDSTORAGE

FEATURE

Multi-master replication

Use asynchronize sync data among server nodes

Page 35: Distribute Storage System May-2014

UNIVERSALDISTRIBUTEDSTORAGE

STATISTIC

System information:

3 machine 8GB Ram, core i5 3,220GHz

LAN/WAN network

7 physical servers on 3 above mechine

Concurrence write 16500000 items in 3680s, rate~ 4480req/sec (at client computing)

Concurrence read 16500000 items in 1458s, rate~ 11320req/sec (at client computing)

* It doesn’t limit of this system, it limit at clients (this test using 3 client thread)

Page 37: Distribute Storage System May-2014

7. APPENDIX

Page 38: Distribute Storage System May-2014

APPENDIX - 001

How to join/leave server(s)

1. join/leave 2. join/leave: forward

Leader server

4. broadcast result

3. process join/leave

Server A Server B Server C

Page 39: Distribute Storage System May-2014

APPENDIX - 002

How to move data when join/leave server(s)

Make appropriate data for the moving

Async data for the moving by thread, and control

speed of the moving

Page 40: Distribute Storage System May-2014

APPENDIX - 003

How to detect Leader or sub-leader die

Easy dectect by polling connection

Page 41: Distribute Storage System May-2014

APPENDIX - 004

How to make multi virtual node for one server

Easy generate multi virtual node for one server by

hash server-name

Ex:

make 200 virtual node for server ‘photoTokyo’:

use hash value of: photoTokyo1, photoTokyo2, …,

photoTokyo200

Page 42: Distribute Storage System May-2014

APPENDIX - 005

For fast moving data

Use bloomfilter for dectect exist hash value of data-

key

Use a storage for store all data-key for this local

server

Page 43: Distribute Storage System May-2014

APPENDIX - 006

How to avoid network turnning

Use client connection pool with screening strategy

before, it avoid many connection hanging when call

remote via network between two server