Upload
cong-loi-duong
View
227
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Architect document for build distributed storage system and the good sample distributed storage from author
Citation preview
DISTRIBUTED STORAGE
SYSTEM
Mr. Dương Công Lợi
Company: VNG-Corp
Tel: +84989510016/+84908522017
CONTENTS
1. What is distributed-computing system?
2. Principle of distributed database/storage
system
3. Distributed storage system paradigm
4. Canonical problems in distributed systems
5. Common solution for canonical problems in
distributed systems
6. UniversalDistributedStorage
7. Appendix
1. WHAT IS DISTRIBUTED-COMPUTING
SYSTEM?
Distributed-Computing is the process of solving a
computational problem using a distributed
system.
A distributed system is a computing system in
which a number of components on multiple
computers cooperate by communicating over a
network to achieve a common goal.
DISTRIBUTED DATABASE/STORAGE
SYSTEM
A distributed database system, the database is
stored on several computers .
A distributed database is a collection of multiple
, logic computer network .
DISTRIBUTED SYSTEM ADVANCE
Advance
Avoid bottleneck & single-point-of-failure
More Scalability
More Availability
Routing model
Client routing: client request to appropriate server to
read/write data
Server routing: server forward request of client to
appropriate server and send result to this client
* can combine the two model above into a system
DISTRIBUTED STORAGE SYSTEM
Store some data {1,2,3,4,6,7,8} into 1 server
And store them into 3 distributed server
1,2,3,4,6,7,8
1,2,3 4,6
7,8
2. PRINCIPLE OF DISTRIBUTED
DATABASE/STORAGE SYSTEM
Shard data key and store it into appropriate
server use Distributed Hash Table (DHT)
DHT must be consistent hashing:
Uniform distribution of generation
Consistent
Jenkins, Murmur are the good choice;some else:
MD5, SHA slower
3. DISTRIBUTED STORAGE SYSTEM
PARADIGM
Data Hashing/Addressing
Determine server for data store in
Data Replication
Store data into multi server node for more
availability, fault-tolerance
DISTRIBUTED STORAGE SYSTEM
ARCHITECT
Data Hashing/Addressing
Use DHT to addressing server (use server-name) to a
number, performing it on one circle called the keys
space
Use DHT to addressing data and find server store it
by successor(k)=ceiling(addressing(k))
successor(k): server store k
0
server3
server1
server2
DISTRIBUTED STORAGE SYSTEM
ARCHITECT
Addressing – Virtual node
Each server node is generated to more node-id for
evenly distributed, load balance
Server1: n1, n4, n6
Server2: n2, n7
Server3: n3, n5, n8
0
server3
server1
server2
n7
n1
n5
n2
n4
n8
n3
n6
DISTRIBUTED STORAGE SYSTEM
ARCHITECT
Data Replication
Data k1 store in server1 as master and store in
server2 as slave
0
server3
server1
server2
k1
4. CANONICAL PROBLEMS IN DISTRIBUTED
SYSTEMS
Distributed transactions: ACID (Atomicity,
Consistency, Isolation, Durability) requirement
Distributed data independence
Fault tolerance
Transparency
5. COMMON SOLUTION FOR CANONICAL
PROBLEMS IN DISTRIBUTED SYSTEMS
Atomicity and Consistency with Two Phase
Commit protocal
Distributed data independence with consistent
hashing algorithm
Fault tolerance with leader election, multi
master and data replication
Transparency with server routing, client seen
distributed system as a single server
TWO PHASE COMMIT PROTOCAL
What is this?
Two-phase commit is a transaction protocol designed for
the complications that arise with distributed resource
managers.
Two-phase commit technology is used for hotel and
airline reservations, stock market transactions, banking
applications, and credit card systems.
With a two-phase commit protocol, the distributed
transaction manager employs a coordinator to manage
the individual resource managers. The commit process
proceeds as follows:
TWO PHASE COMMIT PROTOCAL
Phase1: Obtaining a Decision
Step 1 Coordinator asks all participants to prepare
to commit transaction Ti.
Ci adds the records <prepare T> to the log and
forces log to stable storage (a log is a file which
maintains a record of all changes to the database)
sends prepare T messages to all sites where T
executed
TWO PHASE COMMIT PROTOCAL
Phase1: Making a Decision Step 2 Upon receiving message, transaction
manager at site determines if it can commit the transaction
if not:
add a record <no T> to the log and send abort T message to Ci
if the transaction can be committed, then:
1). add the record <ready T> to the log
2). force all records for T to stable storage
3). send ready T message to Ci
TWO PHASE COMMIT PROTOCAL
Phase 2: Recording the Decision Step 1 T can be committed of Ci received a ready T
message from all the participating sites: otherwise T
must be aborted.
Step 2 Coordinator adds a decision record, <commit
T> or <abort T>, to the log and forces record onto stable
storage. Once the record is in stable storage, it cannot
be revoked (even if failures occur)
Step 3 Coordinator sends a message to each
participant informing it of the decision (commit or abort)
Step 4 Participants take appropriate action locally.
TWO PHASE COMMIT PROTOCAL
Costs and Limitations
If one database server is unavailable, none of the
servers gets the updates.
This is correctable through network tuning and
correctly building the data distribution through
database optimization techniques.
LEADER ELECTION
Some leader election algorithm can use: LCR
(LeLann-Chang-Roberts), Pitterson, HS
(Hirschberg-Sinclair)
LEADER ELECTION
Bully Leader Election algorithm
MULTI MASTER
Multi-master replication
Problem of multi-master replication
MULTI MASTER
Solution, 2 candicate model:
Two phase commit (always consistency)
Asynchronize sync data among multi node
Still active despite some node dies
Faster than 2PC
MULTI MASTER
Asynchronize sync data
Data store to main master (called sub-leader), then
this data post to queue to sync to other master.
MULTI MASTER
Asynchronize sync data
req1 req2
Server1
(leader )
Server2
data queue
req2: forward
X
UNIVERSALDISTRIBUTEDSTORAGE
a distributed storage system
6. UNIVERSALDISTRIBUTEDSTORAGE
UniversalDistributedStorage is a distributed
storage system develop for:
Distributed transactions (ACID)
Distributed data independence
Fault tolerance
Leader election (decision for join or leave server node)
Replicate with multiple master replication
Transparency
UNIVERSALDISTRIBUTEDSTORAGE
ARCHITECTURE
Overview
Bussiness
Layer
Distrib
uted
Layer
Storage
Layer
Bussiness
Layer
Distrib
uted
Layer
Storage
Layer
Bussiness
Layer
Distrib
uted
Layer
Storage
Layer
Server
UNIVERSALDISTRIBUTEDSTORAGE
ARCHITECTURE
Internal Overview
Business Layer
Distributed Layer
Storage Layer
dataLocate(), dataRemote()
Result(s)
localData()
Result{s}
Client request(s)
remote
queuing
ARCHITECTURE OVERVIEW
UNIVERSALDISTRIBUTEDSTORAGE
FEATURE
Data hashing/addressing
Use Murmur hashing function
UNIVERSALDISTRIBUTEDSTORAGE
FEATURE
Leader election
Use Bully Leader Election algorithm
UNIVERSALDISTRIBUTEDSTORAGE
FEATURE
Multi-master replication
Use asynchronize sync data among server nodes
UNIVERSALDISTRIBUTEDSTORAGE
STATISTIC
System information:
3 machine 8GB Ram, core i5 3,220GHz
LAN/WAN network
7 physical servers on 3 above mechine
Concurrence write 16500000 items in 3680s, rate~ 4480req/sec (at client computing)
Concurrence read 16500000 items in 1458s, rate~ 11320req/sec (at client computing)
* It doesn’t limit of this system, it limit at clients (this test using 3 client thread)
Q & A
Contact:
Duong Cong Loi
https://www.facebook.com/duongcong.loi
7. APPENDIX
APPENDIX - 001
How to join/leave server(s)
1. join/leave 2. join/leave: forward
Leader server
4. broadcast result
3. process join/leave
Server A Server B Server C
APPENDIX - 002
How to move data when join/leave server(s)
Make appropriate data for the moving
Async data for the moving by thread, and control
speed of the moving
APPENDIX - 003
How to detect Leader or sub-leader die
Easy dectect by polling connection
APPENDIX - 004
How to make multi virtual node for one server
Easy generate multi virtual node for one server by
hash server-name
Ex:
make 200 virtual node for server ‘photoTokyo’:
use hash value of: photoTokyo1, photoTokyo2, …,
photoTokyo200
APPENDIX - 005
For fast moving data
Use bloomfilter for dectect exist hash value of data-
key
Use a storage for store all data-key for this local
server
APPENDIX - 006
How to avoid network turnning
Use client connection pool with screening strategy
before, it avoid many connection hanging when call
remote via network between two server