Upload
amy-w-tang
View
843
Download
1
Tags:
Embed Size (px)
DESCRIPTION
Kishore Gopalakrisha (Staff Software Engineer @ LinkedIn) gave this talk at ApacheCon in February 2013.
Citation preview
1
Building distributed systems using Helix
Kishore Gopalakrishna, @kishoreg1980h?p://www.linkedin.com/in/kgopalak
h?p://helix.incubator.apache.org Apache IncubaGon Oct, 2012 @apachehelix
Outline
• Introduc)on • Architecture • How to use Helix • Tools • Helix usage
2
3
Examples of distributed data systems
4
Single Node
MulG node
Fault tolerance
Cluster Expansion
• ParGGoning • Discovery • Co-‐locaGon
• ReplicaGon • Fault detecGon • Recovery
• Thro?le data movement • Re-‐distribuGon
Lifecycle
5
Typical Architecture
Node Node Node Node
App. App. App. App.
Network Cluster manager
Distributed search service
Node 1 Node 3 Node 2
P.3
P.1 P.2
P.4
ParGGon management
• MulGple replicas • Even distribuGon
• Rack aware placement
Fault tolerance
• Fault detecGon • Auto create replicas
• Controlled creaGon of replicas
ElasGcity
• re-‐distribute parGGons
• Minimize movement
• Thro?le data movement
P.5
P.3 P.4
P.6 P.1
P.5 P.6
P.2
INDEX SHARD
REPLICA
Distributed data store
Node 1 Node 3 Node 2
P.4
P.9 P.10 P.11
P.12
P.1 P.2 P.3 P.7 P.5 P.6
P.8 P.1 P.5 P.6
P.9 P.10
P.4 P.3
P.7 P.8 P.11 P.12
P.2 P.1
ParGGon management
• MulGple replicas • 1 designated master
• Even distribuGon
Fault tolerance
• Fault detecGon • Promote slave to master
• Even distribuGon
• No SPOF
ElasGcity
• Minimize downGme
• Minimize data movement
• Thro?le data movement
MASTER
SLAVE
Message consumer group
• Similar to Message groups in AcGveMQ – guaranteed ordering of the processing of related messages across a single queue
– load balancing of the processing of messages across mulGple consumers
– high availability / auto-‐failover to other consumers if a JVM goes down
• Applicable to many messaging pub/sub systems like kada, rabbitmq etc
8
Message consumer group
9
ASSIGNMENT SCALING FAULT TOLERANCE
10
ApplicaGon
Zookeeper
ApplicaGon
Framework
Consensus System
• File system • Lock • Ephemeral
• Node • ParGGon • Replica • State • TransiGon
Zookeeper provides low level primiGves. We need high level primiGves.
11
Outline
• IntroducGon • Architecture • How to use Helix • Tools • Helix usage
12
13
Terminologies Node A single machine
Cluster Set of Nodes
Resource A logical en/ty e.g. database, index, task
ParGGon Subset of the resource.
Replica Copy of a parGGon
State Status of a parGGon replica, e.g Master, Slave
TransiGon AcGon that lets replicas change status e.g Slave -‐> Master
COUNT=2
COUNT=1
minimize(maxnj∈N S(nj) ) t1≤5
Core concept
14
S
M O
t1 t2
t3 t4 minimize(maxnj∈N M(nj) )
State Machine
• States • Offline, Slave, Master
• TransiGon • O-‐>S, S-‐>M,S-‐>M, M-‐>S
Constraints
• States • M=1, S=2
• TransiGons • concurrent(0-‐>S) < 5
ObjecGves
• ParGGon Placement • Failure semanGcs
Helix soluGon
Message consumer group
Offline Online
Distributed search
15
MAX=1
MAX=3 (number of replicas)
Start consumpGon
Stop consumpGon
MAX per node=5
IDEALSTATE
P1 N1:M
N2:S
P2 N2:M
N3:S
P3 N3:M
N1:S
16
ConfiguraGon
• 3 nodes • 3 parGGons • 2 replicas • StateMachine
Constraints
• 1 Master • 1 Slave • Even distribuGon
Replica placement
Replica State
CURRENT STATE
• P1:OFFLINE • P3:OFFLINE N1 • P2:MASTER • P1:MASTER N2 • P3:MASTER • P2:SLAVE N3
17
18
EXTERNAL VIEW
P1 N1:O
N2:M
P2 N2:M
N3:S
P3 N3:M
N1:O
19
Helix Based System Roles
Node 1 Node 3 Node 2
P.4
P.9 P.10 P.11
P.12
P.1 P.2 P.3 P.7 P.5 P.6
P.8 P.1 P.5 P.6
P.9 P.10
P.4 P.3
P.7 P.8 P.11 P.12
P.2 P.1
PARTICIPANT
SPECTATORController
Parition routing logic
CURRENT STATE
IDEAL STATE
RESPONSE COMMAND
Logical deployment
20
Outline
• IntroducGon • Architecture • How to use Helix • Tools • Helix usage
21
Helix based soluGon
1. Define 2. Configure 3. Run
22
Define: State model definiGon
• States – All possible states – Priority
• TransiGons – Legal transiGons – Priority
• Applicable to each parGGon of a resource
• e.g. MasterSlave
23
S
M O
Define: state model
24
Builder = new StateModelDefinition.Builder(“MASTERSLAVE”);! // Add states and their rank to indicate priority. ! builder.addState(MASTER, 1);! builder.addState(SLAVE, 2);! builder.addState(OFFLINE);! ! //Set the initial state when the node starts! builder.initialState(OFFLINE);
//Add transitions between the states.! builder.addTransition(OFFLINE, SLAVE);! builder.addTransition(SLAVE, OFFLINE);! builder.addTransition(SLAVE, MASTER);! builder.addTransition(MASTER, SLAVE);! !
Define: constraints State Transi)on
ParGGon Y Y
Resource -‐ Y
Node Y Y
Cluster -‐ Y
25
S
M O
COUNT=2
COUNT=1 State Transi)on
ParGGon M=1,S=2 -‐
Define:constraints
26
// static constraint! builder.upperBound(MASTER, 1);!!! // dynamic constraint! builder.dynamicUpperBound(SLAVE, "R");!!! ! // Unconstrained ! builder.upperBound(OFFLINE, -1;
Define: parGcipant plug-‐in code
27
Step 2: configure
28
helix-‐admin –zkSvr <zkAddress>
CREATE CLUSTER
-‐-‐addCluster <clusterName>
ADD NODE
-‐-‐addNode <clusterName instanceId(host:port)>
CONFIGURE RESOURCE
-‐-‐addResource <clusterName resourceName par;;ons statemodel>
REBALANCE èSET IDEALSTATE
-‐-‐rebalance <clusterName resourceName replicas>
29
zookeeper view IDEALSTATE
Step 3: Run
30
run-‐helix-‐controller -‐zkSvr localhost:2181 –cluster MyCluster START CONTROLLER
START PARTICIPANT
zookeeper view
31
Znode content
CURRENT STATE EXTERNAL VIEW
32
Spectator Plug-‐in code
33
34
Helix ExecuGon modes
35
IDEALSTATE
P1 N1:M
N2:S
P2 N2:M
N3:S
P3 N3:M
N1:S
ConfiguraGon
• 3 nodes • 3 parGGons • 2 replicas • StateMachine
Constraints
• 1 Master • 1 Slave • Even distribuGon
Replica placement
Replica State
ExecuGon modes
• Who controls what
36
AUTO REBALANCE
AUTO CUSTOM
Replica placement
Helix App App
Replica State
Helix Helix App
Auto rebalance v/s Auto
AUTO REBALANCE AUTO
37
In acGon Auto rebalance
MasterSlave p=3 r=2 N=3 Node1 Node2 Node3
P1:M P2:M P3:M
P2:S P3:S P1:S
Auto MasterSlave p=3 r=2 N=3
38
Node 1 Node 2 Node 3
P1:O P2:M P3:M
P2:O P3:S P1:S
P1:M P2:S
Node 1 Node 2 Node 3
P1:M P2:M P3:M
P2:S P3:S P1:M
Node 1 Node 2 Node 3
P1:M P2:M P3:M
P2:S P3:S P1:S
On failure: Only change states to saGsfy constraint
On failure: Auto create replica and assign state
Custom mode: example
39
40
Custom mode: handling failure � Custom code invoker
� Code that lives on all nodes, but acGve in one place � Invoked when node joins/leaves the cluster � Computes new idealstate � Helix controller fires the transiGon without viola)ng constraints
P1
N1:M
N2:S
P2
N2:M
N3:S
P3
N3:M
N1:S
P1
N1:S
N2:M
P2
N2:M
N3:S
P3
N3:M
N1:S
Transi)ons
1 N1 MàS
2 N2 Sà M
1 & 2 in parallel violate single master constraint
Helix sends 2 aser 1 is finished
Outline
• IntroducGon • Architecture • How to use Helix • Tools • Helix usage
41
Tools
• Chaos monkey • Data driven tesGng and debugging • Rolling upgrade • On demand task scheduling and intra-‐cluster messaging
• Health monitoring and alerts
42
Data driven tesGng
• Instrument – • Zookeeper, controller, parGcipant logs
• Simulate – Chaos monkey • Analyze – Invariants are
• Respect state transiGon constraints • Respect state count constraints • And so on
• Debugging made easy • Reproduce exact sequence of events
43
Structured Log File -‐ sample timestamp partition instanceName sessionId state
1323312236368 TestDB_123 express1-md_16918 ef172fe9-09ca-4d77b05e-15a414478ccc OFFLINE
1323312236426 TestDB_123 express1-md_16918 ef172fe9-09ca-4d77b05e-15a414478ccc OFFLINE
1323312236530 TestDB_123 express1-md_16918 ef172fe9-09ca-4d77b05e-15a414478ccc OFFLINE
1323312236530 TestDB_91 express1-md_16918 ef172fe9-09ca-4d77b05e-15a414478ccc OFFLINE
1323312236561 TestDB_123 express1-md_16918 ef172fe9-09ca-4d77b05e-15a414478ccc SLAVE
1323312236561 TestDB_91 express1-md_16918 ef172fe9-09ca-4d77b05e-15a414478ccc OFFLINE
1323312236685 TestDB_123 express1-md_16918 ef172fe9-09ca-4d77b05e-15a414478ccc SLAVE
1323312236685 TestDB_91 express1-md_16918 ef172fe9-09ca-4d77b05e-15a414478ccc OFFLINE
1323312236685 TestDB_60 express1-md_16918 ef172fe9-09ca-4d77b05e-15a414478ccc OFFLINE
1323312236719 TestDB_123 express1-md_16918 ef172fe9-09ca-4d77b05e-15a414478ccc SLAVE
1323312236719 TestDB_91 express1-md_16918 ef172fe9-09ca-4d77b05e-15a414478ccc SLAVE
1323312236719 TestDB_60 express1-md_16918 ef172fe9-09ca-4d77b05e-15a414478ccc OFFLINE
1323312236814 TestDB_123 express1-md_16918 ef172fe9-09ca-4d77b05e-15a414478ccc SLAVE
No more than R=2 slaves Time State Number Slaves Instance
42632 OFFLINE 0 10.117.58.247_12918
42796 SLAVE 1 10.117.58.247_12918
43124 OFFLINE 1 10.202.187.155_12918
43131 OFFLINE 1 10.220.225.153_12918
43275 SLAVE 2 10.220.225.153_12918
43323 SLAVE 3 10.202.187.155_12918
85795 MASTER 2 10.220.225.153_12918
How long was it out of whack? Number of Slaves Time Percentage
0 1082319 0.5
1 35578388 16.46
2 179417802 82.99
3 118863 0.05
83% of the Gme, there were 2 slaves to a parGGon 93% of the Gme, there was 1 master to a parGGon
Number of Masters Time Percentage
0 15490456 7.164960359 1 200706916 92.83503964
Invariant 2: State TransiGons FROM TO COUNT
MASTER SLAVE 55
OFFLINE DROPPED 0
OFFLINE SLAVE 298
SLAVE MASTER 155
SLAVE OFFLINE 0
Outline
• IntroducGon • Architecture • How to use Helix • Tools • Helix usage
48
Helix usage at LinkedIn
49
Espresso
In flight
• Apache S4 – ParGGoning, co-‐locaGon – Dynamic cluster expansion
• Archiva – ParGGoned replicated file store – Rsync based replicaGon
• Others in evaluaGon – Bigtop
50
Auto scaling sosware deployment tool
51
• States • Download, Configure, Start • AcGve, Standby
• Constraint for each state • Download < 100 • AcGve 1000 • Standby 100
Download
Configure
Start
Active
Standby
Offline
< 100
1000
100
Summary
• Helix: A Generic framework for building distributed systems
• Modifying/enhancing system behavior is easy – AbstracGon and modularity is key
• Simple programming model: declaraGve state machine
52
Roadmap
• Features • Span mulGple data centers • AutomaGc Load balancing • Distributed health monitoring • YARN Generic ApplicaGon master for real Gme Apps
• Stand alone Helix agent
54
website h?p://helix.incubator.apache.org
user [email protected]
twi?er @apachehelix, @kishoreg1980