Upload
donny-nadolny
View
215
Download
1
Embed Size (px)
Citation preview
Donny Nadolny, PagerDuty#Devoxx #distsys
Debugging Distributed SystemsDonny Nadolny
PagerDuty
Donny Nadolny, PagerDuty#Devoxx #distsys
Donny Nadolny, PagerDuty#Devoxx #distsys
What is ZooKeeper
• Distributed system for building distributed systems
• Small in-memory filesystem
Donny Nadolny, PagerDuty#Devoxx #distsys
ZooKeeper API
• create directory
• create file (ZooKeeper term: “node”)
• atomically update a file
• watch a file for changes
• create “ephemeral” file (goes away when client does)
• create sequential file (concurrent attempts to create are ordered)
Donny Nadolny, PagerDuty#Devoxx #distsys
ZooKeeper at PagerDuty
• Distributed locking
• Consistent, highly available
Donny Nadolny, PagerDuty#Devoxx #distsys
Current Talk: Debugging Distributed SystemsFor Cassandra Consistency Issues, See:
Donny Nadolny, PagerDuty#Devoxx #distsys
ZooKeeper at PagerDuty
• Distributed locking
• Consistent, highly available
Donny Nadolny, PagerDuty#Devoxx #distsys
ZooKeeper at PagerDuty
• Distributed locking
• Consistent, highly available
ZK 3
ZK 1 ZK 2
DC-A
DC-C
DC-B
24 ms
24 ms 3 ms
… over a WAN
Donny Nadolny, PagerDuty#Devoxx #distsys
ZooKeeper Overview
Donny Nadolny, PagerDuty#Devoxx #distsys
The Failure
• Network trouble, one follower falls behind
• ZooKeeper gets stuck - leader still up
1
2
DB
Siz
e
Donny Nadolny, PagerDuty#Devoxx #distsys
The Failure
• Network trouble, one follower falls behind
• ZooKeeper gets stuck - leader still up
2
DB
Siz
e
1
2
1.51
Donny Nadolny, PagerDuty#Devoxx #distsys
Recovery
• Restart all nodes
• Restart leader
2
DB
Siz
e
1
2
1.51
3 3
Donny Nadolny, PagerDuty#Devoxx #distsys
First Hint
• Leader logs: “Toobusytosnap,skipping”
Donny Nadolny, PagerDuty#Devoxx #distsys
Fault Injection
• Disk slow? let’s test:•sshfsdonny@some_server:/home/donny/mnt
• Similar failure profile
Donny Nadolny, PagerDuty#Devoxx #distsys
Fault Injection
• Disk slow? let’s test:•sshfsdonny@some_server:/home/donny/mnt
• Similar failure profile
• Re-examine disk latency… nope, was a red herring
Donny Nadolny, PagerDuty#Devoxx #distsys
Health Checks
• First warning: application monitoring
• High-level application checks are good because they catch many problems, but don’t tell you the cause
• Monitoring ZooKeeper: used ruok
Donny Nadolny, PagerDuty#Devoxx #distsys
Deep Health Checks
• Added deep health check:
• write to one ZooKeeper key
• read from ZooKeeper key
Donny Nadolny, PagerDuty#Devoxx #distsys
"LearnerHandler-/123.45.67.89:45874"prio=10tid=0x00000000024bb800nid=0x3d0drunnable[0x00007fe6c3193000]java.lang.Thread.State:RUNNABLEatjava.net.SocketOutputStream.socketWrite0(NativeMethod)atjava.net.SocketOutputStream.socketWrite(SocketOutputStream.java:113)…atorg.apache.jute.BinaryOutputArchive.writeBuffer(BinaryOutputArchive.java:118)…atorg.apache.jute.BinaryOutputArchive.writeRecord(BinaryOutputArchive.java:123)atorg.apache.zookeeper.server.DataTree.serializeNode(DataTree.java:1115)-locked<0x00000000d4cd9e28>(aorg.apache.zookeeper.server.DataNode)atorg.apache.zookeeper.server.DataTree.serializeNode(DataTree.java:1130)…atorg.apache.zookeeper.server.ZKDatabase.serializeSnapshot(ZKDatabase.java:467)atorg.apache.zookeeper.server.quorum.LearnerHandler.run(LearnerHandler.java:493)
The Stack Trace
1
2
3
Donny Nadolny, PagerDuty#Devoxx #distsys
Threads (Leader)
Request processors
Learner handler (one per follower)
Client requests
Donny Nadolny, PagerDuty#Devoxx #distsys
🔒🔒
Threads (Leader)
Request processors
Learner handler (one per follower)
Client requests
🔒🔓🔓
Donny Nadolny, PagerDuty#Devoxx #distsys
Threads (Leader)
Request processors
Learner handler (one per follower)
Client requests
🔒
Donny Nadolny, PagerDuty#Devoxx #distsys
Threads (Leader)
Request processors
Learner handler (one per follower)
Client requests
🔒
🔒
Donny Nadolny, PagerDuty#Devoxx #distsys
Threads (Leader)
Request processors
Learner handler (one per follower)
Client requests
🔒
🔒 🔒
Donny Nadolny, PagerDuty#Devoxx #distsys
void serializeNode(OutputArchive output, String path) { DataNode node = getNode(path); String[] children = {}; synchronized (node) { output.writeString(path, "path"); output.writeRecord(node, "node"); children = node.getChildren(); } for (String child : children) { serializeNode(output, path + "/" + child); }}
Write Snapshot Code (simplified)
Blocking network write
Donny Nadolny, PagerDuty#Devoxx #distsys
ZooKeeper Heartbeat
• Why didn’t a follower take over?
• restart all nodes - cluster recovers
• restart leader - cluster recovers
• ZK heartbeat: message from leader to follower
• follower gets heartbeat, everything is fine
• follower doesn’t get heartbeat: start an election
Donny Nadolny, PagerDuty#Devoxx #distsys
Threads (Leader)
Request processors
Learner handler (one per follower)
Client requests
🔒
🔒 🔒
Donny Nadolny, PagerDuty#Devoxx #distsys
Threads (Leader)
Request processors
Learner handler (one per follower)
Client requests
Quorum Peer
Followers
❤ ❤ ❤
🔒
🔒 🔒
Donny Nadolny, PagerDuty#Devoxx #distsys
TCP
Donny Nadolny, PagerDuty#Devoxx #distsys
Follower LeaderESTABLISHED ESTABLISHED
Packet 1
ACK
… SYN, SYN-ACK, ACK …
TCP Data Transmission
Donny Nadolny, PagerDuty#Devoxx #distsys
Follower LeaderESTABLISHED ESTABLISHED
Packet 1
TCP Data Transmission
Donny Nadolny, PagerDuty#Devoxx #distsys
ESTABLISHED ESTABLISHEDPacket 1
Packet 1~200ms
TCP Data Transmission
Follower Leader
Donny Nadolny, PagerDuty#Devoxx #distsys
ESTABLISHED ESTABLISHEDPacket 1
Packet 1~200ms
Packet 1 ~200ms
TCP Data Transmission
Follower Leader
Donny Nadolny, PagerDuty#Devoxx #distsys
ESTABLISHED ESTABLISHEDPacket 1
Packet 1~200ms
Packet 1 ~200ms
~400msPacket 1
TCP Data Transmission
Follower Leader
Donny Nadolny, PagerDuty#Devoxx #distsys
ESTABLISHED ESTABLISHEDPacket 1
Packet 1~200ms
Packet 1 ~200ms
~400msPacket 1
~800msPacket 1
TCP Data Transmission
Follower Leader
Donny Nadolny, PagerDuty#Devoxx #distsys
ESTABLISHED ESTABLISHEDPacket 1
Packet 1~200ms
Packet 1 ~200ms
~400msPacket 1
~800ms
~
120sec
Packet 1
Packet 1 120sec
CLOSED
15 retries…
TCP Data Transmission
Follower Leader
Donny Nadolny, PagerDuty#Devoxx #distsys
TCP Retransmission (Linux Defaults)
• Retransmission timeout (RTO) is based on latency
• TCP_RTO_MIN = 200 ms
• TCP_RTO_MAX = 2 minutes
• /proc/sys/net/ipv4/tcp_retries2 = 15 retries
• 0.2 + 0.2 + 0.4 + 0.8 + … + 120 = 924.8 seconds (15.5 mins)
Donny Nadolny, PagerDuty#Devoxx #distsys
ESTABLISHED ESTABLISHEDPacket 1
Packet 1~200ms
Packet 1 ~200ms
~400msPacket 1
~800ms
~
120sec
Packet 1
Packet 1 120sec
CLOSED
15.5 mins (or more)
…
TCP Data Transmission
Follower Leader
Donny Nadolny, PagerDuty#Devoxx #distsys
Timeline1. Network trouble begins - packet loss / latency2. Follower falls behind, restarts, requests snapshot3. Leader begins to send snapshot4. Snapshot transfer stalls5. Follower ZooKeeper restarts, attempts to close connection 6. Network heals 7. … Leader still stuck
Donny Nadolny, PagerDuty#Devoxx #distsys
Timeline1. Network trouble begins - packet loss / latency2. Follower falls behind, restarts, requests snapshot3. Leader begins to send snapshot4. Snapshot transfer stalls5. Follower ZooKeeper restarts, attempts to close connection6. Network heals7. … Leader still stuck
Donny Nadolny, PagerDuty#Devoxx #distsys
ESTABLISHED ESTABLISHED
FIN/ACK
FIN
ACK
LAST_ACK
CLOSED
TIME_WAIT
CLOSED
60 seconds
FIN_WAIT1
TCP Close Connection
Follower Leader
Donny Nadolny, PagerDuty#Devoxx #distsys
ESTABLISHED ESTABLISHED
CLOSED~1m40s
FIN_WAIT1 FINFINFIN
FIN
FIN
8 retries ~
TCP Close Connection
Follower Leader
Donny Nadolny, PagerDuty#Devoxx #distsys
ESTABLISHED ESTABLISHED
CLOSED~1m40s
FIN_WAIT1 FIN Packet 1
CLOSED~15.5 mins
TCP Close Connection
Follower Leader
Donny Nadolny, PagerDuty#Devoxx #distsys
ESTABLISHED ESTABLISHED
CLOSED~1m40s
FIN_WAIT1 FIN Packet 1
CLOSEDRST
TCP Close Connection
Follower Leader
Donny Nadolny, PagerDuty#Devoxx #distsys
06:51:47iptables:WARN:IN=eth0OUT=MAC=00:0d:12:34:56:78:12:34:56:78:12:34:56:78SRC=<leader_ip>DST=<follower_ip>LEN=54TOS=0x00PREC=0x00TTL=44ID=36370DFPROTO=TCPSPT=3888DPT=36416WINDOW=227RES=0x00ACKPSHURGP=0
syslog - Dropped Packets on Follower
Donny Nadolny, PagerDuty#Devoxx #distsys
ESTABLISHED ESTABLISHED
CLOSED~1m40s
FIN_WAIT1 FIN Packet 1
TCP Close Connection
Blocked by iptablesX
Follower Leader
XX
Donny Nadolny, PagerDuty#Devoxx #distsys
iptablesiptables-AINPUT-mstate--stateESTABLISHED,RELATED-jACCEPT
iptables-AINPUT-ptcp--dport80-jACCEPT
... more rules to accept connections …
iptables-AINPUT-jDROP
Donny Nadolny, PagerDuty#Devoxx #distsys
iptablesiptables-AINPUT-mstate--stateESTABLISHED,RELATED-jACCEPT
iptables-AINPUT-ptcp--dport80-jACCEPT
... more rules to accept connections …
iptables-AINPUT-jDROP
But: iptables connections != netstat connections
Donny Nadolny, PagerDuty#Devoxx #distsys
conntrack Timeouts
• From linux/net/netfilter/nf_conntrack_proto_tcp.c:
• [TCP_CONNTRACK_LAST_ACK] = 30 SECS
Donny Nadolny, PagerDuty#Devoxx #distsys
Follower Leader
CLOSED
~51.2s
FIN_WAIT1 FINFINFIN
FIN
FIN~25.6s
kernel TCPconntrackLAST_ACK
30s
30s
30s
30s
CLOSED
~12.8s
30s
~81.2s~102.4s
TCP Close Connection
Donny Nadolny, PagerDuty#Devoxx #distsys
The Full Story
• Packet loss
• Follower falls behind, requests snapshot
• (Packet loss continues) follower closes connection
• Follower conntrack forgets connection
• Leader now stuck for ~15 mins, even if network heals
Donny Nadolny, PagerDuty#Devoxx #distsys
(Alternative: kill the follower)
Reproducing (1/3) - Setup
• Follower falls behind:tcqdiscadddeveth0rootnetemdelay500ms100msloss35%
• Wait for a few minutes
Donny Nadolny, PagerDuty#Devoxx #distsys
Reproducing (2/3) - Request Snapshot
• Remove latency / packet loss:tcqdiscdeldeveth0rootnetem
• Restrict bandwidth:tcqdiscadddeveth0handle1:roothtbdefault11tcclassadddeveth0parent1:classid1:1htbrate100kbpstcclassadddeveth0parent1:1classid1:11htbrate100kbps
• Restart follower ZooKeeper process
Donny Nadolny, PagerDuty#Devoxx #distsys
Reproducing (3/3) - Close Connection
• Block traffic to leader:iptables-AOUTPUT-ptcp-d<leaderip>-jDROP
• Remove bandwidth restriction:tcqdiscdeldeveth0root
• Kill follower ZooKeeper process, kernel tries to close connection
• Monitor conntrack status, wait for entry to disappear, ~80 seconds:conntrack-L|grep<leaderip>
• Allow traffic to leader:iptables-DOUTPUT-ptcp-d<leaderip>-jDROP
Donny Nadolny, PagerDuty#Devoxx #distsys
IPsec
Donny Nadolny, PagerDuty#Devoxx #distsys
Follower Leader
ESP (UDP)
ESP (UDP)IPsec
TCP dataIPsec TCP data
IPsec
Donny Nadolny, PagerDuty#Devoxx #distsys
IPsec Phase 1
IPsec Phase 2
TCP data
IPsec - Establish Connection
Follower Leader
Donny Nadolny, PagerDuty#Devoxx #distsys
TCP data
IPsec - Dropped Packets
TCP data
IPsec Phase 1
IPsec Phase 2
Follower Leader
Donny Nadolny, PagerDuty#Devoxx #distsys
IPsec Heartbeat
IPsec - Heartbeat
TCP data
TCP data
IPsec Phase 1
IPsec Phase 2
Follower Leader
Donny Nadolny, PagerDuty#Devoxx #distsys
Lessons
Donny Nadolny, PagerDuty#Devoxx #distsys
Lesson 1
• Don’t lock and block
• TCP can block for a really long time
• Interfaces / abstract methods make analysis harder
Donny Nadolny, PagerDuty#Devoxx #distsys
Lesson 2
• Automate debug info collection (stack trace, heap dump, transaction logs, etc)
Donny Nadolny, PagerDuty#Devoxx #distsys
Lesson 3
• Application/dependency checks should be deep health checks!
• Leader/follower heartbeats should be deep health checks!
Donny Nadolny, PagerDuty#Devoxx #distsys
Questions?Link: “Network issues can cause cluster to hang due to near-deadlock”https://issues.apache.org/jira/browse/ZOOKEEPER-2201
Donny Nadolny, PagerDuty#Devoxx #distsys
“Mess With The Network” Cheat Sheet#addlatencytcqdiscadddeveth0rootnetemdelay500ms100msloss25%
#removelatencytcqdiscdeldeveth0rootnetem
#restrictbandwidthtcqdiscadddeveth0handle1:roothtbdefault11tcclassadddeveth0parent1:classid1:1htbrate100kbpstcclassadddeveth0parent1:1classid1:11htbrate100kbps
#removebandwidthrestrictiontcqdiscdeldeveth0root#tip:whendoinglatency/loss/bandwidthrestriction:#run"sleep60&&<tcdeletecommand>&disown"incaseyoulosesshaccess
#capturepackets,thenopenlocallyinwiresharktcpdump-n"srchost123.45.67.89ordsthost123.45.67.89"-ieth0-s65535-w/tmp/packet.dump
iptables-AOUTPUT-ptcp--dport4444-jDROP#blocktrafficiptables-DOUTPUT-ptcp--dport4444-jDROP#allowtraffic#canuseINPUT/OUTPUTchainforincoming/outgoingtraffic#otheroptions:--dport<destport>,--sport<srcport>,-s<sourceip>,-d<destip>
#configuredatabase/applicationlocaldatadirectorytobe/mnt,[email protected]:/tmp/data/mnt#alternative:nbd(networkblockdevice)
netstat-peanut#networkconnections,regularkernelviewconntrack-L#networkconnections,iptablesview