15
SHARKFEST '09 | Stanford University | June 15–18, 2009 Protocol Analysis in a Complex Enterprise: The Importance of “The Art of Recognition.” June 16 th , 2009 Hansang Bae Senior VP| Citi (f.k.a Citigroup) [email protected] SHARKFEST '09 Stanford University June 15-18, 2009

SHARKFEST '09 | Stanford University | June 15–18, 2009 Protocol Analysis in a Complex Enterprise: The Importance of “The Art of Recognition.” June 16 th,

Embed Size (px)

Citation preview

Page 1: SHARKFEST '09 | Stanford University | June 15–18, 2009 Protocol Analysis in a Complex Enterprise: The Importance of “The Art of Recognition.” June 16 th,

SHARKFEST '09 | Stanford University | June 15–18, 2009

Protocol Analysis in a Complex Enterprise: The Importance of “The Art of Recognition.”

June 16th, 2009

Hansang BaeSenior VP| Citi (f.k.a Citigroup)[email protected]

SHARKFEST '09Stanford UniversityJune 15-18, 2009

Page 2: SHARKFEST '09 | Stanford University | June 15–18, 2009 Protocol Analysis in a Complex Enterprise: The Importance of “The Art of Recognition.” June 16 th,

SHARKFEST '09 | Stanford University | June 15–18, 2009

Challenges: As it turns out, size does matter!

Citi’s branch network spans 5,000+ locations in the USCiti’s network infrastructure includes 30,000+ devices300,000 users located in over 100 countries.Number of servers in use is mind numbingly large!

Compliance/Security QuagmireDoing a full packet capture is difficult.Tools in use include NetVCR and Opnet’s ACE.Wireshark is the only approved protocol analyzer at Citi. It

dislodged past market leaders.

Page 3: SHARKFEST '09 | Stanford University | June 15–18, 2009 Protocol Analysis in a Complex Enterprise: The Importance of “The Art of Recognition.” June 16 th,

SHARKFEST '09 | Stanford University | June 15–18, 2009

Act I: Much Ado about Nothing! Old medical school saying: When you hear hooves beating, think horses and not

zebras! Server SA reports extreme slowness during file transfers

What are the top issues that come to mind? Server SA started a ping script and in it showed…..

Lessons Learned: Learn to recognize what should and should not change as you go through the

trace files. RFC1323 was not in play because they are on the same switch! Take a few minutes to scan the trace files. Learn to trust your brain’s ability to

spot differences. Know how protocols work so you can rule out red-herrings. This is what

separates “techs” from “engineers” Try not to filter. You might have missed the “arp” frames in this trace. This is

different than capturing in “promiscuous” mode.

ICMP_BHNew2.pcap

ICMP_BHNew2ICMPOnly.pcap

Page 4: SHARKFEST '09 | Stanford University | June 15–18, 2009 Protocol Analysis in a Complex Enterprise: The Importance of “The Art of Recognition.” June 16 th,

SHARKFEST '09 | Stanford University | June 15–18, 2009

Act II: Taming the SSH Logging into a server via ssh takes over two minutes:

What are the top issues that come to mind for slow telnet/ssh login? Let’s capture and find out. Packet captures are like Shakira’s hips. They don’t

lie! Lessons Learned:

Scroll through the trace to look for patterns. Again, trust your brain. Develop a technique; a list of common filters to run through when

troubleshooting. e.g. tcp.flags==02, tcp.analysis.flags Don’t forget UDP. What important function runs on UDP? Do not blindly trust the tcp analysis. Wireshark can only know what you feed

it. It too suffers from GIGO (Garbage In, Garbage Out) Use the graphical tools available in Wireshark. Picture *IS* worth a thousand

words! Capture placement is important. If I captured at the client, I would still be

wondering why there is a delay!

SlowSSHLogin2.pcap

Page 5: SHARKFEST '09 | Stanford University | June 15–18, 2009 Protocol Analysis in a Complex Enterprise: The Importance of “The Art of Recognition.” June 16 th,

SHARKFEST '09 | Stanford University | June 15–18, 2009

Act III: To Stream or Not to Stream?

Application developers report extreme slowness when ftp’ing a file. What are the top issues that come to mind for slow ftp sessions?

Lessons Learned: Scroll through the trace to look for patterns. Again, trust your brain. Develop a technique; a list of common filters to run through when

troubleshooting. e.g. tcp.flags==02, tcp.analysis.flags Buffer tearing is pretty common. Applications are constantly trying to

do TCP’s job. App bytes can help you identify it. Learn to recognize it! (Oracle, MS SQL, Sybase, they all do it)

Understand what “streaming” really means. TCP *HAS NO* byte boundaries.

Use the graphical tools available in Wireshark. Picture *IS* worth a thousand words!

Slow ftp anon.pcap

Page 6: SHARKFEST '09 | Stanford University | June 15–18, 2009 Protocol Analysis in a Complex Enterprise: The Importance of “The Art of Recognition.” June 16 th,

SHARKFEST '09 | Stanford University | June 15–18, 2009

Act IV: Window’s Tale Call center servers are not able to keep up with call volume after a data center

migration The servers are not getting the data fast enough - causing a backlog. What

simple change can increase the throughput? The path after the migration is longer by 50 ms.

Lessons Learned: If latency is causing a problem, look for RFC1323 related problems. Know what affects a transfer throughput. Buffer tearing, window sizes, or

packet loss. Use the graphical plots to zoom in on the problem – so let’s look at the window

size. Should we look at the receive or send window? Argue your case. If you’re right, you’re right! But you had better be right. You

earn your “cred” over time, but you can blow it in one shot! Use the graphical tools available in Wireshark. Picture *IS* worth a thousand

words! See next page.

MQSlow.pcap MQSlowPrint.txt

Page 7: SHARKFEST '09 | Stanford University | June 15–18, 2009 Protocol Analysis in a Complex Enterprise: The Importance of “The Art of Recognition.” June 16 th,

SHARKFEST '09 | Stanford University | June 15–18, 2009

Act IV: Window’s Tale

Use STATISTICS, IO GRAPH to bring up this graph. Modify the highlighted items to bring up this view

Page 8: SHARKFEST '09 | Stanford University | June 15–18, 2009 Protocol Analysis in a Complex Enterprise: The Importance of “The Art of Recognition.” June 16 th,

SHARKFEST '09 | Stanford University | June 15–18, 2009

Act V: A User’s Complaint

Smith Barney Financial Consultants are complaining of slow page load times for their home page. The problem is sporadic and random but happens enough that it’s impacting their productivity. The problem is wide-spread, not easily reproducible….where do you

start? What do you do? “Who you gonna call?” What’s common in the problem? Home page; use of load balancer;

common backend servers; affecting many users. What’s the job of a load balancer? Where should we take the trace? What “bad things” can happen if you are using a load balancer with

Source NAT configured?

Page 9: SHARKFEST '09 | Stanford University | June 15–18, 2009 Protocol Analysis in a Complex Enterprise: The Importance of “The Art of Recognition.” June 16 th,

SHARKFEST '09 | Stanford University | June 15–18, 2009

Act V: A User’s Complaint (con’t)

Page 10: SHARKFEST '09 | Stanford University | June 15–18, 2009 Protocol Analysis in a Complex Enterprise: The Importance of “The Art of Recognition.” June 16 th,

SHARKFEST '09 | Stanford University | June 15–18, 2009

Act V: A User’s Complaint (con’t) LBProblemNew.pcap

Page 11: SHARKFEST '09 | Stanford University | June 15–18, 2009 Protocol Analysis in a Complex Enterprise: The Importance of “The Art of Recognition.” June 16 th,

SHARKFEST '09 | Stanford University | June 15–18, 2009

Act V: A User’s Complaint (con’t)

Lessons Learned: Start by looking at what infrastructure is in common for all users

experiencing the problem. What constitutes a TCP packet? 2-Tuple? 4-Tuple? Remember that sequence numbers are nothing more than the number

of bytes transferred. Acknowledgement is nothing more than an indication of how much of the data you received. You receive something outside of what’s expected, something went horribly wrong!

When you have a 22,000 user base, having a ephemeral port range of 1024-5000 can be exhausted quickly.

Sometimes, you have to resort to turning off “relative sequence numbers” for analysis. This is especially true when load balancers – or any device that NATs – is in the data path.

Page 12: SHARKFEST '09 | Stanford University | June 15–18, 2009 Protocol Analysis in a Complex Enterprise: The Importance of “The Art of Recognition.” June 16 th,

SHARKFEST '09 | Stanford University | June 15–18, 2009

Act V: A User’s Complaint (con’t)

Lessons Learned (con’t): (Turn off relative sequence numbers) Frames 1-8 contain the orderly close of a connections. Frame 9 which occurs approx. 14 seconds later is an attempt of a

‘new’ client to open a connection to the LB. (Frame 10 is the LB translated request to the web server).

Frame 11 is an acknowledgement for the prior connection. This occurs, because the Web server still has this socket in FIN-WAIT. (Frame 12 is the translated request – LB to client).

Frames 13 and 14 is the RST generated by the client, and the translated request, respectively.

Frames 15-18 contain a connection creation. This is allowed to occur because of the RST. However, this causes the client to pause for approx 3. Seconds.

LBTCPHandshake.pcap

Page 13: SHARKFEST '09 | Stanford University | June 15–18, 2009 Protocol Analysis in a Complex Enterprise: The Importance of “The Art of Recognition.” June 16 th,

SHARKFEST '09 | Stanford University | June 15–18, 2009

Act VI: As You Log It

After a data center migration, an application was no longer able to support the production traffic. The new data center was separated by 11ms round trip latency. Before the move, both servers were located in the same DC Naturally, first inclination was to blame the network! After all, the

problem started after the migration. The application generates a 3 byte “alert” message followed by another

small packet with the actual data. What should be the first problem that comes to your mind? What looked like a slam-dunk turned out be quite complicated! In the Army, we had a saying: Be, Know, Do. It applies to packet

analysis. At the end of the day, in depth knowledge of how TCP should work

allowed us to find the problem.

DCMove_BothSideLookAt918.pcapDCMove_OneSideLookAt10-11-12.pcap

DCMove_Original_LookAt197.pcap

Page 14: SHARKFEST '09 | Stanford University | June 15–18, 2009 Protocol Analysis in a Complex Enterprise: The Importance of “The Art of Recognition.” June 16 th,

SHARKFEST '09 | Stanford University | June 15–18, 2009

Act VI: As You Log It (con’t)

Lessons Learned: Nagle and Delayed Acknowledgment deadlock is very common when

TCP is used to shuttle small amounts of data. This can be a “killer” when trading programs are involved. Turning on application level logging can help, but don’t forget to turn it

off! Know what impact you can have if you decide to log. For us router-

jockeys, it’s equivalent to doing a “debug ip ospf” on a production backbone router. Hint: not a good idea. It’s a self correcting error – if you do it once, you’ll never do it again!

If you know how TCP really works, you can argue your point with conviction because deep down inside, you know you’re right.

Page 15: SHARKFEST '09 | Stanford University | June 15–18, 2009 Protocol Analysis in a Complex Enterprise: The Importance of “The Art of Recognition.” June 16 th,

SHARKFEST '09 | Stanford University | June 15–18, 2009

Appendix: IP’s used in the examples ACT I: ICMP_BHNew*pcap

192.168.1.1 and 192.168.1.254 are servers on the same switch. ACT II: SlowSSHLoging2.pcap:

192.168.1.1 is the client. 172.16.50.50 is the ssh server. 192.168.75.75 and 192.168.200.200 are NIS+ servers.

ACT III: SlowFtpAnon.pcap 10.10.10.10 is the ftp server. 192.168.1.1 client is pulling the file from the server.

ACT IV: MQSlow.pcap 172.16.50.50 is the MQ server. 192.168.1.1 is the MQ client. The server is pushing the file to

the client. ACT V: LBProblemNew.pcap

10.2.53.102 and 10.17.97.111 are users in different branches. 172.16.10.10 and 172.16.20.20 belong to the load balancer. 172.16.254.254 is the real web server. 172.16.10.10 is end user facing IP of the LB and 172.16.20.20 is the IP used by the LB for source NAT’ing when talking to the real web server.

ACT VI: DCMove_*.pcap 192.168.1.102 and 172.16.1.125 are two servers involved in the transfer. Both send data

independently of one another.

Please email me at [email protected] if you would like the “The Tool” Visio macro.