Sclends basic network troubleshooting

Is Evergreen Slowing Down – Basic Network TroubleshootingRogan Hamby, June 13th 2013

Is Evergreen slowing down?

There could be several culprits.

Staff Client Issues

There are known memory leaks in the staff client. These are being

actively addressed by the community.

If this is happening it probably isn’t happening the same at all stations.

Reboot the troubled station.

Network Issues

From your local switch having fits to a router in Tennessee dying to

someone in Atlanta doing a thirteen terabit backup we are at the mercy

of the pipes inbetween.

Usually these problems will grow slowly. All machines will be affected but it may not seem like that at first as some activities are more prone to

interruption.

Staff facing patrons and those functions moving large data frames (e.g. cataloging) will usually notice

first because lost packets and latency have the greatest

perceivable impact.

Now it’s important to look at your network path. There are many

common elements in the paths from SCLENDS member libraries to the

hosting facility but no universal ones except the last few.

If you use ICMP or UDP based tools be aware of the false positives they

can give since they are often blocked.

I recommend that you use TCP based trace routes.

Windows – Pingplotter Pro

http://www.pingplotter.com/pro/

Linux – traceroute -T

Mac – Path Analyzer ProUses protocol paths, not just hops.

http://www.pathanalyzer.com/

If the issue is on your local LAN or anywhere in SC and ongoing you need to either address the issue

internally or with the State level e-rate board.

If the issue is outside SC we can look at trying to appeal for a remedy or some kind of routing but we can’t

guarantee results.

If the issue is at the hosting facility we can fix the issues immediately.

Standard Traceroute

TCP Based Path

So… what if everything so far looks clear?

It’s a SERVER(s)!

Our Setup

Load Balancer

App Servers

Production

Replication and Reporting

Database Servers

How can I tell which has just gone to meet Werner Jacob?

(warning: broad simplifications ahead)

If it’s the DB servers then everything goes to heck starting with database

retrieval and the errors will say ‘SQL’ in them somewhere usually. But it’s

quick!

If it’s only the replication one then only reports will be affected

including notices.

App bricks – its very rare for all four app bricks to fail at once so usually some machines will do fine while others have issues or it appears

random.

Example: When catalogers have template issues, they may have lost them on one brick but not others.

When a brick crashes you will usually get errors referencing

various PM files (perl modules) or specific scripts.

When it’s the load balancer – everything slows down painfully and everything goes to heck. Eventually stations will time out and errors will

reflect that.

Don’t jump to conclusions but these examples should give you some

insight into the kinds of things to look for.

Copy errors. Observe and report. Communicate on listserv. IRC

channel is also available specific to SCLENDS. Call Rogan in an

emergency (he’s not always at his desk).

Technology

Sclends basic network troubleshooting