Thursday, March 25, 2010

Troubleshoot Bandwidth Issues

Determining why your web application is slow can be difficult. Whether it is Workflows or Firefox or Quicktime, slow responsiveness can be annoying and so it grabs your attention. The fundamental problem, though, could be at your own PC, or the remote server, or any point in between, and any number of issues at any point. Even clever experienced network people can have trouble. You'll soon see why I had to include that last sentence.

The task can be daunting if you're not a network engineer, and I suspect many of you are not. But I just went through a "teachable moment" at one of my clients, and that, as well as Suzanne's excellent recent post about bandwidth, compells me to describe what happened and what it could mean for you.

The users at the site complained about slow Workflows responsiveness between 2 and 4 in the afternoon. Workflows was the only thing they complained about. I have some software tools to look at the circuit from the library to the Internet and that always looked good. It was always well below what the circuit could carry. The router at the edge of the state firewall was always very responsive (I've seen problems there in the past), so I did what I almost always do when I first look at a problem. I assume it's someone else's fault.

They continued to complain. It would sometimes be so bad their Workflows application would close. This makes it hard to run a library. So, the next time I was there, I downloaded and ran a free program called Ping Plotter (www.pingplotter.com/freeware.html). It is basically an enhanced traceroute program. If you don't know what traceroute is, you should probably give this post to your tech. I point it to any place out on the Internet and see how responsive the path is. When troubleshooting Workflows, I always point to 216.146.126.246 because that is the last router I can see before the route goes inside the state firewall.

Within seconds after starting this tool I could see there was a problem, a big problem. Most of the packets our library was sending to the Internet were not being responded to. We were seeing frequently more than 70 % packet loss. It was amazing that any of our applications were working at all. The curious part though was that we were seeing packet loss at all hops, not just one bad spot having trouble. Sticking with my previous assumption, that its someone else's fault, I called Qwest. The Qwest guy was fantastic, but basically he proved to me that it wasn't his fault. And furthermore, he also proved that the problem was inside the building. Uh-oh, it looks like my fault.

This site has a CentreCOM 24-port switch they got from one of the early Gates hardware distributions in Montana. I think this came with the 2002 distribution. When I finally replaced it with a different switch the problem was gone. The switch had just gotten to the point where it could not reliably move packets through itself and it looked like the problem was a bad Internet connection. Or actually, it looked like a bad Internet connection, until I used the right tool to look at the problem.

So one moral of the story is try using ping plotter. Its free. Its easy to install. It might point you in the right direction if you are having a "slowness" problem, chronic or otherwise. It does require a little understanding to be able to use it.

Another is to be suspicious of these switches. Maybe this one had just been mistreated at some point, but maybe they are getting old as a group.

Another, possibly, is to not assume initially that it's someone else's fault. Naa... I think I'll continue to do that.

No comments: