This post focuses on some fundamental network troubleshooting issues that may be most helpful to less technical readers such as new administrators and small-business owners. It was inspired by the fact that I’ve had a number of experiences involving basic network troubleshooting “in the field” over the past month, with both clients and family/friends. I’d like to share the work that I did for them here, with the end goal of teaching a few skills that can potentially save you a headache and prevent you from (heaven forbid) calling Geek Squad. Today I’m going to focus primarily on the network side; I'll save the hardware and other server troubleshooting for future posts.
1) If you have more than five endpoints, you need a diagram…and backups!
I run into this all the time, both at small businesses and enterprises and at larger companies. There’s nothing like going to help someone out and finding a spaghetti mess in the network rack to the switch, where nothing is labeled and you can’t get a feel for the network infrastructure without logging in.
A simple way to remedy this is to have a diagram, and to assign someone responsible or use a wiki to update it. It doesn’t need to be fancy; it just needs to do one job – serve as a one-stop shop for information pertaining to the network. At a minimum, the diagram should include hardware, IP scheme, cabling scheme, an idea of routing flow, servers and their purpose, default gateways, firewalls, and other special considerations. Too much detail is better than not enough, and it's definitely better than none whatsoever. And don't worry if you don't have access to a diagramming software program such as Microsoft Visio -- a (legibly) hand-drawn diagram can easily get the job done.
An example of a good diagram could look something like this.
You’ll also need to have backups of your configurations and ALL data before and after any change goes in. This just saves headaches for later. Ask my friend how she feels about backups now after she lost all of her music, college data, and photos from a crashed hard drive.
2) Check physical connections and power.
You wouldn’t believe the number of issues that have been caused by a cat that somehow disconnected the power, or a random employee who “moved something” and managed to unplug a cable. Use your diagram and confirm your physical connections!
3) Network recon!
Three tools that can basically diagnose all basic network issues are ping, trace route, and telnet. This is where your diagram comes in. You can perform some network reconnaissance in attempting to locate your network issue. All of these commands are run from the command prompt or Linux shell.
Ping -- Open a command prompt (start -> run -> cmd) and start by pinging your network adapter (ping 127.0.0.1), working your way to the Internet and servers whose services might be needed. Usually I ping localhost, then default gateway (from diagram), then a server in a different network (if applicable), and then 8.8.8.8 (Google’s public DNS server). If it fails at any point, make sure ping is enabled on the remote devices and research the bottleneck.
Trace route – Traceroute is usually used to track the route that network packets take toward a specified target host. Use your diagram and make sure that traffic is flowing in the preferred method. Test, test, test.
Telnet – A ridiculously simple way to make sure a server’s services are running is to telnet to the port in question at the IP address of the server (diagram again!). For example, my server at home was being silly and not streaming data. I basically performed a “telnet 192.168.0.101 8888” to test streaming on my custom streaming port. Turns out my services were corrupt and had to be restarted. Aron also wrote about a Microsoft tool that does the same thing!
4) If you can, reboot.
A goal at AppliedTrust is to understand the problem first. If the first 3 steps didn't help you figure out your problem, sometimes you just need to do a simple reboot. Network devices, like desktops and servers, can sometimes be cranky. Rebooting them might do the trick with reloading a fresh configuration. Make sure you have the configuration information backed up in case it loses all information (this happened to me last week at a client!). If you find yourself rebooting often, you problem is probably with the hardware. It might be time to start researching a code upgrade or digging a little deeper into the environmental status or the device logs to see if there is a problem or a bug.
5) If you need to call, please have information!
One of my favorite calls is the vague “this isn’t working, please come fix it” (*cough* DAD *cough*). It's always best to document as much information as you can about the problem and to provide that upfront. If I show up with no idea of what’s not working (shame on me!), it could end up being a waste of time and money if I do all of the research and find out I can’t fix it immediately or I could have done it remotely. Here’s my personal template for troubleshooting questions that I’ll always ask:
- What changed in the infrastructure, and when? Did you have a change plan or a description of EXACTLY what was done?
- Does the issue happen all the time, or just sometimes? Multiple machines, or one machine? Multiple networks, or one network?
- What are the results from Steps 1-4? What have you tried?
Sometimes, just by asking these questions, the problem will become completely obvious and we’ll solve it then and there.
Hopefully this helps you with a few ideas. Feel free to talk to us at AppliedTrust for more complicated issues. We do infrastructure too!
