Troubleshooting is a form of problem-solving, it is applied to repair failed system or application. How to proceed in case of a system failure, where to look, how to find out the root cause of the problem, all of these skills will generally develop over time.
Life cycle of a troubleshooting window usually includes –
- detection : noticing that a problem exists
- identification : identifying what the problem is
- analysis : determining what causes the problem
- correction : fixing the problem
- prevention : ensuring that the problem doesn't happen again in future
A step by step method to troubleshooting can help to more quickly pinpoint the root cause of a problem that breaks the system or the application. Below are some questions to ask yourself.
What just changed?
The foremost reaction to something that stops working is to ask "ok, so what changed ?". Sometimes looking into the recent changes can also be the most beneficial. You can look for files, especially configuration files, that might have been modified, applications or packages that were just added, services that were just started, etc.
What errors am I seeing?
You need to pay close attention to any errors that are being displayed on the system console or in your log files. Check if they point to any particular cause. Have you seen errors like these before? Do you see any evidence of the same errors in older log files or on other systems? What do online searches tell you? No matter what kind of problem you've run into, you're not likely to be the first sysadmin who has run into them.
How is the system or the service is behaving?
Looking into the symptoms of the problem is also likely to pay off. Check if the system or the service is slow or completely unusable? Maybe some users are not able to log in? Maybe some functions are not working? By noticing what works and what doesn't might help you to find out what's wrong.
How is this system different than one that is still working?
If you have a mirror image of your faulty system which is working and have a chance to compare both the systems, you may be able to find out the root cause of the problem.
What are the probable breakpoints?
Think about how the application/service works, how/where it likely to have problems. Does it rely on any configuration file? Does it need to communicate with other servers? Is a database involved here? Does it write to any log files? Does it involve multiple processes? Check whether the required processes are running or not?
What troubleshooting commands you need in your hand which might be helpful?
- top -- for looking at performance, including some memory, swap space, and load issues
- df -- for examining disk usage
- find -- for locating files that have been modified in the last day or so
- tail -f -- for viewing recent log entries and watching to see if errors are still arriving
- lsof -- to determine what files a particular process has open
- ping -- quick network checking
- ifconfig -- checking network interfaces
- traceroute -- checking connections to remote systems
- netstat -- examining network connections
- nslookup -- checking host resolutions
- route -- verifying routing tables
- arp -- checking IP address to MAC address entries in your cache
What should I NOT do?
Don't get confused over symptoms and causes. Whenever you identify a problem, ask yourself why the problem exists.
Be careful not to destroy "evidence" as you work furiously to get your system back online. Copy log files to another system if you need to recover disk space to get the system back to an operational state. Then you can examine them later to help figure out what caused the problems you're working to resolve. If you need to repair a configuration file, first make a copy of the file (e.g., cp -p config config.save) so that you can more easily look into how and when the file was modified and what you had to do to get things working.
Keep in mind that you might end up making a lot of changes in the process of tracking down your problem. Later on, you might want to think through which of those changes actually resolved the problem.
What should I do?
- Record your actions. If you're using PuTTY/Mobaxterm to connect (or some other tool that allows you to record your system interactions), turn on logging. This will help you when you have to review. If you've not out of disk space, you also have the option of using the script command to record your login session (e.g., script troubleshooting.'date %m%d%y').
- If you can't record, keep notes on what you did and what you saw. You might not remember it all later, especially if you're stressed. You might remember the steps, but not the order in which you ran them.
- After the problem is resolved, document what happened. You might see it again and you might need to explain to your boss or your customers what happened and how you're going to prevent it from happening in the future.
- Whenever possible, think about how the problem could be avoided in the future. Can you improve your monitoring services so that disk space, memory and network issues, configuration changes, etc. are brought to your attention long before they affect running services?
Wrap up
Good skills of troubleshooting can really save the day and having a plan to check when a problem arises can play a major role in getting your systems and applications back online.