Epic Fail: The Need for System Documentation

It’s not easy for me to admit failure! But anytime it can help someone learn a lesson, I’m more than willing to air my dirty laundry.

I haven’t always been in IT. Many of you who have followed me on Spiceworks know that once upon a time I was a police officer. The biggest lesson I learned was from my old patrol Sup who said, “If it ain’t written down, it never happened”. And isn’t that the truth as the following story illustrates.

One case I worked on was a sexual assault case. A young lady was raped at gun point and she was able to provide us with a good deal of information such as description of car, the individual involved, and the weapon he used. Within hours after it happened, I knew who our suspect was, had located the car, and had a good idea what kind of weapon was used. From her description, it was more than likely a Ruger 22 caliber automatic.  But the problem came when I applied for a search warrant to locate the weapon. I didn’t spell out in detail how I arrived at the conclusion it was a Ruger 22. So when we arrested him, and searched his house, guess what we found? A Ruger 22.  But at the pre-trial conference, because I’d failed to write out my thinking and how I arrived at the conclusion on the type of weapon involved, his lawyers had a field day. As a result we lost the weapon for the trial and couldn’t even refer to it. That meant the difference between sending him away for three years versus the 25 years he so richly deserved. Not exactly one of my most shining moments. My documentation skills let the victim down, and at the end of the day, I had no one to blame except me.

OK, so what does all this have to do with IT, SOX, and so on? Simple! We live and die by our documentation. I think we can all agree, that network and system documentation is important, but how about the more mundane things like tickets or how to get things done?

Anytime you need to explain how anything needs to be, there are several things you need to answer:

  • Who is doing the job? I’ve also found it useful to mention, to whom something can be escalated.
  • What are you doing? Procedures should always been written in 4th grade English. And oh, lots of pretty pictures aren’t a bad thing. You might also want to mention symptoms, especially when it comes to problems, that might be seen. If your documentation software accepts search tags, put in eve tag that associates with the solution, the application, and the problem you can think of.
  • When: sometimes certain problems are seen associated with certain events (backups, virus scans etc)
  • Where might be a little hard to come by, but it can still apply, if you have names of servers, datacenters etc.
  • How can be associated with procedures, and again, remember the KISS rule (Keep It Simple Stupid).
  • Why: in some conditions, issues might occur.  Knowing the “Why” might relax frustrations.

The exact same information needs to be put into tickets. SOX auditors have been known to pull change management tickets and look at them. This is especially true for service and user accounts, and what’s going on in the ticket must be easily understood. For example: a person wants a user account for a new user in this department by this date, with the following permissions. Now the “How” is probably outlined in an SOP, but that doesn’t relieve you from the responsibility of mentioning what you did. And surely, you need to take account for your time somewhere. Recently, Netwrix has conducted a survey, revealing astonishing results regarding documentation tendencies.

Another place where you need to answer this is in server builds, relocations etc. We recently went through a bit of a brain damage, where providing answers to the questions would have helped. We had a server that needed to be relocated from one site to another. The idea was that folks have done this a dozen times, and the process should be well understood. But there’s an old expression about best laid plans going awry and we ended up with two newbies doing this for the first time. While the “How” was adequately answered, the rest was not, and what looked on the surface as a Linux server relocation, morphed into the server already in the box (total misunderstanding) to going to another office and pulling a Windows server and relocating it, that in the final analysis didn’t have to be moved at all.

A Field Technician will go to this office, and locate this server (give Serial Number etc). A Field Tech will then give the On-Call Tech a phone call, telling him/her that they located the server. The Field Tech will then RDP into the machine, and configure IP address etc to work at the location it will be moved to (again, this part of the story would contain information like location, IP address etc).  Once done, the On-Call Tech will shut the server down, and call the Field Tech and notify him about it. The Field tech will then move the server to the new location, plug it in, and wait till the On-Call Tech told him that it’s up and going.

This little story with some additional information would leave little doubt as to what’s going on, who does what and how. Additionally, all documentation for this task should be in one place. Despite the fact we might have SOPs generated and refer to them, having the script for what needs to be done in one place (like cut and pasted into the ticket), might be very useful and less confusing.

Another place we have to account for things is in our ticketing system itself. A ticket always has a number assigned to it, and we should be looking at our tickets and literally go from one to the other. But sometimes we get a ticket in that’s a duplicate, or was a mistake, or was generated by accident. What do we do with them?

So many times they’ll simply be deleted, and that’s where the trouble starts. You now have a numerical gap in the record, and that will really raise a question. Gaps are something you can ill afford to have, and should there be issues later, then an attorney can have some real fun with it. It’s as simple as this, how can we prove, what that missing ticket was all about!

The courtroom is packed. Your ex-boss is on trial for shortchanging the company by a couple millions of dollars. You’re on the witness stand, and a prosecuting attorney is asking you about the gaps found in the ticketing system. You answer, “Probably junk tickets, we deleted them”.

“Probably,” he says. “Are you sure?”

“Absolutely,” you answer.

“That was three years ago.  Can you honestly say for sure that they were junk tickets?”

You see where that one is going? If you’re lucky, you and your ex-boss will be sharing jail cells. At the least you’re going to be presented as incompetent. The tickets are there for your protection.

So how do you cover it when there is a gap? We had something like that happened. A ticket got generated, but there was something wrong with it, and it just kept regenerating over and over. I finally had to get into the ticketing software, stop the services, and then delete the two hundred plus ticket generated. That’s a big gap. First, I collected information about the ticket, and I answered all the questions I need:

  • Who discovered the problem.
  • What was the problem, and what did I do to resolve it.
  • When all this happened, and where.
  • How it happened, and why. I also collected screen shots.

All this became part of what is called an MFR (Memorandum For Record), and it became part of the documentation we gave the SOX auditors. Because the incident was so well documented, it was never even questioned.