Exchange Server Down: You Only Have Minutes

You just got the call: your Exchange server is not sending or receiving email. Your one and only Exchange server. OK, this is bad. You drop everything and begin to check the Exchange server for problems.

It is up? Yes.

Are the needed services running? Yes.

Does it have enough disk space? Yes.

OK – time to check the Event Logs.

It can’t find a domain controller. Time for a ping test. Check. NSLookup? Check.

OK… what in the world is going on?

While this is a hypothetical problem for you, it was a real problem for one of our customers. How much time do you think you really have to solve this problem? With your only Exchange down, you obviously don’t have weeks or days. You really don’t have hours and seconds are, of course, unrealistic. So we’re really talking about minutes. You have minutes. So how can you quickly figure out the source of the problem?

Back to the Exchange server. It appears to be working fine. So what’s the issue? The old  standard “what changed” is a key part of the answer. You’ve (hypothetically) already started down the path of seeing what’s changed by looking at the state of the server, and it all looks good.

To solve this problem, and any one like it, you’re going to need to know what’s changed on your servers and, possibly within your entire environment. Could a password have been changed? Did someone change permissions in AD? Hmmm… this is going to be tougher than you thought.

Without a system where IT Pros can log each and every change made in your environment (which our “2014 State of IT Changes” survey shows is something that isn’t consistently used) or a change auditing solution that tracks every change by auditing the systems themselves in place, you’re never going to be able to easily tell Who did What, When and Where and, therefore, determine how to fix the problem within minutes.

And the clock is still ticking…

Now everyone’s aware of the issue and helpdesk calls are coming in like crazy. If only you knew how to pull up all changes made in the last 10 minutes.

Our customer had a change auditing solution in place and was able to do exactly that.

The cause of the problem? You’ll never guess. One of the AD admins was making  changes to the Active Directory Sites and modified a subnet mask just perfectly that it isolated the Exchange server logically into its own AD Site, so the Exchange server couldn’t find a Domain Controller to function.

Think you would have guessed that one?  Yeah, me neither. Glad they had a change auditing solution. Otherwise, minutes would have taken hours or, worse, days.