A term I coined which describes the common sysadmin ability to stay cool in a perceived system-related disaster. The sysadmin calm is a nice side-effect of a present sysadmin exercising his cool.

Since sysadmins are trained, by trade, to stay cool even as the world is falling apart all around them, they develop the inability to simply panic because they've seen it all before. Hard disks crashing without a backup, network lines cut, monitors blowing up, routers failing, CPUs smoking, it's all the life of a sysadmin.

I came to this realization just recently during a series of events. One of them occured while I was in an informal meeting with my boss. The power went out, plunging us both in darkness. While the front desk staff ran around like chickens with their heads cut off, the boss and I simply continued talking as if nothing had happened at all, despite the fact that several key (but not critical) computers did not have UPSs on them.

Nobody hires a hot-headed sysadmin who makes irrational decisions during a catastrophe. The reason why this is, is because sysadmins actually love the chaos of a large problem. They revel in the challenge that such a disaster presents them. It allows them to come up with unique and creative solutions to very complex problems, which in turn, is very satisfying.

The ability to stay cool, thus, sysadmin cool is a required part of being a good system administrator.

This is indeed a very valuable trait to have in a sysadmin. In fact, if you are a manager hiring a sysadmin to look after your critical machines, this is a trait you will want to ensure the potential employee has.

I am a system administrator and my team and I (I would like to think) have this trait. My team looks after 160 midrange servers for Telstra, the Australian national telecommunications company. These servers are considered mission critical as they control customer billing, national work management, rostering etc. All these servers are stored in two large data centers.

Earlier in the year, the mains power become 'spiky' and somewhat unreliable. Consequently, the UPS' kicked in. Except, the switching board shorted out for some reason and one of the data centers had a total and absolute blackout. I remember that day quite clearly. There I was in my city office, when at 10:10am almost every pager on the floor started madly beeping, our monitoring screens all started flashing red icons and the phone calls started. With the data center out of action, Telstra lost access to over 500 servers and associated equipment ranging from mainframes, midrange servers, routers and switches. With that much equipment down, Telstra lost computer and network access nationwide.

As you might imaging, there was frantic activity as managers tried to coordinate the restoration of services. There was much shouting, screaming and frantic hurrying around ... except for my team. As one of the key sysadmin groups, we were the primary group for getting the systems up and available. Until we got them up, the application support teams would not be able to restart the applications. So there was a fair amount of focus on us. Nevertheless, my team was still joking around, chatting quite happily about various aspects of our lives and generally reveling in the excitement. I remember joking with a fellow sysadmin that with all the choas, I might be able to sneak in a few server upgrades without having to do the paperwork!

Mind you, my manager was a little bit panicky and became a bit of a micro-manager. As team leader, I did my best to stop her coming into my area and agitating my team with her nervousness. I think that our coolness under pressure didn't ease her agititation and in many ways, probably further unsettled her since she probably thought we didn't take the situation seriously. But that is part of the sysadmin cool trait. My entire team knew the seriousness of the situation. If we took too long to get the systems up, our SLAs would have been effected and this would have resulted in $100's of thousands in penalties. We fully understood what we needed to do - we just chose not to panic about it.

In the end, we have almost every system up and running within 6 hours. This may seem quite a long time when say compared to starting your computer at home, but this six hours included:

  • confirming that we had a stable power supply
  • starting up servers in the correct sequence (eg network first, then mainframes, then midrange servers)
  • checking servers for damage (there were a few hardware failures)
  • startup the applications (some take over an hour to start)
  • fix corruptions to databases and applications
  • perform testing to ensure everything is working correctly, sending and receiving data etc
And on top of that, all this had to be coordinated nationally. Not a small job.

Despite all this, my team remained cool and did its job. And that's why some sysadmins are paid extraordinary amounts - because when the crunch comes and your business has millions of dollars worth of transactions that can't be processed because of systems that are down, you want sysadmins who can remain calm, collected and can get the job done quickly and efficiently. You want system administrators who have sysadmin cool!

Log in or registerto write something here or to contact authors.