Linux.com

Feature

SysAdmin to SysAdmin: My Birthday "Bash"

By Brian Jones on November 24, 2004 (8:00:00 AM)

Share    Print    Comments   

My 31st birthday just passed, on Sunday, November 14th. However, due to some events that I can only describe as "interesting," I celebrated it on Sunday, November 21st. For my birthday, I learned a good number of very important lessons. Just a few examples: UPS batteries are wired in series; out-of-band communication devices are a lifesaver when your mail server is down and two thirds of your crew is 800 miles away; document, document, document, and much more.

Here's the scoop

So it's the Friday before my birthday, and I'm saying goodbye to most of the people I work with, who won't be back until after the LISA conference. I'm not feeling stressed about their departure, as I've now been in the department for a little over three years, and feel fairly confident that I can muddle through pretty much anything that comes along. Simple resource requests from users, debugging wireless networking or DSL issues, and debugging services in our environment are all pretty much old hat. So I leave to begin celebrating my 31st birthday.

Sunday the 14th started as a very relaxing day. I had been out Saturday night quite late, and didn't rise until 10:30. On purpose. It was my birthday, and nobody was gonna rush me into doing anything. I walk past my office down to the kitchen and make some coffee, I have some breakfast, and I walk back upstairs to see what's new in my inbox. Only there is no inbox, per se. Just an error message from Evolution saying it couldn't reach my IMAP server. I'm not shaken a bit. I recently moved into a new place, and was forced to give up the awesome Speakeasy DSL service I had for Comcast cable internet. I figured it was a simple matter of having to reboot my cable modem to pick up a new IP address or something. But wait... I can get to Slashdot... I can get to Google... I can get to sites I've never been to before, so I know it's not cached... Odd... But I'm still not quite worried.

Hey, I can't get to the department website! Hey, I can't ping the web server! ACK! I can't ping any of our servers! Oh no. My new blackberry 7290 lets out a buzz in the background of my office, and I perk up. "Mail! Someone got through to the mail server, so all is not lost!". Wrong. Blackberries can send PIN-to-PIN using nothing more than the cellular network, bypassing any notion of a mail server, and that's exactly how this message came to me, from a member of our group who was about to board a plane for the LISA conference in Atlanta. It read something like "Hey, something amiss here, anyone around?" Well, turns out, I was around, and that was pretty much it.

As a group, we keep a pretty close watch on our infrastructure even when we're not around. We get emails from our UPS to let us know if there's an outage, emails from our syslog server, emails from our air handlers, even emails from various key entry points to our machine rooms and networking cabinets. There was an email from our UPS earlier about a power hit, but then another email came that would seem to have indicated a recovery. I was hopeful that things weren't in some "day after" state when I suited up and headed into the office.

Early Discoveries

The first thing I noticed was via my ears, not my eyes. The UPS has a horn on it that is louder than I had remembered. I silenced it, and then noticed it was still in some faulty state which I hadn't seen before. Flipping through the menus on the LCD screen, I saw that the battery voltage was at 0%. This was bad. However, the room, and the building, had power.

Next, I logged into our console server to connect to our file server. My heart sank when I saw the prompt on the Sun 4500 server: hit ctrl-d to continue booting. Oh no. Why hadn't it booted? A quick glance over to the racks of disk showed that an entire T3 storage array was black. I quickly power cycled the array, continued booting the cycle server, and then quickly switched consoles to check on another machine that I knew would be lost without the file server. It was also in a weird, half-booted and broken state, and running uptime on both machines showed that they had been up only a couple of hours.

At this point, I'm thinking that at some point our entire machine room was black. Just then a member of the user community comes by and says his machine rebooted some time ago, and he's since been unable to get an IP address. So now I know the entire building lost power, which leads me to the only reasonable conclusion, which was kind of scary, because it's never happened before: our UPS completely and catastrophically failed. Oh joy. I call the UPS tech support, and they get somebody on the road to visit within the hour.

Of course, the fact that the power is back means that something else pretty catastrophic happened: every machine in our entire machine room was powered on and booted at the exact same time. Service dependencies be damned!

Calling for Backup

At this point, it's pretty clear that I'll be needing backup. By now, a Sun tech is on his way to replace a disk in our array, and a UPS tech is coming out to get our UPS back to normal. It's clear that the entire world will have to be brought back down, if only to insure that it is brought up in the order it's supposed to come up in. My boss and coworker are still in town, and they come in to lend a hand.

All this time, it should also be noted that, although we were without a mail server, probably 50 or so messages passed through my Blackberry to communicate with others in my group who were out of town. This was invaluable since there were slight changes and additions to our services which were not yet reflected in the documentation.

The UPS tech says a battery in the UPS failed, and shows us splatterings of battery acid inside the battery compartment. There are 20 batteries, so I'm a little baffled, as I'm not an expert on UPSes. Turns out, UPS batteries are wired in series, which makes some sense if you think about it. Wiring in series is the only way the UPS can get the benefit of aggregated power from all of the batteries. Wiring in parallel will only allow it to use power from one battery at a time. Of course, the side effect of this is that UPS batteries resemble christmas lights: one goes out, they all go out.

The Sun guy shows up a little later, and, though he seems a bit confused by some of the errors in the logs, replaces the drive and things are well with the world. At this point, it's after 5 PM, and my boss instructs me to go home for my birthday dinner. I followed his orders, and left them there to bring the rest of the service machines back up.

Post Mortem

After all is said and done, we're not happy that this event took place, but the reality is that UPS batteries don't often blow up, and we don't often get simultaneous disk failures in T3 arrays that cause a file server not to boot. At the same time, we are still discussing ways to shorten both routine and emergency downtimes, and how to make the process smoother. The machine room is constantly evolving, and along with it our process for keeping up with the care and feeding of our systems and services.

Some investments, like the Blackberries, which we had upgraded only a couple of days before, were totally justified. Others, like a call-in number for downtime (similar to a school's inclement weather line), are being explored. Still others, like a paid-for inspection of our UPS batteries only a couple of weeks before the failure, are being questioned.

These are all signs of a healthy admin team. There was no infighting, no blaming, no finger pointing, no mumbling under our breath. Things needed doing, and they got done. What didn't get done is getting done with the help of others. What couldn't be done is being researched, and all is peaceful in the user community.

Share    Print    Comments   

Comments

on SysAdmin to SysAdmin: My Birthday "Bash"

Note: Comments are owned by the poster. We are not responsible for their content.

Uh, no...

Posted by: Anonymous Coward on November 26, 2004 11:59 PM
Wiring in series is the only way the UPS can get the benefit of aggregated power from all of the batteries. Wiring in parallel will only allow it to use power from one battery at a time.


Wiring batteries in series increases the net voltage output from the pack (it becomes the sum of the voltage output from each battery).

Wiring batteries in parallel increases the net current output from the pack (it becomes the sum of the current output from each battery).

In both cases, power is consumed from all batteries simultaneously, and as long as there isn't a mixture of old very weak and new fresh cells in the pack, pretty much evenly divided as well.

The reason the UPS batteries are wired in series is so that the UPS box does not have to have it's own internal voltage pump to push the 12v output (I'm assuming the cells in the UPS are 12v packs, that would be a typical size) of the cells up to the range of output for the UPS. Performing voltage pumping consumes power, power that would otherwise go to running your machines during the downtime. It's much more efficient (more power goes to your computers, less to conversion) for the battery pack to supply the UPS with a bit more voltage than it needs, and to have the UPS create it's output AC power from a voltage source that's already the same as it's AC output voltage.

#

Re:Uh, no...

Posted by: Anonymous Coward on November 27, 2004 06:29 AM
One problem with wiring in series (apart from the lack of redundant electron paths...) is that the internal resistances of the batteries are also in series, so the terminal potential difference across the UPS is quite a bit less than the EMF would suggest. That said, I've had an Interruptible Power Supply before, and the thing going through my mind wasn't "does this box have batteries in series or parallel" but "ohshitohshitohshit it's taken five hours and the fsck still hasn't done...."

#

Re:The layer 8 factor

Posted by: Anonymous Coward on November 27, 2004 04:21 PM
I'm still looking for an entry level sys/net-admin job, but one of the best tricks I've ever come up with was writing my own man page.

After sifting through a mound of scrawled notes on scraps of paper littered all over my apartment, I was reading up on some app when it suddenly occurred to me that this was exactly what I needed. Rather than writting yet another text file to be lost in the shuffle, this one could be used to keep all my notes and inspirations, as they happened, and it was all a 'man mynotes' away no matter where I was in the filestructure.

Picking apart a small existing man page, I discovered it had a markup language not unlike XML or HTML. Some quick noodling around and I had isolated a solid structure to add important system notes as I went along. Add a script to automagically insert ~/mynotes.8-current to<nobr> <wbr></nobr>/usr/man/man8/ and you have a simple solution to an annoying problem.

While I've not tried it this would seem to scale well, as you could add a cron job to spread the most recent file to multiple machines and limit access to other admins (still wouldn't put passwords and critical infrastructure data in it, but it'd be great for those niggling bits that were mentioned in the article).

Enjoy!

#

Series or Parallel? How about Redundancy too?

Posted by: Anonymous Coward on November 28, 2004 01:01 AM
The batteries in your car are composed of cells in series. It is pretty much universal in car batteries.

If you have batteries in parallel, with no isolation components, they will all be of slightly different voltages, and current will be flowing out of the ones with higher voltages into those with less voltage (under the concition that they are not being heavily charged). That is not good for holding charge. You could put big, low resistance diodes in series with each battery, but that would take power when drawing current from the batteries. It is a balancing act. What is more important? No design is best for all situations.

Probably putting the batteries in series was the best decision from some angle, like less inefficiency of the circuitry that changes the DC of the batteries into the AC that runs the computers etc.

There might be a UPS design using parallel batteries, or even some banks of series batteries, making it series-parallel. You could even have a couple banks of batteries all charged up but not connected all at once, and when one bank is getting low on charge, it gets switched out and the next bank gets switched in. That would be great for this situation here, because if the one bank of batteries failed, another bank would automatically switch in and run things.

A couple of questions occur to me here. Why were the batteries bad? Did they die before the UPS was called upon, or did they die because they were weak and couldn't quite handle the load it was attached to? Were they overcharged? What happened so that the batteries were not good?

There should be an indicator on the UPS that says that the batteries are not taking a charge or the path is open or that the charger is going crazy and boiling acid out of the batteries. I hope different UPSs have these design features.

Sounds like a solution for this for the future might involve redundant batteries, and/or a UPS that checks it's own batteries and tells you there is a problem. Otherwise the batteries need to be checked every so often by the personnel to make sure they work, check for acid leaking out, check voltage across the batteries, maybe even apply a load across the batteries and see how the voltage drops like the car battery testers do it. Then you know ahead of time if the batteries have failed.

Really, there should be a real test of the UPS at least every month, where you actually let it run something. Maybe you should have a couple UPSs and take one off line and connect it to some load and then shut unplug that ups and see how it runs the load, which could be a light bulb maybe. If it works fine, put it back on line and try another UPS. But at least develop some way to test the ups, and make sure it isn't overloaded. If overload had anything to do with your situation, it might happen again next power failure.

Having a UPS is much better than having no UPS, but having a UPS that tells you it is having a problem is better, and having some sort of redundancy would have kept this problem from happening.

#

Battery check ...

Posted by: Anonymous Coward on June 28, 2005 01:51 AM
Another problem with attempting to verify whether the battery is good is that you actually have to put it under a specific load test; i.e., take your car battery to an autoshop and do a battery test on it to see what I mean.

Both good and bad batteries can show positive potential (i.e., 13.8v) when you do just a voltage check, but to see if a battery still has available capacity, you have to put it under a specific load (like, say, 10 amp load), THEN do a voltage check to see if it is handling the load; if the voltage still reads ~13.4-13.8v, then the battery is still good and can handle the load; if the voltage reads less than 12v when it should read >13v, then the battery is bad.

There are probably some UPS's that do that, but you have to search for them, and the price is not going to be pretty!

#

Back Pain relief

Posted by: Anonymous Coward on May 28, 2006 06:49 PM
[URL=http://painrelief.fanspace.com/index.htm] Pain relief [/URL]
[URL=http://lowerbackpain.0pi.com/backpain.htm] Back Pain [/URL]
[URL=http://painreliefproduct.guildspace.com] Pain relief [/URL]
[URL=http://painreliefmedic.friendpages.com] Pain relief [/URL]
[URL=http://nervepainrelief.jeeran.com/painrelief<nobr>.<wbr></nobr> htm] Nerve pain relief [/URL]

#

Re:Parallel &amp; series

Posted by: Anonymous Coward on November 29, 2004 03:20 AM
Yes, but UPS-es usually require 48 volt, made by daisychaining 6 volt lead acid batteries. It makes little sense to switch the batteries in parallel, if the electronics circuitry blows out you're toast anyway.
If you want it redundant, double the whole UPS (with enough capacity) not just the batteries.

#

Why series

Posted by: Anonymous Coward on November 29, 2004 06:15 AM
A better reason for wiring them in series is protection against catastrophic failure; there are 2 obvious ways a battery can fail catastrophically: a cell or battery can go open circuit, or it can go short circuit.

If they're wired in parallel you might lose a battery and never know, without more elaborate circuitry.

If they're wired in series and a battery goes short circuit things stop and you have a dead battery; if they're wired in parallel things stop, the other batteries discharge through the failed one as fast as phsyics allows, and you have a probable fire.

#

relief joint

Posted by: Anonymous Coward on May 28, 2006 01:55 PM
[URL=http://painrelief.fanspace.com/index.htm] Pain relief [/URL]

  [URL=http://lowerbackpain.0pi.com/backpain.htm] Back Pain [/URL]

  [URL=http://painreliefproduct.guildspace.com] Pain relief [/URL]
[URL=http://painreliefmedic.friendpages.com] Pain relief [/URL]
[URL=http://nervepainrelief.jeeran.com/painrelief<nobr>.<wbr></nobr> htm] Nerve pain relief [/URL]

#

UPS Battery exploded?

Posted by: Anonymous Coward on November 30, 2004 04:11 AM
Huh! I wonder if, when the whole facility lost power, maybe the current load from *everything* plugged into the UPS was more than the UPS could handle, causing a very rapid discharge, which could cause a cell to overheat/vent gas/explode/etc.?

I.e., maybe more and more stuff, over time, got plugged into the UPS-backed-up circuit, without realizing the startup (transient) current load of all that stuff might have exceeded the UPS capability...?

#

Parallel &amp; series

Posted by: Administrator on November 25, 2004 01:48 AM
Actually wire batteries in parallel for higher current and in series for higher voltage (ie. 2x1.5V cells in series to power a 3V torch/flashlight).

#

The layer 8 factor

Posted by: Administrator on November 25, 2004 01:20 AM
I really appreciate the focus of this article - people. I agree completely and often wish for an environment more like what you describe. Unfortunately, I work in a much smaller organization which permits many little kingdoms to develop. It seems that common personality characteristics among sys admins include isolation and control. Ten years ago I fit that description much more. Now, I want to interact in a healthy way that produces a whole greater than the sum of its parts. One obstacle to that may be the lack of demand from userland, especially higher management, when the organization is relatively small. I wonder what size organization you work in and what size organization is necessary to promote that kind of teamwork? The latter part of that question is somewhat rhetorical as I believe where ever two sys-admins coexist efforts should be made to work as a team.

One of the key things I often find missing is a desire to communicate. Documentation and logging is important for even the smallest of teams. How often does it get done? Another factor is practice, or rather rehearsal, of response plans. Don't write it down as a plan then wing it when the emergency happens. Worse still would be to not have a plan and wing it, too.

All of this, just like clueless users, is what I refer to as layer 8 issues. People require tweaking and performance testing, too. What good is the perfect computing infrastructure when people cannot effectively interact with it? That goes for sys admins as well as clueless users.

#

This story has been archived. Comments can no longer be posted.



 
Tableless layout Validate XHTML 1.0 Strict Validate CSS Powered by Xaraya