I post way too infrequently. It seems like every 4th post is about how I had some sort of elaborate hardware failure. So let me tell you about my most recent one.
Roughly a month ago, my NFS requests started failing. This was odd. The server was still happily running along, but, after further investigation, totally unresponsive. I’m thinking this is bad, but I didn’t know exactly how bad.
After resetting the system, nothing happened. Now I’m a lot worried. Several resets later, I sat back and pondered the results. About half the time, the system would get half way through posting. Once, the system nearly booted, but have a disk error and locked. Sporatic results like these point to a motherboard, CPU, or, most likely, a power supply.
I took the power supply out of my main desktop box and plugged it in. The system would boot to a certain point every time, but still throw disk errors and refuse to fully boot. This was a huge advancement. Any issue that is reproduceable is explainable and solvable.
It was time to buy a new power supply. It seems as though every time a component fails, I am able to buy something better and more advanced. There is nothing that spawns learning quite like failure. The power supply that I purchased was a Enermax Revolution 85+ ( Eight hundred and fifty watts!!!!!! ). Enermax is my favorite power supply maker at this point. This power supply had a few bonuses too. It is exceptionally efficient, it is fully modular, it can power two dozen or so hard disks, and it had a $70 rebate. I am totally pleased with the purchase.
The next step was to figure out the disk issue. With hard disk issues, ears are an efficient trouble shooting tool. Really? Really. If you hear a hard disk making sounds it doesn’t normally make, back up your data instantly. This tip would’ve saved my bacon on many occasions. I noticed the server making odd noises days before the failure and should have acted then. After the new power supply was installed, it was totally apparent. I had two failed disks. The easiest way to see that a drive is failed is that it doesn’t show up when a system is booting. During the power cycle, a system will check the disks that it has attached to it and, most of the time, display the specifications of the disk. I could see that two disks that were properly plugged in and they were not detected; therefore, they were bad. That, and I could hear that they weren’t spinning up properly.
Disk failure shouldn’t be an issue in servers. I had RAID implemented on the disks. RAID typically allows for a disk failure, that is, unless you use a type of RAID that doesn’t. Because of space concerns I had when building out the box, I decided to use RAID level 0 on the disks. RAID 0 will allow many disks to appear as one disk while combining the storage capacity of all of the disks. Unfortunately, when one disk fails, all data is lost. Only data that is ok to be lost should be put on an array where the disks are configured in this manner.
All data was not lost, however. I did follow my own rule and only put data that could be lost on disk arrays that could not withstand failure. The problem was that I considered my main OS to be something that was expendable. The virtual machines, like the one that runs this site, were protected and recovered. The problem with this setup is obvious. When the main OS is down, the virtual machines will no longer be able to run because of their dependency on the main OS. This was a classic mistake on my part, I should’ve put the OS in a safer place. That won’t happen again.
The disks that I purchased to comprise the new storage core of the server are from the Western Digital Black family. I really like these drive because they are built for performance and because they are cheap. I purchased 3 of the 750GB model for $60 a piece. I don’t know how reliable they will be until one of them fails. The drives get good reviews so I’m not too worried about it.
Two disks and a power supply at the same time? How on earth could that happen? My current theory is that the power supply didn’t fail. It degraded to the point where it couldn’t muster the power to get the entire system running from a cold start. The system had to cold start when I received a disk failure on my main system array. The second disk was part of my backup array that could survive a disk failure. It is possible that the disk had been in a failed state for some time.
I have to give props to Zalman and Seagate. Both companies stood by their product’s warranty and replaced the faulty products. There was only 3 months left in a 3 year warranty on the Zalman power supply that failed. The disk was an enterprise quality disk (but it failed so….), it had roughly 2 years left on the warranty.
Props also go to volume management and filesystem resizing utilities. I used the CentOS 5.4 live CD as a recovery disk to transfer data from the disks after the operating system had failed.
Another year, another hardware failure. This is why only professionals (like me) should host their own equipment. Typically, people are better off letting a hosting company handle problems like this for them.