Server Collapse – Uno-Code

Wow. Where to begin? Last week I had a crazy sequence of events related to one of my servers. After seeing a series of segfaults and memory errors, I started seeing corrupted files on the disk. This particular server is housed at a dedicated hosting facility. Their first look at the hardware they noticed that the chassis fan was dead. They quickly replaced the fan, and I thought that was a good possible source of the memory errors.

The next day, a MySQL database was corrupted, so I requested that they take the server off line and do a reiserfsck on /dev/hdc. hdc contained /home, /var, /tmp and swap. After an hour of down time, I contacted them to see where we were on the fsck, and I was stunned to find out that the drive was dead, particularly /dev/hdc4 which was the /var partition. They do not support Gentoo and wanted to know how to proceed. I simply told them to format at drive to match the partition table that I would email them. I did have /var backed up and would send them a tar ball (minus web sites) so they could at least get a bootable system. They agreed and things were moving forward.

An hour later, I received another call, that the system still could not boot, and they believed it was bad RAM and were in the process to find a set of RAM that all checked out. I was irritated with the additional down time, but that could have explained the initial memory errors. Again, an hour after that, I received another call the system STILL could not boot, and they suspected bad mobo or CPU and were in the process of fixing this. I was amazed how things were spiraling this way. After many hours of down time, I finally got the call the system was ‘up’. So they had to replace the CPU, RAM, Chassis fan and a secondary harddrive. Crazy.

But the fun didn’t end there. I was using dirvish to perform incremental off site backups of the server. One thing that I quickly discovered was that I was not backing up /var/db. I remember adding this to the exclude thinking those would not be necessary and I could speed up backups by not including them. Dumb! Well, because of this, I was not backing up /var/db/pkg. This directory is a vital part of Gentoo’s package system. It basically let’s the system know what it has in world. What versions of packages are installed including slotted packages (NS).

So after getting webs and mail back up, I needed to proceed on how to repair my system. Basically, my only real option was to rebuild the system. There are a few other ways of doing this, but this seemed to be the most thorough way of getting it done.

emerge -ev --nodeps system emerge -ve --nodeps world

After that I still had some issues with my world file so I had to see what was wrong with it.

emaint --check world

This showed packages in world that were not in the system. Unfortunately, this could have been masked packages or packages out of portage but installed in the system. Like I mentioned above, all slotted items are lost. So I have several hardened-sources kernel trees installed, but now the system only thinks there is only one version installed. A Gentoo developer contacted me and stated it would be fine to delete the kernel directory for those ‘lost’ packages and everything would be fine.

Moral of the story… make sure you understand what you’re NOT backing up. When in doubt, back it up!