Ashoat Posted November 24, 2010 Posted November 24, 2010 Today's downtime (and Sunday's earlier in the week) were the result of electrical problems with our datacenter (Hurricane Electric). Since I also recently de-journaled our /home filesystem (from ext3 to ext2) to improve disk I/O performance, the power cycle forced our machine to perform a very long e2fsck to confirm disk integrity. Short answer: we should be a good bit faster because of the de-journaling, but when electrical problems like this force a power cycle, we might stay down for a day. Electrical problems are rare for datacenters and hopefully this won't happen again.
Wizard Posted November 24, 2010 Posted November 24, 2010 Don't we run the risk of screwing up the entire partition without journaling?
Ashoat Posted November 25, 2010 Author Posted November 25, 2010 Yeah, the risk of data loss due to hardware failure is higher now.
madvic Posted November 26, 2010 Posted November 26, 2010 Why don't you use Uninterruptible power supply to prevent this happening ?
papukai Posted November 26, 2010 Posted November 26, 2010 Why don't you use Uninterruptible power supply to prevent this happening ? I hope that ups is already in use. I think that djbob meant hard drive, etc failures.
Ashoat Posted November 27, 2010 Author Posted November 27, 2010 They do use UPSes. Those were actually what broke in this incident. On Saturday night at Fremont 1 around 9pm during a thunderstorm, there was a power incident involving the electric power utility lasting approximately 3 seconds that damaged two UPSes causing them to fail. The UPS paralleling system automatically went into bypass in order to restore power. The UPS service technicians inspected the units and determined the specific components that failed and ordered parts. The failed UPSes were damaged by power fluctuations that occurred during the thunderstorm. On Tuesday morning at Fremont 1 at 07:27am PST, the electric power utility had a 1 second power incident. Due to running on bypass this had an effect, where it normally would not have. The UPS service technicians have the replacement parts and are on site performing the necessary repairs to restore the 2 failed UPS units to normal operational status.
Recommended Posts