Tuesday, January 11, 2011

DB goes haywire

Well, that was embarrassing.

Today at noon PST, FRED's database server stopped responding, which of course takes the site offline. I'm still investigating, but I believe it was caused by a problem in the hourly data backup snapshot job.

Some background: In EC2, Amazon provides highly available redundant network "drive" storage called Elastic Block Store, or EBS. FRED's data files are stored on such an EBS. EBS provides a way to take nearly instantaneous snapshots of the files and store those for backup purposes. FRED's database server does this once an hour. It consists of flushing the database tables to disk, freezing the filesystem against changes (a nice feature of the XFS filesystem I put on the EBS volume), then taking the snapshot, and then thawing the filesystem and db. All this is done with an open source script written by others especially for use on a MySQL database hosted in EC2.

Today at noon, it looks like this snapshot job hung up somehow, which of course hangs the database and the site(s). When I logged into the database server instance, the snapshot job was still "running"; in fact two of them were (it having been over an hour since the problem first happened). I killed the snapshots, but that still left me unable to get the MySQL database running again.

Rather than monkey with it, I simply launched another database server instance, detached the EBS volume from the "dead" instance, attached it to the new one, and swapped the new one into live production. That took about 5 minutes.

So, over all: not too cool.
I've suspended the backup job until tonight when i have more time to diagnose the root issue.

Sorry everyone for the outage!

-p

No comments:

Post a Comment