Friday, January 21, 2011

So far so good

Well, it's been a week and change since the last problem related to the server move. Is it too early to declare the move a success? At this point the only thing that is still an open issue is the volume of email FRED sends. The new outgoing mail service I'm using charges a certain amount of money for a certain number of emails per month. At first I just guessed how much quota FRED would need, and as it turned out i guessed low. In just three days FRED had burned through the quota I bought for him, and I had to buy more. At this rate, FRED will be spending around $400 a year just to send email. Whew!

Tuesday, January 11, 2011

Ok, maybe the backup wasn't the problem...

Now it looks like FRED's PCI compliance scanner took him offline.

Background: Like all web based businesses that accept credit cards online, FRED is required by the credit card companies to have a security scan done periodically by an approved independent vendor. In FRED's case, that's done weekly. As it happens, they do it at noon PST monday, exactly the time FRED went offline today. There was also a coinciding huge spike in network traffic. I'd had problems with their scan being a little "enthusiastic" for FRED before, but the measures I put in place to deal with it were on the old server. So now that FRED moved, the problem came back.

Grrrrr......

DB goes haywire

Well, that was embarrassing.

Today at noon PST, FRED's database server stopped responding, which of course takes the site offline. I'm still investigating, but I believe it was caused by a problem in the hourly data backup snapshot job.

Some background: In EC2, Amazon provides highly available redundant network "drive" storage called Elastic Block Store, or EBS. FRED's data files are stored on such an EBS. EBS provides a way to take nearly instantaneous snapshots of the files and store those for backup purposes. FRED's database server does this once an hour. It consists of flushing the database tables to disk, freezing the filesystem against changes (a nice feature of the XFS filesystem I put on the EBS volume), then taking the snapshot, and then thawing the filesystem and db. All this is done with an open source script written by others especially for use on a MySQL database hosted in EC2.

Today at noon, it looks like this snapshot job hung up somehow, which of course hangs the database and the site(s). When I logged into the database server instance, the snapshot job was still "running"; in fact two of them were (it having been over an hour since the problem first happened). I killed the snapshots, but that still left me unable to get the MySQL database running again.

Rather than monkey with it, I simply launched another database server instance, detached the EBS volume from the "dead" instance, attached it to the new one, and swapped the new one into live production. That took about 5 minutes.

So, over all: not too cool.
I've suspended the backup job until tonight when i have more time to diagnose the root issue.

Sorry everyone for the outage!

-p

Tuesday, January 4, 2011

FRED moves to the cloud

Many FRED users may have noticed that in the past half year or so, FRED has developed a progressively worsening case of narcolepsy. That is, he seems to fall unconscious at times, and fails to respond when you come calling. At first it happened only now and then, and not for very long. These days though, it seems to happen at least once a week, sometimes a couple times a day, for as much as an hour at a time. I have monitoring to alert me when this happens, but I'm not always in range of an internet connection to wake FRED up quickly.

The basic problem is this: FRED has outgrown his home again. For the curious (and geeky), here's some history:

Back in 2002, FRED started out hosted in a cheap shared server setup whose actual hardware specs I never knew. He quickly outgrew that and moved to a dedicated but wimpy Celeron 1.7Ghz box. After that came a 2.0Ghz Xeon single proc server and then a dual proc 2.5Ghz, and for the past couple years, FRED has lived in a dual 3.2Ghz box with 4GB ram.

Up till now, FRED has been hosted in a single server, running Apache/PHP, MySQL, memcached, email, DNS, etc all on that one machine. It currently serves around 1.2 million page views per month, which is not really all that huge, but it's not tiny either. FRED is also a pretty heavyweight application, with lots of database queries, some on tables with a few million records in them, and some complex view definitions. At times of highest traffic, FRED is CPU bound in his current home, especially when some of that CPU is taken by virus scanning and spam filtering over incoming email.

Another part of the problem is this: because of how much work it is to set up a new machine, when FRED outgrows one, it takes me a long time to move him to another one. But virtualization and cloud computing have made this much easier. FRED is now moving into Amazon's Elastic Compute Cloud (EC2). There, I'll be able to provision additional server resources in a matter of minutes or hours, not weeks. It also means FRED can buy more server power for a little less money than with traditional dedicated servers.

FRED's new setup:

Database:
One Standard Large instance (4 CPU units, 7.5GB ram)
Web2 (askfred.net):
One High-CPU Medium instance (5 CPU units, 1.7GB ram)
Web1 (foc.askfred.net, usfaroc.askfred.net, thebaycup.askfred.net, demo.askfred.net):
One Standard Small Instance (1 CPU Unit, 1.7GB ram)


Some of you have also noticed that FRED's email delivery success rate has dropped. This is most likely because FRED's server got on a spam blacklist somewhere (though I've never been able to find out for sure if this is true, nor which blacklist). In conjunction with the move to EC2, I have obtained the services of an outgoing email service which should improve the email delivery rate dramatically. This is a company whose whole job is to deliver email, so they are pros at making sure their servers remain off blacklists and available to send email to you.

Incoming email to @askfred.net (except for support@askfred.net) will remain on FRED's single dedicated physical machine for the time being. This will keep it separate from web and db service, so those two website-critical, latency sensitive services can't be slowed by the very CPU hungry virus and spam scanning processes.

Thanks to everyone for your patience and tolerance while I got this move done. It took many hours of designing and configuring AMIs (virtual server images) upon which to base FRED's new virtual machines. I hope that this will improve both uptime and response times.

Cheers
-Peet Sasaki
FRED admin/developer