FRED tek: 2011

Wednesday, December 14, 2011

Upcoming event rating and size searches are back

Hello tournament seekers-

Over a year ago, I had to disable the "event size" and "expected event rating" search criteria in FRED's upcoming event list. At long last, they are back. These are one of the more valuable features of FRED's event search, so I'm very happy to have them back, and I'm sure you will be too.

You'd be surprised how much work it was. Admittedly, most of it was "under the hood" work that will be useful for lots of other features, so it's not just this one feature that caused all the headache. It's kind of like building a car just to drive to the store for milk, but we'll be able to use that car for so much other stuff, I swear!

Anyhow, thanks everyone for your patience while these features were "on vacation".

-p

Warning: extreme geek-ness follows:

The problem:
These two filters were implemented in SQL, using some big joins and subselects on the preregistration table, and (brace yourself...) the USFA event classification chart expressed as an SQL view. Yeah, I admit, I did that one just to prove it could be done. This all worked fine while there were only a few tens of thousands of preregs in the db. But there are now over half a million. Whenever someone used one of these two criteria (esp the rating search), the database server would slow to a crawl and web page loads would time out, bringing the whole site to its knees for all users, not just the one searcher.

Ouch.

The Solution:
The current prereg count and predicted event rating are precalculated and saved in the event table so those filters are now just a simple where clause. But that's the simple and obvious part. The hard part is keeping them in sync in near-real-time as people preregister for the tournament. One way to do this would be to recalculate these values and update them as part of the same transaction as the user's preregistration. This is less than ideal because that could slow the response time to the user's prereg submission, just so we can accomplish some housekeeping tasks. Not cool.

Instead, FRED now has a task queue based publish-and-subscribe system for deferring such processing to a background worker process. Each time someone preregisters for a tournament, a small message is sent to the queue in fire-and-forget style. A worker process is continually pulling messages from the queue and acting on them, in this case updating the event table's prereg count and rating prediction fields. The whole process takes about 5-10 seconds, so the search criteria are correct very soon after the preregistration happens.

This pub-sub system will be super-useful for decoupling cause and effect in FRED's processes, and for deferring costly processing to the background, to preserve front end performance.

Whew!
-p

Sunday, August 7, 2011

Turns out it was a Chinese bot.

As it turns out, FRED's recent downtime was caused by an ill-behaved crawler run by a Chinese search engine. When this issue first arose, one of the first things I did was to look for excessive numbers of requests coming from single IPs, and this bot had been among the top 3 or 4 clients over the better part of that day. But because it made fewer requests than other crawlers such as Google, Bing, and Yahoo, I discounted it as a cause of the issue. After all, it had made fewer requests than those other well-behaved bots, which FRED has no problem serving.

However, I had retrospectively counted the Chinese requests over the whole day in aggregate. When I had a chance to watch the server processes escalate in real time, I saw that one IP address was making as many as 100 concurrent requests! It was the IP of that Chinese spider.

Adding a line to FRED's firewall config fixed the problem by blocking that IP (their whole class B subnet actually). So FRED's search rank in this chinese search engine will suffer, but I'm ok with that. :^)

-p

Thursday, August 4, 2011

Tweaked Apache config

Ok so I opted for the "tweak apache" option. I lowered the MaxClients setting and set MaxRequestsPerChild to 1000. Neither of those will directly stop the number of processes from escalating, but they might cause different behavior to occur when the number of processes gets too high.

We'll see.

Wednesday, August 3, 2011

FRED is flapping

FRED's webserver has been down for a number of short (~3 minute) periods all day today. Beginning with a few incidents on friday and a few over the weekend, escalating to 24 such incidents today (so far). Apache simply spawns gradually more and more child processes until it exhausts memory on the server, at which point it fails to respond to a probe from FRED's auto-restart monitor. At that point the monitor restarts Apache, and all is well until the next time it fills the available memory.

I may wave a dead chicken over some Apache settings, but since this is suddenly happening with no config changes or significant change in traffic on the site, I'm tempted to just spin up a new EC2 instance and see if that helps. Maybe the current instance is just going bad in some impenetrable way?

-p

Thursday, April 21, 2011

Well that was painful....

FRED is back from today's EC2 outage. Amazon has three of the four availability zones in their us-east-1 region (virginia) datacenter operating. Unfortunately, FRED's database server was running in the one zone that is still sick. However, I was able to snapshot the database's EBS volume and create a new volume from that snapshot in one of the other zones, then fire up a new DB server instance in that zone, attach the new volume to that instance and get things rolling again.

While riding the bus home from work.

Yay for wifi on the bus. Boo for Amazon having a full day outage.

I was getting pretty proud of FRED's 99.99% 30 day uptime. Now it's all shot to hell: 97.58%!

Oh well. At least it wasn't on friday or saturday when everyone would be trying to download their preregs.

-p

EC2 outage

Hello faithful FRED users- Today Amazon EC2's us-east-1 region is experiencing a serious, sustained outage in EBS connectivity and creation. FRED lives in us-east-1a, so his database server (whose files live in an EBS volume) became inaccessible at about 1am PDT.

All I can do is wait for EC2 to fix the issue. Very curious or geeky folks can follow their progress here: http://status.aws.amazon.com

Please accept my apologies for the FRED outage.

-Peet

Saturday, March 26, 2011

Another try at handling the accented characters in fencers' names

FRED is getting used more and more in Canada these days, which is very cool. However, it's brought a long-standing problem with FRED closer to the surface: Multibyte characters.

FRED is written in PHP, a great language for quickly building complex web applications. However, PHP's support for multibyte encodings is less than awesome. Also, lots of the code in FRED was written in the early days of PHP4 when mb support was even worse.

Lots of our Canadian friends and their clubs have names with accented characters represented in the UTF-8 multibyte character encoding. Their names are accepted into FRED just fine, but when they are transmitted back and forth with Fencing Time, the XML-related PHP functions FRED uses to read and write preregistration and results handle the mb characters badly. There's a bunch of info out there on the web as to how to best handle this problem, and I've tried lots of them, with mixed results.

Today I deployed another attempted solution to write UTF-8 XML preregistration files for import into Fencing Time using the mb_convert_encoding() function to ensure that the stream output is valid UTF-8.

Given how many systems this data passes through (FRED's webserver, db server, your browser, your OS, Fencing Time, and back again...), it's hard to be sure everything works 100%, but so far this change has performed well in my tests. Hopefully the real world will behave similarly.

-P

Thursday, March 3, 2011

And here I was getting all happy about the uptime...

Only to have FRED go down hard for 6 hours this morning! What happened: Every night logrotate rotates FRED's web server logs (among others) and restarts the web server process. Last night the webserver didn't come back from the restart for some reason. All it took was a simple (re)start to get FRED back up (not of the whole server, just the webserver process). It's hard to tell why it died, and I'm looking at its logs to see if I can tell, but ultimately the more important question is "how do we prevent this in the future?" Answer: I have installed SIM (http://www.rfxn.com/projects/system-integrity-monitor/), a cron initiated script that periodically checks to make sure certain services are running and healthy and (re)starts them automatically if they are not.

So hopefully this won't happen again.

Wednesday, February 16, 2011

Query of Death Part Deux

Well, I've made some code changes in an attempt to prevent the offending "expensive" query from being executed. To be honest, only time will tell if I've actually eliminated the exact code path that has been running that query, and even if I have, we'll have to see if that stops these little 5 minute outages from happening.

Wednesday, February 9, 2011

Query of death

FRED continues to have short periods of downtime ranging from 3-6 minutes a few times per week. These appear to be caused by a request that runs a very expensive query that takes a couple minutes to run and consumes most of the database's resources during that time. This slows or blocks other users' queries, causing page requests to stack up and take all of the available memory on the web server, subsequently causing all pages to fail to load.

I'm looking at how to prevent that query from occurring. It may actually be triggered by a search engine crawler (hence FRED not being built for its behavior, and failing to handle it).

Sunday, February 6, 2011

Wow, FRED sends a lot of mail

With FRED's current SMTP (outgoing email) service, it'll cost about $550 per year to send email. Yikes! EC2 did just launch an outgoing email service that would be WAY more cost-effective (maybe even free for FRED since he might make it under their free tier), but it's not an SMTP service. Instead, it's a webservice API, which makes it harder for FRED to use it. I have hope that soon they'll add SMTP as an alternative method of using the service.

Friday, January 21, 2011

So far so good

Well, it's been a week and change since the last problem related to the server move. Is it too early to declare the move a success? At this point the only thing that is still an open issue is the volume of email FRED sends. The new outgoing mail service I'm using charges a certain amount of money for a certain number of emails per month. At first I just guessed how much quota FRED would need, and as it turned out i guessed low. In just three days FRED had burned through the quota I bought for him, and I had to buy more. At this rate, FRED will be spending around $400 a year just to send email. Whew!

Tuesday, January 11, 2011

Ok, maybe the backup wasn't the problem...

Now it looks like FRED's PCI compliance scanner took him offline.

Background: Like all web based businesses that accept credit cards online, FRED is required by the credit card companies to have a security scan done periodically by an approved independent vendor. In FRED's case, that's done weekly. As it happens, they do it at noon PST monday, exactly the time FRED went offline today. There was also a coinciding huge spike in network traffic. I'd had problems with their scan being a little "enthusiastic" for FRED before, but the measures I put in place to deal with it were on the old server. So now that FRED moved, the problem came back.

Grrrrr......

DB goes haywire

Well, that was embarrassing.

Today at noon PST, FRED's database server stopped responding, which of course takes the site offline. I'm still investigating, but I believe it was caused by a problem in the hourly data backup snapshot job.

Some background: In EC2, Amazon provides highly available redundant network "drive" storage called Elastic Block Store, or EBS. FRED's data files are stored on such an EBS. EBS provides a way to take nearly instantaneous snapshots of the files and store those for backup purposes. FRED's database server does this once an hour. It consists of flushing the database tables to disk, freezing the filesystem against changes (a nice feature of the XFS filesystem I put on the EBS volume), then taking the snapshot, and then thawing the filesystem and db. All this is done with an open source script written by others especially for use on a MySQL database hosted in EC2.

Today at noon, it looks like this snapshot job hung up somehow, which of course hangs the database and the site(s). When I logged into the database server instance, the snapshot job was still "running"; in fact two of them were (it having been over an hour since the problem first happened). I killed the snapshots, but that still left me unable to get the MySQL database running again.

Rather than monkey with it, I simply launched another database server instance, detached the EBS volume from the "dead" instance, attached it to the new one, and swapped the new one into live production. That took about 5 minutes.

So, over all: not too cool.
I've suspended the backup job until tonight when i have more time to diagnose the root issue.

Sorry everyone for the outage!

-p

Tuesday, January 4, 2011

FRED moves to the cloud

Many FRED users may have noticed that in the past half year or so, FRED has developed a progressively worsening case of narcolepsy. That is, he seems to fall unconscious at times, and fails to respond when you come calling. At first it happened only now and then, and not for very long. These days though, it seems to happen at least once a week, sometimes a couple times a day, for as much as an hour at a time. I have monitoring to alert me when this happens, but I'm not always in range of an internet connection to wake FRED up quickly.

The basic problem is this: FRED has outgrown his home again. For the curious (and geeky), here's some history:

Back in 2002, FRED started out hosted in a cheap shared server setup whose actual hardware specs I never knew. He quickly outgrew that and moved to a dedicated but wimpy Celeron 1.7Ghz box. After that came a 2.0Ghz Xeon single proc server and then a dual proc 2.5Ghz, and for the past couple years, FRED has lived in a dual 3.2Ghz box with 4GB ram.

Up till now, FRED has been hosted in a single server, running Apache/PHP, MySQL, memcached, email, DNS, etc all on that one machine. It currently serves around 1.2 million page views per month, which is not really all that huge, but it's not tiny either. FRED is also a pretty heavyweight application, with lots of database queries, some on tables with a few million records in them, and some complex view definitions. At times of highest traffic, FRED is CPU bound in his current home, especially when some of that CPU is taken by virus scanning and spam filtering over incoming email.

Another part of the problem is this: because of how much work it is to set up a new machine, when FRED outgrows one, it takes me a long time to move him to another one. But virtualization and cloud computing have made this much easier. FRED is now moving into Amazon's Elastic Compute Cloud (EC2). There, I'll be able to provision additional server resources in a matter of minutes or hours, not weeks. It also means FRED can buy more server power for a little less money than with traditional dedicated servers.

FRED's new setup:

Database:

One Standard Large instance (4 CPU units, 7.5GB ram)

Web2 (askfred.net):

One High-CPU Medium instance (5 CPU units, 1.7GB ram)

Web1 (foc.askfred.net, usfaroc.askfred.net, thebaycup.askfred.net, demo.askfred.net):

One Standard Small Instance (1 CPU Unit, 1.7GB ram)

Some of you have also noticed that FRED's email delivery success rate has dropped. This is most likely because FRED's server got on a spam blacklist somewhere (though I've never been able to find out for sure if this is true, nor which blacklist). In conjunction with the move to EC2, I have obtained the services of an outgoing email service which should improve the email delivery rate dramatically. This is a company whose whole job is to deliver email, so they are pros at making sure their servers remain off blacklists and available to send email to you.

Incoming email to @askfred.net (except for support@askfred.net) will remain on FRED's single dedicated physical machine for the time being. This will keep it separate from web and db service, so those two website-critical, latency sensitive services can't be slowed by the very CPU hungry virus and spam scanning processes.

Thanks to everyone for your patience and tolerance while I got this move done. It took many hours of designing and configuring AMIs (virtual server images) upon which to base FRED's new virtual machines. I hope that this will improve both uptime and response times.

Cheers

-Peet Sasaki

FRED admin/developer