Sunday, August 7, 2011

Turns out it was a Chinese bot.

As it turns out, FRED's recent downtime was caused by an ill-behaved crawler run by a Chinese search engine. When this issue first arose, one of the first things I did was to look for excessive numbers of requests coming from single IPs, and this bot had been among the top 3 or 4 clients over the better part of that day. But because it made fewer requests than other crawlers such as Google, Bing, and Yahoo, I discounted it as a cause of the issue. After all, it had made fewer requests than those other well-behaved bots, which FRED has no problem serving.

However, I had retrospectively counted the Chinese requests over the whole day in aggregate. When I had a chance to watch the server processes escalate in real time, I saw that one IP address was making as many as 100 concurrent requests! It was the IP of that Chinese spider.

Adding a line to FRED's firewall config fixed the problem by blocking that IP (their whole class B subnet actually). So FRED's search rank in this chinese search engine will suffer, but I'm ok with that. :^)


Thursday, August 4, 2011

Tweaked Apache config

Ok so I opted for the "tweak apache" option. I lowered the MaxClients setting and set MaxRequestsPerChild to 1000. Neither of those will directly stop the number of processes from  escalating, but they might cause different behavior to occur when the number of processes gets too high.

We'll see.

Wednesday, August 3, 2011

FRED is flapping

FRED's webserver has been down for a number of short (~3 minute) periods all day today. Beginning with a few incidents on friday and a few over the weekend, escalating to 24 such incidents today (so far). Apache simply spawns gradually more and more child processes until it exhausts memory on the server, at which point it fails to respond to a probe from FRED's auto-restart monitor. At that point the monitor restarts Apache, and all is well until the next time it fills the available memory.

I may wave a dead chicken over some Apache settings, but since this is suddenly happening with no config changes or significant change in traffic on the site, I'm tempted to just spin up a new EC2 instance and see if that helps. Maybe the current instance is just going bad in some impenetrable way?