Paul R. Brown
For anyone who's been wondering why this blog has been up and down over the past week, it's a slow-motion battle between the memory police at TextDrive killing Typo instance that hosts the blog and either a FastCGI dispatcher or a nanny cron job starting it back up. The onus is clearly on me to figure out what's burning memory, and my first inclination was to naively google for Ruby profilers. Here's a rambling account of what I did to conclude that I'm probably out of luck as far as a quick cure for the issues and then to address them.
There are a couple of speed-oriented Ruby performance profilers, the built-in one and ruby-prof, but there are no space-oriented profilers. There was a brute-force approach based on ObjectSpace.each_object in an old mailing list post from Michael Garniss that looked suitable, so I integrated it into the main controller in Typo as an after_filter and fired-up several concurrent wget commands to walk around on a production configuration on my development box at home:
while true; \
do wget -nv -r --delete-after http://localhost:3000; \
done
(There is no reason to try to set it on fire with something like ab.) That won't catch any issues with the vanilla two dispatcher lighttpd/FastCGI configuration that I use on Textdrive, but it should catch any issues with Typo internals, badly behaved sidebars, etc.
With the profiling code integrated, a request that includes the dump takes several seconds to complete, and there are several hits per page; so I added a class variable (@@no_sooner_than) and a little logic so that profiling requests would only run once a minute or so. With several wget walkers working, top reports that the server runs along at a happy 80-90Mb, and eyeballing the profiling output shows memory usage oscillating between <7Mb and ~20Mb without any perceptible upward trend over the course of an hour and a half. (That said, that's all the data I captured, as WEBrick locked up completely after that hour and a half.)
Armed with the information that there wasn't an easy fix for the memory issues, I switched the FastGCI configuration for the production instance to a single dispatcher from the previous two, pointed a couple of wget walkers at it, and tracked memory usage and process id at the commandline, like so:
while true; \
do ps mux | grep ruby | grep -v grep; \
read -t 30; done
I also changed the wget walker command to provide more useful information:
wget -S -r -b -l 4 --delete-after http://mult.ifario.us \
-a /tmp/log_id
where id is a unique number per walker, and so far, so good. Crunching the wget output through shell commands (awk, grep, cut, sort, uniq -c, etc.), e.g.:
cat log* | grep HTTP/1.1 | cut -f 4 -d ' ' | sort | uniq -c
says that mult.ifario.us is consistently returning snappy HTTP/1.1 200 responses about two nines (99.x%) of the time, which isn't great but isn't awful. (Really it's more like 2.5 nines, i.e., −log10(0.003), but who's counting?)
This is one time when I've missed some of the Java runtime environment's capabilities (i.e., the JVMTI) in other language runtimes, but no rocket science was required to get Typo under control.