rubydbc February 25
gem install rubydbc
I'm making slow progress on my personal publishing platform rewrite in Haskell (see earlier posts here and here), so herein part 3 of n, wherein I experiment with data migration and an embarrassingly simple data model. A forthcoming part 4 will be really simple Atom serialization.
As I experiment with the new platform, I'd like a way to move the data from the typo instance into the new environment on-demand; this post is my lab notebook for the export/import experiment.
One of the things that Rails has gotten 100% right is the ability to (easily) access configured environments via interactive (script/console) or scripting (script/running) front-ends. (Using a framework like Spring in the Java space can provide similar functionality by constructing an application context, but it's more awkward to separate out the services that the runtime container would be providing.) My first thought on exporting was to use YAML, but the significant whitespace and cryptic annotations ("|" for a free-form text block with a trailing linebreak and "|-" for a free-form text block without a trailing linebreak) just rubbed me the wrong way. JSON turns out to be a better choice because ActiveRecord supports JSON export (via ActiveSupport::JSON), and there are a couple of JSON libraries for Haskell. One is a predecessor version of the other, and I'm going to work with the earlier version because it has no dependencies other than a baseline GHC 6.6 install.
Getting an entry out to play with is a piece of cake:
./script/runner 'puts Article.find_by_state("Published").to_json' \
> /tmp/entry.json Parsing the JSON is similarly straightforward:
$ ghci
___ ___ _
/ _ \ /\ /\/ __(_)
/ /_\// /_/ / / | | GHC Interactive, version 6.6, for Haskell 98.
/ /_\\/ __ / /___| | http://www.haskell.org/ghc/
\____/\/ /_/\____/|_| Type :? for help.
Loading package base ... linking ... done.
Prelude> :load /tmp/JSON.hs
[1 of 1] Compiling JSON ( /tmp/JSON.hs, interpreted )
Ok, modules loaded: JSON.
*JSON> entry <- P.parseFromFile json "/tmp/entry.json"
Loading package parsec-2.0 ... linking ... done.
Right (Object (fromList [("attributes",Object (fromList [
[...] (JSON.hs aliases Data.Map as M and Text.ParserCombinators.Parsec as P, so that's where the P.parseFromFile is coming from.) The map of values is wrapped up a bit, but a couple of simple functions will get it out from behind the type constructors:
*JSON> let unR = \(Right r) -> r
*JSON> let unO = \(Object o) -> o
*JSON> :t (unO.unR) s
(unO.unR) s :: M.Map String Value which gets us down to the level of the first map with one entry under the key "attributes". To get the map of attributes we want out:
*JSON> let m = unO ((((M.!).unO.unR) s) "attributes")
*JSON> m
fromList [("allow_comments",String "1"),("allow_pings",String "1"),... (Haskell uses ! for dereferencing keys in a Data.Map.) And now the components of the entry are easy to extract:
*JSON> let atts = ((M.!) m)
*JSON> atts "allow_comments"
String "1"
*JSON> atts "updated_at"
String "2006-09-15 02:12:45"
*JSON> atts "body"
String "<p>Although I really do like the... The first-cut data model for entries looks like this:
data BlogPost = BlogPost { p_title :: String,
p_summary :: Maybe String,
p_permalink :: String,
p_metadata :: PostMetadata,
p_body :: String,
p_tags :: [String],
p_uid :: String,
p_comments :: [BlogPost]
}
deriving (Show)
data PostMetadata = PostMetadata { m_created :: CalendarTime,
m_publish :: CalendarTime,
m_updated :: CalendarTime,
m_author :: PostAuthor,
m_published :: Bool }
deriving (Show)
data PostAuthor = PostAuthor { p_name :: String,
p_uri :: Maybe String,
p_email :: Maybe String,
p_showEmail :: Bool
}
deriving (Show) And interpolating from typo's model to the new model is just putting the fields in the right place with a little bit of date munging, since the new model has the expectation that dates are represented as Haskell CalendarTime. The reuse of the BlogPost structure for comments is intentional, both for Atom syndication and to support threaded comments.
Pulling all of the entries out is also straightforward:
$ ./script/runner 'puts (Article.find_all_by_state("Published")).to_json' \
> /tmp/entry.json
$ ./script/runner 'puts (Article.find_all_by_state("ContentState::Published")).to_json' \
>> /tmp/entry.json and pulling comments and trackbacks is a similar exercise:
$ ./script/runner 'puts (Comment.find_all_by_state("ContentState::Ham")).to_json' \
> /tmp/comments.json
$ ./script/runner 'puts (Trackback.find_all_by_state("ContentState::Ham")).to_json' \
> /tmp/trackbacks.json although it takes a little doing to collate comments and trackbacks with their parent posts. So far, so good — unlike any of the other migrations (Radio Userland→SnipSnap, SnipSnap→WordPress, and WordPress→typo) I've done, this looks to be neither lossy nor labor-intensive.
As an aside, over about four years of blogging (2003-02-17 through the present), I've accumulated the equivalent of ~110 single-spaced pages of content.
I've been working with the ActiveRecord model for my typo installation using irb (via scripts/console), but some expressions were producing a little too much output for convenience, e.g.:
Article.find_all.collect { |a| a.permalink } Fortunately, there is a convenient configuration property that does the trick by suppressing the output of evaluated expressions:
conf.return_format = "" After which, what you get is what you puts.
Possibly the only thing I like about MySQL is when performing a search, the values 'SIMON' and 'sImOn' are considered equal—case-insensitive searching. PostgreSQL on the other hand considers them to be different—case-sensitive searching. Now I don't know about you but for %99.999~ of the applications I've ever written, I'd rather 'Australia' and 'AuStRaLiA' weren't considered different countries.
The "standard" approach to solving this problem is to change a query from this:
SELECT * FROM countries WHERE name = ?;
To something like this:
SELECT * FROM countries WHERE LOWER(name) = LOWER(?);
Thereby forcing the database to perform a pseudo case-insensitive search. The only problem is that all those nice indexes you've created to ensure fast, efficient searching are totally ignored. (Who can spell full-table scan?) Performance issues aside (I mean after all we know that premature optimisation is the root of all evil right?) what's just as annoying is that I can't actually guarantee uniqueness, which is pretty much the whole point! No, even with a unique index on countries.name, the database will still quite happily allow me to:
INSERT INTO countries (name) VALUES ('Australia');
INSERT INTO countries (name) VALUES ('AUSTRALIA');
INSERT INTO countries (name) VALUES ('aUsTrAlIa');
So when I perform a case-insensitive search as previously discussed, I'll end up with three (count 'em 3) records. Thankfully, there is a solution (of sorts): expression indexes.
PostgreSQL allows you to create indexes based on expressions, say for example LOWER(name), allowing us to create a unique, case-insensitive index as simply as:
CREATE UNIQUE INDEX index_countries_on_name ON countries (LOWER(name));
Ok, so perhaps you knew this already and you're wondering what all this has to do with Rails? Well I'm glad you asked.
Rails (as of 1.2) has a new option for validates_uniqueness_of named, oddly enough, case_sensitive. This is assumed to be true by default (meaning all searches are case-sensitive). Set it to false however and you'll magically get validation queries that look like:
SELECT * FROM countries WHERE (LOWER(countries.name) = 'australia') LIMIT 1;
To compliment this feature, I've recently enhanced the RedHill on Rails Plugins in two interesting (and hopefully useful) ways.
The first is in the core and supports the creation of case-insensitive indexes during schema migration:
add_index :countries, [:name], :unique => true, :case_sensitive => false
The second is in schema validations and causes case_sensitive => false to be passed as an option to validates_uniqueness_of whenever a case-insensitive index is detected.
(I also looked at the possibility of automagically surrounding query parameters, etc. with LOWER() inside find methods but given the myriad forms queries can take, it seems altogether too difficult for my feeble mind at this point.)
The upshot of all this is that at the very least, it should now be possible to add case-insensitivity to your queries and be assured that (bugs not withstanding) the performance of your application won't suddenly plummet as a consequence.
James has a home-grown GTD system which he'll hopefully write about if not release to the world one day. It's totally command-line driven using a combination of bash, perl and ruby to manipulate plain text files and, from what I can tell, it's not only blindingly fast, it seems to work seriously well.
One of the daily tasks that James performs is to prioritise the things he wants to (attempt) to get done and what better time to do that than on the 45 minute train trip to work. As you can well imagine, prioritising 40 items takes a fair amount of time not only due to the sheer number but also, being vision-impaired, reading (and re-reading) takes a bit of concentration. So, we came up with a nifty solution in two parts.
The first part was quite simple: use binary insertion. That way, instead of O(n^2) comparisons, we'd get O(n log n). So far so good. But the that was pretty obvious. The next bit was the real doozy!
Rather than have each pair printed out so that James could read them and make a decision, we instead chose to use some Text-To-Speech (TTS) to literally speak each pair; something along the lines of:
"Is calling your boss to discuss staff pay reviews more important than send wife a bunch of flowers for her birthday?"
In addition, we actually run the TTS in another thread so that we can easily interrupt it, primarily because James: has already chosen a suitable answer; needs to repeat the question; or wants to cancel the whole thing.
About the only problem we encountered was that, on his Powerbook, the say command (built-in to Mac OS X) ran a little too slowly for our liking so we simply aliased it to use the much faster swift command that comes as part of the commercial Cepstral William voice he uses for all his Text-To-Speech.
Now James can listen to the minimal set of questions, making the appropriate decision for each by way of a simple keystroke, and have it all done within 15 minutes all the while free to sit and enjoy the view (such that it is on a suburban train during peak hour) rather than staring at the screen.
So, the recent rash of controversial benchmarks have inspired me to do some of my own microbenchmarking. I want to test web servers, and specifically *slightly* dynamic stuff, basically how long does it take to setup and teardown the request processing stuff without doing anything significant with the request (like talking to a database) as I want to isolate and benchmark the actual web part, not anything else. Luckily, there is a perfet test for this, Hello World! All tests were performed on the same machine, and I won't provide the code or configurations (as then they might be considered useful). I did tune each one to the best of my ability for running in the environment it is in (4 core machine, plenty of ram, etc). So, without further ado, results of a dynamic Hello World on a number of different webapp-ish servers, sorted by performance:
| Server | Requests / Second | Notes |
|---|---|---|
| Apache HTTPD (Worker) via mod_hello_world | Requests per second: 18823.58 [#/sec] (mean) | This is basically a minimal httpd module which just prints hello world to the response |
| Apache HTTPD (Worker) via mod_wombat | Requests per second: 17856.76 [#/sec] (mean) | This uses a server-scoped mod_wombat handler with a default pool of fifty Lua virtual machines |
| Apache Tomcat 5.5.20 | Requests per second: 17644.40 [#/sec] (mean) | This used a JSP and a few hundred thousand requests before the benchmark to let the JVM warm up. |
| Jetty 6.1.1 | Requests per second: 12449.36 [#/sec] (mean) | This used a JSP and a few hundred thousand requests before the benchmark to let the JVM warm up. |
| Mongrel accessed directly | Requests per second: 2378.05 [#/sec] (mean) | This was done via an HttpHandler rather than Rails in order to cut down on overhead |
| Mongrel, four instances, proxied through Pen | Requests per second: 2109.91 [#/sec] (mean) | This configuration was basically a test to make sure that ruby's single-threaded nature wasn't the bottleneck. It seems the additional proxying was much more expensive than not using the extra four cores available. |
| TwistedWeb | Requests per second: 2089.55 [#/sec] (mean) | This is the current TwistedWeb, not TwiestedWeb 2. |
| Mongrel (four instances) proxied through LightTPD | n/a | This was a very unstable configuration. I was unable to run more than a thousand or so requests before lighttpd would start losing track of mongrel instances and start returning errors. LightTPD returns errors very quickly, however. |
Now, to complicate matters further. Mongrel doesn't support HTTP keep alive, so I re-ran all the other benchmarks with keep alive disabled in the client and the numbers dropped about in half in every case, the three-way-handshake seems to matter a lot for micro-benchmarks.
Finally, this is a wholly unscientific set of benchmarks which is basically useless, so please don't read anything important into it. Any real system is almost certainly not bound by request handling in the application server :-)
The Ruby plugin for IntelliJ IDEA has finally been released. IntelliJ is bar none the best Java IDE. Downloaded and installed.