rubydbc

Design by Contract for Ruby is now a Rubyforge project. Version 1.0.0 is available as a gem:

gem install rubydbc

JSON as a Migration Format

I'm making slow progress on my personal publishing platform rewrite in Haskell (see earlier posts here and here), so herein part 3 of n, wherein I experiment with data migration and an embarrassingly simple data model. A forthcoming part 4 will be really simple Atom serialization.

Data Out, Data In

As I experiment with the new platform, I'd like a way to move the data from the typo instance into the new environment on-demand; this post is my lab notebook for the export/import experiment.

One of the things that Rails has gotten 100% right is the ability to (easily) access configured environments via interactive (script/console) or scripting (script/running) front-ends. (Using a framework like Spring in the Java space can provide similar functionality by constructing an application context, but it's more awkward to separate out the services that the runtime container would be providing.) My first thought on exporting was to use YAML, but the significant whitespace and cryptic annotations ("|" for a free-form text block with a trailing linebreak and "|-" for a free-form text block without a trailing linebreak) just rubbed me the wrong way. JSON turns out to be a better choice because ActiveRecord supports JSON export (via ActiveSupport::JSON), and there are a couple of JSON libraries for Haskell. One is a predecessor version of the other, and I'm going to work with the earlier version because it has no dependencies other than a baseline GHC 6.6 install.

Getting an entry out to play with is a piece of cake:

./script/runner 'puts Article.find_by_state("Published").to_json' \
  > /tmp/entry.json

Parsing the JSON is similarly straightforward:

$ ghci
   ___         ___ _
  / _ \ /\  /\/ __(_)
 / /_\// /_/ / /  | |      GHC Interactive, version 6.6, for Haskell 98.
/ /_\\/ __  / /___| |      http://www.haskell.org/ghc/
\____/\/ /_/\____/|_|      Type :? for help.

Loading package base ... linking ... done.
Prelude> :load /tmp/JSON.hs
[1 of 1] Compiling JSON             ( /tmp/JSON.hs, interpreted )
Ok, modules loaded: JSON.
*JSON> entry <- P.parseFromFile json "/tmp/entry.json"
Loading package parsec-2.0 ... linking ... done.
Right (Object (fromList [("attributes",Object (fromList [
[...]

(JSON.hs aliases Data.Map as M and Text.ParserCombinators.Parsec as P, so that's where the P.parseFromFile is coming from.) The map of values is wrapped up a bit, but a couple of simple functions will get it out from behind the type constructors:

*JSON> let unR = \(Right r) -> r
*JSON> let unO = \(Object o) -> o
*JSON> :t (unO.unR) s
(unO.unR) s :: M.Map String Value

which gets us down to the level of the first map with one entry under the key "attributes". To get the map of attributes we want out:

*JSON> let m = unO ((((M.!).unO.unR) s) "attributes")
*JSON> m
fromList [("allow_comments",String "1"),("allow_pings",String "1"),...

(Haskell uses ! for dereferencing keys in a Data.Map.) And now the components of the entry are easy to extract:

*JSON> let atts = ((M.!) m)
*JSON> atts "allow_comments"
String "1"
*JSON> atts "updated_at"
String "2006-09-15 02:12:45"
*JSON> atts "body" 
String "<p>Although I really do like the...

The first-cut data model for entries looks like this:

data BlogPost = BlogPost { p_title :: String,
               p_summary :: Maybe String,
               p_permalink :: String,
               p_metadata :: PostMetadata,
               p_body :: String,
               p_tags :: [String],
               p_uid :: String,
               p_comments :: [BlogPost]
             }
        deriving (Show)

data PostMetadata = PostMetadata { m_created :: CalendarTime,
                   m_publish :: CalendarTime,
                   m_updated :: CalendarTime,
                   m_author :: PostAuthor,
                   m_published :: Bool }
          deriving (Show)

data PostAuthor = PostAuthor { p_name :: String,
                   p_uri :: Maybe String,
                   p_email :: Maybe String,
                   p_showEmail :: Bool
                 }
          deriving (Show)

And interpolating from typo's model to the new model is just putting the fields in the right place with a little bit of date munging, since the new model has the expectation that dates are represented as Haskell CalendarTime. The reuse of the BlogPost structure for comments is intentional, both for Atom syndication and to support threaded comments.

Pulling all of the entries out is also straightforward:

$ ./script/runner 'puts (Article.find_all_by_state("Published")).to_json' \
  > /tmp/entry.json
$ ./script/runner 'puts (Article.find_all_by_state("ContentState::Published")).to_json' \
  >> /tmp/entry.json

and pulling comments and trackbacks is a similar exercise:

$ ./script/runner 'puts (Comment.find_all_by_state("ContentState::Ham")).to_json' \
  > /tmp/comments.json
$ ./script/runner 'puts (Trackback.find_all_by_state("ContentState::Ham")).to_json' \
  > /tmp/trackbacks.json

although it takes a little doing to collate comments and trackbacks with their parent posts. So far, so good — unlike any of the other migrations (Radio Userland→SnipSnap, SnipSnap→WordPress, and WordPress→typo) I've done, this looks to be neither lossy nor labor-intensive.

As an aside, over about four years of blogging (2003-02-17 through the present), I've accumulated the equivalent of ~110 single-spaced pages of content.

Cure for irbarrhea

I've been working with the ActiveRecord model for my typo installation using irb (via scripts/console), but some expressions were producing a little too much output for convenience, e.g.:

Article.find_all.collect { |a| a.permalink }

Fortunately, there is a convenient configuration property that does the trick by suppressing the output of evaluated expressions:

conf.return_format = ""

After which, what you get is what you puts.

Rails, PostgreSQL and Case-Sensitivity

Possibly the only thing I like about MySQL is when performing a search, the values 'SIMON' and 'sImOn' are considered equal—case-insensitive searching. PostgreSQL on the other hand considers them to be different—case-sensitive searching. Now I don't know about you but for %99.999~ of the applications I've ever written, I'd rather 'Australia' and 'AuStRaLiA' weren't considered different countries.

The "standard" approach to solving this problem is to change a query from this:

SELECT * FROM countries WHERE name = ?;

To something like this:

SELECT * FROM countries WHERE LOWER(name) = LOWER(?);

Thereby forcing the database to perform a pseudo case-insensitive search. The only problem is that all those nice indexes you've created to ensure fast, efficient searching are totally ignored. (Who can spell full-table scan?) Performance issues aside (I mean after all we know that premature optimisation is the root of all evil right?) what's just as annoying is that I can't actually guarantee uniqueness, which is pretty much the whole point! No, even with a unique index on countries.name, the database will still quite happily allow me to:

INSERT INTO countries (name) VALUES ('Australia');
INSERT INTO countries (name) VALUES ('AUSTRALIA');
INSERT INTO countries (name) VALUES ('aUsTrAlIa');

So when I perform a case-insensitive search as previously discussed, I'll end up with three (count 'em 3) records. Thankfully, there is a solution (of sorts): expression indexes.

PostgreSQL allows you to create indexes based on expressions, say for example LOWER(name), allowing us to create a unique, case-insensitive index as simply as:

CREATE UNIQUE INDEX index_countries_on_name ON countries (LOWER(name));

Ok, so perhaps you knew this already and you're wondering what all this has to do with Rails? Well I'm glad you asked.

Rails (as of 1.2) has a new option for validates_uniqueness_of named, oddly enough, case_sensitive. This is assumed to be true by default (meaning all searches are case-sensitive). Set it to false however and you'll magically get validation queries that look like:

SELECT * FROM countries WHERE (LOWER(countries.name) = 'australia') LIMIT 1;

To compliment this feature, I've recently enhanced the RedHill on Rails Plugins in two interesting (and hopefully useful) ways.

The first is in the core and supports the creation of case-insensitive indexes during schema migration:

add_index :countries, [:name], :unique => true, :case_sensitive => false

The second is in schema validations and causes case_sensitive => false to be passed as an option to validates_uniqueness_of whenever a case-insensitive index is detected.

(I also looked at the possibility of automagically surrounding query parameters, etc. with LOWER() inside find methods but given the myriad forms queries can take, it seems altogether too difficult for my feeble mind at this point.)

The upshot of all this is that at the very least, it should now be possible to add case-insensitivity to your queries and be assured that (bugs not withstanding) the performance of your application won't suddenly plummet as a consequence.

gspreadsheet: running formulas from the command line

I am spending time hacking away on Google APIs to really see what it is like. I remember seeing the Google Spreadsheet Data API that allows you access to spreadsheets in Google Docs & Spreadsheets. There is a full API that gives you access to create and modify spreadsheets even to the level of REST requests for each cell. I was also surprised at the number of formulas available. Suddenly I realised that I could create a spreadsheet and use a cell to do various calculations for me, so I hacked up a ruby script to do this: % gspreadsheet [insert formula] e.g. % gspreadsheet 'GoogleFinance("GOOG")' 467.3 % gspreadsheet 'sin(0.2)' 0.19866933079506 Behind the scenes the script used the Google APIs to put the formula in a cell, and then read from that field to get the calculated output. Writing the script To get this all working I just had to: Find a Ruby library to the Google APIs Work out how to authenticate using the Google ClientLogin API Work out the location for the REST requests Work out why I was getting a 404 error Find a Ruby library to the Google APIs I was surprised to not be able to find a nice API to Google Spreadsheets. I did find this script that helped a lot though. I am working on packaging a gdata-ruby module that I will place in rubyforge soon. It will start out with just APIs such as ClientLogin and Spreadsheets, but hopefully we can grow it to cover more of them. Work out how to authenticate using the Google ClientLogin API The key to authentication is using the ClientLogin API to get the auth token, and hiding it away in a member variable so other requests will add it to the headers: response = Net::HTTPS.post_form(https://www.google.com/accounts/ClientLogin, {'Email' => email, 'Passwd' => password, 'source' => "formula", 'service' => 'wise' }) @headers = { 'Authorization' => "GoogleLogin auth=#{response.body.split(/=/).last}", 'Content-Type' => 'application/atom+xml' } Work out the location for the REST requests To find out the location for feeds it helps to GET the feeds themselves and look for the post URLs in link tags. To get the A1 cell (a.k.a. R1C1) you would use something like: /feeds/cells/#{@spreadsheet_key}/1/#{@headers ? "private" : "public"}/basic/A1" The spreadsheet key is the magic key for each of your spreadsheets that looks something like: pSYwzniwpzSFfn0KFRg9oWB. The 1 right after that is the worksheet id. The private/public check get changed in this based on if the code is authenticating you or not (and if you have allowed public access to the spreadsheet in question). The 'basic' is a projection value. basic means just basic atom. 'values' means a full feed minus formula data, and 'full' means a full read/write feed with everything Finally we give the A1 cell info For reading this cell we just used basic projection, but for writing data into the cell we need to use the URL: "/feeds/cells/#{@spreadsheet_key}/1/#{@headers ? 'private' : 'public'}/full" Notice that we use full here (as we want full access) and yet we do not put in the cell in question. This is because we will POST or PUT an entry piece of XML: <entry xmlns='http://www.w3.org/2005/Atom' xmlns:gs='http://schemas.google.com/spreadsheets/2006'> <gs:cell row='1' col='1' inputValue='=sin("0.2")' /> </entry> This has the row and column and input value (in this case a formula). Work out why I was getting a 404 error At first I was getting 404 errors when I posted data up. The reason was that I wasn't putting the full namespaces in the entry doc: <entry xmlns='http://www.w3.org/2005/Atom' xmlns:gs='http://schemas.google.com/spreadsheets/2006'> vs. <entry> XML always seems so frustrating and when these things happen. Really? You couldn't work out what to do without that namespace? Really? Nice and forgiving. Give me JSON or YAML ;) The full code It was fun to be able to be productive with these APIs immediately because they are just basic REST actions that just require Net::HTTP to access. I will work on getting some nice helper libraries so the low level stuff doesn't even need to be done. For those that are interested, here is the quick hack as one script file. The code is ugly... I am sorry. The library version is a lot nicer (and is just a few lines of code after the require's). You can also get weird behaviour if you use single ticks. For now use " and all is well. gspreadsheet #!/usr/bin/env ruby require 'net/http' require 'net/https' require 'uri' require 'rubygems' require 'hpricot' # # Make it east to use some of the convenience methods using https # module Net class HTTPS email, 'Passwd' => password, 'source' => "formula", 'service' => 'wise' }) @headers = { 'Authorization' => "GoogleLogin auth=#{response.body.split(/=/).last}", 'Content-Type' => 'application/atom+xml' } end def evaluate_cell(cell) path = "/feeds/cells/#{@spreadsheet_key}/1/#{@headers ? "private" : "public"}/basic/#{cell}" doc = Hpricot(request(path)) result = (doc/"content[@type='text']").inner_html end def set_entry(entry) path = "/feeds/cells/#{@spreadsheet_key}/1/#{@headers ? 'private' : 'public'}/full" post(path, entry) end def entry(formula, row=1, col=1) <<XML <?xml version='1.0' ?> <entry xmlns='http://www.w3.org/2005/Atom' xmlns:gs='http://schemas.google.com/spreadsheets/2006'> <gs:cell row='#{row}' col='#{col}' inputValue='=#{formula}' /> </entry> XML end def add_to_cell(formula) #puts entry(formula) set_entry(entry(formula)) end private def request(path) response, data = get_http.get(path, @headers) data end def post(path, entry) get_http.post(path, entry, @headers) end def get_http http = Net::HTTP.new('spreadsheets.google.com', 80) #http.set_debug_output $stderr http end end if __FILE__ == $0 formula = ARGV.first || 'sin(0.2)' gs = GoogleSpreadSheet.new([INSERT YOUR SPREADSHEET KEY]) gs.authenticate('you@gmail.com', 'your password') gs.add_to_cell formula puts gs.evaluate_cell('A1') end

Hear Me Sort

James has a home-grown GTD system which he'll hopefully write about if not release to the world one day. It's totally command-line driven using a combination of bash, perl and ruby to manipulate plain text files and, from what I can tell, it's not only blindingly fast, it seems to work seriously well.

One of the daily tasks that James performs is to prioritise the things he wants to (attempt) to get done and what better time to do that than on the 45 minute train trip to work. As you can well imagine, prioritising 40 items takes a fair amount of time not only due to the sheer number but also, being vision-impaired, reading (and re-reading) takes a bit of concentration. So, we came up with a nifty solution in two parts.

The first part was quite simple: use binary insertion. That way, instead of O(n^2) comparisons, we'd get O(n log n). So far so good. But the that was pretty obvious. The next bit was the real doozy!

Rather than have each pair printed out so that James could read them and make a decision, we instead chose to use some Text-To-Speech (TTS) to literally speak each pair; something along the lines of:

"Is calling your boss to discuss staff pay reviews more important than send wife a bunch of flowers for her birthday?"

In addition, we actually run the TTS in another thread so that we can easily interrupt it, primarily because James: has already chosen a suitable answer; needs to repeat the question; or wants to cancel the whole thing.

About the only problem we encountered was that, on his Powerbook, the say command (built-in to Mac OS X) ran a little too slowly for our liking so we simply aliased it to use the much faster swift command that comes as part of the commercial Cepstral William voice he uses for all his Text-To-Speech.

Now James can listen to the minimal set of questions, making the appropriate decision for each by way of a simple keystroke, and have it all done within 15 minutes all the while free to sit and enjoy the view (such that it is on a suburban train during peak hour) rather than staring at the screen.

IntelliRuby

I've got to admit it's getting better a little better all the time IntelliJ Ruby support keeps getting better. I wish there was a lightweight mode where I could just open one .rb file and have it edit without needing the project setup.

Silly Micro-Benchmarks

So, the recent rash of controversial benchmarks have inspired me to do some of my own microbenchmarking. I want to test web servers, and specifically *slightly* dynamic stuff, basically how long does it take to setup and teardown the request processing stuff without doing anything significant with the request (like talking to a database) as I want to isolate and benchmark the actual web part, not anything else. Luckily, there is a perfet test for this, Hello World! All tests were performed on the same machine, and I won't provide the code or configurations (as then they might be considered useful). I did tune each one to the best of my ability for running in the environment it is in (4 core machine, plenty of ram, etc). So, without further ado, results of a dynamic Hello World on a number of different webapp-ish servers, sorted by performance:

ServerRequests / SecondNotes
Apache HTTPD (Worker) via mod_hello_world Requests per second: 18823.58 [#/sec] (mean) This is basically a minimal httpd module which just prints hello world to the response
Apache HTTPD (Worker) via mod_wombat Requests per second: 17856.76 [#/sec] (mean) This uses a server-scoped mod_wombat handler with a default pool of fifty Lua virtual machines
Apache Tomcat 5.5.20 Requests per second: 17644.40 [#/sec] (mean) This used a JSP and a few hundred thousand requests before the benchmark to let the JVM warm up.
Jetty 6.1.1 Requests per second: 12449.36 [#/sec] (mean) This used a JSP and a few hundred thousand requests before the benchmark to let the JVM warm up.
Mongrel accessed directly Requests per second: 2378.05 [#/sec] (mean) This was done via an HttpHandler rather than Rails in order to cut down on overhead
Mongrel, four instances, proxied through Pen Requests per second: 2109.91 [#/sec] (mean) This configuration was basically a test to make sure that ruby's single-threaded nature wasn't the bottleneck. It seems the additional proxying was much more expensive than not using the extra four cores available.
TwistedWeb Requests per second: 2089.55 [#/sec] (mean) This is the current TwistedWeb, not TwiestedWeb 2.
Mongrel (four instances) proxied through LightTPD n/a This was a very unstable configuration. I was unable to run more than a thousand or so requests before lighttpd would start losing track of mongrel instances and start returning errors. LightTPD returns errors very quickly, however.

Now, to complicate matters further. Mongrel doesn't support HTTP keep alive, so I re-ran all the other benchmarks with keep alive disabled in the client and the numbers dropped about in half in every case, the three-way-handshake seems to matter a lot for micro-benchmarks.

Finally, this is a wholly unscientific set of benchmarks which is basically useless, so please don't read anything important into it. Any real system is almost certainly not bound by request handling in the application server :-)

IntelliJ and Ruby

The Ruby plugin for IntelliJ IDEA has finally been released. IntelliJ is bar none the best Java IDE. Downloaded and installed.