faster_lsi: An Attempt To Accelerate Related Posts in Jekyll
Saturday, 8 December 2012; 9 pm
One thing I’d like to mention at this point is that I use the latent
semantic indexer (LSI) in Jekyll to build the related posts lists.
The Ruby module Classifier::LSI
, part of the classifier gem, uses
the GNU Scientific Library for acceleration, but it’s still
agonisingly slow.
Currently, Classifier::LSI
rebuilds the index every time an entry is
added. This runs into massive performance overheads, as once it gets
to around fifty posts, the time to inject a new entry begins to
increase noticeably, and eventually it gets up into the twenty minute
range to complete. Thankfully, Classifier::LSI
has a little-known
knob that disables automatic index rebuilds, and by explicitly
rebuilding the LSI index at the end of the LSI repopulation, it kicks
things along nicely.
As a side note, here, I use pandoc-ruby to provide a more featureful Markdown transformer, so be mindful that it imposes I/O performance overheads.
With just the 76 posts I’d written this year (abysmal, I know), I come up with the following figures:
Without faster_lsi:
jekyll --lsi 16.91s user 0.88s system 97% cpu 18.302 total
With faster_lsi:
jekyll --lsi 2.72s user 0.77s system 88% cpu 3.940 total
With 109 posts, we begin to see even better improvements:
Without faster_lsi:
jekyll --lsi 51.00s user 1.47s system 98% cpu 53.060 total
With faster_lsi:
jekyll --lsi 5.04s user 1.12s system 91% cpu 6.735 total
At this point, we begin to see I/O overheads being slower than LSI when faster_lsi is active. I call that fairly conclusive. But wait, there’s more. I have 273 posts lying around… I wonder what happens if I feed them all in. With faster_lsi, it was nice and clippy. Without it, I simply gave up, and went and refilled my cup of tea. A nd it was still going:
Without faster_lsi:
jekyll --lsi 1277.86s user 10.90s system 99% cpu 21:30.29 total
With faster_lsi:
jekyll --lsi 34.62s user 4.43s system 96% cpu 40.430 total
That is, in anyone’s books, a major improvement. Note, however, that
I don’t know just how well this will perform with jekyll --auto
because I don’t know how it does the LSI rebuilds. I think (but
please, don’t commit me on this) that the LSI is rebuilt every time
Jekyll picks up a file change.
So, all up, the performance improvement is massive, and scales depending on how many files you have. At the last data point, the improvement is just on 3200%.
Admittedly, a more optimal solution would be to cache the LSI index and/or content data between runs somehow, I’ll leave investigating that to when faster_lsi takes over ten minutes to run (which may be some time away).
Related Posts
- The Art of Science Fiction 08 Oct 2013
- The curious incident of elspeth in the mid-afternoon 08 Jun 2014
- Content Preparation 20 Sep 2013
- Host Naming 09 Apr 2014
- Ruby Markdown Performance 06 Oct 2013
About this post
- Date & Time
- 8 December 2012, 21:32:11
- Words
- 439
- Tags
- jekyll, markdown, ruby, and lsi