The Cybernetic Penguin: January 2012

I've gotten to the stage where my ereader has the basic functionality I wanted - you can read books with it, you can correct mis-spelled words on the fly, and the corrections are persistent. So I put it up on the Android market in case anyone wants to play with it -

https://market.android.com/details?id=org.kde.necessitas.example.calliope

It's glitchy, partly because of some bugs in my code, partly because Qt for Android is in alpha and has its own quirks (for example, it's a known bug that the onscreen keyboard defaults to upper case for some reason, and there seems to be an issue with settings not being saved; this'll no doubt be corrected by the next Qt release). I've also started working on making the UI work differently for Android versus the desktop; the button bar shows up by default above the page for the desktop, and pops up with the menu button on Android.

Still, bugs and all, you can read with it, and I've added some nice-to-have as opposed to essential features as well - the reader uses a filesystem watcher on the directories in which it searches for books, so drop a new one in and it'll show right up on the menu, and it interacts with Windows/X11 session management so you can log off and on again and not lose your place.

The coolest thing I've added, though, is the filter manager. Basically, the reader works with a stream of elements parsed out of the HTML - some images, some pagebreaks, but mostly paragraphs of text. Filters operate at various points in the paragraph's transition from 'list of words with attributes (e.g. bold, italic)' to 'group of words at given x/y coordinates in a bounding box', and also when someone clicks/touches the screen inside a paragraph.

The spelling-correction filter is run before the text layout process; it has a map of corrections of the form 'the third word of the paragraph, which is coler, should actually be colour' and makes the appropriate alterations in the list.
There's also a dictionary-lookup filter which is invoked (if set as the active touch filter) when a word is pressed, after the text has been laid out. At some point I'll likely also add a filter that operates after text layout but before rendering to justify the text (such that it lines up on both left and right margins as opposed to the default ragged right alignment).

I've not done much with the dictionary yet, and the API needs work (it should be asynchronous for a start). That done, it would be easy enough to
for example look a word up on Wikipedia from within the application given Qt's http support. Right now, though, the only dictionary is for Latin (which I'm in the process of learning), which in itself took a bit of work. Latin is a highly inflected language - that is, where we in English add words to change the meaning of a word, it tends to use different endings instead. 'I have' is habeo, 'we have' is habemus, 'we were having' is habebamus, and so forth, so it's not a straightforward thing for a computer to go from a random Latin word to the canonical form in which it appears in a conventional dictionary.

There is a program that does know how to do this, though, using some very clever algorithms and knowledge of Latin grammar; it's known as Whitaker's Words, and it is open source. Unfortunately, its author made the somewhat...unusual choice of Ada as its implementation language; unsurprisingly, an Ada compiler is not part of the Android NDK. The 'nice' way to bring that capability to Android would be to reimplement the program in C++, but that would involve quite a bit of work, and this is more for my own use than anything else, so I took a quick and dirty route.

Available for Whitakers Words is a list of every word understood by the program in all its forms (so it would include habeo, habemus, habebamus etc). I wrote a little program which reads that list, invokes Whitakers Words on each one, and writes the output into a file, writing an index into another file of the form source word, current position in the output file, length of the string from Words. This takes quite a while to run (about as long as ICS takes to build on my machine) and generates about a 250 meg output file and 20 meg index. My dictionary loads the index the first time a word is queried and uses that to seek into the output file and pull out the word's definition (I originally tried simply using Qt's built-in IO facilities to write out the QHash into a binary index file, but that actually ended up producing a bigger file for some reason).

On the off chance this would be useful for someone else in the same situation, the utility is at

https://github.com/jotheberlock/whitakerwords

and the source for the dictionary is whitaker.cpp in Calliope's source.

Incidentally, some of the books I've been working with make me really sympathetic towards the developers of browsers (an ebook reader is, after all, functionally a simplified HTML renderer with some special needs). For the most part Calliope basically displays anything in <p> tags as paragraphs, but one book in particular was a long stream of text, not in any form of block tag, broken only with <br>'s at the end of each paragraph. From what I can find on the web this was all the rage back in about 1992. I put something in to deal with that case, but there's at least one other book out there where most of the text doesn't show up; I'm investigating why.

The Cybernetic Penguin

Sunday, January 29, 2012

Taking another look at LLDB

Wednesday, January 18, 2012

Calliope hits alpha