Writing an Arabic Dictionary Application

There are no good electronic Arabic dictionaries for language learners, so I thought I'd roll my own.

Clojure version

The main problem with Arabic dictionaries is that most of them do not include the vowels. I was lucky enough to find a database (the Buckwalter wordlist), and I programmed the first version of the dictionary in Clojure, using a Java/Swing interface.

Searching in the dictionary had some non-trivial aspects: the search string may contain no vowels, or just some of them, and we still have to get the best-fitting word. Also, alifs can be written in several ways, and even the Arabs do not care for the correct spelling in everyday use. These could be handled by dynamically generated regexps, but it seemed to be easier to just write a function (ar=) that tests for equality between two strings.

Searching in the other direction was just a bonus, so I didn't waste much time one it, a simple regexp search sufficed.

As for data structure, there was nothing fancy, just a list of tuples, but still it was good enough. Note, that at this stage, the dictionary only searched when the user hit RET.

Java version

Next I wanted to move the dictionary to my Android phone, so as a first step, I've rewritten the dictionary in Java.

While I was at it, I've added several things. First, dictionary entries were searched on-line, i.e., as the user typed. This was more problematic than I originally thought. Searches took forever, so I've used a simple indexing -- just storing the index of each starting letter, as if I was opening a real-life dictionary. This was adequate most of the time, but I had to introduce a "limit", that the search routine is run only for a string of at least two characters (not much of a limitation in a practical sense, but still).

English-Arabic search was made somewhat more user friendly, and I have also resorted the database in a more meaningful way (by the way, the sorting / indexing / etc. were all written in Common Lisp).

Then I downloaded the Android SDK, played around, and ... didn't like it, so I abandoned the project for a while.

JavaScript version

Then I've read about jQuery Mobile, and thought that it would be sufficient for a simple dictionary interface, so I've written a JavaScript version from scratch.

It was obvious, that I need a better data structure, if I want the search to be fast in JavaScript, and a trie was the obvious choice. The idea was to generate the trie (from Common Lisp code, once again) as a JavaScript object, and just load it from file.

All the language-specific information was generated into the tree, and because of that, most words appeared at multiple nodes (e.g. once with full vocalization, once with only part of the vowels). Even using indices instead of the actual data, it is no wonder that the trie file got big, around 12MB.

The file format was like this:

var trie={idx:[],"a":{idx:[1,2],"b":{idx:[3]}},"c":{idx:[]},"d":{idx:[4,5,6]}};

This represents the following trie:

         [ ]
       /  |  \
      a   c   d
     /    |    \
  [1,2]  [ ] [4,5,6]
    |
    b
    |
   [3]

Using a trie, still, had its advantages. Search was instantaneous, even for querying the whole subtree (I've set up a limit for search results, but this has more to do with the speed of displaying them).

The main problem was the loading time and memory usage. On my desktop PC, Google Chrome took as much as 15 seconds to load the (local!) page, and its task manager reported 400MB memory usage for that tab.

Looks like creating a large number of objects is not cheap in JavaScript. After a good night's sleep, I've encoded the trie as a single array. For example, the data above became:

var trie=[1,"a",11,3,1,2,"b",5,2,3,0,0,"c",4,1,0,"d",7,4,4,5,6,0,0];

Maybe this one needs a bit of explaining. It starts with the relative offset of the first child, followed by the indices contained in the node. Every child starts with name of the child (or 0 if there are no more children), the relative offset of the next child, and then the child node.

This file was actually even smaller than the previous version, and loading time was reduced to 2-2.5 seconds, with a memory usage of 60MB. Not ideal, but much better. I could get even better values, if I didn't put the logic in the trie, but I suspect that would slow down the search substantially.

... or so I thought. But it turned out, that the search is still fast, even with all the extra subtree crawling for missing vowels and diacritics. What we gained, on the other hand, is no load time delay, and even the memory footprint was cut in half.

Android version

Phonegap did the rest, though there was quite a bit of bookkeeping, editing XML configuration files, adding an icon, signing the app (and creating a release key in the process)... but all in all, it worked out of the box. You can feel a slight search delay, but not much, and the app loads in 2-3 seconds on my Galaxy S3 Mini.