Muller's World

Mike Muller's Homepage

2015-01-12 Revamping spugmail, part 1

Many years ago, I wrote my own e-mail client called "spugmail." I wrote my own because I wanted something I could run in an ssh session and I had used elm and pine and they just too slow when dealing with an inbox as large as mine. So I wrote spugmail.

Spugmail is fast and light. Written in python with a curses UI, it stores messages in GDBM files, one file per folder. Since GDBM is strictly key/value, there is some trickery involved in preserving the ordering of a folder (which seemed like a good idea at the time). So I store the entire index of each folder as a single object in the database, writing it upon leaving the folder and storing journal entries for times when it goes down prematurely (usually due to a power-outage and a broken battery backup).

The whole thing works pretty well, but in recent years I've become more dissatisfied with it. For whatever reason, my GDBM setup doesn't garbage collect very well, so every once in a while my inbox file grows to 2G (even though there's no where near that much data in it) and I have to move it and start a new inbox. There's also the fact that curses doesn't automatically deal with window size changes, so when I run it in a 'screen' session and pull it into a window of a different size, the layout gets messed up.

After having used g-mail, there are also a lot of g-mail like features that I'd like to add to it, like search and the ability to see an entire thread as one view. These kinds of things don't fit very well with the "one file per folder" design, and (lacking good tests) I've grown rather change-averse because breaking something can leave me without e-mail, sometimes at a very inconvenient time.

So I've decided to finally undertake some major changes to the client. A big part of this is to make the system more modular and transparent, so I can experiment with new tools without fear of breaking the existing flow. The first step is to replace the GDBM backed folders with something a little more flexible and scalable. I've decided that I want to store my messages directly as ordinary files, one per message.

My e-mail database now consists of about 86,000 messages. Storing them in the filesystem presents a bit of a challenge. You can't just dump them all into one directory, filesystems and shell tools don't work very well with extremely large directories. There's also the question of what to name the files.

I decided to identify messages by the SHA-256 hash of their content. Encoded in hex, these translate to 64 character identifiers like these:

    1d58aac0c67b36ba2e3d8f91b19c28a5393fc05b61762ba30b82a1a425ab3878
    861c9ffe1bed32fc9647c29a14f6cdde12455b5dd71b5a044a1f867897814362
    5af0d8e038e09104896801ad1671e182aa51de6a6c11b86b5de9cb4aaedace58

Identifying messages by their hashes makes it very easy to construct the filename: just hash the content. You don't need any global counters or to examine the existing message set. It also makes it very easy to ensure that the same message is stored in exactly one file because the same message will always have the same hash. Having two different messages with the same hash is extremely unlikely.

These are all initially stored in a single directory. When the directory grows too large (beyond a certain constant number of entries, currently 256), the storage layer reorganizes it by creating subdirectories for each unique two character prefix and moving the files with that prefix to the new subdirectory (and truncating the prefix from the new basename). So a set of files like this:

    aabbcc
    aaxxyy
    abcdef
    abqrst

Would become this:

    aa/bbcc
    aa/xxyy
    ab/cdef
    ab/qrst

The git filesystem uses a similar mode of organization for its "objects" tree.

The new message store doesn't replace the existing message storage code: it supplements it. The code that stores a message in the old database now also writes it to the new filesystem. This way I don't have to worry about breaking any existing code.

There's still a lot of work to do. There's no indexes of any kind. Messages are essentially immutable (since they're referenced by their own hashes), so we don't currently have a way to track whether a message has been read or marked for deletion in the new system. But I can now do things like grep every e-mail message I've ever sent or received. And I can make interesting observations like this:

The old databases:

    $ du -k ~/.spugmail/mail
    11475300        /home/mmuller/.spugmail/mail

The new file-based storage:

    $ du -sk ~/.spugmail/messages
    1436752 /home/mmuller/.spugmail/messages

That's 11G to store all of my e-mails in GDBM files, 1.4G to store them in the filesystem. This will eventually be a pretty huge efficiency gain once I'm confident enough to turn off the existing database. Another potential savings will come from compressing the files.

The next step is to store indexes for each of the folders in parallel with the existing ones. This will allow me to store message hashes without messing with the existing indexes.