Muller's World

Mike Muller's Homepage

2015-01-15 Revamping Spugmail, part 2 - Indexes

Now that all of my messages are stored in my filesystem, my next task is getting to where I can look them up quickly.

The existing spugmail client stores its its indexes as a single object in each folder database. Messages are stored under monotonically increasing integer keys, and each entry in the index contains a reference to its message object, in addition to some other fields (sender, subject, date...) that are displayed in folder's list view.

I want the index lists to be stored as flat-files so they can be easily manipulated by other tools. I chose the CSV format because it's both simple and ubiquitous. But CSV has some problems. For one thing, there's no real standard for CSV. Different tools do different things, and most allow some degree of configuration of things like separators and escaping rules. Even the "comma separated" part is not a standard.

Furthermore, none of the formats in the python csv module deal with newlines very well. A newline embedded in a field seems to always be represented literally. This doesn't work very well with line-oriented tools because an embedded newline is indistinguishable from an end-of-row newline unless you fully parse the row.

So I wrote my own CSV module. The rules are fairly simple: fields containing a comma, newline, double quote or trailing or leading spaces are quoted. In quoted fields, newlines, double quotes and backslashes are escaped with a backslash. This ensures one row per line while mostly retaining compatiblity with standard csv. In case I ever want to import it into a spreadsheet or something, conversion to an excel-style CSV is pretty simple:

    import sys
    from spug.mail import csv as smcsv
    import csv

    out = csv.writer(sys.stdout)
    for line in sys.stdin:
        out.writerow(smcsv.stringToRow(line[:-1]))

Another advantage to one-row-per-line is that it makes it a lot easier to do a binary search on the file, but I'm getting ahead of myself.

In order for the existing client to be able to write the new index files, it needs to keep track of the message hashes. I really didn't want to mess with the existing persistence format of the system, but the prospect of maintaining an external mapping from the folder index to the message hash worried me. So I set about adding message hashes to my existing indexes.

This didn't work. Or rather, it worked until I ran into a folder that was already so big that I couldn't write its index.

I ended up adding two new index files per folder. One contains a representation of the data in the old index (sender, subject, etc. plus message hash) and the other is a mapping from the old integer message key to the message hash. The second will eventually go away once I can confidently get rid of the old GDBM code.

The one place where I did have to add the hashes to the old database was the journal entries.

Spugmail 1 only writes the complete index when terminating or switching folders. If something takes the client down before that happens, we lose changes since the last write. To avoid this, I added journal entries to the GDBM databases: whenever a change is made to the index, the client records a lightweight journal entry and these are applied to the the index when the folder is reloaded (and then cleared when the index is written).

Because the new index writes happen at the same time as the old index stores, we would lose the hashes if we did a premature shutdown. So the journal entries now have a 'hash' field storing the message hash along with all of the other transient data.

Eventually we'll store the journal entries in CSV files, too. But I think my next step will be loading messages from the message store instead of GDBM.