Indexed MAF access

My second week of working on this MAF library was more of a struggle than the first week. I was implementing indexing of MAF files and using those indexes to enable searching by genomic regions. In the process of doing this, I ended up changing directions partway through, but I eventually did implement the features I had planned to, and ended up with a better implementation than I thought I would.

MAF files can be quite large, often hundreds of gigabytes. Frequently, too, a user will want to extract a relatively small subset of that data. Reading in and parsing an entire 200 GB file to find a few alignments of interest is not a very useful approach; instead, we want to build indexes to enable quick location of the alignments to parse.

This was, in fact, my milestone for this week: to be able to build an index for a MAF file, identify alignment blocks matching a region of interest, and then selectively parse those blocks from the MAF file. The two MAF implementations I’m aware of that use a similar indexing scheme are Biopython and bx-python; Biopython uses the SQLite embedded relational database to build its indexes, and bx-python uses a custom index file format. (I believe the UCSC genome browser uses MySQL for indexing in general, but I’m not sure whether this is applied to MAF files directly.)

SQLite vs. Kyoto Cabinet

The plan I started with was to use SQLite for indexing in a similar way to Biopython, in order to bring up functional code with indexing support as quickly as possible. I reasoned that being able to use SQL for declarative queries would make this simpler than, say, using a key-value store such as Kyoto Cabinet or Berkeley DB, let alone implementing a sui generis indexing system such as the one bx-python uses.

My first task, though, was to understand the UCSC bin indexing system, which is used by all MAF indexing implementations. It was not particularly clear to me how it worked, even after looking at several implementations of it, until I finally read the paper describing it and illustrating its hierarchical nature. I still don’t understand why Biopython always searches bin 1, which was puzzling me for a while; I need to ask the author.

I ran into a series of problems with using SQLite, unfortunately. First, the standard sqlite3 gem uses the C extension API rather than FFI, and so does not work with JRuby. To use the same code with the native and JDBC drivers, I needed to use the unmaintained DBI/DBD library. (I subsequently learned that Sequel might have worked as well, but after I’d already decided to try another tack.) The DBI interface also had the unfortunate feature of returning all numeric values as strings, which was not what I would hope for in performance-critical code.

Also, I discovered that SQLite does not have clustered indexes, meaning that I needed to create a table just to hold the index data, and then a separate index within SQLite for that data. This duplicates the index data, which is also not ideal.

Most of all, though, SQL simply turned out to be too cumbersome. Explicitly creating and indexing tables and checking for their existence started consuming more of my effort than I would have liked. Biopython’s implementation ended up running a SQL query for each possible bin for each range of interest; I knew that it should be possible to execute a search like this while only having to make a single pass through the index data for each bin. It wasn’t clear that there was a good (and simple) way to do this in SQL, though.

In frustration, I opened up the Kyoto Cabinet documentation, studied the available features for prefix and range searches, and wrote out the algorithm I had in mind. My biggest decision point was how exactly to store the index data. In the past, I’ve always favored human-readable data representations; I’ve generally dealt with small amounts of data in situations where being able to see the data immediately with View Source or a packet trace is invaluable. Here, though, after adding things up I realized I could use a binary representation and use a mere 24 bytes per index record. This actually may be the first time I’ve written something to output a binary data representation, but I think it was well worth it.

Once I sat down to actually implement the Kyoto Cabinet version, I was amazed how much simpler it was than the SQLite version. I could step through the data just the way I wanted to, and do whatever computation seemed appropriate on the way. I pulled together a running version very quickly, and simply applied the same RSpec tests to both.

For both implementations of this indexing code, I found RSpec invaluable. I made a fairly rigorous effort to implement this in BDD/TDD style, writing tests for each bit of behavior I could think of to specify. This gave me a great deal of confidence in optimizing and tweaking the index search code.

Kyoto Cabinet results

I’m now very happy with the Kyoto Cabinet code; I also expect it to be easily adapted to other key-value stores such as leveldb or Berkeley DB.

Getting Kyoto Cabinet working on JRuby has posed a bit of a headache too, unfortunately. Its Ruby library also uses the C extension API, so I plan to use the Kyoto Cabinet Java library instead, with a thin adaptation layer to present a compatible interface to the Ruby library. This approach has been promising, except that a few of the methods in the Java library seem to encounter string encoding problems with binary data, returning garbage strings (probably containing Unicode ‘REPLACEMENT CHARACTER’, I need to check) rather than the expected data. Many of the Java methods have separate String and byte[] versions, but not all. Ironically, the trickiest problems of this kind are caused by DB#match_prefix calls in my unit tests, rather than in the actual indexing code.

Another oddity has been that there don’t appear to be any Ubuntu packages in usable form for Kyoto Cabinet. This is strange, since it’s hardly the most obscure software I’ve wanted packages for. I don’t need these for development, but Travis CI doesn’t provide Kyoto Cabinet, so I have to install it myself there to get CI working again. Since I’m not about to teach myself Debian packaging just for this, I’ve written a horrible shell script instead for the time being. (What would really be nice is an easy way to run user-provided Chef recipes on Travis CI, but they don’t appear to provide one.)

Random-access parsing

The trickiest part of the whole effort may have been figuring out how to revise the parsing code to work in a truly random-access fashion, and in a way that will be compatible with BGZF compression later on. What I ended up doing was extracting a separate ChunkReader class responsible for positioning within the file and reading appropriately-sized chunks of it as requested while maintaining proper chunk alignment. Together with other code to merge requests for adjoining or nearby alignment blocks, this keeps the basic parsing logic intact but should avoid either excessive or excessively large reads.

This also means that the indexing code simply provides the starting offset of each block, and the parser is responsible for reading ahead as appropriate. This should be perfectly compatible with a BGZF layer, even if an alignment block spans the boundary between separate BGZF blocks. (Alignment blocks, chunks, and now BGZF blocks… I may be heading for a terminology problem here.)

Next steps

This week, I’m focusing on providing additional logic for filtering and querying for sequences and alignment blocks. I’m also working on a few ideas to optimize index creation, and I hope to get my index code running on JRuby soon.

BioRuby MAF blog

Multiple Alignment Format support for BioRuby and bio-alignment

SQLite vs. Kyoto Cabinet

Kyoto Cabinet results

Random-access parsing

Next steps

Comments