Basic MAF parsing

Today is the official start of coding for GSoC. Following my initial plan, I’m going to begin by implementing the basics of MAF parsing in Ruby. I’ve already got Cucumber features defined for basic MAF parsing and conversion to FASTA, along with reference data as processed by bx-python.

This should be fairly straightforward to implement, but I’m going to wait until I’ve got this basic functionality running before I define anything else in detail. I’m still weighing whether it will make sense to have a ‘raw’ representation of sequences and alignment blocks rather than just using (and presumably subclassing) the bio-alignment representations.

Once basic MAF parsing works, my next area to focus on will be indexed access. Many use cases for MAF involve pulling a few alignments out of very large data files, rather than batch-processing the whole file. I’ll be focusing on the indexed-access API at first, and building a simple interim indexing scheme similar to that used by Biopython, probably using SQLite in a similar way. In developing the API, I’ll study those provided by bx-python and Biopython, the two other MAF implementations providing persistent indexing.

Ultimately, I plan to revisit my actual indexing method, and potentially implement support for bx-python’s interval index files. I’ll also take a careful look at other database alternatives such as Berkeley DB and Tokyo Cabinet.

P.S. For my next blog post, I think I will try using Markdown’s reference-style links, since the raw source for these posts is getting unwieldy with inline links.

BioRuby MAF blog

Multiple Alignment Format support for BioRuby and bio-alignment

Comments