BioRuby MAF blog

Multiple Alignment Format support for BioRuby and bio-alignment

First Milestone

| Comments

It’s the end of my first week of actual development work, and I’m happy with my progress. My first milestone, for batch MAF conversion to FASTA, is complete. I started with a simple line-based parser, and then spent some time developing a faster one; my MAF parser now appears to be quite competitive in performance. I can now convert simple MAF files to FASTA with output matching that from the Galaxy tools. And I’m fairly confident in its correctness, with RSpec unit tests and Cucumber features specifying its behavior relatively completely.

Parser development

My initial approach was to simply process the file line-by-line, calling IO#readline each time. This turned out to be a little more cumbersome than I expected. Also, there was a discussion on the Biopython list of the usefulness of BGZF blocked compression for MAF files, which would not fit well with such a linear line-oriented approach. Inspired by Dmitry Vyukov’s Wide Finder 2 entry, I wrote a new parser which reads the file in 8 MB chunks. It immediately locates the start of the last MAF alignment block in that chunk, and then proceeds to scan through the chunk, parsing an alignment block at a time, until the last block is reached. When it reaches the last block, it joins the block fragments at the end of one chunk and the beginning of the next chunk.

For separating the alignment blocks within each chunk, using a StringScanner and regexes worked well, but for parsing the individual alignment blocks, using String#split to split the block into lines, dispatching on the first character with a case statement and plain string comparisons, and splitting lines into fields turned out to perform substantially better. I suspect the overhead of setting up a regex matching operation is the reason for this; profiling indicated that it was spending considerable time in regex operations, at any rate.

After this optimization, the chunked parser was able to parse a 315 MB file and count its MAF blocks in 10.1 s, compared to 16.0 s for the earlier line-based parser and 22.7 s for the parser from Galaxy’s bx-python library. This seems like very respectable performance.

Also, the upcoming JRuby 1.7 on Java 7 appears to have excellent performance with the chunked parser, perhaps due to the new invokedynamic support. After warming up the JVM, it averaged 16 μs per alignment block parsed, compared to 25 μs for Ruby 1.9.3. See the performance page on the project wiki for details.

Behavior-Driven Development

At the beginning of this week, I was not at all clear on when it would make sense to use Cucumber and when to use RSpec. After writing a Cucumber feature to ensure that sequence data was exposed properly, it didn’t seem worthwhile to duplicate that in RSpec. However, the chunked parser was much harder to get right than the original line-based parser, and RSpec was an invaluable tool for that.

For most of the bugs I found, I developed an RSpec specification to call for the correct behavior. Some of these could go into Cucumber just as well, such as the one stipulating that a certain alignment block should have ten sequences in it. Most, though, are about low-level details such as correct parsing of sequence lines split across chunk boundaries.

In some cases it was challenging to follow the BDD principle of writing only the necessary code to make the next test pass. When I started on the chunked parser, I had a very clear idea of how it needed to work, and next to no idea how to specify that in terms of tests. So I just started writing code and relying on the existing parser tests. However, the set of RSpec tests I ended up with seems to specify the chunked parser’s behavior reasonably well; they specify where the last block in a chunk should be detected, how many blocks should be parsed in a file if chunk fragments are being joined properly, what happens if a sequence line is split across a chunk boundary, and so on.

This gives me a concrete idea of the kind of tests I might want to write to specify something like this next time. We’ll see how well I can apply this next week, when I start working on indexed access.

Tools

I knew that it was possible to set up continuous integration as a hosted service with Travis-CI, but it was a pleasant surprise to find that I could also have documentation generated the same way. It would be nice to have Travis-CI do hosted coverage reporting as well, but setting up simplecov was very easy.

I’ve been doing my development work with Emacs and using rspec-mode, cucumber.el, and Magit. These have all been better than nothing, but also somewhat frustrating; I’ve already sent in one pull request for cucumber.el, and over the next month or two I may explore some more radical ideas to make rspec-mode and cucumber.el more useful. I’ve been gradually learning how to use Magit, although I usually end up doing my more complex Git operations from the command line.

Basic MAF Parsing

| Comments

Today is the official start of coding for GSoC. Following my initial plan, I’m going to begin by implementing the basics of MAF parsing in Ruby. I’ve already got Cucumber features defined for basic MAF parsing and conversion to FASTA, along with reference data as processed by bx-python.

This should be fairly straightforward to implement, but I’m going to wait until I’ve got this basic functionality running before I define anything else in detail. I’m still weighing whether it will make sense to have a ‘raw’ representation of sequences and alignment blocks rather than just using (and presumably subclassing) the bio-alignment representations.

Once basic MAF parsing works, my next area to focus on will be indexed access. Many use cases for MAF involve pulling a few alignments out of very large data files, rather than batch-processing the whole file. I’ll be focusing on the indexed-access API at first, and building a simple interim indexing scheme similar to that used by Biopython, probably using SQLite in a similar way. In developing the API, I’ll study those provided by bx-python and Biopython, the two other MAF implementations providing persistent indexing.

Ultimately, I plan to revisit my actual indexing method, and potentially implement support for bx-python’s interval index files. I’ll also take a careful look at other database alternatives such as Berkeley DB and Tokyo Cabinet.

P.S. For my next blog post, I think I will try using Markdown’s reference-style links, since the raw source for these posts is getting unwieldy with inline links.

GSoC Week 2

| Comments

This was my second week of work on my GSoC project, and the last week of the ‘community bonding’ period before the official start of coding. A major focus of mine was BioRuby’s phyloXML support; it uses libxml, which has been causing unit test failures under JRuby. In the end, the best course of action seemed to separate the phyloXML support as a separate plugin, which I have done as the bio-phyloxml gem. This will remove BioRuby’s dependency on XML libraries entirely and that JRuby issue along with it. At the same time, users of the phyloXML code should be able to continue using it with no substantive changes.

Separately, I began porting this phyloXML code to use Nokogiri instead of libxml-ruby, but ran into difficulties with this effort. While it is possible, and the library APIs are very similar, the code uses relatively low-level XML processing APIs in ways that seem to be sensitive to subtle differences in text node and namespace semantics between the two libraries. Substantial restructuring of the code and the addition of quite a few unit tests might be necessary to carry out such a port with confidence that the resulting code would work well.

Also, someone else submitted a JRuby patch for JRUBY-6658, one of the major causes of BioRuby’s unit test failures with JRuby; once a fix is integrated, we’ll be close to having all the tests passing under JRuby.

I identified another JRuby bug, JRUBY-6666, causing several unit test failures. This one affects BioRuby’s code for running external commands, so it would be likely to be encountered in production use. For this one, I also worked up a patch.

I also spent some time preparing a performance testing environment, for evaluating existing MAF implementations as well as my own. This will be important, since I will be considering the use of an existing C parser. I will also want to ensure that the performance of my code is competitive with the alternatives. Lacking any hardware more powerful than a MacBook Air, I am setting this up with Amazon EC2. To simplify environment setup, I’ll be using Chef. I’ve already set up a Chef repository with configuration logic, and some rudimentary code to streamline launching Ubuntu machines on EC2 and bootstrapping a Chef environment. To save money, I plan to make use of EC2 Spot Instances, which are perfect for instances that only need to run for a few hours for batch tasks.

GSoC Week 1

| Comments

This has been my first half-week of work on my Google Summer of Code project, and it’s off to an exciting start. The first order of business has been to get my development environment together; since I’ve been a microbiology student instead of a programmer for the last year, it’s taken some work. In that process, I’ve ended up making a few open source contributions just to get my tools working the way I want. I’m running GNU Emacs 24 and trying to take more advantage of it than I have in the past. I’ll have much more to say about this in a future post.

I’ve also started working on the BioRuby unit test failures under JRuby, as a way of familiarizing myself with the BioRuby code base as well as the community and its development processes. Right now, JRuby in 1.8 mode is showing 6 failures and 126 errors, which is hardly confidence-inspiring for people considering using JRuby with BioRuby. This is too bad, since JRuby has some definite advantages as a Ruby implementation. After looking into these failures, I’ve broken them down into a few categories:

  • temporary file permissions problems, likely due to some sort of Travis-CI environment issue
  • a bug in JRuby’s implementation of Open3.popen3 which I’m working up a bug report for
  • an odd autoload problem I’ve filed JRUBY-6658 for and sent an accompanying RubySpec patch for
  • a problem with libxml-jruby, which appears unmaintained, for which I’ve submitted a BioRuby patch plus JRUBY-6662
  • and a small test case bug relating to floating point handling, which I’ve submitted a patch for.

Once these are resolved, JRuby should be passing the BioRuby unit tests in 1.8 mode, and closer to passing in 1.9 mode. (There are a few extra failures under 1.9 that I haven’t sorted through yet.)

I’ve also gotten a start on my project itself, creating the bioruby-maf Github repository with a project skeleton and writing my first Cucumber feature for it. This is, in fact, my first Cucumber feature ever. However, I did spend a few cross-country flights reading the RSpec and Cucumber books last week; between that and cribbing from Pjotr’s code I feel like I have some idea what I’m doing. Just assembling that feature has been useful, too, since I’ve had to get several of the existing MAF tools running on my machine. In fact, my test MAF data and the FASTA version of it are courtesy of bx-python, which will be my reference implementation in many respects.

Hello World

| Comments

Hello World! This blog will be tracking the development of bio-maf, a Multiple Alignment Format (MAF) parser for the Ruby bioinformatics library BioRuby, as part of the Google Summer of Code 2012. I’m Clayton Wheeler, a programmer and a biology student at the University of Michigan. This is exciting for me as my first substantial open source project, and as a way to write hopefully-useful bioinformatics software. And thanks, Google, for making this possible.

In my next post I’ll discuss what MAF is, what it’s useful for, and how I’m planning to approach the project. I’ll be making weekly status posts, plus others as inspiration strikes.