BioRuby MAF blog

Multiple Alignment Format support for BioRuby and bio-alignment

Setting Up a Mac for ssh+Kerberos at the University of Michigan

| Comments

Note: this is probably irrelevant if you’re not at the University of Michigan. However, if you’re a Mac user there and tired of having to type in your password to ssh into the shell account servers, read on.

This is what I had to do to get passwordless ssh access working to the U of M servers (e.g. from my machine, running Mac OS X 10.8.2. The university doesn’t allow public-key-based ssh access, so I had to set up Kerberos.

  1. Get a copy of the krb5.conf file. You can find this at /etc/krb5.conf on any of the Linux shell servers such as You’ll need to get it to your machine with scp/sftp/etc.

  2. Install it as /Library/Preferences/ You’ll need root privileges to write to this directory, so:

     $ sudo cp /path/to/krb5.conf /Library/Preferences/
  3. Open up /Applications/Utilities/Keychain Access.

  4. Go to its application menu and select Ticket Viewer.

  5. Acquire a Kerberos ticket by choosing Ticket > Add Identity, giving <uniqname>@UMICH.EDU as your identity and your usual U of M password as your password.

  6. Set up ssh to use Kerberos authentication to U of M hosts, by adding something like the following to your ~/.ssh/config:

Host *
     User <your uniqname goes here>
     Compression yes
     GSSAPIAuthentication yes
     GSSAPIDelegateCredentials yes
     GSSAPITrustDNS yes

You should now be able to ssh to a university machine without being prompted for a password.

Bio-MAF 1.0.1

| Comments

Bio-MAF 1.0.1 is released! As described in my last post, this includes support for BGZF-compressed MAF files and a command-line maf_extract(1) tool exposing almost all of the library’s random access functionality in an orthogonal way from the command line. It also includes a substantial overhaul of the maf_tile(1) tool, adding a --concat option for concatenation of several tiled intervals. This supports one of the original motivating use cases for this library, enabling alignments corresponding to a set of exons to be concatenated and any gaps filled from a reference sequence. The resulting output corresponds to their mRNA product.

Several other new maf_tile options support integration into Galaxy.

Since my last post, I have also made substantial performance improvements in BGZF compression and exercised the parser and indexing functionality on the entire set of UCSC multiz46way MAF files. This led to better memory management, handling of extended-range UCSC bins in indexing, better logging and warning generation, index version validation, index error reporting, and other work to make this library more robust and useful.

Sadly, this is the end of the Google Summer of Code. I’ve enjoyed working on Bio-MAF, getting to know the BioRuby community and especially my fantastic mentors, and most of all, delivering a useful piece of bioinformatics software. Thanks also to Google for organizing and funding this!

While the Summer of Code is over, I plan to continue to maintain Bio-MAF. While the originally envisioned feature set has been implemented, I will at least keep it up to date with dependencies. And, of course, if you use this library or are considering it, please file Github issues (or, better yet, pull requests) for any bugs, problem reports, feature requests, or suggestions you may have.

BGZF and Testing

| Comments

This post was originally going to be the bio-maf 1.0 release announcement. I’ve added a new command-line maf_extract(1) tool as well as support for BGZF compression of MAF files, which allows random access to compressed files, saving disk space and also I/O bandwidth. However, in the process of testing my tools against UCSC’s multiz46way set of MAF files, I’ve discovered a number of issues requiring a bit more work.


With some of the other MAF libraries such as bx-python, there are a variety of command-line tools for extracting MAF alignments, each offering a different query or post-processing mode. I created maf_extract(1) to serve as a generic random-access MAF extraction tool, exposing most of the capabilities of bio-maf with a simple command-line interface. The man page has plenty of examples, but a few representative invocations are:

$ maf_extract -d test/data --interval mm8.chr7:80082592-80082766 --only-species hg18,mm8,rheMac2

$ maf_extract -d test/data --mode slice --interval mm8.chr7:80082592-80082766

I have already found this useful for ad hoc data extraction from MAF files, without writing any code at all.


BGZF is the Blocked GZip Format, defined in the SAM/BAM specification. Blocks of up to 64 KB of data are separately compressed with gzip and concatenated into a single file. For random access, data is addressed with a virtual offset scheme: a virtual offset is a 64-bit quantity where the first 48 bits give the starting offset of the BGZF block within the file, and the last 16 bits give the offset within the decompressed data. Thus, given a virtual offset, one can read the relevant BGZF block, decompress it, and seek to the appropriate position within the decompressed data.

Artem Tarasov wrote a Ruby implementation of basic BGZF compression and decompression, which I forked and extended for my needs as bio-bgzf. This offers rudimentary IO-like interfaces for reading and writing BGZF files.

To use BGZF optimally with bio-maf, I’ve added the maf_bgzip(1) tool to compress (or recompress, in the case of gzipped files) MAF files, optionally building indexes on them in the process. The idea is that this can be used as a ‘intake’ tool to prepare a set of MAF files such as those from UCSC for random-access querying.

Stress testing

In light of that concept, I decided to recompress and index the entire set of MAF files from UCSC. It immediately became clear that I should have taken that step earlier, and not just tested with selected subsets of the 31 GB (compressed!) data set.

I discovered a memory leak ultimately traceable to JRuby’s very aggressive sharing of backing store for strings. This is a great feature, until you write code that holds references to occasional small strings parsed from a multi-GB file. I wasn’t happy with my fix; I suspect there is a better alternative for obtaining a totally independent copy of a string than "" << string, but if there is, I haven’t found it yet.

I also found, to my surprise, that doing gzip decompression in the parser thread performs substantially better than opening a pipe to gzip -dc, cutting elapsed time by about 25%. I suspect I/O buffering may be the culprit, but it didn’t seem worth digging into further.

The slightly nonstandard upstream*.maf files from UCSC led me to add a :strict parser option, defaulting to off, since the parser was not expecting their r lines. Where possible, I’m adding non-strict parsing.

The performance of maf_bgzip(1) on large files turned out to be poor, but with a few improvements I managed to double the speed of recompressing and indexing files.

The next sticky issue I need to confront is the need for extended UCSC bin indexes; at least one sequence in the UCSC alignments has an genomic position beyond 229, causing this unfortunate surprise.

Next steps

After getting a more thoroughly tested release out the door, I’m going to focus on integrating my tools into the Galaxy Tool Shed, to make it even easier for people to use them. I’ll post more on that once I’ve gotten them integrated with a local Galaxy instance.

Bio-MAF 0.3.0

| Comments

bio-maf version 0.3.0 is released. This version adds features including joining adjacent MAF blocks when sequences that caused them to be split have been filtered out; returning bio-alignment objects; and truncating (or ‘slicing’) alignment blocks to only cover a given genomic interval.

For developers, this also adds a higher-level Bio::MAF::Access API for working with directories containing indexed MAF files (or, alternatively, single files), providing all relevant functionality for indexed access in a simpler way than using the KyotoIndex and Parser classes directly. The maf_tile(1) utility has been updated to use this functionality; a directory of indexed MAF files can now be specified, and the correct file will now be parsed as appropriate.

Usage of Enumerators and blocks has also been substantially improved; all access methods for multiple blocks such as Access#find, Access#slice, Parser#each_block now accept a block parameter, which will be called for each block in turn. If no block parameter is given, they will all return an Enumerator for the resulting blocks. This is how most of the Ruby standard library, e.g. Array#each, works.

Bio-MAF 0.2.0

| Comments

I have released bio-maf version 0.2.0 as a Ruby gem. The main new feature in this release is MAF ‘tiling’; combining the alignment blocks covering a given genomic interval, optionally filling in gaps with reference sequence data, to output a single alignment block. See the README for more details. This includes a maf_tile(1) tool. The name of this feature, and inspiration for the implementation, comes from the bx-python implementation.

This release also enables gap removal in cases where removal of some species from an alignment block leaves a gap in all remaining sequences.

Other fixes and improvements include:

  • #59: returning blocks in sorted order
  • #56: avoiding a race condition in parallel parsing
  • Using all available cores by default under JRuby

Kyoto Cabinet Support for JRuby

| Comments

Kyoto Cabinet is a useful database, with a convenient Ruby binding; unfortunately, this uses Ruby’s C extension API and so does not work with JRuby. Since JRuby has a considerable performance advantage over MRI for my MAF parser, I needed a way of using Kyoto Cabinet from JRuby. I have released the kyotocabinet-java gem to provide this. It is not complete, but it wraps enough of the API for my needs, and the rest should be simple to cover.

This gem contains a patched copy of the Kyoto Cabinet Java library, along with a thin Ruby adaptation layer to provide the usual Ruby API. Since the underlying Java library relies on native code integrated with JNI, installing the gem compiles the native library and associated Java code, and thus requires a JDK and working C++ compiler as well as a Kyoto Cabinet installation.

Mapping the Java API onto the Ruby API is generally quite simple. The chief issue is ensuring that Ruby strings are converted to Java byte arrays rather then Java strings; the Java string versions of the API methods fail completely with binary data. Also, the Java API lacks the Ruby API’s optional parameters. For instance, here is DB#get:

alias_method :_get, :get
def get(key)

A few methods needed a tiny bit more work, such as DB#set_bulk and DB#cursor_process, but as you can see from kyotocabinet.rb, it wasn’t much.

Using this and the kyotocabinet-ruby library interchangeably for MRI and JRuby support in a Bundler application is straightforward:

gem "kyotocabinet-ruby", "~> 1.27.1", :platforms => [:mri, :rbx]
gem "kyotocabinet-java", "~> 0.2.0", :platforms => :jruby

Unfortunately, if you are developing a gem as I am with bioruby-maf, things are not quite so simple. While Bundler makes it easy to require different gems depending on one’s current Ruby platform, Gem::Specification provides no such feature. I ended up making a gemspec that builds two different versions of the gem: a java platform variant that depends on kyotocabinet-java, and a ‘normal’ gem with no platform specified that depends on kyotocabinet-ruby. This is a bit cumbersome; it requires running gem build under both JRuby and MRI to build both versions of the gem, and meant that I had to abandon use of Jeweler. Nonetheless, it works; under JRuby, gem install bio-maf will pull in the Java variant and kyotocabinet-java, while under MRI it will pull in the ‘normal’ gem and kyotocabinet-ruby. The relevant part of the gemspec is:

if RUBY_PLATFORM == 'java'
  s.platform = 'java'
if RUBY_PLATFORM == 'java'
  s.add_runtime_dependency('kyotocabinet-java', ["~> 0.2.0"])
  s.add_runtime_dependency('kyotocabinet-ruby', ["~> 1.27.1"])

Ultimately, it might be worth contacting the maintainer of the kyotocabinet-ruby gem and seeing about packaging this as a platform variant, to sidestep all this. For the time being, though, this works.

A note on performance: I found JRuby’s ObjectProxyCache to be a major performance bottleneck, especially for multithreaded access to Kyoto Cabinet. Starting JRuby with the -Xji.objectProxyCache=false option gave about a 2.5x speedup for Kyoto Cabinet searches. If the rest of your JRuby code works with this option (which will be the default in JRuby 2.0, anyway), I highly recommend using it.

Parallel I/O

| Comments

Some applications of a MAF parser call for sequential processing of an entire MAF file, but in many cases random access is necessary. I found that a single-threaded parsing approach worked tolerably well for sequential access, with OS-level readahead helping, but that single-threaded performance with random access was very poor indeed. With a test scenario, I was only able to parse 10.5 MB/s from a SATA hard disk, and 34.3 MB/s from a MacBook Air SSD.

My next step was to parallelize random I/O for index-based queries. With Ruby 1.9.3, this was pointless, since its GIL prevents any useful parallelism, but with JRuby there was ample scope for improvement. In the end, I implemented several parallel I/O subsystems: one with a pool of independent threads reading and parsing groups of blocks, one with separate pools of reader and parser threads, and one using an adaptive parallel I/O subsystem modeled after Dmitry Vyukov’s Wide Finder entry.

As test data for this, I used chr22.maf from the hg18 data set and a set of 10,000 randomly generated genomic intervals from its reference sequence, along with an index for the MAF file. I tested performance with all three I/O subsystems on hard disk and SSD storage, and of course with various numbers of threads, queue sizes, and so forth.

To my surprise, I found that my first implementation, with independent worker threads reading and immediately parsing MAF blocks, outperformed the others. Quite possibly this is because the data being parsed has just been read into cache by the same thread. It is possible that the adaptive implementation could be competitive with it, but it was challenging to tune it properly for this workload in a way that worked with slow and fast storage alike. Further work on this might be promising.

In the end, the first implementation delivered a maximum of 26 MB/s from hard disk, and 136 MB/s from SSD; the implementation with separate reader and parser threads seemed to top out at 24 and 104 MB/s respectively.

Almost as striking, however, was the difference between a single run and the third or fourth run repeated back-to-back in the same JVM instance. In some cases, performance almost doubled, from 87 to 136 MB/s in a representative case. File system caching was ruled out, but intensive HotSpot compiler activity was observed, so I would attribute this to the combined optimization activity of the JRuby and HotSpot JIT compilers.

Index scanning

An unexpected bottleneck was observed: at peak performance, single-threaded scanning of the 16 MB index to select the blocks to parse takes 6.9 seconds, nearly as long as it took to parse the 869 MB of MAF data. Optimizing the match test used produced a significant speedup, but past that, no optimization attempts have had much success.

Parallelizing the index scan showed a modest improvement with two threads, topping out around 5.2 seconds; more than two threads have resulted in negligible improvements. Currently, the two-thread parallel scan is the default under JRuby.

Starting the scan at the first block of interest rather than at the start of the bin added noticeable complexity for no detectable performance gain. Running the index scan in parallel with the block parsing seemed like a reasonable approach, but turned out to be slower than performing the two operations sequentially.

JRuby and java.util.concurrent

A significant benefit of using JRuby is having access to the java.util.concurrent library. Its blocking queues made parallel parsing simple to implement, and parallel index scans were even easier with the ExecutorCompletionService. Since JRuby objects are JVM objects like any other, and JRuby threads are implemented atop JVM threads, it was perfectly easy to use these.

JRuby Support and I/O Performance

| Comments

I/O performance and the Magic Number

My MAF parser is designed to read data from disk in relatively large chunks before parsing it, to minimize read(2) calls. I started out using a chunk size of 8 MB, but using the jvisualvm profiler with JRuby, I observed that during index builds with maf_index, it was spending about 37% of its time in calls. (I had observed MRI spending a similar amount of time in read calls as well.) Since Mac OS X has DTrace, I used iosnoop to watch the I/O activity at the OS level, and observed that reads were actually being issued with sizes of 128k:

  501 55424 R 1139635624 131072       java ??/maf/chr22.maf
  501 55424 R 1139635880 131072       java ??/maf/chr22.maf
  501 55424 R 1139636136 131072       java ??/maf/chr22.maf
  501 55424 R 1139636392 131072       java ??/maf/chr22.maf
  501 55424 R 1139636904 131072       java ??/maf/chr22.maf

Suspecting that merging the data from these read calls into an 8 MB buffer might be a problem, I changed the chunk size constant (SEQ_CHUNK_SIZE) to 128k to match, and saw the processing rate jump from 19 MB/s to 30 MB/s, with only 2.4% of CPU time spent in

MRI 1.9.3 exhibited a similar improvement, from 18 to 24 MB/s.

It turns out, though, that 128k is not, as I guessed, a hard-coded Ruby setting. I later ran the same test on my MacBook Air’s built-in SSD instead of a USB-attached SATA disk, and saw I/O sizes of 1048576 instead. Here, changing the chunk size to match showed no significant improvement. Presumably this is due to the lower I/O latency of the SSD.

Requiring users to divine the best I/O size for their environment to achieve the best parsing performance seems a bit arcane; I wonder if it might be worthwhile to try to automatically detect the best-performing chunk size at runtime by defaulting to 128k, varying it in both directions, and tracking average latencies.

JRuby support with Kyoto Cabinet

Kyoto Cabinet has turned out to perform well and be easy to use. However, while the parser code runs substantially faster under JRuby, there is not a straightforward out-of-the-box way to use Kyoto Cabinet with Ruby. The standard kyotocabinet-ruby gem uses the C Extension API, so it isn’t really practical to use with JRuby. (Technically, JRuby has partial support for the C Extension API, but at least on my machine, it’s broken in several ways under RVM, needs to be enabled with a specific runtime option, and is generally frowned upon.) However, Kyoto Cabinet also has a Java API; both it and the Ruby API are based on the same underlying C++ library, so the APIs are quite similar. I wrote a small Ruby wrapper around the Java library, so it can be used under JRuby in an identical way to the native Ruby library.

Unfortunately, this still requires users to build and install the JNI-based [Java library][] for Kyoto Cabinet as well as Kyoto Cabinet itself. This isn’t ideal, but shouldn’t be too difficult. I plan to separate this out as its own gem.

Filtering Work

| Comments

I think I made good progress in the end this week, although it was a bit frustrating along the way, with a few dead ends and some challenges in figuring out the right approaches to things. My main goal was to implement MAF file filtering on queries, as specified in a Cucumber feature and Github issue.

This is all working now, and I’ve managed to do all the alignment block filtering using the index, to avoid having to parse blocks to test whether they meet criteria for e.g. presence of species. I’m storing things like the number of sequences, the text size, and a bit vector of species present in the index, and checking those when building the list of MAF blocks to read from disk. I think this is an improvement over bx-python, which does this sort of thing by a combination of post-parsing filtering and writing out modified MAF files.

I tried using external libraries to provide somewhat nicer interfaces to binary record access (bindata) and bit vector manipulation (bitstring); they are both nice and well-thought out libraries but turned out to have awful performance, so I ended up ripping them both out. (I got about a 120% performance boost when building indexes by using raw Array#pack calls instead of bindata, and then about 100% by using Integers and shifts instead of bitstring.)

I think in the future I will do microbenchmarks before integrating things like these; I wouldn’t have expected such a high performance penalty for such conceptually simple operations.

Ultimately, I wrote my own little library for building format strings, after making a few arithmetic mistakes when calculating byte offsets in my head. I also wrote a bit of code to dump out binary indexes in human-readable form, to help with debugging.

The API for this filtering code could perhaps use some work. I’ll need to think about that further.

I also made some progress on getting Kyoto Cabinet and JRuby working together, although it’s not quite there yet. Along the way, I discovered what looks like an interesting JRuby bug.

And I’ve got tests passing on Travis CI again, after adding a script to build and install Kyoto Cabinet there.

After working through and thinking about things this week, and particularly in light of Francesco’s comments, I have a few questions I need to look into about MAF and MAF workflows:

  • Is it possible for alignment blocks to overlap, at least with respect to the reference sequence?
  • Is it useful to build indexes on other sequences besides the reference sequence?
  • What is the longest (in terms of text size, i.e. sequence data including dashes) alignment block I should be prepared to encounter? (That is, does its length need to be 32 bits? 16? 64?)

Indexed MAF Access

| Comments

My second week of working on this MAF library was more of a struggle than the first week. I was implementing indexing of MAF files and using those indexes to enable searching by genomic regions. In the process of doing this, I ended up changing directions partway through, but I eventually did implement the features I had planned to, and ended up with a better implementation than I thought I would.

MAF files can be quite large, often hundreds of gigabytes. Frequently, too, a user will want to extract a relatively small subset of that data. Reading in and parsing an entire 200 GB file to find a few alignments of interest is not a very useful approach; instead, we want to build indexes to enable quick location of the alignments to parse.

This was, in fact, my milestone for this week: to be able to build an index for a MAF file, identify alignment blocks matching a region of interest, and then selectively parse those blocks from the MAF file. The two MAF implementations I’m aware of that use a similar indexing scheme are Biopython and bx-python; Biopython uses the SQLite embedded relational database to build its indexes, and bx-python uses a custom index file format. (I believe the UCSC genome browser uses MySQL for indexing in general, but I’m not sure whether this is applied to MAF files directly.)

SQLite vs. Kyoto Cabinet

The plan I started with was to use SQLite for indexing in a similar way to Biopython, in order to bring up functional code with indexing support as quickly as possible. I reasoned that being able to use SQL for declarative queries would make this simpler than, say, using a key-value store such as Kyoto Cabinet or Berkeley DB, let alone implementing a sui generis indexing system such as the one bx-python uses.

My first task, though, was to understand the UCSC bin indexing system, which is used by all MAF indexing implementations. It was not particularly clear to me how it worked, even after looking at several implementations of it, until I finally read the paper describing it and illustrating its hierarchical nature. I still don’t understand why Biopython always searches bin 1, which was puzzling me for a while; I need to ask the author.

I ran into a series of problems with using SQLite, unfortunately. First, the standard sqlite3 gem uses the C extension API rather than FFI, and so does not work with JRuby. To use the same code with the native and JDBC drivers, I needed to use the unmaintained DBI/DBD library. (I subsequently learned that Sequel might have worked as well, but after I’d already decided to try another tack.) The DBI interface also had the unfortunate feature of returning all numeric values as strings, which was not what I would hope for in performance-critical code.

Also, I discovered that SQLite does not have clustered indexes, meaning that I needed to create a table just to hold the index data, and then a separate index within SQLite for that data. This duplicates the index data, which is also not ideal.

Most of all, though, SQL simply turned out to be too cumbersome. Explicitly creating and indexing tables and checking for their existence started consuming more of my effort than I would have liked. Biopython’s implementation ended up running a SQL query for each possible bin for each range of interest; I knew that it should be possible to execute a search like this while only having to make a single pass through the index data for each bin. It wasn’t clear that there was a good (and simple) way to do this in SQL, though.

In frustration, I opened up the Kyoto Cabinet documentation, studied the available features for prefix and range searches, and wrote out the algorithm I had in mind. My biggest decision point was how exactly to store the index data. In the past, I’ve always favored human-readable data representations; I’ve generally dealt with small amounts of data in situations where being able to see the data immediately with View Source or a packet trace is invaluable. Here, though, after adding things up I realized I could use a binary representation and use a mere 24 bytes per index record. This actually may be the first time I’ve written something to output a binary data representation, but I think it was well worth it.

Once I sat down to actually implement the Kyoto Cabinet version, I was amazed how much simpler it was than the SQLite version. I could step through the data just the way I wanted to, and do whatever computation seemed appropriate on the way. I pulled together a running version very quickly, and simply applied the same RSpec tests to both.

For both implementations of this indexing code, I found RSpec invaluable. I made a fairly rigorous effort to implement this in BDD/TDD style, writing tests for each bit of behavior I could think of to specify. This gave me a great deal of confidence in optimizing and tweaking the index search code.

Kyoto Cabinet results

I’m now very happy with the Kyoto Cabinet code; I also expect it to be easily adapted to other key-value stores such as leveldb or Berkeley DB.

Getting Kyoto Cabinet working on JRuby has posed a bit of a headache too, unfortunately. Its Ruby library also uses the C extension API, so I plan to use the Kyoto Cabinet Java library instead, with a thin adaptation layer to present a compatible interface to the Ruby library. This approach has been promising, except that a few of the methods in the Java library seem to encounter string encoding problems with binary data, returning garbage strings (probably containing Unicode ‘REPLACEMENT CHARACTER’, I need to check) rather than the expected data. Many of the Java methods have separate String and byte[] versions, but not all. Ironically, the trickiest problems of this kind are caused by DB#match_prefix calls in my unit tests, rather than in the actual indexing code.

Another oddity has been that there don’t appear to be any Ubuntu packages in usable form for Kyoto Cabinet. This is strange, since it’s hardly the most obscure software I’ve wanted packages for. I don’t need these for development, but Travis CI doesn’t provide Kyoto Cabinet, so I have to install it myself there to get CI working again. Since I’m not about to teach myself Debian packaging just for this, I’ve written a horrible shell script instead for the time being. (What would really be nice is an easy way to run user-provided Chef recipes on Travis CI, but they don’t appear to provide one.)

Random-access parsing

The trickiest part of the whole effort may have been figuring out how to revise the parsing code to work in a truly random-access fashion, and in a way that will be compatible with BGZF compression later on. What I ended up doing was extracting a separate ChunkReader class responsible for positioning within the file and reading appropriately-sized chunks of it as requested while maintaining proper chunk alignment. Together with other code to merge requests for adjoining or nearby alignment blocks, this keeps the basic parsing logic intact but should avoid either excessive or excessively large reads.

This also means that the indexing code simply provides the starting offset of each block, and the parser is responsible for reading ahead as appropriate. This should be perfectly compatible with a BGZF layer, even if an alignment block spans the boundary between separate BGZF blocks. (Alignment blocks, chunks, and now BGZF blocks… I may be heading for a terminology problem here.)

Next steps

This week, I’m focusing on providing additional logic for filtering and querying for sequences and alignment blocks. I’m also working on a few ideas to optimize index creation, and I hope to get my index code running on JRuby soon.