maf_extract(1) - extract blocks from MAF files

maf_extract(1)
BioRuby Manual
maf_extract(1)

NAME

maf_extract - extract blocks from MAF files

SYNOPSIS

maf_extract -m MAF [-i INDEX] --interval SEQ:START-END OPTIONS

maf_extract -m MAF [-i INDEX] --bed BED OPTIONS

maf_extract -d MAFDIR --interval SEQ:START-END OPTIONS

maf_extract -d MAFDIR --bed BED OPTIONS

DESCRIPTION

maf_extract extracts alignment blocks from one or more indexed MAF files, according to either a genomic interval specified with --interval or multiple intervals given in a BED file specified with --bed.

It can either match blocks intersecting the specified intervals with --mode intersect, the default, or extract slices of them which cover only the specified intervals, with --mode slice.

Blocks and the sequences they contain can be filtered with a variety of options including --only-species, --with-all-species, --min-sequences, --min-text-size, and --max-text-size.

With the --join-blocks option, adjacent parsed blocks can be joined if sequence filtering has removed a species causing them to be separated. The --remove-gaps option will remove columns containing only gaps (-).

Blocks can be output in MAF format, with --format maf (the default), or FASTA format, with --format fasta. Output can be directed to a file with --output.

This tool exposes almost all the random-access functionality of the Bio::MAF::Access class. The exception is MAF tiling, which is provided by maf_tile(1).

FILES

A single MAF file can be processed by specifying it with --maf. Its accompanying index, created by maf_index(1), is specified with --index. If --maf is given but no index is specified, the entire file will be parsed to build a temporary in-memory index. This facilitates processing small, transient MAF files. However, on a large file this will incur a great deal of overhead; files expected to be used more than once should be indexed with maf_index(1).

MAF files can optionally be BGZF-compressed, as produced by bgzip(1) from samtools.

Alternatively, a directory of indexed MAF files can be specified with --maf-dir; in this case, they will all be used to satisfy queries.

OPTIONS

MAF source options:

-m, --maf MAF: A single MAF file to process.
-i, --index INDEX: An index for the file specified with --maf, as created by maf_index(1).
-d, --maf-dir DIR: A directory of indexed MAF files.

Extraction options:

--mode (intersect | slice): The extraction mode to use. With --mode intersect, any alignment block intersecting the genomic intervals specified will be matched in its entirety. With --mode slice, intersecting blocks will be matched in the same way, but columns extending outside the specified interval will be removed.
--bed BED: The specified file will be parsed as a BED file, and each interval it contains will be matched in turn.
--interval SEQ:START-END: A single zero-based half-open genomic interval will be matched, with sequence identifier seq, (inclusive) start position start, and (exclusive) end position end.

Output options:

-f, --format (maf | fasta): Output will be written in the specified format, either MAF or FASTA.
-o, --output OUT: Output will be written to the file out.

Filtering options:

--only-species (SP1,SP2,SP3 | @FILE): Alignment blocks will be filtered to contain only the specified species. These can be given as a comma-separated list or as a file, prefixed with @, from which a list of species will be read.
--with-all-species (SP1,SP2,SP3 | @FILE): Only alignment blocks containing all the specified species will be matched. These can be given as a comma-separated list or as a file, prefixed with @, from which a list of species will be read.
--min-sequences N: Only alignment blocks containing at least n sequences will be matched.
--min-text-size N: Only alignment blocks with a text size (including gaps) of at least n will be matched.
--max-text-size N: Only alignment blocks with a text size (including gaps) of at most n will be matched.

Block processing options:

--join-blocks: If sequence filtering with --only-species removes a species which caused two adjacent blocks to be separate, this option will join them together into a single alignment block. The filtered blocks must contain the same sequences in contiguous positions and on the same strand.
--remove-gaps: If sequence filtering with --only-species leaves a block containing columns consisting only of gap characters (-), these will be removed.
--parse-extended: Parse i lines, giving information on the context of sequence lines, and q lines, giving quality scores.
--parse-empty: Parse e lines, indicating cases where a species does not align with the current block but does align with blocks before and after it.

Logging options:

-q, --quiet: Run quietly, with warnings suppressed.
-v, --verbose: Run verbosely, with additional informational messages.
--debug: Log debugging information.

EXAMPLES

Extract MAF blocks intersecting with a given interval:

$ maf_extract -d test/data --interval mm8.chr7:80082592-80082766

As above, but operating on a single file:

$ maf_extract -m test/data/mm8_chr7_tiny.maf \
      -i test/data/mm8_chr7_tiny.kct \
      --interval mm8.chr7:80082592-80082766

Like the first case, but writing output to a file:

$ maf_extract -d test/data --interval mm8.chr7:80082592-80082766 \
      --output out.maf

Extract a slice of MAF blocks over a given interval:

$ maf_extract -d test/data --mode slice \
      --interval mm8.chr7:80082592-80082766

Filter for sequences from only certain species:

$ maf_extract -d test/data --interval mm8.chr7:80082592-80082766 \
      --only-species hg18,mm8,rheMac2

Extract only blocks with all specified species:

$ maf_extract -d test/data --interval mm8.chr7:80082471-80082730 \
      --with-all-species panTro2,loxAfr1

Extract blocks with at least a certain number of sequences:

$ maf_extract -d test/data --interval mm8.chr7:80082767-80083008 \
      --min-sequences 6

Extract blocks with text sizes in a certain range:

$ maf_extract -d test/data --interval mm8.chr7:0-80100000 \
      --min-text-size 72 --max-text-size 160

ENVIRONMENT

maf_index is a Ruby program and relies on ordinary Ruby environment variables.

BUGS

No provision exists for writing output to multiple files.

FASTA description lines are always in the format >source:start-end.