maf_tile
- synthesize an alignment for a given region
maf_tile
[options] -i [SEQ:]BEGIN:END [-s SPECIES[:NAME] ...] maf [index]
maf_tile
[options] --bed BED -o BASE [-s SPECIES[:NAME] ...] maf [index]
maf_tile takes a MAF file, with optional index, or directory of indexed MAF files, extracts alignment blocks overlapping the given genomic interval, and constructs a single alignment block covering the entire interval for the specified species. Optionally, any gaps in coverage of the MAF file's reference sequence can be filled in from a FASTA sequence file.
If a single interval is specified, the output will be written to
stdout in FASTA format. If a directory of MAF files is supplied as the
maf parameter, the interval must include the sequence identifier in
the form sequence:begin:end
. If the --output-base
option is
specified, _<begin>:<end>.fa
will be appended to the given --bed
, --output-base
is also required.
Species can be renamed for output by specifying them as SPECIES:NAME; the first component will be used to select the species from the MAF file, and the second will be used in the FASTA description line for output.
-r
, --reference SEQ
The FASTA reference sequence file given, which may be gzipped, will be used to fill in any gaps between alignment blocks.
-i
, --interval [CHR:]BEGIN-END
The given zero-based genomic interval will be used to select
alignment blocks from the MAF file. If the chromosome is not
specified, it will be taken from the first species specified with
--species
or --species-file
.
-s
, --species SPECIES[:NAME]
The given species will be selected for output. If given as
species:name
, it will appear in the FASTA output as name.
--species-file FILE
Species to select, and optional mapping names, will be read from
the given file, one species per line. If the species name is
followed by whitespace and an additional name, this will be taken
as the output name. Lines beginning with #
will be ignored.
-b
, --bed BED
The given BED file will be used to provide a list of intervals to
process. If present, --interval
will be ignored and
--output-base
must be given as well.
--bed-species SPECIES
The given species name will be prepended to the chromosome name
indicated in the BED file, separated by a period. This is necessary
if the BED file simply indicates chr12
, but the sequence
identifiers in the MAF file are e.g. hg19.chr12
.
--concat
The alignments specified in the BED file will be individually tiled and concatenated.
-o
, --output-base BASE
The given path will be used as the base name for output files, as described above.
--fill-char C
Gaps where no aligning sequence data exists will be filled with the
given character instead of *
.
--upcase
All sequence data will be folded to upper case.
-q
, --quiet
Run quietly, with warnings suppressed.
-v
, --verbose
Run verbosely, with additional informational messages.
--debug
Log debugging information.
Generate an alignment of the hg19
, petMar1
, and ornAna1
sequences from chrY.maf
over the interval 14400 to 15000 on the
reference sequence of the MAF file. Fills in gaps from
chrY.refseq.fa.gz
. Writes FASTA output to stdout.
$ maf_tile --reference ~/maf/chrY.refseq.fa.gz \
--interval 14400:15000 \
-s hg19:human -s petMar1 -s ornAna1 \
chrY.maf chrY.kct
>human
GGGTGACGAAAAGAGCCGA-----[...]
>petMar1
gagtgccggggagtgccggggagt[...]
>ornAna1
AGGGATCTGGGAATTCTGG-----[...]
Write out a FASTA file for each interval in the given BED file,
prefixed with /tmp/mm8
, and without filling in data from a reference
sequence:
$ maf_tile --bed /tmp/mm8.bed --output-base /tmp/mm8 \
-s mm8:mouse -s rn4:rat -s hg18:human \
mm8_chr7_tiny.maf mm8_chr7_tiny.kct
The output is generated in FASTA format, with one sequence per species.
The maf parameter must specify either a Multiple Alignment Format (MAF) file or a directory of such files, with indexes.
MAF files can optionally be BGZF-compressed, as produced by bgzip(1) from samtools.
The index must be a MAF index built with maf_index(1). This parameter is ignored if the maf parameter is a directory. It can be omitted if a single MAF file is given, but in this case the entire file will be parsed to build a temporary index. For large files which will be reused, this is not advisable.
If --bed
bed is specified, its argument must be a BED file. Only
the second and third columns will be used, to specify the zero-based
start and end positions of intervals.
maf_tile
is a Ruby program and relies on ordinary Ruby environment
variables.
maf_tile
is copyright (C) 2012 Clayton Wheeler.
maf_index(1), ruby(1), bgzip(1)