SaryFasta.py - index fasta files by suffix array

Subroutines for working on I/O of large genomic files.

Index a fasta file to retrieve sequences by suffix-array fragment search.

python SaryFasta.py [options] name [ files ]

SaryFasta.getHID(sequence)

returns a hash identifier for a sequence.

SaryFasta.createDatabase(db, filenames, buf_size=400000000, force=False, regex_identifier=None)

index files in filenames to create database.

buf_size: buffer size for a sary chunk.

Two new files are created - db.fasta and db_name.idx

regex_identifier: pattern to extract identifier from description line. If None, the part until the first white-space character is used.

SaryFasta.benchmarkRandomFragment(fasta, size)

returns a random fragment of size.

SaryFasta.verify(reference, fasta, num_iterations, fragment_size, stdout=<open file '<stdout>', mode 'w' at 0x7f1ccf94d150>, quiet=False)

verify two databases.

Get segment from fasta and check for presence in fasta2.

Previous topic

IntervallsWeigted.py - working with weigted intervals

Next topic

Fasta.py - Methods for dealing with fasta files.

This Page