Subroutines for working on I/O of large genomic files.
Index a fasta file to retrieve sequences by suffix-array fragment search.
python SaryFasta.py [options] name [ files ]
returns a hash identifier for a sequence.
index files in filenames to create database.
buf_size: buffer size for a sary chunk.
Two new files are created - db.fasta and db_name.idx
regex_identifier: pattern to extract identifier from description line. If None, the part until the first white-space character is used.
returns a random fragment of size.
verify two databases.
Get segment from fasta and check for presence in fasta2.