Exons.py - A library to read/write/manage exons.

Author:
Release:$Id$
Date:December 09, 2013
Tags:Python
class Exons.Exon

class for exons.

contains info about the genomic location of an exon and its location within a peptide sequence.

The field mAlignment is set optionally.

Read(line, contig_sizes={}, format='exons', extract_id=None, converter=None)

read exon from tab-separated line.

extract_id is a regular expression object to extract the identifier from the identifier column.

if converter is given, it is used to convert to zero-based open-closed both strand coordinates.

Merge(other)

Merge this exon with another (adjacent and preceeding) exon.

Do not merge if the distance between exons is not divisible by 3. Merging of two exons invalidated peptide coordinates for all following exons. These need to be updated.

InvertGenomicCoordinates(lgenome)

invert genomic alignment on sequence.

Negative strand is calculated from the other end.

Exons.UpdatePeptideCoordinates(exons)

updates peptides coordinates for a list of exons.

Exons have to be sorted.

Exons.PostProcessExons(all_exons, do_invert=None, remove_utr=None, filter=None, reset=False, require_increase=False, no_invert=False, contig_sizes={}, from_zero=False, delete_missing=False, set_peptide_coordinates=False, set_rank=False)

do post-processing of exons

exons is a dictionary of lists of exons.

Exons are sorted by mPeptideFrom.

Operations include:

-invert: sort out forward/reverse strand coordinates

-set_peptide_coordinates: sets the peptide coordinates of
exons.

-set-rank: set rank of exons

-remove_utr: remove any utr (needs peptide coordinates)

-delete_missing: if set set true, exons on contigs not in contig_sizes
will be deleted.

-from_zero: exon genomic coordinates start at 0

-reset: exon genomic coordinates start 0

Exons.GetExonBoundariesFromTable(dbhandle, table_name_predictions='predictions', table_name_exons='exons', only_good=False, do_invert=None, remove_utr=None, filter=None, reset=False, require_increase=False, contig_sizes={}, prediction_ids=None, table_name_quality='quality', table_name_redundant='redundant', non_redundant_filter=False, schema=None, quality_filter=None, from_zero=False, delete_missing=False)

get exon boundaries from table.

Exons.CountNumExons(exons)

return hash with number of exons per entry.

Exons.SetRankToPositionFlag(exons)

set rank for all exons.

Set rank to 1 : if it is first exon, -1: if it is the last exon (single exon genes are -1) 0 : if it is an internal exon.

Exons.ReadExonBoundaries(file, do_invert=None, remove_utr=None, filter=None, reset=False, require_increase=False, no_invert=False, contig_sizes={}, converter=None, from_zero=False, delete_missing=False, format='exons', gtf_extract_id=None)

read exons boundaries from tab separated file.

if remove_utr is set, the UTR of the first/last exon is removed.

if reset is set, then the genomic part is moved so that it starts at 1. if require_increase is set, then exons are sorted in increasing order.

If do_invert is set: negative strand coordinates are converted to positive strand coordinates

if no_invert is set: coordinates are kept as they are.

if from_zero is set: coordinates are mapped from 0. Thus reverse strand coordinates will be negative.

if delete_missing is True and sbjct-token is not in contig_sizes but the exon needs to be inverted: delete transcript.

The exon file format is tab-separated and can be of the two formats:

format=”exons”: id, contig, strand, frame, rank, peptide_from, peptide_to, genome_from, genome_to

format = “gtf”: contig, ignored, ignored, genome_from, genome_to, ignored, strand, frame, id

if converter is given, use it to convert to forward/reverse strand coordinates.

gtg_extract_id: regular expression object to extract id from id column.

Exons.Alignment2Exons(alignment, query_from=0, sbjct_from=0, add_stop_codon=1)

convert a Peptide2DNA alignment to exon boundaries.

Exons.Exons2Alignment(exons)

build alignment string from a (sorted) list of exons.

Exons.RemoveRedundantEntries(l)

remove redundant entries (and 0s) from list.

One liner?

Exons.CompareGeneStructures(xcmp_exons, ref_exons, map_ref2cmp=None, cmp_sequence=None, ref_sequence=None, threshold_min_pide=0, threshold_slipping_exon_boundary=9, map_cmp2ref=None, threshold_terminal_exon=15)

Compare two gene structures.

This function is useful for comparing the exon boundaries of a predicted peptide with the exon boundaries of the query peptide.

cmp_exons are exons for the gene to test. ref_exons are exons from the reference.

Exon boundaries are already mapped to the peptide for the reference.

map_ref2cmp: Alignment of protein sequences for cmp and ref. map_cmp2ref: Alignment of cmp to ref. If given, mapping is done from cmp to ref. Invalid exon boundaries can be set to -1.

threshold_terminal_exon:
Disregard terminal exons for counting missed boundaries, if they are maximum x nucleotides long.
Exons.MapExons(exons, map_a2b)

map peptide coordinates of exons with map.

returns a list of mapped exons.

Exons.CountMissedBoundaries(cmp_boundaries, reference_boundaries, max_slippage=9, min_from=0, max_to=0)

count missed boundaries comparing cmp to ref.

Exons.GetExonsRange(exons, first, last, full=True, min_overlap=0, min_exon_size=0)

get exons in range (first:last) (peptide coordinates).

Set full to False, if you don’t require full coverage.

Exons.ClusterByExonIdentity(exons, max_terminal_num_exons=3, min_terminal_exon_coverage=0.0, max_slippage=0, loglevel=0)

build clusters of transcripts with identical exons.

The boundaries in the first/last exon can vary.

Returns two maps map_cluster2transcripts and map_transcript2cluster

Exons.ClusterByExonOverlap(exons, min_overlap=0, min_min_coverage=0, min_max_coverage=0, loglevel=0)

build clusters of transcripts with overlapping exons.

Exons need not be identical.

Returns two maps map_cluster2transcripts and map_transcript2cluster

Exons.CheckOverlap(exons1, exons2, min_overlap=1)

check if exons overlap.

(does not check chromosome and strand.)

Exons.CheckCoverage(exons1, exons2, max_terminal_num_exons=3, min_terminal_exon_coverage=0.0, max_slippage=0)

check if one set of exons covers the other.

Note: does not check chromosome and strand, just genomic coordinates.

Exons.CheckContainedAinB(exons1, exons2, min_terminal_exon_coverage=0.0, loglevel=0)

check if all exons in exons1 are contained in exons2.

Note: does not check contig and strand.

Exons.CheckCoverageAinB(exons1, exons2, min_terminal_num_exons=3, min_terminal_exon_coverage=0.0, max_slippage=0, loglevel=0)

check if exons1 are all in exons2

Note: does not check contig and strand.

Exons.GetPeptideLengths(exons)

for all exons get maximum length in coding nucleotides.

Exons.GetGenomeLengths(exons)

for all exons get maximum nucleotide.

Exons.CalculateStats(exons)

calculate some statistics for all exons.

minimum/maximum intron/exon length, number of exons gene length

Exons.MatchExons(map_a2b, in_exons1, in_exons2, threshold_slipping_boundary=9)

returns a list of overlapping exons (mapped via map_a2b).

Previous topic

<no title>

Next topic

Orthologs.py - tools to deal with Leo’s orthology pipeline.

This Page