Tool reference

This page summarizes prominent tools within the CGAT Code collection. The tools are grouped losely by functionality.

Genomic intervals/features

Compute overlap statistics of multiple bed files.
Transform interval data in a bed formatted file into a fasta formatted file of sequence data.
Convert between interval data. Convert a bed formatted file to a gff or gtf formatted file.
Work on gff formatted files with genomic features. This tools sorts/renames feature files, reconciles chromosome names, and more.
Filter or merge interval data in a bed formatted file.
Compare two sets of genomic intervals and output a list of overlapping features.
Compute summary statistics of genomic intervals.
Annotate genomic intervals (composition, peak location, overlap, ...)
Decompose multiple sets of genomic intervals into various intersections and unions.
Compare multiple sets of interval data sets. The tools computes all-vs-all pairwise overlap summaries. Permits incremental updates of similarity table.
Convert between formats
Split a file in gff format into smaller files. The script ensures that overlapping intervals remain in the same file.
This script computes the genomic coverage of intervals in a gff formatted file. The coverage is computed per feature.
Output genomic sequences from intervals.
Compute distributions of interval sizes, intersegmental distances and interval ovelap from list of intervals.
Summarize features within a gff formatted file.
Convert between formats.

Gene sets

Translate a gene set into genomic annotations such as introns, intergenic regions, regulatory domains, etc.
Annotate transcripts in a gtf formatted file. Annotations can be in reference to a second gene set (fragments, extensions), aligned reads (coverage, intron overrun, ...) or densities.
Annotate each base in the genome according to its use within a transcript. Outputs lists of junctions.
Derive genomic intervals (intergenic regions, introns) from a gene set.
merge exons/transcripts/genes, filter transcripts/genes, rename transcripts/genes, ...
convert gene set in gtf format to tabular format.
Compare two gene sets - output common and unique lists of genes.
Compare multiple gene sets. The tools computes all-vs-all pairwise overlap of exons, bases and genes. Permits incremental updates of similarity table.

Sequence data

Interleave paired reads from two fastq files into a single fasta file.
Build an index for a fasta file. Pre-requisite for many CGAT tools.
Count kmer content in a set of fasta sequences.
Compute features of sequences in fasta formatted files
Compare two sets of sequences. Outputs missing, identical and fragmented sequences.
Segment sequences based on G+C content, gaps, ...
Concatentate sequences from multiple files.
In-silico creation of variants of protein coding sequences.

NGS data

Compute meta-gene profiles from aligned reads in a bam formatted file. Also accepts bed or bigwig formatted files.
Operate on bam formatted files - filtering, stripping, setting flags.
Convert bam formatted file of genomic alignments into genomic intervals. Permits merging of paired read data and filtering by insert-size.
Save sequence and quality information from a bam formatted file.
Compute read densities over a collection of intervals. Also accepts bed or bigwig formatted files.
Compute summary statistics of a bam formatted file.
Convert read coverage in a bam formatted file into a wiggle or bigwig formatted file.
Compute stats on exon over-/underrun and spliced reads.
Compute coverage of reads within multiple interval types.
Outputs side-by-side comparison of residue level counts between multiple bam formatted files.
Perform quality score conversion between fastq formatted files.
Interleave paired end data.
Output bases below quality threshold, number of N’s, quality score distribution.
Ensure that paired read fastq formatted files are consistent after filtering on the individual files.
Perform read-by-read comparison of two bam-files.


Sort a vcf file.


How many residues to the same locations, do different locations, etc.
Output coverage statistics for a UCSC liftover chain file.

