==============
Tool reference
==============

This page summarizes prominent tools within the CGAT Code
collection. The tools are grouped losely by functionality.

Genomic intervals/features
==========================

:doc:`scripts/beds2counts`
    Compute overlap statistics of multiple :term:`bed` files.

:doc:`scripts/bed2fasta`
    Transform interval data in a :term:`bed` formatted file into a
    :term:`fasta` formatted file of sequence data.

:doc:`scripts/bed2gff`
    Convert between interval data. Convert a :term:`bed` formatted
    file to a :term:`gff` or :term:`gtf` formatted file.

:doc:`scripts/gff2gff`
    Work on :term:`gff` formatted files with genomic features. This 
    tools sorts/renames feature files, reconciles chromosome names,
    and more.

:doc:`scripts/bed2bed`
    Filter or merge interval data in a :term:`bed` formatted file.

:doc:`scripts/bed2graph`
    Compare two sets of genomic intervals and output a list of
    overlapping features.

:doc:`scripts/bed2stats`
    Compute summary statistics of genomic intervals.

:doc:`scripts/bed2table`
    Annotate genomic intervals (composition, peak location, overlap, ...)

:doc:`scripts/beds2beds`
    Decompose multiple sets of genomic intervals into various
    intersections and unions.

:doc:`scripts/diff_bed`
    Compare multiple sets of interval data sets. The tools computes
    all-vs-all pairwise overlap summaries. Permits incremental updates
    of similarity table.

:doc:`scripts/gff2bed`
    Convert between formats

:doc:`scripts/split_gff`
    Split a file in gff format into smaller files. The script ensures
    that overlapping intervals remain in the same file.

:doc:`scripts/gff2coverage`
   This script computes the genomic coverage of intervals 
   in a :term:`gff` formatted file. The coverage is computed
   per feature.

:doc:`scripts/gff2fasta`
    Output genomic sequences from intervals.

:doc:`scripts/gff2histogram`
    Compute distributions of interval sizes, intersegmental distances
    and interval ovelap from list of intervals.

:doc:`scripts/gff2stats`
    Summarize features within a :term:`gff` formatted file.	

:doc:`scripts/gff2psl`
    Convert between formats.

Gene sets
=========

:doc:`scripts/gtf2gff`
    Translate a gene set into genomic annotations such as introns,
    intergenic regions, regulatory domains, etc.

:doc:`scripts/gtf2table`
    Annotate transcripts in a :term:`gtf` formatted file. Annotations
    can be in reference to a second gene set (fragments, extensions), 
    aligned reads (coverage, intron overrun, ...) or densities.

:doc:`scripts/gtf2fasta`
    Annotate each base in the genome according to its use within
    a transcript. Outputs lists of junctions.

:doc:`scripts/gtf2gff`
    Derive genomic intervals (intergenic regions, introns) from
    a gene set.

:doc:`scripts/gtf2gtf`
    merge exons/transcripts/genes, filter transcripts/genes, rename
    transcripts/genes, ...

:doc:`scripts/gtf2tsv`
    convert gene set in :term:`gtf` format to tabular format.

:doc:`scripts/gtfs2tsv`
    Compare two gene sets - output common and unique lists of genes.

:doc:`scripts/diff_gtf` 
    Compare multiple gene sets. The tools computes all-vs-all pairwise
    overlap of exons, bases and genes. Permits incremental updates of
    similarity table.

Sequence data
=============

:doc:`scripts/fastqs2fasta`
    Interleave paired reads from two fastq files into a single fasta file.

:doc:`scripts/index_fasta`
    Build an index for a fasta file. Pre-requisite for many CGAT tools.

:doc:`scripts/fasta2kmercontent`
    Count kmer content in a set of :term:`fasta` sequences.

:doc:`scripts/fasta2table`
    Compute features of sequences in :term:`fasta` formatted files

:doc:`scripts/diff_fasta`
    Compare two sets of sequences. Outputs missing, identical
    and fragmented sequences.

:doc:`scripts/fasta2bed`
    Segment sequences based on G+C content, gaps, ...

:doc:`scripts/fastas2fasta`
    Concatentate sequences from multiple files.

:doc:`scripts/fasta2variants`
    In-silico creation of variants of protein coding
    sequences.

NGS data
========

:doc:`scripts/bam2geneprofile`
    Compute meta-gene profiles from aligned reads in a :term:`bam`
    formatted file. Also accepts :term:`bed` or :term:`bigwig`
    formatted files.

:doc:`scripts/bam2bam`
    Operate on :term:`bam` formatted files - filtering, stripping, 
    setting flags.

:doc:`scripts/bam2bed`
    Convert :term:`bam` formatted file of genomic alignments
    into genomic intervals. Permits merging of paired read data
    and filtering by insert-size.

:doc:`scripts/bam2fastq`
    Save sequence and quality information from a :term:`bam` 
    formatted file.

:doc:`scripts/bam2peakshape`
    Compute read densities over a collection of intervals. Also 
    accepts :term:`bed` or :term:`bigwig` formatted files.

:doc:`scripts/bam2stats`
    Compute summary statistics of a :term:`bam` formatted file.

:doc:`scripts/bam2wiggle`
    Convert read coverage in a :term:`bam` formatted file into
    a :term:`wiggle` or :term:`bigwig` formatted file.

:doc:`scripts/bam_vs_gtf`
    Compute stats on exon over-/underrun and spliced reads.

:doc:`scripts/bam_vs_bed`
    Compute coverage of reads within multiple interval types.

:doc:`scripts/bam_vs_bam`
    Outputs side-by-side comparison of residue level counts
    between multiple :term:`bam` formatted files.
	 
:doc:`scripts/fastq2fastq`
    Perform quality score conversion between :term:`fastq` 
    formatted files.

:doc:`scripts/fastqs2fasta`
    Interleave paired end data.

:doc:`scripts/fastq2table`
    Output bases below quality threshold, number of N's, quality score distribution.    

:doc:`scripts/fastqs2fastqs`
    Ensure that paired read :term:`fastq` formatted files are consistent
    after filtering on the individual files.

:doc:`scripts/diff_bam`
    Perform read-by-read comparison of two bam-files.

Variants
========

:doc:`scripts/vcf2vcf`
    Sort a vcf file.

Genomics
========

:doc:`scripts/diff_chains`
    How many residues to the same locations, do different locations,
    etc.

:doc:`scripts/chain2stats`
    Output coverage statistics for a UCSC liftover chain file.