CGAT - Computational Genomics Analysis Tools

CGAT is a collection of tools for the computational genomicist written in the python language. The tools have been developed and accumulated in various genome projects (Heger & Ponting, 2007, Warren et al., 2008) and NGS projects (Ramagopalan et al., 2010). The tools are continuously being developed as part of the CGAT Training programme.

The tools work from the command line, but can readily be installed within frameworks such as Galaxy.

Please note that the tools are part of a larger code base also including genomics and NGS pipelines. More information about those is here.

Detailed instructions on installation, on usage and a tool reference are below, followed by a Quickstart guide.

Quickstart

To install the CGAT tools, type:

pip install cgat

This will install the CGAT scripts and libraries together with the required dependencies. See Installation instructions for dependencies and troubleshooting.

CGAT tools are run from the unix command line. Lets assume we have the results of the binding locations of a ChIP-Seq experiment (chipseq.hg19.bed) in bed format and we want to know, how many binding locations are intronic, intergenic and within exons.

Thus, we need to create a set of genomic annotations denoting intronic, intergenic regions, etc. with respect to a reference gene set. Here, we download the GENCODE geneset (Harrow et al., 2012) in GTF format from ENSEMBL (Flicek et al., 2013).

The following unix statement downloads the ENSEMBL gene set containing over-lapping transcripts, and outputs a set of non-overlapping genomic annotations in gff format (annotations.gff) by piping the data through various GAT tools:

wget -qO- ftp://ftp.ensembl.org/pub/release-72/gtf/homo_sapiens/Homo_sapiens.GRCh37.72.gtf.gz
| gunzip
| awk '$2 == "protein_coding"'
| cgat gff2ff --genome-file=hg19 --sanitize=ucsc --skip-missing
| cgat gtf2gtf --sort=gene
| cgat gtf2gtf --merge-exons --with-utr
| cgat gtf2gtf --filter=longest-gene
| cgat gtf2gtf --sort=position
| cgat gtf2gff --genome-file=hg19 --flank=5000 --method=genome
| gzip
> annotations.gff.gz

Note

The statements above need an indexed genome. To create such an indexed genome for hg19, type the following:

wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/chromFa.tar.gz | index_fasta.py hg19 - > hg19.log

CGAT tools can be chained into a single work flow using unix pipes. The above sequence of commands in turn (1) reconciles UCSC and ENSEMBL naming schemes for chromosome names, (2) merges all exons of alternative transcripts per gene, (3) keeps the longest gene in case of overlapping genes and (4) annotates exonic, intronic, intergenic and flanking region (size=5kb) within and between genes.

Note that the creation of annotations.gff.gz goes beyond simple interval intersection, as gene structures have to be normalized from multiple possible alternative transcripts to a single transcript that is chosen by the user depending on what is most relevant for the analysis.

Choosing different options can provide different sets of answers. Instead of merging all exons per gene, the longest transcript might be selected by replacing (2) with gtf2gtf --filter=longest-transcript. Or, instead of genomic annotations, regulatory domains such as defined by GREAT might be obtained by removing (3) and replacing (4) with gtf2gff --method=great-domains.

The generated annotations in annotations.gff can then be used to count the number of transcription factor binding sites using bed-tools or other interval intersections. Here, we will use another CGAT tool, gtf2table, to do the counting and classification:

zcat /ifs/devel/gat/tutorial/data/srf.hg19.bed
| cgat bed2gff --as-gtf
| cgat gtf2table --counter=classifier-chipseq --filename-gff=annotations.gff.gz

The scripts follow a consistent naming scheme centered around common genomic formats. Because of the common genomic formats, the tools can be easily combined with other tools such as bedtools (Quinlan and Hall, 2010) or UCSC Tools (Kuhn et al. 2013).

Table Of Contents

This Page