Installing CGAT pipelines

The CGAT pipelines, scripts and libraries make several assumptions about the computing environment. This section describes how to install the code and set up your computing environment.

Downloading and installing the source code

To obtain the latest code, check it out from the public mercurial repository at:

hg clone cgat

Once checked-out, you can get the latest changes via pulling and updating:

hg pull
hg update

Some scripts contain cython code that needs to be recompiled if the script or the pysam installation has changed. To rebuild all scripts, for example after updating the repository, type:

python cgat/scripts/

Recompilation requires a C compiler to be installed.

Setting up the computing environment

The pipelines assume that Sun Grid Engine has been installed. Other queueing systems might work, but expect to be disappointed. The pipeline is started on a submit host assuming a default queue all.q. Other queues can be specified on the command line, for example:

python cgat/CGATPipelines/pipeline_<name>.py --cluster-queue=medium_jobs.q

A pipeline might start up to -p/--multiprocess processes. Preferentially, tasks are sent to the cluster, but for some tasks this is not possible. These might thus run on the submit host, so make sure it is fairly powerful.

Pipelines expects that the working directory is accessible with the same path both from the submit and the execution host.

Software requirements

On top of pipeline specific bioinformatics software, CGAT pipelines make use a variety of software. Unfortunately we can’t support many versions. The following table gives a list software we have currently installed:

Section Software Version
apps java jre1.6.0_26
apps gccxml 0.9
apps R 2.14.1
bio alignlib 0.4.4
apps python 2.7.1
apps perl 5.12.3
apps graphlib 0.1
bio abiwtap 1.2.1
bio bamstats 1.22
bio batman 0.2.3
bio bedtools 2.13.3
bio belvu 2.16
bio bfast 0.6.5a
bio bioprospector 2004
bio bowtie 0.12.7
bio bwa 0.5.9
bio cdhit 4.3
bio clustalw 2.1
bio cufflinks 1.3.0
bio cpc 0.9-r2
bio dialign 2.2.1
bio ensembl 62
bio ensembl-variation 62
bio exonerate 2.2.0
bio fastqc 0.9.2
bio fastx 0.0.13
bio gatk 1.0.5506
bio gblocks 0.91b
bio gcprofile 1.0
bio gmap 2011.03.28
bio galaxy dist
bio IGV 2.0.23
bio IGVTools 1.5.12
bio kent 1.0
bio hmmer 3.0
bio leotools 0.1
bio meme 4.7.0
bio muscle 3.8.31
bio mappability_map 1.0
bio ncbiblast 2.2.25+
bio newickutils 1.3.0
bio novoalign 2.07.11
bio novoalignCS 1.01.11
bio paml 4.4c
bio picard-tools 1.48
bio phylip 3.69
bio polyphen 2.0.23
bio samtools 0.1.18
bio shrimp 2.1.1
bio sicer 1.1
bio sift 4.0.3
bio simseq 72ce499
bio soap 2.21
bio soapsplice 1.0
bio sratoolkit 2.1.7
bio SpliceMap
bio stampy 1.0.17
bio statgen 0.1.4
bio storm 0.1
bio tabix 0.2.5
bio tophat 1.4.1
bio treebest 0.1
bio tv 0.5
bio vcftools 0.1.8a
bio emboss 6.3.1
bio velvet 1.1.04
bio perm 0.3.5
bio lastz 1.02.00
bio hpeak 2.1
bio boost 1.46.1
bio Trinity 2012-01-25
bio bowtie2 2.0.0-beta5
bio tophat2 2.0.0
bio all 1.0

What exactly is required will depend on the particular pipeline. The pipeline assumes that the executables are in the users PATH and that the rest of the environment has been set up for each tool.

Additionally, there is a list of additional software that is required that are usually shipped as a source package with the operating system. These are:


Python libraries

CGAT uses python extensively and is currently developed against python 2.7.1. Python 2.6 should work as well, but some libraries present in 2.7.1 but missing in 2.6 might need to be installed. Scripts have not yet been ported to python 3.

CGAT requires the following in-house python libraries to be installed:

Library Version Purpose Download
pysam 0.6.0 python bindings for samtools hg clone pysam
alignlib 0.4.5 C++ sequence alignment library with python bindings. wget
sphinxreport latest report generator svn checkout sphinx-report

In addition, CGAT scripts make extensive use of the following python libraries (list below might not be complete):

Library Version Purpose

The full list of modules installed at CGAT is:

Module Version Method
pycairo 01/08/06 S
pygjobject 2.20.0 S
pygtk 2.16.0 S
wxPython S
matplotlib 1 S
numpy 01/05/01 E
scipy 0.8.0 S
rpy 1.0.3 S
rpy2 02/02/00 S
networkx 1.3 E
pytables 2.2  
pygccxml 1 S
pyplusplus 1 S
pygresql 4 E
myqsl-python 01/02/03 E
biopython 1.56 E
ply 3.3 E
pyrex 0.9.9 E
cython 0.13 E
sphinx 1.0.5 E
reportlab 2.5 E
guppy 0.1.9 E
pil 01/01/07 E
threadpool 01/02/07 E
progressbar 2.3 E
virtualenv 01/05/01 E
sqlalchemy 0.6.5 E
ruffus 2.2 E
drmaa 0.4b3 E
bx.python 12/01/10 S
corebio 0.5.0 E
weblogolib 3 E
mercurial 01/07/03 E
scikits.learn 0.7.1 E 0.34 E
pandas 0.5.0 E
pybedtools 0.6 E

Method : Installation method (E = easy_install/setuptools, S =, C = CGAT)

