The CGAT pipelines, scripts and libraries make several assumptions about the computing environment. This section describes how to install the code and set up your computing environment.
To obtain the latest code, check it out from the public mercurial repository at:
hg clone http://www.cgat.org/hg/cgat/ cgat
Once checked-out, you can get the latest changes via pulling and updating:
hg pull
hg update
Some scripts contain cython code that needs to be recompiled if the script or the pysam installation has changed. To rebuild all scripts, for example after updating the repository, type:
python cgat/scripts/cgat_rebuild_extensions.py
Recompilation requires a C compiler to be installed.
The pipelines assume that Sun Grid Engine has been installed. Other queueing systems might work, but expect to be disappointed. The pipeline is started on a submit host assuming a default queue all.q. Other queues can be specified on the command line, for example:
python cgat/CGATPipelines/pipeline_<name>.py --cluster-queue=medium_jobs.q
A pipeline might start up to -p/--multiprocess processes. Preferentially, tasks are sent to the cluster, but for some tasks this is not possible. These might thus run on the submit host, so make sure it is fairly powerful.
Pipelines expects that the working directory is accessible with the same path both from the submit and the execution host.
On top of pipeline specific bioinformatics software, CGAT pipelines make use a variety of software. Unfortunately we can’t support many versions. The following table gives a list software we have currently installed:
Section | Software | Version |
apps | java | jre1.6.0_26 |
apps | gccxml | 0.9 |
apps | R | 2.14.1 |
bio | alignlib | 0.4.4 |
apps | python | 2.7.1 |
apps | perl | 5.12.3 |
apps | graphlib | 0.1 |
bio | abiwtap | 1.2.1 |
bio | bamstats | 1.22 |
bio | batman | 0.2.3 |
bio | bedtools | 2.13.3 |
bio | belvu | 2.16 |
bio | bfast | 0.6.5a |
bio | bioprospector | 2004 |
bio | bowtie | 0.12.7 |
bio | bwa | 0.5.9 |
bio | cdhit | 4.3 |
bio | clustalw | 2.1 |
bio | cufflinks | 1.3.0 |
bio | cpc | 0.9-r2 |
bio | dialign | 2.2.1 |
bio | ensembl | 62 |
bio | ensembl-variation | 62 |
bio | exonerate | 2.2.0 |
bio | fastqc | 0.9.2 |
bio | fastx | 0.0.13 |
bio | gatk | 1.0.5506 |
bio | gblocks | 0.91b |
bio | gcprofile | 1.0 |
bio | gmap | 2011.03.28 |
bio | galaxy | dist |
bio | IGV | 2.0.23 |
bio | IGVTools | 1.5.12 |
bio | kent | 1.0 |
bio | hmmer | 3.0 |
bio | leotools | 0.1 |
bio | meme | 4.7.0 |
bio | muscle | 3.8.31 |
bio | mappability_map | 1.0 |
bio | ncbiblast | 2.2.25+ |
bio | newickutils | 1.3.0 |
bio | novoalign | 2.07.11 |
bio | novoalignCS | 1.01.11 |
bio | paml | 4.4c |
bio | picard-tools | 1.48 |
bio | phylip | 3.69 |
bio | polyphen | 2.0.23 |
bio | samtools | 0.1.18 |
bio | shrimp | 2.1.1 |
bio | sicer | 1.1 |
bio | sift | 4.0.3 |
bio | simseq | 72ce499 |
bio | soap | 2.21 |
bio | soapsplice | 1.0 |
bio | sratoolkit | 2.1.7 |
bio | SpliceMap | 3.3.5.2 |
bio | stampy | 1.0.17 |
bio | statgen | 0.1.4 |
bio | storm | 0.1 |
bio | tabix | 0.2.5 |
bio | tophat | 1.4.1 |
bio | treebest | 0.1 |
bio | tv | 0.5 |
bio | vcftools | 0.1.8a |
bio | emboss | 6.3.1 |
bio | velvet | 1.1.04 |
bio | perm | 0.3.5 |
bio | lastz | 1.02.00 |
bio | hpeak | 2.1 |
bio | boost | 1.46.1 |
bio | Trinity | 2012-01-25 |
bio | bowtie2 | 2.0.0-beta5 |
bio | tophat2 | 2.0.0 |
bio | all | 1.0 |
What exactly is required will depend on the particular pipeline. The pipeline assumes that the executables are in the users PATH and that the rest of the environment has been set up for each tool.
Additionally, there is a list of additional software that is required that are usually shipped as a source package with the operating system. These are:
sqlite
CGAT uses python extensively and is currently developed against python 2.7.1. Python 2.6 should work as well, but some libraries present in 2.7.1 but missing in 2.6 might need to be installed. Scripts have not yet been ported to python 3.
CGAT requires the following in-house python libraries to be installed:
Library | Version | Purpose | Download |
pysam | 0.6.0 | python bindings for samtools | hg clone https://code.google.com/p/pysam/ pysam |
alignlib | 0.4.5 | C++ sequence alignment library with python bindings. | wget http://downloads.sourceforge.net/project/alignlib/alignlib/alignlib-0.4.5.tar.gz |
sphinxreport | latest | report generator | svn checkout https://sphinx-report.googlecode.com/svn/trunk/ sphinx-report |
In addition, CGAT scripts make extensive use of the following python libraries (list below might not be complete):
Library | Version | Purpose |
numpy | ||
scipy | ||
rpy2 | ||
matplotlib | ||
ruffus | ||
drmaa_python |
The full list of modules installed at CGAT is:
Module | Version | Method |
pycairo | 01/08/06 | S |
pygjobject | 2.20.0 | S |
pygtk | 2.16.0 | S |
wxPython | 2.9.1.1 | S |
matplotlib | 1 | S |
numpy | 01/05/01 | E |
scipy | 0.8.0 | S |
rpy | 1.0.3 | S |
rpy2 | 02/02/00 | S |
networkx | 1.3 | E |
pytables | 2.2 | |
pygccxml | 1 | S |
pyplusplus | 1 | S |
bx.python | ||
pygresql | 4 | E |
myqsl-python | 01/02/03 | E |
biopython | 1.56 | E |
ply | 3.3 | E |
psyco | ||
pyrex | 0.9.9 | E |
cython | 0.13 | E |
sphinx | 1.0.5 | E |
reportlab | 2.5 | E |
guppy | 0.1.9 | E |
pil | 01/01/07 | E |
threadpool | 01/02/07 | E |
progressbar | 2.3 | E |
virtualenv | 01/05/01 | E |
sqlalchemy | 0.6.5 | E |
ruffus | 2.2 | E |
drmaa | 0.4b3 | E |
bx.python | 12/01/10 | S |
corebio | 0.5.0 | E |
weblogolib | 3 | E |
mercurial | 01/07/03 | E |
scikits.learn | 0.7.1 | E |
web.py | 0.34 | E |
pandas | 0.5.0 | E |
pybedtools | 0.6 | E |
Method : Installation method (E = easy_install/setuptools, S = setup.py/distutils, C = CGAT)