Using CGAT pipelines

This section provides a tutorial-like introduction to CGAT pipelines.


A pipeline takes input data and performs a series of automated steps (task) on it to produce some output data.

Each pipeline is usually coupled with a SphinxReport document to summarize and visualize the results.

It really helps if you are familiar with following:

  • the unix command line to run and debug the pipeline
  • python in order to understand what happens in the pipeline
  • ruffus in order to understand the pipeline code
  • sge in order to monitor your jobs
  • mercurial in order to up-to-date code

Setting up a pipeline

Before starting, check that your computing environment is appropriate (see Installing CGAT pipelines). Once all components are in place, setting up a pipeline involves the following steps:

Step 1: Get the latest clone of the cgat script repository:

hg clone src


You need to have mercurial installed.

The directory src is the source directory. It will be abbreviated <src> in the following commands. This directory will contain the pipeline master script named pipeline_<name>.py, the default configuration files and all the helper scripts and libraries to run the pipeline.

Step 2: Create a working directory and enter it. For example:

mkdir version1
cd version1

The pipeline will live there and all subsequent steps should be executed from within this directory.

Step 3: Obtain and edit an initial configuration file. Ruffus pipelines are controlled by a configuration file. A configuration file with all the default values can be obtained by running:

python <src>/pipeline_<name>.py config

This will create a new pipeline.ini file. YOU MUST EDIT THIS FILE. The default values are likely to use the wrong genome or point to non-existing locations of indices and databases. The configuration file should be well documented and the format is simple. The documenation for the ConfigParser python module contains the full specification.

Step 4: Add the input files. The required input is specific for each pipeline; read the pipeline documentation to find out exactly which files are needed. Commonly, a pipeline works from input files copied or linked into the working directory and named following pipeline specific conventions.

Running a pipeline

Pipelines are controlled by a single python script called pipeline_<name>.py that lives in the source directory. Command line usage information is available by running:

python <src>/pipeline_<name>.py --help

The basic syntax for pipeline_<name>.py is:

python <src>/pipeline_<name>.py [options] _COMMAND_

COMMAND can be one of the following:

make <task>
run all tasks required to build task
show <task>
show tasks required to build task without executing them
plot <task>
plot image (requires inkscape) of pipeline state for task
touch <task>
touch files without running task or its pre-requisites. This sets the timestamps for files in task and its pre-requisites such that they will seem up-to-date to the pipeline.
write a new configuration file pipeline.ini with default values. An existing configuration file will not be overwritten.
clone <srcdir>
clone a pipeline from srcdir into the current directory. Cloning attempts to conserve disk space by linking.

In case you are running a long pipeline, make sure you start it appropriately, for example:

nice -19 nohup <src>/pipeline_<name>.py make full

This will keep the pipeline running if you close the terminal.


Many things can go wrong while running the pipeline. Look out for

  • bad input format. The pipeline does not perform sanity checks on the input format.

    If the input is bad, you might see wrong or missing results or an error message.

  • pipeline disrutions. Problems with the cluster, the file system or the controlling terminal

    might all cause the pipeline to abort.

  • bugs. The pipeline makes many implicit assumptions about the input files and the programs it

    runs. If program versions change or inputs change, the pipeline might not be able to deal with it. The result will be wrong or missing results or an error message.

If the pipeline aborts, locate the step that caused the error by reading the logfiles and the error messages on stderr (nohup.out). See if you can understand the error and guess the likely problem (new program versions, badly formatted input, ...). If you are able to fix the error, remove the output files of the step in which the error occured and restart the pipeline. It should continue from the appropriate location.


Look out for upstream errors. For example, the pipeline might build a geneset filtering by a certain set of contigs. If the contig names do not match, the geneset will be empty, but the geneset building step might conclude successfully. However, you might get an error in any of the downstream steps complaining that the gene set is empty. To fix this, fix the error and delete the files created by the geneset building step and not just the step that threw the error.

Updating to the latest code version

To get the latest bugfixes, go into the source directory and type:

hg pull
hg update

The first command retrieves the latest changes from the master repository and the second command updates your local version with these changes.

Building pipeline reports

Some of the pipelines are associated with an automated report generator to display summary information as a set of nicely formatted html pages. In order to build the documentation, drop the appropriate and sphinxreport.ini configuration files into the working directory and run the pipeline command:

nice -19 pipeline_<name>.py make build_report

This will create the report from scratch in the current directory. The report can be viewed opening the file <work>/report/html/contents.html in your browser.

Sphinxreport is quite powerful, but also runs quite slowly on large projects that need to generate a multitude of plots and tables. In order to speed up this process, there are some advanced features that Sphinxreport offers:

  • caching of results
  • multiprocessing
  • incremental builds
  • separate build directory

Please see the sphinxreport documentation for more information.