This section provides some background on CGAT pipelines.
There really are two types of pipelines. In production pipelines the inputs are usually the same every time the pipeline is run and the output is known beforehand. For example, read mapping and quality control is a typical pipeline. These pipelines can be well optimized and can be re-used with little change in configuration.
analysis pipelines control scientific analyses and are much more in a state of flux. Here, the input might change over time as the analysis expands and the output will change with every new insight or new direction a project takes. It will be still a pipeline as long as the output can be generated from the input without manual intervention. These pipelines leave less scope for optimization compared to production pipelines and adapting a pipeline to a new project will involve significant refactoring.
In CGAT, we are primarily concerned with analysis pipelines, though we have some production pipelines for common tasks.
There are several ways to build pipelines. For example, there are generic workflow systems like taverna which even provide GUIs for connecting tasks. A developer writes some glue code permitting the output of one application to be used as input for another application. Also, there are specialized workflow systems for genomics, for example galaxy, which allows you to save and share analyses. New tools can be added to the system and new data imported easily for example from the UCSC genome browser.
There probably is not one toolset to satisfy all these criteria.. We use the following tools to build a pipeline:
- ruffus to control the main computational steps
- sqlite to store the results of the computational steps
- sphinxreport to visualize the data in the sqlite database