This document describes the pipeline of the Chris Ponting group for predicting genes by homology. The input is a set of known transcripts from a reference genome and the masked genomic sequence of a target genome.
The pipeline predicts genes in a two-step procedure:
The pipeline contains many options to mask sequences, analyse and quality control the predictions and store the results in a relational database. This manual describes how to setup up the pipeline, run it on our cluster, and analyse the results.
Note
The pipeline is tailored to the computational setup within the Ponting group. Porting it elsewhere will require a significant amount of software installation and configuration.
The pipeline consists of a set of scripts and a makefile, that glues together the various scripts. This section tells you what programs are needed to be installed (Requirements), how to install the software from the pipeline (Installation).
Next, the input files need to prepared (Input files).
Before starting the pipeline, you need to configure your environment and the pipeline (Configuration).
Gpipe requires following software to be installed:
- postgres
- Gpipe currently requires a postgres installation.
- seq
- Seg is a program to mask low complexity regions in protein sequences.
- exonerate
- Exonerate is a program to align a peptide sequence to a genomic sequence (other alignment modes are possible). It offers heuristic modes, that allow for fast scanning of large chunks of genomic DNA, and exhaustive modes, that do a full dynamic programming mode. It is available at http://www.ebi.ac.uk/~guy/exonerate.
- python
- Most scripts require python...
- perl
- ...unless they are written in perl
- alignlib
- a library for sequence alignments and its python interface. See http://sourceforge.net/projects/alignlib.
- SGE
- Sun Grid Engine. Other schedulers might work.
- seq_pairs_kaks
- Leo’s wrapper around PAML (optional, only required for step12).
Untar and unpack the source code tarball. The directory in which it ends up is the source directory.
Gpipe works in a working directory. To set up the pipeline with the current directory as the working directory, run:
python <src>setup.py --method=gpipe --project=project_name > setup.log
src is the location of the source directory.
This will create a makefile in the current directory and give the project the name project_name. The latter will be used as the name of the database schema in postgres in which data will be stored.
Note that gpipe assumes that you use bash as you shell. In particular, it requires that two functions are in your environment:
#-----------------------------------
# helper functions for detecting errors in pipes
#-----------------------------------
detect_pipe_error_helper()
{
while [ "$#" != 0 ] ; do
# there was an error in at least one program of the pipe
if [ "$1" != 0 ] ; then return 1 ; fi
shift 1
done
return 0
}
detect_pipe_error()
{
detect_pipe_error_helper "${PIPESTATUS[@]}"
return $?
}
Once the code is in place, add the input files to the working and make sure that all the other requirements are fulfilled.
Gpipe requires 5 files to run. These input files contain the reference gene set to predict with and the genome sequence to predict in. Sample data is available at http://genserv.anat.ox.ac.uk/downloads/software/gpipe/sampledata/gpipe_sample_data.tar. To drop the sample data into your working directory, type:
wget ttp://genserv.anat.ox.ac.uk/downloads/software/gpipe/sampledata/gpipe_sample_data.tar
tar -xf gpipe_sample_data.tar
gunzip *
The filenames and their contents are:
A fasta-formatted file with peptide sequences. Each sequence is on a single line. The identifier of a sequence is taken from the description line with the pattern >(\S+) (characters between > and first white-space). For example:
>CG11023-RA
MGERDQPQSSERISIFNPPVYTQHQVRNEAPYIPTTFDLLSDDEESSQRVANAGPSFRPL...
>CG2671-RA
MLKFIRGKGQQPSADRHRLQKDLFAYRKTAQHGFPHKPSALAYDPVLKLMAIGTQTGALK...
...
A fasta-formatted file with the genomic sequence together with its index. This file can be created from a collection of individual fasta formatted files (for example, per chromosome files) using the command:
python <src>index_fasta.py genome <somedir>my_dna_sequences*.fa.gz > genome.log
A table with gene models from the reference gene set. This is a tab-formatted table with the following columns:
Coordinates are 0-based, half-open intervals. Genomic coordinates are forward/reverse strand coordinates.
For example:
CG10000-RA chr3R -1 0 1 0 126 24577165 24577291
CG10000-RA chr3R -1 0 2 126 287 24576946 24577107
CG10000-RA chr3R -1 1 3 287 466 24576706 24576885
CG10000-RA chr3R -1 2 4 466 930 24576187 24576651
CG10000-RA chr3R -1 0 5 930 1100 24575892 24576062
CG10000-RA chr3R -1 1 6 1100 1280 24575573 24575753
CG10000-RA chr3R -1 1 7 1280 1677 24574936 24575330
CG10001-RA chr3R -1 0 1 0 540 24569207 24569747
CG10001-RA chr3R -1 0 2 540 819 24566427 24566706
CG10001-RA chr3R -1 0 3 819 978 24566193 24566352
A table linking genes to transcripts. This tab-formatted table contains the following columns
To configure the pipeline, options can be set in the Makefile in the working directory.
Options that might need to be changed:
- PARAM_PSQL_DATABASE
- The psql database
- PARAM_PSQL_HOST
- The psql host
- PARAM_PSQL_USER
- The psql username
The pipeline uses makefiles to control script logic. Before executing any make commands, run:
source setup.csh
to update your paths and other environment variables.
Before first running the pipeline, some maintenance work need to be performed like creating the database schema and the tables. To prepare the pipeline, run:
make prepare
This needs to be done only once.
To run the pipeline, type:
make all
Gpipe writes status messages to the file log in the working directory.
Results of the gpipe run are stored in the psql database.
The view overview aggregates most results into a single table for easy access.
If something goes wrong, the first step is to look at the command line that caused the problem. To see the command executed, run:
make -n <target>
We use Sun Grid Engine as job queueing system and assume that for all nodes the code and data can be reached via the same mount point. All jobs that are run on the cluster are prefixed by the MAKE variable $(CMD_REMOTE_SUBMIT). You can set this variable to the empty string to run everything locally or on a mosix cluster:
make all CMD_REMOTE_SUBMIT=
The pipeline proceeds in 12 steps, which are: