GPipe - Gene prediction pipeline¶

Introduction¶

This document describes the pipeline of the Chris Ponting group for predicting genes by homology. The input is a set of known transcripts from a reference genome and the masked genomic sequence of a target genome.

The pipeline predicts genes in a two-step procedure:

Regions of similarity between the known transcripts and the reference genome are identified using a quick heuristic search.
The regions of similarity of step 1 are submitted to a sensitive, but slower gene prediction program.

The pipeline contains many options to mask sequences, analyse and quality control the predictions and store the results in a relational database. This manual describes how to setup up the pipeline, run it on our cluster, and analyse the results.

Note

The pipeline is tailored to the computational setup within the Ponting group. Porting it elsewhere will require a significant amount of software installation and configuration.

Setting up¶

The pipeline consists of a set of scripts and a makefile, that glues together the various scripts. This section tells you what programs are needed to be installed (Requirements), how to install the software from the pipeline (Installation).

Next, the input files need to prepared (Input files).

Before starting the pipeline, you need to configure your environment and the pipeline (Configuration).

Requirements¶

Gpipe requires following software to be installed:

postgres

Gpipe currently requires a postgres installation.

seq

Seg is a program to mask low complexity regions in protein sequences.

exonerate

Exonerate is a program to align a peptide sequence to a genomic sequence (other alignment modes are possible). It offers heuristic modes, that allow for fast scanning of large chunks of genomic DNA, and exhaustive modes, that do a full dynamic programming mode. It is available at http://www.ebi.ac.uk/~guy/exonerate.

python

Most scripts require python...

perl

...unless they are written in perl

alignlib

a library for sequence alignments and its python interface. See http://sourceforge.net/projects/alignlib.

SGE

Sun Grid Engine. Other schedulers might work.

seq_pairs_kaks

Leo’s wrapper around PAML (optional, only required for step12).

Installation¶

Untar and unpack the source code tarball. The directory in which it ends up is the source directory.

Gpipe works in a working directory. To set up the pipeline with the current directory as the working directory, run:

python <src>setup.py --method=gpipe --project=project_name > setup.log

src is the location of the source directory.

This will create a makefile in the current directory and give the project the name project_name. The latter will be used as the name of the database schema in postgres in which data will be stored.

Note that gpipe assumes that you use bash as you shell. In particular, it requires that two functions are in your environment:

#-----------------------------------
# helper functions for detecting errors in pipes
#-----------------------------------
detect_pipe_error_helper()
{
 while [ "$#" != 0 ] ; do
     # there was an error in at least one program of the pipe
     if [ "$1" != 0 ] ; then return 1 ; fi
     shift 1
 done
 return 0
}

detect_pipe_error()
{
 detect_pipe_error_helper "${PIPESTATUS[@]}"
 return $?
}

Once the code is in place, add the input files to the working and make sure that all the other requirements are fulfilled.

Input files¶

Gpipe requires 5 files to run. These input files contain the reference gene set to predict with and the genome sequence to predict in. Sample data is available at http://genserv.anat.ox.ac.uk/downloads/software/gpipe/sampledata/gpipe_sample_data.tar. To drop the sample data into your working directory, type:

wget ttp://genserv.anat.ox.ac.uk/downloads/software/gpipe/sampledata/gpipe_sample_data.tar
tar -xf gpipe_sample_data.tar
gunzip *

The filenames and their contents are:

peptides.fasta

A fasta-formatted file with peptide sequences. Each sequence is on a single line. The identifier of a sequence is taken from the description line with the pattern >(\S+) (characters between > and first white-space). For example:

>CG11023-RA
MGERDQPQSSERISIFNPPVYTQHQVRNEAPYIPTTFDLLSDDEESSQRVANAGPSFRPL...
>CG2671-RA
MLKFIRGKGQQPSADRHRLQKDLFAYRKTAQHGFPHKPSALAYDPVLKLMAIGTQTGALK...
...

genome.fasta and genome.idx

A fasta-formatted file with the genomic sequence together with its index. This file can be created from a collection of individual fasta formatted files (for example, per chromosome files) using the command:

python <src>index_fasta.py genome <somedir>my_dna_sequences*.fa.gz > genome.log

reference.exons

A table with gene models from the reference gene set. This is a tab-formatted table with the following columns:

transcript name: name of the transcript consistent with peptides.fasta
contig name: name of the DNA segment the transcript is located on
strand: strand
phase: phase of that particular exon
exon-id: numerical number of exon
peptide_start: start of exon in transcript sequence
peptide_end: end of exon in transcript sequence
genome_start: start of exon on contig
genome_end: end of exon on contig

Coordinates are 0-based, half-open intervals. Genomic coordinates are forward/reverse strand coordinates.

For example:

CG10000-RA chr3R   -1      0       1       0       126     24577165        24577291
CG10000-RA chr3R   -1      0       2       126     287     24576946        24577107
CG10000-RA chr3R   -1      1       3       287     466     24576706        24576885
CG10000-RA chr3R   -1      2       4       466     930     24576187        24576651
CG10000-RA chr3R   -1      0       5       930     1100    24575892        24576062
CG10000-RA chr3R   -1      1       6       1100    1280    24575573        24575753
CG10000-RA chr3R   -1      1       7       1280    1677    24574936        24575330
CG10001-RA chr3R   -1      0       1       0       540     24569207        24569747
CG10001-RA chr3R   -1      0       2       540     819     24566427        24566706
CG10001-RA chr3R   -1      0       3       819     978     24566193        24566352

map_rep2mem

A table linking genes to transcripts. This tab-formatted table contains the following columns

rep: A gene identifier
mem: A transcript identifier
size: Transcript size

Configuration¶

To configure the pipeline, options can be set in the Makefile in the working directory.

Options that might need to be changed:

PARAM_PSQL_DATABASE

The psql database

PARAM_PSQL_HOST

The psql host

PARAM_PSQL_USER

The psql username

Running the pipeline¶

The pipeline uses makefiles to control script logic. Before executing any make commands, run:

source setup.csh

to update your paths and other environment variables.

Before first running the pipeline, some maintenance work need to be performed like creating the database schema and the tables. To prepare the pipeline, run:

make prepare

This needs to be done only once.

To run the pipeline, type:

make all

Gpipe writes status messages to the file log in the working directory.

Results¶

Results of the gpipe run are stored in the psql database.

The view overview aggregates most results into a single table for easy access.

Troubleshooting¶

If something goes wrong, the first step is to look at the command line that caused the problem. To see the command executed, run:

make -n <target>

We use Sun Grid Engine as job queueing system and assume that for all nodes the code and data can be reached via the same mount point. All jobs that are run on the cluster are prefixed by the MAKE variable $(CMD_REMOTE_SUBMIT). You can set this variable to the empty string to run everything locally or on a mosix cluster:

make all CMD_REMOTE_SUBMIT=

Steps¶

The pipeline proceeds in 12 steps, which are:

Step1: Masking of protein sequences
Step2: Selecting representative transcipts to search with
Step3: Running exonerate
Step4: Running TBLASTN (disabled)
Step5: Collating putative gene-containing regions
Step6: Predicting genes for representative transcripts.
Step7: Predicting genes for redundant (alternative) transcripts
Step8: Predicting genes for member sequences
Step9: Analysing the predictions
Step10: Quality control of predictions
Step11: Removing redundant/erroneous predictions
Step12: Filter by ks (optional)

Glossary¶

working directory: The working directory. Location of the data files and results. All commands in this tutorial are executed in the working directory.
source directory: The location of the source code. The place where the script setup.py resides.

GPipe - Gene prediction pipeline¶

Introduction¶

Setting up¶

Requirements¶

Installation¶

Input files¶

Configuration¶

Running the pipeline¶

Results¶

Troubleshooting¶

Steps¶

Glossary¶

Table Of Contents

Previous topic

Next topic

This Page

Navigation

GPipe - Gene prediction pipeline¶

Introduction¶

Setting up¶

Requirements¶

Installation¶

Input files¶

Configuration¶

Running the pipeline¶

Results¶

Troubleshooting¶

Steps¶

Glossary¶

Table Of Contents

Previous topic

Next topic

This Page

Quick search

Navigation