OPTIC

Purpose

The optic pipeline performs orthology assignment in a group of species.

Setting up

Download CGAT code

CGAT code can be obtained by checking the latest version out from mercurial:

hg clone http://www.cgat.org/hg/cgat

In the following, the location of the checked out code will be referred to as <src>.

Requirements

Optic requires the following tools to be installed.

Program Version Purpose
postgres   database
muscle >=3.81.3 Multiple alignment
phyop   read mapping
treebest >=0.1 bam/sam files
paml >=4.4c evolutionary rate estimation

Genomes

The first step is to build the files with the genomic sequences. Download fasta files with genomic secquences and index them using the <src>/IndexedFasta.py tools.

Store the genomes in a separate directory (referred to later as <genome>).

Genesets

The second step is to upload the genesets into the database. The pipeline expects protein coding transcripts from ENSEMBL and requires two files that can be downloaded from the ENSEMBL ftp site:

  • File with peptide sequences, usually called species.pep.all.fa.gz

  • File with exon coordinates in gtf format, usually called

    species.gtf.gz

The following example shows how to upload the human gene set. We are going to be using ensembl version 62 on hg19:

mkdir -p genesets/hs62

cd genesets/hs62

ln -s <genomes>hg19.fasta genome.fasta

ln -s <genomes>hg19.idx genome.idx

ln -s <mirror>Homo_sapiens.GRCh37.62.gtf.gz reference.gtf.gz

ln -s <mirror>Homo_sapiens.GRCh37.62.pep.all.fa.gz reference.pep.fa.gz

Create the makefile:

python <src>setup.py -m ensembl -p cgat_hs62

Create database tables:

make prepare

Upload data:

make all

To verify all was ok, look at the file predictions.check. This file compares the ENSEMBL supplied peptide sequences with those that have been uploaded into the database.

The data is now stored in the database schema cgat_hs62.

You need to do this for all species that you want to run OPTIC on.

Make sure that the naming is consistent with the genomes. Thus, hg19 should both refer to the gene set for human, but also to the genomic sequence files (hg19.fasta) in the <genomes> directory.

Running OPTIC

The optic pipeline works from several directories.

Create a working directory:

mkdir optic

Create a makefile:

python <src>setup.py -m optic -p optic
cd optic

Enter the data directory and create the following files:

Makefile.inc

common makefile file options. For example:

## Global configuration options for OPTIC
PARAM_PROJECT_NAME=cgat_proj007

DIR_TMP=/tmp/

PARAM_DIR_DATA=<optic>data/

PARAM_SRC_SCHEMAS=cgat_hs62 cgat_mm62 cgat_gg62 cgat_ac62 cgat_xt62
cgat_dr62 cgat_oa65

PARAM_SPECIES_TREE=((((((cgat_hs62,cgat_mm62),cgat_oa65),cgat_gg62),cgat_ac62),cgat_xt62),cgat_dr62);

PARAM_ANALYSIS_DUPLICATIONS_OUTGROUPS=cgat_dr62

## CGAT cluster params

DIR_SCRIPTS=<src>/

PARAM_QUEUE=all.q
PARAM_QUEUE_LOCAL=all.q
PARAM_QUEUE_SERVER=all.q
schema2sp

tsv file mapping species in database schema to swissprot name (required for treebest):

# map of species names to swiss prot taxonomic names
# used for njtree
schema  sp
cgat_hs62       HUMAN
cgat_mm62       MOUSE
cgat_xt62       XENTR
cgat_dr62       DANRE
cgat_ac62       ANOCA
cgat_gg62       CHICK
cgat_oa65       ORNAN
species_tree

the phylogeny of the species in newick format, for example:

((((((cgat_hs62,cgat_mm62),cgat_oa65),cgat_gg62),cgat_ac62),cgat_xt62),cgat_dr62);
species_tree_permutations
the phylogeny of the species and permutations of it. Usually just a copy of species_tree

files needed for web server:

schema2colour

map species name to colour:

cgat_hs62       255,204,0
cgat_dr62       255,102,204
cgat_xt62       204,204,0
cgat_mm62       204,102,255
cgat_ac62       102,255,255
cgat_gg62       125,125,255
cgat_oa65       255,255,102
schema2name

map species name to real name:

schema  name
cgat_hs62       H. sapiens
cgat_mm62       M. musculus
cgat_ac62       A. carolinensis
cgat_xt62       X. tropicalis
cgat_dr62       D. rerio
cgat_gg62       G. gallus
cgat_oa65       O. anatinus
schema2url

map species name to ENSEMBL URL:

cgat_hs62       displayGene?schema=%(species)s&gene_id=%(gene)s
cgat_mm62       displayGene?schema=%(species)s&gene_id=%(gene)s
cgat_ac62       displayGene?schema=%(species)s&gene_id=%(gene)s
cgat_xt62       displayGene?schema=%(species)s&gene_id=%(gene)s
cgat_dr62       displayGene?schema=%(species)s&gene_id=%(gene)s
cgat_gg62       displayGene?schema=%(species)s&gene_id=%(gene)s
cgat_oa65       displayGene?schema=%(species)s&gene_id=%(gene)s

Preparing data

Create export data files. This will create fasta file and exon boundary files for all species:

make -C export export_clustering

Phyop

Setup pairwise phyop runs and run them:

make -C orthology_pairwise prepare

nice -19 nohup make -C orthology_pairwise all

Wait a while...

Clustering

The clustering step combines pairwise orthology assignments into clusters of potential orthologs:

make -C orthology_multiple prepare
make -C orthology_mulitple all

Multiple alignment

Next, multiple alignments are built for each cluster:

make -C malis prepare
make -C malis all
make -C malis summary.dir
make -C malis summary

Orthologous groups

Based on the multiple alignments, trees are built within each cluster and the trees are split into orthologous groups:

make -C paralogy_trees prepare
make -C paralogy_trees all
make -C paralogy_trees analysis
make -C paralogy_trees summary

Configuration

Edit the Makefile to configure the pipeline.