Transcript comparison pipeline¶

Purpose¶

Map 454 reads onto a genome and assemble overlapping transcripts into transcript models.

The pipeline currently does not use base quality information during mapping and does not consider alternative transcripts.

Setting up¶

To set up the pipeline in the current directory run:

python setup.py --method=compare_transcripts > setup.log

Link towards the genome from /net/cpp-data/backup/databases/indexed_fasta and call the files genome.fasta and genome.idx. For example:

ln -s /net/cpp-data/backup/databases/indexed_fasta/hs_ncbi36_softmasked.fasta genome.fasta
ln -s /net/cpp-data/backup/databases/indexed_fasta/hs_ncbi36_softmasked.idx genome.idx

Input (required):

%.gtf: gtf files with (experimental) transcripts. The % denotes the track name, for example heart.gtf, kidney.gtf, sample1.gtf, ...
genome.fasta, genome.idx: an indexed genome PARAM_GENOME. See also index_fasta.py.
ensembl.gtf: a gtf file with a reference sequence set: the default is ensembl, but can be changed in PARAM_MASTER_SET_GENES.
annotations.gff: a gff file with annotated genomic regions. See PARAM_GENOME_REGIONS. Use gtf2gff.py to create this file.

The pipeline includes additional information if it is present:

%.coverage: table with coverage information for a track. The output is from blat2assembly.py
%.polyA: information about polyA tails. The output is from blat2assembly.py
%.readstats: a table with read alignment statistics after filtering (see output from MapTranscripts)
%.readmap: a table mapping gene_ids to read_ids after filtering (see output from MapTranscripts)
%.readinfo: a table with read information.
%.readgtf: mapped locations of reads after filtering.

PARAM_FILE_REPEATS_RATES

gff formatted file of ancesctral repeats. The score field contains the rate (see Makefile.ancestral_repeats)

PARAM_FILE_REPEATS

gff file with repeats in genome. These are used for masking in coding potential predictions.

PARAM_FILE_REPEATS_GC

gff formatted file of ancesctral repeats. The score field contains the G+C content (see Makefile.ancestral_repeats)

PARAM_FILE_ALIGNMENTS

psl formatted file with genomic alignments between this species in query and another at appropriate evolutionary distance in target.

PARAM_FILENAME_GO (PARAM_FILENAME_GOSLIM)

GO annotations for genes in the reference set. Example format is:

cell_location   ENSPPYG00000000676      GO:0016020      membrane        NA

PARAM_FILENAME_TERRITORIES

gene territories. GTF formatted file, an example entry would be:

chr1    protein_coding  exon    3979975 4199559 .       -       .       transcript_id "ENSPPYG00000000050"; gene_id "ENSPPYG00000000050";#

PARAM_CPC_UNIREF

uniref database to use for coding potential predictions.

PARAM_DATABASE

database name

Output from the mapTranscripts454 project can be imported with a single command:

make PATH_TO_MAPPING_DIR.add-tracks

Configuration¶

Edit the Makefile to configure the pipeline. See Parameters below.

Usage¶

The pipeline is controlled by running make targets. The results of the pipeline computation are stored as tab separated tables in the working directory. Most of these tables are then imported into an sqlite database called csvdb (see PARAM_DATABASE).

Annotation¶

Type:

make all

to do all.

Fine grained control¶

A more complete list of targets:

all: make all
build: only build, but do not import.
import: import

Visualization¶

The following targets aid visualizatiov:

ucsc-tracks-gtf

export the segments as compressed gtf files. Can be viewed as user tracks in the ucsc genome browser.

GO analysis¶

GO analysis will compute the relative enrichment/depletion gene sets.

Requires PARAM_FILENAME_TERRITORIES, PARAM_FILENAME_GO and PARAM_FILENAME_GOSLIM to be set.

There are two counting methods. The first method (go) assigns GO terms associated with the reference gene set to TLs and counts these. The second method (territorygo) assigns TLs to genes in the reference set and then does a GO analysis on theses.

Note

The convential GO analysis based on gene list is the territorygo method.

Usage¶

Usage:

make <track>:<slice>:<subset>:<background>.<go>.<method>analysis

The fields are:

track: the data track to be chosen.
slice: the slices correspond to flags in the table <track>_annotation. Use all to use all segments in a track.
subset: the subset corresponds to a table that is joined with <track>_annotation to restrict segments to a user-specified set. Use all for no restriction.
background: the background gene set
go: either go or goslim
method: either go or goterritory

Results will be in the directory <track>:<slice>:<subset>:<background>.<go>.<method>analysis.dir.

For example:

make thoracic:known:all:thoracic.go.goanalysis

will compute the enrichment of protein coding TL in the track thoracic using all thoracic genes as the background.

The command:

make thoracic:known:all:ensembl.goslim.territorygoanalysis

will compute goslim term enrichment. The foreground set are genes from the reference set (ensembl) overlapping protein coding TL in the track thoracic. The background is the complete reference gene set (ensembl).

Annotator analysis¶

Annotator computes the statistical significance of enrichment/depletion of genomic features (called segments) within genomic regions (called annotations).

To run annotator analysis, two files need to be present:

A workspace
A collection of annotations on the genome

Building workspaces¶

Workspaces are built using makefile targets. For example to build :file:genome.workspace, type::: make genome.workspace

All workspaces exclude contigs called matching random.

genome.workspace: full genome
intergenic.workspace: only intergenic regions
intronic.workspace: only intronic regions
unknown.workspace: both intergenic and intronic regions
territories.workspace: workspace of territories
alignable.workspace: only segments that can be aligned to a reference genome.

There is a convenience target:

make annotator-workspaces

that will build all available workspaces.

Annotations¶

Annotations are built using makefile targets.

all.annotations:: all subsets (all/known/unknown) for each track.
architecture.annotations:: annotations according to genes (intronic, intergenic, ...).
{all,known,unknown}_sets.annotations: annotations of known, unknown, all transcripts
allgo_territories.annotations: territories annotation with GO categories
allgoslim_territories.annotations: territories annotation with GOSlim categories
intronicgo_territories.annotations: territories annotation with GO categories
intronicgoslim_territories.annotations: territories annotation with GOSlim categories
intergenicgo_territories.annotations: territories annotation with GO categories
intergenicgoslim_territories.annotations: territories annotation with GOSlim categories

There is a convenience target:

make annotator-annotations

that will build all available annotations.

Usage¶

In order to perform Annotator analyses, you run a make target:

make <track>:<slice>:<subset>:<workspace>:<workspace2>_<annotations>.annotators

The fields determine which segments are used for the enrichment analysis.

track: the data track to be chosen.
slice: the slices correspond to flags in the table <track>_annotation. Use all to use all segments in a track.
subset: the subset corresponds to a table that is joined with <track>_annotation to restrict segments to a user-specified set. Use all for no restriction.
workspace: the workspace to be used
workspace2: a second workspace. The actual workspace will be the intersection of both workspaces.
annotations: annotations to use.

Note

Annotations, segments and the workspace need to be chosen carefully for each experiment. For example, failing to use territories for goterritory analysis will measure enrichment of segments within goterritories in general, and not necessarily relative enrichment between go territories.

The results will be in the file <track>:<slice>:<subset>:<workspace>:<workspace2>_<annotations>.annotators.

Examples¶

The command:

make thoracic:unknown:all:intergenic:all_unknownsets.annotators

will test for enrichment among unknown transcripts in the track thoracic with intergenic segments the other sets. The command:

make thoracic:intronic:all:intronic:territories_intronicgoslimterritories.annotators

will check for enrichment of intronic transcripts from the track merged within intronic genomic segments that also have GO assignments (intersection of workspaces intronic and territories. It will label GO territories by GOslim territories.

Association analysis¶

Association analysis computes the significance of finding segments close to annotations.

Type:

make annotator-distance-run

to run all association analyses.

Parameters¶

The following parameters can be set in the Makefile:

Transcript comparison pipeline¶

Purpose¶

Setting up¶

Configuration¶

Usage¶

Annotation¶

Fine grained control¶

Visualization¶

GO analysis¶

Usage¶

Annotator analysis¶

Building workspaces¶

Annotations¶

Usage¶

Examples¶

Association analysis¶

Parameters¶

Table Of Contents

Previous topic

Next topic

This Page

Navigation

Transcript comparison pipeline¶

Purpose¶

Setting up¶

Configuration¶

Usage¶

Annotation¶

Fine grained control¶

Visualization¶

GO analysis¶

Usage¶

Annotator analysis¶

Building workspaces¶

Annotations¶

Usage¶

Examples¶

Association analysis¶

Parameters¶

Table Of Contents

Previous topic

Next topic

This Page

Quick search

Navigation