******************************* Transcript comparison pipeline ******************************* Purpose ------- Map 454 reads onto a genome and assemble overlapping transcripts into transcript models. The pipeline currently does not use base quality information during mapping and does not consider alternative transcripts. Setting up ---------- To set up the pipeline in the current directory run:: python setup.py --method=compare_transcripts > setup.log Link towards the genome from /net/cpp-data/backup/databases/indexed_fasta and call the files genome.fasta and genome.idx. For example:: ln -s /net/cpp-data/backup/databases/indexed_fasta/hs_ncbi36_softmasked.fasta genome.fasta ln -s /net/cpp-data/backup/databases/indexed_fasta/hs_ncbi36_softmasked.idx genome.idx Input (required): %.gtf gtf files with (experimental) transcripts. The % denotes the track name, for example heart.gtf, kidney.gtf, sample1.gtf, ... genome.fasta, genome.idx an indexed genome ``PARAM_GENOME``. See also index_fasta.py. ensembl.gtf a gtf file with a reference sequence set: the default is ``ensembl``, but can be changed in ``PARAM_MASTER_SET_GENES``. annotations.gff a gff file with annotated genomic regions. See ``PARAM_GENOME_REGIONS``. Use gtf2gff.py to create this file. The pipeline includes additional information if it is present: %.coverage table with coverage information for a track. The output is from blat2assembly.py %.polyA information about polyA tails. The output is from blat2assembly.py %.readstats a table with read alignment statistics after filtering (see output from MapTranscripts) %.readmap a table mapping gene_ids to read_ids after filtering (see output from MapTranscripts) %.readinfo a table with read information. %.readgtf mapped locations of reads after filtering. .. glossary:: PARAM_FILE_REPEATS_RATES gff formatted file of ancesctral repeats. The score field contains the rate (see Makefile.ancestral_repeats) PARAM_FILE_REPEATS gff file with repeats in genome. These are used for masking in coding potential predictions. PARAM_FILE_REPEATS_GC gff formatted file of ancesctral repeats. The score field contains the G+C content (see Makefile.ancestral_repeats) PARAM_FILE_ALIGNMENTS psl formatted file with genomic alignments between this species in query and another at appropriate evolutionary distance in target. PARAM_FILENAME_GO (PARAM_FILENAME_GOSLIM) GO annotations for genes in the reference set. Example format is:: cell_location ENSPPYG00000000676 GO:0016020 membrane NA PARAM_FILENAME_TERRITORIES gene territories. GTF formatted file, an example entry would be:: chr1 protein_coding exon 3979975 4199559 . - . transcript_id "ENSPPYG00000000050"; gene_id "ENSPPYG00000000050";# PARAM_CPC_UNIREF uniref database to use for coding potential predictions. PARAM_DATABASE database name Output from the mapTranscripts454 project can be imported with a single command:: make PATH_TO_MAPPING_DIR.add-tracks Configuration ------------- Edit the :file:`Makefile` to configure the pipeline. See Parameters_ below. Usage ----- The pipeline is controlled by running `make`_ targets. The results of the pipeline computation are stored as tab separated tables in the working directory. Most of these tables are then imported into an sqlite_ database called ``csvdb`` (see :term:`PARAM_DATABASE`). Annotation ~~~~~~~~~~ Type:: make all to do all. Fine grained control ++++++++++++++++++++ A more complete list of targets: all make all build only build, but do not import. import import Visualization ~~~~~~~~~~~~~ The following targets aid visualizatiov: ucsc-tracks-gtf export the segments as compressed gtf files. Can be viewed as user tracks in the `ucsc`_ genome browser. GO analysis ~~~~~~~~~~~ GO analysis will compute the relative enrichment/depletion gene sets. Requires ``PARAM_FILENAME_TERRITORIES``, ``PARAM_FILENAME_GO`` and ``PARAM_FILENAME_GOSLIM`` to be set. There are two counting methods. The first method (``go``) assigns GO terms associated with the reference gene set to TLs and counts these. The second method (``territorygo``) assigns TLs to genes in the reference set and then does a GO analysis on theses. .. note:: The convential GO analysis based on gene list is the ``territorygo`` method. Usage +++++ Usage:: make :::..analysis The fields are: track the data track to be chosen. slice the slices correspond to flags in the table _annotation. Use ``all`` to use all segments in a ``track``. subset the subset corresponds to a table that is joined with _annotation to restrict segments to a user-specified set. Use ``all`` for no restriction. background the background gene set go either ``go`` or ``goslim`` method either ``go`` or ``goterritory`` Results will be in the directory :file:`:::..analysis.dir`. For example:: make thoracic:known:all:thoracic.go.goanalysis will compute the enrichment of protein coding TL in the track ``thoracic`` using all ``thoracic`` genes as the background. The command:: make thoracic:known:all:ensembl.goslim.territorygoanalysis will compute ``goslim`` term enrichment. The foreground set are genes from the reference set (``ensembl``) overlapping protein coding TL in the track ``thoracic``. The background is the complete reference gene set (``ensembl``). Annotator analysis ~~~~~~~~~~~~~~~~~~ Annotator computes the statistical significance of enrichment/depletion of genomic features (called segments) within genomic regions (called annotations). To run annotator analysis, two files need to be present: 1. A workspace 2. A collection of annotations on the genome Building workspaces +++++++++++++++++++ Workspaces are built using makefile targets. For example to build :file:``genome.workspace``, type:: make genome.workspace All workspaces exclude contigs called matching ``random``. genome.workspace full genome intergenic.workspace only intergenic regions intronic.workspace only intronic regions unknown.workspace both intergenic and intronic regions territories.workspace workspace of territories alignable.workspace only segments that can be aligned to a reference genome. There is a convenience target:: make annotator-workspaces that will build all available workspaces. Annotations +++++++++++ Annotations are built using makefile targets. all.annotations: all subsets (all/known/unknown) for each track. architecture.annotations: annotations according to genes (intronic, intergenic, ...). {all,known,unknown}_sets.annotations annotations of known, unknown, all transcripts allgo_territories.annotations territories annotation with GO categories allgoslim_territories.annotations territories annotation with GOSlim categories intronicgo_territories.annotations territories annotation with GO categories intronicgoslim_territories.annotations territories annotation with GOSlim categories intergenicgo_territories.annotations territories annotation with GO categories intergenicgoslim_territories.annotations territories annotation with GOSlim categories There is a convenience target:: make annotator-annotations that will build all available annotations. Usage +++++ In order to perform ``Annotator`` analyses, you run a make target:: make ::::_.annotators The fields determine which segments are used for the enrichment analysis. track the data track to be chosen. slice the slices correspond to flags in the table _annotation. Use ``all`` to use all segments in a ``track``. subset the subset corresponds to a table that is joined with _annotation to restrict segments to a user-specified set. Use ``all`` for no restriction. workspace the workspace to be used workspace2 a second workspace. The actual workspace will be the intersection of both workspaces. annotations annotations to use. .. note:: Annotations, segments and the workspace need to be chosen carefully for each experiment. For example, failing to use territories for goterritory analysis will measure enrichment of segments within goterritories in general, and not necessarily relative enrichment between go territories. The results will be in the file :file:`::::_.annotators`. Examples ++++++++ The command:: make thoracic:unknown:all:intergenic:all_unknownsets.annotators will test for enrichment among ``unknown`` transcripts in the track ``thoracic`` with intergenic segments the other sets. The command:: make thoracic:intronic:all:intronic:territories_intronicgoslimterritories.annotators will check for enrichment of ``intronic`` transcripts from the track ``merged`` within intronic genomic segments that also have GO assignments (intersection of workspaces ``intronic`` and ``territories``. It will label GO territories by GOslim territories. Association analysis ~~~~~~~~~~~~~~~~~~~~ Association analysis computes the significance of finding segments close to annotations. Type:: make annotator-distance-run to run all association analyses. Parameters ---------- The following parameters can be set in the :file:`Makefile`: .. report:: Trackers.MakefileParameters :render: glossary :tracks: Makefile.compare_transcripts :transpose: Overview of pipeline parameters.