================
Using CGAT Tools
================

Command line usage
==================

CGAT tools are written for command line usage with a consistent
interface that makes them amenable to integration in pipelines.
Tools can be accessed through the :file:`cgat` front-end that will
be installed in your PATH.

To get a list of all available commands, type::

   cgat --help

Command line help for individual tools is available through 
each tool's ``--help`` option::

   cgat gff2gff --help

Logging
-------

CGAT scripts output logging information as comments starting with a
``#`` into stdout or into a separate log file (``--log``). 

Below is an example of logging output::

    # output generated by /ifs/devel/andreas/cgat/beds2beds.py --force --exclusive --method=unmerged-combinations --output-filename-pattern=030m.intersection.tsv.dir/030m.intersection.tsv-%s.bed.gz --log=030m.intersection.tsv.log Irf5-030m-R1.bed.gz Rela-030m-R1.bed.gz
    # job started at Thu Mar 29 13:06:33 2012 on cgat150.anat.ox.ac.uk -- e1c16e80-03a1-4023-9417-f3e44e33bdcd
    # pid: 16649, system: Linux 2.6.32-220.7.1.el6.x86_64 #1 SMP Fri Feb 10 15:22:22 EST 2012 x86_64
    # exclusive                               : True
    # filename_update                         : None
    # ignore_strand                           : False
    # loglevel                                : 1
    # method                                  : unmerged-combinations
    # output_filename_pattern                 : 030m.intersection.tsv.dir/030m.intersection.tsv-%s.bed.gz
    # output_force                            : True
    # pattern_id                              : (.*).bed.gz
    # stderr                                  : <open file \'<stderr>\', mode \'w\' at 0x2ba70e0c2270>
    # stdin                                   : <open file \'<stdin>\', mode \'r\' at 0x2ba70e0c2150>
    # stdlog                                  : <open file \'030m.intersection.tsv.log\', mode \'a\' at 0x1f1a810>
    # stdout                                  : <open file \'<stdout>\', mode \'w\' at 0x2ba70e0c21e0>
    # timeit_file                             : None
    # timeit_header                           : None
    # timeit_name                             : all
    # tracks                                  : None

The header contains information about:

    * the script name (``beds2beds.py``)
    * the command line options (``--force --exclusive --method=unmerged-combinations --output-filename-pattern=030m.intersection.tsv.dir/030m.intersection.tsv-%s.bed.gz --log=030m.intersection.tsv.log Irf5-030m-R1.bed.gz Rela-030m-R1.bed.gz``)
    * the time when the job was started (``Thu Mar 29 13:06:33 2012``)
    * the location it was executed (``cgat150.anat.ox.ac.uk``)
    * a unique job id (``e1c16e80-03a1-4023-9417-f3e44e33bdcd``)
    * the pid of the job (``16649``)
    * the system specification (``Linux 2.6.32-220.7.1.el6.x86_64 #1 SMP Fri Feb 10 15:22:22 EST 2012 x86_64``)

Once completed successfully, a script will output to the logfile. Below is typical output::

    # job finished in 11 seconds at Thu Mar 29 13:06:44 2012 -- 11.36  0.45  0.00  0.01 -- e1c16e80-03a1-4023-9417-f3e44e33bdcd

The footer contains information about:

   * the job has finished (``job finished``)
   * the time it took to execute (``11 seconds``)
   * when it completed (``Thu Mar 29 13:06:44 2012``)
   * some benchmarking information (``11.36  0.45  0.00  0.01``) which is 
         ``user time``, ``system time``, ``child user time``, ``child system time``.
   * the unique job id (``e1c16e80-03a1-4023-9417-f3e44e33bdcd``)

The unique job id can be used to easily retrieve matching information from a concatenation of 
log files.

The logging level can be determined by the ``--verbose`` option. A
level of ``0`` means no logging output, while ``1`` is information
messages only, while ``2`` outputs also debugging information.

I/O redirection
----------------

Most scripts work by reading data from :term:`stdin` and outputting
data to :term:`stdout`. Both can be redirected to files with the 
``-I/--stdin`` and ``-O/--stdout`` options. :term:`stderr` can be 
redirected with ``-E/--stderr``.

Indexing genomes
================

Many CGAT tools require genomic information, some require the actual
genomic sequence, but many require information about chromosome sizes.
Thus, many tools have the obligatory option ``--genome-file``.

The ``genome-file`` argument points to an indexed fasta file. CGAT
tools can read two different indices, either files indexed using
the supplied :doc:`scripts/index_fasta` script or using the samtools_ 
``faidx`` command.

Pipeline usage
==============

We use a light-weight workflow system called ruffus_, but others
are equally possible such as galaxy_ (see :ref:`GalaxyInstallation`).
These tools allow CGAT tools to run in an automated fashion. 

Using unix pipes, CGAT tools can also be easily run in a parallel
fashion. For example, we have a script called `farm.py` (not part
of the CGAT collection, but within the CGAT repository), that allows
to split input data and run separate chunks on our compute
cluster. Below is a simple example of running the command::

   zcat geneset.gtf.gz 
   | cgat gtf2table --counter=length --log=log |
   gzip > out.tsv.gz

in parallel on the cluster, running one job per chromosome::

   zcat geneset.gtf.gz 
   | farm.py --split-at-column=1
           "cgat gtf2table --counter=length --log=log"
   | gzip 
   > out.tsv.gz