================ Using CGAT Tools ================ Command line usage ================== CGAT tools are written for command line usage with a consistent interface that makes them amenable to integration in pipelines. Tools can be accessed through the :file:`cgat` front-end that will be installed in your PATH. To get a list of all available commands, type:: cgat --help Command line help for individual tools is available through each tool's ``--help`` option:: cgat gff2gff --help Logging ------- CGAT scripts output logging information as comments starting with a ``#`` into stdout or into a separate log file (``--log``). Below is an example of logging output:: # output generated by /ifs/devel/andreas/cgat/beds2beds.py --force --exclusive --method=unmerged-combinations --output-filename-pattern=030m.intersection.tsv.dir/030m.intersection.tsv-%s.bed.gz --log=030m.intersection.tsv.log Irf5-030m-R1.bed.gz Rela-030m-R1.bed.gz # job started at Thu Mar 29 13:06:33 2012 on cgat150.anat.ox.ac.uk -- e1c16e80-03a1-4023-9417-f3e44e33bdcd # pid: 16649, system: Linux 2.6.32-220.7.1.el6.x86_64 #1 SMP Fri Feb 10 15:22:22 EST 2012 x86_64 # exclusive : True # filename_update : None # ignore_strand : False # loglevel : 1 # method : unmerged-combinations # output_filename_pattern : 030m.intersection.tsv.dir/030m.intersection.tsv-%s.bed.gz # output_force : True # pattern_id : (.*).bed.gz # stderr : \', mode \'w\' at 0x2ba70e0c2270> # stdin : \', mode \'r\' at 0x2ba70e0c2150> # stdlog : # stdout : \', mode \'w\' at 0x2ba70e0c21e0> # timeit_file : None # timeit_header : None # timeit_name : all # tracks : None The header contains information about: * the script name (``beds2beds.py``) * the command line options (``--force --exclusive --method=unmerged-combinations --output-filename-pattern=030m.intersection.tsv.dir/030m.intersection.tsv-%s.bed.gz --log=030m.intersection.tsv.log Irf5-030m-R1.bed.gz Rela-030m-R1.bed.gz``) * the time when the job was started (``Thu Mar 29 13:06:33 2012``) * the location it was executed (``cgat150.anat.ox.ac.uk``) * a unique job id (``e1c16e80-03a1-4023-9417-f3e44e33bdcd``) * the pid of the job (``16649``) * the system specification (``Linux 2.6.32-220.7.1.el6.x86_64 #1 SMP Fri Feb 10 15:22:22 EST 2012 x86_64``) Once completed successfully, a script will output to the logfile. Below is typical output:: # job finished in 11 seconds at Thu Mar 29 13:06:44 2012 -- 11.36 0.45 0.00 0.01 -- e1c16e80-03a1-4023-9417-f3e44e33bdcd The footer contains information about: * the job has finished (``job finished``) * the time it took to execute (``11 seconds``) * when it completed (``Thu Mar 29 13:06:44 2012``) * some benchmarking information (``11.36 0.45 0.00 0.01``) which is ``user time``, ``system time``, ``child user time``, ``child system time``. * the unique job id (``e1c16e80-03a1-4023-9417-f3e44e33bdcd``) The unique job id can be used to easily retrieve matching information from a concatenation of log files. The logging level can be determined by the ``--verbose`` option. A level of ``0`` means no logging output, while ``1`` is information messages only, while ``2`` outputs also debugging information. I/O redirection ---------------- Most scripts work by reading data from :term:`stdin` and outputting data to :term:`stdout`. Both can be redirected to files with the ``-I/--stdin`` and ``-O/--stdout`` options. :term:`stderr` can be redirected with ``-E/--stderr``. Indexing genomes ================ Many CGAT tools require genomic information, some require the actual genomic sequence, but many require information about chromosome sizes. Thus, many tools have the obligatory option ``--genome-file``. The ``genome-file`` argument points to an indexed fasta file. CGAT tools can read two different indices, either files indexed using the supplied :doc:`scripts/index_fasta` script or using the samtools_ ``faidx`` command. Pipeline usage ============== We use a light-weight workflow system called ruffus_, but others are equally possible such as galaxy_ (see :ref:`GalaxyInstallation`). These tools allow CGAT tools to run in an automated fashion. Using unix pipes, CGAT tools can also be easily run in a parallel fashion. For example, we have a script called `farm.py` (not part of the CGAT collection, but within the CGAT repository), that allows to split input data and run separate chunks on our compute cluster. Below is a simple example of running the command:: zcat geneset.gtf.gz | cgat gtf2table --counter=length --log=log | gzip > out.tsv.gz in parallel on the cluster, running one job per chromosome:: zcat geneset.gtf.gz | farm.py --split-at-column=1 "cgat gtf2table --counter=length --log=log" | gzip > out.tsv.gz