Experiment.py - Tools for scripts

Author:Andreas Heger
Date:December 09, 2013

The Experiment modules contains utility functions for logging and record keeping of scripts.

This module is imported by most CGAT scripts. It provides convenient and consistent methods for

See on how to use this module.

The basic usage of this module within a script is:

"""script_name.py - my script

Mode Documentation
import sys
import optparse
import CGAT.Experiment as E

def main( argv = None ):
    """script main.

    parses command line options in sys.argv, unless *argv* is given.

    if not argv: argv = sys.argv

    # setup command line parser
    parser = E.OptionParser( version = "%prog version: $Id$", 
                                    usage = globals()["__doc__"] )
    parser.add_option("-t", "--test", dest="test", type="string",
                      help="supply help"  )

    ## add common options (-h/--help, ...) and parse command line 
    (options, args) = E.Start( parser )

    # do something
    # ...
    E.info( "an information message" )
    E.warn( "a warning message")

    ## write footer and output benchmark information.

if __name__ == "__main__":
    sys.exit( main( sys.argv) )

Record keeping

The central functions in this module are the Start() and Stop() methods which are called before or after any work is done within a script.

Experiment.Start(parser=None, argv=None, quiet=False, no_parsing=False, add_csv_options=False, add_mysql_options=False, add_psql_options=False, add_pipe_options=True, add_cluster_options=False, add_output_options=False, return_parser=False)

set up an experiment.

param parser an E.OptionParser instance with commandi line options. param argv command line options to parse. Defaults to sys.argv quiet set loglevel to 0 - no logging no_parsing do not parse command line options return_parser return the parser object, no parsing add_csv_options add common options for parsing tsv separated files add_mysql_options add common options for connecting to mysql databases add_psql_options add common options for connecting to postgres databases add_pipe_options add common options for redirecting input/output add_cluster_options add common options for scripts submitting jobs to the cluster add_output_options add commond options for working with multiple output files returns a tuple (options,args) with options (a E.OptionParser object

and a list of positional arguments.

The Start() method will also set up a file logger.

The default options added by this method are:

the loglevel
turn on benchmarking information and save to file
name to use for timing information,
output header for timing information.

Optional options added are:


csv_dialect. the default is excel-tab, defaulting to tsv formatted files.
psql connection string
psql user name
use cluster
cluster priority to request
cluster queue to use
number of jobs to submit to the cluster at the same time
additional options to the cluster for each job.
Pattern to use for output filenames.

The Start() is called with an E.OptionParser object. Start() will add additional command line arguments, such as --help for command line help or --verbose to control the loglevel. It can also add optional arguments for scripts needing database access, writing to multiple output files, etc.

Start() will write record keeping information to a logfile. Typically, logging information is output on stdout, prefixed by a #, but it can be re-directed to a separate file. Below is a typical output:

# output generated by /ifs/devel/andreas/cgat/beds2beds.py --force --exclusive --method=unmerged-combinations --output-filename-pattern=030m.intersection.tsv.dir/030m.intersection.tsv-%s.bed.gz --log=030m.intersection.tsv.log Irf5-030m-R1.bed.gz Rela-030m-R1.bed.gz
# job started at Thu Mar 29 13:06:33 2012 on cgat150.anat.ox.ac.uk -- e1c16e80-03a1-4023-9417-f3e44e33bdcd
# pid: 16649, system: Linux 2.6.32-220.7.1.el6.x86_64 #1 SMP Fri Feb 10 15:22:22 EST 2012 x86_64
# exclusive                               : True
# filename_update                         : None
# ignore_strand                           : False
# loglevel                                : 1
# method                                  : unmerged-combinations
# output_filename_pattern                 : 030m.intersection.tsv.dir/030m.intersection.tsv-%s.bed.gz
# output_force                            : True
# pattern_id                              : (.*).bed.gz
# stderr                                  : <open file '<stderr>', mode 'w' at 0x2ba70e0c2270>
# stdin                                   : <open file '<stdin>', mode 'r' at 0x2ba70e0c2150>
# stdlog                                  : <open file '030m.intersection.tsv.log', mode 'a' at 0x1f1a810>
# stdout                                  : <open file '<stdout>', mode 'w' at 0x2ba70e0c21e0>
# timeit_file                             : None
# timeit_header                           : None
# timeit_name                             : all
# tracks                                  : None

The header contains information about:

  • the script name (beds2beds.py)
  • the command line options (--force --exclusive --method=unmerged-combinations --output-filename-pattern=030m.intersection.tsv.dir/030m.intersection.tsv-%s.bed.gz --log=030m.intersection.tsv.log Irf5-030m-R1.bed.gz Rela-030m-R1.bed.gz)
  • the time when the job was started (Thu Mar 29 13:06:33 2012)
  • the location it was executed (cgat150.anat.ox.ac.uk)
  • a unique job id (e1c16e80-03a1-4023-9417-f3e44e33bdcd)
  • the pid of the job (16649)
  • the system specification (Linux 2.6.32-220.7.1.el6.x86_64 #1 SMP Fri Feb 10 15:22:22 EST 2012 x86_64)

It is followed by a list of all options that have been set in the script.

Once completed, a script will call the Stop() function to signify the end of the experiment.


stop the experiment.

Stop() will output to the log file that the script has concluded successfully. Below is typical output:

# job finished in 11 seconds at Thu Mar 29 13:06:44 2012 -- 11.36  0.45  0.00  0.01 -- e1c16e80-03a1-4023-9417-f3e44e33bdcd

The footer contains information about:

  • the job has finished (job finished)

  • the time it took to execute (11 seconds)

  • when it completed (Thu Mar 29 13:06:44 2012)

  • some benchmarking information (11.36  0.45  0.00  0.01) which is

    user time, system time, child user time, child system time.

  • the unique job id (e1c16e80-03a1-4023-9417-f3e44e33bdcd)

The unique job id can be used to easily retrieve matching information from a concatenation of log files.


Complete reference

class Experiment.AppendCommaOption(*opts, **attrs)

Bases: optparse.Option

Option with additional parsing capabilities.

  • ”,” in arguments to options that have the action ‘append’ are treated as a list of options. This is what galaxy does, but generally convenient.
  • Option values of “None” and “” are treated as default values.
class Experiment.OptionParser(*args, **kwargs)

Bases: optparse.OptionParser

CGAT derivative of OptionParser.


add_option(opt_str, ..., kwarg=val, ...)

check_values(values : Values, args : [string])

-> (values : Values, args : [string])

Check that the supplied option values and leftover arguments are valid. Returns the option values and leftover arguments (possibly adjusted, possibly completely new – whatever you like). Default implementation just returns the passed-in values; subclasses may override as desired.


Declare that you are done with this OptionParser. This cleans up reference cycles so the OptionParser (and all objects referenced by it) can be garbage-collected promptly. After calling destroy(), the OptionParser is unusable.


Set parsing to stop on the first non-option. Use this if you have a command processor which runs another command that has options of its own and you want to make sure these options don’t get confused.


Set parsing to not stop on the first non-option, allowing interspersing switches with command arguments. This is the default behavior. See also disable_interspersed_args() and the class documentation description of the attribute allow_interspersed_args.

error(msg : string)

Print a usage message incorporating ‘msg’ to stderr and exit. If you override this in a subclass, it should not return – it should either exit or raise an exception.

parse_args(args=None, values=None)
parse_args(args : [string] = sys.argv[1:],
values : Values = None)

-> (values : Values, args : [string])

Parse the command-line options found in ‘args’ (default: sys.argv[1:]). Any errors result in a call to ‘error()’, which by default prints the usage message to stderr and calls sys.exit() with an error message. On success returns a pair (values, args) where ‘values’ is an Values instance (with all your option values) and ‘args’ is the list of arguments left over after parsing options.

print_help(file : file = stdout)

Print an extended help message, listing all options and any help text provided with them, to ‘file’ (default stdout).

print_usage(file : file = stdout)

Print the usage message for the current program (self.usage) to ‘file’ (default stdout). Any occurrence of the string “%prog” in self.usage is replaced with the name of the current program (basename of sys.argv[0]). Does nothing if self.usage is empty or not defined.

print_version(file : file = stdout)

Print the version message for this program (self.version) to ‘file’ (default stdout). As with print_usage(), any occurrence of “%prog” in self.version is replaced by the current program’s name. Does nothing if self.version is empty or undefined.

Experiment.openFile(filename, mode='r', create_dir=False)

open file in filename with mode mode.

If create is set, the directory containing filename will be created if it does not exist.

gzip - compressed files are recognized by the suffix .gz and opened transparently.

Note that there are differences in the file like objects returned, for example in the ability to seek.

returns a file or file-like object.


return a header string with command line options and timestamp


return a string containing script parameters.

Parameters are all variables that start with param_.


return a header string with command line options and timestamp.

Experiment.Start(parser=None, argv=None, quiet=False, no_parsing=False, add_csv_options=False, add_mysql_options=False, add_psql_options=False, add_pipe_options=True, add_cluster_options=False, add_output_options=False, return_parser=False)

set up an experiment.

param parser an E.OptionParser instance with commandi line options. param argv command line options to parse. Defaults to sys.argv quiet set loglevel to 0 - no logging no_parsing do not parse command line options return_parser return the parser object, no parsing add_csv_options add common options for parsing tsv separated files add_mysql_options add common options for connecting to mysql databases add_psql_options add common options for connecting to postgres databases add_pipe_options add common options for redirecting input/output add_cluster_options add common options for scripts submitting jobs to the cluster add_output_options add commond options for working with multiple output files returns a tuple (options,args) with options (a E.OptionParser object

and a list of positional arguments.

The Start() method will also set up a file logger.

The default options added by this method are:

the loglevel
turn on benchmarking information and save to file
name to use for timing information,
output header for timing information.

Optional options added are:


csv_dialect. the default is excel-tab, defaulting to tsv formatted files.
psql connection string
psql user name
use cluster
cluster priority to request
cluster queue to use
number of jobs to submit to the cluster at the same time
additional options to the cluster for each job.
Pattern to use for output filenames.

stop the experiment.


decorator collecting wall clock time spent in decorated method.


decorator for caching a method.


Decorator that caches a function’s return value each time it is called. If called later with the same arguments, the cached value is returned, and not re-evaluated.

Taken from http://wiki.python.org/moin/PythonDecoratorLibrary#Memoize

Experiment.log(loglevel, message)

log message at loglevel.


log information message, see the logging module


log warning message, see the logging module


log warning message, see the logging module


log debugging message, see the logging module


log error message, see the logging module


log critical message, see the logging module


return filename to write to.

Experiment.openOutputFile(section, mode='w')

open file for writing substituting section in the output_pattern (if defined).

If the filename ends with ”.gz”, the output is opened as a gzip’ed file.

class Experiment.Counter

Bases: object

a counter class.

The counter acts both as a dictionary and a object permitting attribute access.

Counts are automatically initialized to 0.

Instantiate and use like this:

c = Counter()
c.input += 1
c.output += 2
c["skipped"] += 1

print str(c)

Store data returned by function.


return values as tab-separated table (without header).

Experiment.run(cmd, return_stdout=False, **kwargs)

executed a command line cmd.

returns the return code.

If return_stdout is True, the contents of stdout are returned.

kwargs are passed on to subprocess.call or subprocess.check_output.

raises OSError if process failed or was terminated.

