Clustering metagenomic contigs on tetranucleotide frequency
============================================================

Metagenomic sequencing has become a widely used method for assessing the
functional potential of microbial communities across a wide range of environments.
Often the first step in a metagenomic analysis is the assembly of short reads
into longer contigs - permitting gene/function predictions to be made. However, due to the 
complexity of a sample, many contigs are often produced that represent a variety of species 
that are present in the community. Assignment of contigs to species is non-trivial. Nevertheless, 
researchers will often use nucleotide content to begin to cluster related contigs. A common
method is to compute tetranucleotide frequencies for each contig and cluster the results. Here
we explain how to use the CGAT script, ``fasta2kmercontent.py`` to calculate the tetranucleotide
frequencies for a set of contigs (up to 8-mers supported).

Our input is a :term:`fasta` formatted file representing a set of contigs derived from a 
metgenome assembly - metagenome_contigs.fasta. A simple command line statement will compute
the tetranucleotide frequency for the set of contigs::

    cat metagenome_contigs.fa | fasta2kmercontent --kmer 4 --proportion > metagenome_tetranucleotide_freq.tsv

Notice that we specify the ``--proportion`` option in this example. This is because contigs
will be of different length and thus incomparable without this option.

The output will be a tab-delimited text file with contigs as columns and tetramers as rows.

+----+--------------------+--------------------+--------------------+--------------------+-------------------------------+------------------------------+
|kmer|Streptococcus_suis26|Streptococcus_suis27|Streptococcus_suis24|Streptococcus_suis25|Bacteroides_thetaiotaomicron101|Bacteroides_thetaiotaomicron23|
+----+--------------------+--------------------+--------------------+--------------------+-------------------------------+------------------------------+
|GTAC|0.0016393442623     |0.00234100663285    |0.00522778192681    |0.00265428002654    |0.00303990610329               |0.00334864510152              |
+----+--------------------+--------------------+--------------------+--------------------+-------------------------------+------------------------------+
|CGAG|0.0016393442623     |0.00195083886071    |0.00124470998257    |0.000663570006636   |0.00129694835681               |0.00128348645102              |
+----+--------------------+--------------------+--------------------+--------------------+-------------------------------+------------------------------+
|GTAA|0.00327868852459    |0.00390167772142    |0.0049788399303     |0.00729927007299    |0.0037646713615                |0.00467073264881              |
+----+--------------------+--------------------+--------------------+--------------------+-------------------------------+------------------------------+
|CGAA|0.00327868852459    |0.00429184549356    |0.00224047796863    |0.00199071001991    |0.00422828638498               |0.0042847216861               |
+----+--------------------+--------------------+--------------------+--------------------+-------------------------------+------------------------------+
|AAAT|0.0131147540984     |0.00819352321498    |0.00398307194424    |0.00729927007299    |0.00776115023474               |0.0080869296688               |
+----+--------------------+--------------------+--------------------+--------------------+-------------------------------+------------------------------+
|CGAC|0.0016393442623     |0.000390167772142   |0.00199153597212    |0.00132714001327    |0.00261443661972               |0.00177565042847              |
+----+--------------------+--------------------+--------------------+--------------------+-------------------------------+------------------------------+
|GTAT|0.00655737704918    |0.00156067108857    |0.00373412994772    |0.00398142003981    |0.00450704225352               |0.00579981471474              |
+----+--------------------+--------------------+--------------------+--------------------+-------------------------------+------------------------------+
|AGTG|0.0                 |0.00546234880999    |0.00323624595469    |0.00398142003981    |0.00215962441315               |0.00340654674593              |
+----+--------------------+--------------------+--------------------+--------------------+-------------------------------+------------------------------+
|AGTA|0.00327868852459    |0.00429184549356    |0.00373412994772    |0.00331785003318    |0.00409330985915               |0.00409171620474              |
+----+--------------------+--------------------+--------------------+--------------------+-------------------------------+------------------------------+
|... |...                 |...                 |...                 |...                 |...                            |...                           |
+----+--------------------+--------------------+--------------------+--------------------+-------------------------------+------------------------------+


As the output is in tab separated format it is straight-forward to load into statistical/plotting software such as R and perform further 
downstream analysis. For example, we can perform a simple clustering analysis on the results. Start R and type::

    R version 2.15.2 (2012-10-26) -- "Trick or Treat"
    Copyright (C) 2012 The R Foundation for Statistical Computing
    ISBN 3-900051-07-0
    Platform: x86_64-unknown-linux-gnu (64-bit)

    R is free software and comes with ABSOLUTELY NO WARRANTY.
    You are welcome to redistribute it under certain conditions.
    Type 'license()' or 'licence()' for distribution details.

    R is a collaborative project with many contributors. 
    Type 'contributors()' for more information and
    'citation()' on how to cite R or R packages in publications.

    Type 'demo()' for some demos, 'help()' for on-line help, or
    'help.start()' for an HTML browser interface to help.
    Type 'q()' to quit R.


    > tetra <- read.csv("metagenome_tetranucleotide_freq.tsv", header = T, stringsAsFactors = F, sep = "\t", row.names = 1)
    > plot(hclust(dist(t(dat)))) 
 
This will produce a cluster dendrogram like the one displayed below.


.. image:: ../plots/metagenome_contigs_tetra.png 


This example is using data from simulated metagenomic data and we therefore know the source of the contigs. We can see that it
is possible to separate Streptococcus species from Bacteroides based on tetranucleotide composition. There is less separation 
between the two closely related bacteroides species. Although this example dataset is unrealistically simple, it emphasises
the ease with which CGAT tools can be used for quick assessment of data.