================================= Testing for Functional enrichment ================================= This tutorial demonstrates the usage of *gat* with a simple example - where does a transcription factor bind in the genome? This tutorial uses the SRF data set described in `Valouev et al. (2008)`_. The data sets used in this tutorial are available at: http://www.cgat.org/~andreas/sample_data/srf.hg19.bed.gz This :term:`bed` formatted file contains 556 high confidence peaks from the analysis of `Valouev et al. (2008)`_ mapped to human chromosome hg19. We want to find out, where these binding sites are located in the genome. First let us download the genomic sequence for hg19 and index it:: wget -qO- http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/chromFa.tar.gz | cgat index_fasta --file-format=tar.gz hg19 - > hg19.log Next, we need to define where intronic and intergenic regions are located in the genome. We do this by by obtaining the latest geneset from ENSEMBL_ and pushing it through a sequence of commands:: wget -qO- ftp://ftp.ensembl.org/pub/release-72/gtf/homo_sapiens/Homo_sapiens.GRCh37.72.gtf.gz | gunzip | awk '$2 == "protein_coding"' | cgat gff2gff --genome-file=hg19 --sanitize=genome --skip-missing | cgat gtf2gtf --sort=gene | cgat gtf2gtf --merge-exons --with-utr | cgat gtf2gtf --filter=longest-gene | cgat gtf2gtf --sort=position | cgat gtf2gff --genome-file=hg19 --flank=5000 --method=genome | gzip > annotations.hg19.gff.gz The commands do the following. 1. Reconcile the chromosome names in the gene set (ENSEMBL: 1,2,3) with the UCSC convention (chr1,chr2,chr3):: | cgat gff2ff --genome-file=hg19 --sanitize=genome --skip-missing 2. Sort the gene set by gene making sure that all exons within a gene appear in a block:: | cgat gtf2gtf --merge-exons --with-utr 3. Merge overlapping exons from alternative transcripts of the same gene:: | cgat gtf2gtf --merge-exons --with-utr 4. Resolve nested genes. In nested genes a genomic region might be both defined intronic and intergenic. Here, we select the longer one:: | cgat gtf2gtf --filter=longest-gene 5. Sort by genomic position:: | cgat gtf2gtf --sort=position 6. Define intronic, intergenic and other gene set based annotations:: | cgat gtf2gff --genome-file=hg19 --flank=5000 --method=genome The tool gff2stats We can now use the file :file:`annotations.hg19.gff.gz` to classify individual peaks with the :doc:`../../scripts/bed2table` tool:: zcat srf.hg19.bed.gz | cgat bed2table --genome-file=hg19 --counter=classify-chipseq --filename-gff=annotations.hg19.gff.gz | gzip > srf.hg19.tsv.gz The table :file:`srf.hg19.tsv.gz` contains a row for each interval in the input file :file:`srf.hg19.bed.gz` describing which genomic features it overlaps and assigns it to a category such as introning, intergenic, etc.. We can upload this file into a database or view in excel to easily filter and summarize the data. To get a more global view of where the transcription factor binds, we make use of the gat_ tool. Gat tests if two sets of genomic features are overlapping more - or less - than expected by chance through simulation. Note that gat_ needs to be installed separately (``pip install gat``). For gat, we need the file with genomic annotations (:file:`annotations.hg19.gff.gz`) we created previously and a workspace - a set of genomic regions that are accessible for simulation. Here, we will use the full genome for simulation excluding regions that are gaps as we do not expect to be able to detect transcription factor binding sites in those in an NGS experiment. To get these regions, we use the :doc:`../../scripts/fasta2bed` tool:: cat hg19.fasta | cgat fasta2bed --method=ungapped --min-gap-size=100 | awk '$1 ~ /^chr/' | cut -f 1,2,3 | gzip > ungapped.hg19.bed.gz Gat needs :term:`bed` formatted input files, so let us quickly convert :file:`annogations.hg19.gff.gz`:: zcat annogations.hg19.gff.gz | cgat gff2bed.py | gzip > annotations.bed.gz We are now ready to run gat:: gat-run.py --ignore-segment-tracks --segments=srf.hg19.bed.gz --annotations=annotations.hg19.bed.gz --workspace=ungapped.hg19.bed.gz --num-samples=1000 --log=gat.log | gzip > gat.out The option `--ignore-segment-tracks` tells *gat* to ignore the fourth column in the :term:`tracks` file and assume that all intervals in this file belong to the same :term:`track`. If not given, each interval would be treated separately. The above statement finishes in a few seconds. With large interval collections or many annotations, *gat* might take a while. It is thus good practice to always save the output in a file. The option `--log` tells gat to save information or warning messages into a separate log file. The first 11 columns of the output file are the most informative: +------+----------+--------+----------+----------+----------+--------+-------+-------+----------+----------+ |track |annotation|observed|expected |CI95low |CI95high |stddev |fold |l2fold |pvalue |qvalue | +------+----------+--------+----------+----------+----------+--------+-------+-------+----------+----------+ |merged|telomeric |0 |69.7440 |0.0000 |200.0000 |59.6216 |0.0141 |-6.1445|2.5100e-01|3.9443e-01| +------+----------+--------+----------+----------+----------+--------+-------+-------+----------+----------+ |merged|intergenic|6200 |13909.1770|12989.0000|14800.0000|570.3231|0.4458 |-1.1656|1.0000e-03|2.2000e-03| +------+----------+--------+----------+----------+----------+--------+-------+-------+----------+----------+ |merged|intronic |8415 |11401.6660|10440.0000|12345.0000|577.7517|0.7381 |-0.4382|1.0000e-03|2.2000e-03| +------+----------+--------+----------+----------+----------+--------+-------+-------+----------+----------+ |merged|UTR3 |284 |305.5370 |114.0000 |500.0000 |120.2095|0.9297 |-0.1051|4.3000e-01|5.2556e-01| +------+----------+--------+----------+----------+----------+--------+-------+-------+----------+----------+ |merged|unknown |0 |0.0140 |0.0000 |0.0000 |0.3603 |0.9862 |-0.0201|9.9800e-01|9.9800e-01| +------+----------+--------+----------+----------+----------+--------+-------+-------+----------+----------+ |merged|frameshift|0 |0.0050 |0.0000 |0.0000 |0.0947 |0.9950 |-0.0072|9.9700e-01|9.9800e-01| +------+----------+--------+----------+----------+----------+--------+-------+-------+----------+----------+ |merged|3flank |800 |699.4930 |400.0000 |1045.0000 |187.2328|1.1435 |0.1934 |3.0300e-01|4.1662e-01| +------+----------+--------+----------+----------+----------+--------+-------+-------+----------+----------+ |merged|CDS |758 |392.1510 |192.0000 |611.0000 |131.0955|1.9306 |0.9490 |3.0000e-03|5.5000e-03| +------+----------+--------+----------+----------+----------+--------+-------+-------+----------+----------+ |merged|flank |1335 |176.1320 |50.0000 |350.0000 |90.7093 |7.5424 |2.9150 |1.0000e-03|2.2000e-03| +------+----------+--------+----------+----------+----------+--------+-------+-------+----------+----------+ |merged|5flank |6224 |742.0590 |450.0000 |1071.0000 |191.1824|8.3775 |3.0665 |1.0000e-03|2.2000e-03| +------+----------+--------+----------+----------+----------+--------+-------+-------+----------+----------+ |merged|UTR5 |3784 |104.0220 |0.0000 |237.0000 |68.5653 |36.0401|5.1715 |1.0000e-03|2.2000e-03| +------+----------+--------+----------+----------+----------+--------+-------+-------+----------+----------+ The first two columns contain the name of the :term:`track` and :term:`annotation` that are being compared. The columns :term:`observed` and :term:`expected` give the observed and expected nucleotide overlap, respectively, between the :term:`track` and :term:`annotation`. The following columns CI95low, CI95high, stddev give 95% confidence intervals and the standard deviation of the sample distribution, respectively. The :term:`fold` column is the fold enrichment or depletion and is computed as the ratio of :term:`observed` over :term:`expected`. The column :term:`l2fold` is the log2 of this ratio. The column :term:`pvalue` gives the empirical :term:`p-value`, i.e. in what proportion of samples was a higher enrichment or lower depletion found than the one that was observed. The column :term:`qvalue` lists a multiple testing corrected :term:`p-value`. Setting a qvalue threshold and accepting only those comparisons with a qvalue below that threshold corresponds to controlling the false discovery rate at that particular level. .. _Valouev et al. (2008): http://www.ncbi.nlm.nih.gov/pubmed/19160518 .. _GREAT: http://bejerano.stanford.edu/great/public/html/ .. _MacLean et al. (2010): http://www.ncbi.nlm.nih.gov/pubmed/20436461 .. _Ensembl: http:://www.ensembl.org .. _GO Gene Ontology: http://www.geneontology.org/ .. _gat: http://code.google.com/p/genomic-association-tester/