This tutorial demonstrates the usage of gat with a simple example - where does a transcription factor bind in the genome?
This tutorial uses the SRF data set described in Valouev et al. (2008). The data sets used in this tutorial are available at:
http://www.cgat.org/~andreas/sample_data/srf.hg19.bed.gz
This bed formatted file contains 556 high confidence peaks from the analysis of Valouev et al. (2008) mapped to human chromosome hg19.
We want to find out, where these binding sites are located in the genome. First let us download the genomic sequence for hg19 and index it:
wget -qO- http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/chromFa.tar.gz
| cgat index_fasta --file-format=tar.gz hg19 -
> hg19.log
Next, we need to define where intronic and intergenic regions are located in the genome. We do this by by obtaining the latest geneset from ENSEMBL and pushing it through a sequence of commands:
wget -qO- ftp://ftp.ensembl.org/pub/release-72/gtf/homo_sapiens/Homo_sapiens.GRCh37.72.gtf.gz
| gunzip
| awk '$2 == "protein_coding"'
| cgat gff2gff --genome-file=hg19 --sanitize=genome --skip-missing
| cgat gtf2gtf --sort=gene
| cgat gtf2gtf --merge-exons --with-utr
| cgat gtf2gtf --filter=longest-gene
| cgat gtf2gtf --sort=position
| cgat gtf2gff --genome-file=hg19 --flank=5000 --method=genome
| gzip
> annotations.hg19.gff.gz
The commands do the following.
Reconcile the chromosome names in the gene set (ENSEMBL: 1,2,3) with the UCSC convention (chr1,chr2,chr3):
| cgat gff2ff --genome-file=hg19 --sanitize=genome --skip-missing
Sort the gene set by gene making sure that all exons within a gene appear in a block:
| cgat gtf2gtf --merge-exons --with-utr
Merge overlapping exons from alternative transcripts of the same gene:
| cgat gtf2gtf --merge-exons --with-utr
Resolve nested genes. In nested genes a genomic region might be both defined intronic and intergenic. Here, we select the longer one:
| cgat gtf2gtf --filter=longest-gene
Sort by genomic position:
| cgat gtf2gtf --sort=position
Define intronic, intergenic and other gene set based annotations:
| cgat gtf2gff --genome-file=hg19 --flank=5000 --method=genome
The tool gff2stats
We can now use the file annotations.hg19.gff.gz to classify individual peaks with the <no title> tool:
zcat srf.hg19.bed.gz
| cgat bed2table --genome-file=hg19 --counter=classify-chipseq --filename-gff=annotations.hg19.gff.gz
| gzip
> srf.hg19.tsv.gz
The table srf.hg19.tsv.gz contains a row for each interval in the input file srf.hg19.bed.gz describing which genomic features it overlaps and assigns it to a category such as introning, intergenic, etc.. We can upload this file into a database or view in excel to easily filter and summarize the data.
To get a more global view of where the transcription factor binds, we make use of the gat tool. Gat tests if two sets of genomic features are overlapping more - or less - than expected by chance through simulation. Note that gat needs to be installed separately (pip install gat).
For gat, we need the file with genomic annotations (annotations.hg19.gff.gz) we created previously and a workspace - a set of genomic regions that are accessible for simulation. Here, we will use the full genome for simulation excluding regions that are gaps as we do not expect to be able to detect transcription factor binding sites in those in an NGS experiment. To get these regions, we use the <no title> tool:
cat hg19.fasta
| cgat fasta2bed --method=ungapped --min-gap-size=100
| awk '$1 ~ /^chr/'
| cut -f 1,2,3
| gzip
> ungapped.hg19.bed.gz
Gat needs bed formatted input files, so let us quickly convert annogations.hg19.gff.gz:
zcat annogations.hg19.gff.gz
| cgat gff2bed.py
| gzip
> annotations.bed.gz
We are now ready to run gat:
gat-run.py
--ignore-segment-tracks
--segments=srf.hg19.bed.gz
--annotations=annotations.hg19.bed.gz
--workspace=ungapped.hg19.bed.gz
--num-samples=1000
--log=gat.log
| gzip
> gat.out
The option –ignore-segment-tracks tells gat to ignore the fourth column in the tracks file and assume that all intervals in this file belong to the same track. If not given, each interval would be treated separately.
The above statement finishes in a few seconds. With large interval collections or many annotations, gat might take a while. It is thus good practice to always save the output in a file. The option –log tells gat to save information or warning messages into a separate log file.
The first 11 columns of the output file are the most informative:
track | annotation | observed | expected | CI95low | CI95high | stddev | fold | l2fold | pvalue | qvalue |
merged | telomeric | 0 | 69.7440 | 0.0000 | 200.0000 | 59.6216 | 0.0141 | -6.1445 | 2.5100e-01 | 3.9443e-01 |
merged | intergenic | 6200 | 13909.1770 | 12989.0000 | 14800.0000 | 570.3231 | 0.4458 | -1.1656 | 1.0000e-03 | 2.2000e-03 |
merged | intronic | 8415 | 11401.6660 | 10440.0000 | 12345.0000 | 577.7517 | 0.7381 | -0.4382 | 1.0000e-03 | 2.2000e-03 |
merged | UTR3 | 284 | 305.5370 | 114.0000 | 500.0000 | 120.2095 | 0.9297 | -0.1051 | 4.3000e-01 | 5.2556e-01 |
merged | unknown | 0 | 0.0140 | 0.0000 | 0.0000 | 0.3603 | 0.9862 | -0.0201 | 9.9800e-01 | 9.9800e-01 |
merged | frameshift | 0 | 0.0050 | 0.0000 | 0.0000 | 0.0947 | 0.9950 | -0.0072 | 9.9700e-01 | 9.9800e-01 |
merged | 3flank | 800 | 699.4930 | 400.0000 | 1045.0000 | 187.2328 | 1.1435 | 0.1934 | 3.0300e-01 | 4.1662e-01 |
merged | CDS | 758 | 392.1510 | 192.0000 | 611.0000 | 131.0955 | 1.9306 | 0.9490 | 3.0000e-03 | 5.5000e-03 |
merged | flank | 1335 | 176.1320 | 50.0000 | 350.0000 | 90.7093 | 7.5424 | 2.9150 | 1.0000e-03 | 2.2000e-03 |
merged | 5flank | 6224 | 742.0590 | 450.0000 | 1071.0000 | 191.1824 | 8.3775 | 3.0665 | 1.0000e-03 | 2.2000e-03 |
merged | UTR5 | 3784 | 104.0220 | 0.0000 | 237.0000 | 68.5653 | 36.0401 | 5.1715 | 1.0000e-03 | 2.2000e-03 |
The first two columns contain the name of the track and annotation that are being compared. The columns observed and expected give the observed and expected nucleotide overlap, respectively, between the track and annotation.
The following columns CI95low, CI95high, stddev give 95% confidence intervals and the standard deviation of the sample distribution, respectively.
The fold column is the fold enrichment or depletion and is computed as the ratio of observed over expected. The column l2fold is the log2 of this ratio.
The column pvalue gives the empirical p-value, i.e. in what proportion of samples was a higher enrichment or lower depletion found than the one that was observed.
The column qvalue lists a multiple testing corrected p-value. Setting a qvalue threshold and accepting only those comparisons with a qvalue below that threshold corresponds to controlling the false discovery rate at that particular level.