1 Installation

Install the most recent stable version from Bioconductor:

source("https://bioconductor.org/biocLite.R")
biocLite("CAGEfightR")

And load CAGEfightR:

library(CAGEfightR)

Alternatetively, you can install the development version directly from GitHub using devtools:

devtools::install_github("MalteThodberg/CAGEfightR")

2 Citation

If you use CAGEfightR, please cite the following article:

citation("CAGEfightR")

3 Getting help

For general questions about the usage of CAGEfightR, use the official Bioconductor support forum and tag your question “CAGEfightR”. We strive to answer questions as quickly as possible.

For technical questions, bug reports and suggestions for new features, we refer to the CAGEfightR github page

4 Quick start for the impatient

A CAGEfightR anaysis usually starts with loading CAGE Transcription Start Sites (CTSSs) from BigWig-files (one for each strand):

# Load the example data
data("exampleDesign")
head(exampleDesign)

# Locate files on your harddrive
bw_plus <- system.file("extdata", exampleDesign$BigWigPlus, package = "CAGEfightR")
bw_minus <- system.file("extdata", exampleDesign$BigWigMinus, package = "CAGEfightR")

# Create two named BigWigFileList-objects:
bw_plus <- BigWigFileList(bw_plus)
bw_minus <- BigWigFileList(bw_minus)
names(bw_plus) <- exampleDesign$Name
names(bw_minus) <- exampleDesign$Name

The first step is to quantify CTSSs across all samples using quantifyCTSSs, which will return the results a RangedSummarizedExperiment:

# Get genome information
genomeInfo <- SeqinfoForUCSCGenome("mm9")

# Quantify CTSSs
CTSSs <- quantifyCTSSs(plusStrand = bw_plus, minusStrand = bw_minus, design = exampleDesign, 
    genome = genomeInfo)
#> Checking supplied genome compatibility...
#> Iterating over 28 genomic tiles in 3 samples using 4 worker(s)...
#> Importing CTSSs from plus strand...
#> Importing CTSSs from minus strand...
#> Merging strands...
#> ### CTSS summary ###
#> Number of samples: 3
#> Number of CTSSs: 0.04126 millions
#> Sparsity: 57.06 %
#> Final object size: 2.47 MB

The wrapper function quickTSSs will automatically find and quantify candidate TSSs, returning the results as a RangedSummarizedExperiment:

TSSs <- quickTSSs(CTSSs)
#>  - Running calcTPM and calcPooled:
#> Calculating library sizes...
#> Calculating TPM...
#> 
#>  - Running tuneTagClustering:
#> Finding thresholds to be tested...
#> Splitting coverage by strand...
#> Iterating over 11 thresholds using 4 worker(s)...
#> Analyzing plus strand...
#> Analyzing minus strand...
#> Preparing output...
#> Optimal pooled cutoff: 0
#> 
#>  - Running clusterUnidirectionally:
#> Splitting by strand...
#> Finding tag clusters...
#> Calculating tag cluster statistics...
#> Preparing output...
#> Tag clustering summary:
#>   Width Count Percent
#>   Total 21008 100.0 %
#>     >=1 18387  87.5 %
#>    >=10  2487  11.8 %
#>   >=100   134   0.6 %
#>  >=1000     0   0.0 %
#> 
#>  - Running quantifyClusters:
#> Finding overlaps...
#> Aggregating within clusters...

Similarly, quickEnhancers will find and quantify candidate enhancers:

enhancers <- quickEnhancers(CTSSs)
#>  - Running calcTPM and calcPooled:
#> Calculating library sizes...
#> Calculating TPM...
#> 
#>  - Running clusterUnidirectionally:
#> Calculating windowed coverage on plus strand...
#> Calculating windowed coverage on minus strand...
#> Calculating balance score...
#> Slice-reduce to find bidirectional clusters...
#> Calculating statistics...
#> Preparing output...
#> 
#>  - Running subsetByBidirectionality:
#> Calculating bidirectionality...
#> Subsetting...
#> Removed 285 out of 671 regions (42.5%)
#> 
#>  - Running quantifyClusters:
#> Finding overlaps...
#> Aggregating within clusters...

We can then use transcript models stored in a TxDb-object to annotate each candidate TSS and enhancer with their transcript context:

# Use the built in annotation for mm9
library(TxDb.Mmusculus.UCSC.mm9.knownGene)
#> Loading required package: GenomicFeatures
#> Loading required package: AnnotationDbi
txdb <- TxDb.Mmusculus.UCSC.mm9.knownGene

# Annotate both TSSs and enhancers
TSSs <- assignTxType(TSSs, txModels = txdb)
#> Finding hierachical overlaps...
#> ### Overlap summary: ###
#>       txType count percentage
#> 1   promoter   275        1.3
#> 2   proximal   354        1.7
#> 3    fiveUTR   247        1.2
#> 4   threeUTR  1330        6.3
#> 5        CDS  1611        7.7
#> 6       exon   160        0.8
#> 7     intron 11945       56.9
#> 8  antisense  2745       13.1
#> 9 intergenic  2341       11.1
enhancers <- assignTxType(enhancers, txModels = txdb)
#> Finding hierachical overlaps...
#> ### Overlap summary: ###
#>       txType count percentage
#> 1   promoter    23        6.0
#> 2   proximal    15        3.9
#> 3    fiveUTR     4        1.0
#> 4   threeUTR    29        7.5
#> 5        CDS    45       11.7
#> 6       exon     5        1.3
#> 7     intron   227       58.8
#> 8  antisense     0        0.0
#> 9 intergenic    38        9.8

Usually, candidate enhancers overlapping known transcripts are removed:

enhancers <- subset(enhancers, txType %in% c("intergenic", "intron"))

For usage with other Bioconductor package, it can be useful to merge candidate TSSs and enhancers into a single object:

# Add an identifier column
rowRanges(TSSs)$clusterType <- "TSS"
rowRanges(enhancers)$clusterType <- "enhancer"

# Combine TSSs and enhancers, discarding TSSs if they overlap enhancers
RSE <- combineClusters(TSSs, enhancers, removeIfOverlapping = "object1")
#> Removing overlapping features from object1: 866
#> Keeping assays: counts
#> Keeping columns: score, thick, txType, clusterType
#> Merging metadata...
#> Stacking and re-sorting...

Finally, for use with packages for calling differential expression (DESeq2, edgeR, limma, etc.) very lowly expressed features are removed:

# Only keep features with more than 0 counts in more than 1 sample:
RSE <- subsetBySupport(RSE, inputAssay = "counts", outputColumn = "support", unexpressed = 0, 
    minSamples = 1)
#> Calculating support...
#> Subsetting...
#> Removed 16025 out of 20407 regions (78.5%)

5 Introduction and overview

5.1 Introduction to CAGE data

Cap Analysis of Gene Expression (CAGE) is one of the most popular high-throughput assay for profiling Transcription Start Site (TSS) activity. CAGE is based on sequencing the first 20-30 basepairs of capped full-length RNAs, called CAGE-tags. When mapped to a reference genome, CAGE-tags can be used to identify both TSSs and enhancers. See the CAGEfightR paper for additional introduction.

CAGE-data is often represented as CAGE TSSs (CTSSs), which is the the number of CAGE-tag 5’-ends mapping to each genomic position. CTSSs thereby represent a genome-wide, basepair resolution, quantitative atlas of transcription. CTSSs are inherently sparse, as only very few genomic positions are CTSSs, and only very few CTSSs have a high number of CAGE-tags.

Analysis of CAGE data is often focused on identifying clusters of CTSSs that corresponds to activity of individual TSSs and enhancers. Having access to both TSSs and enhancers in a single experiment makes CAGE well suited for studying many aspects of transcriptional regulation, for example differential expression, motif analysis and alternative TSS usage.

5.2 Central S4-classes

There are two key S4-classes to understand when using CAGEfightR:

The GRanges-class: Genomic locations or ranges are stored as GRanges-objects from the GenomicRanges package. GRanges contain chromosome (seqnames), start (start) and end (end) positions of ranges, along with chromosome lengths (seqinfo). An important additional features is added information for each range via metadata columns (mcols). These can be anything almost anything, as long as they can be put into a DataFrame. Some metadata columns have special meanings: the score column (accesible via the score function) and thick column, as these columns can be exported to a standard BED-format file via the export function from the rtracklayer package.

The RangedSummarizedExperiment-class: Complete CAGE experiments can be stored as RangedSummarizedExperiment-objects from the SummarizedExperiment, which implements the idea of the “three tables of genomics”. A RangedSummarizedExperiment can store several matrices of the same shape, e.g. counts and normalized expression values (accesible via assay and assays), along with information about each sample as a DataFrame-object (accesible via colData) and about each feature as a GRanges-object like above (accesible via rowRanges). Many GRanges methods also work onRangedSummarizedExperiment-objects, like sort, findOverlaps, etc.. Storing all information as a single RangedSummarizedExperiment helps keeping data organized, for example when extracting subsets of the data, where subset can be used to simultaneously extract requested data from all three tables at once.

Other S4-classes includes TxDb-objects from GenomicFeatures and various track-objects from Gviz. See Annotation with transcript models and Plotting CAGE data in a genome browser for more details.

5.3 Overview of functions

CAGEfightR conceptually analyses CAGE-data at 3 levels:

CTSS-level: Analysis at the level of number of CAGE tags per CTSS.
Cluster-level: Analysis at the level of clusters of CTSSs, usually TSSs and/or enhancers.
Gene-level: Analysis at the level of annotated genes.

For easier overview, functions in CAGEfightR are organized by prefixes. The most important groups are:

The quantify*-functions summarizes CAGE data to a higher level. In all cases, data is a stored as a RangedSummarizedExperiment:

quantifyCTSSs: Quantify the number of CAGE-tags for each CTSSs from BigWig-files.
quantifyClusters: Quantify the number of CAGE-tags within clusters of CTSSs, usually TSSs or enhancers.
quantifyGenes: Quantify the number of CAGE-tags within annotated genes, by summing the number of CAGE-tags annotated clusters.

The calc*-functions modify data stored in a RangedSummarizedExperiment. The following calc* functions perform basic calculations applicable to all three levels:

calcTotalTags: Calculate the total number of CAGE tags in a library
calcTPM: Calcuate Tags-Per-Million (TPM)
calcPooled: Calculate a pooled signal across all samples
calcSupport: Count the number of samples expression a feature above some level. Useful for removing lowly expressed features via the subsetBySupport wrapper.

Some calc* functions only work for a specfic level:

calcShape: Quantify the shape of TCs using shape*-functions.
calcBidirectionality: Count the number of samples showing bidirectional transcription. Useful for finding candidate enhancers via the subsetByBidirectionality wrapper.
calcComposition: Calculate how much each TSSs contribute to gene expression. Useful for removing lowly expressed TSSs via the subsetByComposition wrapper.

The cluster*-functions can cluster CTSSs to find TSSs or enhancers:

clusterUnidirectionally: Find unidirectional or Tag Clusters (TCs). Search threshold can be be found using tuneTagClustering. TCs can be modified using the trim*-functions.
clusterBidirectionally: Find bidirectional clusters (BCs) for enhancer identification.

The assign*-functions can annotate CTSSs and clusters with transcript and genes models:

assignTxID: Annotate clusters with transcript IDs
assignTxType: Annotate clusters based on their overlap with known transcripts, for example to determine whether a TSS is known or novel.
assignGeneID: Annotate with gene IDs, for example prior to running quantifyGenes.

Other groups of functions are bw*-functions for checking BigWig-files, swap*-functions for manipulating GRanges-objects, track*-functions for visualization and combine*-functions for safe merging of different types of clusters. The quick*-functions are high-level wrappers for many other functions, intended to be used as single-line execution of the standard pipeline.

5.4 Pipeability

The majority of functions in CAGEfightR are endomorphisms, meaning they returned modified versions of the input objects. This often works by adding calculated values as new metadata columns, accesible via rowData or mcols.

While not used in this vignette, this means that CAGEfightR is highly compatible with the pipes from the magrittr package.

The main exceptions to pipeable functions this are the quantify* functions, that instead summarizes data between different levels.

6 Detailed Introduction

The following section contains a detailed walkthrough of most of the functions in CAGEfightR, using the built-in dataset. To keep this vignette compact, the example dataset is extremely small - for a more realistic analysis of a full CAGE dataset using using both CAGEfightR and additional packages, additional workflows will be added in the future.

6.1 CAGE Transcription Start Site (CTSS) level analysis

As shown in the introduction Quick start for the impatient, the first step of a CAGEfightR analysis is import CTSSs from BigWig-files with the quantifyCTSSs function. Rather than repeating this, we use the built-in exampleCTSSs object:

data(exampleCTSSs)
exampleCTSSs
#> class: RangedSummarizedExperiment 
#> dim: 41256 3 
#> metadata(0):
#> assays(1): counts
#> rownames: NULL
#> rowData names(1): score
#> colnames(3): C547 C548 C549
#> colData names(4): Name BigWigPlus BigWigMinus totalTags

This is a somewhat special RangedSummarizedExperiment: The assay is not a normal matrix, but rather a sparse matrix (dgCMatrix from the Matrix package). Since CAGE data is inherently sparse (vast majority sites in the genome have no counts), storing CTSSs as a sparse matrix allows for much faster and memory efficient processing of data:

head(assay(exampleCTSSs))
#> 6 x 3 sparse Matrix of class "dgCMatrix"
#>      C547 C548 C549
#> [1,]    1    .    .
#> [2,]    .    .    1
#> [3,]    1    .    .
#> [4,]    .    1    .
#> [5,]    1    .    .
#> [6,]    .    .    1

The ranges are stored as GRanges, where all ranges a one basepair wide:

rowRanges(exampleCTSSs)
#> GRanges object with 41256 ranges and 1 metadata column:
#>           seqnames    ranges strand |              score
#>              <Rle> <IRanges>  <Rle> |          <numeric>
#>       [1]    chr18  72690884      + | 0.0804105958007014
#>       [2]    chr18  72730496      + | 0.0690737677447056
#>       [3]    chr18  73169663      + | 0.0804105958007014
#>       [4]    chr18  73190583      + | 0.0814477699478417
#>       [5]    chr18  73234436      + | 0.0804105958007014
#>       ...      ...       ...    ... .                ...
#>   [41252]    chr19  61304387      - | 0.0814477699478417
#>   [41253]    chr19  61304427      - | 0.0690737677447056
#>   [41254]    chr19  61332796      - | 0.0690737677447056
#>   [41255]    chr19  61333667      - |  0.162895539895683
#>   [41256]    chr19  61333668      - | 0.0814477699478417
#>   -------
#>   seqinfo: 35 sequences (1 circular) from mm9 genome

6.1.1 Calculating pooled CTSSs

For clustering CTSSs, we first want to calculate the the overall CTSSs signal across all samples, called pooled CTSSs. However, since the different samples have different library sizes, CTSS counts must first be normalized. The simplest way of doing this is to use Tags-Per-Million (TPM, not to be confused with Transcripts-Per-Million or TxPM used in RNA-Seq):

exampleCTSSs <- calcTPM(exampleCTSSs, inputAssay = "counts", outputAssay = "TPM", 
    outputColumn = "subsetTags")
#> Calculating library sizes...
#> Calculating TPM...

calcTPM will calculate the total number of tags for each sample and store them in colData, and then scale counts into TPM and add them as a new assay:

# Library sizes
colData(exampleCTSSs)
#> DataFrame with 3 rows and 5 columns
#>             Name       BigWigPlus       BigWigMinus totalTags subsetTags
#>      <character>      <character>       <character> <numeric>  <numeric>
#> C547        C547 mm9.C547.plus.bw mm9.C547.minus.bw  12436172      63230
#> C548        C548 mm9.C548.plus.bw mm9.C548.minus.bw  12277807      52342
#> C549        C549 mm9.C549.plus.bw mm9.C549.minus.bw  14477276      69728

# TPM values
head(assay(exampleCTSSs, "TPM"))
#> 6 x 3 sparse Matrix of class "dgCMatrix"
#>          C547     C548     C549
#> [1,] 15.81528  .        .      
#> [2,]  .        .       14.34144
#> [3,] 15.81528  .        .      
#> [4,]  .       19.10512  .      
#> [5,] 15.81528  .        .      
#> [6,]  .        .       14.34144

As this is just a subset of the original dataset, we instead use the total number tags from the the complete dataset, by specifying the name of the column in colData (Note that a warning is passed that we are overwritting the previous TPM assay):

exampleCTSSs <- calcTPM(exampleCTSSs, inputAssay = "counts", totalTags = "totalTags", 
    outputAssay = "TPM")
#> Using supplied library sizes...
#> Warning in calcTPM(exampleCTSSs, inputAssay = "counts", totalTags =
#> "totalTags", : object already has an assay named TPM: It will be overwritten!
#> Calculating TPM...
head(assay(exampleCTSSs, "TPM"))
#> 6 x 3 sparse Matrix of class "dgCMatrix"
#>           C547       C548       C549
#> [1,] 0.0804106 .          .         
#> [2,] .         .          0.06907377
#> [3,] 0.0804106 .          .         
#> [4,] .         0.08144777 .         
#> [5,] 0.0804106 .          .         
#> [6,] .         .          0.06907377

calcPooled will calculate pooled CTSSs by summing up TPM across sample for each CTSS and store it in the score-column:

# Library sizes
exampleCTSSs <- calcPooled(exampleCTSSs, inputAssay = "TPM")
#> Warning in calcPooled(exampleCTSSs, inputAssay = "TPM"): object already has a
#> column named score in rowData: It will be overwritten!
rowRanges(exampleCTSSs)
#> GRanges object with 41256 ranges and 1 metadata column:
#>           seqnames    ranges strand |              score
#>              <Rle> <IRanges>  <Rle> |          <numeric>
#>       [1]    chr18  72690884      + | 0.0804105958007014
#>       [2]    chr18  72730496      + | 0.0690737677447056
#>       [3]    chr18  73169663      + | 0.0804105958007014
#>       [4]    chr18  73190583      + | 0.0814477699478417
#>       [5]    chr18  73234436      + | 0.0804105958007014
#>       ...      ...       ...    ... .                ...
#>   [41252]    chr19  61304387      - | 0.0814477699478417
#>   [41253]    chr19  61304427      - | 0.0690737677447056
#>   [41254]    chr19  61332796      - | 0.0690737677447056
#>   [41255]    chr19  61333667      - |  0.162895539895683
#>   [41256]    chr19  61333668      - | 0.0814477699478417
#>   -------
#>   seqinfo: 35 sequences (1 circular) from mm9 genome

6.1.2 Calculating CTSS support to remove excess noise

In some cases you might have a very large number of and/or very noisy samples (For example due to poor RNA quality), which can lead to an increase in the number of single tags spread across the genome. To alleviate this issue, CTSSs appearing in only a single or few samples can be discarded. We refer to this as calculating the support: the number of samples expressing a feature above some threshold:

# Count number of samples with MORE ( > ) than 0 counts:
exampleCTSSs <- calcSupport(exampleCTSSs, inputAssay = "counts", outputColumn = "support", 
    unexpressed = 0)
table(rowRanges(exampleCTSSs)$support)
#> 
#>     1     2     3 
#> 33162  4297  3797

The majority of CTSSs are only expressed in a single sample. We can discard these sites using subset, and then recalculate TPM values:

supportedCTSSs <- subset(exampleCTSSs, support > 1)
supportedCTSSs <- calcTPM(supportedCTSSs, totalTags = "totalTags")
#> Using supplied library sizes...
#> Warning in calcTPM(supportedCTSSs, totalTags = "totalTags"): object already
#> has an assay named TPM: It will be overwritten!
#> Calculating TPM...
supportedCTSSs <- calcPooled(supportedCTSSs)
#> Warning in calcPooled(supportedCTSSs): object already has a column named
#> score in rowData: It will be overwritten!

Note, CAGEfightR warned that it was overwritting columns: Another option would be to save the output as new columns. The subsetBySupport function wraps the common task of calling calcSupport and subset.

6.1.3 Unidirectional (tag) clustering: Finding Transcription Start Sites (TSSs)

Once we have obtained pooled CTSSs, we can identify various types of clusters: Transcripts are rarely transcribed from just a single CTSS, but rather from a collection or cluster of nearby CTSSs corresponding to a single TSS.

The simplest way of doing this is to group nearby CTSSs on the same strand into Tag Clusters (TCs), which are likely candidates for real TSSs (although some post-filtering is almost always a good idea, see next section). CAGEfightR does this by a slice-reduce approach: It finds sites above some value (slice) and then merges nearby sites (reduce). In the simplest form this can be run as:

naive_TCs <- clusterUnidirectionally(exampleCTSSs, pooledCutoff = 0, mergeDist = 20)
#> Splitting by strand...
#> Finding tag clusters...
#> Calculating tag cluster statistics...
#> Preparing output...
#> Tag clustering summary:
#>   Width Count Percent
#>   Total 21008 100.0 %
#>     >=1 18387  87.5 %
#>    >=10  2487  11.8 %
#>   >=100   134   0.6 %
#>  >=1000     0   0.0 %

clusterUnidirectionally find TCs and outputs some summary statistics. We see that we have some very wide TCs (> 1000 bp wide). This is often the result of a just few spread-out tags that link several major TCs, which does not represent the TSS structure very well. This can be avoided by tuning the cutoff used for slicing: The smallest value for cutoff value producing the largest number of TCs is chosen:

tuned <- tuneTagClustering(exampleCTSSs, mergeDist = 20)
#> Finding thresholds to be tested...
#> Splitting coverage by strand...
#> Iterating over 11 thresholds using 4 worker(s)...
#> Analyzing plus strand...
#> Analyzing minus strand...
#> Preparing output...
tuned
#>     threshold  nTCs
#> 1  0.00000000 21008
#> 2  0.06907377 15628
#> 3  0.08041060  8861
#> 4  0.08144777  3392
#> 5  0.13814754  3044
#> 6  0.14948436  2694
#> 7  0.15052154  2421
#> 8  0.16082119  2006
#> 9  0.16185837  1743
#> 10 0.16289554  1417
#> 11 0.20722130  1389

Increasing the slice-cutoff by a very small amount greatly increases the number of TCs, by preventing merging of major TCs by weakly expressed CTSSs. We can then use this tuned cutoff value to identify TCs:

optimalCutoff <- tuned[which.max(tuned$nTCs), 1]
TCs <- clusterUnidirectionally(exampleCTSSs, pooledCutoff = optimalCutoff, mergeDist = 20)
#> Splitting by strand...
#> Finding tag clusters...
#> Calculating tag cluster statistics...
#> Preparing output...
#> Tag clustering summary:
#>   Width Count Percent
#>   Total 21008 100.0 %
#>     >=1 18387  87.5 %
#>    >=10  2487  11.8 %
#>   >=100   134   0.6 %
#>  >=1000     0   0.0 %

Note, we no longer have any very long TCs (For a real dataset it’s not uncommon to have a few very long TCs left). Let’s take a closer look at what’s returned:

TCs
#> GRanges object with 21008 ranges and 2 metadata columns:
#>                             seqnames            ranges strand |
#>                                <Rle>         <IRanges>  <Rle> |
#>   chr18:72690884-72690884;+    chr18          72690884      + |
#>   chr18:72730496-72730496;+    chr18          72730496      + |
#>   chr18:73169663-73169663;+    chr18          73169663      + |
#>   chr18:73190583-73190583;+    chr18          73190583      + |
#>   chr18:73234436-73234436;+    chr18          73234436      + |
#>                         ...      ...               ...    ... .
#>   chr19:61304092-61304092;-    chr19          61304092      - |
#>   chr19:61304278-61304387;-    chr19 61304278-61304387      - |
#>   chr19:61304427-61304427;-    chr19          61304427      - |
#>   chr19:61332796-61332796;-    chr19          61332796      - |
#>   chr19:61333667-61333668;-    chr19 61333667-61333668      - |
#>                                          score     thick
#>                                      <numeric> <IRanges>
#>   chr18:72690884-72690884;+ 0.0804105958007014  72690884
#>   chr18:72730496-72730496;+ 0.0690737677447056  72730496
#>   chr18:73169663-73169663;+ 0.0804105958007014  73169663
#>   chr18:73190583-73190583;+ 0.0814477699478417  73190583
#>   chr18:73234436-73234436;+ 0.0804105958007014  73234436
#>                         ...                ...       ...
#>   chr19:61304092-61304092;- 0.0690737677447056  61304092
#>   chr19:61304278-61304387;-   86.4736312947607  61304320
#>   chr19:61304427-61304427;- 0.0690737677447056  61304427
#>   chr19:61332796-61332796;- 0.0690737677447056  61332796
#>   chr19:61333667-61333668;-  0.244343309843525  61333667
#>   -------
#>   seqinfo: 35 sequences (1 circular) from mm9 genome

The GRanges includes the chromosome, start and end positions of each TCs, as well as two other key values: The TC peak in the thick column: This the single position within the TC with the highest pooled TPM, and the sum of pooled TPM within the TC in the score column. In case you want to save TCs to a bed file, you can use the export function from the rtracklayer package.

6.1.4 Bidirectional clustering: Finding Enhancers

CAGE is a unique technology in that it allows for the robust identification of enhancers by transcription of enhancer RNAs (eRNAs). Active enhancers produce weak but consistent bidirectional transcription of capped eRNAs, resulting in a characteristic CTSS pattern of two diverging peaks approximally 400 basepairs apart.

CAGEfightR can identify this pattern by calculating a balance score ranging from 0 to 1, where 1 corresponds to perfectly divergent site (Technically, the balance score is the Bhattacharyya coefficient measuring agreement with the “ideal” divergent site, see the CAGEfightR paper). Again, a slice-reduce approach can be use to identify invidividual enhancers based on the balance scores. We refer to this approach as bidirectional clustering as opposed to the conventional unidirectonal cluster used to identify TSSs: and hence the name of the cluster functions: clusterBidirectionally and clusterUnidirectionally:

enhancers <- clusterBidirectionally(exampleCTSSs, balanceThreshold = 0.95)
#> Calculating windowed coverage on plus strand...
#> Calculating windowed coverage on minus strand...
#> Calculating balance score...
#> Slice-reduce to find bidirectional clusters...
#> Calculating statistics...
#> Preparing output...

Let’s look closer at the returned GRanges:

enhancers
#> GRanges object with 674 ranges and 3 metadata columns:
#>                           seqnames            ranges strand |
#>                              <Rle>         <IRanges>  <Rle> |
#>   chr18:73731202-73731784    chr18 73731202-73731784      * |
#>   chr18:73731956-73732359    chr18 73731956-73732359      * |
#>   chr18:73750120-73750564    chr18 73750120-73750564      * |
#>   chr18:73752492-73752953    chr18 73752492-73752953      * |
#>   chr18:73802190-73802733    chr18 73802190-73802733      * |
#>                       ...      ...               ...    ... .
#>   chr19:61182777-61183299    chr19 61182777-61183299      * |
#>   chr19:61193381-61193912    chr19 61193381-61193912      * |
#>   chr19:61194241-61194692    chr19 61194241-61194692      * |
#>   chr19:61202166-61202703    chr19 61202166-61202703      * |
#>   chr19:61260539-61260971    chr19 61260539-61260971      * |
#>                                       score     thick           balance
#>                                   <numeric> <IRanges>         <numeric>
#>   chr18:73731202-73731784 0.160821191601403  73731493                 1
#>   chr18:73731956-73732359  1.22166008691667  73732159 0.997127396724092
#>   chr18:73750120-73750564 0.242268961549244  73750342 0.985341340592472
#>   chr18:73752492-73752953 0.149484363545407  73752722 0.999279749024173
#>   chr18:73802190-73802733 0.150521537692547  73802461 0.999153450069061
#>                       ...               ...       ...               ...
#>   chr19:61182777-61183299  0.31134272929395  61183042 0.999802454365801
#>   chr19:61193381-61193912 0.762907342371592  61193599 0.999770184445126
#>   chr19:61194241-61194692 0.218558131290113  61194460 0.999279749024173
#>   chr19:61202166-61202703 0.149484363545407  61202434 0.999279749024173
#>   chr19:61260539-61260971 0.138147535489411  61260755                 1
#>   -------
#>   seqinfo: 35 sequences (1 circular) from mm9 genome

The GRanges includes the chromosome, start and end positions of each enhancer (enhancers are unstranded), as well as two other key values: the score column holds the sum of pooled CTSSs of the enhancer (on both strands) and the enhancer midpoint in the thick column: This is the maximally balanced site in the enhancers. Again, the export function from the rtracklayer package can be used to export enhancers to a BED-file.

As balance here is solely defined on the pooled CTSSs, a useful filter is to make sure the bidirectional site is also bidirectional in at least a single sample. Similarly to how we calculated support, we can calculate the sample-wise bidirectionality of enhancers:

# Calculate number of bidirectional samples
enhancers <- calcBidirectionality(enhancers, samples = exampleCTSSs)

# Summarize
table(enhancers$bidirectionality)
#> 
#>   0   1   2   3 
#> 284 319  44  27

Many enhancers are not observed to be bidirectional in one or more samples. We can remove these using subset (The subsetBySupport function wraps the common task of calling calcBidirectionality and subset):

enhancers <- subset(enhancers, bidirectionality > 0)

This bidirectional clustering approach identifies any bidirectional site in the genome. This means that for example bidirectional promoters will also detected (given that they are sufficiently balanced). To remove these cases, annotation with transcript models can be used to remove bidirectional clusters overlapping known promoters and exons. This is described in the next section.

6.2 Cluster level analysis

Once interesting clusters (unidirectional TSSs candidates and bidirectional enhancer candidates) have been identified, they can be quantified and annotated with transcript models. We show this functionality using the built-in example data:

# Load the example datasets
data(exampleCTSSs)
data(exampleUnidirectional)
data(exampleBidirectional)

6.2.1 Quantifying expression at cluster level

To look at differential expression, we must quantify the expression of each TC in each sample (in almost all cases you want to quantify the raw CTSS counts). This can be done using the quantifyClusters function (The example datasets have already been quantified, so here we will re-quantify the clusters):

requantified_TSSs <- quantifyClusters(exampleCTSSs, clusters = rowRanges(exampleUnidirectional), 
    inputAssay = "counts")
#> Finding overlaps...
#> Aggregating within clusters...
requantified_enhancers <- quantifyClusters(exampleCTSSs, clusters = rowRanges(exampleBidirectional), 
    inputAssay = "counts")
#> Finding overlaps...
#> Aggregating within clusters...

This returns a RangedSummarizedExperiment with clusters in rowRanges and count matrix in assays.

6.2.2 Removing weakly expressed clusters

We often wish to remove weakly expressed clusters. Similarly to CTSSs, a simple approach is to remove cluster based on counts using the subsetBySupport function:

# Only keep enhancers expressed in more than one sample
supported_enhancers <- subsetBySupport(exampleBidirectional, inputAssay = "counts", 
    unexpressed = 0, minSamples = 1)
#> Calculating support...
#> Subsetting...
#> Removed 105 out of 390 regions (26.9%)

Similarly, we can also first calculate TPM values for clusters, and then filter on TPM values:

# Calculate TPM using pre-calculated total tags:
exampleUnidirectional <- calcTPM(exampleUnidirectional, totalTags = "totalTags")
#> Using supplied library sizes...
#> Calculating TPM...

# Only TSSs expressed at more than 1 TPM in more than 2 samples
exampleUnidirectional <- subsetBySupport(exampleUnidirectional, inputAssay = "TPM", 
    unexpressed = 1, minSamples = 2)
#> Calculating support...
#> Subsetting...
#> Removed 17141 out of 17313 regions (99%)

6.2.3 Annotation with transcript models

While CAGE can identify TSSs and enhancers completely independent of annotation, it is often useful to compare these to existing annotations. Bioconductor stores transcript models using the TxDb class from the GenomicFeatures package. There are multiple ways to obtain a TxDb object:

Use the TxDb-objects included as Bioconductor packages: For this vignette we use the TxDb.Mmusculus.UCSC.mm9.knownGene package.
Import a GTF or GFF3 file using the makeTxDbFromGFF function from GenomicFeatures.
Use the AnnotationHub package to directly download TxDb objects for a wide range of organisms.
Use makeTxDbFromBiomart, makeTxDbFromUCSC or makeTxDbFromEnsembl from from GenomicFeatures to create TxDb objects from various online sources.

In case you really do not want to use a TxDb object, all annotation functions in CAGEfightR also accepts GRanges or GRangesList as input.

library(TxDb.Mmusculus.UCSC.mm9.knownGene)
txdb <- TxDb.Mmusculus.UCSC.mm9.knownGene
txdb
#> TxDb object:
#> # Db type: TxDb
#> # Supporting package: GenomicFeatures
#> # Data source: UCSC
#> # Genome: mm9
#> # Organism: Mus musculus
#> # Taxonomy ID: 10090
#> # UCSC Table: knownGene
#> # Resource URL: http://genome.ucsc.edu/
#> # Type of Gene ID: Entrez Gene ID
#> # Full dataset: yes
#> # miRBase build ID: NA
#> # transcript_nrow: 55419
#> # exon_nrow: 246570
#> # cds_nrow: 213117
#> # Db created by: GenomicFeatures package from Bioconductor
#> # Creation time: 2015-10-07 18:13:02 +0000 (Wed, 07 Oct 2015)
#> # GenomicFeatures version at creation time: 1.21.30
#> # RSQLite version at creation time: 1.0.0
#> # DBSCHEMAVERSION: 1.1

First, we might simply inquire as to what transcripts overlaps our clusters, which is done via the assignTxID function:

exampleUnidirectional <- assignTxID(exampleUnidirectional, txModels = txdb, outputColumn = "txID")
#> Extracting transcripts...
#> Finding hierachical overlaps...
#> ### Overlap Summary: ###
#> Features overlapping transcripts: 87.79 %
#> Number of unique transcripts: 232

Note, that a single TC can overlap multiple transcripts (names are separated with a ;):

rowRanges(exampleUnidirectional)[5:6]
#> GRanges object with 2 ranges and 4 metadata columns:
#>                             seqnames            ranges strand |
#>                                <Rle>         <IRanges>  <Rle> |
#>   chr18:74427830-74427996;+    chr18 74427830-74427996      + |
#>   chr18:74442758-74442837;+    chr18 74442758-74442837      + |
#>                                        score     thick   support
#>                                    <numeric> <IRanges> <integer>
#>   chr18:74427830-74427996;+ 63.4299739643187  74427933         3
#>   chr18:74442758-74442837;+ 11.2621271351077  74442817         3
#>                                              txID
#>                                       <character>
#>   chr18:74427830-74427996;+ uc008fph.2;uc008fpj.2
#>   chr18:74442758-74442837;+ uc008fpl.2;uc008fpm.2
#>   -------
#>   seqinfo: 35 sequences (1 circular) from mm9 genome

In many cases, we are not so much interested in what specific transcripts a cluster overlaps, but rather where in a transcript a cluster lies - for example whether it is located at the annotated TSSs, in the 5’-UTR or in the CDS region. However, due to overlapping transcripts, these categories might overlap, for example a region might be an exon in one transcripts, but skipped in another due to alternative splicing. To get around this, CAGEfightR implements a hierachical annotation scheme (See the CAGEfightR paper for a detailed description), implemented in the assignTxType function:

exampleUnidirectional <- assignTxType(exampleUnidirectional, txModels = txdb, 
    outputColumn = "txTyp")
#> Finding hierachical overlaps...
#> ### Overlap summary: ###
#>       txType count percentage
#> 1   promoter   101       58.7
#> 2   proximal    11        6.4
#> 3    fiveUTR    15        8.7
#> 4   threeUTR     3        1.7
#> 5        CDS     8        4.7
#> 6       exon     0        0.0
#> 7     intron    13        7.6
#> 8  antisense     6        3.5
#> 9 intergenic    15        8.7

The function returns a small summary of the annotation: At the top of the hierachy we have overlap with annotated promoter (+/- 100 basepairs from the annotated TSSs, with this distance being changeable when running the function), followed by proximal (- 1000 upstrema of the annotated TSS). Coding transcripts can be annotated as 5’-UTR, 3-’UTR, CDS and intron, while non-conding transcripts are simply annotated as exon or intron. At the bottom of the hierachy, clusters overlapping genes on the opposite strand are annotated as antisense. Finally, in case a cluster does not overlap any transcript, it is annotated as intergenic.

Highly expressed TSSs can be quite wide, and overlap many different categories. It can be useful to only annotate the TSSs peak rather than the the entire TSSs to obtain more representative annotations:

exampleUnidirectional <- assignTxType(exampleUnidirectional, txModels = txdb, 
    outputColumn = "peakTxType", swap = "thick")
#> Finding hierachical overlaps with swapped ranges...
#> ### Overlap summary: ###
#>       txType count percentage
#> 1   promoter    94       54.7
#> 2   proximal    15        8.7
#> 3    fiveUTR    17        9.9
#> 4   threeUTR     3        1.7
#> 5        CDS     6        3.5
#> 6       exon     0        0.0
#> 7     intron    16        9.3
#> 8  antisense     6        3.5
#> 9 intergenic    15        8.7

Annotation with transcript models are particular useful for enhancer detecting, since bidirectional cluster might in some cases be very balanced bidirectional promoters. We can remove enhancers overlapping known transcripts:

# Annotate with TxTypes
exampleBidirectional <- assignTxType(exampleBidirectional, txModels = txdb, outputColumn = "txType")
#> Finding hierachical overlaps...
#> ### Overlap summary: ###
#>       txType count percentage
#> 1   promoter    23        5.9
#> 2   proximal    15        3.8
#> 3    fiveUTR     4        1.0
#> 4   threeUTR    29        7.4
#> 5        CDS    45       11.5
#> 6       exon     5        1.3
#> 7     intron   231       59.2
#> 8  antisense     0        0.0
#> 9 intergenic    38        9.7

# Only keep intronic and intergenic enhancers
exampleBidirectional <- subset(exampleBidirectional, txType %in% c("intron", "intergenic"))

Of course, this way of filtering relies on having accurate transcript models!

6.2.4 Quantifying shape of Tag Clusters

We have already looked at two characteristics of Tag Clusters (TCs) in addition their genomic location: the expression level (score column) and TSSs peak (thick column). One can further describe TCs beyond these two measures. For example, it was found that mammalian TSSs can be divided into “sharp” or “broad” TSSs, by comparing the Interquartile Range (IQR). We refer to such a measure as a shape statistic and CAGEfightR includes a few built-in functions for shape statistics (IQR, entropy and multimodality) as well as an easy framework for implementing your own.

Let’s see how this works by calculating the IQR for TSSs. First, we need to calculate pooled CTSS:

# Recalculate pooled signal
exampleCTSSs <- calcTPM(exampleCTSSs, totalTags = "totalTags")
#> Using supplied library sizes...
#> Calculating TPM...
exampleCTSSs <- calcPooled(exampleCTSSs)
#> Warning in calcPooled(exampleCTSSs): object already has a column named score
#> in rowData: It will be overwritten!

# Calculate shape
exampleUnidirectional <- calcShape(exampleUnidirectional, pooled = exampleCTSSs, 
    outputColumn = "IQR", shapeFunction = shapeIQR, lower = 0.25, upper = 0.75)
#> Splitting coverage by strand...
#> Applying function to each TC...
#> Assembling output...

A plot of the distribution of IQR values reveals the distinction between broad and sharp TSSs:

hist(rowRanges(exampleUnidirectional)$IQR, breaks = max(rowRanges(exampleUnidirectional)$IQR), 
    xlim = c(0, 100), xlab = "IQR", col = "red")

shapeEntropy works in the same way. Advanced users may implement their own shape statistic by writing a new function:

# Write a function that quantifies the lean of a TSS
shapeLean <- function(x, direction) {
    # Coerce to normal vector
    x <- as.vector(x)
    
    # Find highest position:
    i <- which.max(x)
    
    # Calculate sum
    i_total <- sum(x)
    
    # Calculate lean fraction
    if (direction == "right") {
        i_lean <- sum(x[i:length(x)])
    } else if (direction == "left") {
        i_lean <- sum(x[1:i])
    } else {
        stop("direction must be left or right!")
    }
    
    # Calculate lean
    o <- i_lean/i_total
    
    # Return
    o
}

# Calculate lean statistics, additional arguments can be passed to calcShape
# via '...'
exampleUnidirectional <- calcShape(exampleUnidirectional, exampleCTSSs, outputColumn = "leanRight", 
    shapeFunction = shapeLean, direction = "right")
#> Splitting coverage by strand...
#> Applying function to each TC...
#> Assembling output...

exampleUnidirectional <- calcShape(exampleUnidirectional, exampleCTSSs, outputColumn = "leanLeft", 
    shapeFunction = shapeLean, direction = "left")
#> Splitting coverage by strand...
#> Applying function to each TC...
#> Assembling output...

6.3 Gene level analysis

CAGE can detect TSSs and enhancers completely independent of existing gene models. However many databases of functional annotation, such as GO and KEGG terms, are only available at gene level. One might need to compare CAGE data to existing RNA-Seq or Microarray data only available at the gene level.

Another typical analysis is to look at Differential TSS Usage (DTU), sometimes refered to as alternative TSSs or promoter switches. Here the aim is to identify if genes utilize different TSSs under different conditions, independently of whether the overall gene expression changes.

6.3.1 Annotation with gene models

In a similar fashion to transcript annotation described above Annotation with transcript models, gene models are obtained via TxDb objects:

# Load example TSS
data(exampleUnidirectional)

# Keep only TCs expressed at more than 1 TPM in more than 2 samples:
exampleUnidirectional <- calcTPM(exampleUnidirectional, totalTags = "totalTags")
#> Using supplied library sizes...
#> Calculating TPM...
exampleUnidirectional <- subsetBySupport(exampleUnidirectional, inputAssay = "TPM", 
    unexpressed = 1, minSamples = 2)
#> Calculating support...
#> Subsetting...
#> Removed 17141 out of 17313 regions (99%)

# Use the Bioconductor mm9 UCSC TxXb
library(TxDb.Mmusculus.UCSC.mm9.knownGene)
txdb <- TxDb.Mmusculus.UCSC.mm9.knownGene

The assignGeneID function annotate ranges with gene IDs:

exampleUnidirectional <- assignGeneID(exampleUnidirectional, geneModels = txdb, 
    outputColumn = "geneID")
#> Extracting genes...
#> Overlapping while taking distance to nearest TSS into account...
#> Finding hierachical overlaps...
#> ### Overlap Summary: ###
#> Features overlapping genes: 88.37 %
#> Number of unique genes: 89

TSSs not overlappping any genes receive NA, and in the (usually rare) case of a TC overlapping multiple genes, the gene with the closest annotated TSS is chosen:

What type of gene ID is returned of course depends on the source of the TxDb object supplied. In this case, it is Entrez IDs, the most commonly used ID system in Bioconductor. Entrez IDs can be used to find the corresponding gene symbols using an OrgDb object:

# Use Bioconductor OrgDb package
library(org.Mm.eg.db)
#> 
odb <- org.Mm.eg.db

# Match IDs to symbols
symbols <- mapIds(odb, keys = rowRanges(exampleUnidirectional)$geneID, keytype = "ENTREZID", 
    column = "SYMBOL")
#> 'select()' returned 1:1 mapping between keys and columns

# Add to object
rowRanges(exampleUnidirectional)$symbol <- as.character(symbols)

For most gene-level operations in CAGEfightR NAs (TCs not overlapping genes) are simply removed. If you wish to keep these TCs, you can relabel them as “Novel” genes via the assignMissingID function:

exampleUnidirectional <- assignMissingID(exampleUnidirectional, outputColumn = "symbol")
#> Assigned 2 missing IDs

6.3.2 Quantify expression at gene-level

Once TCs have been assigned to genes, TC expression can be summed up within genes to obtain a gene-level expression matrix via the quantifyGenes function:

genelevel <- quantifyGenes(exampleUnidirectional, genes = "geneID", inputAssay = "counts")

This returns a RangedSummarizedExperiment with the same number of rows as genes. The ranges are returned as a GRangesList of the individual TSSs making up the expression of the genes:

rowRanges(genelevel)
#> GRangesList object of length 88:
#> $100217439 
#> GRanges object with 5 ranges and 5 metadata columns:
#>                             seqnames            ranges strand |
#>                                <Rle>         <IRanges>  <Rle> |
#>   chr19:53899602-53899622;-    chr19 53899602-53899622      - |
#>   chr19:55817615-55817733;-    chr19 55817615-55817733      - |
#>   chr19:56738135-56738213;-    chr19 56738135-56738213      - |
#>   chr19:57432920-57432984;-    chr19 57432920-57432984      - |
#>   chr19:60842388-60842675;-    chr19 60842388-60842675      - |
#>                                        score     thick   support      geneID
#>                                    <numeric> <IRanges> <integer> <character>
#>   chr19:53899602-53899622;- 5.20115106166876  53899605         3   100217439
#>   chr19:55817615-55817733;- 7.01452555772393  55817695         3   100217439
#>   chr19:56738135-56738213;- 11.9559511230901  56738161         3   100217439
#>   chr19:57432920-57432984;- 7.02786464660017  57432963         3   100217439
#>   chr19:60842388-60842675;- 14.6610146789778  60842645         3   100217439
#>                                  symbol
#>                             <character>
#>   chr19:53899602-53899622;-     Snora19
#>   chr19:55817615-55817733;-     Snora19
#>   chr19:56738135-56738213;-     Snora19
#>   chr19:57432920-57432984;-     Snora19
#>   chr19:60842388-60842675;-     Snora19
#> 
#> $100217466 
#> GRanges object with 1 range and 5 metadata columns:
#>                             seqnames            ranges strand |
#>   chr18:74938323-74938515;-    chr18 74938323-74938515      - |
#>                                        score    thick support    geneID
#>   chr18:74938323-74938515;- 243.886676380891 74938434       3 100217466
#>                               symbol
#>   chr18:74938323-74938515;- Scarna17
#> 
#> $100502841 
#> GRanges object with 1 range and 5 metadata columns:
#>                             seqnames            ranges strand |
#>   chr18:78135139-78135265;+    chr18 78135139-78135265      + |
#>                                        score    thick support    geneID
#>   chr18:78135139-78135265;+ 101.861136805853 78135206       3 100502841
#>                             symbol
#>   chr18:78135139-78135265;+   Epg5
#> 
#> ...
#> <85 more elements>
#> -------
#> seqinfo: 35 sequences (1 circular) from mm9 genome

Gene-level values are accesible via the rowData function:

rowData(genelevel)
#> DataFrame with 88 rows and 1 column
#>     nClusters
#>     <integer>
#> 1           5
#> 2           1
#> 3           1
#> 4           1
#> 5           2
#> ...       ...
#> 84          1
#> 85          2
#> 86          1
#> 87          1
#> 88          1

The gene-level expression matrix can now be supplied to many other Bioconductor packages, including DESeq2, edgeR, limma, etc.

6.3.3 Filtering clusters based on gene composition

When looking at Differential TSS Usage (DTU) we are interested in changes in the contribution of individual TCs to overall gene expression. Since CAGE can detect even very lowly expressed TSSs, it is common to see TSSs making up only a very small fraction of the total gene expression. Removing these TSSs can improve power and clarity of downstream analyses.

CAGEfightR can calculate the composition value of intragenic TCs, defined as the number of samples expressing a TC at more than a certain fraction of total gene expression. This is implemented in the calcComposition function:

# Remove TSSs not belonging to any genes
intragenicTSSs <- subset(exampleUnidirectional, !is.na(geneID))

# Calculate composition: The number of samples expressing TSSs above 10% of
# the total gene expression.
intragenicTSSs <- calcComposition(intragenicTSSs, outputColumn = "composition", 
    unexpressed = 0.1, genes = "geneID")

# Overview of composition values:
table(rowRanges(intragenicTSSs)$composition)
#> 
#>   0   1   2   3 
#>  22   8   6 116

We can see that many TSSs makes up less than 10% in all samples. These can be removed with subset:

# Remove TSSs with a composition score less than 3
intragenicTSSs <- subset(intragenicTSSs, composition > 2)

The subsetByComposition function wraps the common task of calling calcComposition and subset. After filtering, the resulting expression matrix can be used with popular Bioconductor packages for DTU calling, such as DEXSeq, DRIMSeq, edgeR, limma, etc.

6.4 Plotting CAGE data in a genome browser

Zenbu is an excellent online genome browser for visualizing CAGE data in relation to annotation in an interative manner. For producing figures (or possibly a very large number of figures) of specific genomic regions, it can be useful to use R to generate genome browser-like plots.

CAGEfightR can produce genome browser plots via the popular and extensive Gviz package. CAGEfightR provides functions for generating genome browser tracks for both CTSSs and clusters. For general generation and customization of GViz plots we refer to the extensive Gviz manual. Below we include some basic examples of visualizing CAGE data.

First we load use the built-in datasets:

library(Gviz)
#> Loading required package: grid

data("exampleCTSSs")
data("exampleUnidirectional")
data("exampleBidirectional")

First we make a track of pooled CTSSs:

# Calculate pooled CTSSs
exampleCTSSs <- calcTPM(exampleCTSSs, totalTags = "totalTags")
#> Using supplied library sizes...
#> Calculating TPM...
exampleCTSSs <- calcPooled(exampleCTSSs)
#> Warning in calcPooled(exampleCTSSs): object already has a column named score
#> in rowData: It will be overwritten!

# Built the track
pooled_track <- trackCTSS(exampleCTSSs, name = "CTSSs")
#> Splitting pooled signal by strand...
#> Preparing track...

The returned object is a DataTrack, which we can plot a specific genomic region of:

# Plot the main TSS of the myob gene
plotTracks(pooled_track, from = 74601950, to = 74602100, chromosome = "chr18")

Secondly, we make a track of clusters: This function can handle both TSSs and enhancers, so we first merge them into a single object:

# Remove columns
exampleUnidirectional$totalTags <- NULL
exampleBidirectional$totalTags <- NULL

# Combine TSSs and enhancers, discarding TSSs if they overlap enhancers
CAGEclusters <- combineClusters(object1 = exampleUnidirectional, object2 = exampleBidirectional, 
    removeIfOverlapping = "object1")
#> Removing overlapping features from object1: 1049
#> Keeping assays: counts
#> Keeping columns: score, thick
#> Merging metadata...
#> Stacking and re-sorting...

# Only keep features with more than 0 counts in more than 2 samples
CAGEclusters <- subsetBySupport(CAGEclusters, inputAssay = "counts", unexpressed = 0, 
    minSamples = 2)
#> Calculating support...
#> Subsetting...
#> Removed 15198 out of 16654 regions (91.3%)

# Built track
cluster_track <- trackClusters(CAGEclusters, name = "Clusters", col = NA)
#> Setting thick and thin features...
#> Merging and sorting...
#> Preparing track...

The returned object is a GeneRegionTrack, which we can plot the same genomic region of:

# Plot the main TSS of the myob gene
plotTracks(cluster_track, from = 74601950, to = 74602100, chromosome = "chr18")

Of course, we want to combine several different tracks into a single overview. There is multiple ways of doing this, and the code below is just one such way:

# Genome axis tracks
axis_track <- GenomeAxisTrack()

# Transcript model track
library(TxDb.Mmusculus.UCSC.mm9.knownGene)
txdb <- TxDb.Mmusculus.UCSC.mm9.knownGene
tx_track <- GeneRegionTrack(txdb, name = "Gene Models", col = NA, fill = "bisque4", 
    shape = "arrow")

# Merge all tracks
all_tracks <- list(axis_track, tx_track, cluster_track, pooled_track)

# Plot the Hdhd2 gene
plotTracks(all_tracks, from = 77182700, to = 77184000, chromosome = "chr18")


# Plot an intergenic enhancer
plotTracks(all_tracks, from = 53499000, to = 53499600, chromosome = "chr19")

6.5 Parallel execution

Currently, two functions can utilize multiple cores for computation, quantifyCTSSs and tuneTagClustering. The central one is quantifyCTSSs, which is usually the slowest and most memory consuming step of a CAGEfightR analysis. quantifyCTSSs automatically uses the default backend registered with the BiocParallel package. This can be changed in the following way:

library(BiocParallel)

# Setup for parallel execustion with 3 threads:
register(MulticoreParam(workers = 3))

# Disable parallel execution
register(SerialParam())

7 Session Info

sessionInfo()
#> R version 3.5.0 (2018-04-23)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Ubuntu 16.04.4 LTS
#> 
#> Matrix products: default
#> BLAS: /home/biocbuild/bbs-3.7-bioc/R/lib/libRblas.so
#> LAPACK: /home/biocbuild/bbs-3.7-bioc/R/lib/libRlapack.so
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> attached base packages:
#>  [1] grid      parallel  stats4    stats     graphics  grDevices utils    
#>  [8] datasets  methods   base     
#> 
#> other attached packages:
#>  [1] Gviz_1.24.0                            
#>  [2] org.Mm.eg.db_3.6.0                     
#>  [3] TxDb.Mmusculus.UCSC.mm9.knownGene_3.2.2
#>  [4] GenomicFeatures_1.32.0                 
#>  [5] AnnotationDbi_1.42.0                   
#>  [6] CAGEfightR_1.0.0                       
#>  [7] SummarizedExperiment_1.10.0            
#>  [8] DelayedArray_0.6.0                     
#>  [9] BiocParallel_1.14.0                    
#> [10] matrixStats_0.53.1                     
#> [11] Biobase_2.40.0                         
#> [12] rtracklayer_1.40.0                     
#> [13] GenomicRanges_1.32.0                   
#> [14] GenomeInfoDb_1.16.0                    
#> [15] IRanges_2.14.0                         
#> [16] S4Vectors_0.18.0                       
#> [17] BiocGenerics_0.26.0                    
#> [18] BiocStyle_2.8.0                        
#> 
#> loaded via a namespace (and not attached):
#>  [1] ProtGenerics_1.12.0      bitops_1.0-6            
#>  [3] bit64_0.9-7              RColorBrewer_1.1-2      
#>  [5] progress_1.1.2           httr_1.3.1              
#>  [7] rprojroot_1.3-2          GenomicFiles_1.16.0     
#>  [9] tools_3.5.0              backports_1.1.2         
#> [11] R6_2.2.2                 rpart_4.1-13            
#> [13] Hmisc_4.1-1              DBI_0.8                 
#> [15] lazyeval_0.2.1           colorspace_1.3-2        
#> [17] nnet_7.3-12              gridExtra_2.3           
#> [19] prettyunits_1.0.2        bit_1.1-12              
#> [21] curl_3.2                 compiler_3.5.0          
#> [23] formatR_1.5              htmlTable_1.11.2        
#> [25] bookdown_0.7             scales_0.5.0            
#> [27] checkmate_1.8.5          stringr_1.3.0           
#> [29] digest_0.6.15            Rsamtools_1.32.0        
#> [31] foreign_0.8-70           rmarkdown_1.9           
#> [33] XVector_0.20.0           pkgconfig_2.0.1         
#> [35] base64enc_0.1-3          dichromat_2.0-0         
#> [37] htmltools_0.3.6          ensembldb_2.4.0         
#> [39] BSgenome_1.48.0          htmlwidgets_1.2         
#> [41] rlang_0.2.0              rstudioapi_0.7          
#> [43] pryr_0.1.4               RSQLite_2.1.0           
#> [45] acepack_1.4.1            VariantAnnotation_1.26.0
#> [47] RCurl_1.95-4.10          magrittr_1.5            
#> [49] GenomeInfoDbData_1.1.0   Formula_1.2-2           
#> [51] Matrix_1.2-14            Rcpp_0.12.16            
#> [53] munsell_0.4.3            stringi_1.1.7           
#> [55] yaml_2.1.18              zlibbioc_1.26.0         
#> [57] plyr_1.8.4               blob_1.1.1              
#> [59] lattice_0.20-35          Biostrings_2.48.0       
#> [61] splines_3.5.0            knitr_1.20              
#> [63] pillar_1.2.2             codetools_0.2-15        
#> [65] biomaRt_2.36.0           XML_3.98-1.11           
#> [67] evaluate_0.10.1          biovizBase_1.28.0       
#> [69] latticeExtra_0.6-28      data.table_1.10.4-3     
#> [71] gtable_0.2.0             grr_0.9.5               
#> [73] assertthat_0.2.0         ggplot2_2.2.1           
#> [75] xfun_0.1                 AnnotationFilter_1.4.0  
#> [77] survival_2.42-3          tibble_1.4.2            
#> [79] GenomicAlignments_1.16.0 Matrix.utils_0.9.7      
#> [81] memoise_1.1.0            cluster_2.0.7-1

Introduction to CAGEfightR

30 April 2018

Abstract

Package

Contents