Contents

1 Introduction

The GRAVI workflow, for which this package is designed, uses sliding windows for differential binding analysis in a manner similar to the package csaw, but also incorporating macs2 peaks. The workflow itself extends to integrating multiple ChIP targets and external data sources, and as such, this package introduces a handful of functions to enable these analyses.

The majority of examples below use extremely simplified datasets to provide general guidance on using the functions. Some results may appear trivial as a result, but will hopefully prove far more useful in a true experimental context.

2 Installation

In order to use the package extraChIPs and follow this vignette, we recommend using the package BiocManager hosted on CRAN. Once this is installed, the additional packages required for this vignette (tidyverse, Rsamtools, csaw, BiocParallel and rtracklayer) can also be installed.

if (!"BiocManager" %in% rownames(installed.packages()))
  install.packages("BiocManager")
pkg <- c("tidyverse", "Rsamtools", "csaw", "BiocParallel", "rtracklayer")
BiocManager::install(pkg)
BiocManager::install("extraChIPs")

3 Differential Binding and ChIP-Seq Analysis

3.1 Sliding Windows

The starting point for differential binding analysis using sliding windows is to define windows, then count reads within each window using the bam files. Commonly one or IP input/control samples is also produced during a ChIP-Seq experiment. The example files provided here contain a small subset of reads from chromosome 10 across two experimental and one input sample.

The approach taken below is to define a set of sliding windows, using the capabilities of csaw, but to then use macs peaks to define regions of most likely signal. First we can define our windows and count the alignments using existing tools. In the following, we’ll use a sliding window of 180bp and a step size of 60bp, meaning each nucleotide is covered by 3 windows.

library(tidyverse)
library(Rsamtools)
library(csaw)
library(BiocParallel)
library(rtracklayer)
bfl <- system.file(
  "extdata", "bam", c("ex1.bam", "ex2.bam", "input.bam"), package = "extraChIPs"
) %>% 
  BamFileList()
names(bfl) <- c("ex1", "ex2", "input")
rp <- readParam(
  pe = "none",
  dedup = TRUE,
  restrict = "chr10"
)
wincounts <- windowCounts(
  bam.files = bfl,
  spacing = 60,
  width = 180,
  ext = 200,
  filter = 1,
  param = rp
)

This produces a RangesSummarizedExperiment with windows included which passed the minimum threshold of 1 total read. As we’ve only counted reads within a very small window, the complete library sizes will be highly inaccurate. The true library sizes can be added here noting that this step is not normally required, but given these values are essential for accurate CPM values, they will be added here.

wincounts$totals <- c(964076L, 989543L, 1172179L)

We can also add some key information to the colData element of this object, which will also be propagated to all downstream objects.

wincounts$sample <- colnames(wincounts)
wincounts$treat <- as.factor(c("ctrl", "treat", NA))
colData(wincounts)
## DataFrame with 3 rows and 6 columns
##                    bam.files    totals       ext      rlen      sample    treat
##                  <character> <integer> <integer> <integer> <character> <factor>
## ex1   /tmp/Rtmp605fA8/Rins..    964076       200        73         ex1    ctrl 
## ex2   /tmp/Rtmp605fA8/Rins..    989543       200        74         ex2    treat
## input /tmp/Rtmp605fA8/Rins..   1172179       200        74       input    NA

A density plot can be simply drawn of these counts, with the vast majority of windows receiving very low counts, due to the nature of transcription factor binding, where long stretches are unbound. The windows with higher counts tend to be associated with the samples targeting a transcription factor (TF), as seen in the two treatment group samples.

library(extraChIPs)
plotAssayDensities(wincounts, colour = "treat", trans = "log1p")

3.2 Filtering of Sliding Windows

After counting all reads in the sliding windows, the next step is to discard windows for which counts are unlikely to represent TF binding. The package extraChIPs uses a set of consensus peaks to automatically set a threshold based on 1) counts strongly above the counts from the input sample, and 2) the windows with the overall highest signal. Thresholds are determined such that q = 0.5 of the retained windows overlap on of the supplied consensus peaks. Higher values for q will return more windows, however many of these will tend to only marginally overlap a peak. Experience has shown that values such as q = 0.5 tend to return a considerable proportion of windows containing true TF binding signal.

First we can load the peaks, supplied here as a simple bed file.

peaks <- import.bed(
  system.file("extdata", "peaks.bed.gz", package = "extraChIPs")
)
peaks <- granges(peaks)

The we can pass these to the function dualFilter() which utilises the strategy described above. On large datasets, this can be quite time-consuming, as can the initial counting step. Due to the small example dataset, a more inclusive threshold for q will be used here.

filtcounts <- dualFilter(
  x = wincounts[, !is.na(wincounts$treat)],
  bg = wincounts[, is.na(wincounts$treat)], 
  ref = peaks,
  q = 0.8 # Better to use q = 0.5 on real data
)

The returned object will by default contain counts and logCPM assays, with the complete library sizes used for the calculation of logCPM values.

plotAssayDensities(filtcounts, assay = "logCPM", colour = "treat")

plotAssayPCA(filtcounts, assay = "logCPM", colour = "treat", label = "sample")

Whilst the initial set of counts contained 1007 windows, these have now been reduced to 108 windows. Similarly, the input sample is no longer included in the data object.

dim(wincounts)
## [1] 1007    3
dim(filtcounts)
## [1] 108   2

The rowData element of the returned object will contain a logical column indicating where each specific retained window overlapped one of the supplied consensus peaks.

rowRanges(filtcounts)
## GRanges object with 108 ranges and 1 metadata column:
##         seqnames              ranges strand | overlaps_ref
##            <Rle>           <IRanges>  <Rle> |    <logical>
##     [1]    chr10 103865521-103865700      * |         TRUE
##     [2]    chr10 103865581-103865760      * |         TRUE
##     [3]    chr10 103865641-103865820      * |         TRUE
##     [4]    chr10 103865701-103865880      * |         TRUE
##     [5]    chr10 103874161-103874340      * |        FALSE
##     ...      ...                 ...    ... .          ...
##   [104]    chr10 103911901-103912080      * |        FALSE
##   [105]    chr10 103911961-103912140      * |        FALSE
##   [106]    chr10 103912021-103912200      * |        FALSE
##   [107]    chr10 103912081-103912260      * |        FALSE
##   [108]    chr10 103912261-103912440      * |        FALSE
##   -------
##   seqinfo: 1 sequence from an unspecified genome
mean(rowRanges(filtcounts)$overlaps_ref)
## [1] 0.7314815

3.3 Using Voom

Multiple approaches are available for analysis of differential binding, and given the small example dataset, only a brief example of conventional results will be used. extraChIPs does provide a simple coercion function to convert logCPM to a voom object, which requires the relationship between library sizes and logCPM values to be intact. Whist this will not be discussed further here should this be a viable approach for an analysis, the following code may prove helpful.

v <- voomWeightsFromCPM(
  cpm = assay(filtcounts, "logCPM"), 
  lib.size = filtcounts$totals, 
  isLogCPM = TRUE
)

3.4 Merging Windows

After an analysis has been performed, common values contained in the output may be estimated signal (logCPM), estimated change (logFC) with both raw and adjusted p-values. Given the dependency of neighbouring windows, any adjusted p-values will not be appropriate and a merging of overlapping windows will be performed.

For our example dataset we’ll add these manually, however this is just for demonstration purposes for the eventual merging of windows.

rowRanges(filtcounts)$logCPM <- rowMeans(assay(filtcounts,"logCPM"))
rowRanges(filtcounts)$logFC <- rowDiffs(assay(filtcounts,"logCPM"))[,1]
rowRanges(filtcounts)$PValue <- 1 - pchisq(rowRanges(filtcounts)$logFC^2, 1)

Now we have some example values, we can merge any overlapping windows using mergeByCol(). During this process, overlapping ranges are merged into a single range with representative values taken from one of the initial sliding windows. The recommended approach for retaining statistical independence between windows is to choose the window with the largest signal as representative of the entire merged window.

res_gr <- mergeByCol(filtcounts, col = "logCPM", pval = "PValue")
res_gr$overlaps_ref <- overlapsAny(res_gr, peaks)

A GRanges object is returned with representative values for each merged window. The mcol keyval_range provides the original range from which the representative values were taken. A column with adjusted p-values will also be added if p_adj_method is not set to “none”.

3.5 Mapping of Windows To Genes

Once the binding characteristics of a transcription factor have been characterised, a common next step is to assess which genes are likely to be under regulatory influence. Whilst no definitive, single methodology exists for this process, the function mapByFeature() offers an intuitive approach, taking into account any defined regulatory features. These regulatory features may be defined by simple proximity to TSS regions, by histone marks, downloaded from external repositories or any other possibility. Whilst these features can improve the precision of mapping, even without these this function can still enable a useful assignment of target gene to binding event.

The process undertaken inside mapByFeature() is a sequential checking of each range’s association with regulatory features and the most likely target as a result. These steps are:

  1. Check for any HiC interactions
  • All genes which directly overlap an interaction anchor are considered part of the regulatory network for that interaction, and as such, all genes associated with both anchors are assigned to a peak which overlaps a HiC Interaction
  1. Check for any overlaps with a promoter
  • All genes regulated by that promoter are assigned as regulatory targets. By default, this is by direct promoter/gene overlap (prom2gene = 0)
  1. Check for any overlaps with an enhancer
  • Peaks which overlap an enhancer are assigned to all genes within the distance specified by enh2gene (default = 100kb)
  1. Check for genes with no previous mappings
  • Peaks with no previous mappings are assigned to all directly overlapping genes, or the nearest gene within a specified distance (default gr2gene = 100kb)

As a result, if no promoters, enhancers or interactions are supplied, all genes will be mapped to peaks using step 4

The two essential data objects to perform simple gene assignment are 1) a set of ranges representing binding events of interest, such as res_gr above, and 2) a set of ranges defining genes, as contained in the example dataset ex_genes. This contains the two mcols gene and symbol, and we can ask for both in the returned object.

data("ex_genes")
data("ex_prom")
mapByFeature(
  res_gr, genes = ex_genes, prom = ex_prom, cols = c("gene", "symbol")
)
## GRanges object with 9 ranges and 11 metadata columns:
##       seqnames              ranges strand | n_windows      n_up    n_down
##          <Rle>           <IRanges>  <Rle> | <integer> <integer> <integer>
##   [1]    chr10 103865521-103865880      * |         4         0         3
##   [2]    chr10 103874161-103874940      * |         8         1         6
##   [3]    chr10 103876921-103878000      * |        16         0         6
##   [4]    chr10 103878601-103878840      * |         2         1         1
##   [5]    chr10 103878961-103881060      * |        33         0        17
##   [6]    chr10 103881181-103881540      * |         4         1         2
##   [7]    chr10 103892161-103893420      * |        19         5        12
##   [8]    chr10 103899841-103900560      * |        10         2         5
##   [9]    chr10 103911481-103912440      * |        12         3         7
##                    keyval_range    logCPM      logFC    PValue PValue_fdr
##                       <GRanges> <numeric>  <numeric> <numeric>  <numeric>
##   [1] chr10:103865521-103865700   3.46979 -0.5610608  0.574756   0.975993
##   [2] chr10:103874641-103874820   4.24295 -0.1115489  0.911181   0.975993
##   [3] chr10:103877281-103877460   5.61526 -1.4816426  0.138435   0.975993
##   [4] chr10:103878661-103878840   3.35589 -0.0300928  0.975993   0.975993
##   [5] chr10:103880281-103880460   7.72879 -0.6250501  0.531938   0.975993
##   [6] chr10:103881241-103881420   3.73440 -0.0318288  0.974609   0.975993
##   [7] chr10:103892581-103892760   7.40460 -0.0633132  0.949517   0.975993
##   [8] chr10:103900141-103900320   4.64587  0.1423445  0.886808   0.975993
##   [9] chr10:103911901-103912080   4.81325 -0.1924133  0.847419   0.975993
##       overlaps_ref            gene          symbol
##          <logical> <CharacterList> <CharacterList>
##   [1]         TRUE ENSG00000198728            LDB1
##   [2]        FALSE ENSG00000198728            LDB1
##   [3]         TRUE ENSG00000198728            LDB1
##   [4]        FALSE ENSG00000198728            LDB1
##   [5]         TRUE ENSG00000198728            LDB1
##   [6]         TRUE ENSG00000198728            LDB1
##   [7]         TRUE ENSG00000148840           PPRC1
##   [8]         TRUE ENSG00000148840           PPRC1
##   [9]        FALSE ENSG00000166197           NOLC1
##   -------
##   seqinfo: 1 sequence from an unspecified genome

For this dataset, we have an example HiC interaction, which we can now pass to the mapping process. (This time we’ll save the object)

data("ex_hic")
res_gr_mapped <- mapByFeature(
  res_gr, 
  genes = ex_genes, 
  prom = ex_prom,
  gi = ex_hic, 
  cols = c("gene", "symbol")
)
res_gr_mapped
## GRanges object with 9 ranges and 11 metadata columns:
##       seqnames              ranges strand | n_windows      n_up    n_down
##          <Rle>           <IRanges>  <Rle> | <integer> <integer> <integer>
##   [1]    chr10 103865521-103865880      * |         4         0         3
##   [2]    chr10 103874161-103874940      * |         8         1         6
##   [3]    chr10 103876921-103878000      * |        16         0         6
##   [4]    chr10 103878601-103878840      * |         2         1         1
##   [5]    chr10 103878961-103881060      * |        33         0        17
##   [6]    chr10 103881181-103881540      * |         4         1         2
##   [7]    chr10 103892161-103893420      * |        19         5        12
##   [8]    chr10 103899841-103900560      * |        10         2         5
##   [9]    chr10 103911481-103912440      * |        12         3         7
##                    keyval_range    logCPM      logFC    PValue PValue_fdr
##                       <GRanges> <numeric>  <numeric> <numeric>  <numeric>
##   [1] chr10:103865521-103865700   3.46979 -0.5610608  0.574756   0.975993
##   [2] chr10:103874641-103874820   4.24295 -0.1115489  0.911181   0.975993
##   [3] chr10:103877281-103877460   5.61526 -1.4816426  0.138435   0.975993
##   [4] chr10:103878661-103878840   3.35589 -0.0300928  0.975993   0.975993
##   [5] chr10:103880281-103880460   7.72879 -0.6250501  0.531938   0.975993
##   [6] chr10:103881241-103881420   3.73440 -0.0318288  0.974609   0.975993
##   [7] chr10:103892581-103892760   7.40460 -0.0633132  0.949517   0.975993
##   [8] chr10:103900141-103900320   4.64587  0.1423445  0.886808   0.975993
##   [9] chr10:103911901-103912080   4.81325 -0.1924133  0.847419   0.975993
##       overlaps_ref                            gene          symbol
##          <logical>                 <CharacterList> <CharacterList>
##   [1]         TRUE                 ENSG00000198728            LDB1
##   [2]        FALSE                 ENSG00000198728            LDB1
##   [3]         TRUE                 ENSG00000198728            LDB1
##   [4]        FALSE                 ENSG00000198728            LDB1
##   [5]         TRUE ENSG00000198728,ENSG00000148840      LDB1,PPRC1
##   [6]         TRUE ENSG00000198728,ENSG00000148840      LDB1,PPRC1
##   [7]         TRUE ENSG00000148840,ENSG00000198728      PPRC1,LDB1
##   [8]         TRUE                 ENSG00000148840           PPRC1
##   [9]        FALSE                 ENSG00000166197           NOLC1
##   -------
##   seqinfo: 1 sequence from an unspecified genome

The 5th to 7th windows are now mapped to both LDB1 and PPRC1, whereas previously these windows were only mapped to LDB1.

4 Visualisation of Results

4.1 Association with Features

The association of windows or peaks with defined features, such as histone marks or regulatory elements can be important for describing the binding characteristics of any given transcription factor. We have already defined the association of the merged windows with consensus peaks identified by macs2. We can easily visualise these using plotPie()

res_gr %>% 
  as_tibble() %>% 
  plotPie("overlaps_ref")