1 Overview

Here, we perform a window-based DB analysis to identify differentially bound (DB) regions for CREB-binding protein (CBP). This provides an example of how to use csaw for transcription factor (TF) data, complementing our previous analysis on a histone mark. We use CBP ChIP-seq data from a study comparing wild-type (WT) and CBP knock-out (KO) animals (Kasper et al. 2014), with two biological replicates for each genotype. BAM files and indices are downloaded using chipseqDBData and cached for later use.

library(chipseqDBData)
cbpdata <- CBPData()
cbpdata
## DataFrame with 4 rows and 3 columns
##          Name       Description
##   <character>       <character>
## 1  SRR1145787 CBP wild-type (1)
## 2  SRR1145788 CBP wild-type (2)
## 3  SRR1145789 CBP knock-out (1)
## 4  SRR1145790 CBP knock-out (2)
##                                             Path
##                                      <character>
## 1 /tmp/RtmpYT6shT/file5dc424cfdaa/SRR1145787.bam
## 2 /tmp/RtmpYT6shT/file5dc424cfdaa/SRR1145788.bam
## 3 /tmp/RtmpYT6shT/file5dc424cfdaa/SRR1145789.bam
## 4 /tmp/RtmpYT6shT/file5dc424cfdaa/SRR1145790.bam

2 Pre-processing checks

We check some mapping statistics for the CBP dataset with Rsamtools, as previously described.

library(Rsamtools)
diagnostics <- list()
for (bam in cbpdata$Path) {
    total <- countBam(bam)$records
    mapped <- countBam(bam, param=ScanBamParam(
        flag=scanBamFlag(isUnmapped=FALSE)))$records
    marked <- countBam(bam, param=ScanBamParam(
        flag=scanBamFlag(isUnmapped=FALSE, isDuplicate=TRUE)))$records
    diagnostics[[basename(bam)]] <- c(Total=total, Mapped=mapped, Marked=marked)
}
diag.stats <- data.frame(do.call(rbind, diagnostics))
diag.stats$Prop.mapped <- diag.stats$Mapped/diag.stats$Total*100
diag.stats$Prop.marked <- diag.stats$Marked/diag.stats$Mapped*100
diag.stats
##                   Total   Mapped  Marked Prop.mapped Prop.marked
## SRR1145787.bam 28525952 24289396 2022868    85.14842    8.328194
## SRR1145788.bam 25514465 21604007 1939224    84.67356    8.976224
## SRR1145789.bam 34476967 29195883 2412650    84.68228    8.263665
## SRR1145790.bam 32624587 27348488 2617879    83.82784    9.572299

We construct a readParam object to standardize the parameter settings in this analysis. The ENCODE blacklist is again used1 Assuming you ran the previous workflow, this will be retrieved from cache rather than being downloaded again. to remove reads in problematic regions (ENCODE Project Consortium 2012).

library(BiocFileCache)
bfc <- BiocFileCache("local", ask=FALSE)
black.path <- bfcrpath(bfc, file.path("https://www.encodeproject.org",
    "files/ENCFF547MET/@@download/ENCFF547MET.bed.gz"))

library(rtracklayer)
blacklist <- import(black.path)

We set the minimum mapping quality score to 10 to remove poorly or non-uniquely aligned reads.

library(csaw)
param <- readParam(minq=10, discard=blacklist)
param
##     Extracting reads in single-end mode
##     Duplicate removal is turned off 
##     Minimum allowed mapping score is 10 
##     Reads are extracted from both strands
##     No restrictions are placed on read extraction
##     Reads in 164 regions will be discarded

3 Computing the average fragment length

The average fragment length is estimated by maximizing the cross-correlation function (Figure 1), as previously described. Generally, cross-correlations for TF datasets are sharper than for histone marks as the TFs typically contact a smaller genomic interval. This results in more pronounced strand bimodality in the binding profile.

x <- correlateReads(cbpdata$Path, param=reform(param, dedup=TRUE))
frag.len <- maximizeCcf(x)
frag.len
## [1] 161
plot(1:length(x)-1, x, xlab="Delay (bp)", ylab="CCF", type="l")
abline(v=frag.len, col="red")
text(x=frag.len, y=min(x), paste(frag.len, "bp"), pos=4, col="red")
Cross-correlation function (CCF) against delay distance for the CBP dataset. The delay with the maximum correlation is shown as the red line.

Figure 1: Cross-correlation function (CCF) against delay distance for the CBP dataset. The delay with the maximum correlation is shown as the red line.

4 Counting reads into windows

Reads are then counted into sliding windows using csaw (Lun and Smyth 2015). For TF data analyses, smaller windows are necessary to capture sharp binding sites. A large window size will be suboptimal as the count for a particular site will be “contaminated” by non-specific background in the neighbouring regions. In this case, a window size of 10 bp is used.

win.data <- windowCounts(cbpdata$Path, param=param, width=10, ext=frag.len)
win.data
## class: RangedSummarizedExperiment 
## dim: 9952827 4 
## metadata(6): spacing width ... param final.ext
## assays(1): counts
## rownames: NULL
## rowData names(0):
## colnames: NULL
## colData names(4): bam.files totals ext rlen

The default spacing of 50 bp is also used here. This may seem inappropriate given that the windows are only 10 bp. However, reads lying in the interval between adjacent windows will still be counted into several windows. This is because reads are extended to the value of frag.len, which is substantially larger than the 50 bp spacing2 Smaller spacings can be used but will provide little benefit given that each extended read already overlaps multiple windows..

5 Normalization for composition biases

Composition biases are introduced when the amount of DB in each condition is unbalanced (Robinson and Oshlack 2010; Lun and Smyth 2014). More binding in one condition means that more reads are sequenced at the binding sites, leaving fewer reads for the rest of the genome. This suppresses the genomic coverage at non-DB sites, resulting in spurious differences between samples. We expect unbalanced DB in this dataset as CBP function should be compromised in the KO cells, such that most - if not all - of the DB sites should exhibit increased CBP binding in the WT condition.

To remove this bias, we assign reads to large genomic bins and assume that most bins represent non-DB background regions. Any systematic differences in the coverage of those bins is attributed to composition bias and is normalized out. Specifically, the trimmed mean of M-values (TMM) method (Robinson and Oshlack 2010) is applied to compute normalization factors from the bin counts. These factors are stored in win.data3 See the se.out= argument. so that they will be applied during the DB analysis with the window counts.

bins <- windowCounts(cbpdata$Path, bin=TRUE, width=10000, param=param)
win.data <- normFactors(bins, se.out=win.data)
(normfacs <- win.data$norm.factors)
## [1] 1.0125617 0.9083253 1.0443668 1.0410799

We visualize the effect of normalization with mean-difference plots between pairs of samples (Figure 2). The dense cloud in each plot represents the majority of bins in the genome. These are assumed to mostly contain background regions. A non-zero log-fold change for these bins indicates that composition bias is present between samples. The red line represents the log-ratio of normalization factors and passes through the centre of the cloud in each plot, indicating that the bias has been successfully identified and removed.

bin.ab <- scaledAverage(bins)
adjc <- calculateCPM(bins, use.norm.factors=FALSE)

par(cex.lab=1.5, mfrow=c(1,3))
smoothScatter(bin.ab, adjc[,1]-adjc[,4], ylim=c(-6, 6),
    xlab="Average abundance", ylab="Log-ratio (1 vs 4)")
abline(h=log2(normfacs[1]/normfacs[4]), col="red")

smoothScatter(bin.ab, adjc[,2]-adjc[,4], ylim=c(-6, 6),
    xlab="Average abundance", ylab="Log-ratio (2 vs 4)")
abline(h=log2(normfacs[2]/normfacs[4]), col="red")

smoothScatter(bin.ab, adjc[,3]-adjc[,4], ylim=c(-6, 6),
    xlab="Average abundance", ylab="Log-ratio (3 vs 4)")
abline(h=log2(normfacs[3]/normfacs[4]), col="red")