Contents

1 Overview

Here, we perform a window-based DB analysis to identify differentially bound (DB) regions for CREB-binding protein (CBP). We use CBP ChIP-seq data from a study comparing wild-type (WT) and CBP knock-out (KO) animals (Kasper et al. 2014), with two biological replicates for each genotype. Most, if not all, of the DB sites should be increased in the WT, given that protein function should be compromised in the KO. This provides an example of how to use the workflow with transcription factor (TF) data, to complement the previous H3K9ac analysis.

2 Aligning reads from CBP libraries

Libraries are downloaded from the NCBI GEO data series GSE54453, using the SRA accessions listed below. One file is available for each library, i.e., no technical replicates.

sra.numbers <- c("SRR1145787", "SRR1145788", "SRR1145789", "SRR1145790")
genotype <- c("wt", "wt", "ko", "ko")
all.sra <- paste0(sra.numbers, ".sra")
data.frame(SRA=all.sra, Condition=genotype)
##              SRA Condition
## 1 SRR1145787.sra        wt
## 2 SRR1145788.sra        wt
## 3 SRR1145789.sra        ko
## 4 SRR1145790.sra        ko

SRA files are unpacked to yield FASTQ files with the raw read sequences.

for (sra in all.sra) {
    code <- system(paste("fastq-dump", sra))
    stopifnot(code==0L)
}
all.fastq <- paste0(sra.numbers, ".fastq")

Reads are aligned to the mm10 genome using Rsubread. Here, the default consensus threshold is used as the reads are longer (75 bp). A Phred offset of +64 is also used, instead of the default +33 used in the previous analysis.

bam.files <- paste0(sra.numbers, ".bam")
align(index="index/mm10", readfile1=all.fastq, type=1, phredOffset=64,
    input_format="FASTQ", output_file=bam.files)

Alignments in each BAM file are sorted by coordinate. Duplicate reads are marked, and the resulting files are indexed.

temp.bam <- "cbp_temp.bam"
temp.file <- "cbp_metric.txt"
temp.dir <- "cbp_working"
dir.create(temp.dir)
for (bam in bam.files) {
    out <- suppressWarnings(sortBam(bam, "cbp_temp"))
    file.rename(out, bam)
    code <- system(sprintf("MarkDuplicates I=%s O=%s M=%s \\
        TMP_DIR=%s AS=true REMOVE_DUPLICATES=false \\
        VALIDATION_STRINGENCY=SILENT",
        bam, temp.bam, temp.file, temp.dir))
    stopifnot(code==0L)
    file.rename(temp.bam, bam)
}
indexBam(bam.files)

Some mapping statistics are reported as previously described.

library(Rsamtools)
diagnostics <- list()
for (bam in bam.files) {
    total <- countBam(bam)$records
    mapped <- countBam(bam, param=ScanBamParam(
        flag=scanBamFlag(isUnmapped=FALSE)))$records
    marked <- countBam(bam, param=ScanBamParam(
        flag=scanBamFlag(isUnmapped=FALSE, isDuplicate=TRUE)))$records
    diagnostics[[bam]] <- c(Total=total, Mapped=mapped, Marked=marked)
}
diag.stats <- data.frame(do.call(rbind, diagnostics))
diag.stats$Prop.mapped <- diag.stats$Mapped/diag.stats$Total*100
diag.stats$Prop.marked <- diag.stats$Marked/diag.stats$Mapped*100
diag.stats
##                   Total   Mapped  Marked Prop.mapped Prop.marked
## SRR1145787.bam 28525952 24015041 2244935    84.18664    9.348037
## SRR1145788.bam 25514465 21288115 2062157    83.43547    9.686893
## SRR1145789.bam 34476967 28830024 2678297    83.62111    9.289958
## SRR1145790.bam 32624587 27067108 2912659    82.96537   10.760880

3 Counting reads into windows

First, a readParam object is constructed to standardize the parameter settings in this analysis. The ENCODE blacklist is again used to remove reads in problematic regions (ENCODE Project Consortium 2012). For consistency, the MAPQ threshold of 50 is also re-used here for removing poorly aligned reads. Lower thresholds (e.g., from 10 to 20) can be used for longer reads with more reliable mapping locations - though in practice, the majority of long read alignments reported by Rsubread tend to have very high or very low MAPQ scores, such that the exact choice of the MAPQ threshold is not a critical parameter.

library(rtracklayer)
blacklist <- import("mm10.blacklist.bed.gz")
library(csaw)
param <- readParam(minq=50, discard=blacklist)

The average fragment length is estimated by maximizing the cross-correlation function, as previously described.

x <- correlateReads(bam.files, param=reform(param, dedup=TRUE))
frag.len <- maximizeCcf(x)
frag.len
## [1] 162

Reads are then counted into sliding windows using csaw (Lun and Smyth 2015). For TF data analyses, smaller windows are necessary to capture sharp binding sites. A large window size will be suboptimal as the count for a particular site will be “contaminated” by non-specific background in the neighbouring regions. In this case, a window size of 10 bp is used.

win.data <- windowCounts(bam.files, param=param, width=10, ext=frag.len)
win.data
## class: RangedSummarizedExperiment 
## dim: 9144853 4 
## metadata(6): spacing width ... param final.ext
## assays(1): counts
## rownames: NULL
## rowData names(0):
## colnames: NULL
## colData names(4): bam.files totals ext rlen

The default spacing of 50 bp is also used here. This may seem inappropriate, given that the windows are only 10 bp. However, reads lying in the interval between adjacent windows will still be counted into several windows. This is because reads are extended to the value of frag.len, which is substantially larger than the 50 bp spacing. Again, smaller spacings can be used but will provide little benefit, given that each extended read already overlaps multiple windows.

4 Normalization for composition biases

Composition biases are introduced when the amount of DB in each condition is unbalanced (Robinson and Oshlack 2010; Lun and Smyth 2014). More binding in one condition means that more reads are sequenced at the binding sites, leaving fewer reads for the rest of the genome. This suppresses the genomic coverage at non-DB sites, resulting in spurious differences between libraries. To remove this bias, reads are counted into large genomic bins. Most bins are assumed to represent non-DB background regions. Any systematic differences in the coverage of those bins is attributed to composition bias and is normalized out. Specifically, the TMM method (Robinson and Oshlack 2010) is applied to compute normalization factors from the bin counts. These factors are then applied to the DB analysis with the window counts.

bins <- windowCounts(bam.files, bin=TRUE, width=10000, param=param)
win.data <- normOffsets(bins, se.out=win.data)
normfacs <- win.data$norm.factors

The effect of normalization is visualized with some mean-difference plots between pairs of libraries (Figure 1). The dense cloud in each plot represents the majority of bins in the genome. These are assumed to mostly contain background regions. A non-zero log-fold change for these bins indicates that composition bias is present between libraries. The red line represents the log-ratio of normalization factors and passes through the centre of the cloud in each plot, indicating that the bias has been successfully identified and removed.

library(edgeR)
y.bin <- asDGEList(bins)
bin.ab <- aveLogCPM(y.bin)
adjc <- cpm(y.bin, log=TRUE)

par(cex.lab=1.5, mfrow=c(1,3))
smoothScatter(bin.ab, adjc[,1]-adjc[,4], ylim=c(-6, 6),
    xlab="Average abundance", ylab="Log-ratio (1 vs 4)")
abline(h=log2(normfacs[1]/normfacs[4]), col="red")
smoothScatter(bin.ab, adjc[,2]-adjc[,4], ylim=c(-6, 6),
    xlab="Average abundance", ylab="Log-ratio (2 vs 4)")
abline(h=log2(normfacs[2]/normfacs[4]), col="red")
smoothScatter(bin.ab, adjc[,3]-adjc[,4], ylim=c(-6, 6),
    xlab="Average abundance", ylab="Log-ratio (3 vs 4)")
abline(h=log2(normfacs[3]/normfacs[4]), col="red")
Mean-difference plots for the bin counts, comparing library 4 to all other libraries. The red line represents the log-ratio of the normalization factors between libraries.

Figure 1: Mean-difference plots for the bin counts, comparing library 4 to all other libraries
The red line represents the log-ratio of the normalization factors between libraries.

Note that this normalization strategy is quite different from that in the H3K9ac analysis. Here, systematic DB in one direction is expected between conditions, given that CBP function is lost in the KO genotype. This means that the assumption of a non-DB majority (required for non-linear normalization of the H3K9ac data) is not valid. No such assumption is made by the binned-TMM approach described above, which makes it more appropriate for use in the CBP analysis.

4.0.1 Filtering of low-abundance windows

Removal of low-abundance windows is performed as previously described. The majority of windows in background regions are filtered out upon applying a modest fold-change threshold. This leaves a small set of relevant windows for further analysis.

filter.stat <- filterWindows(win.data, bins, type="global")
min.fc <- 3
keep <- filter.stat$filter > log2(min.fc)
summary(keep)
##    Mode   FALSE    TRUE 
## logical 8874505  270348
filtered.data <- win.data[keep,]

Note that the 10 kbp bins are used here for filtering, while smaller 2 kbp bins were used in the corresponding step for the H3K9ac analysis. This is purely for convenience – the 10 kbp counts for this data set were previously loaded for normalization, and can be re-used during filtering to save time. Changes in bin size will have little impact on the results, so long as the bins (and their counts) are large enough for precise estimation of the background abundance. While smaller bins provide greater spatial resolution, this is irrelevant for quantifying coverage in large background regions that span most of the genome.

5 Statistical modelling of biological variability

Counts for each window are modelled using edgeR as previously described (McCarthy, Chen, and Smyth 2012; Robinson, McCarthy, and Smyth 2010). First, a design matrix is constructed for our experimental design.

genotype <- factor(genotype)
design <- model.matrix(~0+genotype)
colnames(design) <- levels(genotype)
design
##   ko wt
## 1  0  1
## 2  0  1
## 3  1  0
## 4  1  0
## attr(,"assign")
## [1] 1 1
## attr(,"contrasts")
## attr(,"contrasts")$genotype
## [1] "contr.treatment"

Estimation of the NB and QL dispersions is performed for each window (Lund et al. 2012). The estimated NB dispersions are substantially larger than those observed in the H3K9ac data set. In addition, the estimated prior d.f. is infinite.

y <- asDGEList(filtered.data, norm.factors=normfacs)
y <- estimateDisp(y, design)
summary(y$trended.dispersion)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1204  0.1670  0.1861  0.1902  0.2150  0.2573
fit <- glmQLFit(y, design, robust=TRUE)
summary(fit$df.prior)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     Inf     Inf     Inf     Inf     Inf     Inf

These statistics are consistent with the presence of a batch effect between replicates. The dispersions for all windows are inflated to a similarly large value by the batch effect, resulting in low variability in the dispersions across windows. This is illustrated in Figure 2 where the WT libraries are clearly separated in both dimensions of the MDS plot. In particular, separation of replicates on the first dimension is indicative of a systematic difference of size comparable to that between genotypes.

plotMDS(cpm(y, log=TRUE), top=10000, labels=genotype,
    col=c("red", "blue")[as.integer(genotype)])
MDS plot with two dimensions for all libraries in the CBP data set. Libraries are labelled and coloured according to the genotype. A larger top set of windows was used to improve the visualization of the genome-wide differences between the WT libraries.

Figure 2: MDS plot with two dimensions for all libraries in the CBP data set
Libraries are labelled and coloured according to the genotype. A larger top set of windows was used to improve the visualization of the genome-wide differences between the WT libraries.

The presence of a large batch effect between replicates is not ideal. Nonetheless, the DB analysis can proceed, albeit with some loss of power due to the inflated NB dispersions.

6 Testing for DB

DB windows are identified using the QL F-test. Windows are clustered into regions, and the region-level FDR is controlled using Simes’ method (Simes 1986; Lun and Smyth 2014). All significant regions have increased CBP binding in the WT genotype. This is expected, given that protein function should be lost in the KO genotype.

contrast <- makeContrasts(wt-ko, levels=design)
res <- glmQLFTest(fit, contrast=contrast)
merged <- mergeWindows(rowRanges(filtered.data), tol=100, max.width=5000)
tabcom <- combineTests(merged$id, res$table)
tabbest <- getBestTest(merged$id, res$table)
is.sig <- tabcom$FDR <= 0.05
summary(is.sig)
##    Mode   FALSE    TRUE 
## logical   55952    1828
table(tabcom$direction[is.sig])
## 
##   up 
## 1828
is.sig.pos <- (tabbest$logFC > 0)[is.sig]
summary(is.sig.pos)
##    Mode    TRUE 
## logical    1828

These results are saved to file, as previously described. Key objects are also saved for convenience.

out.ranges <- merged$region
elementMetadata(out.ranges) <- data.frame(tabcom,
    best.pos=mid(ranges(rowRanges(filtered.data[tabbest$best]))),
    best.logFC=tabbest$logFC)
saveRDS(file="cbp_results.rds", out.ranges)
save(file="cbp_objects.Rda", win.data, bins, y)

7 Annotation and visualization

Annotation for each region is added using the detailRanges function, as previously described.

library(TxDb.Mmusculus.UCSC.mm10.knownGene)
library(org.Mm.eg.db)
anno <- detailRanges(out.ranges, orgdb=org.Mm.eg.db,
            txdb=TxDb.Mmusculus.UCSC.mm10.knownGene)
meta <- elementMetadata(out.ranges)
elementMetadata(out.ranges) <- data.frame(meta, anno)

One of the top-ranked DB regions will be visualized here. This corresponds to a simple DB event as all windows are changing in the same direction, i.e., up in the WT. The binding region is also quite small relative to some of the H3K9ac examples, consistent with sharp TF binding to a specific recognition site.

o <- order(out.ranges$PValue)    
cur.region <- out.ranges[o[2]]
cur.region
## GRanges object with 1 range and 11 metadata columns:
##       seqnames            ranges strand |  nWindows  logFC.up logFC.down
##          <Rle>         <IRanges>  <Rle> | <integer> <integer>  <integer>
##   [1]    chr16 70313851-70314860      * |        21        21          0
##                     PValue                 FDR   direction  best.pos
##                  <numeric>           <numeric> <character> <integer>
##   [1] 3.72676828075784e-08 0.00107666335631094          up  70314555
##             best.logFC    overlap     left    right
##              <numeric>   <factor> <factor> <factor>
##   [1] 4.54600951570794 Gbe1|0-1|+                  
##   -------
##   seqinfo: 66 sequences from an unspecified genome

Plotting is performed using Gviz (F. and R. 2016) with two tracks for each library – one for the forward-strand coverage, another for the reverse-strand coverage. This allows visualization of the strand bimodality that is characteristic of genuine TF binding sites. In Figure 3, two adjacent sites are present at the Gbe1 promoter, both of which exhibit increased binding in the WT genotype. Coverage is also substantially different between the WT replicates, consistent with the presence of a batch effect.

library(Gviz)
collected <- list()
lib.sizes <- filtered.data$totals/1e6

for (i in seq_along(bam.files)) {
    reads <- extractReads(bam.file=bam.files[i], cur.region, param=param)
    pcov <- as(coverage(reads[strand(reads)=="+"])/lib.sizes[i], "GRanges")
    ncov <- as(coverage(reads[strand(reads)=="-"])/-lib.sizes[i], "GRanges")
    ptrack <- DataTrack(pcov, type="histogram", lwd=0, ylim=c(-5, 5),
        name=bam.files[i], col.axis="black", col.title="black",
        fill="blue", col.histogram=NA)
    ntrack <- DataTrack(ncov, type="histogram", lwd=0, ylim=c(-5, 5),
        fill="red", col.histogram=NA)
    collected[[i]] <- OverlayTrack(trackList=list(ptrack, ntrack))
}

gax <- GenomeAxisTrack(col="black", fontsize=15, size=2)
greg <- GeneRegionTrack(TxDb.Mmusculus.UCSC.mm10.knownGene, showId=TRUE,
    geneSymbol=TRUE, name="", background.title="transparent")
plotTracks(c(gax, collected, greg), chromosome=as.character(seqnames(cur.region)),
    from=start(cur.region), to=end(cur.region))
Coverage tracks for TF binding sites that are differentially bound in the WT (top two tracks) against the KO (last two tracks). Blue and red tracks represent forward- and reverse-strand coverage, respectively, on a per-million scale (capped at 5 in SRR1145788, for visibility).

Figure 3: Coverage tracks for TF binding sites that are differentially bound in the WT (top two tracks) against the KO (last two tracks)
Blue and red tracks represent forward- and reverse-strand coverage, respectively, on a per-million scale (capped at 5 in SRR1145788, for visibility).

Note that that the gax and greg objects are the same as those used in the visualization of the H3K9ac data.

8 Session information

sessionInfo()
## R version 3.5.0 (2018-04-23)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 16.04.4 LTS
## 
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.7-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.7-bioc/R/lib/libRlapack.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
##  [1] grid      stats4    parallel  stats     graphics  grDevices utils    
##  [8] datasets  methods   base     
## 
## other attached packages:
##  [1] Gviz_1.24.0                             
##  [2] ChIPpeakAnno_3.14.0                     
##  [3] VennDiagram_1.6.20                      
##  [4] futile.logger_1.4.3                     
##  [5] TxDb.Mmusculus.UCSC.mm10.knownGene_3.4.0
##  [6] GenomicFeatures_1.32.0                  
##  [7] org.Mm.eg.db_3.6.0                      
##  [8] AnnotationDbi_1.42.0                    
##  [9] edgeR_3.22.0                            
## [10] limma_3.36.0                            
## [11] csaw_1.14.0                             
## [12] SummarizedExperiment_1.10.0             
## [13] DelayedArray_0.6.0                      
## [14] BiocParallel_1.14.0                     
## [15] matrixStats_0.53.1                      
## [16] Biobase_2.40.0                          
## [17] rtracklayer_1.40.0                      
## [18] Rsamtools_1.32.0                        
## [19] Biostrings_2.48.0                       
## [20] XVector_0.20.0                          
## [21] GenomicRanges_1.32.0                    
## [22] GenomeInfoDb_1.16.0                     
## [23] IRanges_2.14.1                          
## [24] S4Vectors_0.18.1                        
## [25] BiocGenerics_0.26.0                     
## [26] Rsubread_1.30.0                         
## [27] knitr_1.20                              
## [28] BiocStyle_2.8.0                         
## 
## loaded via a namespace (and not attached):
##  [1] colorspace_1.3-2         seqinr_3.4-5            
##  [3] rprojroot_1.3-2          biovizBase_1.28.0       
##  [5] htmlTable_1.11.2         base64enc_0.1-3         
##  [7] dichromat_2.0-0          rstudioapi_0.7          
##  [9] bit64_0.9-7              splines_3.5.0           
## [11] ade4_1.7-11              Formula_1.2-2           
## [13] cluster_2.0.7-1          GO.db_3.6.0             
## [15] graph_1.58.0             compiler_3.5.0          
## [17] httr_1.3.1               backports_1.1.2         
## [19] assertthat_0.2.0         Matrix_1.2-14           
## [21] lazyeval_0.2.1           formatR_1.5             
## [23] acepack_1.4.1            htmltools_0.3.6         
## [25] prettyunits_1.0.2        tools_3.5.0             
## [27] gtable_0.2.0             GenomeInfoDbData_1.1.0  
## [29] Rcpp_0.12.16             multtest_2.36.0         
## [31] xfun_0.1                 stringr_1.3.0           
## [33] ensembldb_2.4.0          statmod_1.4.30          
## [35] XML_3.98-1.11            idr_1.2                 
## [37] zlibbioc_1.26.0          MASS_7.3-50             
## [39] scales_0.5.0             BSgenome_1.48.0         
## [41] VariantAnnotation_1.26.0 BiocInstaller_1.30.0    
## [43] ProtGenerics_1.12.0      RBGL_1.56.0             
## [45] AnnotationFilter_1.4.0   lambda.r_1.2.2          
## [47] RColorBrewer_1.1-2       yaml_2.1.19             
## [49] curl_3.2                 memoise_1.1.0           
## [51] gridExtra_2.3            ggplot2_2.2.1           
## [53] biomaRt_2.36.0           rpart_4.1-13            
## [55] latticeExtra_0.6-28      stringi_1.1.7           
## [57] RSQLite_2.1.0            Rhtslib_1.12.0          
## [59] highr_0.6                checkmate_1.8.5         
## [61] rlang_0.2.0              pkgconfig_2.0.1         
## [63] bitops_1.0-6             evaluate_0.10.1         
## [65] lattice_0.20-35          GenomicAlignments_1.16.0
## [67] htmlwidgets_1.2          bit_1.1-12              
## [69] plyr_1.8.4               magrittr_1.5            
## [71] bookdown_0.7             R6_2.2.2                
## [73] Hmisc_4.1-1              DBI_0.8                 
## [75] pillar_1.2.2             foreign_0.8-70          
## [77] survival_2.42-3          RCurl_1.95-4.10         
## [79] nnet_7.3-12              tibble_1.4.2            
## [81] futile.options_1.0.1     KernSmooth_2.23-15      
## [83] rmarkdown_1.9            progress_1.1.2          
## [85] locfit_1.5-9.1           data.table_1.11.0       
## [87] blob_1.1.1               digest_0.6.15           
## [89] regioneR_1.12.0          munsell_0.4.3

References

ENCODE Project Consortium. 2012. “An integrated encyclopedia of DNA elements in the human genome.” Nature 489 (7414):57–74.

F., Hahne, and Ivanek R. 2016. “Visualizing Genomic Data Using Gviz and Bioconductor.” In Statistical Genomics: Methods and Protocols, edited by Ewy Mathé and Sean Davis, 335–51. New York, NY: Springer New York. https://doi.org/10.1007/978-1-4939-3578-9_16.

Kasper, L. H., C. Qu, J. C. Obenauer, D. J. McGoldrick, and P. K. Brindle. 2014. “Genome-wide and single-cell analyses reveal a context dependent relationship between CBP recruitment and gene expression.” Nucleic Acids Res. 42 (18):11363–82.

Lun, A. T., and G. K. Smyth. 2014. “De novo detection of differentially bound regions for ChIP-seq data using peaks and windows: controlling error rates correctly.” Nucleic Acids Res. 42 (11):e95.

———. 2015. “csaw: a Bioconductor package for differential binding analysis of ChIP-seq data using sliding windows.” Nucleic Acids Res. https://doi.org/10.1093/nar/gkv1191.

Lund, S. P., D. Nettleton, D. J. McCarthy, and G. K. Smyth. 2012. “Detecting differential expression in RNA-sequence data using quasi-likelihood with shrunken dispersion estimates.” Stat. Appl. Genet. Mol. Biol. 11 (5):Article 8.

McCarthy, D. J., Y. Chen, and G. K. Smyth. 2012. “Differential expression analysis of multifactor RNA-Seq experiments with respect to biological variation.” Nucleic Acids Res. 40 (10):4288–97.

Robinson, M. D., D. J. McCarthy, and G. K. Smyth. 2010. “edgeR: a Bioconductor package for differential expression analysis of digital gene expression data.” Bioinformatics 26 (1):139–40.

Robinson, M. D., and A. Oshlack. 2010. “A scaling normalization method for differential expression analysis of RNA-seq data.” Genome Biol. 11 (3):R25.

Simes, R. J. 1986. “An Improved Bonferroni Procedure for Multiple Tests of Significance.” Biometrika 73 (3):751–54.