1 Overview

Here, we perform a window-based differential binding (DB) analysis to identify regions of differential H3K9ac enrichment between pro-B and mature B cells (Revilla-I-Domingo et al. 2012). H3K9ac is associated with active promoters and tends to exhibit relatively narrow regions of enrichment relative to other marks such as H3K27me3. We download the BAM files using the relevant function from the chipseqDBData package11 BAM files are cached upon the first call to this function, so subsequent calls do not need to re-download the files.. The experimental design contains two biological replicates for each of the two cell types.

library(chipseqDBData)
acdata <- H3K9acData()
acdata

## DataFrame with 4 rows and 3 columns
##                  Name            Description
##           <character>            <character>
## 1    h3k9ac-proB-8113    pro-B H3K9ac (8113)
## 2    h3k9ac-proB-8108    pro-B H3K9ac (8108)
## 3 h3k9ac-matureB-8059 mature B H3K9ac (8059)
## 4 h3k9ac-matureB-8086 mature B H3K9ac (8086)
##                                                      Path
##                                               <character>
## 1    /tmp/RtmpYT6shT/file5dc2acd2228/h3k9ac-proB-8113.bam
## 2    /tmp/RtmpYT6shT/file5dc2acd2228/h3k9ac-proB-8108.bam
## 3 /tmp/RtmpYT6shT/file5dc2acd2228/h3k9ac-matureB-8059.bam
## 4 /tmp/RtmpYT6shT/file5dc2acd2228/h3k9ac-matureB-8086.bam

2 Pre-processing checks

2.1 Examining mapping statistics

We use methods from the Rsamtools package to compute some mapping statistics for each BAM file. Ideally, the proportion of mapped reads should be high (70-80% or higher), while the proportion of marked reads should be low (generally below 20%).

library(Rsamtools)
diagnostics <- list()
for (bam in acdata$Path) {
    total <- countBam(bam)$records
    mapped <- countBam(bam, param=ScanBamParam(
        flag=scanBamFlag(isUnmapped=FALSE)))$records
    marked <- countBam(bam, param=ScanBamParam(
        flag=scanBamFlag(isUnmapped=FALSE, isDuplicate=TRUE)))$records
    diagnostics[[basename(bam)]] <- c(Total=total, Mapped=mapped, Marked=marked)
}
diag.stats <- data.frame(do.call(rbind, diagnostics))
diag.stats$Prop.mapped <- diag.stats$Mapped/diag.stats$Total*100
diag.stats$Prop.marked <- diag.stats$Marked/diag.stats$Mapped*100
diag.stats

##                            Total  Mapped Marked Prop.mapped Prop.marked
## h3k9ac-proB-8113.bam    10724526 8832006 434884    82.35335    4.923955
## h3k9ac-proB-8108.bam    10413135 7793913 252271    74.84694    3.236770
## h3k9ac-matureB-8059.bam 16675372 4670364 396785    28.00756    8.495805
## h3k9ac-matureB-8086.bam  6347683 4551692 141583    71.70635    3.110558

Note that all csaw functions that read from a BAM file require BAM indices with .bai suffixes. In this case, index files have already been downloaded by H3K9acData(), but users supplying their own files should take care to ensure that BAM indices are available with appropriate names.

2.2 Obtaining the ENCODE blacklist for mm10

A number of genomic regions contain high artifactual signal in ChIP-seq experiments. These often correspond to genomic features like telomeres or microsatellite repeats. For example, multiple tandem repeats in the real genome are reported as a single unit in the genome build. Alignment of all (non-specifically immunoprecipitated) reads from the former will result in artificially high coverage of the latter. Moreover, differences in repeat copy numbers between conditions can lead to detection of spurious DB.

As such, these problematic regions must be removed prior to further analysis. This is done with an annotated blacklist for the mm10 build of the mouse genome, constructed by identifying consistently problematic regions from ENCODE datasets (ENCODE Project Consortium 2012). We download this BED file and save it into a local cache with the BiocFileCache package. This allows it to be used again in later workflows without being re-downloaded.

library(BiocFileCache)
bfc <- BiocFileCache("local", ask=FALSE)
black.path <- bfcrpath(bfc, file.path("https://www.encodeproject.org",
    "files/ENCFF547MET/@@download/ENCFF547MET.bed.gz"))

Genomic intervals in the blacklist are loaded using the import() method from the rtracklayer package. All reads mapped within the blacklisted intervals will be ignored during processing in csaw by specifying the discard= parameter (see below).

library(rtracklayer)
blacklist <- import(black.path)
blacklist

## GRanges object with 164 ranges and 0 metadata columns:
##         seqnames              ranges strand
##            <Rle>           <IRanges>  <Rle>
##     [1]    chr10     3110061-3110270      *
##     [2]    chr10   22142531-22142880      *
##     [3]    chr10   22142831-22143070      *
##     [4]    chr10   58223871-58224100      *
##     [5]    chr10   58225261-58225500      *
##     ...      ...                 ...    ...
##   [160]     chr9     3038051-3038300      *
##   [161]     chr9   24541941-24542200      *
##   [162]     chr9   35305121-35305620      *
##   [163]     chr9 110281191-110281400      *
##   [164]     chr9 123872951-123873160      *
##   -------
##   seqinfo: 19 sequences from an unspecified genome; no seqlengths

Any user-defined set of regions can be used as a blacklist in this analysis.

For example, one could use predicted repeat regions from the UCSC genome annotation (Rosenbloom et al. 2015). This tends to remove a greater number of problematic regions (especially microsatellites) compared to the ENCODE blacklist. However, the size of the UCSC list means that genuine DB sites may also be removed. Thus, the ENCODE blacklist is preferred for most applications.
Alternatively, if negative control samples are available, they can be used to empirically identify problematic regions with the GreyListChIP package. These regions should be ignored as they have high coverage in the controls and are unlikely to be genuine binding sites.

2.3 Setting up the read extraction parameters

In the csaw package, the readParam object determines which reads are extracted from the BAM files. The intention is to set this up once and to re-use it in all relevant functions. For this analysis, reads are ignored if they map to blacklist regions or do not map to the standard set of mouse nuclear chromosomes22 In this case, we are not interested in the mitochondrial genome, as these should not be bound by histones anyway..

library(csaw)
standard.chr <- paste0("chr", c(1:19, "X", "Y"))
param <- readParam(minq=20, discard=blacklist, restrict=standard.chr)

Reads are also ignored if they have a mapping quality (MAPQ) score below 2033 This is more stringent than usual, to account for the fact that the short reads ued here (32-36 bp) are more difficult to accurately align.. This avoids spurious results due to weak or non-unique alignments that should be assigned low MAPQ scores by the aligner. Note that the range of MAPQ scores will vary between aligners, so some inspection of the BAM files is necessary to choose an appropriate value.

3 Computing the average fragment length

Strand bimodality is often observed in ChIP-seq experiments involving narrow binding events like H3K9ac marking. This refers to the presence of distinct subpeaks on each strand and is quantified with cross-correlation plots (Kharchenko, Tolstorukov, and Park 2008). A strong peak in the cross-correlations should be observed if immunoprecipitation was successful. The delay distance at the peak corresponds to the distance between forward- and reverse-strand subpeaks. This is identified from Figure 1 and is used as the average fragment length for this analysis.

x <- correlateReads(acdata$Path, param=reform(param, dedup=TRUE))
frag.len <- maximizeCcf(x)
frag.len

## [1] 154

plot(1:length(x)-1, x, xlab="Delay (bp)", ylab="CCF", type="l")
abline(v=frag.len, col="red")
text(x=frag.len, y=min(x), paste(frag.len, "bp"), pos=4, col="red")

Cross-correlation function (CCF) against delay distance for the H3K9ac data set. The delay with the maximum correlation is shown as the red line.

Figure 1: Cross-correlation function (CCF) against delay distance for the H3K9ac data set. The delay with the maximum correlation is shown as the red line.

Only unmarked reads (i.e., not potential PCR duplicates) are used to calculate the cross-correlations. This reduces noise from variable PCR amplification and decreases the size of the “phantom” peak at the read length (Landt et al. 2012). However, general removal of marked reads is risky as it caps the signal in high-coverage regions of the genome. This can result in loss of power to detect DB, or introduction of spurious DB when the same cap is applied to libraries of different sizes. Thus, the marking status of each read will be ignored in the rest of the analysis, i.e., no duplicates will be removed in downstream steps.

4 Counting reads into windows

csaw uses a sliding window strategy to quantify protein binding intensity across the genome. Each read is directionally extended to the average fragment length (Figure 2) to represent the DNA fragment from which that read was sequenced. Any position within the inferred fragment is a potential contact site for the protein of interest. To quantify binding in a genomic window, the number of these fragments overlapping the window is counted. The window is then moved to its next position on the genome and counting is repeated44 Each read is usually counted into multiple windows, which will introduce correlations between adjacent windows but will not otherwise affect the analysis.. This is done for all samples such that a count is obtained for each window in each sample.

Directional extension of reads by the average fragment length `ext` in single-end ChIP-seq data. Each extended read represents an imputed fragment, and the number of fragments overlapping a window of a given `width` is counted.

Figure 2: Directional extension of reads by the average fragment length ext in single-end ChIP-seq data. Each extended read represents an imputed fragment, and the number of fragments overlapping a window of a given width is counted.

The windowCounts() function produces a RangedSummarizedExperiment object containing a matrix of such counts. Each row corresponds to a window; each column represents a BAM file corresponding to a single sample55 Counting can be parallelized across files using the BPPARAM= argument.; and each entry of the matrix represents the number of fragments overlapping a particular window in a particular sample.

win.data <- windowCounts(acdata$Path, param=param, width=150, ext=frag.len)
win.data

## class: RangedSummarizedExperiment 
## dim: 1671254 4 
## metadata(6): spacing width ... param final.ext
## assays(1): counts
## rownames: NULL
## rowData names(0):
## colnames: NULL
## colData names(4): bam.files totals ext rlen

To analyze H3K9ac data, a window size of 150 bp is used here. This corresponds roughly to the length of the DNA in a nucleosome (Humburg et al. 2011), which is the smallest relevant unit for studying histone mark enrichment. The spacing between windows is set to the default of 50 bp, i.e., the start positions for adjacent windows are 50 bp apart. Smaller spacings can be used to improve spatial resolution, but will increase memory usage and runtime by increasing the number of windows required to cover the genome. This is unnecessary as increased resolution confers little practical benefit for this data set – counts for very closely spaced windows will be practically identical. Finally, windows with very low counts (by default, less than a sum of 10 across all samples) are removed to reduce memory usage. This represents a preliminary filter to remove uninteresting windows corresponding to likely background regions.

5 Filtering windows by abundance

As previously mentioned, low-abundance windows contain no binding sites and need to be filtered out. This improves power by removing irrelevant tests prior to the multiple testing correction; avoids problems with discreteness in downstream statistical methods; and reduces computational work for further analyses. Here, filtering is performed using the average abundance of each window (McCarthy, Chen, and Smyth 2012), which is defined as the average log-count per million for that window. This performs well as an independent filter statistic for NB-distributed count data (Lun and Smyth 2014).

The filter threshold is defined based on the assumption that most regions in the genome are not marked by H3K9ac. Reads are counted into large bins and the median coverage across those bins is used as an estimate of the background abundance66 Large bins are necessary to obtain a precise estimate of background coverage, which would otherwise be too low in individual windows.. This estimate is then compared to the average abundances of the windows, after rescaling to account for differences in the window and bin sizes. A window is only retained if its coverage is 3-fold higher than that of the background regions, i.e., the abundance of the window is greater than the background abundance estimate by log₂(3) or more. This removes a large number of windows that are weakly or not marked and are likely to be irrelevant.

bins <- windowCounts(acdata$Path, bin=TRUE, width=2000, param=param)
filter.stat <- filterWindowsGlobal(win.data, bins)
min.fc <- 3
keep <- filter.stat$filter > log2(min.fc)
summary(keep)

##    Mode   FALSE    TRUE 
## logical  982167  689087

The effect of the fold-change threshold is examined visually in Figure 3. The chosen threshold is greater than the abundances of most bins in the genome – presumably, those that contain background regions. This suggests that the filter will remove most windows lying within background regions.

hist(filter.stat$back.abundances, main="", breaks=50,
    xlab="Background abundance (log2-CPM)")
threshold <- filter.stat$abundances[1] - filter.stat$filter[1] + log2(min.fc)
abline(v=threshold, col="red")

Figure 3: Histogram of average abundances across all 2 kbp genomic bins. The filter threshold is shown as the red line.

The filtering itself is done by simply subsetting the RangedSummarizedExperiment object.

filtered.data <- win.data[keep,]

6 Normalizing for sample-specific trended biases

Normalization is required to eliminate confounding sample-specific biases prior to any comparisons between samples. Here, a trended bias is present between samples in Figure 4. This refers to a systematic fold-difference in per-window coverage between samples that changes according to the average abundance of the window.

win.ab <- scaledAverage(filtered.data)
adjc <- calculateCPM(filtered.data, use.offsets=FALSE)
logfc <- adjc[,4] - adjc[,1]
smoothScatter(win.ab, logfc, ylim=c(-6, 6), xlim=c(0, 5),
    xlab="Average abundance", ylab="Log-fold change")

lfit <- smooth.spline(logfc~win.ab, df=5)
o <- order(win.ab)
lines(win.ab[o], fitted(lfit)[o], col="red", lty=2)

Abundance-dependent trend in the log-fold change between two H3K9ac samples (mature B over pro-B), across all windows retained after filtering. A smoothed spline fitted to the log-fold change against the average abundance is also shown in red.

Figure 4: Abundance-dependent trend in the log-fold change between two H3K9ac samples (mature B over pro-B), across all windows retained after filtering. A smoothed spline fitted to the log-fold change against the average abundance is also shown in red.

Trended biases cannot be removed by scaling methods like TMM normalization (Robinson and Oshlack 2010), as the amount of scaling required varies with the abundance of the window. Rather, non-linear normalization methods must be used. csaw implements a version of the fast loess method (Ballman et al. 2004) that has been modified to handle count data (Lun and Smyth 2015). This produces a matrix of offsets that can be used during model fitting.

filtered.data <- normOffsets(filtered.data)
offsets <- assay(filtered.data, "offset")
head(offsets)

##           [,1]      [,2]       [,3]       [,4]
## [1,] 0.5372350 0.3457091 -0.4860457 -0.3968984
## [2,] 0.5082160 0.3238545 -0.4601253 -0.3719453
## [3,] 0.5021301 0.3192584 -0.4546073 -0.3667813
## [4,] 0.6274643 0.4119203 -0.5571155 -0.4822690
## [5,] 0.7049858 0.4638779 -0.6039740 -0.5648898
## [6,] 0.7260917 0.4778642 -0.6174239 -0.5865321

The effect of non-linear normalization is visualized with another mean-difference plot. Once the offsets are applied to adjust the log-fold changes, the trend is eliminated from the plot (Figure 5). The cloud of points is also centred at a log-fold change of zero. This indicates that normalization was successful in removing the differences between samples.

norm.adjc <- calculateCPM(filtered.data, use.offsets=TRUE)
norm.fc <- norm.adjc[,4]-norm.adjc[,1]
smoothScatter(win.ab, norm.fc, ylim=c(-6, 6), xlim=c(0, 5),
    xlab="Average abundance", ylab="Log-fold change")

lfit <- smooth.spline(norm.fc~win.ab, df=5)
lines(win.ab[o], fitted(lfit)[o], col="red", lty=2)

Effect of non-linear normalization on the trended bias between two H3K9ac samples. Normalized log-fold changes are shown for all windows retained after filtering. A smoothed spline fitted to the log-fold change against the average abundance is also shown in red.

Figure 5: Effect of non-linear normalization on the trended bias between two H3K9ac samples. Normalized log-fold changes are shown for all windows retained after filtering. A smoothed spline fitted to the log-fold change against the average abundance is also shown in red.

The implicit assumption of non-linear methods is that most windows at each abundance are not DB. Any systematic difference between samples is attributed to bias and is removed. The assumption of a non-DB majority is reasonable for this data set, given that the cell types being compared are quite closely related. However, it is not appropriate in cases where large-scale DB is expected, as removal of the difference would result in loss of genuine DB. An alternative normalization strategy for these situations will be described later in the CBP analysis.

7 Statistical modelling of biological variability

7.1 Setting up the design matrix

Counts are modelled using negative binomial generalized linear models (NB GLMs) in the edgeR package (McCarthy, Chen, and Smyth 2012; Robinson, McCarthy, and Smyth 2010). The NB distribution is useful as it can handle low, discrete counts for each window. The NB dispersion parameter allows modelling of biological variability between replicate samples. GLMs can also accommodate complex experimental designs, though a simple design is sufficient for this study.

celltype <- acdata$Description
celltype[grep("pro", celltype)] <- "proB"
celltype[grep("mature", celltype)] <- "matureB"

celltype <- factor(celltype)
design <- model.matrix(~0+celltype)
colnames(design) <- levels(celltype)
design

##   matureB proB
## 1       0    1
## 2       0    1
## 3       1    0
## 4       1    0
## attr(,"assign")
## [1] 1 1
## attr(,"contrasts")
## attr(,"contrasts")$celltype
## [1] "contr.treatment"

As a general rule, the experimental design should contain at least two replicates in each of the biological conditions. This ensures that the results for each condition are replicable and are not the result of technical artifacts such as PCR duplicates. Obviously, more replicates will provide more power to detect DB accurately and reliability, albeit at the cost of time and experimental resources.

7.2 Estimating the NB dispersion

The RangedSummarizedExperiment object is coerced into a DGEList object (plus offsets) for use in edgeR. Estimation of the NB dispersion is performed using the estimateDisp function. Specifically, a NB dispersion trend is fitted to all windows against the average abundance. This means that empirical mean-dispersion trends can be flexibly modelled.

library(edgeR)
y <- asDGEList(filtered.data)
str(y)

## Formal class 'DGEList' [package "edgeR"] with 1 slot
##   ..@ .Data:List of 3
##   .. ..$ : int [1:689087, 1:4] 6 6 7 12 15 17 24 22 25 24 ...
##   .. .. ..- attr(*, "dimnames")=List of 2
##   .. .. .. ..$ : chr [1:689087] "1" "2" "3" "4" ...
##   .. .. .. ..$ : chr [1:4] "Sample1" "Sample2" "Sample3" "Sample4"
##   .. ..$ :'data.frame':  4 obs. of  3 variables:
##   .. .. ..$ group       : Factor w/ 1 level "1": 1 1 1 1
##   .. .. ..$ lib.size    : int [1:4] 8392971 7269175 3792141 4241789
##   .. .. ..$ norm.factors: num [1:4] 1 1 1 1
##   .. ..$ : num [1:689087, 1:4] 16.1 16 16 16.2 16.2 ...

y <- estimateDisp(y, design)
summary(y$trended.dispersion)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.04096 0.05252 0.06165 0.06075 0.07209 0.07395

The NB dispersion trend is visualized in Figure 6 as the biological coefficient of variation (BCV), i.e., the square root of the NB dispersion. Note that only the trended dispersion will be used in the downstream steps – the common and tagwise values are only shown for diagnostic purposes. Specifically, the common BCV provides an overall measure of the variability in the data set, averaged across all windows. Data sets with common BCVs ranging from 10 to 20% are considered to have low variability, i.e., counts are highly reproducible. The tagwise BCVs should also be dispersed above and below the fitted trend, indicating that the fit was successful.

plotBCV(y)

Figure 6: Abundance-dependent trend in the BCV for each window, represented by the blue line. Common (red) and tagwise estimates (black) are also shown.

For most sequencing count data, we expect to see a decreasing trend that plateaus with increasing average abundance. This reflects the greater reliability of large counts, where the effects of stochasticity and technical artifacts (e.g., mapping errors, PCR duplicates) are averaged out. We observe no clear trend in Figure 6 as the windows have already been filtered to the plateau. This is still a satisfactory result as it indicates that the retained windows have low variability and more power to detect DB.

7.3 Estimating the QL dispersion

Additional modelling is provided with the QL methods in edgeR (Lund et al. 2012). This introduces a QL dispersion parameter for each window, which captures variability in the NB dispersion around the fitted trend for each window. Thus, the QL dispersion can model window-specific variability, whereas the NB dispersion trend is averaged across many windows. However, with limited replicates, there is not enough information for each window to stably estimate the QL dispersion. This is overcome by sharing information between windows with empirical Bayes (EB) shrinkage. The instability of the QL dispersion estimates is reduced by squeezing the estimates towards an abundance-dependent trend (Figure 7).

fit <- glmQLFit(y, design, robust=TRUE)
plotQLDisp(fit)

Effect of EB shrinkage on the raw QL dispersion estimate for each window (black) towards the abundance-dependent trend (blue) to obtain squeezed estimates (red).

Figure 7: Effect of EB shrinkage on the raw QL dispersion estimate for each window (black) towards the abundance-dependent trend (blue) to obtain squeezed estimates (red).

The extent of shrinkage is determined by the prior degrees of freedom (d.f.). Large prior d.f. indicates that the dispersions were similar across windows, such that strong shrinkage to the trend could be performed to increase stability and power. Small prior d.f. indicates that the dispersions were more variable. In such cases, less squeezing is performed as strong shrinkage would be inappropriate.

summary(fit$df.prior)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2238 15.4949 15.4949 15.2544 15.4949 15.4949

Also note the use of robust=TRUE in the glmQLFit() call, which reduces the sensitivity of the EB procedures to outlier variances. This is particularly noticeable in Figure 7 with highly variable windows that (correctly) do not get squeezed towards the trend.

7.4 Examining the data with MDS plots

Multi-dimensional scaling (MDS) plots are used to examine the similarities between samples. The distance between a pair of samples on this plot represents the overall log-fold change between those samples. Ideally, replicates should cluster together while samples from different conditions should be separate. While the mature B replicates are less tightly grouped, samples still separate by cell type in Figure 8. This suggests that our downstream analysis will be able to detect significant differences in enrichment between cell types.

plotMDS(norm.adjc, labels=celltype,
    col=c("red", "blue")[as.integer(celltype)])

Figure 8: MDS plot with two dimensions for all samples in the H3K9ac data set. Samples are labelled and coloured according to the cell type.

8 Testing for DB and controlling the FDR

8.1 Testing for DB with QL F-tests

Each window is tested for significant differences between cell types using the QL F-test (Lund et al. 2012). This is superior to the likelihood ratio test that is typically used for GLMs, as the QL F-test accounts for the uncertainty in dispersion estimation. One \(p\)-value is produced for each window, representing the evidence against the null hypothesis (i.e., that no DB is present in the window). For this analysis, the comparison is parametrized such that the reported log-fold change for each window represents that of the coverage in pro-B cells over their mature B counterparts.

contrast <- makeContrasts(proB-matureB, levels=design)
res <- glmQLFTest(fit, contrast=contrast)
head(res$table)

##       logFC    logCPM        F     PValue
## 1 1.3657940 0.3096885 2.305514 0.14677966
## 2 1.3564260 0.2624458 2.304629 0.14685257
## 3 2.0015383 0.2502879 3.940307 0.06304923
## 4 2.0779767 0.5033489 5.576372 0.03003408
## 5 0.8842093 0.8051419 1.651922 0.21545238
## 6 0.9678112 0.8948537 2.054733 0.16936408

8.2 Controlling the FDR across regions

One might attempt to control the FDR by applying the Benjamini-Hochberg (BH) method to the window-level \(p\)-values (Benjamini and Hochberg 1995). However, the features of interest are not windows, but the genomic regions that they represent. Control of the FDR across windows does not guarantee control of the FDR across regions (Lun and Smyth 2014). The latter is arguably more relevant for the final interpretation of the results.

We instead control the region-level FDR by aggregating windows into regions and combining the \(p\)-values. Here, adjacent windows less than 100 bp apart are aggregated into clusters. Each cluster represents a genomic region. Smaller values of tol allow distinct marking events to kept separate, while larger values provide a broader perspective, e.g., by considering adjacent co-regulated sites as a single entity. Chaining effects are mitigated by setting a maximum cluster width of 5 kbp.

merged <- mergeResults(filtered.data, res$table, tol=100, 
    merge.args=list(max.width=5000))
merged$regions

## GRanges object with 41616 ranges and 0 metadata columns:
##           seqnames            ranges strand
##              <Rle>         <IRanges>  <Rle>
##       [1]     chr1   4775451-4775750      *
##       [2]     chr1   4785001-4786300      *
##       [3]     chr1   4807251-4807750      *
##       [4]     chr1   4808001-4808600      *
##       [5]     chr1   4857051-4858950      *
##       ...      ...               ...    ...
##   [41612]     chrY 73038001-73038400      *
##   [41613]     chrY 75445801-75446200      *
##   [41614]     chrY 88935951-88936350      *
##   [41615]     chrY 90554201-90554400      *
##   [41616]     chrY 90812801-90813100      *
##   -------
##   seqinfo: 21 sequences from an unspecified genome

A combined \(p\)-value is computed for each cluster using the method of Simes (1986), based on the \(p\)-values of the constituent windows. This represents the evidence against the global null hypothesis for each cluster, i.e., that no DB exists in any of its windows. Rejection of this global null indicates that the cluster (and the region that it represents) contains DB. Applying the BH method to the combined \(p\)-values allows the region-level FDR to be controlled.

tabcom <- merged$combined
tabcom

## DataFrame with 41616 rows and 6 columns
##        nWindows  logFC.up logFC.down              PValue                FDR
##       <integer> <integer>  <integer>           <numeric>          <numeric>
## 1             3         3          0   0.146852572988762   0.24642809183469
## 2            24         9          0  0.0882966655233521  0.168735548166406
## 3             8         1          3   0.526424531180179  0.648041273430584
## 4            10         1          2   0.729650939287535  0.829875744449031
## 5            36        14          6  0.0208882037608781 0.0605414172269767
## ...         ...       ...        ...                 ...                ...
## 41612         6         0          6  0.0058750474593641 0.0265929927201323
## 41613         6         0          6  0.0386801713741882 0.0930954838880468
## 41614         6         0          6  0.0208215544419839 0.0604134288264382
## 41615         2         0          2  0.0334464643173884 0.0836784933889885
## 41616         4         0          4 0.00147494061744926 0.0114325067490721
##         direction
##       <character>
## 1              up
## 2              up
## 3           mixed
## 4           mixed
## 5              up
## ...           ...
## 41612        down
## 41613        down
## 41614        down
## 41615        down
## 41616        down

Each row of the above table contains the statistics for a single cluster, including the combined p-value before and after the BH correction. Additional fields include nWindows, the total number of windows in the cluster; logFC.up, the number of windows with a DB log-fold change above 0.5; and log.FC.down, the number of windows with a log-fold change below -0.5.

8.3 Examining the scope and direction of DB

We determine the total number of DB regions at a FDR of 5% by applying the Benjamini-Hochberg method on the combined \(p\)-values.

is.sig <- tabcom$FDR <= 0.05
summary(is.sig)

##    Mode   FALSE    TRUE 
## logical   28515   13101

Determining the direction of DB is more complicated, as clusters may contain windows that are changing in opposite directions. One approach is to use the direction of DB from the windows that contribute most to the combined \(p\)-value, as reported in the direction field for each cluster. If significance is driven by windows changing in both directions, the direction for the cluster is defined as "mixed". Otherwise, the reported direction is the same as that of the windows, i.e., "up" or "down".

table(tabcom$direction[is.sig])

## 
##  down mixed    up 
##  8580   154  4367

Another approach is to use the log-fold change of the most significant window as a proxy for the log-fold change of the cluster.

tabbest <- merged$best
tabbest

## DataFrame with 41616 rows and 6 columns
##            best              logFC            logCPM                F
##       <integer>          <numeric>         <numeric>        <numeric>
## 1             3   2.00153829256075 0.250287926080891  3.9403070089186
## 2            15   6.45488225628961 0.712521465636628 11.9826122454754
## 3            35    1.1783996686851 0.727376262089356 2.51421074291099
## 4            43 -0.908825402006814   1.0234078969562 2.74637354889958
## 5            60   6.57273805081489 0.809667875887879 14.9826446406268
## ...         ...                ...               ...              ...
## 41612    689064  -6.96911276199788  1.40799011361021 22.8173045433537
## 41613    689070  -5.44350288147529 0.481214860211134 7.93926118453806
## 41614    689076    -6.683218886502  1.21098784780083 15.4258356080577
## 41615    689082  -5.00403851736606 0.313859418771516 7.30295013039867
## 41616    689086  -4.08130627831824 0.513547709822732 17.0075267948004
##                    PValue                FDR
##                 <numeric>          <numeric>
## 1       0.189147676521167  0.335560137526852
## 2      0.0882966655233521  0.190582668484939
## 3                       1                  1
## 4                       1                  1
## 5      0.0464345919412199  0.121535973473321
## ...                   ...                ...
## 41612  0.0176251423780923 0.0634640306584145
## 41613   0.195680102159668  0.344562772021383
## 41614  0.0614117484768238   0.14635845404945
## 41615  0.0668929286347768  0.154992267583368
## 41616 0.00421596727709738 0.0261244333239554

In the table above, the best column is the index of the window that is the most significant in each cluster, while the logFC field is the log-fold change of that window. We use this to obtain a summary of the direction of DB across all clusters.

is.sig.pos <- (tabbest$logFC > 0)[is.sig]
summary(is.sig.pos)

##    Mode   FALSE    TRUE 
## logical    8664    4437

This approach is generally satisfactory, though it will not capture multiple changes in opposite directions77 Try mixedClusters() to formally detect clusters that contain significant changes in both directions.. It also tends to overstate the magnitude of the log-fold change in each cluster.

9 Saving results to file

One approach to saving results is to store all statistics in the metadata of a GRanges object. This is useful as it keeps the statistics and coordinates together for each cluster, avoiding problems with synchronization in downstream steps. We also store the midpoint and log-fold change of the most significant window in each cluster. The updated GRanges object is then saved to file as a serialized R object with the saveRDS function.

out.ranges <- merged$regions
mcols(out.ranges) <- DataFrame(tabcom,
    best.pos=mid(ranges(rowRanges(filtered.data[tabbest$best]))),
    best.logFC=tabbest$logFC)
saveRDS(file="h3k9ac_results.rds", out.ranges)

For input into other programs like genome browsers, results need to be saved in a more conventional format. Here, coordinates of DB regions are saved in BED format via rtracklayer, using the log-transformed FDR as the score.

simplified <- out.ranges[is.sig]
simplified$score <- -10*log10(simplified$FDR)
export(con="h3k9ac_results.bed", object=simplified)

10 Interpreting the DB results

10.1 Adding gene-centric annotation

10.1.1 Using the `detailRanges` function

csaw provides its own annotation function, detailRanges(). This identifies all genic features overlapping each region and reports them in a compact string form. Briefly, features are reported as SYMBOL:STRAND:TYPE where SYMBOL represents the gene symbol; STRAND reports the strand of the gene; and TYPE reports the type(s) of overlapped feature, e.g., E for exons, P for promoters, I for introns88 Introns are only reported if an exon is not overlapped..

library(org.Mm.eg.db)
library(TxDb.Mmusculus.UCSC.mm10.knownGene)
anno <- detailRanges(out.ranges, orgdb=org.Mm.eg.db,
    txdb=TxDb.Mmusculus.UCSC.mm10.knownGene)
head(anno$overlap)

## [1] "Mrpl15:-:E"            "Mrpl15:-:PE"           "Lypla1:+:P"           
## [4] "Lypla1:+:PE"           "Lypla1:+:I,Tcea1:+:PE" "Rgs20:-:I"

Annotated features that flank the region of interest are also reported. The description for each feature is formatted as described above but the TYPE instead represents the distance (in base pairs) between the closest exon of the gene and the region. By default, only flanking features within 5 kbp of each region are considered.

head(anno$left)

## [1] "Mrpl15:-:935" "Mrpl15:-:896" ""             "Lypla1:+:19" 
## [5] ""             ""

head(anno$right)

## [1] "Mrpl15:-:627" ""             "Lypla1:+:38"  ""            
## [5] ""             ""

The annotation for each region is stored in the metadata of the GRanges object. The compact string form is useful for human interpretation, as it allows rapid examination of all genic features neighbouring each region.

meta <- mcols(out.ranges)
mcols(out.ranges) <- data.frame(meta, anno)

10.1.2 Using the ChIPpeakAnno package

As its name suggests, the ChIPpeakAnno package is designed to annotate peaks from ChIP-seq experiments (Zhu et al. 2010). A GRanges object containing all regions of interest is supplied to the relevant function after removing all previous metadata fields to reduce clutter. The gene closest to each region is then reported. Gene coordinates are taken from the NCBI mouse 38 annotation, which is roughly equivalent to the annotation in the mm10 genome build.

library(ChIPpeakAnno)
data(TSS.mouse.GRCm38)
minimal <- out.ranges
elementMetadata(minimal) <- NULL
anno.regions <- annotatePeakInBatch(minimal, AnnotationData=TSS.mouse.GRCm38)
colnames(elementMetadata(anno.regions))

## [1] "peak"                     "feature"                 
## [3] "start_position"           "end_position"            
## [5] "feature_strand"           "insideFeature"           
## [7] "distancetoFeature"        "shortestDistance"        
## [9] "fromOverlappingOrNearest"

Alternatively, identification of all overlapping features within, say, 5 kbp can be achieved by setting maxgap=5000 and output="overlapping" in annotatePeakInBatch. This will report each overlapping feature in a separate entry of the returned GRanges object, i.e., each input region may have multiple output values. In contrast, detailRanges() will report all overlapping features for a region as a single string, i.e., each input region has one output value. Which is preferable depends on the purpose of the annotation – the detailRanges() output is more convenient for direct annotation of a DB list, while the annotatePeakInBatch() output contains more information and is more convenient for further manipulation.

10.2 Reporting gene-based results

Another approach to annotation is to flip the problem around such that DB statistics are reported directly for features of interest like genes. This is more convenient when the DB analysis needs to be integrated with, e.g., differential expression analyses of matched RNA-seq data. In the code below, promoter coordinates and gene symbols are obtained from various annotation objects.

prom <- suppressWarnings(promoters(TxDb.Mmusculus.UCSC.mm10.knownGene,
    upstream=3000, downstream=1000, columns=c("tx_name", "gene_id")))
entrez.ids <- sapply(prom$gene_id, FUN=function(x) x[1]) # Using the first Entrez ID.
gene.name <- select(org.Mm.eg.db, keys=entrez.ids, keytype="ENTREZID", column="SYMBOL")
prom$gene_name <- gene.name$SYMBOL[match(entrez.ids, gene.name$ENTREZID)]
head(prom)

## GRanges object with 6 ranges and 3 metadata columns:
##                        seqnames          ranges strand |
##                           <Rle>       <IRanges>  <Rle> |
##   ENSMUST00000193812.1     chr1 3070253-3074252      + |
##   ENSMUST00000082908.1     chr1 3099016-3103015      + |
##   ENSMUST00000192857.1     chr1 3249757-3253756      + |
##   ENSMUST00000161581.1     chr1 3463587-3467586      + |
##   ENSMUST00000192183.1     chr1 3528795-3532794      + |
##   ENSMUST00000193244.1     chr1 3677155-3681154      + |
##                                     tx_name         gene_id   gene_name
##                                 <character> <CharacterList> <character>
##   ENSMUST00000193812.1 ENSMUST00000193812.1            <NA>        <NA>
##   ENSMUST00000082908.1 ENSMUST00000082908.1            <NA>        <NA>
##   ENSMUST00000192857.1 ENSMUST00000192857.1            <NA>        <NA>
##   ENSMUST00000161581.1 ENSMUST00000161581.1            <NA>        <NA>
##   ENSMUST00000192183.1 ENSMUST00000192183.1            <NA>        <NA>
##   ENSMUST00000193244.1 ENSMUST00000193244.1            <NA>        <NA>
##   -------
##   seqinfo: 66 sequences (1 circular) from mm10 genome

All windows overlapping each promoter are defined as a cluster. We compute DB statistics are computed for each cluster/promoter using Simes’ method, which directly yields DB results for the annotated features. Promoters with no overlapping windows are assigned NA values for the various fields and are filtered out below for demonstration purposes.

olap.out <- overlapResults(filtered.data, regions=prom, res$table)
olap.out

## DataFrame with 142446 rows and 3 columns
##                             regions     combined         best
##                           <GRanges>  <DataFrame>  <DataFrame>
## 1            chr1:3070253-3074252:+ NA:NA:NA:... NA:NA:NA:...
## 2            chr1:3099016-3103015:+ NA:NA:NA:... NA:NA:NA:...
## 3            chr1:3249757-3253756:+ NA:NA:NA:... NA:NA:NA:...
## 4            chr1:3463587-3467586:+ NA:NA:NA:... NA:NA:NA:...
## 5            chr1:3528795-3532794:+ NA:NA:NA:... NA:NA:NA:...
## ...                             ...          ...          ...
## 142442 chrUn_GL456381:15722-19721:- NA:NA:NA:... NA:NA:NA:...
## 142443 chrUn_GL456385:28243-32242:+ NA:NA:NA:... NA:NA:NA:...
## 142444 chrUn_GL456385:29719-33718:+ NA:NA:NA:... NA:NA:NA:...
## 142445 chrUn_JH584304:58668-62667:- NA:NA:NA:... NA:NA:NA:...
## 142446 chrUn_JH584304:58691-62690:- NA:NA:NA:... NA:NA:NA:...

simple <- DataFrame(ID=prom$tx_name, Gene=prom$gene_name, olap.out$combined)
simple[!is.na(simple$PValue),]

## DataFrame with 57380 rows and 8 columns
##                          ID        Gene  nWindows  logFC.up logFC.down
##                 <character> <character> <integer> <integer>  <integer>
## 1      ENSMUST00000134384.7      Lypla1        18         2          5
## 2     ENSMUST00000027036.10      Lypla1        18         2          5
## 3      ENSMUST00000150971.7      Lypla1        18         2          5
## 4      ENSMUST00000155020.1      Lypla1        18         2          5
## 5      ENSMUST00000119612.8      Lypla1        18         2          5
## ...                     ...         ...       ...       ...        ...
## 57376  ENSMUST00000150715.1         Uty        18         0         14
## 57377  ENSMUST00000154527.1         Uty        18         0         14
## 57378 ENSMUST00000091190.11       Ddx3y        17         0         17
## 57379  ENSMUST00000188484.1       Ddx3y        17         0         17
## 57380  ENSMUST00000187962.1          NA         3         0          3
##                     PValue                  FDR   direction
##                  <numeric>            <numeric> <character>
## 1        0.700464901716033    0.739134871831733       mixed
## 2        0.700464901716033    0.739134871831733       mixed
## 3        0.700464901716033    0.739134871831733       mixed
## 4        0.700464901716033    0.739134871831733       mixed
## 5        0.700464901716033    0.739134871831733       mixed
## ...                    ...                  ...         ...
## 57376 6.45129747826827e-06 0.000324146628111237        down
## 57377 6.45129747826827e-06 0.000324146628111237        down
## 57378 6.82321075683763e-05  0.00142185516847767        down
## 57379 6.82321075683763e-05  0.00142185516847767        down
## 57380  0.00293752372968205   0.0138024166073662        down

Note that this strategy is distinct from counting reads across promoters. Using promoter-level counts would not provide enough spatial resolution to detect sharp binding events that only occur in a subinterval of the promoter. In particular, detection may be compromised by non-specific background or the presence of multiple opposing DB events in the same promoter. Combining window-level statistics is preferable as resolution is maintained for optimal performance.

11 Visualizing DB results

11.1 Overview

Here, the Gviz package is used to visualize read coverage across the data set at regions of interest (F. and R. 2016). Coverage in each BAM file will be represented by a single track. Several additional tracks will also be included in each plot. One is the genome axis track, to display the genomic coordinates across the plotted region. The other is the annotation track containing gene models, with gene IDs replaced by symbols (where possible) for easier reading.

library(Gviz)
gax <- GenomeAxisTrack(col="black", fontsize=15, size=2)
greg <- GeneRegionTrack(TxDb.Mmusculus.UCSC.mm10.knownGene, showId=TRUE,
    geneSymbol=TRUE, name="", background.title="transparent")
symbols <- unlist(mapIds(org.Mm.eg.db, gene(greg), "SYMBOL",
    "ENTREZID", multiVals = "first"))
symbol(greg) <- symbols[gene(greg)]

We will also sort the DB regions by p-value for easier identification of regions of interest.

o <- order(out.ranges$PValue)
sorted.ranges <- out.ranges[o]
sorted.ranges

## GRanges object with 41616 ranges and 11 metadata columns:
##           seqnames              ranges strand |  nWindows  logFC.up
##              <Rle>           <IRanges>  <Rle> | <integer> <integer>
##       [1]    chr17   34285101-34290050      * |        97         0
##       [2]     chr9 109050201-109053150      * |        57         0
##       [3]    chr17   34261151-34265850      * |        92        11
##       [4]    chr17   34306001-34308650      * |        51         0
##       [5]    chr18   60802751-60805750      * |        55         0
##       ...      ...                 ...    ... .       ...       ...
##   [41612]    chr18   23751901-23753200      * |        22         0
##   [41613]    chr12   83922051-83922650      * |        10         2
##   [41614]    chr15   99395101-99395650      * |         8         0
##   [41615]     chr3   67504201-67504500      * |         4         0
##   [41616]     chr4   43043401-43043700      * |         4         0
##           logFC.down               PValue                  FDR   direction
##            <integer>            <numeric>            <numeric> <character>
##       [1]         97 4.04797773668499e-11 1.22569557219491e-06        down
##       [2]         57 7.13783570863201e-11 1.22569557219491e-06        down
##       [3]         78 8.83575239471534e-11 1.22569557219491e-06        down
##       [4]         51 1.23282008545495e-10 1.28262601690733e-06        down
##       [5]         55 2.06286478303051e-10 1.54430072588262e-06        down
##       ...        ...                  ...                  ...         ...
##   [41612]          2    0.999832725328934    0.999908062838253       mixed
##   [41613]          0      0.9998854212888    0.999908062838253       mixed
##   [41614]          0    0.999908062838253    0.999908062838253       mixed
##   [41615]          0    0.999908062838253    0.999908062838253          up
##   [41616]          0    0.999908062838253    0.999908062838253       mixed
##            best.pos         best.logFC                  overlap
##           <integer>          <numeric>                 <factor>
##       [1]  34287575  -7.18686332236642               H2-Aa:-:PE
##       [2] 109051575  -6.19603369122881              Shisa5:+:PE
##       [3]  34262025  -7.70114852451639              H2-Ab1:+:PE
##       [4]  34306075  -5.80798257689994              H2-Eb1:+:PE
##       [5]  60804525  -5.98346376178937                Cd74:+:PE
##       ...       ...                ...                      ...
##   [41612]  23752525  -0.77704982053454 Gm15972:-:PE,Mapre2:+:PE
##   [41613]  83922125  0.880874875114293                 Numb:-:P
##   [41614]  99395425 -0.411300240851034               Tmbim6:+:I
##   [41615]  67504275  0.491618329274888              Rarres1:-:I
##   [41616]  43043575   0.17425353617959              Fam214b:-:I
##                     left                                         right
##                 <factor>                                      <factor>
##       [1]    H2-Aa:-:565                                              
##       [2]                Atrip-trex1:-:4783,Trex1:-:4788,Shisa5:+:1713
##       [3]  H2-Ab1:+:3314                                 H2-Ab1:+:1252
##       [4]                                                 H2-Eb1:+:925
##       [5]                                                  Cd74:+:2158
##       ...            ...                                           ...
##   [41612]   Gm15972:-:78                                  Mapre2:+:525
##   [41613]     Numb:-:117                                              
##   [41614]  Tmbim6:+:1371                                 Tmbim6:+:4007
##   [41615]                                                             
##   [41616] Fam214b:-:3106                                Fam214b:-:1948
##   -------
##   seqinfo: 21 sequences from an unspecified genome

11.2 Simple DB across a broad region

We start by visualizing one of the top-ranking DB regions. This represents a simple DB event where the entire region changes in one direction (Figure 9). Specifically, it represents an increase in H3K9ac marking at the H2-Aa locus in mature B cells. This is consistent with the expected biology – H3K9ac is a mark of active gene expression (Karmodiya et al. 2012) and MHCII components are upregulated in mature B cells (Hoffmann et al. 2002).

cur.region <- sorted.ranges[1]
cur.region

## GRanges object with 1 range and 11 metadata columns:
##       seqnames            ranges strand |  nWindows  logFC.up logFC.down
##          <Rle>         <IRanges>  <Rle> | <integer> <integer>  <integer>
##   [1]    chr17 34285101-34290050      * |        97         0         97
##                     PValue                  FDR   direction  best.pos
##                  <numeric>            <numeric> <character> <integer>
##   [1] 4.04797773668499e-11 1.22569557219491e-06        down  34287575
##              best.logFC    overlap        left    right
##               <numeric>   <factor>    <factor> <factor>
##   [1] -7.18686332236642 H2-Aa:-:PE H2-Aa:-:565         
##   -------
##   seqinfo: 21 sequences from an unspecified genome

One track is plotted for each sample, in addition to the coordinate and annotation tracks. Coverage is plotted in terms of sequencing depth-per-million at each base. This corrects for differences in library sizes between tracks.

collected <- list()
lib.sizes <- filtered.data$totals/1e6
for (i in seq_along(acdata$Path)) {
    reads <- extractReads(bam.file=acdata$Path[i], cur.region, param=param)
    cov <- as(coverage(reads)/lib.sizes[i], "GRanges")
    collected[[i]] <- DataTrack(cov, type="histogram", lwd=0, ylim=c(0,10),
        name=acdata$Description[i], col.axis="black", col.title="black",
        fill="darkgray", col.histogram=NA)
}
plotTracks(c(gax, collected, greg), chromosome=as.character(seqnames(cur.region)),
    from=start(cur.region), to=end(cur.region))

Coverage tracks for a simple DB event between pro-B and mature B cells, across a broad region in the H3K9ac data set. Read coverage for each sample is shown as a per-million value at each base.

Figure 9: Coverage tracks for a simple DB event between pro-B and mature B cells, across a broad region in the H3K9ac data set. Read coverage for each sample is shown as a per-million value at each base.

11.3 Complex DB across a broad region

Complex DB refers to situations where multiple DB events are occurring within the same enriched region. These are identified as those clusters that contain windows changing in both directions99 Technically, we should use mixedClusters() for rigorous identification of regions with significant changes in both directions. However, for simplicity, we’ll just use a more ad hoc approach here.. Here, one of the top-ranking complex clusters is selected for visualization.

complex <- sorted.ranges$logFC.up > 0 & sorted.ranges$logFC.down > 0
cur.region <- sorted.ranges[complex][2]
cur.region

## GRanges object with 1 range and 11 metadata columns:
##       seqnames              ranges strand |  nWindows  logFC.up logFC.down
##          <Rle>           <IRanges>  <Rle> | <integer> <integer>  <integer>
##   [1]     chr5 122987201-122991450      * |        83        18         43
##                     PValue                  FDR   direction  best.pos
##                  <numeric>            <numeric> <character> <integer>
##   [1] 1.30976135102916e-08 1.33826057750574e-05        down 122990925
##              best.logFC                       overlap         left
##               <numeric>                      <factor>     <factor>
##   [1] -5.48534588563145 A930024E05Rik:+:PE,Kdm2b:-:PE Kdm2b:-:2230
##                      right
##                   <factor>
##   [1] A930024E05Rik:+:2913
##   -------
##   seqinfo: 21 sequences from an unspecified genome

This region contains a bidirectional promoter where different genes are marked in the different cell types (Figure 10). Upon differentiation to mature B cells, loss of marking in one part of the region is balanced by a gain in marking in another part of the region. This represents a complex DB event that would not be detected if reads were counted across the entire region.

collected <- list()
for (i in seq_along(acdata$Path)) {
    reads <- extractReads(bam.file=acdata$Path[i], cur.region, param=param)
    cov <- as(coverage(reads)/lib.sizes[i], "GRanges")
    collected[[i]] <- DataTrack(cov, type="histogram", lwd=0, ylim=c(0,3),
        name=acdata$Description[i], col.axis="black", col.title="black",
        fill="darkgray", col.histogram=NA)
}
plotTracks(c(gax, collected, greg), chromosome=as.character(seqnames(cur.region)),
    from=start(cur.region), to=end(cur.region))

Figure 10: Coverage tracks for a complex DB event in the H3K9ac data set, shown as per-million values.

11.4 Simple DB across a small region

Both of the examples above involve differential marking within broad regions spanning several kilobases. This is consistent with changes in the marking profile across a large number of nucleosomes. However, H3K9ac marking can also be concentrated into small regions, involving only a few nucleosomes. csaw is equally capable of detecting sharp DB within these small regions. This is demonstrated by examining those clusters that contain a smaller number of windows.

sharp <- sorted.ranges$nWindows < 20
cur.region <- sorted.ranges[sharp][1]
cur.region

## GRanges object with 1 range and 11 metadata columns:
##       seqnames            ranges strand |  nWindows  logFC.up logFC.down
##          <Rle>         <IRanges>  <Rle> | <integer> <integer>  <integer>
##   [1]    chr16 36665551-36666200      * |        11         0         11
##                     PValue                  FDR   direction  best.pos
##                  <numeric>            <numeric> <character> <integer>
##   [1] 1.29839663897595e-08 1.33826057750574e-05        down  36665925
##              best.logFC   overlap        left    right
##               <numeric>  <factor>    <factor> <factor>
##   [1] -4.93341819257933 Cd86:-:PE Cd86:-:3937         
##   -------
##   seqinfo: 21 sequences from an unspecified genome

Marking is increased for mature B cells within a 500 bp region (Figure 11), which is sharper than the changes in the previous two examples. This also coincides with the promoter of the Cd86 gene. Again, this makes biological sense as CD86 is involved in regulating immunoglobulin production in activated B-cells (Podojil and Sanders 2003).

collected <- list()
for (i in seq_along(acdata$Path)) {
    reads <- extractReads(bam.file=acdata$Path[i], cur.region, param=param)
    cov <- as(coverage(reads)/lib.sizes[i], "GRanges")
    collected[[i]] <- DataTrack(cov, type="histogram", lwd=0, ylim=c(0,3),
        name=acdata$Description[i], col.axis="black", col.title="black",
        fill="darkgray", col.histogram=NA)
}
plotTracks(c(gax, collected, greg), chromosome=as.character(seqnames(cur.region)),
    from=start(cur.region), to=end(cur.region))

Figure 11: Coverage tracks for a sharp and simple DB event in the H3K9ac data set, shown as per-million values.

Note that the window size will determine whether sharp or broad events are preferentially detected. Larger windows provide more power to detect broad events (as the counts are higher), while smaller windows provide more resolution to detect sharp events. Optimal detection of all features can be obtained by performing analyses with multiple window sizes and consolidating the results1010 See ?consolidateWindows and ?consolidateTests for further information., though – for brevity – this will not be described here. In general, smaller window sizes are preferred as strong DB events with sufficient coverage will always be detected. For larger windows, detection may be confounded by other events within the window that distort the log-fold change in the counts between conditions.

12 Session information

sessionInfo()

## R version 3.6.1 (2019-07-05)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.3 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.10-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.10-bioc/R/lib/libRlapack.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
##  [1] grid      stats4    parallel  stats     graphics  grDevices utils    
##  [8] datasets  methods   base     
## 
## other attached packages:
##  [1] ChIPpeakAnno_3.20.0                      
##  [2] VennDiagram_1.6.20                       
##  [3] futile.logger_1.4.3                      
##  [4] Gviz_1.30.0                              
##  [5] org.Mm.eg.db_3.10.0                      
##  [6] TxDb.Mmusculus.UCSC.mm10.knownGene_3.10.0
##  [7] GenomicFeatures_1.38.0                   
##  [8] AnnotationDbi_1.48.0                     
##  [9] edgeR_3.28.0                             
## [10] limma_3.42.0                             
## [11] csaw_1.20.0                              
## [12] SummarizedExperiment_1.16.0              
## [13] DelayedArray_0.12.0                      
## [14] BiocParallel_1.20.0                      
## [15] matrixStats_0.55.0                       
## [16] Biobase_2.46.0                           
## [17] rtracklayer_1.46.0                       
## [18] BiocFileCache_1.10.0                     
## [19] dbplyr_1.4.2                             
## [20] Rsamtools_2.2.0                          
## [21] Biostrings_2.54.0                        
## [22] XVector_0.26.0                           
## [23] GenomicRanges_1.38.0                     
## [24] GenomeInfoDb_1.22.0                      
## [25] IRanges_2.20.0                           
## [26] S4Vectors_0.24.0                         
## [27] BiocGenerics_0.32.0                      
## [28] chipseqDBData_1.1.0                      
## [29] knitr_1.25                               
## [30] BiocStyle_2.14.0                         
## 
## loaded via a namespace (and not attached):
##   [1] backports_1.1.5               Hmisc_4.2-0                  
##   [3] AnnotationHub_2.18.0          lazyeval_0.2.2               
##   [5] splines_3.6.1                 ggplot2_3.2.1                
##   [7] digest_0.6.22                 ensembldb_2.10.0             
##   [9] htmltools_0.4.0               GO.db_3.10.0                 
##  [11] magrittr_1.5                  checkmate_1.9.4              
##  [13] memoise_1.1.0                 BSgenome_1.54.0              
##  [15] cluster_2.1.0                 askpass_1.1                  
##  [17] prettyunits_1.0.2             colorspace_1.4-1             
##  [19] blob_1.2.0                    rappdirs_0.3.1               
##  [21] xfun_0.10                     dplyr_0.8.3                  
##  [23] crayon_1.3.4                  RCurl_1.95-4.12              
##  [25] graph_1.64.0                  zeallot_0.1.0                
##  [27] survival_2.44-1.1             VariantAnnotation_1.32.0     
##  [29] glue_1.3.1                    gtable_0.3.0                 
##  [31] zlibbioc_1.32.0               seqinr_3.6-1                 
##  [33] scales_1.0.0                  futile.options_1.0.1         
##  [35] DBI_1.0.0                     Rcpp_1.0.2                   
##  [37] xtable_1.8-4                  progress_1.2.2               
##  [39] htmlTable_1.13.2              foreign_0.8-72               
##  [41] bit_1.1-14                    Formula_1.2-3                
##  [43] htmlwidgets_1.5.1             httr_1.4.1                   
##  [45] RColorBrewer_1.1-2            acepack_1.4.1                
##  [47] pkgconfig_2.0.3               XML_3.98-1.20                
##  [49] nnet_7.3-12                   locfit_1.5-9.1               
##  [51] tidyselect_0.2.5              rlang_0.4.1                  
##  [53] later_1.0.0                   munsell_0.5.0                
##  [55] BiocVersion_3.10.1            tools_3.6.1                  
##  [57] RSQLite_2.1.2                 ExperimentHub_1.12.0         
##  [59] ade4_1.7-13                   evaluate_0.14                
##  [61] stringr_1.4.0                 fastmap_1.0.1                
##  [63] yaml_2.2.0                    bit64_0.9-7                  
##  [65] purrr_0.3.3                   AnnotationFilter_1.10.0      
##  [67] RBGL_1.62.0                   mime_0.7                     
##  [69] formatR_1.7                   biomaRt_2.42.0               
##  [71] compiler_3.6.1                rstudioapi_0.10              
##  [73] curl_4.2                      interactiveDisplayBase_1.24.0
##  [75] tibble_2.1.3                  statmod_1.4.32               
##  [77] stringi_1.4.3                 idr_1.2                      
##  [79] highr_0.8                     lattice_0.20-38              
##  [81] ProtGenerics_1.18.0           Matrix_1.2-17                
##  [83] multtest_2.42.0               vctrs_0.2.0                  
##  [85] pillar_1.4.2                  BiocManager_1.30.9           
##  [87] data.table_1.12.6             bitops_1.0-6                 
##  [89] httpuv_1.5.2                  R6_2.4.0                     
##  [91] latticeExtra_0.6-28           bookdown_0.14                
##  [93] promises_1.1.0                KernSmooth_2.23-16           
##  [95] gridExtra_2.3                 codetools_0.2-16             
##  [97] lambda.r_1.2.4                dichromat_2.0-0              
##  [99] MASS_7.3-51.4                 assertthat_0.2.1             
## [101] openssl_1.4.1                 regioneR_1.18.0              
## [103] GenomicAlignments_1.22.0      GenomeInfoDbData_1.2.2       
## [105] hms_0.5.1                     rpart_4.1-15                 
## [107] rmarkdown_1.16                biovizBase_1.34.0            
## [109] shiny_1.4.0                   base64enc_0.1-3

References

Ballman, K. V., D. E. Grill, A. L. Oberg, and T. M. Therneau. 2004. “Faster cyclic loess: normalizing RNA arrays via linear models.” Bioinformatics 20 (16):2778–86.

Benjamini, Y., and Y. Hochberg. 1995. “Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing.” J. Royal Stat. Soc. B 57:289–300.

ENCODE Project Consortium. 2012. “An integrated encyclopedia of DNA elements in the human genome.” Nature 489 (7414):57–74.

F., Hahne, and Ivanek R. 2016. “Visualizing Genomic Data Using Gviz and Bioconductor.” In Statistical Genomics: Methods and Protocols, edited by Ewy Mathé and Sean Davis, 335–51. New York, NY: Springer New York. https://doi.org/10.1007/978-1-4939-3578-9_16.

Hoffmann, R., T. Seidl, M. Neeb, A. Rolink, and F. Melchers. 2002. “Changes in gene expression profiles in developing B cells of murine bone marrow.” Genome Res. 12 (1):98–111.

Humburg, P., C. A. Helliwell, D. Bulger, and G. Stone. 2011. “ChIPseqR: analysis of ChIP-seq experiments.” BMC Bioinformatics 12:39.

Karmodiya, K., A. R. Krebs, M. Oulad-Abdelghani, H. Kimura, and L. Tora. 2012. “H3K9 and H3K14 acetylation co-occur at many gene regulatory elements, while H3K14ac marks a subset of inactive inducible promoters in mouse embryonic stem cells.” BMC Genomics 13:424.

Kharchenko, P. V., M. Y. Tolstorukov, and P. J. Park. 2008. “Design and analysis of ChIP-seq experiments for DNA-binding proteins.” Nat. Biotechnol. 26 (12):1351–9.

Landt, S. G., G. K. Marinov, A. Kundaje, P. Kheradpour, F. Pauli, S. Batzoglou, B. E. Bernstein, et al. 2012. “ChIP-seq guidelines and practices of the ENCODE and modENCODE consortia.” Genome Res. 22 (9):1813–31.

Lun, A. T., and G. K. Smyth. 2014. “De novo detection of differentially bound regions for ChIP-seq data using peaks and windows: controlling error rates correctly.” Nucleic Acids Res. 42 (11):e95.

———. 2015. “csaw: a Bioconductor package for differential binding analysis of ChIP-seq data using sliding windows.” Nucleic Acids Res. https://doi.org/10.1093/nar/gkv1191.

Lund, S. P., D. Nettleton, D. J. McCarthy, and G. K. Smyth. 2012. “Detecting differential expression in RNA-sequence data using quasi-likelihood with shrunken dispersion estimates.” Stat. Appl. Genet. Mol. Biol. 11 (5):Article 8.

McCarthy, D. J., Y. Chen, and G. K. Smyth. 2012. “Differential expression analysis of multifactor RNA-Seq experiments with respect to biological variation.” Nucleic Acids Res. 40 (10):4288–97.

Podojil, J. R., and V. M. Sanders. 2003. “Selective regulation of mature IgG1 transcription by CD86 and beta 2-adrenergic receptor stimulation.” J. Immunol. 170 (10):5143–51.

Revilla-I-Domingo, R., I. Bilic, B. Vilagos, H. Tagoh, A. Ebert, I. M. Tamir, L. Smeenk, et al. 2012. “The B-cell identity factor Pax5 regulates distinct transcriptional programmes in early and late B lymphopoiesis.” EMBO J. 31 (14):3130–46.

Robinson, M. D., D. J. McCarthy, and G. K. Smyth. 2010. “edgeR: a Bioconductor package for differential expression analysis of digital gene expression data.” Bioinformatics 26 (1):139–40.

Robinson, M. D., and A. Oshlack. 2010. “A scaling normalization method for differential expression analysis of RNA-seq data.” Genome Biol. 11 (3):R25.

Rosenbloom, K. R., J. Armstrong, G. P. Barber, J. Casper, H. Clawson, M. Diekhans, T. R. Dreszer, et al. 2015. “The UCSC Genome Browser database: 2015 update.” Nucleic Acids Res. 43 (Database issue):D670–681.

Simes, R. J. 1986. “An Improved Bonferroni Procedure for Multiple Tests of Significance.” Biometrika 73 (3):751–54.

Zhu, L. J., C. Gazin, N. D. Lawson, H. Pages, S. M. Lin, D. S. Lapointe, and M. R. Green. 2010. “ChIPpeakAnno: a Bioconductor package to annotate ChIP-seq and ChIP-chip data.” BMC Bioinformatics 11:237.

Detecting differential enrichment of H3K9ac in murine B cells

2019-10-30

1 Overview

2 Pre-processing checks

2.1 Examining mapping statistics

2.2 Obtaining the ENCODE blacklist for mm10

2.3 Setting up the read extraction parameters

3 Computing the average fragment length

4 Counting reads into windows

5 Filtering windows by abundance

6 Normalizing for sample-specific trended biases

7 Statistical modelling of biological variability

7.1 Setting up the design matrix

7.2 Estimating the NB dispersion

7.3 Estimating the QL dispersion

7.4 Examining the data with MDS plots

8 Testing for DB and controlling the FDR

8.1 Testing for DB with QL F-tests

8.2 Controlling the FDR across regions

8.3 Examining the scope and direction of DB

9 Saving results to file

10 Interpreting the DB results

10.1 Adding gene-centric annotation

10.1.1 Using the `detailRanges` function

10.1.2 Using the ChIPpeakAnno package

10.2 Reporting gene-based results

11 Visualizing DB results

11.1 Overview

11.2 Simple DB across a broad region

11.3 Complex DB across a broad region

11.4 Simple DB across a small region

12 Session information

References

Detecting differential enrichment of H3K9ac in murine B cells

2019-10-30

1 Overview

2 Pre-processing checks

2.1 Examining mapping statistics

2.2 Obtaining the ENCODE blacklist for mm10

2.3 Setting up the read extraction parameters

3 Computing the average fragment length

4 Counting reads into windows

5 Filtering windows by abundance

6 Normalizing for sample-specific trended biases

7 Statistical modelling of biological variability

7.1 Setting up the design matrix

7.2 Estimating the NB dispersion

7.3 Estimating the QL dispersion

7.4 Examining the data with MDS plots

8 Testing for DB and controlling the FDR

8.1 Testing for DB with QL F-tests

8.2 Controlling the FDR across regions

8.3 Examining the scope and direction of DB

9 Saving results to file

10 Interpreting the DB results

10.1 Adding gene-centric annotation

10.1.1 Using the detailRanges function

10.1.2 Using the ChIPpeakAnno package

10.2 Reporting gene-based results

11 Visualizing DB results

11.1 Overview

11.2 Simple DB across a broad region

11.3 Complex DB across a broad region

11.4 Simple DB across a small region

12 Session information

References

10.1.1 Using the `detailRanges` function