1 Overview

Here, we perform a window-based DB analysis to identify regions of differential H3K27me3 enrichment in mouse lung epithelium. H3K27me3 is associated with transcriptional repression and is usually observed with broad regions of enrichment. The aim of this workflow is to demonstrate how to analyze these broad marks with csaw, especially at variable resolutions with multiple window sizes. We use H3K27me3 ChIP-seq data from a study comparing wild-type (WT) and Ezh2 knock-out (KO) animals (Galvis et al. 2015), contains two biological replicates for each genotype. We download BAM files and indices using chipseqDBData.

library(chipseqDBData)
h3k27me3data <- H3K27me3Data()
h3k27me3data
## DataFrame with 4 rows and 3 columns
##          Name                 Description
##   <character>                 <character>
## 1  SRR1274188        control H3K27me3 (1)
## 2  SRR1274189        control H3K27me3 (2)
## 3  SRR1274190 Ezh2 knock-out H3K27me3 (1)
## 4  SRR1274191 Ezh2 knock-out H3K27me3 (2)
##                                             Path
##                                      <character>
## 1 /tmp/RtmpYT6shT/file5dc4b70252c/SRR1274188.bam
## 2 /tmp/RtmpYT6shT/file5dc4b70252c/SRR1274189.bam
## 3 /tmp/RtmpYT6shT/file5dc4b70252c/SRR1274190.bam
## 4 /tmp/RtmpYT6shT/file5dc4b70252c/SRR1274191.bam

2 Pre-processing checks

We check some mapping statistics with Rsamtools.

library(Rsamtools)
diagnostics <- list()
for (bam in h3k27me3data$Path) {
    total <- countBam(bam)$records
    mapped <- countBam(bam, param=ScanBamParam(
        flag=scanBamFlag(isUnmapped=FALSE)))$records
    marked <- countBam(bam, param=ScanBamParam(
        flag=scanBamFlag(isUnmapped=FALSE, isDuplicate=TRUE)))$records
    diagnostics[[basename(bam)]] <- c(Total=total, Mapped=mapped, Marked=marked)
}
diag.stats <- data.frame(do.call(rbind, diagnostics))
diag.stats$Prop.mapped <- diag.stats$Mapped/diag.stats$Total*100
diag.stats$Prop.marked <- diag.stats$Marked/diag.stats$Mapped*100
diag.stats
##                   Total   Mapped  Marked Prop.mapped Prop.marked
## SRR1274188.bam 24445704 18605240 2769679    76.10842    14.88655
## SRR1274189.bam 21978677 17014171 2069203    77.41217    12.16164
## SRR1274190.bam 26910067 18606352 4361393    69.14272    23.44034
## SRR1274191.bam 21354963 14092438 4392541    65.99140    31.16949

We construct a readParam object to standardize the parameter settings in this analysis. For consistency with the original analysis by Galvis et al. (2015), we will define the blacklist using the predicted repeats from the RepeatMasker software.

library(BiocFileCache)
bfc <- BiocFileCache("local", ask=FALSE)
black.path <- bfcrpath(bfc, file.path("http://hgdownload.cse.ucsc.edu",
    "goldenPath/mm10/bigZips/chromOut.tar.gz"))
tmpdir <- tempfile()
dir.create(tmpdir)
untar(black.path, exdir=tmpdir)

# Iterate through all chromosomes.
collected <- list()
for (x in list.files(tmpdir, full=TRUE)) {
    f <- list.files(x, full=TRUE, pattern=".fa.out")
    to.get <- vector("list", 15)
    to.get[[5]] <- "character"
    to.get[6:7] <- "integer"
    collected[[length(collected)+1]] <- read.table(f, skip=3, 
        stringsAsFactors=FALSE, colClasses=to.get)
}

collected <- do.call(rbind, collected)
blacklist <- GRanges(collected[,1], IRanges(collected[,2], collected[,3]))
blacklist
## GRanges object with 5147737 ranges and 0 metadata columns:
##                         seqnames          ranges strand
##                            <Rle>       <IRanges>  <Rle>
##         [1]                 chr1 3000001-3002128      *
##         [2]                 chr1 3003153-3003994      *
##         [3]                 chr1 3003994-3004054      *
##         [4]                 chr1 3004041-3004206      *
##         [5]                 chr1 3004207-3004270      *
##         ...                  ...             ...    ...
##   [5147733] chrY_JH584303_random   152557-155890      *
##   [5147734] chrY_JH584303_random   155891-156883      *
##   [5147735] chrY_JH584303_random   157070-157145      *
##   [5147736] chrY_JH584303_random   157909-157960      *
##   [5147737] chrY_JH584303_random   157953-158099      *
##   -------
##   seqinfo: 66 sequences from an unspecified genome; no seqlengths

We set the minimum mapping quality score to 10 to remove poorly or non-uniquely aligned reads.

library(csaw)
param <- readParam(minq=10, discard=blacklist)
param
##     Extracting reads in single-end mode
##     Duplicate removal is turned off 
##     Minimum allowed mapping score is 10 
##     Reads are extracted from both strands
##     No restrictions are placed on read extraction
##     Reads in 5147737 regions will be discarded

3 Counting reads into windows

Reads are then counted into sliding windows using csaw (Lun and Smyth 2015). At this stage, we use a large 2 kbp window to reflect the fact that H3K27me3 exhibits broad enrichment. This allows us to increase the size of the counts and thus detection power, without having to be concerned about loss of genomic resolution to detect sharp binding events.

win.data <- windowCounts(h3k27me3data$Path, param=param, width=2000,
    spacing=500, ext=200)
win.data
## class: RangedSummarizedExperiment 
## dim: 4614717 4 
## metadata(6): spacing width ... param final.ext
## assays(1): counts
## rownames: NULL
## rowData names(0):
## colnames: NULL
## colData names(4): bam.files totals ext rlen

We use spacing=500 to avoid redundant work when sliding a large window across the genome. The default spacing of 50 bp would result in many windows with over 90% overlap in their positions, increasing the amount of computational work without a meaningful improvement in resolution. We also set the fragment length to 200 bp based on experimental knowledge of the size selection procedure. Unlike the previous analyses, the fragment length cannot easily estimated here due to weak strand bimodality of diffuse marks.

4 Normalization for composition biases

As in the CBP example, we normalize for composition biases resulting from imbalanced DB between conditions (Lun and Smyth 2014). We expect systematic DB in one direction as Ezh2 function (and thus some H3K27me3 deposition activity) is lost in the KO genotype. We apply the TMM method (Robinson and Oshlack 2010) to counts for large 10 kbp bins, and store the resulting normalization factors back in win.data for use in the DB analysis with the window counts.

bins <- windowCounts(h3k27me3data$Path, bin=TRUE, width=10000, param=param)
win.data <- normFactors(bins, se.out=win.data)
(normfacs <- win.data$norm.factors)
## [1] 0.9946875 0.9984751 1.0080895 0.9987965

Figure 1 shows the effect of normalization on the relative enrichment between pairs of samples. We see that log-ratio of normalization factors passes through the centre of the cloud of background regions in each plot, indicating that the bias has been successfully identified and removed.

bin.ab <- scaledAverage(bins)
adjc <- calculateCPM(bins, use.norm.factors=FALSE)

par(cex.lab=1.5, mfrow=c(1,3))
smoothScatter(bin.ab, adjc[,1]-adjc[,2], ylim=c(-6, 6),
    xlab="Average abundance", ylab="Log-ratio (1 vs 2)")
abline(h=log2(normfacs[1]/normfacs[4]), col="red")

smoothScatter(bin.ab, adjc[,1]-adjc[,3], ylim=c(-6, 6),
    xlab="Average abundance", ylab="Log-ratio (1 vs 3)")
abline(h=log2(normfacs[2]/normfacs[4]), col="red")

smoothScatter(bin.ab, adjc[,1]-adjc[,4], ylim=c(-6, 6),
    xlab="Average abundance", ylab="Log-ratio (1 vs 4)")
abline(h=log2(normfacs[3]/normfacs[4]), col="red")
Mean-difference plots for the bin counts, comparing sample 1 to all other samples. The red line represents the log-ratio of the normalization factors between samples.

Figure 1: Mean-difference plots for the bin counts, comparing sample 1 to all other samples. The red line represents the log-ratio of the normalization factors between samples.

5 Filtering of low-abundance windows

We estimate the global background and remove low-abundance windows that are not enriched above this background level. To retain a window, we require it to have at least 2-fold more coverage than the average background. This is less stringent than the thresholds used in previous analyses, owing the weaker enrichment observed for diffuse marks.

filter.stat <- filterWindowsGlobal(win.data, bins)
min.fc <- 2 

Figure 2 shows that chosen threshold is greater than the abundances of most bins in the genome, presumably those corresponding to background regions. This suggests that the filter will remove most windows lying within background regions.

hist(filter.stat$back.abundances, main="", breaks=50,
    xlab="Background abundance (log2-CPM)")
threshold <- filter.stat$abundances[1] - filter.stat$filter[1] + log2(min.fc)
abline(v=threshold, col="red")
Histogram of average abundances across all 10 kbp genomic bins. The filter threshold is shown as the red line.

Figure 2: Histogram of average abundances across all 10 kbp genomic bins. The filter threshold is shown as the red line.

We filter out the majority of windows in background regions upon applying a modest fold-change threshold. This leaves a small set of relevant windows for further analysis.

keep <- filter.stat$filter > log2(min.fc)
summary(keep)
##    Mode   FALSE    TRUE 
## logical 4555355   59362
filtered.data <- win.data[keep,]

6 Statistical modelling of biological variability

Counts for each window are modelled using edgeR (McCarthy, Chen, and Smyth 2012; Robinson, McCarthy, and Smyth 2010). We first convert our RangedSummarizedExperiment object into a DGEList.

library(edgeR)
y <- asDGEList(filtered.data)
str(y)
## Formal class 'DGEList' [package "edgeR"] with 1 slot
##   ..@ .Data:List of 2
##   .. ..$ : int [1:59362, 1:4] 17 19 32 33 30 31 31 36 33 29 ...
##   .. .. ..- attr(*, "dimnames")=List of 2
##   .. .. .. ..$ : chr [1:59362] "1" "2" "3" "4" ...
##   .. .. .. ..$ : chr [1:4] "Sample1" "Sample2" "Sample3" "Sample4"
##   .. ..$ :'data.frame':  4 obs. of  3 variables:
##   .. .. ..$ group       : Factor w/ 1 level "1": 1 1 1 1
##   .. .. ..$ lib.size    : int [1:4] 10602308 9971590 8605634 5659274
##   .. .. ..$ norm.factors: num [1:4] 0.995 0.998 1.008 0.999

We then construct a design matrix for our experimental design. Here, we use a simple one-way layout with two groups of two replicates.

genotype <- h3k27me3data$Description
genotype[grep("control", genotype)] <- "wt"
genotype[grep("knock-out", genotype)] <- "ko"

genotype <- factor(genotype)
design <- model.matrix(~0+genotype)
colnames(design) <- levels(genotype)
design
##   ko wt
## 1  0  1
## 2  0  1
## 3  1  0
## 4  1  0
## attr(,"assign")
## [1] 1 1
## attr(,"contrasts")
## attr(,"contrasts")$genotype
## [1] "contr.treatment"

We estimate the negative binomial (NB) and quasi-likelihood (QL) dispersions for each window (Lund et al. 2012). We again observe an increasing trend in the NB dispersions with respect to abundance (Figure 3).

y <- estimateDisp(y, design)
summary(y$trended.dispersion)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## 0.008979 0.009484 0.010091 0.010970 0.010446 0.025661
plotBCV(y)