The following vignette describes the nullranges implementation of the block bootstrap with respect to a genomic segmentation. See the main nullranges vignette for an overview of the idea of bootstrapping, and there is additionally a vignette on block boostrapping without respect to segmentation.
As proposed by Bickel et al. (2010), nullranges contains an implementation of a block bootstrap, such that features are sampled from the genome in blocks. Blocks are sampled and placed within regions of a genome segmentation. That is, for a genome segmented into states 1,2,…,S, blocks from state s will be used to tile the ranges of state s in each bootstrap sample.
The segmented block bootstrap has two options, either:
In this vignette, we give an example of segmenting the hg38 genome by Ensembl gene density, then we profile the time to generate a single block bootstrap sample. Finally, we use a toy dataset to visualize what a segmented block bootstrap sample looks like with respect to a genome segmentation. Future versions of this vignette will demonstrate the functions used within an overlap analysis. See also the unsegmented block bootstrap vignette in nullranges, if it is not desired to bootstrap with respect to a genome segmentation.
A finally consideration is whether the blocks should scale proportionally to the segment state length, with the default setting of proportionLength=TRUE
. When blocks scale proportionally, blockLength
provides the maximal length of a block, while the actual block length used for a segmentation state is proportional to the fraction of genomic basepairs covered by that state. This option is visualized on toy data at the end of this vignette.
First we obtain the Ensembl genes (Howe et al. 2020) for segmenting by gene density. We obtain these using the ensembldb package (Rainer, Gatto, and Weichenberger 2019).
suppressPackageStartupMessages(library(EnsDb.Hsapiens.v86))
edb <- EnsDb.Hsapiens.v86
filt <- AnnotationFilterList(GeneIdFilter("ENSG", "startsWith"))
g <- genes(edb, filter = filt)
We perform some processing to align the sequences (chromosomes) of g
with our excluded regions and our features of interest (DNase hypersensitive sites, or DHS, defined below).
library(GenomeInfoDb)
g <- keepStandardChromosomes(g, pruning.mode = "coarse")
seqlevels(g, pruning.mode="coarse") <- setdiff(seqlevels(g), "MT")
# normally we would assign a new style, but for recent host issues
## seqlevelsStyle(g) <- "UCSC"
seqlevels(g) <- paste0("chr", seqlevels(g))
genome(g) <- "hg38"
g <- sortSeqlevels(g)
g <- sort(g)
table(seqnames(g))
##
## chr1 chr2 chr3 chr4 chr5 chr6 chr7 chr8 chr9 chr10 chr11 chr12 chr13
## 5194 3971 3010 2505 2868 2863 2867 2353 2242 2204 3235 2940 1304
## chr14 chr15 chr16 chr17 chr18 chr19 chr20 chr21 chr22 chrX chrY
## 2224 2152 2511 2995 1170 2926 1386 835 1318 2359 523
We next import excluded regions, as created by Amemiya, Kundaje, and Boyle (2019), and packaged for R/Bioconductor in the excluderanges package by Mikhail Dozmorov.
## snapshotDate(): 2021-10-20
## loading from cache
## [1] TRUE
We first demonstrate the use a CBS segmentation as implemented in DNAcopy (Olshen et al. 2004).
We load the nullranges and plyranges packages, and patchwork in order to produce grids of plots.
We subset the excluded ranges to those which are 500 bp or larger. The motivation for this step is to avoid segmenting the genome into many small pieces due to an abundance of short excluded regions. Note that we re-save the excluded ranges to exclude
.
set.seed(5)
exclude <- exclude %>%
plyranges::filter(width(exclude) >= 500)
L_s <- 1e6
seg <- segmentDensity(g, n = 3, L_s = L_s,
exclude = exclude, type = "cbs")
## Analyzing: Sample.1
plots <- lapply(c("ranges","barplot","boxplot"), function(t) {
plotSegment(seg, exclude, type = t)
})
plots[[1]]
Note here, the default ranges plot gives whole genome and every fractured bind regions represents state transformations happens. However, some transformations within small ranges cannot be visualized, e.g 1kb. If user want to look into specific ranges of segmentation state, the region argument is flexible to support.
region <- GRanges("chr16", IRanges(3e7,4e7))
plotSegment(seg, exclude, type="ranges", region=region)
Here we use an alternative segmentation implemented in the RcppHMM CRAN package, using the initGHMM
, learnEM
, and viterbi
functions.
## Finished at Iteration: 100 with Error: 9.31905e-06
plots <- lapply(c("ranges","barplot","boxplot"), function(t) {
plotSegment(seg_hmm, exclude, type = t)
})
plots[[1]]