Introduction

The following vignette describes the nullranges implementation of the block bootstrap with respect to a genomic segmentation. See the main nullranges vignette for an overview of the idea of bootstrapping, or below diagram, and there is additionally a vignette on block boostrapping without respect to segmentation-Unsegmented block bootstrap.

As proposed by Bickel et al. (2010), nullranges contains an implementation of a block bootstrap, such that features are sampled from the genome in blocks. The original block bootstrapping algorithm is implemented in a python software called Genome Structure Correlation, GSC.

In a segmented block bootstrap, the blocks are sampled and placed within regions of a genome segmentation. That is, for a genome segmented into states 1,2,…,S, blocks from state s will be used to tile the ranges of state s in each bootstrap sample. The process can be visualized in (A), a block with length \(L_b\) is randomly selected from state “red” and move to a tile block across chromosome. Additionally, the workflow of bootRanges is diagrammed in (B) and listed as:

  1. Overlap GRanges of feature \(x\) and GRanges of feature \(y\) to derive interested observed statistics
  2. bootRanges() with optional segmentation and exclude creates a bootRanges object \(y'\)
  3. Overlap GRanges of feature \(x\) and \(y'\) to derive interested bootstrap statistics
  4. \(z\) test is performed for testing the null hypothesis that there is no true biological enrichment

The segmented block bootstrap has two options, either:

  • Perform a de-novo segmentation of the genome using feature density, e.g. gene density
  • Use exiting segmentation (e.g. ChromHMM, etc.) downloaded from AnnotationHub or external to Bioconductor (BED files imported with rtracklayer)

In this vignette, we give an example of segmenting the hg38 genome by Ensembl gene density, create bootstrapped peaks and evaluate overlaps for observed peaks and bootstrap peaks, then we profile the time to generate a single block bootstrap sample. Finally, we use a toy dataset to visualize what a segmented block bootstrap sample looks like with respect to a genome segmentation.

A finally consideration is whether the blocks should scale proportionally to the segment state length, with the default setting of proportionLength=TRUE. When blocks scale proportionally, blockLength provides the maximal length of a block, while the actual block length used for a segmentation state is proportional to the fraction of genomic basepairs covered by that state. This option is visualized on toy data at the end of this vignette.

Pre-built segmentations

nullranges has generated pre-built segmentations for easy use by following below section Segmentation by gene density. Either pre-built segmentations using CBS or HMM methods with \(L_s=2e6\) considering excludable regions can be selected from ExperimentHub.

## snapshotDate(): 2022-04-19
## see ?nullrangesData and browseVignettes('nullrangesData') for documentation
## loading from cache
## see ?nullrangesData and browseVignettes('nullrangesData') for documentation
## loading from cache

Segmentation by gene density

First we obtain the Ensembl genes (Howe et al. 2020) for segmenting by gene density. We obtain these using the ensembldb package (Rainer, Gatto, and Weichenberger 2019).

We perform some processing to align the sequences (chromosomes) of g with our excluded regions and our features of interest (DNase hypersensitive sites, or DHS, defined below).

## 
##  chr1  chr2  chr3  chr4  chr5  chr6  chr7  chr8  chr9 chr10 chr11 chr12 chr13 
##  5194  3971  3010  2505  2868  2863  2867  2353  2242  2204  3235  2940  1304 
## chr14 chr15 chr16 chr17 chr18 chr19 chr20 chr21 chr22  chrX  chrY 
##  2224  2152  2511  2995  1170  2926  1386   835  1318  2359   523

Import excluded regions

We next import excluded regions including ENCODE-produced excludable regions(Amemiya, Kundaje, and Boyle 2019), telomeres from UCSC, centromeres (Commo 2022). For easy use, pre-combined excludable regions is stored in ExperimentHub. These steps using excluderanges package (Dozmorov et al. 2022) are included in nullrangesData in the inst/scripts/make-segmentation-hg38.R script.

## see ?nullrangesData and browseVignettes('nullrangesData') for documentation
## loading from cache
## [1] TRUE

CBS segmentation

We first demonstrate the use a CBS segmentation as implemented in DNAcopy (Olshen et al. 2004).

We load the nullranges and plyranges packages, and patchwork in order to produce grids of plots.

We subset the excluded ranges to those which are 500 bp or larger. The motivation for this step is to avoid segmenting the genome into many small pieces due to an abundance of short excluded regions. Note that we re-save the excluded ranges to exclude.

Here, and below, we need to specify plyranges::filter as it conflicts with filter exported by ensembldb.

## Analyzing: Sample.1

Note here, the default ranges plot gives whole genome and every fractured bind regions represents state transformations happens. However, some transformations within small ranges cannot be visualized, e.g 1kb. If user want to look into specific ranges of segmentation state, the region argument is flexible to support.

Alternatively: HMM segmentation

Here we use an alternative segmentation implemented in the RcppHMM CRAN package, using the initGHMM, learnEM, and viterbi functions.

## Finished at Iteration: 111 with Error: 9.22278e-06

Segmented block bootstrap

We use a set of DNase hypersensitivity sites (DHS) from the ENCODE project (ENCODE 2012) in A549 cell line (ENCSR614GWM). Here, for speed, we work with a pre-processed data object from ExperimentHub, created using the following steps:

  • Download ENCODE DNase hypersensitive peaks in A549 from AnnotationHub
  • Subset to standard chromosomes and remove mitochondrial DNA
  • Use a chain file from UCSC to lift ranges from hg19 to hg38
  • Sort the DHS features to be bootstrapped

These steps are included in nullrangesData in the inst/scripts/make-dhs-data.R script.

For speed of the vignette, we restrict to a smaller number of DHS, filtering by the signal value. We also remove metadata columns that we don’t need for the bootstrap analysis. Consider, when creating bootstrapped data, that you will be creating an object many times larger than your original features, so filtering and trimming extra metadata can help make the analysis more efficient.

## see ?nullrangesData and browseVignettes('nullrangesData') for documentation
## loading from cache
## [1] 6214
## 
##  chr1  chr2  chr3  chr4  chr5  chr6  chr7  chr8  chr9 chr10 chr11 chr12 chr13 
##  1436   252   108    30   148    51   184   146   155   443   436   526    20 
## chr14 chr15 chr16 chr17 chr18 chr19 chr20 chr21 chr22  chrX  chrY 
##   197   265   214   715    20   649   142    31    19    17    10

Now we apply a segmented block bootstrap with blocks of size 500kb, to the peaks. Here we show generation of 50 iterations of a block bootstrap followed by a typical overlap analysis using plyranges (Lee, Cook, and Lawrence 2019). (We might normally do 100 iterations or more, depending on the granularity of the bootstrap distribution that is needed.)

## bootRanges object with 310726 ranges and 4 metadata columns:
##            seqnames            ranges strand |        id     block  iter
##               <Rle>         <IRanges>  <Rle> | <integer> <integer> <Rle>
##        [1]     chr1     242791-242940      * |       347         5     1
##        [2]     chr1     256031-256180      * |       348         5     1
##        [3]     chr1     391535-391684      * |      5301         8     1
##        [4]     chr1     421046-421195      * |      5302         8     1
##        [5]     chr1     438186-438335      * |      5303         8     1
##        ...      ...               ...    ... .       ...       ...   ...
##   [310722]     chrY 27090908-27091057      * |      2133     12441    50
##   [310723]     chrY 27194968-27195117      * |      2134     12441    50
##   [310724]     chrY 27224188-27224337      * |      2135     12441    50
##   [310725]     chrY 27234153-27234302      * |      2136     12441    50
##   [310726]     chrY 27789879-27790028      * |      2116     12442    50
##            blockLength
##                  <Rle>
##        [1]      500000
##        [2]      500000
##        [3]      500000
##        [4]      500000
##        [5]      500000
##        ...         ...
##   [310722]      500000
##   [310723]      500000
##   [310724]      500000
##   [310725]      500000
##   [310726]      500000
##   -------
##   seqinfo: 24 sequences from hg38 genome

What is returned here? The bootRanges function returns a bootRanges object, which is a simple sub-class of GRanges. The iteration (iter) and the block length (blockLength) are recorded as metadata columns, accessible via mcols. We return the bootstrapped ranges as GRanges rather than GRangesList, as the former is more compatible with downstream overlap joins with plyranges, where the iteration column can be used with group_by to provide per bootstrap summary statistics, as shown below.

Note that we use the exclude object from the previous step, which does not contain small ranges. If one wanted to also avoid generation of bootstrapped features that overlap small excluded ranges, then omit this filtering step (use the original, complete exclude feature set).

We can examine the number and extent of the bootstrapped data:

## DataFrame with 50 rows and 3 columns
##      iter         n total_width
##     <Rle> <integer>   <integer>
## 1       1      6150      923802
## 2       2      6276      945143
## 3       3      6249      939485
## 4       4      6293      946714
## 5       5      6365      957108
## ...   ...       ...         ...
## 46     46      6214      936107
## 47     47      6177      930282
## 48     48      6225      936523
## 49     49      6185      931428
## 50     50      6250      937598

which is similar to the orig inal set, after filtering out those overlapping the excluded region:

## DataFrame with 1 row and 2 columns
##           n total_width
##   <integer>   <integer>
## 1      6166      927017

Use with plyranges

Suppose we have a set of features x and we are interested in evaluating the overlap of this set with the DHS. We can calculate for example the mean observed number of overlaps for features in x (or something more complicated, e.g. the maximum log fold change or signal value for DHS peaks within a maxgap window of x).

## [1] 1.26

We can repeat this with the bootstrapped features using a group_by command, a summarize, followed by a second group_by and summarize. It may help to step through these commands one by one to understand what the intermediate output is.

Note that we need to use tidyr::complete in order to fill in combinations of x and iter where the overlap was 0.

The above code, first grouping by x_id and iter, then subsequently by iter is general and allows for more complex analysis than just mean overlap (e.g. how many times an x range has 1 or more overlap, what is the mean or max signal value for peaks overlapping ranges in x, etc.).

Finally we can plot a histogram. In this case, as the x features were arbitrary, our observed value falls within the distribution of mean overlap with bootstrapped data.

For more examples of combining bootRanges from nullranges with plyranges piped operations, see the relevant chapter in the tidy-ranges-tutorial book.

Timing on DHS peaks

Here, we test the speed of the various options for bootstrapping (see below for visualization of the difference).

## Unit: milliseconds
##     expr       min       lq     mean   median       uq      max neval cld
##     prop 106.30244 112.7007 120.8499 116.5004 120.5533 148.7172    10   b
##  no_prop  99.76908 103.1025 106.5121 107.8988 109.0851 114.0771    10  a

Visualizing toy bootstrap samples

Below we present a toy example for visualizing the segmented block bootstrap. First, we define a helper function for plotting GRanges using plotgardener (Kramer et al. 2021). A key aspect here is that we color the original and bootstrapped ranges by the genomic state (the state of the segmentation that the original ranges fall in).

Create a toy genome segmentation:

We can visualize with our helper function:

Now create small ranges distributed uniformly across the toy genome:

Not scaling by segmentation

We can visualize block bootstrapped ranges when the blocks do not scale to segment state length:

Scaling by segmentation

This time the blocks scale to the segment state length. Note that in this case blockLength is the maximal block length possible, but the actual block lengths per segment will be smaller (proportional to the fraction of basepairs covered by that state in the genome segmentation).

Note that some ranges from adjacent states are allowed to be placed within different states in the bootstrap sample. This is because, during the random sampling of blocks of original data, a block is allowed to extend beyond the segmentation region of the state being sampled, and features from adjacent states are not excluded from the sampled block.

Session information

## R version 4.2.0 RC (2022-04-19 r82224)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04.4 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.15-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.15-bioc/R/lib/libRlapack.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_GB              LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] grid      stats4    stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] microbenchmark_1.4.9        tidyr_1.2.0                
##  [3] EnsDb.Hsapiens.v86_2.99.0   ensembldb_2.20.0           
##  [5] AnnotationFilter_1.20.0     GenomicFeatures_1.48.0     
##  [7] AnnotationDbi_1.58.0        patchwork_1.1.1            
##  [9] plyranges_1.16.0            nullrangesData_1.1.1       
## [11] ExperimentHub_2.4.0         AnnotationHub_3.4.0        
## [13] BiocFileCache_2.4.0         dbplyr_2.1.1               
## [15] ggplot2_3.3.5               plotgardener_1.2.0         
## [17] nullranges_1.2.0            InteractionSet_1.24.0      
## [19] SummarizedExperiment_1.26.0 Biobase_2.56.0             
## [21] MatrixGenerics_1.8.0        matrixStats_0.62.0         
## [23] GenomicRanges_1.48.0        GenomeInfoDb_1.32.0        
## [25] IRanges_2.30.0              S4Vectors_0.34.0           
## [27] BiocGenerics_0.42.0        
## 
## loaded via a namespace (and not attached):
##   [1] plyr_1.8.7                    RcppHMM_1.2.2                
##   [3] lazyeval_0.2.2                splines_4.2.0                
##   [5] BiocParallel_1.30.0           TH.data_1.1-1                
##   [7] digest_0.6.29                 yulab.utils_0.0.4            
##   [9] htmltools_0.5.2               fansi_1.0.3                  
##  [11] magrittr_2.0.3                memoise_2.0.1                
##  [13] ks_1.13.5                     Biostrings_2.64.0            
##  [15] sandwich_3.0-1                prettyunits_1.1.1            
##  [17] jpeg_0.1-9                    colorspace_2.0-3             
##  [19] blob_1.2.3                    rappdirs_0.3.3               
##  [21] xfun_0.30                     dplyr_1.0.8                  
##  [23] crayon_1.5.1                  RCurl_1.98-1.6               
##  [25] jsonlite_1.8.0                survival_3.3-1               
##  [27] zoo_1.8-10                    glue_1.6.2                   
##  [29] gtable_0.3.0                  zlibbioc_1.42.0              
##  [31] XVector_0.36.0                strawr_0.0.9                 
##  [33] DelayedArray_0.22.0           scales_1.2.0                 
##  [35] mvtnorm_1.1-3                 DBI_1.1.2                    
##  [37] Rcpp_1.0.8.3                  xtable_1.8-4                 
##  [39] progress_1.2.2                gridGraphics_0.5-1           
##  [41] bit_4.0.4                     mclust_5.4.9                 
##  [43] httr_1.4.2                    RColorBrewer_1.1-3           
##  [45] speedglm_0.3-4                ellipsis_0.3.2               
##  [47] pkgconfig_2.0.3               XML_3.99-0.9                 
##  [49] farver_2.1.0                  sass_0.4.1                   
##  [51] utf8_1.2.2                    DNAcopy_1.70.0               
##  [53] ggplotify_0.1.0               tidyselect_1.1.2             
##  [55] labeling_0.4.2                rlang_1.0.2                  
##  [57] later_1.3.0                   munsell_0.5.0                
##  [59] BiocVersion_3.15.2            tools_4.2.0                  
##  [61] cachem_1.0.6                  cli_3.3.0                    
##  [63] generics_0.1.2                RSQLite_2.2.12               
##  [65] ggridges_0.5.3                evaluate_0.15                
##  [67] stringr_1.4.0                 fastmap_1.1.0                
##  [69] yaml_2.3.5                    knitr_1.38                   
##  [71] bit64_4.0.5                   purrr_0.3.4                  
##  [73] KEGGREST_1.36.0               mime_0.12                    
##  [75] pracma_2.3.8                  xml2_1.3.3                   
##  [77] biomaRt_2.52.0                compiler_4.2.0               
##  [79] filelock_1.0.2                curl_4.3.2                   
##  [81] png_0.1-7                     interactiveDisplayBase_1.34.0
##  [83] tibble_3.1.6                  bslib_0.3.1                  
##  [85] stringi_1.7.6                 highr_0.9                    
##  [87] lattice_0.20-45               ProtGenerics_1.28.0          
##  [89] Matrix_1.4-1                  vctrs_0.4.1                  
##  [91] pillar_1.7.0                  lifecycle_1.0.1              
##  [93] BiocManager_1.30.17           jquerylib_0.1.4              
##  [95] data.table_1.14.2             bitops_1.0-7                 
##  [97] httpuv_1.6.5                  rtracklayer_1.56.0           
##  [99] R6_2.5.1                      BiocIO_1.6.0                 
## [101] promises_1.2.0.1              KernSmooth_2.23-20           
## [103] codetools_0.2-18              MASS_7.3-57                  
## [105] assertthat_0.2.1              rjson_0.2.21                 
## [107] withr_2.5.0                   GenomicAlignments_1.32.0     
## [109] Rsamtools_2.12.0              multcomp_1.4-19              
## [111] GenomeInfoDbData_1.2.8        parallel_4.2.0               
## [113] hms_1.1.1                     rmarkdown_2.14               
## [115] shiny_1.7.1                   restfulr_0.0.13

References

Amemiya, Haley M, Anshul Kundaje, and Alan P Boyle. 2019. “The ENCODE Blacklist: Identification of Problematic Regions of the Genome.” Scientific Reports 9 (1): 9354. https://doi.org/10.1038/s41598-019-45839-z.

Bickel, Peter J., Nathan Boley, James B. Brown, Haiyan Huang, and Nancy R. Zhang. 2010. “Subsampling Methods for Genomic Inference.” The Annals of Applied Statistics 4 (4): 1660–97. https://doi.org/10.1214/10-{AOAS363}.

Commo, Frederic. 2022. RCGH: Comprehensive Pipeline for Analyzing and Visualizing Array-Based Cgh Data. https://github.com/fredcommo/rCGH.

Dozmorov, Mikhail G., Eric Davis, Wancen Mu, Stuart Lee, Tim Triche, Douglas Phanstiel, and Michael Love. 2022. Excluderanges. https://github.com/mdozmorov/excluderanges.

ENCODE. 2012. “An integrated encyclopedia of DNA elements in the human genome.” Nature 489 (7414): 57–74. https://doi.org/10.1038/nature11247.

Howe, Kevin L, Premanand Achuthan, James Allen, Jamie Allen, Jorge Alvarez-Jarreta, M Ridwan Amode, Irina M Armean, et al. 2020. “Ensembl 2021.” Nucleic Acids Research 49 (D1): D884–D891. https://doi.org/10.1093/nar/gkaa942.

Kramer, Nicole E, Eric S Davis, Craig D Wenger, Erika M Deoudes, Sarah M Parker, Michael I Love, and Douglas H Phanstiel. 2021. “Plotgardener: Cultivating precise multi-panel figures in R.” bioRxiv. https://doi.org/10.1101/2021.09.08.459338.

Lee, Stuart, Dianne Cook, and Michael Lawrence. 2019. “Plyranges: A Grammar of Genomic Data Transformation.” Genome Biology 20 (1): 4. https://doi.org/10.1186/s13059-018-1597-8.

Olshen, A. B., E. S. Venkatraman, R. Lucito, and M. Wigler. 2004. “Circular binary segmentation for the analysis of array-based DNA copy number data.” Biostatistics 5 (4): 557–72.

Rainer, Johannes, Laurent Gatto, and Christian X Weichenberger. 2019. “ensembldb: an R package to create and use Ensembl-based annotation resources.” Bioinformatics 35 (17): 3151–3. https://doi.org/10.1093/bioinformatics/btz031.