Introduction

The following vignette describes the nullranges implementation of the block bootstrap with respect to a genomic segmentation. See the main nullranges vignette for an overview of the idea of bootstrapping, and there is additionally a vignette on block boostrapping without respect to segmentation.

As proposed by Bickel et al. (2010), nullranges contains an implementation of a block bootstrap, such that features are sampled from the genome in blocks. Blocks are sampled and placed within regions of a genome segmentation. That is, for a genome segmented into states 1,2,…,S, blocks from state s will be used to tile the ranges of state s in each bootstrap sample.

The segmented block bootstrap has two options, either:

  • Perform a de-novo segmentation of the genome using feature density, e.g. gene density
  • Use exiting segmentation (e.g. ChromHMM, etc.) downloaded from AnnotationHub or external to Bioconductor (BED files imported with rtracklayer)

In this vignette, we give an example of segmenting the hg38 genome by Ensembl gene density, then we profile the time to generate a single block bootstrap sample. Finally, we use a toy dataset to visualize what a segmented block bootstrap sample looks like with respect to a genome segmentation. Future versions of this vignette will demonstrate the functions used within an overlap analysis. See also the unsegmented block bootstrap vignette in nullranges, if it is not desired to bootstrap with respect to a genome segmentation.

A finally consideration is whether the blocks should scale proportionally to the segment state length, with the default setting of proportionLength=TRUE. When blocks scale proportionally, blockLength provides the maximal length of a block, while the actual block length used for a segmentation state is proportional to the fraction of genomic basepairs covered by that state. This option is visualized on toy data at the end of this vignette.

Segmentation by gene density

First we obtain the Ensembl genes (Howe et al. 2020) for segmenting by gene density. We obtain these using the ensembldb package (Rainer, Gatto, and Weichenberger 2019).

We perform some processing to align the sequences (chromosomes) of g with our excluded regions and our features of interest (DNase hypersensitive sites, or DHS, defined below).

## 
##  chr1  chr2  chr3  chr4  chr5  chr6  chr7  chr8  chr9 chr10 chr11 chr12 chr13 
##  5194  3971  3010  2505  2868  2863  2867  2353  2242  2204  3235  2940  1304 
## chr14 chr15 chr16 chr17 chr18 chr19 chr20 chr21 chr22  chrX  chrY 
##  2224  2152  2511  2995  1170  2926  1386   835  1318  2359   523

Import excluded regions

We next import excluded regions, as created by Amemiya, Kundaje, and Boyle (2019), and packaged for R/Bioconductor in the excluderanges package by Mikhail Dozmorov.

## snapshotDate(): 2021-10-20
## loading from cache
## [1] TRUE

CBS segmentation

We first demonstrate the use a CBS segmentation as implemented in DNAcopy (Olshen et al. 2004).

We load the nullranges and plyranges packages, and patchwork in order to produce grids of plots.

We subset the excluded ranges to those which are 500 bp or larger. The motivation for this step is to avoid segmenting the genome into many small pieces due to an abundance of short excluded regions. Note that we re-save the excluded ranges to exclude.

## Analyzing: Sample.1

Note here, the default ranges plot gives whole genome and every fractured bind regions represents state transformations happens. However, some transformations within small ranges cannot be visualized, e.g 1kb. If user want to look into specific ranges of segmentation state, the region argument is flexible to support.

Alternatively: HMM segmentation

Here we use an alternative segmentation implemented in the RcppHMM CRAN package, using the initGHMM, learnEM, and viterbi functions.

## Finished at Iteration: 100 with Error: 9.31905e-06