# 1 Overview

In this vignette, we provide a brief overview of the ChIPexoQual package. This package provides a statistical quality control (QC) pipeline that enables the exploration and analysis of ChIP-exo/nexus experiments. In this vignette we used the reads aligned to chr1 in the mouse liver ChIP-exo experiment (Serandour et al. 2013) to illustrate the use of the pipeline. To load the packages we use:

    library(ChIPexoQual)
library(ChIPexoQualExample)

ChIPexoQual takes a set of aligned reads from a ChIP-exo (or ChIP-nexus) experiment as input and performs the following steps:

2. Compute $$D_i$$, number of reads in island $$i$$, and $$U_i$$, number of island $$i$$ positions with at least one aligning read, $$i=1, \cdots, I$$.
• For each island $$i$$, $$i=1, \cdots, I$$ compute island statistics: \begin{align*} \mbox{ARC}_i &= \frac{D_i}{W_i}, \quad \mbox{URC}_i = \frac{U_i}{D_i}, \\ %\mbox{URC}_i &= \frac{U_i}{D_i}, \\ \mbox{FSR}_i &= \frac{(\text{Number of forward strand reads aligning to island i})}{D_i}, \end{align*}
where $$W_i$$ denotes the width of island $$i$$,.
3. Generate diagnostic plots (i) URC vs. ARC plot; (ii) Region Composition plot; (iii) FSR distribution plot.
4. Randomly sample $$M$$ (at least 1000) islands and fit, \begin{align*} D_i = \beta_1 U_i + \beta_2 W_i + \varepsilon_i, \end{align*} where $$\varepsilon_i$$ denotes the independent error term. Repeat this process $$B$$ times and generate box plots of estimated $$\beta_1$$ and $$\beta_2$$.

We analyzed a larger collection of ChIP-exo/nexus experiments in (Welch et al. 2016) including complete versions of this samples.

# 2 Creating an ExoData object

The minimum input to use ChIPexoQual are the aligned reads of a ChIP-exo/nexus experiment. ChIPexoQual accepts either the name of the bam file or the reads in a GAlignments object:

    files = list.files(system.file("extdata",
package = "ChIPexoQualExample"),full.names = TRUE)
basename(files[1])
## [1] "ChIPexo_carroll_FoxA1_mouse_rep1_chr1.bam"
    ex1 = ExoData(file = files[1],mc.cores = 2L,verbose = FALSE)
ex1
## ExoData object with 655785 ranges and 11 metadata columns:
##               <Rle>           <IRanges>  <Rle> | <integer> <integer> <integer>
##        [1]     chr1     3000941-3000976      * |         2         0         1
##        [2]     chr1     3001457-3001492      * |         0         1         0
##        [3]     chr1     3001583-3001618      * |         0         2         0
##        [4]     chr1     3001647-3001682      * |         1         0         1
##        [5]     chr1     3001852-3001887      * |         1         0         1
##        ...      ...                 ...    ... .       ...       ...       ...
##   [655781]     chr1 197192012-197192047      * |         0         1         0
##   [655782]     chr1 197192421-197192456      * |         0         1         0
##   [655783]     chr1 197193059-197193094      * |         1         0         1
##   [655784]     chr1 197193694-197193729      * |         0         3         0
##   [655785]     chr1 197194986-197195021      * |         0         2         0
##               revPos     depth uniquePos       ARC       URC       FSR
##            <integer> <integer> <integer> <numeric> <numeric> <numeric>
##        [1]         0         2         1 0.0555556       0.5         1
##        [2]         1         1         1 0.0277778       1.0         0
##        [3]         1         2         1 0.0555556       0.5         0
##        [4]         0         1         1 0.0277778       1.0         1
##        [5]         0         1         1 0.0277778       1.0         1
##        ...       ...       ...       ...       ...       ...       ...
##   [655781]         1         1         1 0.0277778  1.000000         0
##   [655782]         1         1         1 0.0277778  1.000000         0
##   [655783]         0         1         1 0.0277778  1.000000         1
##   [655784]         1         3         1 0.0833333  0.333333         0
##   [655785]         1         2         1 0.0555556  0.500000         0
##                    M         A
##            <numeric> <numeric>
##        [1]      -Inf       Inf
##        [2]      -Inf      -Inf
##        [3]      -Inf      -Inf
##        [4]      -Inf       Inf
##        [5]      -Inf       Inf
##        ...       ...       ...
##   [655781]      -Inf      -Inf
##   [655782]      -Inf      -Inf
##   [655783]      -Inf       Inf
##   [655784]      -Inf      -Inf
##   [655785]      -Inf      -Inf
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths
    reads = readGAlignments(files[1],param = NULL)
identical(GRanges(ex1),GRanges(ex2))
## [1] FALSE

For the rest of the vignette, we generate an ExoData object for each replicate:

    files = files[grep("bai",files,invert = TRUE)] ## ignore index files
exampleExoData = lapply(files,ExoData,mc.cores = 2L,verbose = FALSE)

Finally, we can recover the number of reads that compose a ExoData object by using the nreads function:

    sapply(exampleExoData,nreads)
## [1] 1654985 1766665 1670117

## 2.1 Enrichment analysis and library complexity:

To create the ARC vs URC plot proposed in (Welch et al. 2016), we use the ARC_URC_plot function. This function allows to visually compare different samples:

    ARCvURCplot(exampleExoData,names.input = paste("Rep",1:3,sep = "-"))

This plot typically exhibits one of the following three patterns for any given sample. In all three panels we can observe two arms: the first with low Average Read Coefficient (ARC) and varying Unique Read Coefficient (URC); and the second where the URC decreases as the ARC increases. The first and third replicates exhibit a defined decreasing trend in URC as the ARC increases. This indicates that these samples exhibit a higher ChIP enrichment than the second replicate. On the other hand, the overall URC level from the first two replicates is higher than that of the third replicate, elucidating that the libraries for the first two replicates are more complex than that of the third replicate.

## 2.2 Strand imbalance

To create the FSR distribution and Region Composition plots suggested in Welch et. al 2016 (submitted), we use the FSR_dist_plot and region_comp_plot, respectively.

    p1 = regionCompplot(exampleExoData,names.input = paste("Rep",1:3,
sep = "-"),depth.values = seq_len(50))
p2 = FSRDistplot(exampleExoData,names.input = paste("Rep",1:3,sep = "-"),
quantiles = c(.25,.5,.75),depth.values = seq_len(100))
gridExtra::grid.arrange(p1,p2,nrow = 1)

The left panel displays the Region Composition plot and the right panel shows the Forward Strand Ratio (FSR) distribution plot, both of which highlight specific problems with replicates 2 and 3. The Region Composition plot exhibits apparent decreasing trends in the proportions of regions formed by fragments in one exclusive strand. High quality experiments tend to show exponential decay in the proportion of single stranded regions, while for the lower quality experiments, the trend may be linear or even constant. The FSR distributions of both of replicates 2 and 3 are more spread around their respective medians. The rate at which the FSR distribution becomes more centralized around the median indicates the aforementioned lower enrichment in the second replicate and the low complexity in the third one. The asymmetric behavior of the second replicate is characteristic of low enrichment, while the constant values of replicate three for low minimum number of reads indicate that this replicate has islands composed of reads aligned to very few unique positions.

### 2.2.1 Further exploration of ChIP-exo data

All the plot functions in ChIPexoQual allow a list or several separate ExoData objects. This allows to explore island subsets for each replicate. For example, to show that the first arm is composed of regions formed by reads aligned to few positions, we can generate the following plot:

    ARCvURCplot(exampleExoData[[1]],
subset(exampleExoData[[1]],uniquePos > 10),
subset(exampleExoData[[1]],uniquePos > 20),
names.input = c("All", "uniquePos > 10", "uniquePos > 20"))

For this figure, we used the ARC vs URC plot to show how several of the regions with low ARC values are composed by reads that align to a small number of unique positions. This technique highlights a strategy that can be followed to further explore the data, as with all the previously listed plotting functions we may compare different subsets of the islands in the partition.

## 2.3 Quality evaluation

The last step of the quality control pipeline is to evaluate the linear model:

\begin{align*} D_i = \beta_1 U_i + \beta_2 U_2 + \epsilon_i, \end{align*}

The distribution of the parameters of this model is built by sampling nregions regions (the default value is 1,000), fitting the model and repeating the process ntimes (the default value is 100). We visualize the distributions of the parameters with box-plots:

    p1 = paramDistBoxplot(exampleExoData,which.param = "beta1", names.input = paste("Rep",1:3,sep = "-"))
p2 = paramDistBoxplot(exampleExoData,which.param = "beta2", names.input = paste("Rep",1:3,sep = "-"))
gridExtra::grid.arrange(p1,p2,nrow = 1)

Further details over this analysis are in Welch et. al 2016 (submitted). In short, when the ChIP-exo/nexus sample is not deeply sequenced, high values of $$\hat{\beta}_1$$ indicate that the library complexity is low. In contrast, lower values correspond to higher quality ChIP-exo experiments. We concluded that samples with estimated $$\hat{\beta_1} \leq 10$$ seem to be high quality samples. Similarly, samples with estimated $$\hat{\beta_2} \approx 0$$ can be considered as high quality samples. The estimated values for these parameters can be accessed with the beta1, beta2, and param_dist methods. For example, using the median to summarize these parameter distributions, we conclude that these three replicates (in chr1) are high quality samples:

    sapply(exampleExoData,function(x)median(beta1(x)))
## [1] 1.841214 1.484552 8.343228
    sapply(exampleExoData,function(x)median(-beta2(x)))
## [1] 0.013901431 0.008013086 0.050511523

## 2.4 Subsampling reads from the experiment to asses quality

The behavior of the third’s FoxA1 replicate may be an indication of problems in the sample. However, it is also common to observe that pattern in deeply sequenced experiments. For convenience, we added the function ExoDataSubsampling, that performs the analysis suggested by Welch et. al 2016 (submitted) when the experiment is deeply sequenced. To use this function, we proceed as follows:

    sample.depth = seq(1e5,2e5,5e4)
exoList = ExoDataSubsampling(file = files[3],sample.depth = sample.depth, verbose=FALSE)

The output of ExoDataSubsampling is a list of ExoData objects, therefore its output can be used with any of the plotting functions to asses the quality of the samples. For example, using we may use paramDistBoxplot to get the following figures:

    p1 = paramDistBoxplot(exoList,which.param = "beta1")
p2 = paramDistBoxplot(exoList,which.param = "beta2")
gridExtra::grid.arrange(p1,p2,nrow = 1)

Clearly there are increasing trends in both plots, and since we are only using the reads in chromosome 1, we are observing fewer reads than in a typical ChIP-exo/nexus experiment. In a higher quality experiment it is expected to show lower $$\hat{\beta}_1$$ and $$\hat{\beta}_2$$ levels. Additionally, the rate at which the estimated $$\hat{\beta}_2$$ parameter increases is going to be higher in a lower quality experiment.

# 3 Conclusions

We presented a systematic exploration of a ChIP-exo experiment and show how to use the QC pipeline provided in ChIPexoQual. ChIPexoQual takes aligned reads as input and automatically generates several diagnostic plots and summary measures that enable assessing enrichment and library complexity. The implications of the diagnostic plots and the summary measures align well with more elaborate analysis that is computationally more expensive to perform and/or requires additional imputes that often may not be available.

# 4 SessionInfo

sessionInfo("ChIPexoQual")
## R version 4.4.0 beta (2024-04-15 r86425)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 22.04.4 LTS
##
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.19-bioc/R/lib/libRblas.so
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
##
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: America/New_York
## tzcode source: system (glibc)
##
## attached base packages:
## character(0)
##
## other attached packages:
## [1] ChIPexoQual_1.28.0
##
## loaded via a namespace (and not attached):
##   [1] DBI_1.2.2                   bitops_1.0-7
##   [3] gridExtra_2.3               rlang_1.1.3
##   [5] magrittr_2.0.3              biovizBase_1.52.0
##   [7] matrixStats_1.3.0           compiler_4.4.0
##   [9] RSQLite_2.3.6               GenomicFeatures_1.56.0
##  [11] png_0.1-8                   vctrs_0.6.5
##  [13] ProtGenerics_1.36.0         stringr_1.5.1
##  [15] pkgconfig_2.0.3             crayon_1.5.2
##  [17] fastmap_1.1.1               magick_2.8.3
##  [19] backports_1.4.1             XVector_0.44.0
##  [21] labeling_0.4.3              utf8_1.2.4
##  [23] Rsamtools_2.20.0            rmarkdown_2.26
##  [25] grDevices_4.4.0             UCSC.utils_1.0.0
##  [27] tinytex_0.50                purrr_1.0.2
##  [29] bit_4.0.5                   xfun_0.43
##  [31] zlibbioc_1.50.0             cachem_1.0.8
##  [33] graphics_4.4.0              GenomeInfoDb_1.40.0
##  [35] jsonlite_1.8.8              blob_1.2.4
##  [37] highr_0.10                  DelayedArray_0.30.0
##  [39] BiocParallel_1.38.0         broom_1.0.5
##  [41] parallel_4.4.0              cluster_2.1.6
##  [43] VariantAnnotation_1.50.0    R6_2.5.1
##  [45] bslib_0.7.0                 stringi_1.8.3
##  [47] RColorBrewer_1.1-3          rtracklayer_1.64.0
##  [49] rpart_4.1.23                GenomicRanges_1.56.0
##  [51] jquerylib_0.1.4             Rcpp_1.0.12
##  [53] bookdown_0.39               SummarizedExperiment_1.34.0
##  [55] knitr_1.46                  base64enc_0.1-3
##  [57] IRanges_2.38.0              Matrix_1.7-0
##  [59] nnet_7.3-19                 tidyselect_1.2.1
##  [61] viridis_0.6.5               rstudioapi_0.16.0
##  [63] dichromat_2.0-0.1           abind_1.4-5
##  [65] yaml_2.3.8                  codetools_0.2-20
##  [67] curl_5.2.1                  lattice_0.22-6
##  [69] tibble_3.2.1                withr_3.0.0
##  [71] Biobase_2.64.0              KEGGREST_1.44.0
##  [73] evaluate_0.23               foreign_0.8-86
##  [75] base_4.4.0                  Biostrings_2.72.0
##  [77] pillar_1.9.0                BiocManager_1.30.22
##  [79] MatrixGenerics_1.16.0       checkmate_2.3.1
##  [81] stats4_4.4.0                generics_0.1.3
##  [83] RCurl_1.98-1.14             ensembldb_2.28.0
##  [85] S4Vectors_0.42.0            ggplot2_3.5.1
##  [87] munsell_0.5.1               scales_1.3.0
##  [89] BiocStyle_2.32.0            stats_4.4.0
##  [91] glue_1.7.0                  Hmisc_5.1-2
##  [93] lazyeval_0.2.2              ChIPexoQualExample_1.27.0
##  [95] tools_4.4.0                 datasets_4.4.0
##  [97] hexbin_1.28.3               BiocIO_1.14.0
##  [99] data.table_1.15.4           BSgenome_1.72.0
## [101] GenomicAlignments_1.40.0    XML_3.99-0.16.1
## [103] grid_4.4.0                  utils_4.4.0
## [105] tidyr_1.3.1                 methods_4.4.0
## [107] AnnotationDbi_1.66.0        colorspace_2.1-0
## [109] GenomeInfoDbData_1.2.12     htmlTable_2.4.2
## [111] restfulr_0.0.15             Formula_1.2-5
## [113] cli_3.6.2                   fansi_1.0.6
## [115] viridisLite_0.4.2           S4Arrays_1.4.0
## [117] dplyr_1.1.4                 AnnotationFilter_1.28.0
## [119] gtable_0.3.5                sass_0.4.9
## [121] digest_0.6.35               BiocGenerics_0.50.0
## [123] SparseArray_1.4.0           farver_2.1.1
## [125] rjson_0.2.21                htmlwidgets_1.6.4
## [127] memoise_2.0.1               htmltools_0.5.8.1
## [129] lifecycle_1.0.4             httr_1.4.7
## [131] bit64_4.0.5

# References

Serandour, Aurelien, Brown Gordon, Joshua Cohen, and Jason Carroll. 2013. “Development of and Illumina-Based ChIP-Exonuclease Method Provides Insight into FoxA1-DNA Binding Properties.” Genome Biology.

Welch, Rene, Dongjun Chung, Jeffrey Grass, Robert Landick, and Sündüz Keleş. 2016. “Data Exploration, Quality Control, and Statistical Analysis of ChIP-Exo/Nexus Experiments.” Submitted.