RESOLVE
R packageRESOLVE 1.0.0
Cancer is a genetic disease caused by somatic mutations in genes controlling key biological functions such as cellular growth and division. Such mutations may arise both through cell-intrinsic and exogenous processes, generating characteristic mutational patterns over the genome named mutational signatures. The study of mutational signatures have become a standard component of modern genomics studies, since it can reveal which (environmental and endogenous) mutagenic processes are active in a tumor, and may highlight markers for therapeutic response.
Mutational signatures computational analyses fall mostly within two categories: (i) de novo signatures extraction and (ii) signatures exposure estimation. In the first case, the presence of mutational processes is first assessed from the data, signatures are identified and extracted and finally assigned to samples. This task is typically performed by Non-Negative Matrix Factorization (NMF). While other approaches have been proposed, NMF-based methods are by far the most used. The estimation of signatures exposures is performed by holding a set of signatures fixed (see, e.g., COSMIC mutational signatures catalogue) and assigning them to samples by minimizing, e.g., mean squared error between observed and estimated mutational patterns for each sample.
However, available mutational signatures computational tools presents many pitfalls. First, the task of determining the number of signatures is very complex and depends on heuristics. Second, several signatures have no clear etiology, casting doubt on them being computational artifacts rather than due to mutagenic processes. Last, approaches for signatures assignment are greatly influenced by the set of signatures used for the analysis. To overcome these limitations, we developed RESOLVE (Robust EStimation Of mutationaL signatures Via rEgularization), a framework that allows the efficient extraction and assignment of mutational signatures.
The RESOLVE R package implements a new de novo signatures extraction algorithm, which employs a regularized Non-Negative Matrix Factorization procedure. The method incorporates a background signature during the inference step and adopts elastic net regression to reduce the impact of overfitting. The estimation of the optimal number of signatures is performed by bi-cross-validation. Furthermore, the package also provide a procedure for confidence estimation of signatures activities in samples.
As such, RESOLVE represent an addition to other Bioconductor packages, such as, e.g., SparseSignatures, MutationalPatterns, musicatk among others, that implements a novel approach for detecting mutational signatures.
In this vignette, we give an overview of the package by presenting some of its main functions.
The RESOLVE package can be installed from Bioconductor as follow.
if (!require("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("RESOLVE")
We now present some of the main features of the package. We notice that the package supports different types of mutational signatures such as: SBS (single base substitutions) and MNV (multi-nucleotide variant) (see Degasperi, Andrea, et al. “Substitution mutational signatures in whole-genome–sequenced cancers in the UK population.” Science 376.6591 (2022): abl9283), CX (chromosomal instability) (see Drews, Ruben M., et al. “A pan-cancer compendium of chromosomal instability.” Nature 606.7916 (2022): 976-983) and CN (copy number) signatures (see Steele, Christopher D., et al. “Signatures of copy number alterations in human cancer.” Nature 606.7916 (2022): 984-991). But, for the sake of this vignette, we present only results on the classical SBS signatures. We refer to the manual for details.
First, we show how to load example data and import them into a count matrix to perform the signatures analysis.
library("RESOLVE")
data(ssm560_reduced)
These data are a reduced version of the 560 breast tumors provided by Nik-Zainal, Serena, et al. (2016) comprising only 3 patients. We notice that these data are provided purely as an example, and, as they are a reduced and partial version of the original dataset, they should not be used to draw any biological conclusion.
We now import such data into a count matrix to perform the signatures discovery. To do so, we also need to specify the reference genome as a BSgenome object to be considered. This can be done as follows, where in the example we used hs37d5 as reference genome as provided as data object within the package.
library("BSgenome.Hsapiens.1000genomes.hs37d5")
imported_data = getSBSCounts(data = ssm560_reduced, reference = BSgenome.Hsapiens.1000genomes.hs37d5)
head(imported_data)
## A[C>A]A A[C>A]C A[C>A]G A[C>A]T A[C>G]A A[C>G]C A[C>G]G A[C>G]T
## PD10010a 37 25 8 24 35 5 16 25
## PD10011a 103 59 16 73 113 54 31 102
## PD10014a 235 241 37 234 158 71 26 180
## A[C>T]A A[C>T]C A[C>T]G A[C>T]T A[T>A]A A[T>A]C A[T>A]G A[T>A]T
## PD10010a 49 31 100 42 21 15 17 30
## PD10011a 116 73 228 109 61 70 56 165
## PD10014a 229 89 178 186 105 90 126 174
## A[T>C]A A[T>C]C A[T>C]G A[T>C]T A[T>G]A A[T>G]C A[T>G]G A[T>G]T
## PD10010a 48 20 29 44 8 6 10 23
## PD10011a 184 116 113 169 77 41 73 105
## PD10014a 261 122 167 211 76 27 84 59
## C[C>A]A C[C>A]C C[C>A]G C[C>A]T C[C>G]A C[C>G]C C[C>G]G C[C>G]T
## PD10010a 34 28 8 23 15 19 20 26
## PD10011a 105 75 30 102 60 37 22 65
## PD10014a 244 238 35 243 107 105 40 144
## C[C>T]A C[C>T]C C[C>T]G C[C>T]T C[T>A]A C[T>A]C C[T>A]G C[T>A]T
## PD10010a 48 37 55 43 12 7 18 16
## PD10011a 71 52 108 103 116 80 89 103
## PD10014a 136 124 144 197 116 139 145 217
## C[T>C]A C[T>C]C C[T>C]G C[T>C]T C[T>G]A C[T>G]C C[T>G]G C[T>G]T
## PD10010a 14 17 20 30 6 8 5 13
## PD10011a 103 78 102 158 40 65 55 188
## PD10014a 103 144 112 129 47 54 70 107
## G[C>A]A G[C>A]C G[C>A]G G[C>A]T G[C>G]A G[C>G]C G[C>G]G G[C>G]T
## PD10010a 31 22 11 22 6 12 9 14
## PD10011a 78 50 14 55 55 66 13 87
## PD10014a 146 126 24 160 63 70 25 120
## G[C>T]A G[C>T]C G[C>T]G G[C>T]T G[T>A]A G[T>A]C G[T>A]G G[T>A]T
## PD10010a 40 32 82 25 6 6 6 13
## PD10011a 76 63 118 81 69 41 56 86
## PD10014a 141 99 180 163 62 66 83 126
## G[T>C]A G[T>C]C G[T>C]G G[T>C]T G[T>G]A G[T>G]C G[T>G]G G[T>G]T
## PD10010a 22 9 16 24 7 1 8 10
## PD10011a 96 62 82 93 56 46 35 99
## PD10014a 110 81 102 135 32 18 61 78
## T[C>A]A T[C>A]C T[C>A]G T[C>A]T T[C>G]A T[C>G]C T[C>G]G T[C>G]T
## PD10010a 40 40 12 48 54 37 12 85
## PD10011a 78 80 12 83 116 104 29 194
## PD10014a 202 191 17 253 198 159 33 325
## T[C>T]A T[C>T]C T[C>T]G T[C>T]T T[T>A]A T[T>A]C T[T>A]G T[T>A]T
## PD10010a 67 55 53 71 39 13 3 35
## PD10011a 119 94 78 126 121 43 64 91
## PD10014a 188 153 93 184 124 89 73 221
## T[T>C]A T[T>C]C T[T>C]G T[T>C]T T[T>G]A T[T>G]C T[T>G]G T[T>G]T
## PD10010a 19 13 11 25 18 11 11 35
## PD10011a 125 79 83 113 68 90 140 251
## PD10014a 143 118 75 148 71 54 76 160
Now, we present an example of visualization feature provided by the package, showing the counts for the first patient, i.e., PD10010a, in the following plot.
patientsSBSPlot(trinucleotides_counts=imported_data,samples="PD10010a")
After the data are loaded, we can perform signatures de novo extraction. To do so, we need to define a range for the number of signatures (variable K) to be considered. We now show how to perform the inference on the first 3 patients of the dataset from Nik-Zainal, Serena, et al. (2016), whose counts are provided within the package.
data(background)
data(patients)
set.seed(12345)
res_denovo = signaturesDecomposition(x = patients[1:3,],
K = 3:4,
background_signature = background,
nmf_runs = 2,
sparsify = FALSE,
num_processes = 1)
## Performing signatures discovery and rank estimation...
## Performing inference for K=3...
## Performing inference for K=4...
We notice that this function can be also used to perform de novo estimation for other types of mutational signatures, such as SBS, MNV, CX and CN.
Now that we have performed the de novo inferece, we need to decide the optimal number of signatures to be extracted from our dataset. To do so, we provide a procedure based on cross-validation.
set.seed(12345)
res_cv = signaturesCV(x = patients[1:3,],
beta = res_denovo$beta,
cross_validation_iterations = 2,
cross_validation_repetitions = 2,
num_processes = 1)
## Estimating the optimal number of signatures with a total of 2 cross validation repetitions...
## Performing repetition 1 out of 2...
## Performing estimation for K=3...
## Performing cross validation iteration 1 out of 2...
## Performing cross validation iteration 2 out of 2...
## Progress 50%...
## Performing estimation for K=4...
## Performing cross validation iteration 1 out of 2...
## Performing cross validation iteration 2 out of 2...
## Progress 100%...
## Performing repetition 2 out of 2...
## Performing estimation for K=3...
## Performing cross validation iteration 1 out of 2...
## Performing cross validation iteration 2 out of 2...
## Progress 50%...
## Performing estimation for K=4...
## Performing cross validation iteration 1 out of 2...
## Performing cross validation iteration 2 out of 2...
## Progress 100%...
We notice that the computations for this task can be very time consuming, expecially when many iterations of cross validations are performed (see manual) and a large set of configurations of the parameters are tested.
We conclude this vignette by plotting the discovered signatures for the best configuration, i.e., K = 4.
signatures = res_denovo$beta[[2]]
signaturesSBSPlot(beta=signatures, xlabels=FALSE)
We refer to the manual for a detailed description of each parameter and to the RESOLVE manuscript for details on the method.
sessionInfo()
## R version 4.2.1 (2022-06-23)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04.5 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.16-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.16-bioc/R/lib/libRlapack.so
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_GB LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] BSgenome.Hsapiens.1000genomes.hs37d5_0.99.1
## [2] BSgenome_1.66.0
## [3] rtracklayer_1.58.0
## [4] Biostrings_2.66.0
## [5] XVector_0.38.0
## [6] GenomicRanges_1.50.0
## [7] GenomeInfoDb_1.34.0
## [8] IRanges_2.32.0
## [9] S4Vectors_0.36.0
## [10] RESOLVE_1.0.0
## [11] NMF_0.24.0
## [12] bigmemory_4.6.1
## [13] Biobase_2.58.0
## [14] BiocGenerics_0.44.0
## [15] cluster_2.1.4
## [16] rngtools_1.5.2
## [17] pkgmaker_0.32.2
## [18] registry_0.5-1
## [19] BiocStyle_2.26.0
##
## loaded via a namespace (and not attached):
## [1] lsa_0.73.3 bitops_1.0-7
## [3] matrixStats_0.62.0 doParallel_1.0.17
## [5] RColorBrewer_1.1-3 SnowballC_0.7.0
## [7] tools_4.2.1 bslib_0.4.0
## [9] utf8_1.2.2 R6_2.5.1
## [11] DBI_1.1.3 colorspace_2.0-3
## [13] withr_2.5.0 tidyselect_1.2.0
## [15] gridExtra_2.3 compiler_4.2.1
## [17] glmnet_4.1-4 cli_3.4.1
## [19] DelayedArray_0.24.0 labeling_0.4.2
## [21] bookdown_0.29 sass_0.4.2
## [23] scales_1.2.1 nnls_1.4
## [25] stringr_1.4.1 digest_0.6.30
## [27] Rsamtools_2.14.0 rmarkdown_2.17
## [29] pkgconfig_2.0.3 htmltools_0.5.3
## [31] MatrixGenerics_1.10.0 highr_0.9
## [33] fastmap_1.1.0 rlang_1.0.6
## [35] farver_2.1.1 shape_1.4.6
## [37] jquerylib_0.1.4 BiocIO_1.8.0
## [39] generics_0.1.3 jsonlite_1.8.3
## [41] BiocParallel_1.32.0 dplyr_1.0.10
## [43] RCurl_1.98-1.9 magrittr_2.0.3
## [45] GenomeInfoDbData_1.2.9 Matrix_1.5-1
## [47] Rcpp_1.0.9 munsell_0.5.0
## [49] fansi_1.0.3 lifecycle_1.0.3
## [51] stringi_1.7.8 yaml_2.3.6
## [53] SummarizedExperiment_1.28.0 zlibbioc_1.44.0
## [55] plyr_1.8.7 grid_4.2.1
## [57] parallel_4.2.1 bigmemory.sri_0.1.3
## [59] crayon_1.5.2 lattice_0.20-45
## [61] splines_4.2.1 magick_2.7.3
## [63] knitr_1.40 pillar_1.8.1
## [65] uuid_1.1-0 rjson_0.2.21
## [67] reshape2_1.4.4 codetools_0.2-18
## [69] XML_3.99-0.12 glue_1.6.2
## [71] evaluate_0.17 data.table_1.14.4
## [73] BiocManager_1.30.19 vctrs_0.5.0
## [75] foreach_1.5.2 gtable_0.3.1
## [77] assertthat_0.2.1 cachem_1.0.6
## [79] ggplot2_3.3.6 xfun_0.34
## [81] gridBase_0.4-7 xtable_1.8-4
## [83] restfulr_0.0.15 survival_3.4-0
## [85] tibble_3.1.8 iterators_1.0.14
## [87] GenomicAlignments_1.34.0