This document is a presentation of the R implementation of the tool CIMICE. It shows the main features of this software and how it is built as a modular pipeline, with the goal of making it easy to change and update.
CIMICE is a tool in the field of tumor phylogenetics and its goal is to build a Markov Chain (called Cancer Progression Markov Chain, CPMC) in order to model tumor subtypes evolution. The input of CIMICE is a Mutational Matrix, so a boolean matrix representing altered genes in a collection of samples. These samples are assumed to be obtained with single-cell DNA analysis techniques and the tool is specifically written to use the peculiarities of this data for the CMPC construction.
CIMICE data processing and analysis can be divided in four section:
These steps will be presented in the following sections.
This implementation of CIMICE is built as a single library on its own:
library(CIMICE)
and it requires the following libraries:
# Dataframe manipulation
library(dplyr)
# Plot display
library(ggplot2)
# Improved string operations
library(glue)
# Dataframe manipulation
library(tidyr)
# Graph data management
library(igraph)
# Remove transitive edges on a graph
# library(relations)
# Interactive graph visualization
library(networkD3)
# Interactive graph visualization
library(visNetwork)
# Correlation plot visualization
library(ggcorrplot)
# Functional R programming
library(purrr)
# Graph Visualization
library(ggraph)
# sparse matrices
library(Matrix)
CIMICE requires a boolean dataframe as input, structured as follows:
It is possible to load this information from a file. The default input format for CIMICE is the “CAPRI/CAPRESE” TRONCO format:
This is a scheme of CIMICE’s input format:
s\g gene_1 gene_2 ... gene_n
sample_1 1 0 ... 0
...
sample_m 1 1 ... 1
and this an example on how to load a dataset from the file system:
# Read input dataset in CAPRI/CAPRESE format
dataset.big <- read_CAPRI(system.file("extdata", "example.CAPRI", package = "CIMICE", mustWork = TRUE))
sfmF | ulaF | dapB | yibT | phnL | ispA | |
---|---|---|---|---|---|---|
GCF_001281685.1_ASM128168v1_genomic.fna | 1 | 1 | 0 | 0 | 0 | 0 |
GCF_001281725.1_ASM128172v1_genomic.fna | 0 | 0 | 0 | 0 | 0 | 0 |
GCF_001281755.1_ASM128175v1_genomic.fna | 0 | 0 | 0 | 0 | 0 | 0 |
GCF_001281775.1_ASM128177v1_genomic.fna | 0 | 0 | 0 | 0 | 0 | 0 |
GCF_001281795.1_ASM128179v1_genomic.fna | 0 | 1 | 0 | 0 | 0 | 0 |
GCF_001281815.1_ASM128181v1_genomic.fna | 0 | 0 | 0 | 0 | 0 | 0 |
## [1] "ncol: 999 - nrow: 160"
Another option is to define directly the dataframe in R. This is made easy by the functions make_dataset
and update_df
, used as follows:
# genes
dataset <- make_dataset(A,B,C,D) %>%
# samples
update_df("S1", 0, 0, 0, 1) %>%
update_df("S2", 1, 0, 0, 0) %>%
update_df("S3", 1, 0, 0, 0) %>%
update_df("S4", 1, 0, 0, 1) %>%
update_df("S5", 1, 1, 0, 1) %>%
update_df("S6", 1, 1, 0, 1) %>%
update_df("S7", 1, 0, 1, 1) %>%
update_df("S8", 1, 1, 0, 1)
with the following outcome:
## 8 x 4 Matrix of class "dgeMatrix"
## A B C D
## S1 0 0 0 1
## S2 1 0 0 0
## S3 1 0 0 0
## S4 1 0 0 1
## S5 1 1 0 1
## S6 1 1 0 1
## S7 1 0 1 1
## S8 1 1 0 1
In the case your data is composed by samples with associated frequencies it is possible to use an alternative format that we call “CAPRIpop”:
s/g gene_1 gene_2 ... gene_n freq
sample_1 1 0 ... 0 freq_s1
...
sample_m 1 1 ... 1 freq_sm
where the freq
column is mandatory and sample must not be repeated. Frequencies
in the freq
column will be automatically normalized. This format is meant
to be used with the functions quick_run(dataset, mode="CAPRIpop")
for the
full analysis and dataset_preprocessing_population(...)
for the preprocessing
stage only. The subsequent operations remain otherwise equal to those
of the default format.
Another option is to compute a mutational matrix directly from a MAF file, which can be done as follows:
# path to MAF file
read_MAF(system.file("extdata", "paac_jhu_2014_500.maf", package = "CIMICE", mustWork = TRUE))[1:5,1:5]
## -Reading
## -Validating
## -Silent variants: 49
## -Summarizing
## --Possible FLAGS among top ten genes:
## HMCN1
## -Processing clinical data
## --Missing clinical data
## -Finished in 0.097s elapsed (0.119s cpu)
## 5 x 5 sparse Matrix of class "dgCMatrix"
## CSMD2 C2orf61 DCHS2 DPYS GPR158
## ACINAR28 1 1 . . .
## ACINAR27 1 . . . .
## ACINAR25 1 . . . .
## ACINAR02 . . 1 . 1
## ACINAR13 . . 1 . .
This implementation of CIMICE includes simple functions to quickly analyze the distributions of mutations among genes and samples.
The following code displays an histogram showing the distribution of the number of mutations hitting a gene:
gene_mutations_hist(dataset.big)
And this does the same but from the samples point of view:
sample_mutations_hist(dataset.big, binwidth = 10)
In case of huge dataset, it could be necessary to focus only on a subset of the input samples or genes. The following procedures aim to provide an easy way to do so when the goal is to use the most (or least) mutated samples or genes.
Keeps the first \(n\) (=100) most mutated genes:
select_genes_on_mutations(dataset.big, 100)
eptA | yohC | yedK | yeeO | narU | rsxC | |
---|---|---|---|---|---|---|
GCF_001281685.1_ASM128168v1_genomic.fna | 1 | 1 | 1 | 1 | 1 | 1 |
GCF_001281725.1_ASM128172v1_genomic.fna | 1 | 1 | 1 | 0 | 0 | 0 |
GCF_001281755.1_ASM128175v1_genomic.fna | 1 | 1 | 1 | 1 | 1 | 1 |
GCF_001281775.1_ASM128177v1_genomic.fna | 1 | 1 | 1 | 1 | 1 | 0 |
GCF_001281795.1_ASM128179v1_genomic.fna | 1 | 1 | 1 | 1 | 1 | 1 |
GCF_001281815.1_ASM128181v1_genomic.fna | 1 | 1 | 1 | 1 | 1 | 1 |
## [1] "ncol: 100 - nrow: 160"
Keeps the first \(n\) (=100) least mutated samples:
select_samples_on_mutations(dataset.big, 100, desc = FALSE)
sfmF | ulaF | dapB | yibT | phnL | ispA | |
---|---|---|---|---|---|---|
GCF_001281725.1_ASM128172v1_genomic.fna | 0 | 0 | 0 | 0 | 0 | 0 |
GCF_001281855.1_ASM128185v1_genomic.fna | 0 | 0 | 0 | 0 | 0 | 0 |
GCF_001281815.1_ASM128181v1_genomic.fna | 0 | 0 | 0 | 0 | 0 | 0 |
GCF_001281775.1_ASM128177v1_genomic.fna | 0 | 0 | 0 | 0 | 0 | 0 |
GCF_001607735.1_ASM160773v1_genomic.fna | 0 | 0 | 0 | 0 | 0 | 0 |
GCF_001297965.1_ASM129796v1_genomic.fna | 0 | 0 | 0 | 0 | 0 | 0 |
## [1] "ncol: 999 - nrow: 100"
It is easy to combine these selections by using the pipe operator %>%
:
select_samples_on_mutations(dataset.big , 100, desc = FALSE) %>% select_genes_on_mutations(100)
eptA | yohC | yedK | argK | yeeO | narU | |
---|---|---|---|---|---|---|
GCF_001281725.1_ASM128172v1_genomic.fna | 1 | 1 | 1 | 1 | 0 | 0 |
GCF_001281855.1_ASM128185v1_genomic.fna | 1 | 1 | 1 | 1 | 0 | 0 |
GCF_001281815.1_ASM128181v1_genomic.fna | 1 | 1 | 1 | 1 | 1 | 1 |
GCF_001281775.1_ASM128177v1_genomic.fna | 1 | 1 | 1 | 1 | 1 | 1 |
GCF_001607735.1_ASM160773v1_genomic.fna | 1 | 1 | 1 | 1 | 1 | 1 |
GCF_001297965.1_ASM129796v1_genomic.fna | 1 | 1 | 1 | 1 | 1 | 1 |
## [1] "ncol: 100 - nrow: 100"
It may be of interest to show correlations among gene or sample mutations. The library corrplots
provides an easy way to do so by preparing an
heatmap based on the correlation matrix. We can show these plots by using the following comands:
gene mutations correlation:
corrplot_genes(dataset)
sample mutations correlation:
corrplot_samples(dataset)
The first step of the CIMICE algorithm is based on grouping the genotypes contained in the dataset to compute their observed frequencies.
In this implementation we used a simple approach using the library dplyr
. However, this solution is not optimal from an efficiency
point of view and might be problematic with very large datasets. An Rcpp implementation is planned and, moreover, it is possible to easily modify
this step by changing the algorithm as long as its output is a dataframe containing only unique genotypes with an additional column named “freq” for the observed frequencies count.
# groups and counts equal genotypes
compactedDataset <- compact_dataset(dataset)
## $matrix
## 5 x 4 Matrix of class "dgeMatrix"
## A B C D
## S1 0 0 0 1
## S2 1 0 0 0
## S4 1 0 0 1
## S7 1 0 1 1
## S5 1 1 0 1
##
## $counts
## [1] 1 2 1 1 3
##
## $row_names
## $row_names[[1]]
## [1] "S1"
##
## $row_names[[2]]
## [1] "S2, S3"
##
## $row_names[[3]]
## [1] "S4"
##
## $row_names[[4]]
## [1] "S7"
##
## $row_names[[5]]
## [1] "S5, S6, S8"
The subsequent stage goal is to prepare the topology for the final Cancer Progression Markov Chain. We racall that this topology is assumed to be a DAG. These eraly steps are required to prepare the information necessary for this and the following pahses.
Convert dataset to matricial form:
samples <- compactedDataset$matrix
## 5 x 4 Matrix of class "dgeMatrix"
## A B C D
## S1 0 0 0 1
## S2 1 0 0 0
## S4 1 0 0 1
## S7 1 0 1 1
## S5 1 1 0 1
Extract gene names:
genes <- colnames(samples)
## [1] "A" "B" "C" "D"
Compute observed frequency of each genotype:
freqs <- compactedDataset$counts/sum(compactedDataset$counts)
## [1] 0.125 0.250 0.125 0.125 0.375
Add “Clonal” genotype to the dataset (if not present) that will be used as DAG root:
# prepare node labels listing the mutated genes for each node
labels <- prepare_labels(samples, genes)
if( is.null(compactedDataset$row_names) ){
compactedDataset$row_names <- rownames(compactedDataset$matrix)
}
matching_samples <- compactedDataset$row_names
# fix Colonal genotype absence, if needed
fix <- fix_clonal_genotype(samples, freqs, labels, matching_samples)
samples = fix[["samples"]]
freqs = fix[["freqs"]]
labels = fix[["labels"]]
matching_samples <- fix[["matching_samples"]]
## 6 x 4 Matrix of class "dgeMatrix"
## A B C D
## S1 0 0 0 1
## S2 1 0 0 0
## S4 1 0 0 1
## S7 1 0 1 1
## S5 1 1 0 1
## Clonal 0 0 0 0
Build the topology of the graph based on the “superset” relation:
# compute edges based on subset relation
edges <- build_topology_subset(samples)
and finally prepare and show with the current topology of the graph:
# remove transitive edges and prepare igraph object
g <- build_subset_graph(edges, labels)
that can be (badly) plotted using basic igraph:
In this sections, it is shown how to call the procedures to the four steps weight computation used in CIMICE. This is in fact based in computing “UP” weights, normalized “UP” weights, “DOWN” weights and normalized “DOWN” weights.
The process is based on the graph adjacency matrix “A”:
A <- as_adj(g)
## 6 x 6 sparse Matrix of class "dgCMatrix"
##
## [1,] . . 1 . . .
## [2,] . . 1 . . .
## [3,] . . . 1 1 .
## [4,] . . . . . .
## [5,] . . . . . .
## [6,] 1 1 . . . .
and on the number of successors for each node:
no.of.children <- get_no_of_children(A,g)
## [1] 1 1 2 0 0 2
\[W_{up}(\langle a,b \rangle \in E) = \frac{1}{ |\Lambda_a|}(P(a) + \sum_{x\in \Pi_a}W_{up}(\langle x,a \rangle)) \] given that \(\Lambda_a\) and \(\Pi_a\) denote the set of children of a node \(a\) and the set of parents of a node \(a\) respectively and that \(P(a)\) is the observed frequency of node \(a\).
upWeights <- computeUPW(g, freqs, no.of.children, A)
## [1] 0.000 0.000 0.125 0.250 0.250 0.250
\[\overline{W}_{up}(\langle a,b \rangle \in E_1) = \begin{cases} 1 & \mbox{if $a[0]=\emptyset$} \\ \frac{ W_{up}(\langle a,b \rangle \in E)}{\sum_{x \in \Pi_b} W_{up}(\langle x,b \rangle)} & \mbox{else} \end{cases}\]
normUpWeights <- normalizeUPW(g, freqs, no.of.children, A, upWeights)
## [1] 1.0000000 1.0000000 0.3333333 0.6666667 1.0000000 1.0000000
\[W_{down}(\langle a,b \rangle) = \overline{W}_{up}(\langle a,b \rangle)(P(b) + \sum_{x\in \Lambda_b}W_{down}(\langle b,x \rangle))\]
downWeights <- computeDWNW(g, freqs, no.of.children, A, normUpWeights)
## [1] 0.3333333 0.6666667 0.2083333 0.4166667 0.1250000 0.3750000
\[ P(\langle a,b \rangle) = \overline{W}_{down}(\langle a,b \rangle) = \frac{W_{down}(\langle a,b \rangle)}{\sum_{x\in \Lambda_a}{W_{down}(\langle a,x \rangle)}} \]
normDownWeights <- normalizeDWNW(g, freqs, no.of.children, A, downWeights)
## [1] 0.3333333 0.6666667 1.0000000 1.0000000 0.2500000 0.7500000
To better show the results of the analysis there were prepared three ouput methods based on three different libraries: ggraph
, networkD3
and
visNetwork
. These libraries improve the dafault igraph
output visualization. Note that output interaction is disabled in this document,
check the Quick Guide instead.
This is the output based on ggraph
, it is ideal for small graphs but, for legibility reasons, it is better not to use it with long labels.
draw_ggraph(quick_run(example_dataset()))
The networkD3
is a quite valid interactive approach, but it lacks the option to draw labels on edges, limiting the representation to thicker or thinner edges.
draw_networkD3(quick_run(example_dataset()))
The visNetwork
approach is overall the best for interactive purposes. It allows almost arbitrary long labels, as it is compatible with HTML
elements and
in particular with textboxes and the “hovering condition” for vertex and edges.
draw_visNetwork(quick_run(example_dataset()))
Finally, it is also possible to export CIMICE’s output to the standard dot format for use in other visualization applications.
cat(to_dot(quick_run(example_dataset())))
## digraph G{
## 1[label="S1"]
## 2[label="S2, S3"]
## 3[label="S4"]
## 4[label="S7"]
## 5[label="S5, S6, S8"]
## 6[label="Clonal"]
## 6 -> 1[label="0.333"]
## 6 -> 2[label="0.667"]
## 1 -> 3[label="1"]
## 2 -> 3[label="1"]
## 3 -> 4[label="0.25"]
## 3 -> 5[label="0.75"]
## }
This vignette was prepared using a R session with the following specifications:
sessionInfo()
## R version 4.4.0 beta (2024-04-15 r86425)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 22.04.4 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.19-bioc/R/lib/libRblas.so
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: America/New_York
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] ggraph_2.2.1 ggcorrplot_0.1.4.1 visNetwork_2.1.2 networkD3_0.4
## [5] tidyr_1.3.1 glue_1.7.0 ggplot2_3.5.1 BiocStyle_2.32.0
## [9] purrr_1.0.2 Matrix_1.7-0 igraph_2.0.3 dplyr_1.1.4
## [13] CIMICE_1.12.0
##
## loaded via a namespace (and not attached):
## [1] gtable_0.3.5 xfun_0.43 bslib_0.7.0
## [4] htmlwidgets_1.6.4 ggrepel_0.9.5 lattice_0.22-6
## [7] vctrs_0.6.5 tools_4.4.0 generics_0.1.3
## [10] tibble_3.2.1 fansi_1.0.6 highr_0.10
## [13] pkgconfig_2.0.3 data.table_1.15.4 RColorBrewer_1.1-3
## [16] assertthat_0.2.1 lifecycle_1.0.4 stringr_1.5.1
## [19] compiler_4.4.0 farver_2.1.1 tinytex_0.50
## [22] munsell_0.5.1 ggforce_0.4.2 graphlayouts_1.1.1
## [25] htmltools_0.5.8.1 sass_0.4.9 yaml_2.3.8
## [28] pillar_1.9.0 jquerylib_0.1.4 MASS_7.3-60.2
## [31] cachem_1.0.8 magick_2.8.3 viridis_0.6.5
## [34] tidyselect_1.2.1 digest_0.6.35 stringi_1.8.3
## [37] reshape2_1.4.4 bookdown_0.39 labeling_0.4.3
## [40] splines_4.4.0 maftools_2.20.0 polyclip_1.10-6
## [43] fastmap_1.1.1 grid_4.4.0 colorspace_2.1-0
## [46] expm_0.999-9 cli_3.6.2 magrittr_2.0.3
## [49] tidygraph_1.3.1 survival_3.6-4 utf8_1.2.4
## [52] withr_3.0.0 scales_1.3.0 rmarkdown_2.26
## [55] gridExtra_2.3 memoise_2.0.1 DNAcopy_1.78.0
## [58] evaluate_0.23 knitr_1.46 viridisLite_0.4.2
## [61] rlang_1.1.3 Rcpp_1.0.12 tweenr_2.0.3
## [64] BiocManager_1.30.22 jsonlite_1.8.8 plyr_1.8.9
## [67] R6_2.5.1