DifferentialRegulation 1.0.0
DifferentialRegulation is a method for detecting differentially regulated genes between two groups of samples (e.g., healthy vs. disease, or treated vs. untreated samples), by targeting differences in the balance of spliced and unspliced mRNA abundances, obtained from single-cell RNA-sequencing (scRNA-seq) data.
DifferentialRegulation accounts for the sample-to-sample variability, and embeds multiple samples in a Bayesian hierarchical model.
In particular, when providing equivaelence classes data (via EC_list
), reads that are compatible with multiple genes, or multiple splicing versions of a gene (unspliced spliced or ambiguous), are allocated to the gene of origin and their splicing version.
Parameters are inferred via Markov chain Monte Carlo (MCMC) techniques (Metropolis-within-Gibbs).
To access the R code used in the vignettes, type:
browseVignettes("DifferentialRegulation")
Questions relative to DifferentialRegulation should be reported as a new issue at BugReports.
To cite DifferentialRegulation, type:
citation("DifferentialRegulation")
##
## To cite package 'DifferentialRegulation' in publications use:
##
## Tiberi S (2022). _DifferentialRegulation: Differentially regulated
## genes from scRNA-seq data_. R package version 1.0.0,
## <https://github.com/SimoneTiberi/DifferentialRegulation>.
##
## A BibTeX entry for LaTeX users is
##
## @Manual{,
## title = {DifferentialRegulation: Differentially regulated genes from scRNA-seq data},
## author = {Simone Tiberi},
## year = {2022},
## note = {R package version 1.0.0},
## url = {https://github.com/SimoneTiberi/DifferentialRegulation},
## }
DifferentialRegulation
is available on Bioconductor and can be installed with the command:
if (!requireNamespace("BiocManager", quietly=TRUE))
install.packages("BiocManager")
BiocManager::install("DifferentialRegulation")
DifferentialRegulation inputs scRNA-seq data, aligned via alevin-fry (He et al. 2022).
NOTE: when using alevin-fry, set options --d
(or --dump-eqclasses
), to obtain the equivalence classes, and --use-mtx
, to store counts in a quants_mat.mtx
file (as expected by our load_USA
function).
We also recommend using the --CR-like-EM
option, which also allows equivalence classes of reads that map to multiple genes.
Load DifferentialRegulation.
library(DifferentialRegulation)
We use a real droplet scRNA-seq dataset from Velasco et al. (2019). In particular, we compare two groups of three samples, consisting of human brain organoids, cultured for 3 and 6 months. For computational reasons, we stored a subset of this dataset, in our package, consisting of 100 genes and 3,493 cells, belonging to two cell-types. Cell-type assignment was done in the original styudy (Velasco et al. 2019).
We specify the directory of the data (internal in the package).
data_dir = system.file("extdata", package = "DifferentialRegulation")
Specify the directory of the USA (unspliced, spliced and ambiguous) estimated counts, inferred via alevin-fry.
# specify samples ids:
sample_ids = paste0("organoid", c(1:3, 16:18))
# set directories of each sample input data (obtained via alevin-fry):
base_dir = file.path(data_dir, "alevin-fry", sample_ids)
# Note that alevin-fry needs to be run with `--use-mtx` option to store counts in a `quants_mat.mtx` file.
path_to_counts = file.path(base_dir,"/alevin/quants_mat.mtx")
path_to_cell_id = file.path(base_dir,"/alevin/quants_mat_rows.txt")
path_to_gene_id = file.path(base_dir,"/alevin/quants_mat_cols.txt")
Specify the directory of the ECs and respective counts, inferred via alevin-fry.
path_to_EC_counts = file.path(base_dir,"/alevin/geqc_counts.mtx")
path_to_EC = file.path(base_dir,"/alevin/gene_eqclass.txt.gz")
Load the unspliced, spliced and ambiguous (USA) counts, quantified by alevin-fry, in a SingleCellExperiment.
By default, counts (stored in assays(sce)$counts
) are defined as summation of spliced read and 50% of ambiguous reads (i.e., reads compatible with both spliced and unspliced versions of a gene): counts = spliced + 0.5 * ambiguous.
sce = load_USA(path_to_counts,
path_to_cell_id,
path_to_gene_id,
sample_ids)
Cell types should be assigned to each cell; here we load pre-computed cell types.
path_to_DF = file.path(data_dir,"DF_cell_types.txt")
DF_cell_types = read.csv(path_to_DF, sep = "\t", header = TRUE)
matches = match(colnames(sce), DF_cell_types$cell_id)
sce$cell_type = DF_cell_types$cell_type[matches]
table(sce$cell_type)
##
## Cycling RG
## 2399 1094
Here, we assume that basic quality control and filtering of low quality cells have been performed.
Load the equivalence classes and respective counts (only needed when performing differential testing on ECs).
EC_list = load_EC(path_to_EC_counts,
path_to_EC,
path_to_cell_id,
path_to_gene_id,
sample_ids)
## The percentage of multi-gene mapping reads in sample 'organoid1' is: 11.09
## The percentage of multi-gene mapping reads in sample 'organoid2' is: 13.53
## The percentage of multi-gene mapping reads in sample 'organoid3' is: 7.19
## The percentage of multi-gene mapping reads in sample 'organoid16' is: 8.59
## The percentage of multi-gene mapping reads in sample 'organoid17' is: 3.75
## The percentage of multi-gene mapping reads in sample 'organoid18' is: 6.76
For every sample, load_EC
prints the percentage of reads compatible with multiple genes (i.e., multi-gene mapping reads).
Here multi-gene reads are relatively low, because we are considering a subset of 100 genes only; however, in the full dataset we found that approximately 40% of reads map to multiple genes.
Intuitively, the larger these numbers, the greater the benefits one may achieve by using ECs and modelling the variability of these uncertain gene allocations.
Quality control (QC) and filtering of low quality cells can be performed as usual on the sce
object.
The sce
object computed via load_USA
contains a counts
assays, defined as the summation of spliced counts and 50% of ambiguous counts.
Importantly, cells only need to be filtered in the sce
object: even when using ECs, cells that are filtered in the sce
object will not be used for differential testing by DifferentialRegulation
function.
Differential testing can be performed on USA estimated counts (faster) or on ECs (slower, but more accurate). Using EC counts allows to explicitly model the uncertainty from reads that map to multiple genes.
First, we define the design of the study: in our case we have 2 groups, that we call “A” and “B” of 2 samples each.
design = data.frame(sample = sample_ids,
group = c( rep("3 mon", 3), rep("6 mon", 3) ))
design
## sample group
## 1 organoid1 3 mon
## 2 organoid2 3 mon
## 3 organoid3 3 mon
## 4 organoid16 6 mon
## 5 organoid17 6 mon
## 6 organoid18 6 mon
To perform differential testing on USA esitmated counts, set EC_list
to NULL
(or leave it unspecified).
# sce-based test:
set.seed(169612)
results_USA = DifferentialRegulation(sce = sce,
EC_list = NULL,
design = design,
sample_col_name = "sample",
group_col_name = "group",
sce_cluster_name = "cell_type")
## the following cell types have more than 100 cells and will be analyzed:
## CyclingRG
## 'n_cores' was left 'NULL'.
## Since tasks are paralellized on cell clusters, we will set 'n_cores' to the number of clusters that will be analyzed.
## 'n_cores' set to: 2
## 'EC_list' was not provided: estimated counts in 'sce' will be used to perform differential testing (faster, but marginally less accurate).
## If you want to use equivalence classes counts (recommended option: slower, but marginally more accurate), provide an 'EC_list' object, computed via 'load_EC' function.
We can sort results by significance, if we want, before visualizing them.
# sort results by significance:
results_USA = results_USA[ order(results_USA$p_val), ]
# visualize head of results:
head(results_USA)
## Gene_id Cluster_id p_val p_adj.loc p_adj.glb
## 21 ENSG00000067606.17 Cycling 3.775885e-06 9.439713e-05 0.0001661389
## 4 ENSG00000228794.10 Cycling 1.182506e-03 1.264521e-02 0.0222555746
## 16 ENSG00000160072.20 Cycling 1.517426e-03 1.264521e-02 0.0222555746
## 7 ENSG00000187608.10 Cycling 3.498055e-03 2.186284e-02 0.0384785997
## 44 ENSG00000235169.11 RG 8.201387e-03 1.558263e-01 0.0721722012
## 30 ENSG00000187608.10 RG 2.613639e-02 2.482957e-01 0.1916668587
To perform differential testing on EC counts, we set EC_list
to the object computed above via load_EC
.
# EC-based test:
set.seed(169612)
results_EC = DifferentialRegulation(sce = sce,
EC_list = EC_list,
design = design,
sample_col_name = "sample",
group_col_name = "group",
sce_cluster_name = "cell_type")
## the following cell types have more than 100 cells and will be analyzed:
## CyclingRG
## 'n_cores' was left 'NULL'.
## Since tasks are paralellized on cell clusters, we will set 'n_cores' to the number of clusters that will be analyzed.
## 'n_cores' set to: 2
As above, we can sort results by significance before visualizing them.
# sort results by significance:
results_EC = results_EC[ order(results_EC$p_val), ]
# visualize head of results:
head(results_EC)
## Gene_id Cluster_id p_val p_adj.loc p_adj.glb
## 16 ENSG00000224870.7 Cycling 0.000000e+00 0.000000e+00 0.000000e+00
## 31 ENSG00000179403.12 RG 0.000000e+00 0.000000e+00 0.000000e+00
## 20 ENSG00000237973.1 Cycling 5.662137e-15 6.794565e-14 7.549517e-14
## 39 ENSG00000237973.1 RG 2.914624e-08 2.331699e-07 2.914624e-07
## 6 ENSG00000160072.20 Cycling 1.560144e-06 1.248116e-05 1.248116e-05
## 8 ENSG00000179403.12 Cycling 3.209818e-06 1.925891e-05 2.139879e-05
sessionInfo()
## R version 4.2.0 RC (2022-04-19 r82224)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04.4 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.15-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.15-bioc/R/lib/libRlapack.so
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_GB LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] DifferentialRegulation_1.0.0 BiocStyle_2.24.0
##
## loaded via a namespace (and not attached):
## [1] Biobase_2.56.0 MatrixGenerics_1.8.0
## [3] sass_0.4.1 edgeR_3.38.0
## [5] jsonlite_1.8.0 foreach_1.5.2
## [7] R.utils_2.11.0 bslib_0.3.1
## [9] assertthat_0.2.1 BiocManager_1.30.17
## [11] stats4_4.2.0 doRNG_1.8.2
## [13] GenomeInfoDbData_1.2.8 yaml_2.3.5
## [15] pillar_1.7.0 lattice_0.20-45
## [17] glue_1.6.2 limma_3.52.0
## [19] digest_0.6.29 GenomicRanges_1.48.0
## [21] XVector_0.36.0 colorspace_2.0-3
## [23] R.oo_1.24.0 htmltools_0.5.2
## [25] Matrix_1.4-1 plyr_1.8.7
## [27] pkgconfig_2.0.3 bookdown_0.26
## [29] zlibbioc_1.42.0 purrr_0.3.4
## [31] scales_1.2.0 BiocParallel_1.30.0
## [33] tibble_3.1.6 generics_0.1.2
## [35] IRanges_2.30.0 ggplot2_3.3.5
## [37] ellipsis_0.3.2 SummarizedExperiment_1.26.0
## [39] BiocGenerics_0.42.0 cli_3.3.0
## [41] magrittr_2.0.3 crayon_1.5.1
## [43] evaluate_0.15 R.methodsS3_1.8.1
## [45] fansi_1.0.3 doParallel_1.0.17
## [47] MASS_7.3-57 BANDITS_1.12.0
## [49] tools_4.2.0 data.table_1.14.2
## [51] lifecycle_1.0.1 matrixStats_0.62.0
## [53] stringr_1.4.0 S4Vectors_0.34.0
## [55] munsell_0.5.0 locfit_1.5-9.5
## [57] DelayedArray_0.22.0 rngtools_1.5.2
## [59] compiler_4.2.0 jquerylib_0.1.4
## [61] GenomeInfoDb_1.32.0 rlang_1.0.2
## [63] grid_4.2.0 RCurl_1.98-1.6
## [65] iterators_1.0.14 DRIMSeq_1.24.0
## [67] SingleCellExperiment_1.18.0 bitops_1.0-7
## [69] rmarkdown_2.14 gtable_0.3.0
## [71] codetools_0.2-18 DBI_1.1.2
## [73] reshape2_1.4.4 R6_2.5.1
## [75] knitr_1.39 dplyr_1.0.8
## [77] fastmap_1.1.0 utf8_1.2.2
## [79] stringi_1.7.6 parallel_4.2.0
## [81] Rcpp_1.0.8.3 vctrs_0.4.1
## [83] tidyselect_1.1.2 xfun_0.30
He, Dongze, Mohsen Zakeri, Hirak Sarkar, Charlotte Soneson, Avi Srivastava, and Rob Patro. 2022. “Alevin-fry unlocks rapid, accurate and memory-frugal quantification of single-cell RNA-seq data.” Nature Methods 19 (3): 316–22.
Velasco, Silvia, Amanda J Kedaigle, Sean K Simmons, Allison Nash, Marina Rocha, Giorgia Quadrato, Bruna Paulsen, et al. 2019. “Individual Brain Organoids Reproducibly Form Cell Diversity of the Human Cerebral Cortex.” Nature 570 (7762): 523–27.