Dropout events make the lowly expressed genes indistinguishable from true zero expression and different from the low expression in cells of the same type. This issue makes any subsequent downstream analysis difficult. ccImpute(Malec, Kurban, and Dalkilic 2022) is an imputation tool that uses cell similarity established by consensus clustering to impute the most probable dropout events in the scRNA-seq datasets. ccImpute demonstrates performance which exceeds the performance of existing imputation approaches while introducing the least amount of new noise as measured by clustering performance characteristics on datasets with known cell identities.
To install this package, start R (version "4.2") and enter:
if (!require("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("ccImpute")
ccImpute
is an imputation tool that does not provide functions for
pre-processing the data. This tool expects the user to pre-process the data
before using it. The input data is expected to be in a log-normalized format.
This manual includes sample minimal pre-processing of a dataset from
scRNAseq database using the
scater tool.
library(scRNAseq)
library(scater)
library(ccImpute)
library(SingleCellExperiment)
library(stats)
library(mclust)
The following code loads Darmanis dataset(Darmanis et al. 2015) and computes log-transformed normalized counts:
sce <- DarmanisBrainData()
sce <- logNormCounts(sce)
A user may consider performing feature selection prior to running the imputation. ccImpute only imputes the most probable dropout events and is unlikely to benefit from the presence of scarcely expressed genes nor make any correctio ns to their expression.
Adjusted Rand Index is a measure of the similarity between two data clusterings adjusted for the chance grouping of elements. This measure allows us to evaluate the performance of the clustering algorithm as a similarity to the optimal clustering assignments derived from cell labels.
# Set seed for reproducibility purposes.
set.seed(0)
# Compute PCA reduction of the dataset
reducedDims(sce) <- list(PCA=prcomp(t(logcounts(sce)))$x)
# Get an actual number of cell types
k <- length(unique(colData(sce)$cell.type))
# Cluster the PCA reduced dataset and store the assignments
set.seed(0)
assgmts <- kmeans(reducedDim(sce, "PCA"), centers = k, iter.max = 1e+09,
nstart = 1000)$cluster
# Use ARI to compare the k-means assignments to label assignments
adjustedRandIndex(assgmts, colData(sce)$cell.type)
#> [1] 0.5206793
assay(sce, "imputed") <- ccImpute(logcounts(sce), k = k)
#> Running ccImpute on dataset with 466 cells.
#> Imputation finished.
# Recompute PCA reduction of the dataset
reducedDim(sce, "PCA_imputed") <- prcomp(t(assay(sce, "imputed")))$x
# Cluster the PCA reduced dataset and store the assignments
assgmts <- kmeans(reducedDim(sce, "PCA_imputed"), centers = k, iter.max = 1e+09,
nstart = 1000)$cluster
# Use ARI to compare the k-means assignments to label assignments
adjustedRandIndex(assgmts, colData(sce)$cell.type)
#> [1] 0.704984
#> ─ Session info ───────────────────────────────────────────────────────────────────────────────────────────────────────
#> setting value
#> version R version 4.2.2 (2022-10-31)
#> os Ubuntu 20.04.5 LTS
#> system x86_64, linux-gnu
#> ui X11
#> language (EN)
#> collate C
#> ctype en_US.UTF-8
#> tz America/New_York
#> date 2023-01-19
#> pandoc 2.5 @ /usr/bin/ (via rmarkdown)
#>
#> ─ Packages ───────────────────────────────────────────────────────────────────────────────────────────────────────────
#> package * version date (UTC) lib source
#> AnnotationDbi 1.60.0 2023-01-19 [2] Bioconductor
#> AnnotationFilter 1.22.0 2023-01-19 [2] Bioconductor
#> AnnotationHub 3.6.0 2023-01-19 [2] Bioconductor
#> assertthat 0.2.1 2019-03-21 [2] CRAN (R 4.2.2)
#> beachmat 2.14.0 2023-01-19 [2] Bioconductor
#> beeswarm 0.4.0 2021-06-01 [2] CRAN (R 4.2.2)
#> Biobase * 2.58.0 2023-01-19 [2] Bioconductor
#> BiocFileCache 2.6.0 2023-01-19 [2] Bioconductor
#> BiocGenerics * 0.44.0 2023-01-19 [2] Bioconductor
#> BiocIO 1.8.0 2023-01-19 [2] Bioconductor
#> BiocManager 1.30.19 2022-10-25 [2] CRAN (R 4.2.2)
#> BiocNeighbors 1.16.0 2023-01-19 [2] Bioconductor
#> BiocParallel 1.32.5 2023-01-19 [2] Bioconductor
#> BiocSingular 1.14.0 2023-01-19 [2] Bioconductor
#> BiocStyle * 2.26.0 2023-01-19 [2] Bioconductor
#> BiocVersion 3.16.0 2023-01-19 [2] Bioconductor
#> biomaRt 2.54.0 2023-01-19 [2] Bioconductor
#> Biostrings 2.66.0 2023-01-19 [2] Bioconductor
#> bit 4.0.5 2022-11-15 [2] CRAN (R 4.2.2)
#> bit64 4.0.5 2020-08-30 [2] CRAN (R 4.2.2)
#> bitops 1.0-7 2021-04-24 [2] CRAN (R 4.2.2)
#> blob 1.2.3 2022-04-10 [2] CRAN (R 4.2.2)
#> bookdown 0.32 2023-01-17 [2] CRAN (R 4.2.2)
#> bslib 0.4.2 2022-12-16 [2] CRAN (R 4.2.2)
#> cachem 1.0.6 2021-08-19 [2] CRAN (R 4.2.2)
#> ccImpute * 1.0.2 2023-01-19 [1] Bioconductor
#> cli 3.6.0 2023-01-09 [2] CRAN (R 4.2.2)
#> codetools 0.2-18 2020-11-04 [2] CRAN (R 4.2.2)
#> colorspace 2.0-3 2022-02-21 [2] CRAN (R 4.2.2)
#> crayon 1.5.2 2022-09-29 [2] CRAN (R 4.2.2)
#> curl 5.0.0 2023-01-12 [2] CRAN (R 4.2.2)
#> DBI 1.1.3 2022-06-18 [2] CRAN (R 4.2.2)
#> dbplyr 2.3.0 2023-01-16 [2] CRAN (R 4.2.2)
#> DelayedArray 0.24.0 2023-01-19 [2] Bioconductor
#> DelayedMatrixStats 1.20.0 2023-01-19 [2] Bioconductor
#> digest 0.6.31 2022-12-11 [2] CRAN (R 4.2.2)
#> dplyr 1.0.10 2022-09-01 [2] CRAN (R 4.2.2)
#> ellipsis 0.3.2 2021-04-29 [2] CRAN (R 4.2.2)
#> ensembldb 2.22.0 2023-01-19 [2] Bioconductor
#> evaluate 0.20 2023-01-17 [2] CRAN (R 4.2.2)
#> ExperimentHub 2.6.0 2023-01-19 [2] Bioconductor
#> fansi 1.0.3 2022-03-24 [2] CRAN (R 4.2.2)
#> fastmap 1.1.0 2021-01-25 [2] CRAN (R 4.2.2)
#> filelock 1.0.2 2018-10-05 [2] CRAN (R 4.2.2)
#> generics 0.1.3 2022-07-05 [2] CRAN (R 4.2.2)
#> GenomeInfoDb * 1.34.7 2023-01-19 [2] Bioconductor
#> GenomeInfoDbData 1.2.9 2022-11-08 [2] Bioconductor
#> GenomicAlignments 1.34.0 2023-01-19 [2] Bioconductor
#> GenomicFeatures 1.50.3 2023-01-19 [2] Bioconductor
#> GenomicRanges * 1.50.2 2023-01-19 [2] Bioconductor
#> ggbeeswarm 0.7.1 2022-12-16 [2] CRAN (R 4.2.2)
#> ggplot2 * 3.4.0 2022-11-04 [2] CRAN (R 4.2.2)
#> ggrepel 0.9.2 2022-11-06 [2] CRAN (R 4.2.2)
#> glue 1.6.2 2022-02-24 [2] CRAN (R 4.2.2)
#> gridExtra 2.3 2017-09-09 [2] CRAN (R 4.2.2)
#> gtable 0.3.1 2022-09-01 [2] CRAN (R 4.2.2)
#> hms 1.1.2 2022-08-19 [2] CRAN (R 4.2.2)
#> htmltools 0.5.4 2022-12-07 [2] CRAN (R 4.2.2)
#> httpuv 1.6.8 2023-01-12 [2] CRAN (R 4.2.2)
#> httr 1.4.4 2022-08-17 [2] CRAN (R 4.2.2)
#> interactiveDisplayBase 1.36.0 2023-01-19 [2] Bioconductor
#> IRanges * 2.32.0 2023-01-19 [2] Bioconductor
#> irlba 2.3.5.1 2022-10-03 [2] CRAN (R 4.2.2)
#> jquerylib 0.1.4 2021-04-26 [2] CRAN (R 4.2.2)
#> jsonlite 1.8.4 2022-12-06 [2] CRAN (R 4.2.2)
#> KEGGREST 1.38.0 2023-01-19 [2] Bioconductor
#> knitr 1.41 2022-11-18 [2] CRAN (R 4.2.2)
#> later 1.3.0 2021-08-18 [2] CRAN (R 4.2.2)
#> lattice 0.20-45 2021-09-22 [2] CRAN (R 4.2.2)
#> lazyeval 0.2.2 2019-03-15 [2] CRAN (R 4.2.2)
#> lifecycle 1.0.3 2022-10-07 [2] CRAN (R 4.2.2)
#> magrittr 2.0.3 2022-03-30 [2] CRAN (R 4.2.2)
#> Matrix 1.5-3 2022-11-11 [2] CRAN (R 4.2.2)
#> MatrixGenerics * 1.10.0 2023-01-19 [2] Bioconductor
#> matrixStats * 0.63.0 2022-11-18 [2] CRAN (R 4.2.2)
#> mclust * 6.0.0 2022-10-31 [2] CRAN (R 4.2.2)
#> memoise 2.0.1 2021-11-26 [2] CRAN (R 4.2.2)
#> mime 0.12 2021-09-28 [2] CRAN (R 4.2.2)
#> munsell 0.5.0 2018-06-12 [2] CRAN (R 4.2.2)
#> pillar 1.8.1 2022-08-19 [2] CRAN (R 4.2.2)
#> pkgconfig 2.0.3 2019-09-22 [2] CRAN (R 4.2.2)
#> png 0.1-8 2022-11-29 [2] CRAN (R 4.2.2)
#> pracma 2.4.2 2022-09-22 [2] CRAN (R 4.2.2)
#> prettyunits 1.1.1 2020-01-24 [2] CRAN (R 4.2.2)
#> progress 1.2.2 2019-05-16 [2] CRAN (R 4.2.2)
#> promises 1.2.0.1 2021-02-11 [2] CRAN (R 4.2.2)
#> ProtGenerics 1.30.0 2023-01-19 [2] Bioconductor
#> purrr 1.0.1 2023-01-10 [2] CRAN (R 4.2.2)
#> R6 2.5.1 2021-08-19 [2] CRAN (R 4.2.2)
#> rappdirs 0.3.3 2021-01-31 [2] CRAN (R 4.2.2)
#> Rcpp 1.0.9 2022-07-08 [2] CRAN (R 4.2.2)
#> RcppAnnoy 0.0.20 2022-10-27 [2] CRAN (R 4.2.2)
#> RCurl 1.98-1.9 2022-10-03 [2] CRAN (R 4.2.2)
#> restfulr 0.0.15 2022-06-16 [2] CRAN (R 4.2.2)
#> rjson 0.2.21 2022-01-09 [2] CRAN (R 4.2.2)
#> rlang 1.0.6 2022-09-24 [2] CRAN (R 4.2.2)
#> rmarkdown 2.19 2022-12-15 [2] CRAN (R 4.2.2)
#> Rsamtools 2.14.0 2023-01-19 [2] Bioconductor
#> RSpectra 0.16-1 2022-04-24 [2] CRAN (R 4.2.2)
#> RSQLite 2.2.20 2022-12-22 [2] CRAN (R 4.2.2)
#> rsvd 1.0.5 2021-04-16 [2] CRAN (R 4.2.2)
#> rtracklayer 1.58.0 2023-01-19 [2] Bioconductor
#> S4Vectors * 0.36.1 2023-01-19 [2] Bioconductor
#> sass 0.4.4 2022-11-24 [2] CRAN (R 4.2.2)
#> ScaledMatrix 1.6.0 2023-01-19 [2] Bioconductor
#> scales 1.2.1 2022-08-20 [2] CRAN (R 4.2.2)
#> scater * 1.26.1 2023-01-19 [2] Bioconductor
#> scRNAseq * 2.12.0 2023-01-19 [2] Bioconductor
#> scuttle * 1.8.4 2023-01-19 [2] Bioconductor
#> sessioninfo * 1.2.2 2021-12-06 [2] CRAN (R 4.2.2)
#> shiny 1.7.4 2022-12-15 [2] CRAN (R 4.2.2)
#> SIMLR 1.24.2 2023-01-19 [2] Bioconductor
#> SingleCellExperiment * 1.20.0 2023-01-19 [2] Bioconductor
#> sparseMatrixStats 1.10.0 2023-01-19 [2] Bioconductor
#> stringi 1.7.12 2023-01-11 [2] CRAN (R 4.2.2)
#> stringr 1.5.0 2022-12-02 [2] CRAN (R 4.2.2)
#> SummarizedExperiment * 1.28.0 2023-01-19 [2] Bioconductor
#> tibble 3.1.8 2022-07-22 [2] CRAN (R 4.2.2)
#> tidyselect 1.2.0 2022-10-10 [2] CRAN (R 4.2.2)
#> utf8 1.2.2 2021-07-24 [2] CRAN (R 4.2.2)
#> vctrs 0.5.1 2022-11-16 [2] CRAN (R 4.2.2)
#> vipor 0.4.5 2017-03-22 [2] CRAN (R 4.2.2)
#> viridis 0.6.2 2021-10-13 [2] CRAN (R 4.2.2)
#> viridisLite 0.4.1 2022-08-22 [2] CRAN (R 4.2.2)
#> withr 2.5.0 2022-03-03 [2] CRAN (R 4.2.2)
#> xfun 0.36 2022-12-21 [2] CRAN (R 4.2.2)
#> XML 3.99-0.13 2022-12-04 [2] CRAN (R 4.2.2)
#> xml2 1.3.3 2021-11-30 [2] CRAN (R 4.2.2)
#> xtable 1.8-4 2019-04-21 [2] CRAN (R 4.2.2)
#> XVector 0.38.0 2023-01-19 [2] Bioconductor
#> yaml 2.3.6 2022-10-18 [2] CRAN (R 4.2.2)
#> zlibbioc 1.44.0 2023-01-19 [2] Bioconductor
#>
#> [1] /tmp/RtmppQvcCP/Rinst1608d74e6b2376
#> [2] /home/biocbuild/bbs-3.16-bioc/R/library
#>
#> ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Darmanis, Spyros, Steven A Sloan, Ye Zhang, Martin Enge, Christine Caneda, Lawrence M Shuer, Melanie G Hayden Gephart, Ben A Barres, and Stephen R Quake. 2015. “A Survey of Human Brain Transcriptome Diversity at the Single Cell Level.” Proceedings of the National Academy of Sciences 112 (23): 7285–90.
Malec, Marcin, Hasan Kurban, and Mehmet Dalkilic. 2022. “CcImpute: An Accurate and Scalable Consensus Clustering Based Algorithm to Impute Dropout Events in the Single-Cell Rna-Seq Data.” BMC Bioinformatics 23 (1): 1–17.