Contents

1 Introduction

Dropout events make the lowly expressed genes indistinguishable from true zero expression and different from the low expression in cells of the same type. This issue makes any subsequent downstream analysis difficult. ccImpute(Malec, Kurban, and Dalkilic 2022) is an imputation tool that uses cell similarity established by consensus clustering to impute the most probable dropout events in the scRNA-seq datasets. ccImpute demonstrates performance which exceeds the performance of existing imputation approaches while introducing the least amount of new noise as measured by clustering performance characteristics on datasets with known cell identities.

1.1 Installation.

To install this package, start R (version "4.2") and enter:
if (!require("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("ccImpute")

2 Data Pre-processing

ccImpute is an imputation tool that does not provide functions for pre-processing the data. This tool expects the user to pre-process the data before using it. The input data is expected to be in a log-normalized format. This manual includes sample minimal pre-processing of a dataset from scRNAseq database using the scater tool.

3 Sample Usage

3.1 Required libraries

library(scRNAseq)
library(scater)
library(ccImpute)
library(SingleCellExperiment)
library(stats)
library(mclust)

3.2 Input Data.

The following code loads Darmanis dataset(Darmanis et al. 2015) and computes log-transformed normalized counts:

sce <- DarmanisBrainData()
sce <- logNormCounts(sce)

3.3 Pre-processing data.

A user may consider performing feature selection prior to running the imputation. ccImpute only imputes the most probable dropout events and is unlikely to benefit from the presence of scarcely expressed genes nor make any correctio ns to their expression.

3.4 Adjusted Rand Index (ARI)

Adjusted Rand Index is a measure of the similarity between two data clusterings adjusted for the chance grouping of elements. This measure allows us to evaluate the performance of the clustering algorithm as a similarity to the optimal clustering assignments derived from cell labels.

3.5 Compute Adjusted Rand Index (ARI) without imputation.

# Set seed for reproducibility purposes.
set.seed(0) 
# Compute PCA reduction of the dataset
reducedDims(sce) <- list(PCA=prcomp(t(logcounts(sce)))$x)

# Get an actual number of cell types
k <- length(unique(colData(sce)$cell.type))

# Cluster the PCA reduced dataset and store the assignments
set.seed(0) 
assgmts <- kmeans(reducedDim(sce, "PCA"), centers = k, iter.max = 1e+09,
    nstart = 1000)$cluster

# Use ARI to compare the k-means assignments to label assignments
adjustedRandIndex(assgmts, colData(sce)$cell.type)
#> [1] 0.5206793

3.6 Perform the imputation and update the logcounts assay.

assay(sce, "imputed") <- ccImpute(logcounts(sce), k = k)
#> Running ccImpute on dataset with 466 cells.
#> Imputation finished.

3.7 Re-compute Adjusted Rand Index (ARI) with imputation.

# Recompute PCA reduction of the dataset
reducedDim(sce, "PCA_imputed") <- prcomp(t(assay(sce, "imputed")))$x

# Cluster the PCA reduced dataset and store the assignments
assgmts <- kmeans(reducedDim(sce, "PCA_imputed"), centers = k, iter.max = 1e+09,
    nstart = 1000)$cluster

# Use ARI to compare the k-means assignments to label assignments
adjustedRandIndex(assgmts, colData(sce)$cell.type)
#> [1] 0.704984

4 R session information.

#> ─ Session info ───────────────────────────────────────────────────────────────────────────────────────────────────────
#>  setting  value
#>  version  R version 4.2.2 (2022-10-31)
#>  os       Ubuntu 20.04.5 LTS
#>  system   x86_64, linux-gnu
#>  ui       X11
#>  language (EN)
#>  collate  C
#>  ctype    en_US.UTF-8
#>  tz       America/New_York
#>  date     2023-01-19
#>  pandoc   2.5 @ /usr/bin/ (via rmarkdown)
#> 
#> ─ Packages ───────────────────────────────────────────────────────────────────────────────────────────────────────────
#>  package                * version   date (UTC) lib source
#>  AnnotationDbi            1.60.0    2023-01-19 [2] Bioconductor
#>  AnnotationFilter         1.22.0    2023-01-19 [2] Bioconductor
#>  AnnotationHub            3.6.0     2023-01-19 [2] Bioconductor
#>  assertthat               0.2.1     2019-03-21 [2] CRAN (R 4.2.2)
#>  beachmat                 2.14.0    2023-01-19 [2] Bioconductor
#>  beeswarm                 0.4.0     2021-06-01 [2] CRAN (R 4.2.2)
#>  Biobase                * 2.58.0    2023-01-19 [2] Bioconductor
#>  BiocFileCache            2.6.0     2023-01-19 [2] Bioconductor
#>  BiocGenerics           * 0.44.0    2023-01-19 [2] Bioconductor
#>  BiocIO                   1.8.0     2023-01-19 [2] Bioconductor
#>  BiocManager              1.30.19   2022-10-25 [2] CRAN (R 4.2.2)
#>  BiocNeighbors            1.16.0    2023-01-19 [2] Bioconductor
#>  BiocParallel             1.32.5    2023-01-19 [2] Bioconductor
#>  BiocSingular             1.14.0    2023-01-19 [2] Bioconductor
#>  BiocStyle              * 2.26.0    2023-01-19 [2] Bioconductor
#>  BiocVersion              3.16.0    2023-01-19 [2] Bioconductor
#>  biomaRt                  2.54.0    2023-01-19 [2] Bioconductor
#>  Biostrings               2.66.0    2023-01-19 [2] Bioconductor
#>  bit                      4.0.5     2022-11-15 [2] CRAN (R 4.2.2)
#>  bit64                    4.0.5     2020-08-30 [2] CRAN (R 4.2.2)
#>  bitops                   1.0-7     2021-04-24 [2] CRAN (R 4.2.2)
#>  blob                     1.2.3     2022-04-10 [2] CRAN (R 4.2.2)
#>  bookdown                 0.32      2023-01-17 [2] CRAN (R 4.2.2)
#>  bslib                    0.4.2     2022-12-16 [2] CRAN (R 4.2.2)
#>  cachem                   1.0.6     2021-08-19 [2] CRAN (R 4.2.2)
#>  ccImpute               * 1.0.2     2023-01-19 [1] Bioconductor
#>  cli                      3.6.0     2023-01-09 [2] CRAN (R 4.2.2)
#>  codetools                0.2-18    2020-11-04 [2] CRAN (R 4.2.2)
#>  colorspace               2.0-3     2022-02-21 [2] CRAN (R 4.2.2)
#>  crayon                   1.5.2     2022-09-29 [2] CRAN (R 4.2.2)
#>  curl                     5.0.0     2023-01-12 [2] CRAN (R 4.2.2)
#>  DBI                      1.1.3     2022-06-18 [2] CRAN (R 4.2.2)
#>  dbplyr                   2.3.0     2023-01-16 [2] CRAN (R 4.2.2)
#>  DelayedArray             0.24.0    2023-01-19 [2] Bioconductor
#>  DelayedMatrixStats       1.20.0    2023-01-19 [2] Bioconductor
#>  digest                   0.6.31    2022-12-11 [2] CRAN (R 4.2.2)
#>  dplyr                    1.0.10    2022-09-01 [2] CRAN (R 4.2.2)
#>  ellipsis                 0.3.2     2021-04-29 [2] CRAN (R 4.2.2)
#>  ensembldb                2.22.0    2023-01-19 [2] Bioconductor
#>  evaluate                 0.20      2023-01-17 [2] CRAN (R 4.2.2)
#>  ExperimentHub            2.6.0     2023-01-19 [2] Bioconductor
#>  fansi                    1.0.3     2022-03-24 [2] CRAN (R 4.2.2)
#>  fastmap                  1.1.0     2021-01-25 [2] CRAN (R 4.2.2)
#>  filelock                 1.0.2     2018-10-05 [2] CRAN (R 4.2.2)
#>  generics                 0.1.3     2022-07-05 [2] CRAN (R 4.2.2)
#>  GenomeInfoDb           * 1.34.7    2023-01-19 [2] Bioconductor
#>  GenomeInfoDbData         1.2.9     2022-11-08 [2] Bioconductor
#>  GenomicAlignments        1.34.0    2023-01-19 [2] Bioconductor
#>  GenomicFeatures          1.50.3    2023-01-19 [2] Bioconductor
#>  GenomicRanges          * 1.50.2    2023-01-19 [2] Bioconductor
#>  ggbeeswarm               0.7.1     2022-12-16 [2] CRAN (R 4.2.2)
#>  ggplot2                * 3.4.0     2022-11-04 [2] CRAN (R 4.2.2)
#>  ggrepel                  0.9.2     2022-11-06 [2] CRAN (R 4.2.2)
#>  glue                     1.6.2     2022-02-24 [2] CRAN (R 4.2.2)
#>  gridExtra                2.3       2017-09-09 [2] CRAN (R 4.2.2)
#>  gtable                   0.3.1     2022-09-01 [2] CRAN (R 4.2.2)
#>  hms                      1.1.2     2022-08-19 [2] CRAN (R 4.2.2)
#>  htmltools                0.5.4     2022-12-07 [2] CRAN (R 4.2.2)
#>  httpuv                   1.6.8     2023-01-12 [2] CRAN (R 4.2.2)
#>  httr                     1.4.4     2022-08-17 [2] CRAN (R 4.2.2)
#>  interactiveDisplayBase   1.36.0    2023-01-19 [2] Bioconductor
#>  IRanges                * 2.32.0    2023-01-19 [2] Bioconductor
#>  irlba                    2.3.5.1   2022-10-03 [2] CRAN (R 4.2.2)
#>  jquerylib                0.1.4     2021-04-26 [2] CRAN (R 4.2.2)
#>  jsonlite                 1.8.4     2022-12-06 [2] CRAN (R 4.2.2)
#>  KEGGREST                 1.38.0    2023-01-19 [2] Bioconductor
#>  knitr                    1.41      2022-11-18 [2] CRAN (R 4.2.2)
#>  later                    1.3.0     2021-08-18 [2] CRAN (R 4.2.2)
#>  lattice                  0.20-45   2021-09-22 [2] CRAN (R 4.2.2)
#>  lazyeval                 0.2.2     2019-03-15 [2] CRAN (R 4.2.2)
#>  lifecycle                1.0.3     2022-10-07 [2] CRAN (R 4.2.2)
#>  magrittr                 2.0.3     2022-03-30 [2] CRAN (R 4.2.2)
#>  Matrix                   1.5-3     2022-11-11 [2] CRAN (R 4.2.2)
#>  MatrixGenerics         * 1.10.0    2023-01-19 [2] Bioconductor
#>  matrixStats            * 0.63.0    2022-11-18 [2] CRAN (R 4.2.2)
#>  mclust                 * 6.0.0     2022-10-31 [2] CRAN (R 4.2.2)
#>  memoise                  2.0.1     2021-11-26 [2] CRAN (R 4.2.2)
#>  mime                     0.12      2021-09-28 [2] CRAN (R 4.2.2)
#>  munsell                  0.5.0     2018-06-12 [2] CRAN (R 4.2.2)
#>  pillar                   1.8.1     2022-08-19 [2] CRAN (R 4.2.2)
#>  pkgconfig                2.0.3     2019-09-22 [2] CRAN (R 4.2.2)
#>  png                      0.1-8     2022-11-29 [2] CRAN (R 4.2.2)
#>  pracma                   2.4.2     2022-09-22 [2] CRAN (R 4.2.2)
#>  prettyunits              1.1.1     2020-01-24 [2] CRAN (R 4.2.2)
#>  progress                 1.2.2     2019-05-16 [2] CRAN (R 4.2.2)
#>  promises                 1.2.0.1   2021-02-11 [2] CRAN (R 4.2.2)
#>  ProtGenerics             1.30.0    2023-01-19 [2] Bioconductor
#>  purrr                    1.0.1     2023-01-10 [2] CRAN (R 4.2.2)
#>  R6                       2.5.1     2021-08-19 [2] CRAN (R 4.2.2)
#>  rappdirs                 0.3.3     2021-01-31 [2] CRAN (R 4.2.2)
#>  Rcpp                     1.0.9     2022-07-08 [2] CRAN (R 4.2.2)
#>  RcppAnnoy                0.0.20    2022-10-27 [2] CRAN (R 4.2.2)
#>  RCurl                    1.98-1.9  2022-10-03 [2] CRAN (R 4.2.2)
#>  restfulr                 0.0.15    2022-06-16 [2] CRAN (R 4.2.2)
#>  rjson                    0.2.21    2022-01-09 [2] CRAN (R 4.2.2)
#>  rlang                    1.0.6     2022-09-24 [2] CRAN (R 4.2.2)
#>  rmarkdown                2.19      2022-12-15 [2] CRAN (R 4.2.2)
#>  Rsamtools                2.14.0    2023-01-19 [2] Bioconductor
#>  RSpectra                 0.16-1    2022-04-24 [2] CRAN (R 4.2.2)
#>  RSQLite                  2.2.20    2022-12-22 [2] CRAN (R 4.2.2)
#>  rsvd                     1.0.5     2021-04-16 [2] CRAN (R 4.2.2)
#>  rtracklayer              1.58.0    2023-01-19 [2] Bioconductor
#>  S4Vectors              * 0.36.1    2023-01-19 [2] Bioconductor
#>  sass                     0.4.4     2022-11-24 [2] CRAN (R 4.2.2)
#>  ScaledMatrix             1.6.0     2023-01-19 [2] Bioconductor
#>  scales                   1.2.1     2022-08-20 [2] CRAN (R 4.2.2)
#>  scater                 * 1.26.1    2023-01-19 [2] Bioconductor
#>  scRNAseq               * 2.12.0    2023-01-19 [2] Bioconductor
#>  scuttle                * 1.8.4     2023-01-19 [2] Bioconductor
#>  sessioninfo            * 1.2.2     2021-12-06 [2] CRAN (R 4.2.2)
#>  shiny                    1.7.4     2022-12-15 [2] CRAN (R 4.2.2)
#>  SIMLR                    1.24.2    2023-01-19 [2] Bioconductor
#>  SingleCellExperiment   * 1.20.0    2023-01-19 [2] Bioconductor
#>  sparseMatrixStats        1.10.0    2023-01-19 [2] Bioconductor
#>  stringi                  1.7.12    2023-01-11 [2] CRAN (R 4.2.2)
#>  stringr                  1.5.0     2022-12-02 [2] CRAN (R 4.2.2)
#>  SummarizedExperiment   * 1.28.0    2023-01-19 [2] Bioconductor
#>  tibble                   3.1.8     2022-07-22 [2] CRAN (R 4.2.2)
#>  tidyselect               1.2.0     2022-10-10 [2] CRAN (R 4.2.2)
#>  utf8                     1.2.2     2021-07-24 [2] CRAN (R 4.2.2)
#>  vctrs                    0.5.1     2022-11-16 [2] CRAN (R 4.2.2)
#>  vipor                    0.4.5     2017-03-22 [2] CRAN (R 4.2.2)
#>  viridis                  0.6.2     2021-10-13 [2] CRAN (R 4.2.2)
#>  viridisLite              0.4.1     2022-08-22 [2] CRAN (R 4.2.2)
#>  withr                    2.5.0     2022-03-03 [2] CRAN (R 4.2.2)
#>  xfun                     0.36      2022-12-21 [2] CRAN (R 4.2.2)
#>  XML                      3.99-0.13 2022-12-04 [2] CRAN (R 4.2.2)
#>  xml2                     1.3.3     2021-11-30 [2] CRAN (R 4.2.2)
#>  xtable                   1.8-4     2019-04-21 [2] CRAN (R 4.2.2)
#>  XVector                  0.38.0    2023-01-19 [2] Bioconductor
#>  yaml                     2.3.6     2022-10-18 [2] CRAN (R 4.2.2)
#>  zlibbioc                 1.44.0    2023-01-19 [2] Bioconductor
#> 
#>  [1] /tmp/RtmppQvcCP/Rinst1608d74e6b2376
#>  [2] /home/biocbuild/bbs-3.16-bioc/R/library
#> 
#> ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────

References

Darmanis, Spyros, Steven A Sloan, Ye Zhang, Martin Enge, Christine Caneda, Lawrence M Shuer, Melanie G Hayden Gephart, Ben A Barres, and Stephen R Quake. 2015. “A Survey of Human Brain Transcriptome Diversity at the Single Cell Level.” Proceedings of the National Academy of Sciences 112 (23): 7285–90.

Malec, Marcin, Hasan Kurban, and Mehmet Dalkilic. 2022. “CcImpute: An Accurate and Scalable Consensus Clustering Based Algorithm to Impute Dropout Events in the Single-Cell Rna-Seq Data.” BMC Bioinformatics 23 (1): 1–17.