Contents

1 Introduction

DecontX is a Bayesian hierarchical model to estimate and remove cross-contamination from ambient RNA in single-cell RNA-seq count data generated from droplet-based sequencing devices. DecontX will take the count matrix with/without the cell labels and estimate the contamination level and deliver a decontaminted count matrix for downstream analysis.

In this vignette we will demonstrate how to use DecontX to estimate and remove contamination.

2 Installation

celda can be installed from Bioconductor:

if (!requireNamespace("BiocManager", quietly=TRUE)){
    install.packages("BiocManager")}
BiocManager::install("celda")

The package can be loaded using the library command.

library(celda)

To see the latest updates and releases or to post a bug, see our GitHub page at https://github.com/campbio/celda. To ask questions about running celda, post a thread on Bioconductor support site at https://support.bioconductor.org/.

3 Reproducibility note

Many functions in celda make use of stochastic algorithms or procedures which require the use of random number generator (RNG) for simulation or sampling. To maintain reproducibility, all these functions use a default seed of 12345 to make sure same results are generated each time one of these functions is called. Explicitly setting the seed arguments is needed for greater control and randomness.

4 Generation of a cross-contaminated dataset

DecontX will take a matrix of counts (referred as observed counts) where each row is a feature, each column is a cell, and each entry in the matrix is the number of counts of each feature in each cell. To illustrate the utility of DecontX, we will apply it to a simulated dataset.

In the function simulateContaminatedMatrix, the K parameter designates the number of cell clusters, the C parameter determines the number of cells, the G parameter determines the number of genes in the simulated dataset.

simCounts <- simulateContaminatedMatrix(G = 300, C = 100, K = 3)

The nativeCounts is the natively expressed counts matrix, and observedCounts is the observed counts matrix that contains both contaminated and natively expressed transctripts. The NByC is the total number of observed transcripts per cell. The counts matrix which only contains contamianted transcripts can be obtained by subtracting the observed counts matrix from the observed counts matrix.

contamination <- simCounts$observedCounts - simCounts$nativeCounts

The z variable contains the population label for each cell.

table(simCounts$z)
## 
##  1  2  3 
## 26 36 38

The phi and eta variables contain the expression distributions and contamination distributions for each population, respectively. Each column corresponds to a population, each row represents a gene. The sum of the rows equal to 1.

colSums(simCounts$phi)
## [1] 1 1 1
colSums(simCounts$eta)
## [1] 1 1 1

5 Decontamination using DecontX

DecontX uses bayesian method to estimate and remove contamination via varitaional inference.

decontxModel <- decontX(counts = simCounts$observedCounts, z = simCounts$z)

5.1 Check convergence

Use log-likelihood to check convergence

plot(decontxModel$resList$logLikelihood)

5.2 Evaluate model performance

decontX estimates a contamination proportion for each cell. We compare the estimated contamination proportion with the real contamination proportion.

plot(decontxModel$resList$estConp,
    colSums(contamination) / simCounts$NByC, col = simCounts$z)
abline(0, 1)

6 Session Information

sessionInfo()
## R version 3.6.0 (2019-04-26)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.2 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.9-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.9-bioc/R/lib/libRlapack.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] celda_1.0.4      BiocStyle_2.12.0
## 
## loaded via a namespace (and not attached):
##  [1] Biobase_2.44.0              httr_1.4.0                 
##  [3] jsonlite_1.6                splines_3.6.0              
##  [5] foreach_1.4.4               assertthat_0.2.1           
##  [7] BiocManager_1.30.4          stats4_3.6.0               
##  [9] GenomeInfoDbData_1.2.1      yaml_2.2.0                 
## [11] ggrepel_0.8.1               pillar_1.4.1               
## [13] lattice_0.20-38             reticulate_1.12            
## [15] glue_1.3.1                  RcppEigen_0.3.3.5.0        
## [17] digest_0.6.19               GenomicRanges_1.36.0       
## [19] RColorBrewer_1.1-2          XVector_0.24.0             
## [21] minqa_1.2.4                 colorspace_1.4-1           
## [23] enrichR_1.0                 htmltools_0.3.6            
## [25] Matrix_1.2-17               plyr_1.8.4                 
## [27] pkgconfig_2.0.2             bookdown_0.11              
## [29] zlibbioc_1.30.0             purrr_0.3.2                
## [31] scales_1.0.0                Rtsne_0.15                 
## [33] BiocParallel_1.18.0         lme4_1.1-21                
## [35] tibble_2.1.2                combinat_0.0-8             
## [37] IRanges_2.18.1              ggplot2_3.1.1              
## [39] withr_2.1.2                 umap_0.2.2.0               
## [41] SummarizedExperiment_1.14.0 BiocGenerics_0.30.0        
## [43] lazyeval_0.2.2              magrittr_1.5               
## [45] crayon_1.3.4                evaluate_0.14              
## [47] doParallel_1.0.14           nlme_3.1-140               
## [49] MASS_7.3-51.4               MAST_1.10.0                
## [51] tools_3.6.0                 data.table_1.12.2          
## [53] matrixStats_0.54.0          stringr_1.4.0              
## [55] S4Vectors_0.22.0            munsell_0.5.0              
## [57] DelayedArray_0.10.0         compiler_3.6.0             
## [59] GenomeInfoDb_1.20.0         rlang_0.3.4                
## [61] blme_1.0-4                  grid_3.6.0                 
## [63] RCurl_1.95-4.12             nloptr_1.2.1               
## [65] iterators_1.0.10            rjson_0.2.20               
## [67] SingleCellExperiment_1.6.0  bitops_1.0-6               
## [69] rmarkdown_1.13              boot_1.3-22                
## [71] gtable_0.3.0                codetools_0.2-16           
## [73] abind_1.4-5                 reshape2_1.4.3             
## [75] R6_2.4.0                    gridExtra_2.3              
## [77] knitr_1.23                  dplyr_0.8.1                
## [79] stringi_1.4.3               parallel_3.6.0             
## [81] Rcpp_1.0.1                  MCMCprecision_0.3.9        
## [83] tidyselect_0.2.5            xfun_0.7