Droplet-based microfluidic devices have become widely used to perform single-cell RNA sequencing (scRNA-seq). However, ambient RNA present in the cell suspension can be aberrantly counted along with a cell’s native mRNA and result in cross-contamination of transcripts between different cell populations. DecontX is a Bayesian method to estimate and remove contamination in individual cells. DecontX assumes the observed expression of a cell is a mixture of counts from two multinomial distributions: (1) a distribution of native transcript counts from the cell’s actual population and (2) a distribution of contaminating transcript counts from all other cell populations captured in the assay. Overall, computational decontamination of single cell counts can aid in downstream clustering and visualization.
DecontX Package can be installed from Bioconductor:
if (!requireNamespace("BiocManager", quietly = TRUE)) {
install.packages("BiocManager")
}
BiocManager::install("decontX")
Then the package can be loaded in R using the following command:
library(decontX)
To see the latest updates and releases or to post a bug, see our GitHub page at https://github.com/campbio/decontX.
DecontX can take either a SingleCellExperiment object or a counts matrix as input. decontX
will attempt to convert any input matrix to class dgCMatrix
from package Matrix before starting the analysis.
To import datasets directly into an SCE object, the singleCellTK package has several importing functions for different preprocessing tools including CellRanger, STARsolo, BUStools, Optimus, DropEST, SEQC, and Alevin/Salmon. For example, the following code can be used as a template to read in the filtered and raw matrices for multiple samples processed with CellRanger:
library(singleCellTK)
sce <- importCellRanger(sampleDirs = c("path/to/sample1/", "path/to/sample2/"))
Within each sample directory, there should be subfolders called "outs/filtered_feature_bc_matrix/"
or "outs/raw_feature_bc_matrix/"
with files called matrix.mtx.gz
, features.tsv.gz
and barcodes.tsv.gz
. If these files are in different subdirectories, the importCellRangerV3Sample
function can be used to import data from a different directory instead.
Optionally, the “raw” or “droplet” matrix can also be easily imported by setting the dataType
argument to “raw”:
sce.raw <- importCellRanger(sampleDirs = c("path/to/sample1/", "path/to/sample2/"), dataType = "raw")
The raw matrix can be passed to the background
parameter in decontX
as described below. If using Seurat, go to the Working with Seurat section for details on how to convert between SCE and Seurat objects.
We will utilize the 10X PBMC 4K dataset as an example in this vignette. This data can be easily retrieved from the package TENxPBMCData. Make sure the the column names are set before running decontX.
# Load PBMC data
library(TENxPBMCData)
sce <- TENxPBMCData("pbmc4k")
colnames(sce) <- paste(sce$Sample, sce$Barcode, sep = "_")
rownames(sce) <- rowData(sce)$Symbol_TENx
counts(sce) <- as(counts(sce), "dgCMatrix")
A SingleCellExperiment (SCE) object or a sparse matrix containing the counts for filtered cells can be passed to decontX via the x
parameter. The matrix to use in an SCE object can be specified with the assayName
parameter, which is set to "counts"
by default. There are two major ways to run decontX: with and without the raw/droplet matrix containing empty droplets. Here is an example of running decontX without supplying the background:
sce <- decontX(sce)
In this scenario, decontX
will estimate the contamination distribution for each cell cluster based on the profiles of the other cell clusters in the filtered dataset. The estimated contamination results can be found in the colData(sce)$decontX_contamination
and the decontaminated counts can be accessed with decontXcounts(sce)
. decontX
will perform heuristic clustering to quickly define major cell clusters. However if you have your own cell cluster labels, they can be specified with the z
parameter. These results will be used throughout the rest of the vignette.
The raw/droplet matrix can be used to empirically estimate the distribution of ambient RNA, which is especially useful when cells that contributed to the ambient RNA are not accurately represented in the filtered count matrix containing the cells. For example, cells that were removed via flow cytometry or that were more sensitive to lysis during dissociation may have contributed to the ambient RNA but were not measured in the filtered/cell matrix. The raw/droplet matrix can be input as an SCE object or a sparse matrix using the background
parameter:
sce <- decontX(sce, background = sce.raw)
Only empty droplets in the background matrix should be used to estimate the ambient RNA. If any cell ids (i.e. colnames
) in the raw/droplet matrix supplied to the background
parameter are also found in the filtered counts matrix (x
), decontX will automatically remove them from the raw matrix. However, if the cell ids are not available for the input matrices, decontX will treat the entire background
input as empty droplets. All of the outputs are the same as when running decontX without setting the background
parameter.
Note: If the input object is just a matrix and not an SCE object, make sure to save the output into a variable with a different name (e.g.
result <- decontX(mat)
). The result object will be a list with contamination inresult$contamination
and the decontaminated counts inresult$decontXcounts
.
DecontX creates a UMAP which we can use to plot the cluster labels automatically identified in the analysis. Note that the clustering approach used here is designed to find “broad” cell types rather than individual cell subpopulations within a cell type.
library(celda) # To use plotting functions in celda
##
## Attaching package: 'celda'
## The following object is masked from 'package:S4Vectors':
##
## params
## The following objects are masked from 'package:decontX':
##
## decontX, decontXcounts, decontXcounts<-, plotDecontXContamination,
## plotDecontXMarkerExpression, plotDecontXMarkerPercentage,
## retrieveFeatureIndex, simulateContamination
umap <- reducedDim(sce, "decontX_UMAP")
plotDimReduceCluster(x = sce$decontX_clusters,
dim1 = umap[, 1], dim2 = umap[, 2])
The percentage of contamination in each cell can be plotting on the UMAP to visualize what what clusters may have higher levels of ambient RNA.
plotDecontXContamination(sce)
Known marker genes can also be plotted on the UMAP to identify the cell types for each cluster. We will use CD3D and CD3E for T-cells, LYZ, S100A8, and S100A9 for monocytes, CD79A, CD79B, and MS4A1 for B-cells, GNLY for NK-cells, and PPBP for megakaryocytes.
library(scater)
sce <- logNormCounts(sce)
plotDimReduceFeature(as.matrix(logcounts(sce)),
dim1 = umap[, 1],
dim2 = umap[, 2],
features = c("CD3D", "CD3E", "GNLY",
"LYZ", "S100A8", "S100A9",
"CD79A", "CD79B", "MS4A1"),
exactMatch = TRUE)
## Warning in asMethod(object): sparse->dense coercion: allocating vector of size
## 1.1 GiB
The percetage of cells within a cluster that have detectable expression of marker genes can be displayed in a barplot. Markers for cell types need to be supplied in a named list. First, the detection of marker genes in the original counts
assay is shown:
markers <- list(Tcell_Markers = c("CD3E", "CD3D"),
Bcell_Markers = c("CD79A", "CD79B", "MS4A1"),
Monocyte_Markers = c("S100A8", "S100A9", "LYZ"),
NKcell_Markers = "GNLY")
cellTypeMappings <- list(Tcells = 2, Bcells = 5, Monocytes = 1, NKcells = 6)
plotDecontXMarkerPercentage(sce,
markers = markers,
groupClusters = cellTypeMappings,
assayName = "counts")