estimateAmbience {DropletUtils} | R Documentation |
Estimate the transcript proportions in the ambient solution using the Good-Turing method.
estimateAmbience(m, lower = 100, round = TRUE, good.turing = TRUE)
m |
A numeric matrix object - usually a dgTMatrix or dgCMatrix - containing droplet data prior to any filtering or cell calling. Columns represent barcoded droplets, rows represent genes. |
lower |
A numeric scalar specifying the lower bound on the total UMI count, at or below which all barcodes are assumed to correspond to empty droplets. |
round |
Logical scalar indicating whether to check for non-integer values in |
good.turing |
Logical scalar indicating whether to perform Good-Turing estimation of the proportions. |
This function obtains an estimate of the composition of the ambient pool of RNA based on the barcodes with total UMI counts less than or equal to lower
.
For each gene, counts are summed across all low-count barcodes and the resulting count vector is used for Good-Turing estimation of the proportions for each transcript.
The aim here is to obtain reasonable proportions for genes with counts of zero in low-count barcodes but non-zero counts for other barcodes (thus avoiding likelihoods of zero when modelling the latter with the proportions).
This function will also attempt to detect whether m
contains non-integer values by seeing if the column and row sums are discrete.
If such values are present, m
is first round
ed to the nearest integer value before proceeding.
This may be relevant when the count matrix is generated from pseudo-alignment methods like Alevin (see the tximeta package for details).
Rounding is performed by default as discrete count values are necessary for the Good-Turing algorithm, but if m
is known to be discrete, setting round=FALSE
can provide a small efficiency improvement.
Good-Turing returns zero probabilities for zero counts if none of the summed counts are equal to 1. This is technically correct but not helpful, so we protect against this by adding a single “pseudo-feature” with a count of 1 to the profile. The modified profile is used to calculate a Good-Turing estimate of observing any feature that has zero counts, which is then divided to get the per-feature probability. We scale down all the other probabilities to make space for this new pseudo-probability, which has some properties of unclear utility (see https://github.com/MarioniLab/DropletUtils/issues/39).
Note that genes with counts of zero across all barcodes in m
automatically have proportions of zero.
This ensures that the estimation is not affected by the presence/absence of non-expressed genes in different annotations.
In any case, such genes are likely to be completely irrelevant to downstream steps and can be safely ignored.
Setting good.turing=FALSE
may be convenient to obtain raw counts for use in further modelling.
A numeric vector of length equal to nrow(m)
,
containing the estimated proportion of each transcript in the ambient solution.
If good.turing=FALSE
, the vector instead contains the sum of counts for each gene across all low-count barcodes.
Aaron Lun
emptyDrops
and hashedDrops
, where the ambient profile estimates are used for testing.
# Mocking up some data: set.seed(0) my.counts <- DropletUtils:::simCounts() ambience <- estimateAmbience(my.counts) head(ambience)