maximumAmbience {DropletUtils} | R Documentation |
Estimate the maximum contribution of the ambient solution to a particular expression profile.
maximumAmbience( y, ambient, threshold = 0.1, dispersion = 0, num.points = 100, num.iter = 5, mode = c("scale", "profile", "proportion") )
y |
A numeric vector of gene-level counts. |
ambient |
A numeric vector of length equal to |
threshold |
Numeric scalar specifying the p-value threshold to use, see Details. |
dispersion |
Numeric scalar specifying the dispersion to use in the negative binomial model. Defaults to zero, i.e., a Poisson model. |
num.points |
Integer scalar specifying the number of points to use for the grid search. |
num.iter |
Integer scalar specifying the number of iterations to use for the grid search. |
mode |
String indicating the output to return - the scaling factor, the maximum ambient profile or the maximum proportion of each gene's counts in |
On occasion, it is useful to estimate the maximum possible contribution of the ambient solution to a count profile. This represents the most pessimistic explanation of a particular expression pattern and can be used to identify and discard suspect genes or clusters prior to downstream analyses.
This function implements the following algorithm:
It computes the mean ambient contribution for each gene by scaling ambient
by some factor.
ambient
itself is usually derived by summing counts across barcodes with low total counts,
see the output of emptyDrops
for an example.
It computes a p-value for each gene based on the probability of observing a count equal to or below that in y
,
using the lower tail of a negative binomial (or Poisson) distribution with mean set to the ambient contribution.
It combines p-values across genes using Simes' method.
The joint null hypothesis is that the expectation of y
is equal to the sum of the scaled ambient proportions
and some (non-negative) contribution from actual intra-cellular transcripts.
It finds the largest scaling factor that fails to reject this joint null at the specified threshold
.
If sum(ambient)
is equal to unity, this scaling factor can be interpreted as the maximum number of transcript molecules contributed to y
by the ambient solution.
The process of going from a scaling factor to a combined p-value has no clean analytical solution,
so we use an iterative grid search to identify to largest possible scaling factor at a decent resolution.
num.points
and num.iter
control the resolution of the grid search,
and generally do not need to be changed.
A numeric scalar quantifying the “contribution” of the ambient solution to y
.
The product of this scalar and ambient
yields the expected number of ambient transcripts for each gene in y
.
The algorithm implemented in this function is, admittedly, rather ad hoc and offers little in the way of theoretical guarantees.
The reported scaling often exceeds the actual contribution, especially at low counts where the reduced power fails to penalize overly large scaling factors.
The p-value is largely used as a score rather than providing any meaningful error control.
Empirically, decreasing threshold
will return a higher scaling factor by making the estimation more robust to drop-outs in y
, at the cost of increasing the risk of over-estimation of the ambient contribution.
It is also important to note that this function returns the maximum possible contribution of the ambient solution to y
, not the actual contribution.
It is probably unwise to attempt to obtain a “cleaned” expression profile by subtracting the scaled ambient proportions from y
.
In the most extreme case, if the ambient profile is similar to the expectation of y
(e.g., due to sequencing a relatively homogeneous cell population), the maximum possible contribution of the ambient solution would be 100% of y
, and subtraction would yield an empty count vector!
Aaron Lun
emptyDrops
, which uses the ambient profile to call cells.
estimateAmbience
, to obtain an estimate to use in ambient
.
# Making up some data. ambient <- c(runif(900, 0, 0.1), runif(100)) y <- rpois(1000, ambient * 50) y <- y + rpois(1000, 5) # actual biology. # Estimating the maximum possible scaling factor: scaling <- maximumAmbience(y, ambient) scaling # Estimating the maximum contribution to 'y' by 'ambient'. contribution <- maximumAmbience(y, ambient, mode="profile") DataFrame(ambient=contribution, total=y)