maximumAmbience {DropletUtils}R Documentation

Maximum ambient contribution

Description

Estimate the maximum contribution of the ambient solution to a particular expression profile.

Usage

maximumAmbience(
  y,
  ambient,
  threshold = 0.1,
  dispersion = 0,
  num.points = 100,
  num.iter = 5,
  mode = c("scale", "profile", "proportion")
)

Arguments

y

A numeric vector of gene-level counts.

ambient

A numeric vector of length equal to y, containing the proportions of transcripts for each gene in the ambient solution.

threshold

Numeric scalar specifying the p-value threshold to use, see Details.

dispersion

Numeric scalar specifying the dispersion to use in the negative binomial model. Defaults to zero, i.e., a Poisson model.

num.points

Integer scalar specifying the number of points to use for the grid search.

num.iter

Integer scalar specifying the number of iterations to use for the grid search.

mode

String indicating the output to return - the scaling factor, the maximum ambient profile or the maximum proportion of each gene's counts in y that is attributable to ambient contamination.

Details

On occasion, it is useful to estimate the maximum possible contribution of the ambient solution to a count profile. This represents the most pessimistic explanation of a particular expression pattern and can be used to identify and discard suspect genes or clusters prior to downstream analyses.

This function implements the following algorithm:

  1. It computes the mean ambient contribution for each gene by scaling ambient by some factor. ambient itself is usually derived by summing counts across barcodes with low total counts, see the output of emptyDrops for an example.

  2. It computes a p-value for each gene based on the probability of observing a count equal to or below that in y, using the lower tail of a negative binomial (or Poisson) distribution with mean set to the ambient contribution.

  3. It combines p-values across genes using Simes' method. The joint null hypothesis is that the expectation of y is equal to the sum of the scaled ambient proportions and some (non-negative) contribution from actual intra-cellular transcripts.

  4. It finds the largest scaling factor that fails to reject this joint null at the specified threshold. If sum(ambient) is equal to unity, this scaling factor can be interpreted as the maximum number of transcript molecules contributed to y by the ambient solution.

The process of going from a scaling factor to a combined p-value has no clean analytical solution, so we use an iterative grid search to identify to largest possible scaling factor at a decent resolution. num.points and num.iter control the resolution of the grid search, and generally do not need to be changed.

Value

A numeric scalar quantifying the “contribution” of the ambient solution to y. The product of this scalar and ambient yields the expected number of ambient transcripts for each gene in y.

Caveats

The algorithm implemented in this function is, admittedly, rather ad hoc and offers little in the way of theoretical guarantees. The reported scaling often exceeds the actual contribution, especially at low counts where the reduced power fails to penalize overly large scaling factors. The p-value is largely used as a score rather than providing any meaningful error control. Empirically, decreasing threshold will return a higher scaling factor by making the estimation more robust to drop-outs in y, at the cost of increasing the risk of over-estimation of the ambient contribution.

It is also important to note that this function returns the maximum possible contribution of the ambient solution to y, not the actual contribution. It is probably unwise to attempt to obtain a “cleaned” expression profile by subtracting the scaled ambient proportions from y. In the most extreme case, if the ambient profile is similar to the expectation of y (e.g., due to sequencing a relatively homogeneous cell population), the maximum possible contribution of the ambient solution would be 100% of y, and subtraction would yield an empty count vector!

Author(s)

Aaron Lun

See Also

emptyDrops, which uses the ambient profile to call cells.

estimateAmbience, to obtain an estimate to use in ambient.

Examples

# Making up some data.
ambient <- c(runif(900, 0, 0.1), runif(100))
y <- rpois(1000, ambient * 50)
y <- y + rpois(1000, 5) # actual biology.

# Estimating the maximum possible scaling factor:
scaling <- maximumAmbience(y, ambient)
scaling

# Estimating the maximum contribution to 'y' by 'ambient'.
contribution <- maximumAmbience(y, ambient, mode="profile")
DataFrame(ambient=contribution, total=y)


[Package DropletUtils version 1.8.0 Index]