combineRecomputedResults {SingleR} | R Documentation |
Combine results from multiple runs of classifySingleR
(usually against different references) into a single DataFrame.
The label from the results with the highest score for each cell is retained.
Unlike combineCommonResults
, this does not assume that each run of classifySingleR
was performed using the same set of common genes, instead recomputing the scores for comparison across references.
combineRecomputedResults( results, test, trained, quantile = 0.8, assay.type.test = "logcounts", check.missing = TRUE, allow.lost = FALSE, warn.lost = TRUE, BNPARAM = KmknnParam(), BPPARAM = SerialParam() )
results |
A list of DataFrame prediction results as returned by |
test |
A numeric matrix of single-cell expression values where rows are genes and columns are cells. Alternatively, a SummarizedExperiment object containing such a matrix. |
trained |
A list of Lists containing the trained outputs of multiple references,
equivalent to either (i) the output of |
quantile |
Further arguments to pass to |
assay.type.test |
An integer scalar or string specifying the assay of |
check.missing |
Logical scalar indicating whether rows should be checked for missing values (and if found, removed). |
allow.lost |
Logical scalar indicating whether to use lost markers in references where they are available. |
warn.lost |
Logical scalar indicating whether to emit a warning if markers from one reference in |
BNPARAM |
A BiocNeighborParam object specifying the algorithm to use for building nearest neighbor indices. |
BPPARAM |
A BiocParallelParam object specifying how parallelization should be performed, if any. |
Here, the strategy is to perform classification separately within each reference,
then collate the results to choose the label with the highest score across references.
For a given cell in test
, we extract its assigned label from results
for each reference.
We also retrieve the marker genes associated with that label and take the union of markers across all references.
This defines a common feature space in which the score for each reference's assigned label is recomputed using ref
;
the label from the reference with the top recomputed score is then reported as the combined annotation for that cell.
Unlike combineCommonResults
, the union of markers is not used for the within-reference calls.
This avoids the inclusion of noise from irrelevant genes in the within-reference assignments.
Obviously, combineRecomputedResults
is slower as it does require recomputation of the scores,
but the within-reference calls are faster as there are fewer genes in the union of markers for assigned labels
(compared to the union of markers across all labels, as required by combineCommonResults
),
so it is likely that the net compute time should be lower.
A DataFrame is returned containing the annotation statistics for each cell or cluster (row).
This mimics the output of classifySingleR
and contains the following fields:
scores
, a numeric matrix of correlations containing the recomputed scores.
For any given cell, entries of this matrix are only non-NA
for the assigned label in each reference;
scores are not recomputed for the other labels.
labels
, a character vector containing the per-cell combined label across references.
references
, an integer vector specifying the reference from which the combined label was derived.
orig.results
, a DataFrame containing results
.
It may also contain first.labels
and pruned.labels
if these were also present in results
.
The metadata
contains label.origin
,
a DataFrame specifying the reference of origin for each label in scores
.
Note that, unlike combineCommonResults
, no common.genes
is reported
as this function does not use a common set of genes across all references.
It is strongly recommended that the universe of genes be the same across all references in trained
.
If this is not the case, the intersection of genes across all trained
will be used in the recomputation.
This at least provides a common feature space for comparing correlations,
though differences in the availability of markers between references may have unpredictable effects on the results
(and so a warning will be emitted by default, when when warn.lost=TRUE
).
That said, the intersection may be too string when dealing with many references with diverse feature annotations.
In such cases, we can set allow.lost=TRUE
so that the recomputation for each reference will use all available markers in that reference.
The idea here is to avoid penalizing all references by removing an informative marker when it is only absent in a single reference.
We hope that the recomputed scores are still roughly comparable if the number of lost markers is relatively low,
coupled with the use of ranks in the calculation of the Spearman-based scores to reduce the influence of individual markers.
This is perhaps as reliable as one might imagine, so setting allow.lost=TRUE
should be considered a last resort.
Aaron Lun
Lun A, Bunis D, Andrews J (2020). Thoughts on a more scalable algorithm for multiple references. https://github.com/LTLA/SingleR/issues/94
SingleR
and classifySingleR
, for generating predictions to use in results
.
combineCommonResults
, for another approach to combining predictions.
# Making up data. ref <- .mockRefData(nreps=8) ref1 <- ref[,1:2%%2==0] ref2 <- ref[,1:2%%2==1] ref2$label <- tolower(ref2$label) test <- .mockTestData(ref) # Performing classification within each reference. test <- scuttle::logNormCounts(test) ref1 <- scuttle::logNormCounts(ref1) train1 <- trainSingleR(ref1, labels=ref1$label) pred1 <- classifySingleR(test, train1) ref2 <- scuttle::logNormCounts(ref2) train2 <- trainSingleR(ref2, labels=ref2$label) pred2 <- classifySingleR(test, train2) # Combining results with recomputation of scores. combined <- combineRecomputedResults( results=list(pred1, pred2), test=test, trained=list(train1, train2)) combined[,1:5]