This document describes some important parameters of the UCell algorithm, and how they can be adapted depending on your dataset. Here we will use single-cell data stored in a Seurat object, but the same considerations apply to SingleCellExperiment or matrix input formats.
For this demo, we will download a single-cell dataset of lung cancer (Zilionis et al. (2019) Immunity) through the scRNA-seq package. This dataset contains >170,000 single cells; for the sake of simplicity, in this demo will we focus on immune cells, according to the annotations by the authors, and downsample to 5000 cells.
library(scRNAseq)
library(ggplot2)
lung <- ZilionisLungData()
immune <- lung$Used & lung$used_in_NSCLC_immune
lung <- lung[,immune]
lung <- lung[,1:5000]
exp.mat <- Matrix::Matrix(counts(lung),sparse = TRUE)
colnames(exp.mat) <- paste0(colnames(exp.mat), seq(1,ncol(exp.mat)))Save it as a Seurat object
library(Seurat)
seurat.object <- CreateSeuratObject(counts = exp.mat,
project = "Zilionis_immune")
seurat.object <- NormalizeData(seurat.object)Note: becase UCell scores are based on relative gene ranks, it can be applied both on raw counts or normalized data. As long as the normalization preserves the relative ranks between genes, the results will be equivalent.
UCell supports positive and negative gene sets within a signature. Simply append + or - signs to the genes to include them in positive and negative sets, respectively. For example:
signatures <- list(
CD8T = c("CD8A+","CD8B+","CD4-"),
CD4 = c("TRAC+","CD4+","CD40LG+","CD8A-","CD8B-"),
NK = c("KLRD1+","NCR1+","NKG7+","CD3D-","CD3E-")
)UCell evaluates the positive and negative gene sets separately, then
subtracts the scores. The parameter w_neg controls the
relative weight of the negative gene set compared to the positive set
(w_neg=1.0 means equal weight). Note that the combined
score is clipped to zero, to preserve UCell scores in the [0, 1]
range.
library(UCell)
seurat.object <- AddModuleScore_UCell(seurat.object, features=signatures,
w_neg = 1.0, name = NULL)
scores <- seurat.object[[names(signatures)]]
head(scores,15)## CD8T CD4 NK
## bcHTNA1 0.000000 0.14975523 0.000000
## bcHNVA2 0.000000 0.02503338 0.000000
## bcALZN3 0.000000 0.00000000 0.000000
## bcFWBP4 0.000000 0.00000000 0.000000
## bcBJYE5 0.000000 0.28627058 0.000000
## bcGSBJ6 0.000000 0.00000000 0.000000
## bcHQGJ7 0.000000 0.00000000 0.000000
## bcHKKM8 0.000000 0.21161549 0.000000
## bcIGQU9 0.000000 0.28649310 0.000000
## bcDVGG10 0.000000 0.21384068 0.000000
## bcEPCC11 0.707374 0.00000000 0.000000
## bcDOHD12 0.000000 0.00000000 0.000000
## bcFPZF13 0.000000 0.27403204 0.274032
## bcHRXV14 0.000000 0.00000000 0.000000
## bcFGME15 0.000000 0.31753449 0.000000
maxRank parameterSingle-cell data are sparse. In other words, for any given cell only
a few hundred/a few thousand genes (out of tens of thousands) are
detected with at least one UMI count. Because UCell scores are based on
ranking genes by their expression values, it is essential to account for
data sparsity when calculating ranks. This is implemented by capping
ranks to a maxRank parameter, in other words only the top
maxRank genes are ranked, and the rest are assumed
equivalent at the lowest ranking value.
It is often useful to adjust the maxRank depending on
the sparsity of your dataset. A good rule of thumb is to examine the
median number of expressed genes per cell, and set maxRank
in that order of magnitude. For example, for the test dataset:
This dataset has relatively low depth, so it is advisable to choose a
maxRank around 800-1000 (from the default 1500)
This is even more important when applying UCell to
technologies/modalities of much lower dimensionality, for example
probe-based spatial transcriptomics data (e.g. Xenium, CosMx), or
antibody tags (ADT) in CITE-seq experiments. Xenium panels contain a few
hundred/a few thousand genes; CITE-seq can detect a few hundred
proteins, as opposed to thousands of genes in scRNA-seq. The
maxRank parameter should then also be adapted to reflect
the new dimensionality, and set it at most to the number of probes in
the panel.
If a subset of the genes in your signature are absent from the count matrix, how should they be handled?
UCell offers two alternative ways of handling missing genes:
missing_genes="impute" (default): it assumes that
absence from the count matrix means zero expression. All values for this
gene are imputed to zero. This can sometimes be the case for processed
scRNA-seq datasets deposited in public repositories, where poorly
detected genes are often dropped from the count matrix.missing_genes="skip": simply exclude all missing genes
from the signatures; they won’t contribute to the scores.Here’s an example with a missing gene:
signatures <- list(
Myeloid = c("LYZ","CSF1R","not_a_gene")
)
seurat.object <- AddModuleScore_UCell(seurat.object, features=signatures,
missing_genes="impute")
scores1 <- seurat.object$Myeloid_UCell
seurat.object <- AddModuleScore_UCell(seurat.object, features=signatures,
missing_genes="skip")
scores2 <- seurat.object$Myeloid_UCell
scores <- cbind(scores1, scores2)
head(scores)## scores1 scores2
## bcHTNA1 0.3319982 0.4978312
## bcHNVA2 0.5263685 0.7892893
## bcALZN3 0.3333333 0.4998332
## bcFWBP4 0.0000000 0.0000000
## bcBJYE5 0.2078327 0.3116450
## bcGSBJ6 0.4755229 0.7130464
UCell scores are calculated individually for each cell (though they
may be later smoothed
by nearest-neighbor similarity). This means that computation can be
easily split into batches, reducing the computational footprint of gene
ranking and enabling parallel processing (see below). The size of the
batches is controlled by the chunk.size parameter. Large
chunks take up more RAM, while small chunk sizes have large overhead
from dataset splitting and merging. A sweet spot for
chunk.size is usually in the order of 100-1000 cells per
batch.
If your machine has multi-core capabilities and enough RAM, running
UCell in parallel can speed up considerably your analysis. The example
below runs on a single core - you may modify this behavior by setting
e.g. workers=8 to parallelize to 8 processes:
To mitigate sparsity in single-cell data, it can be useful to
‘impute’ scores by neighboring cells. The function
SmoothKNN performs smoothing of single-cell scores by
weighted average of the k-nearest neighbors in a given dimensionality
reduction. A crucial parameter is the number of neighbors k
that are used for smoothing. A small k only borrows from
very close neighbors, a large k takes weighted averages
over large portions of transcriptional space:
seurat.object <- NormalizeData(seurat.object)
seurat.object <- FindVariableFeatures(seurat.object,
selection.method = "vst", nfeatures = 500)
seurat.object <- ScaleData(seurat.object)
seurat.object <- RunPCA(seurat.object, npcs = 20,
features=VariableFeatures(seurat.object))
seurat.object <- RunUMAP(seurat.object, reduction = "pca",
dims = 1:20, seed.use=123)signatures <- list(
Tcell = c("CD3D","CD3E","CD3G","CD2","TRAC"),
Myeloid = c("CD14","LYZ","CSF1R","FCER1G","SPI1","LCK-"),
NK = c("KLRD1","NCR1","NKG7","CD3D-","CD3E-"),
Plasma_cell = c("MZB1","DERL3","CD19-")
)
seurat.object <- AddModuleScore_UCell(seurat.object, features=signatures,
name=NULL)seurat.object <- SmoothKNN(seurat.object, reduction="pca",
signature.names = names(signatures),
k=3, suffix = "_kNN3")
seurat.object <- SmoothKNN(seurat.object, reduction="pca",
signature.names = names(signatures),
k=100, suffix = "_kNN100")FeaturePlot(seurat.object, reduction = "umap",
features = c("Tcell","Tcell_kNN3")) &
theme(aspect.ratio = 1)FeaturePlot(seurat.object, reduction = "umap",
features = c("Tcell","Tcell_kNN100")) &
theme(aspect.ratio = 1)The decay parameter controls the relative influence of
close vs distant neighbors. Lower the decay parameter to
increase the weight for distant neighbors, increase decay
to give higher weight to close neighbors
seurat.object <- SmoothKNN(seurat.object, reduction="pca",
signature.names = names(signatures),
k=100, decay=0.001, suffix = "_decay0.001")
seurat.object <- SmoothKNN(seurat.object, reduction="pca",
signature.names = names(signatures),
k=100, decay=0.5, suffix = "_decay0.5")FeaturePlot(seurat.object, reduction = "umap",
features = c("Tcell_decay0.5","Tcell_decay0.001")) &
theme(aspect.ratio = 1)Please report any issues at the UCell GitHub repository.
More demos available on the Bioc landing page and at the UCell demo repository.
If you find UCell useful, you may also check out the scGate package, which relies on UCell scores to automatically purify populations of interest based on gene signatures.
See also SignatuR for easy storing and retrieval of gene signatures.
## R version 4.5.2 (2025-10-31)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.3 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Etc/UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] patchwork_1.3.2 ggplot2_4.0.2
## [3] UCell_2.15.1 Seurat_5.4.0
## [5] SeuratObject_5.3.0 sp_2.2-0
## [7] scRNAseq_2.25.0 SingleCellExperiment_1.33.0
## [9] SummarizedExperiment_1.41.1 Biobase_2.71.0
## [11] GenomicRanges_1.63.1 Seqinfo_1.1.0
## [13] IRanges_2.45.0 S4Vectors_0.49.0
## [15] BiocGenerics_0.57.0 generics_0.1.4
## [17] MatrixGenerics_1.23.0 matrixStats_1.5.0
## [19] BiocStyle_2.39.0
##
## loaded via a namespace (and not attached):
## [1] RcppAnnoy_0.0.23 splines_4.5.2 later_1.4.5
## [4] BiocIO_1.21.0 bitops_1.0-9 filelock_1.0.3
## [7] tibble_3.3.1 polyclip_1.10-7 XML_3.99-0.22
## [10] fastDummies_1.7.5 lifecycle_1.0.5 httr2_1.2.2
## [13] globals_0.19.0 lattice_0.22-9 ensembldb_2.35.0
## [16] MASS_7.3-65 alabaster.base_1.11.2 magrittr_2.0.4
## [19] plotly_4.12.0 sass_0.4.10 rmarkdown_2.30
## [22] jquerylib_0.1.4 yaml_2.3.12 httpuv_1.6.16
## [25] otel_0.2.0 sctransform_0.4.3 spam_2.11-3
## [28] spatstat.sparse_3.1-0 reticulate_1.44.1 cowplot_1.2.0
## [31] pbapply_1.7-4 DBI_1.2.3 buildtools_1.0.0
## [34] RColorBrewer_1.1-3 abind_1.4-8 Rtsne_0.17
## [37] purrr_1.2.1 AnnotationFilter_1.35.0 RCurl_1.98-1.17
## [40] rappdirs_0.3.4 ggrepel_0.9.6 irlba_2.3.7
## [43] spatstat.utils_3.2-1 listenv_0.10.0 alabaster.sce_1.11.0
## [46] maketools_1.3.2 goftest_1.2-3 RSpectra_0.16-2
## [49] spatstat.random_3.4-4 fitdistrplus_1.2-6 parallelly_1.46.1
## [52] codetools_0.2-20 DelayedArray_0.37.0 tidyselect_1.2.1
## [55] UCSC.utils_1.7.1 farver_2.1.2 spatstat.explore_3.7-0
## [58] BiocFileCache_3.1.0 GenomicAlignments_1.47.0 jsonlite_2.0.0
## [61] BiocNeighbors_2.5.4 progressr_0.18.0 ggridges_0.5.7
## [64] survival_3.8-6 tools_4.5.2 ica_1.0-3
## [67] Rcpp_1.1.1 glue_1.8.0 gridExtra_2.3
## [70] SparseArray_1.11.10 xfun_0.56 GenomeInfoDb_1.47.2
## [73] dplyr_1.2.0 HDF5Array_1.39.0 gypsum_1.7.0
## [76] withr_3.0.2 BiocManager_1.30.27 fastmap_1.2.0
## [79] rhdf5filters_1.23.3 digest_0.6.39 R6_2.6.1
## [82] mime_0.13 scattermore_1.2 tensor_1.5.1
## [85] spatstat.data_3.1-9 RSQLite_2.4.6 cigarillo_1.1.0
## [88] h5mread_1.3.1 tidyr_1.3.2 data.table_1.18.2.1
## [91] rtracklayer_1.71.3 htmlwidgets_1.6.4 httr_1.4.7
## [94] S4Arrays_1.11.1 uwot_0.2.4 pkgconfig_2.0.3
## [97] gtable_0.3.6 blob_1.3.0 lmtest_0.9-40
## [100] S7_0.2.1 XVector_0.51.0 sys_3.4.3
## [103] htmltools_0.5.9 dotCall64_1.2 ProtGenerics_1.43.0
## [106] scales_1.4.0 alabaster.matrix_1.11.0 png_0.1-8
## [109] spatstat.univar_3.1-6 knitr_1.51 reshape2_1.4.5
## [112] rjson_0.2.23 nlme_3.1-168 curl_7.0.0
## [115] cachem_1.1.0 zoo_1.8-15 rhdf5_2.55.13
## [118] stringr_1.6.0 BiocVersion_3.23.1 KernSmooth_2.23-26
## [121] vipor_0.4.7 parallel_4.5.2 miniUI_0.1.2
## [124] AnnotationDbi_1.73.0 ggrastr_1.0.2 restfulr_0.0.16
## [127] pillar_1.11.1 grid_4.5.2 alabaster.schemas_1.11.0
## [130] vctrs_0.7.1 RANN_2.6.2 promises_1.5.0
## [133] dbplyr_2.5.1 xtable_1.8-4 cluster_2.1.8.2
## [136] beeswarm_0.4.0 evaluate_1.0.5 GenomicFeatures_1.63.1
## [139] cli_3.6.5 compiler_4.5.2 Rsamtools_2.27.0
## [142] rlang_1.1.7 crayon_1.5.3 future.apply_1.20.1
## [145] labeling_0.4.3 ggbeeswarm_0.7.3 plyr_1.8.9
## [148] stringi_1.8.7 deldir_2.0-4 viridisLite_0.4.3
## [151] alabaster.se_1.11.0 BiocParallel_1.45.0 Biostrings_2.79.4
## [154] lazyeval_0.2.2 spatstat.geom_3.7-0 Matrix_1.7-4
## [157] ExperimentHub_3.1.0 RcppHNSW_0.6.0 bit64_4.6.0-1
## [160] future_1.69.0 Rhdf5lib_1.33.0 KEGGREST_1.51.1
## [163] shiny_1.12.1 alabaster.ranges_1.11.0 AnnotationHub_4.1.0
## [166] ROCR_1.0-12 igraph_2.2.2 memoise_2.0.1
## [169] bslib_0.10.0 bit_4.6.0