scAnnotatR is an R package for cell type prediction on single cell RNA-sequencing data. Currently, this package supports data in the forms of a Seurat object or a SingleCellExperiment object.
More information about Seurat object can be found here: https://satijalab.org/seurat/ More information about SingleCellExperiment object can be found here: https://osca.bioconductor.org/
scAnnotatR provides 2 main features:
The scAnnotatR
package can be directly installed from Bioconductor:
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
if (!require(scAnnotatR))
BiocManager::install("scAnnotatR")
For more information, see https://bioconductor.org/install/.
The scAnnotatR
package comes with several pre-trained models to classify cell types.
# load scAnnotatR into working space
library(scAnnotatR)
#> Loading required package: Seurat
#> Attaching SeuratObject
#> Loading required package: SingleCellExperiment
#> Loading required package: SummarizedExperiment
#> Loading required package: MatrixGenerics
#> Loading required package: matrixStats
#>
#> Attaching package: 'MatrixGenerics'
#> The following objects are masked from 'package:matrixStats':
#>
#> colAlls, colAnyNAs, colAnys, colAvgsPerRowSet, colCollapse,
#> colCounts, colCummaxs, colCummins, colCumprods, colCumsums,
#> colDiffs, colIQRDiffs, colIQRs, colLogSumExps, colMadDiffs,
#> colMads, colMaxs, colMeans2, colMedians, colMins, colOrderStats,
#> colProds, colQuantiles, colRanges, colRanks, colSdDiffs, colSds,
#> colSums2, colTabulates, colVarDiffs, colVars, colWeightedMads,
#> colWeightedMeans, colWeightedMedians, colWeightedSds,
#> colWeightedVars, rowAlls, rowAnyNAs, rowAnys, rowAvgsPerColSet,
#> rowCollapse, rowCounts, rowCummaxs, rowCummins, rowCumprods,
#> rowCumsums, rowDiffs, rowIQRDiffs, rowIQRs, rowLogSumExps,
#> rowMadDiffs, rowMads, rowMaxs, rowMeans2, rowMedians, rowMins,
#> rowOrderStats, rowProds, rowQuantiles, rowRanges, rowRanks,
#> rowSdDiffs, rowSds, rowSums2, rowTabulates, rowVarDiffs, rowVars,
#> rowWeightedMads, rowWeightedMeans, rowWeightedMedians,
#> rowWeightedSds, rowWeightedVars
#> Loading required package: GenomicRanges
#> Loading required package: stats4
#> Loading required package: BiocGenerics
#>
#> Attaching package: 'BiocGenerics'
#> The following objects are masked from 'package:stats':
#>
#> IQR, mad, sd, var, xtabs
#> The following objects are masked from 'package:base':
#>
#> Filter, Find, Map, Position, Reduce, anyDuplicated, aperm, append,
#> as.data.frame, basename, cbind, colnames, dirname, do.call,
#> duplicated, eval, evalq, get, grep, grepl, intersect, is.unsorted,
#> lapply, mapply, match, mget, order, paste, pmax, pmax.int, pmin,
#> pmin.int, rank, rbind, rownames, sapply, setdiff, sort, table,
#> tapply, union, unique, unsplit, which.max, which.min
#> Loading required package: S4Vectors
#>
#> Attaching package: 'S4Vectors'
#> The following object is masked from 'package:utils':
#>
#> findMatches
#> The following objects are masked from 'package:base':
#>
#> I, expand.grid, unname
#> Loading required package: IRanges
#> Loading required package: GenomeInfoDb
#> Loading required package: Biobase
#> Welcome to Bioconductor
#>
#> Vignettes contain introductory material; view with
#> 'browseVignettes()'. To cite Bioconductor, see
#> 'citation("Biobase")', and for packages 'citation("pkgname")'.
#>
#> Attaching package: 'Biobase'
#> The following object is masked from 'package:MatrixGenerics':
#>
#> rowMedians
#> The following objects are masked from 'package:matrixStats':
#>
#> anyMissing, rowMedians
#>
#> Attaching package: 'SummarizedExperiment'
#> The following object is masked from 'package:SeuratObject':
#>
#> Assays
#> The following object is masked from 'package:Seurat':
#>
#> Assays
#> Warning: replacing previous import 'utils::findMatches' by
#> 'S4Vectors::findMatches' when loading 'AnnotationDbi'
#> Warning: replacing previous import 'ape::where' by 'dplyr::where' when loading
#> 'scAnnotatR'
The models are stored in the default_models
object:
default_models <- load_models("default")
#> loading from cache
names(default_models)
#> [1] "B cells" "Plasma cells" "NK"
#> [4] "CD16 NK" "CD56 NK" "T cells"
#> [7] "CD4 T cells" "CD8 T cells" "Treg"
#> [10] "NKT" "ILC" "Monocytes"
#> [13] "CD14 Mono" "CD16 Mono" "DC"
#> [16] "pDC" "Endothelial cells" "LEC"
#> [19] "VEC" "Platelets" "RBC"
#> [22] "Melanocyte" "Schwann cells" "Pericytes"
#> [25] "Mast cells" "Keratinocytes" "alpha"
#> [28] "beta" "delta" "gamma"
#> [31] "acinar" "ductal" "Fibroblasts"
The default_models
object is named a list of classifiers. Each classifier is an instance of the scAnnotatR S4 class
. For example:
default_models[['B cells']]
#> An object of class scAnnotatR for B cells
#> * 31 marker genes applied: CD38, CD79B, CD74, CD84, RASGRP2, TCF3, SP140, MEF2C, DERL3, CD37, CD79A, POU2AF1, MVK, CD83, BACH2, LY86, CD86, SDC1, CR2, LRMP, VPREB3, IL2RA, BLK, IRF8, FLI1, MS4A1, CD14, MZB1, PTEN, CD19, MME
#> * Predicting probability threshold: 0.5
#> * No parent model
To identify cell types available in a dataset, we need to load the dataset as Seurat or SingleCellExperiment object.
For this vignette, we use a small sample datasets that is available as a Seurat
object as part of the package.
# load the example dataset
data("tirosh_mel80_example")
tirosh_mel80_example
#> An object of class Seurat
#> 91 features across 480 samples within 1 assay
#> Active assay: RNA (91 features, 0 variable features)
#> 1 dimensional reduction calculated: umap
The example dataset already contains the clustering results as part of the metadata. This is not necessary for the classification process.
head(tirosh_mel80_example[[]])
#> orig.ident nCount_RNA nFeature_RNA percent.mt
#> Cy80_II_CD45_B07_S883_comb SeuratProject 42.46011 8 0
#> Cy80_II_CD45_C09_S897_comb SeuratProject 74.35907 14 0
#> Cy80_II_CD45_H07_S955_comb SeuratProject 42.45392 8 0
#> Cy80_II_CD45_H09_S957_comb SeuratProject 63.47043 12 0
#> Cy80_II_CD45_B11_S887_comb SeuratProject 47.26798 9 0
#> Cy80_II_CD45_D11_S911_comb SeuratProject 69.12167 13 0
#> RNA_snn_res.0.8 seurat_clusters RNA_snn_res.0.5
#> Cy80_II_CD45_B07_S883_comb 4 4 2
#> Cy80_II_CD45_C09_S897_comb 4 4 2
#> Cy80_II_CD45_H07_S955_comb 4 4 2
#> Cy80_II_CD45_H09_S957_comb 4 4 2
#> Cy80_II_CD45_B11_S887_comb 4 4 2
#> Cy80_II_CD45_D11_S911_comb 1 1 1
To launch cell type identification, we simply call the classify_cells
function. A detailed description of all parameters can be found through the function’s help page ?classify_cells
.
Here we use only 3 classifiers for B cells, T cells and NK cells to reduce computational cost of this vignette. If users want to use all pretrained classifiers on their dataset, cell_types = 'all'
can be used.
seurat.obj <- classify_cells(classify_obj = tirosh_mel80_example,
assay = 'RNA', slot = 'counts',
cell_types = c('B cells', 'NK', 'T cells'),
path_to_models = 'default')
#> loading from cache
cell_types = c('B cells', 'T cells')
classifiers = c(default_models[['B cells']], default_models[['T cells']])
The classify_cells
function returns the input object but with additional columns in the metadata table.
# display the additional metadata fields
seurat.obj[[]][c(50:60), c(8:ncol(seurat.obj[[]]))]
#> B_cells_p B_cells_class NK_p
#> cy80.Cd45.pos.PD1.pos.B09.S45.comb 0.007754246 no 0.4881285
#> cy80.Cd45.pos.Pd1.neg.S366.H06.S366.comb 0.999385770 yes 0.4440553
#> cy80.Cd45.pos.Pd1.neg.S202.A10.S202.comb 0.998317662 yes 0.4416114
#> cy80.Cd45.pos.Pd1.neg.S201.A09.S201.comb 0.997774856 yes 0.4398997
#> cy80.Cd45.pos.Pd1.neg.S221.B05.S221.comb 0.998874031 yes 0.4541005
#> cy80.Cd45.pos.PD1.pos.A03.S15.comb 0.999944282 yes 0.4511450
#> cy80.Cd45.pos.PD1.pos.B11.S47.comb 0.015978230 no 0.4841041
#> cy80.Cd45.pos.PD1.pos.S189.H09.S189.comb 0.099311534 no 0.4858084
#> cy80.Cd45.pos.PD1.pos.A05.S17.comb 0.055754074 no 0.4924746
#> cy80.Cd45.pos.PD1.pos.C02.S62.comb 0.048558881 no 0.5002238
#> cy80.Cd45.pos.PD1.pos.D12.S96.comb 0.996979702 yes 0.4994867
#> NK_class T_cells_p T_cells_class
#> cy80.Cd45.pos.PD1.pos.B09.S45.comb no 0.94205232 yes
#> cy80.Cd45.pos.Pd1.neg.S366.H06.S366.comb no 0.11269306 no
#> cy80.Cd45.pos.Pd1.neg.S202.A10.S202.comb no 0.09834696 no
#> cy80.Cd45.pos.Pd1.neg.S201.A09.S201.comb no 0.22256938 no
#> cy80.Cd45.pos.Pd1.neg.S221.B05.S221.comb no 0.12903487 no
#> cy80.Cd45.pos.PD1.pos.A03.S15.comb no 0.27242536 no
#> cy80.Cd45.pos.PD1.pos.B11.S47.comb no 0.94929624 yes
#> cy80.Cd45.pos.PD1.pos.S189.H09.S189.comb no 0.93390248 yes
#> cy80.Cd45.pos.PD1.pos.A05.S17.comb no 0.98161289 yes
#> cy80.Cd45.pos.PD1.pos.C02.S62.comb yes 0.96436674 yes
#> cy80.Cd45.pos.PD1.pos.D12.S96.comb no 0.94848597 yes
#> predicted_cell_type
#> cy80.Cd45.pos.PD1.pos.B09.S45.comb T cells
#> cy80.Cd45.pos.Pd1.neg.S366.H06.S366.comb B cells
#> cy80.Cd45.pos.Pd1.neg.S202.A10.S202.comb B cells
#> cy80.Cd45.pos.Pd1.neg.S201.A09.S201.comb B cells
#> cy80.Cd45.pos.Pd1.neg.S221.B05.S221.comb B cells
#> cy80.Cd45.pos.PD1.pos.A03.S15.comb B cells
#> cy80.Cd45.pos.PD1.pos.B11.S47.comb T cells
#> cy80.Cd45.pos.PD1.pos.S189.H09.S189.comb T cells
#> cy80.Cd45.pos.PD1.pos.A05.S17.comb T cells
#> cy80.Cd45.pos.PD1.pos.C02.S62.comb NK/T cells
#> cy80.Cd45.pos.PD1.pos.D12.S96.comb B cells/T cells
#> most_probable_cell_type
#> cy80.Cd45.pos.PD1.pos.B09.S45.comb T cells
#> cy80.Cd45.pos.Pd1.neg.S366.H06.S366.comb B cells
#> cy80.Cd45.pos.Pd1.neg.S202.A10.S202.comb B cells
#> cy80.Cd45.pos.Pd1.neg.S201.A09.S201.comb B cells
#> cy80.Cd45.pos.Pd1.neg.S221.B05.S221.comb B cells
#> cy80.Cd45.pos.PD1.pos.A03.S15.comb B cells
#> cy80.Cd45.pos.PD1.pos.B11.S47.comb T cells
#> cy80.Cd45.pos.PD1.pos.S189.H09.S189.comb T cells
#> cy80.Cd45.pos.PD1.pos.A05.S17.comb T cells
#> cy80.Cd45.pos.PD1.pos.C02.S62.comb T cells
#> cy80.Cd45.pos.PD1.pos.D12.S96.comb B cells
New columns are:
predicted_cell_type: The predicted cell type, also containing any ambiguous assignments. In these cases, the possible cell types are separated by a “/”
most_probable_cell_type: contains the most probably cell type ignoring any ambiguous assignments.
columns with syntax [celltype]_p
: probability of a cell to belong to a cell type. Unknown cell types are marked as NAs.
The predicted cell types can now simply be visualized using the matching plotting functions. In this example, we use Seurat’s DimPlot
function:
With the current number of cell classifiers, we identify cells belonging to 2 cell types (B cells and T cells) and to 2 subtypes of T cells (CD4+ T cells and CD8+ T cells). The other cells (red points) are not among the cell types that can be classified by the predefined classifiers. Hence, they have an empty label.
For a certain cell type, users can also view the prediction probability. Here we show an example of B cell prediction probability:
Cells predicted to be B cells with higher probability have darker color, while the lighter color shows lower or even zero probability of a cell to be B cells. For B cell classifier, the threshold for prediction probability is currently at 0.5, which means cells having prediction probability at 0.5 or above will be predicted as B cells.
The automatic cell identification by scAnnotatR matches the traditional cell assignment, ie. the approach based on cell canonical marker expression. Taking a simple example, we use CD19 and CD20 (MS4A1) to identify B cells:
We see that the marker expression of B cells exactly overlaps the B cell prediction made by scAnnotatR.
sessionInfo()
#> R version 4.3.0 RC (2023-04-13 r84269)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Ubuntu 22.04.2 LTS
#>
#> Matrix products: default
#> BLAS: /home/biocbuild/bbs-3.17-bioc/R/lib/libRblas.so
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_GB LC_COLLATE=C
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: America/New_York
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats4 stats graphics grDevices utils datasets methods
#> [8] base
#>
#> other attached packages:
#> [1] scAnnotatR_1.6.0 SingleCellExperiment_1.22.0
#> [3] SummarizedExperiment_1.30.0 Biobase_2.60.0
#> [5] GenomicRanges_1.52.0 GenomeInfoDb_1.36.0
#> [7] IRanges_2.34.0 S4Vectors_0.38.0
#> [9] BiocGenerics_0.46.0 MatrixGenerics_1.12.0
#> [11] matrixStats_0.63.0 SeuratObject_4.1.3
#> [13] Seurat_4.3.0
#>
#> loaded via a namespace (and not attached):
#> [1] RcppAnnoy_0.0.20 splines_4.3.0
#> [3] later_1.3.0 bitops_1.0-7
#> [5] filelock_1.0.2 tibble_3.2.1
#> [7] polyclip_1.10-4 hardhat_1.3.0
#> [9] pROC_1.18.0 rpart_4.1.19
#> [11] lifecycle_1.0.3 globals_0.16.2
#> [13] lattice_0.21-8 MASS_7.3-59
#> [15] magrittr_2.0.3 plotly_4.10.1
#> [17] sass_0.4.5 rmarkdown_2.21
#> [19] jquerylib_0.1.4 yaml_2.3.7
#> [21] httpuv_1.6.9 sctransform_0.3.5
#> [23] sp_1.6-0 spatstat.sparse_3.0-1
#> [25] reticulate_1.28 cowplot_1.1.1
#> [27] pbapply_1.7-0 DBI_1.1.3
#> [29] RColorBrewer_1.1-3 lubridate_1.9.2
#> [31] abind_1.4-5 zlibbioc_1.46.0
#> [33] Rtsne_0.16 purrr_1.0.1
#> [35] RCurl_1.98-1.12 nnet_7.3-18
#> [37] rappdirs_0.3.3 ipred_0.9-14
#> [39] lava_1.7.2.1 GenomeInfoDbData_1.2.10
#> [41] data.tree_1.0.0 ggrepel_0.9.3
#> [43] irlba_2.3.5.1 listenv_0.9.0
#> [45] spatstat.utils_3.0-2 goftest_1.2-3
#> [47] spatstat.random_3.1-4 fitdistrplus_1.1-11
#> [49] parallelly_1.35.0 leiden_0.4.3
#> [51] codetools_0.2-19 DelayedArray_0.26.0
#> [53] tidyselect_1.2.0 farver_2.1.1
#> [55] BiocFileCache_2.8.0 spatstat.explore_3.1-0
#> [57] jsonlite_1.8.4 caret_6.0-94
#> [59] e1071_1.7-13 ellipsis_0.3.2
#> [61] progressr_0.13.0 ggridges_0.5.4
#> [63] survival_3.5-5 iterators_1.0.14
#> [65] foreach_1.5.2 tools_4.3.0
#> [67] ica_1.0-3 Rcpp_1.0.10
#> [69] glue_1.6.2 prodlim_2023.03.31
#> [71] gridExtra_2.3 xfun_0.39
#> [73] dplyr_1.1.2 withr_2.5.0
#> [75] BiocManager_1.30.20 fastmap_1.1.1
#> [77] fansi_1.0.4 digest_0.6.31
#> [79] timechange_0.2.0 R6_2.5.1
#> [81] mime_0.12 colorspace_2.1-0
#> [83] scattermore_0.8 tensor_1.5
#> [85] spatstat.data_3.0-1 RSQLite_2.3.1
#> [87] utf8_1.2.3 tidyr_1.3.0
#> [89] generics_0.1.3 data.table_1.14.8
#> [91] recipes_1.0.6 class_7.3-21
#> [93] httr_1.4.5 htmlwidgets_1.6.2
#> [95] ModelMetrics_1.2.2.2 uwot_0.1.14
#> [97] pkgconfig_2.0.3 gtable_0.3.3
#> [99] timeDate_4022.108 blob_1.2.4
#> [101] lmtest_0.9-40 XVector_0.40.0
#> [103] htmltools_0.5.5 scales_1.2.1
#> [105] png_0.1-8 gower_1.0.1
#> [107] knitr_1.42 reshape2_1.4.4
#> [109] nlme_3.1-162 curl_5.0.0
#> [111] proxy_0.4-27 cachem_1.0.7
#> [113] zoo_1.8-12 stringr_1.5.0
#> [115] BiocVersion_3.17.1 KernSmooth_2.23-20
#> [117] parallel_4.3.0 miniUI_0.1.1.1
#> [119] AnnotationDbi_1.62.0 pillar_1.9.0
#> [121] grid_4.3.0 vctrs_0.6.2
#> [123] RANN_2.6.1 promises_1.2.0.1
#> [125] dbplyr_2.3.2 xtable_1.8-4
#> [127] cluster_2.1.4 evaluate_0.20
#> [129] cli_3.6.1 compiler_4.3.0
#> [131] rlang_1.1.0 crayon_1.5.2
#> [133] future.apply_1.10.0 labeling_0.4.2
#> [135] plyr_1.8.8 stringi_1.7.12
#> [137] viridisLite_0.4.1 deldir_1.0-6
#> [139] munsell_0.5.0 Biostrings_2.68.0
#> [141] lazyeval_0.2.2 spatstat.geom_3.1-0
#> [143] Matrix_1.5-4 patchwork_1.1.2
#> [145] bit64_4.0.5 future_1.32.0
#> [147] ggplot2_3.4.2 KEGGREST_1.40.0
#> [149] shiny_1.7.4 highr_0.10
#> [151] interactiveDisplayBase_1.38.0 AnnotationHub_3.8.0
#> [153] kernlab_0.9-32 ROCR_1.0-11
#> [155] igraph_1.4.2 memoise_2.0.1
#> [157] bslib_0.4.2 bit_4.0.5
#> [159] ape_5.7-1