Training model classifying a cell subtype from scRNA-seq data

Introduction

Apart from a basic model for a basic independent cell type, our methods also supports cell types considered as the children of another cell type. In this case, the so-called parent cell type must already be represented by a classification model.

Here we consider the model classifying child cell type as child model and of course, parent model is used for classification of parent cell type.

Child model is used to distinguish a particular child cell type from other children cell types of the same parent. Therefore, our methods will examine all cells by parent model before training and testing for child model. Only cells that are considered as parent cell type will be used to train and test the new model.

The basis for this approach is that while markers that are very well suited to differentiate a CD4+ T cell from a CD8+ T cell, these markers may worsen the differentiation of a T cell vs. other cells.

Parent model

A first prerequisite of training for a child model is the parent model. A parent model is of class scAnnotatR and must be available in the working space, among default pretrained models of the package or among trained models in a user supplied database.

In this example, we load B cells classifier in the package default models to our working place.

# use BiocManager to install from Bioconductor
if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

# the scAnnotatR package
if (!require(scAnnotatR))
  BiocManager::install("scAnnotatR")

library(scAnnotatR)

default_models <- load_models('default')
#> loading from cache
classifier_B <- default_models[['B cells']]
classifier_B
#> An object of class scAnnotatR for B cells 
#> * 31 marker genes applied: CD38, CD79B, CD74, CD84, RASGRP2, TCF3, SP140, MEF2C, DERL3, CD37, CD79A, POU2AF1, MVK, CD83, BACH2, LY86, CD86, SDC1, CR2, LRMP, VPREB3, IL2RA, BLK, IRF8, FLI1, MS4A1, CD14, MZB1, PTEN, CD19, MME 
#> * Predicting probability threshold: 0.5 
#> * No parent model

Preparing train object and test object

Same as training for basic models, training for a child model also requires a train (Seurat/SCE) object and a test (Seurat/SCE) object. All objects must have a slot in meta data indicating the type of cells. Tag slot indicating parent cell type can also be provided. In this case, parent cell type tag will further be tested for coherence with the provided parent classifier.

Cell tagged as child cell type but incoherent to parent cell type will be removed from training and testing for the child cell type classifier.

In this vignette, we use the human lung dataset from Zilionis et al., 2019, which is available in the scRNAseq (2.4.0) library. The dataset is stored as a SCE object.

To start the training workflow, we first load the neccessary libraries.

# we use the scRNAseq package to load example data
if (!require(scRNAseq))
  BiocManager::install("scRNAseq")

library(scRNAseq)

First, we load the dataset. To reduce the computational complexity of this vignette, we only use the first 5000 cells of the dataset.

zilionis <- ZilionisLungData()
#> see ?scRNAseq and browseVignettes('scRNAseq') for documentation
#> loading from cache
#> see ?scRNAseq and browseVignettes('scRNAseq') for documentation
#> loading from cache
zilionis <- zilionis[, 1:5000]

We split this dataset into two parts, one for the training and the other for the testing.

pivot = ncol(zilionis)%/%2
train_set <- zilionis[, 1:pivot]
test_set <- zilionis[, (1+pivot):ncol(zilionis)]

In this dataset, the cell type meta data is stored in the Most likely LM22 cell type slot of the SingleCellExperiment object (in both the train object and test object).

If the cell type is stored not stored as the default identification (set through Idents for Seurat object) the slot must be set as a parameter in the training and testing function (see below).

table(train_set$`Most likely LM22 cell type`)
#> 
#>               B cells memory                B cells naive 
#>                           36                            2 
#>    Dendritic cells activated      Dendritic cells resting 
#>                           31                           41 
#>                  Eosinophils               Macrophages M0 
#>                           14                          155 
#>               Macrophages M1               Macrophages M2 
#>                           18                          116 
#>         Mast cells activated           Mast cells resting 
#>                           63                           13 
#>                    Monocytes           NK cells activated 
#>                          130                            5 
#>             NK cells resting                  Neutrophils 
#>                           20                           86 
#>                 Plasma cells T cells CD4 memory activated 
#>                           32                           37 
#>   T cells CD4 memory resting                  T cells CD8 
#>                          215                           28 
#>    T cells follicular helper   T cells regulatory (Tregs) 
#>                           29                           23

table(test_set$`Most likely LM22 cell type`)
#> 
#>               B cells memory                B cells naive 
#>                           36                            3 
#>    Dendritic cells activated      Dendritic cells resting 
#>                           35                           35 
#>                  Eosinophils               Macrophages M0 
#>                           16                          185 
#>               Macrophages M1               Macrophages M2 
#>                            7                           92 
#>         Mast cells activated           Mast cells resting 
#>                           79                           14 
#>                    Monocytes           NK cells activated 
#>                          135                           11 
#>             NK cells resting                  Neutrophils 
#>                           21                           83 
#>                 Plasma cells T cells CD4 memory activated 
#>                           51                           39 
#>   T cells CD4 memory resting                  T cells CD8 
#>                          195                           21 
#>    T cells follicular helper   T cells regulatory (Tregs) 
#>                           31                           25

Unlike the example of the training basic model, we will remove all NAs cells in order to reduce the computational complexity for this example.

# remove NAs cells
train_set <- train_set[, !is.na(train_set$`Most likely LM22 cell type`)]
test_set <- test_set[, !is.na(test_set$`Most likely LM22 cell type`)]

# convert cell label: 
# 1 - positive to plasma cells, 
# 0 - negative to plasma cells
train_set$plasma <- unlist(lapply(train_set$`Most likely LM22 cell type`,
                                  function(x) if (x == 'Plasma cells') {1} else {0}))

test_set$plasma <- unlist(lapply(test_set$`Most likely LM22 cell type`,
                                 function(x) if (x == 'Plasma cells') {1} else {0}))

We may want to check the number of cells in each category:

table(train_set$plasma)
#> 
#>    0    1 
#> 1062   32
# 1: plasma cells, 0: not plasma cells

Defining set of marker genes

Next, we define a set of marker genes, which will be used for the child classification model. Supposing we are training a model for classifying plasma cells, we define the set of marker genes as follows:

selected_marker_genes_plasma <- c('BACH2', 'BLK', 'CD14', 'CD19', 'CD27', 'CD37', 
'CD38', 'CD40LG', 'CD74', 'CD79A', 'CD79B', 'CD83', 'CD84', 'CD86', 'CR2', 
'DERL3', 'FLI1', 'IGHG1', 'IGHG2', 'IGHM', 'IL2RA', 'IRF8', 'LRMP', 'LY86', 
'MCL1', 'MEF2C', 'MME', 'MS4A1', 'MVK', 'MZB1', 'POU2AF1', 'PTEN', 'RASGRP2', 
'SDC1', 'SP140', 'TCF3', 'VPREB3')

Train model

Training for a child model needs more parameters than training a basic one. Users must indicate the parent classifier. There are three ways to indicate the parent classifier to the train method:

Users can use an available model in current working place.
Users can give name of a model among default pretrained models, for example: parent_cell = ‘B cells’
Users can give name of a model among models available in users’ database AND the path to that database, for example: parent_cell = 'B cells', path_to_models = '.'

Train the child classifier:

set.seed(123)
classifier_plasma <- train_classifier(train_obj = train_set, 
marker_genes = selected_marker_genes_plasma, cell_type = "Plasma cells", 
assay = 'counts', tag_slot = 'plasma', parent_classifier = classifier_B)
#> Apply pretrained model for parent cell type.
#> Warning: Some annotated Plasma cells are negative to B cells classifier. They are removed from training/testing for Plasma cells classifier.
#> Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.

#> Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.

#> Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.

#> Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.

#> Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.

#> Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.

#> Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.

#> Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.

#> Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.

#> Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.

#> Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.

If the cells classifier has not been loaded to the current working space, an equivalent training process should be:

set.seed(123)
classifier_plasma <- train_classifier(train_obj = train_set, 
marker_genes = selected_marker_genes_plasma, cell_type = "Plasma cells", 
assay = 'counts', tag_slot = 'plasma', parent_cell = 'B cells')
#> Parent classifier not provided. Try finding available model.
#> loading from cache
#> Apply pretrained model for parent cell type.
#> Warning: Some annotated Plasma cells are negative to B cells classifier. They are removed from training/testing for Plasma cells classifier.
#> Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.

#> Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.

#> Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.

#> Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.

#> Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.

#> Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.

#> Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.

#> Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.

#> Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.

#> Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.

#> Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.

classifier_plasma
#> An object of class scAnnotatR for Plasma cells 
#> * 37 marker genes applied: BACH2, BLK, CD14, CD19, CD27, CD37, CD38, CD40LG, CD74, CD79A, CD79B, CD83, CD84, CD86, CR2, DERL3, FLI1, IGHG1, IGHG2, IGHM, IL2RA, IRF8, LRMP, LY86, MCL1, MEF2C, MME, MS4A1, MVK, MZB1, POU2AF1, PTEN, RASGRP2, SDC1, SP140, TCF3, VPREB3 
#> * Predicting probability threshold: 0.5 
#> * A child model of: B cells

caret_model(classifier_plasma)
#> Support Vector Machines with Linear Kernel 
#> 
#> No pre-processing
#> Resampling: Cross-Validated (10 fold) 
#> Summary of sample sizes: 83, 83, 84, 84, 83, 85, ... 
#> Addtional sampling using down-sampling
#> 
#> Resampling results:
#> 
#>   Accuracy   Kappa    
#>   0.9205556  0.7647773
#> 
#> Tuning parameter 'C' was held constant at a value of 1

Test model

The parent classifier must be also set in test method.

classifier_plasma_test <- test_classifier(test_obj = test_set, 
classifier = classifier_plasma, assay = 'counts', tag_slot = 'plasma', 
parent_classifier = classifier_B)
#> Apply pretrained model for parent cell type.
#> Warning: Some annotated Plasma cells are negative to B cells classifier. They are removed from training/testing for Plasma cells classifier.
#> Current probability threshold: 0.5
#>          Positive    Negative    Total
#> Actual       24      73      97
#> Predicted    21      76      97
#> Accuracy: 0.948453608247423
#> Sensivity (True Positive Rate) for Plasma cells: 0.833333333333333
#> Specificity (1 - False Positive Rate) for Plasma cells: 0.986301369863014
#> Area under the curve: 0.913812785388128

Interpreting test model result

The test result obtained from a child model can be interpreted in the same way as we do with the model for basic cell types. We can change the prediction probability threshold according to the research project or personal preference and plot a roc curve.

print(classifier_plasma_test$auc)
#> Area under the curve: 0.9138
roc_curve <- plot_roc_curve(test_result = classifier_plasma_test)
plot(roc_curve)

Save child classification model for further use

In order to save child classifier, the parent classifier must already exist in the classifier database, either in the package default database or in the user-defined database.

# see list of available model in package
default_models <- load_models('default')
#> loading from cache
names(default_models)
#>  [1] "B cells"           "Plasma cells"      "NK"               
#>  [4] "CD16 NK"           "CD56 NK"           "T cells"          
#>  [7] "CD4 T cells"       "CD8 T cells"       "Treg"             
#> [10] "NKT"               "ILC"               "Monocytes"        
#> [13] "CD14 Mono"         "CD16 Mono"         "DC"               
#> [16] "pDC"               "Endothelial cells" "LEC"              
#> [19] "VEC"               "Platelets"         "RBC"              
#> [22] "Melanocyte"        "Schwann cells"     "Pericytes"        
#> [25] "Mast cells"        "Keratinocytes"     "alpha"            
#> [28] "beta"              "delta"             "gamma"            
#> [31] "acinar"            "ductal"            "Fibroblasts"

In our package, the default models already include a model classifying plasma cells. Therefore, we will save this model to a new local database specified by the path_to_models parameter. If you start with a fresh new local database, there is no available parent classifier of plasma cells’ classifier. Therefore, we have to save the parent classifier first, e.g. the classifier for B cells.

# no copy of pretrained models is performed
save_new_model(new_model = classifier_B, path_to_models = tempdir(), 
               include.default = FALSE) 
#> Saving new models to /tmp/RtmpCyelaV/new_models.rda...
#> Finished saving new model
save_new_model(new_model = classifier_plasma, path_to_models = tempdir(), 
               include.default = FALSE) 
#> Saving new models to /tmp/RtmpCyelaV/new_models.rda...
#> Finished saving new model

Applying newly trained models for cell classification

When we save the B cells’ classifier and the plasma cells’ classifier, a local database is newly created. We can use this new database to classify cells in a Seurat or SingleCellExperiment object.

Let’s try to classify cells in the test set:

classified <- classify_cells(classify_obj = test_set, assay = 'counts', 
                             cell_types = 'all', path_to_models = tempdir())

Using the classify_cells() function, we have to indicate exactly the repository containing the database that the models has recently been saved to. In the previous section, we saved our new models to the current working directory.

In the classified object, the classification process added new columns to the cell meta data, including the predicted_cell_type and most_probable_cell_type columns.

If we use the full prediction to compare with actual plasma tag, we obtain this result:

# compare the prediction with actual cell tag
table(classified$predicted_cell_type, classified$plasma)
#>                       
#>                          0   1
#>   B cells               72   4
#>   B cells/Plasma cells   1  20
#>   unknown              990  27
# plasma cell is child cell type of B cell
# so of course, all predicted plasma cells are predicted B cells

When comparing the actual tag with the most probable prediction, we obtain:

# compare the prediction with actual cell tag
table(classified$most_probable_cell_type, classified$plasma)
#>               
#>                  0   1
#>   B cells       72   4
#>   Plasma cells   1  20
#>   unknown      990  27

The number of identified plasma cells is different in the predicted_cell_type slot and in the most_probable_cell_type. This is because the predicted_cell_type takes all predictions having the probabilities satisfying the corresponding probability thresholds. Meanwhile, the most_probable_cell_type takes only the cell type which gives highest prediction probability.

To have all plasma cells specified as plasma cells, we can set the ignore_ambiguous_result to TRUE. This option will hide all ambiguous predictions where we have more than one possible cell type. In the parent-chid(ren) relationship of cell types, the more specified cell types/phenotypes will be reported.

classified <- classify_cells(classify_obj = test_set, assay = 'counts',
                             cell_types = 'all', path_to_models = tempdir(),
                             ignore_ambiguous_result = TRUE)
table(classified$predicted_cell_type, classified$plasma)
#>            
#>               0   1
#>   B cells    72   4
#>   ambiguous   1  20
#>   unknown   990  27

Session Info

sessionInfo()
#> R version 4.3.0 RC (2023-04-13 r84269)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Ubuntu 22.04.2 LTS
#> 
#> Matrix products: default
#> BLAS:   /home/biocbuild/bbs-3.17-bioc/R/lib/libRblas.so 
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_GB              LC_COLLATE=C              
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> time zone: America/New_York
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] stats4    stats     graphics  grDevices utils     datasets  methods  
#> [8] base     
#> 
#> other attached packages:
#>  [1] caret_6.0-94                lattice_0.21-8             
#>  [3] ggplot2_3.4.2               scRNAseq_2.13.0            
#>  [5] scAnnotatR_1.6.0            SingleCellExperiment_1.22.0
#>  [7] SummarizedExperiment_1.30.0 Biobase_2.60.0             
#>  [9] GenomicRanges_1.52.0        GenomeInfoDb_1.36.0        
#> [11] IRanges_2.34.0              S4Vectors_0.38.0           
#> [13] BiocGenerics_0.46.0         MatrixGenerics_1.12.0      
#> [15] matrixStats_0.63.0          SeuratObject_4.1.3         
#> [17] Seurat_4.3.0               
#> 
#> loaded via a namespace (and not attached):
#>   [1] ProtGenerics_1.32.0           spatstat.sparse_3.0-1        
#>   [3] bitops_1.0-7                  lubridate_1.9.2              
#>   [5] httr_1.4.5                    RColorBrewer_1.1-3           
#>   [7] tools_4.3.0                   sctransform_0.3.5            
#>   [9] utf8_1.2.3                    R6_2.5.1                     
#>  [11] lazyeval_0.2.2                uwot_0.1.14                  
#>  [13] withr_2.5.0                   sp_1.6-0                     
#>  [15] prettyunits_1.1.1             gridExtra_2.3                
#>  [17] progressr_0.13.0              cli_3.6.1                    
#>  [19] spatstat.explore_3.1-0        labeling_0.4.2               
#>  [21] sass_0.4.5                    spatstat.data_3.0-1          
#>  [23] proxy_0.4-27                  ggridges_0.5.4               
#>  [25] pbapply_1.7-0                 Rsamtools_2.16.0             
#>  [27] parallelly_1.35.0             RSQLite_2.3.1                
#>  [29] generics_0.1.3                BiocIO_1.10.0                
#>  [31] ica_1.0-3                     spatstat.random_3.1-4        
#>  [33] dplyr_1.1.2                   Matrix_1.5-4                 
#>  [35] fansi_1.0.4                   abind_1.4-5                  
#>  [37] lifecycle_1.0.3               yaml_2.3.7                   
#>  [39] recipes_1.0.6                 BiocFileCache_2.8.0          
#>  [41] Rtsne_0.16                    grid_4.3.0                   
#>  [43] blob_1.2.4                    promises_1.2.0.1             
#>  [45] ExperimentHub_2.8.0           crayon_1.5.2                 
#>  [47] miniUI_0.1.1.1                cowplot_1.1.1                
#>  [49] GenomicFeatures_1.52.0        KEGGREST_1.40.0              
#>  [51] pillar_1.9.0                  knitr_1.42                   
#>  [53] rjson_0.2.21                  future.apply_1.10.0          
#>  [55] codetools_0.2-19              leiden_0.4.3                 
#>  [57] glue_1.6.2                    data.table_1.14.8            
#>  [59] vctrs_0.6.2                   png_0.1-8                    
#>  [61] gtable_0.3.3                  kernlab_0.9-32               
#>  [63] cachem_1.0.7                  gower_1.0.1                  
#>  [65] xfun_0.39                     mime_0.12                    
#>  [67] prodlim_2023.03.31            survival_3.5-5               
#>  [69] timeDate_4022.108             iterators_1.0.14             
#>  [71] hardhat_1.3.0                 lava_1.7.2.1                 
#>  [73] interactiveDisplayBase_1.38.0 ellipsis_0.3.2               
#>  [75] fitdistrplus_1.1-11           ROCR_1.0-11                  
#>  [77] ipred_0.9-14                  nlme_3.1-162                 
#>  [79] bit64_4.0.5                   progress_1.2.2               
#>  [81] filelock_1.0.2                RcppAnnoy_0.0.20             
#>  [83] data.tree_1.0.0               bslib_0.4.2                  
#>  [85] irlba_2.3.5.1                 KernSmooth_2.23-20           
#>  [87] rpart_4.1.19                  colorspace_2.1-0             
#>  [89] DBI_1.1.3                     nnet_7.3-18                  
#>  [91] tidyselect_1.2.0              bit_4.0.5                    
#>  [93] compiler_4.3.0                curl_5.0.0                   
#>  [95] xml2_1.3.3                    DelayedArray_0.26.0          
#>  [97] plotly_4.10.1                 rtracklayer_1.60.0           
#>  [99] scales_1.2.1                  lmtest_0.9-40                
#> [101] rappdirs_0.3.3                stringr_1.5.0                
#> [103] digest_0.6.31                 goftest_1.2-3                
#> [105] spatstat.utils_3.0-2          rmarkdown_2.21               
#> [107] XVector_0.40.0                htmltools_0.5.5              
#> [109] pkgconfig_2.0.3               highr_0.10                   
#> [111] dbplyr_2.3.2                  fastmap_1.1.1                
#> [113] ensembldb_2.24.0              rlang_1.1.0                  
#> [115] htmlwidgets_1.6.2             shiny_1.7.4                  
#> [117] farver_2.1.1                  jquerylib_0.1.4              
#> [119] zoo_1.8-12                    jsonlite_1.8.4               
#> [121] BiocParallel_1.34.0           ModelMetrics_1.2.2.2         
#> [123] RCurl_1.98-1.12               magrittr_2.0.3               
#> [125] GenomeInfoDbData_1.2.10       patchwork_1.1.2              
#> [127] munsell_0.5.0                 Rcpp_1.0.10                  
#> [129] ape_5.7-1                     reticulate_1.28              
#> [131] stringi_1.7.12                pROC_1.18.0                  
#> [133] zlibbioc_1.46.0               MASS_7.3-59                  
#> [135] AnnotationHub_3.8.0           plyr_1.8.8                   
#> [137] parallel_4.3.0                listenv_0.9.0                
#> [139] ggrepel_0.9.3                 deldir_1.0-6                 
#> [141] Biostrings_2.68.0             splines_4.3.0                
#> [143] tensor_1.5                    hms_1.1.3                    
#> [145] igraph_1.4.2                  spatstat.geom_3.1-0          
#> [147] reshape2_1.4.4                biomaRt_2.56.0               
#> [149] BiocVersion_3.17.1            XML_3.99-0.14                
#> [151] evaluate_0.20                 BiocManager_1.30.20          
#> [153] foreach_1.5.2                 httpuv_1.6.9                 
#> [155] RANN_2.6.1                    tidyr_1.3.0                  
#> [157] purrr_1.0.1                   polyclip_1.10-4              
#> [159] future_1.32.0                 scattermore_0.8              
#> [161] xtable_1.8-4                  AnnotationFilter_1.24.0      
#> [163] restfulr_0.0.15               e1071_1.7-13                 
#> [165] later_1.3.0                   viridisLite_0.4.1            
#> [167] class_7.3-21                  tibble_3.2.1                 
#> [169] memoise_2.0.1                 AnnotationDbi_1.62.0         
#> [171] GenomicAlignments_1.36.0      cluster_2.1.4                
#> [173] timechange_0.2.0              globals_0.16.2