1 Introduction

ILoReg is a novel tool for cell population identification from single-cell RNA-seq (scRNA-seq) data. In our study [1], we showed that ILoReg was able to identify, by both unsupervised clustering and visually, rare cell populations that other scRNA-seq data analysis pipelines were unable to identify.

The figure below illustrates the workflows of ILoReg and a typical pipeline that applies feature selection prior to dimensionality reduction by principal component analysis (PCA).

Figure: Workflows of ILoReg and a feature-selection based approach.

In contrast to most scRNA-seq data analysis pipelines, ILoReg does not reduce the dimensionality of the gene expression matrix by feature selection. Instead, it performs probabilistic feature extraction using iterative clustering projection (ICP), yielding an \(N \times k\) -dimensional probability matrix, which contains probabilities of each of the \(N\) cells belonging to the \(k\) clusters. ICP is a novel self-supervised learning algorithm that iteratively seeks a clustering with \(k\) clusters that maximizes the adjusted Rand index (ARI) between the clustering \(C\) and its projection \(C'\) by L1-regularized logistic regression. In the ILoReg consensus approach, ICP is run \(L\) times and the \(L\) probability matrices are merged into a joint probability matrix and subsequently transformed by principal component analysis (PCA) into a lower dimensional (\(N \times p\)) matrix (consensus matrix). The final clustering step is performed using hierarhical clustering by the Ward’s method, after which the user can extract a clustering with \(K\) consensus clusters. Two-dimensional visualization is supported using two popular nonlinear dimensionality reduction methods: t-distributed stochastic neighbor embedding (t-SNE) and uniform manifold approximation and projection (UMAP). Additionally, ILoReg provides user-friendly functions that enable identification of differentially expressed (DE) genes and visualization of gene expression.

2 Installation

ILoReg can be downloaded from Bioconductor and installed by executing the following command in the R console.

if(!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("ILoReg")

3 Example: Peripheral Blood Mononuclear Cells

In the following, we go through the different steps of ILoReg’s workflow and demonstrate it using a peripheral blood mononuclear cell (PBMC) dataset. The toy dataset included in the ILoReg R package (pbmc3k_500) contains 500 cells that have been downsampled from the pbmc3k dataset [2]. The preprocessing was rerun with a newer reference genome (GRCh38.p12) and Cell Ranger v2.2.0 [3].

3.1 Setup a SingleCellExperiment object and prepare it for ILoReg analysis

The only required input for ILoReg is a gene expression matrix that has been normalized for the library size, with genes/features in rows and cells/samples in columns. The input can be of matrix, data.frame or dgCMatrix class, which is then transformed into a sparse object of dgCMatrix class. Please note that the method has been designed to work with sparse data, i.e. with a high proportion of zero values. If, for example, the features of your dataset have been standardized, the run time and the memory usage of ILoReg will likely be much higher.

suppressMessages(library(ILoReg))
suppressMessages(library(SingleCellExperiment))
suppressMessages(library(cowplot))
# The dataset was normalized using the LogNormalize method from the Seurat R package.
sce <- SingleCellExperiment(assays = list(logcounts = pbmc3k_500))
sce <- PrepareILoReg(sce)

## Data in `logcounts` slot already of `dgCMatrix` class...

## 13865/13865 genes remain after filtering genes with only zero values.

3.2 Run the ICP clustering algorithm \(L\) times

Running ICP \(L\) times in parallel is the most computationally demanding part of the workflow.

In the following, we give a brief summary of the parameters.

\(k\): The number of initial clusters in ICP (default \(15\)). Along with decreasing \(d\), increasing \(k\) increases the resolution of the outcome, i.e. more sub-populations with subtle differences are identifiable in the result.
\(d\): A real number greater than \(0\) and smaller than \(1\) that determines how many cells \(n\) are down- or oversampled from each cluster into the training data (\(n= \lceil Nd/k \rceil\)), where \(N\) is the total number of cells (default \(0.3\)). Decreasing \(d\) below \(0.2\) is not recommended due to the increased risk of ICP becoming unstable (\(k\) starts to decrease during the iteration). By contrast, increasing \(d\) above 0.3 will generate more dissimilar ICP runs, which will decrease the resolution of the result.
\(C\): A positive real number that rules the trade-off between correct classification and regularization in L1-regularized logistic regression: \[\displaystyle \min_w {\Vert w \Vert}_1 + C \sum_{i=1}^{n} \log (1+ e^{-y_i w^T w})\] with the default value being \(0.3\). Decreasing \(C\) increases the stringency of the L1-regularized feature selection, i.e. less genes are selected into the logistic regression model. With a lower \(C\) the outcome will be determined by fewer genes.
\(r\): A positive integer that denotes the maximum number of reiterations performed until the ICP algorithm stops (default \(500\)).
\(L\): The number of ICP runs. The default is \(200\), which should be generally used in all situations. For the toy dataset used in this example \(L=30\) is enough.
\(reg.type\): “L1” or “L2”. “L2” denotes L2-regularization (ridge regression). The default is “L1” (lasso regresssion).
\(threads\): The number of threads to use in parallel computing. The default is \(0\): use all available threads but one. The parallelization can be disabled with \(threads=1\).

As general guidelines on how to fine-tune the parameters, we recommend leaving \(C\), \(r\) and \(L\) as their defaults (\(C=0.3\), \(r=5\) and \(L=200\)). However, increasing \(k\) from 15 to, for example, 30 can reveal new cell subsets that are of potential interest. Regarding \(d\), increasing it to somewhere between 0.4-0.6 helps if the user wants lower resolution (less distinguishable populations). Setting \(reg.type="L2"\) disables the feature selection in L1-regularization, and all the genes weights are consequently non-zero in the model, which typically leads to a lower resolution.

# ICP is stochastic. To obtain reproducible results, use set.seed().
set.seed(1)
# Run ICP L times. This is  the slowest step of the workflow, 
# and parallel processing can be used to greatly speed it up.
sce <- RunParallelICP(object = sce, k = 15,
                      d = 0.3, L = 30, 
                      r = 5, C = 0.3, 
                      reg.type = "L1", threads = 0)

## 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |==                                                                    |   3%
  |                                                                            
  |=====                                                                 |   7%
  |                                                                            
  |=======                                                               |  10%
  |                                                                            
  |==========                                                            |  14%
  |                                                                            
  |============                                                          |  17%
  |                                                                            
  |==============                                                        |  21%
  |                                                                            
  |=================                                                     |  24%
  |                                                                            
  |===================                                                   |  28%
  |                                                                            
  |======================                                                |  31%
  |                                                                            
  |========================                                              |  34%
  |                                                                            
  |===========================                                           |  38%
  |                                                                            
  |=============================                                         |  41%
  |                                                                            
  |===============================                                       |  45%
  |                                                                            
  |==================================                                    |  48%
  |                                                                            
  |====================================                                  |  52%
  |                                                                            
  |=======================================                               |  55%
  |                                                                            
  |=========================================                             |  59%
  |                                                                            
  |===========================================                           |  62%
  |                                                                            
  |==============================================                        |  66%
  |                                                                            
  |================================================                      |  69%
  |                                                                            
  |===================================================                   |  72%
  |                                                                            
  |=====================================================                 |  76%
  |                                                                            
  |========================================================              |  79%
  |                                                                            
  |==========================================================            |  83%
  |                                                                            
  |============================================================          |  86%
  |                                                                            
  |===============================================================       |  90%
  |                                                                            
  |=================================================================     |  93%
  |                                                                            
  |====================================================================  |  97%
  |                                                                            
  |======================================================================| 100%

3.3 PCA transformation of the joint probability matrix

The \(L\) probability matrices are merged into a joint probability matrix, which is then transformed into a lower dimensionality by PCA. Before applying PCA, the user can optionally scale the cluster probabilities to unit-variance.

# p = number of principal components
sce <- RunPCA(sce,p=50,scale = FALSE)

Optional: PCA requires the user to specify the number of principal components, for which we selected the default value \(p=50\). To aid in decision making, the elbow plot is commonly used to seek an elbow point, of which proximity the user selects \(p\). In this case the point would be close to \(p=10\). Trying both a \(p\) that is close to the elbow point and the default \(p=50\) is recommended.

PCAElbowPlot(sce)

3.4 Nonlinear dimensionality reduction

To visualize the data in two-dimensional space, nonlinear dimensionality reduction is performed using t-SNE or UMAP. The input data for this step is the \(N \times p\) -dimensional consensus matrix.

sce <- RunUMAP(sce)
sce <- RunTSNE(sce,perplexity=30)

3.5 Gene expression visualization

Visualize the t-SNE and UMAP transformations using the GeneScatterPlot function, highlighting expression levels of CD3D (T cells), CD79A (B cells), CST3 (monocytes, dendritic cells, platelets), FCER1A (myeloid dendritic cells).

GeneScatterPlot(sce,c("CD3D","CD79A","CST3","FCER1A"),
                dim.reduction.type = "umap",
                point.size = 0.3)

GeneScatterPlot(sce,c("CD3D","CD79A","CST3","FCER1A"),
                dim.reduction.type = "tsne",
                point.size = 0.3)

3.6 Hierarchical clustering using the Ward’s method

The \(N \times p\) -dimensional consensus matrix is hierarchically clustered using the Ward’s method.

sce <- HierarchicalClustering(sce)

3.7 Extracting a consensus clustering with \(K\) clusters

After the hierarchical clustering, the user needs to define how many consensus clusters (\(K\)) to extract from the tree dendrogram. The SelectKClusters function enables extracting a consensus clustering with \(K\) clusters. Please note that the clustering is overwritten every time the function is called.

# Extract K=13 clusters.
sce <- SelectKClusters(sce,K=13)

Next, we use the ClusteringScatterPlot function to draw the t-SNE and UMAP transformations and color each cell according to the cluster labels.

# Use plot_grid function from the cowplot R package to combine the two plots into one.
plot_grid(ClusteringScatterPlot(sce,
                                dim.reduction.type = "umap",
                                return.plot = TRUE,
                                title = "UMAP",
                                show.legend=FALSE),
          ClusteringScatterPlot(sce,
                                dim.reduction.type = "tsne",
                                return.plot = TRUE
                                ,title="t-SNE",
                                show.legend=FALSE),
          ncol = 1
)

3.8 Identification of gene markers

TheILoReg R package provides functions for the identification of gene markers of clusters. This is accomplished by DE analysis, where gene expression levels of the cells from each cluster are compared against the rest of the cells. Currently, the only supported statistical test is the the Wilcoxon rank-sum test (aka Mann-Whitney U test). The p-values are corrected for multiple comparisons using the Bonferroni method. To accelerate the analysis, genes that are less likely to be DE can be filtered out prior to the statistical testing using multiple criteria.

gene_markers <- FindAllGeneMarkers(sce,
                                   clustering.type = "manual",
                                   test = "wilcox",
                                   log2fc.threshold = 0.25,
                                   min.pct = 0.25,
                                   min.diff.pct = NULL,
                                   pseudocount.use = 1,
                                   min.cells.group = 3,
                                   return.thresh = 0.01,
                                   only.pos = TRUE,
                                   max.cells.per.cluster = NULL)

## -----------------------------------
## testing cluster 1
## 885 genes left after min.pct filtering
## 885 genes left after min.diff.pct filtering
## 152 genes left after log2fc.threshold filtering
## -----------------------------------
## -----------------------------------
## testing cluster 2
## 873 genes left after min.pct filtering
## 873 genes left after min.diff.pct filtering
## 279 genes left after log2fc.threshold filtering
## -----------------------------------
## -----------------------------------
## testing cluster 3
## 858 genes left after min.pct filtering
## 858 genes left after min.diff.pct filtering
## 194 genes left after log2fc.threshold filtering
## -----------------------------------
## -----------------------------------
## testing cluster 4
## 973 genes left after min.pct filtering
## 973 genes left after min.diff.pct filtering
## 430 genes left after log2fc.threshold filtering
## -----------------------------------
## -----------------------------------
## testing cluster 5
## 1366 genes left after min.pct filtering
## 1366 genes left after min.diff.pct filtering
## 679 genes left after log2fc.threshold filtering
## -----------------------------------
## -----------------------------------
## testing cluster 6
## 1005 genes left after min.pct filtering
## 1005 genes left after min.diff.pct filtering
## 378 genes left after log2fc.threshold filtering
## -----------------------------------
## -----------------------------------
## testing cluster 7
## 993 genes left after min.pct filtering
## 993 genes left after min.diff.pct filtering
## 414 genes left after log2fc.threshold filtering
## -----------------------------------
## -----------------------------------
## testing cluster 8
## 918 genes left after min.pct filtering
## 918 genes left after min.diff.pct filtering
## 357 genes left after log2fc.threshold filtering
## -----------------------------------
## -----------------------------------
## testing cluster 9
## 1180 genes left after min.pct filtering
## 1180 genes left after min.diff.pct filtering
## 528 genes left after log2fc.threshold filtering
## -----------------------------------
## -----------------------------------
## testing cluster 10
## 1069 genes left after min.pct filtering
## 1069 genes left after min.diff.pct filtering
## 409 genes left after log2fc.threshold filtering
## -----------------------------------
## -----------------------------------
## testing cluster 11
## 948 genes left after min.pct filtering
## 948 genes left after min.diff.pct filtering
## 142 genes left after log2fc.threshold filtering
## -----------------------------------
## -----------------------------------
## testing cluster 12
## 879 genes left after min.pct filtering
## 879 genes left after min.diff.pct filtering
## 212 genes left after log2fc.threshold filtering
## -----------------------------------
## -----------------------------------
## testing cluster 13
## 1654 genes left after min.pct filtering
## 1654 genes left after min.diff.pct filtering
## 878 genes left after log2fc.threshold filtering
## -----------------------------------

3.9 Selecting top gene markers

Select top 10 and 1 gene markers based on the log2 fold-change and the Bonferroni adjusted p-value.

top10_log2FC <- SelectTopGenes(gene_markers,
                               top.N = 10,
                               criterion.type = "log2FC",
                               inverse = FALSE)
top1_log2FC <- SelectTopGenes(gene_markers,
                              top.N = 1,
                              criterion.type = "log2FC",
                              inverse = FALSE)
top10_adj.p.value <- SelectTopGenes(gene_markers,
                                    top.N = 10,
                                    criterion.type = "adj.p.value",
                                    inverse = TRUE)
top1_adj.p.value <- SelectTopGenes(gene_markers,
                                   top.N = 1,
                                   criterion.type = "adj.p.value",
                                   inverse = TRUE)

Draw the t-SNE and UMAP transformations, highlighting expression levels of the top 1 gene markers based on the log2 fold-change.

GeneScatterPlot(sce,
                genes = unique(top1_log2FC$gene),
                dim.reduction.type = "tsne",
                point.size = 0.5,ncol=2)

3.10 Gene marker heatmap

GeneHeatmap function enables visualizing gene markers in a heatmap, where cells and genes are grouped by the clustering.

GeneHeatmap(sce,
            clustering.type = "manual",
            gene.markers = top10_log2FC)

3.11 Renaming clusters

RenameAllClusters enables renaming all clusters at once.

sce <- RenameAllClusters(sce,
                         new.cluster.names = c("GZMK+/CD8+ T cells",
                                               "IGKC+ B cells",
                                               "Naive CD4+ T cells",
                                               "NK cells",
                                               "CD16+ monocytes",
                                               "CD8+ T cells",
                                               "CD14+ monocytes",
                                               "IGLC+ B cells",
                                               "Intermediate monocytes",
                                               "IGKC+/IGLC+ B cells",
                                               "Memory CD4+ T cells",
                                               "Naive CD8+ T cells",
                                               "Dendritic cells"))

Draw the gene heatmap again, but with the clusters renamed.

GeneHeatmap(sce,gene.markers = top10_log2FC)

3.12 Violin plot visualization

VlnPlot enables visualization of gene expression with cells grouped by clustering.

# Visualize CD3D: a marker of T cells
VlnPlot(sce,genes = c("CD3D"),return.plot = FALSE,rotate.x.axis.labels = TRUE)

4 Additional functionality

ILoReg provides additional functionality for performing tasks, which are sometimes required in scRNA-seq data analysis.

4.1 Estimating the optimal number of clusters

The optimal number of clusters can be estimated by calculating the average silhouette value across the cells for a set of clusterings within a range of different \(K\) values (e.g. \([2,50]\)). The clustering mathing to the highest average silhouette is saved to clustering.optimal slot. Therefore, the clustering acquired using the SelectKClusters function is not overwritten.

sce <- CalcSilhInfo(sce,K.start = 2,K.end = 50)

## optimal K: 5, average silhouette score: 0.418375297485223

SilhouetteCurve(sce,return.plot = FALSE)

4.2 Renaming one cluster

sce <- SelectKClusters(sce,K=20)
# Rename cluster 1 as A
sce <- RenameCluster(sce,old.cluster.name = 1,new.cluster.name = "A")

4.3 Visualize with a custom annotation

# Select a clustering with K=20 clusters
sce <- SelectKClusters(sce,K=5)
# Generate a custom annotation with K=5 clusters and change the names to the five first alphabets.
custom_annotation <- plyr::mapvalues(metadata(sce)$iloreg$clustering.manual,c(1,2,3,4,5),LETTERS[1:5])
# Visualize the annotation
AnnotationScatterPlot(sce,
                      annotation = custom_annotation,
                      return.plot = FALSE,
                      dim.reduction.type = "tsne",
                      show.legend = FALSE)

4.4 Merging clusters

# Merge clusters 3 and 4
sce <- SelectKClusters(sce,K=20)
sce <- MergeClusters(sce,clusters.to.merge  = c(3,4))

4.5 Identification of differentially expressed genes between two arbitrary sets of clusters

sce <- SelectKClusters(sce,K=13)
sce <- RenameAllClusters(sce,
                         new.cluster.names = c("GZMK+/CD8+ T cells",
                                               "IGKC+ B cells",
                                               "Naive CD4+ T cells",
                                               "NK cells",
                                               "CD16+ monocytes",
                                               "CD8+ T cells",
                                               "CD14+ monocytes",
                                               "IGLC+ B cells",
                                               "Intermediate monocytes",
                                               "IGKC+/IGLC+ B cells",
                                               "Memory CD4+ T cells",
                                               "Naive CD8+ T cells",
                                               "Dendritic cells"))
# Identify DE genes between naive and memory CD4+ T cells
GM_naive_memory_CD4 <- FindGeneMarkers(sce,
                                       clusters.1 = "Naive CD4+ T cells",
                                       clusters.2 = "Memory CD4+ T cells",
                                       logfc.threshold = 0.25,
                                       min.pct = 0.25,
                                       return.thresh = 0.01,
                                       only.pos = TRUE)

## clustering.type='manual'testing cluster group.1
## 918 genes left after min.pct filtering
## 918 genes left after min.diff.pct filtering
## 164 genes left after logfc.threshold filtering

# Find gene markers for dendritic cells
GM_dendritic <- FindGeneMarkers(sce,
                                clusters.1 = "Dendritic cells",
                                logfc.threshold = 0.25,
                                min.pct = 0.25,
                                return.thresh = 0.01,
                                only.pos = TRUE)

## clustering.type='manual'testing cluster group.1
## 1654 genes left after min.pct filtering
## 1654 genes left after min.diff.pct filtering
## 878 genes left after logfc.threshold filtering

4.6 Session info

sessionInfo()

## R version 4.0.3 (2020-10-10)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.5 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.12-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.12-bioc/R/lib/libRlapack.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] parallel  stats4    stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] cowplot_1.1.0               SingleCellExperiment_1.12.0
##  [3] SummarizedExperiment_1.20.0 Biobase_2.50.0             
##  [5] GenomicRanges_1.42.0        GenomeInfoDb_1.26.0        
##  [7] IRanges_2.24.0              S4Vectors_0.28.0           
##  [9] BiocGenerics_0.36.0         MatrixGenerics_1.2.0       
## [11] matrixStats_0.57.0          ILoReg_1.0.0               
## [13] knitr_1.30                  BiocStyle_2.18.0           
## 
## loaded via a namespace (and not attached):
##  [1] bitops_1.0-6           RColorBrewer_1.1-2     tools_4.0.3           
##  [4] doRNG_1.8.2            R6_2.4.1               colorspace_1.4-1      
##  [7] withr_2.3.0            tidyselect_1.1.0       gridExtra_2.3         
## [10] Exact_2.1              compiler_4.0.3         parallelDist_0.2.4    
## [13] SparseM_1.78           expm_0.999-5           DelayedArray_0.16.0   
## [16] labeling_0.4.2         bookdown_0.21          scales_1.1.1          
## [19] mvtnorm_1.1-1          askpass_1.1            stringr_1.4.0         
## [22] digest_0.6.27          rmarkdown_2.5          aricode_1.0.0         
## [25] XVector_0.30.0         pkgconfig_2.0.3        htmltools_0.5.0       
## [28] umap_0.2.6.0           rlang_0.4.8            rstudioapi_0.11       
## [31] farver_2.0.3           generics_0.0.2         jsonlite_1.7.1        
## [34] dendextend_1.14.0      dplyr_1.0.2            RCurl_1.98-1.2        
## [37] magrittr_1.5           GenomeInfoDbData_1.2.4 Matrix_1.2-18         
## [40] Rcpp_1.0.5             DescTools_0.99.38      munsell_0.5.0         
## [43] reticulate_1.18        viridis_0.5.1          lifecycle_0.2.0       
## [46] stringi_1.5.3          yaml_2.2.1             MASS_7.3-53           
## [49] rootSolve_1.8.2.1      zlibbioc_1.36.0        plyr_1.8.6            
## [52] Rtsne_0.15             grid_4.0.3             LiblineaR_2.10-8      
## [55] crayon_1.3.4           doSNOW_1.0.19          lmom_2.8              
## [58] lattice_0.20-41        magick_2.5.0           pillar_1.4.6          
## [61] fastcluster_1.1.25     boot_1.3-25            gld_2.6.2             
## [64] rngtools_1.5           reshape2_1.4.4         codetools_0.2-16      
## [67] glue_1.4.2             evaluate_0.14          BiocManager_1.30.10   
## [70] RcppParallel_5.0.2     vctrs_0.3.4            foreach_1.5.1         
## [73] openssl_1.4.3          gtable_0.3.0           purrr_0.3.4           
## [76] ggplot2_3.3.2          xfun_0.18              e1071_1.7-4           
## [79] RSpectra_0.16-0        class_7.3-17           viridisLite_0.3.0     
## [82] tibble_3.0.4           snow_0.4-3             pheatmap_1.0.12       
## [85] iterators_1.0.13       cluster_2.1.0          ellipsis_0.3.1

5 References

Johannes Smolander, Sini Junttila, Mikko S Venäläinen, Laura L Elo. “ILoReg enables high-resolution cell population identification from single-cell RNA-seq data”. bioRxiv (2020).
“3k PBMCs from a Healthy Donor”. 10X Genomics.
“What is Cell Ranger?” 10X Genomics.

6 Contact information

If you have questions related to ILoReg, please contact us here.

ILoReg package manual

2020-10-27

Contents