ILoReg
is a novel tool for cell population identification from singlecell RNAseq (scRNAseq) data. In our study [1], we showed that ILoReg
was able to identify, by both unsupervised clustering and visually, rare cell populations that other scRNAseq data analysis pipelines were unable to identify.
The figure below illustrates the workflows of ILoReg
and a typical pipeline that applies feature selection prior to dimensionality reduction by principal component analysis (PCA).
In contrast to most scRNAseq data analysis pipelines, ILoReg
does not reduce the dimensionality of the gene expression matrix by feature selection. Instead, it performs probabilistic feature extraction using iterative clustering projection (ICP), yielding an \(N \times k\) dimensional probability matrix, which contains probabilities of each of the \(N\) cells belonging to the \(k\) clusters. ICP is a novel selfsupervised learning algorithm that iteratively seeks a clustering with \(k\) clusters that maximizes the adjusted Rand index (ARI) between the clustering \(C\) and its projection \(C'\) by L1regularized logistic regression. In the ILoReg consensus approach, ICP is run \(L\) times and the \(L\) probability matrices are merged into a joint probability matrix and subsequently transformed by principal component analysis (PCA) into a lower dimensional (\(N \times p\)) matrix (consensus matrix). The final clustering step is performed using hierarhical clustering by the Ward’s method, after which the user can extract a clustering with \(K\) consensus clusters. Twodimensional visualization is supported using two popular nonlinear dimensionality reduction methods: tdistributed stochastic neighbor embedding (tSNE) and uniform manifold approximation and projection (UMAP). Additionally, ILoReg provides userfriendly functions that enable identification of differentially expressed (DE) genes and visualization of gene expression.
ILoReg
can be downloaded from Bioconductor and installed by executing the following command in the R console.
if(!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("ILoReg")
In the following, we go through the different steps of ILoReg
’s workflow and demonstrate it using a peripheral blood mononuclear cell (PBMC) dataset. The toy dataset included in the ILoReg
R package (pbmc3k_500
) contains 500 cells that have been downsampled from the pbmc3k dataset [2]. The preprocessing was rerun with a newer reference genome (GRCh38.p12) and Cell Ranger v2.2.0 [3].
The only required input for ILoReg
is a gene expression matrix that has been normalized for the library size, with genes/features in rows and cells/samples in columns. The input can be of matrix
, data.frame
or dgCMatrix
class, which is then transformed into a sparse object of dgCMatrix
class. Please note that the method has been designed to work with sparse data, i.e. with a high proportion of zero values. If, for example, the features of your dataset have been standardized, the run time and the memory usage of ILoReg
will likely be much higher.
suppressMessages(library(ILoReg))
suppressMessages(library(SingleCellExperiment))
suppressMessages(library(cowplot))
# The dataset was normalized using the LogNormalize method from the Seurat R package.
sce < SingleCellExperiment(assays = list(logcounts = pbmc3k_500))
sce < PrepareILoReg(sce)
## Data in `logcounts` slot already of `dgCMatrix` class...
## 13865/13865 genes remain after filtering genes with only zero values.
Running ICP \(L\) times in parallel is the most computationally demanding part of the workflow.
In the following, we give a brief summary of the parameters.
As general guidelines on how to finetune the parameters, we recommend leaving \(C\), \(r\) and \(L\) as their defaults (\(C=0.3\), \(r=5\) and \(L=200\)). However, increasing \(k\) from 15 to, for example, 30 can reveal new cell subsets that are of potential interest. Regarding \(d\), increasing it to somewhere between 0.40.6 helps if the user wants lower resolution (less distinguishable populations). Setting \(reg.type="L2"\) disables the feature selection in L1regularization, and all the genes weights are consequently nonzero in the model, which typically leads to a lower resolution.
# ICP is stochastic. To obtain reproducible results, use set.seed().
set.seed(1)
# Run ICP L times. This is the slowest step of the workflow,
# and parallel processing can be used to greatly speed it up.
sce < RunParallelICP(object = sce, k = 15,
d = 0.3, L = 30,
r = 5, C = 0.3,
reg.type = "L1", threads = 0)
##

  0%

==  3%

=====  7%

=======  10%

==========  14%

============  17%

==============  21%

=================  24%

===================  28%

======================  31%

========================  34%

===========================  38%

=============================  41%

===============================  45%

==================================  48%

====================================  52%

=======================================  55%

=========================================  59%

===========================================  62%

==============================================  66%

================================================  69%

===================================================  72%

=====================================================  76%

========================================================  79%

==========================================================  83%

============================================================  86%

===============================================================  90%

=================================================================  93%

====================================================================  97%

====================================================================== 100%
The \(L\) probability matrices are merged into a joint probability matrix, which is then transformed into a lower dimensionality by PCA. Before applying PCA, the user can optionally scale the cluster probabilities to unitvariance.
# p = number of principal components
sce < RunPCA(sce,p=50,scale = FALSE)
Optional: PCA requires the user to specify the number of principal components, for which we selected the default value \(p=50\). To aid in decision making, the elbow plot is commonly used to seek an elbow point, of which proximity the user selects \(p\). In this case the point would be close to \(p=10\). Trying both a \(p\) that is close to the elbow point and the default \(p=50\) is recommended.
PCAElbowPlot(sce)
To visualize the data in twodimensional space, nonlinear dimensionality reduction is performed using tSNE or UMAP. The input data for this step is the \(N \times p\) dimensional consensus matrix.
sce < RunUMAP(sce)
sce < RunTSNE(sce,perplexity=30)
Visualize the tSNE and UMAP transformations using the GeneScatterPlot
function, highlighting expression levels of CD3D (T cells), CD79A (B cells), CST3 (monocytes, dendritic cells, platelets), FCER1A (myeloid dendritic cells).
GeneScatterPlot(sce,c("CD3D","CD79A","CST3","FCER1A"),
dim.reduction.type = "umap",
point.size = 0.3)