Single-Cell Consensus Clustering (SC3
) is a tool for unsupervised clustering of scRNA-seq data. SC3
achieves high accuracy and robustness by consistently integrating different clustering solutions through a consensus approach. An interactive graphical implementation makes SC3
accessible to a wide audience of users. In addition, SC3
also aids biological interpretation by identifying marker genes, differentially expressed genes and outlier cells. A manuscript describing SC3
in details is published in Nature Methods.
scater
SC3
is a purely clustering tool and it does not provide functions for the sequencing quality control (QC) or normalisation. On the contrary it is expected that these preprocessing steps are performed by a user in advance. To encourage the preprocessing, SC3
is built on top of the Bioconductor’s scater
package. To our knowledge the scater
is the most comprehensive toolkit for the QC and normalisation analysis of the single-cell RNA-Seq data.
The basic scater
data container is an SCESet
object. SC3
implements several methods that allow one to perform clustering of the expression data contained in the SCESet
object. All results of SC3
calculations are written to the sc3
slot of the SCESet
object.
SC3
InputIf you already have an SCESet
object created and QCed using scater
then proceed to the next chapter.
If you have a matrix containing expression data that was QCed and normalised by some other tool, then we first need to form an SCESet
object containing the data. For illustrative purposes we will use an example expression matrix provided with SC3
. This matrix (treutein
) represents FPKM gene expression of 80 cells derived from the distal lung epithelium of mice. The authors (Treutlein et al.) had computationally identified 5 clusters in the data. The rows in the treutlein
dataset correspond to genes and columns correspond to cells. Column names correspond to clusters identified by the authors.
library(scater)
library(SC3)
treutlein[1:3, 1:3]
## 2 4 3
## 0610005C13Rik 0 0.00000 0
## 0610007C21Rik 0 17.74195 0
## 0610007L01Rik 0 290.31680 0
It is easy to create an SCESet
object from treutlein
expression matrix. We will follow the scater
’s manual:
# cell annotation
ann <- data.frame(cell_type1 = colnames(treutlein))
pd <- new("AnnotatedDataFrame", data = ann)
# cell expression
tmp <- treutlein
colnames(tmp) <- rownames(ann)
# SCESEt object
sceset <- newSCESet(fpkmData = tmp, phenoData = pd, logExprsOffset = 1)
It is also essential for SC3
that the QC metrics is computed for the created object:
is_exprs(sceset) <- exprs(sceset) > 0
sceset <- calculateQCMetrics(sceset)
The treutlein_cell_info
dataframe contains just cell_type1
column which correspond to the cell labels provided by authors of the original publication. Note that in general it can also contain more information about the cells, such as plate, run, well, date etc.
After the SCESet
object is created and QC is run, scater
allows a user to quickly visualize and assess the data, for example using a PCA plot:
plotPCA(sceset, colour_by = "cell_type1")
If you would like to explore clustering of your data in the range of k
s (the number of clusters) from 2 to 4, you just need to run the main sc3
method and define the range of k
s using the ks
parameter (here we also ask SC3
to calculate biological features based on the identified cell clusters):
# Note that n_cores = 1 is required for compilation of this vignette.
# Please remove this parameter when running on your computer:
# sceset <- sc3(sceset, ks = 2:4, biology = TRUE)
sceset <- sc3(sceset, ks = 2:4, biology = TRUE, n_cores = 1)
## Setting SC3 parameters...
## Setting a range of k...
## Calculating distances between the cells...
## Performing transformations and calculating eigenvectors...
## Performing k-means clustering...
## Calculating consensus matrix...
## Calculating biology...
To quickly and easily explore the SC3
solutions using an interactive Shiny application use the following method:
sc3_interactive(sceset)
Visual exploration can provide a reasonable estimate of the number of clusters k
. Once a preferable k
is chosen it is also possible to export the results into an Excel file:
sc3_export_results_xls(sceset)
This will write all results to sc3_results.xls
file. The name of the file can be controlled by the filename
parameter.
SC3
writes all its results obtained for cells to the phenoData
slot of the SCESet
object by adding additional columns to it. This slot also contains all other cell features calculated by the scater
package either automatically during the SCESet
object creation or during the calculateQCMetrics
call. One can identify the SC3
results using the "sc3_"
prefix:
p_data <- pData(sceset)
head(p_data[ , grep("sc3_", colnames(p_data))])
## sc3_2_clusters sc3_3_clusters sc3_4_clusters sc3_2_log2_outlier_score
## 1 1 1 1 0
## 2 1 1 1 0
## 3 1 3 4 0
## 4 1 1 1 0
## 5 1 1 1 0
## 6 1 1 1 0
## sc3_3_log2_outlier_score sc3_4_log2_outlier_score
## 1 0.000000 0.000000
## 2 3.294804 3.232004
## 3 0.000000 0.000000
## 4 3.422047 3.312717
## 5 0.000000 0.000000
## 6 3.355677 3.262667
Additionally, having SC3
results stored in the same slot makes it possible to highlight them during any of the scater
’s plotting function call, for example:
plotPCA(
sceset,
colour_by = "sc3_3_clusters",
size_by = "sc3_3_log2_outlier_score"
)
SC3
writes all its results obtained for features (genes/transcripts) to the featureData
slot of the SCESet
object by adding additional columns to it. This slot also contains all other feature values calculated by the scater
package either automatically during the SCESet
object creation or during the calculateQCMetrics
call. One can identify the SC3
results using the "sc3_"
prefix:
f_data <- fData(sceset)
head(f_data[ , grep("sc3_", colnames(f_data))])
## sc3_gene_filter sc3_2_markers_clusts sc3_2_markers_padj
## 0610005C13Rik FALSE NA NA
## 0610007C21Rik TRUE 2 1
## 0610007L01Rik TRUE 2 1
## 0610007N19Rik TRUE 1 1
## 0610007P08Rik TRUE 2 1
## 0610007P14Rik TRUE 2 1
## sc3_2_markers_auroc sc3_3_markers_clusts sc3_3_markers_padj
## 0610005C13Rik NA NA NA
## 0610007C21Rik 0.6709957 2 1
## 0610007L01Rik 0.6250000 2 1
## 0610007N19Rik 0.7662338 1 1
## 0610007P08Rik 0.5043290 1 1
## 0610007P14Rik 0.6060606 2 1
## sc3_3_markers_auroc sc3_4_markers_clusts sc3_4_markers_padj
## 0610005C13Rik NA NA NA
## 0610007C21Rik 0.6709957 2 1
## 0610007L01Rik 0.6250000 3 1
## 0610007N19Rik 0.6203175 3 1
## 0610007P08Rik 0.5187302 3 1
## 0610007P14Rik 0.6060606 2 1
## sc3_4_markers_auroc sc3_2_de_padj sc3_3_de_padj
## 0610005C13Rik NA NA NA
## 0610007C21Rik 0.6709957 1 1
## 0610007L01Rik 0.6734234 1 1
## 0610007N19Rik 0.7139640 1 1
## 0610007P08Rik 0.5202703 1 1
## 0610007P14Rik 0.6060606 1 1
## sc3_4_de_padj
## 0610005C13Rik NA
## 0610007C21Rik 1
## 0610007L01Rik 1
## 0610007N19Rik 1
## 0610007P08Rik 1
## 0610007P14Rik 1
Because the biological features were also calculated for each k
, one can find ajusted p-values for both differential expression and marker genes, as well as the area under the ROC curve values (see ?sc3_calcl_biology
for more information).
Again, having SC3
results stored in the same slot makes it possible to highlight them during any of the scater
’s plotting function call, for example:
plotFeatureData(
sceset,
aes(
x = sc3_3_markers_clusts,
y = sc3_3_markers_auroc,
colour = sc3_3_markers_padj
)
)
## Warning: Removed 17518 rows containing missing values
## (position_quasirandom).
The default settings of SC3
allow to cluster (using a single k
) a dataset of 2,000 cells in about 20-30 minutes.
For datasets with more than 2,000 cells SC3
automatically adjusts some of its parameters (see below). This allows to cluster a dataset of 5,000 cells in about 20-30 minutes. The parameters can also be manually adjusted for datasets with any number of cells.
For datasets with more than 5,000 cells SC3
utilizes a hybrid approach that combines unsupervised and supervised clusterings (see below). Namely, SC3
selects a subset of cells uniformly at random, and obtains clusters from this subset. Subsequently, the inferred labels are used to train a Support Vector Machine (SVM), which is employed to assign labels to the remaining cells. Training cells can also be manually selected by providing their indeces.
SC3
also provides methods for plotting all figures from the interactive session.
The consensus matrix is a N by N matrix, where N is the number of cells in the input dataset. It represents similarity between the cells based on the averaging of clustering results from all combinations of clustering parameters. Similarity 0 (blue) means that the two cells are always assigned to different clusters. In contrast, similarity 1 (red) means that the two cells are always assigned to the same cluster. The consensus matrix is clustered by hierarchical clustering and has a diagonal-block structure. Intuitively, the perfect clustering is achieved when all diagonal blocks are completely red and all off-diagonal elements are completely blue.
sc3_plot_consensus(sceset, k = 3)
It is also possible to annotate cells (columns of the consensus matrix) with any column of the phenoData
slot of the SCESet
object.
sc3_plot_consensus(
sceset, k = 3,
show_pdata = c(
"cell_type1",
"log10_total_features",
"sc3_3_clusters",
"sc3_3_log2_outlier_score"
)
)
A silhouette is a quantitative measure of the diagonality of the consensus matrix. An average silhouette width (shown at the bottom left of the silhouette plot) varies from 0 to 1, where 1 represents a perfectly block-diagonal consensus matrix and 0 represents a situation where there is no block-diagonal structure. The best clustering is achieved when the average silhouette width is close to 1.
sc3_plot_silhouette(sceset, k = 3)
The expression panel represents the original input expression matrix (cells in columns and genes in rows) after cell and gene filters. Genes are clustered by kmeans with k = 100 (dendrogram on the left) and the heatmap represents the expression levels of the gene cluster centers after log2-scaling.
sc3_plot_expression(sceset, k = 3)
It is also possible to annotate cells (columns of the expression matrix) with any column of the phenoData
slot of the SCESet
object.
sc3_plot_expression(
sceset, k = 3,
show_pdata = c(
"cell_type1",
"log10_total_features",
"sc3_3_clusters",
"sc3_3_log2_outlier_score"
)
)