3 Quick Start

3.1 `SC3` Input

If you already have an SCESet object created and QCed using scater then proceed to the next chapter.

If you have a matrix containing expression data that was QCed and normalised by some other tool, then we first need to form an SCESet object containing the data. For illustrative purposes we will use an example expression matrix provided with SC3. This matrix (treutein) represents FPKM gene expression of 80 cells derived from the distal lung epithelium of mice. The authors (Treutlein et al.) had computationally identified 5 clusters in the data. The rows in the treutlein dataset correspond to genes and columns correspond to cells. Column names correspond to clusters identified by the authors.

library(scater)
library(SC3)
treutlein[1:3, 1:3]

##               2         4 3
## 0610005C13Rik 0   0.00000 0
## 0610007C21Rik 0  17.74195 0
## 0610007L01Rik 0 290.31680 0

It is easy to create an SCESet object from treutlein expression matrix. We will follow the scater’s manual:

# cell annotation
ann <- data.frame(cell_type1 = colnames(treutlein))
pd <- new("AnnotatedDataFrame", data = ann)
# cell expression
tmp <- treutlein
colnames(tmp) <- rownames(ann)
# SCESEt object
sceset <- newSCESet(fpkmData = tmp, phenoData = pd, logExprsOffset = 1)

It is also essential for SC3 that the QC metrics is computed for the created object:

sceset <- calculateQCMetrics(sceset)

The treutlein_cell_info dataframe contains just cell_type1 column which correspond to the cell labels provided by authors of the original publication. Note that in general it can also contain more information about the cells, such as plate, run, well, date etc.

After the SCESet object is created and QC is run, scater allows a user to quickly visualize and assess the data, for example using a PCA plot:

plotPCA(sceset, colour_by = "cell_type1")

3.2 Run SC3

If you would like to explore clustering of your data in the range of ks (the number of clusters) from 2 to 4, you just need to run the main sc3 method and define the range of ks using the ks parameter (here we also ask SC3 to calculate biological features based on the identified cell clusters):

# Note that n_cores = 1 is required for compilation of this vignette.
# Please remove this parameter when running on your computer:
# sceset <- sc3(sceset, ks = 2:4, biology = TRUE)
sceset <- sc3(sceset, ks = 2:4, biology = TRUE, n_cores = 1)

## Setting SC3 parameters...

## Setting a range of k...

## Calculating distances between the cells...

## Performing transformations and calculating eigenvectors...

## Performing k-means clustering...

## Calculating consensus matrix...

## Calculating biology...

To quickly and easily explore the SC3 solutions using an interactive Shiny application use the following method:

sc3_interactive(sceset)

Visual exploration can provide a reasonable estimate of the number of clusters k. Once a preferable k is chosen it is also possible to export the results into an Excel file:

sc3_export_results_xls(sceset)

This will write all results to sc3_results.xls file. The name of the file can be controlled by the filename parameter.

3.3 phenoData

SC3 writes all its results obtained for cells to the phenoData slot of the SCESet object by adding additional columns to it. This slot also contains all other cell features calculated by the scater package either automatically during the SCESet object creation or during the calculateQCMetrics call. One can identify the SC3 results using the "sc3_" prefix:

p_data <- pData(sceset)
head(p_data[ , grep("sc3_", colnames(p_data))])

##   sc3_2_clusters sc3_3_clusters sc3_4_clusters sc3_2_log2_outlier_score
## 1              1              1              1                        0
## 2              1              1              1                        0
## 3              1              3              4                        0
## 4              1              1              1                        0
## 5              1              1              1                        0
## 6              1              1              1                        0
##   sc3_3_log2_outlier_score sc3_4_log2_outlier_score
## 1                 0.000000                 0.000000
## 2                 3.231017                 3.295159
## 3                 0.000000                 0.000000
## 4                 3.323645                 3.416875
## 5                 0.000000                 0.000000
## 6                 3.256986                 3.350090

Additionally, having SC3 results stored in the same slot makes it possible to highlight them during any of the scater’s plotting function call, for example:

plotPCA(
    sceset, 
    colour_by = "sc3_3_clusters", 
    size_by = "sc3_3_log2_outlier_score"
)

3.4 featureData

SC3 writes all its results obtained for features (genes/transcripts) to the featureData slot of the SCESet object by adding additional columns to it. This slot also contains all other feature values calculated by the scater package either automatically during the SCESet object creation or during the calculateQCMetrics call. One can identify the SC3 results using the "sc3_" prefix:

f_data <- fData(sceset)
head(f_data[ , grep("sc3_", colnames(f_data))])

##               sc3_gene_filter sc3_2_markers_clusts sc3_2_markers_padj
## 0610005C13Rik           FALSE                   NA                 NA
## 0610007C21Rik            TRUE                    2                  1
## 0610007L01Rik            TRUE                    2                  1
## 0610007N19Rik            TRUE                    1                  1
## 0610007P08Rik            TRUE                    2                  1
## 0610007P14Rik            TRUE                    2                  1
##               sc3_2_markers_auroc sc3_3_markers_clusts sc3_3_markers_padj
## 0610005C13Rik                  NA                   NA                 NA
## 0610007C21Rik           0.6709957                    2                  1
## 0610007L01Rik           0.6250000                    2                  1
## 0610007N19Rik           0.7662338                    1                  1
## 0610007P08Rik           0.5043290                    1                  1
## 0610007P14Rik           0.6060606                    2                  1
##               sc3_3_markers_auroc sc3_4_markers_clusts sc3_4_markers_padj
## 0610005C13Rik                  NA                   NA                 NA
## 0610007C21Rik           0.6709957                    2                  1
## 0610007L01Rik           0.6250000                    3                  1
## 0610007N19Rik           0.6203175                    3                  1
## 0610007P08Rik           0.5187302                    3                  1
## 0610007P14Rik           0.6060606                    2                  1
##               sc3_4_markers_auroc sc3_2_de_padj sc3_3_de_padj
## 0610005C13Rik                  NA            NA            NA
## 0610007C21Rik           0.6709957             1             1
## 0610007L01Rik           0.6734234             1             1
## 0610007N19Rik           0.7139640             1             1
## 0610007P08Rik           0.5202703             1             1
## 0610007P14Rik           0.6060606             1             1
##               sc3_4_de_padj
## 0610005C13Rik            NA
## 0610007C21Rik             1
## 0610007L01Rik             1
## 0610007N19Rik             1
## 0610007P08Rik             1
## 0610007P14Rik             1

Because the biological features were also calculated for each k, one can find ajusted p-values for both differential expression and marker genes, as well as the area under the ROC curve values (see ?sc3_calcl_biology for more information).

Again, having SC3 results stored in the same slot makes it possible to highlight them during any of the scater’s plotting function call, for example:

plotFeatureData(
    sceset, 
    aes(
        x = sc3_3_markers_clusts, 
        y = sc3_3_markers_auroc, 
        colour = sc3_3_markers_padj
    )
)

## Warning: Removed 17518 rows containing missing values
## (position_quasirandom).

4 Number of Сells

The default settings of SC3 allow to cluster (using a single k) a dataset of 2,000 cells in about 20-30 minutes.

For datasets with more than 2,000 cells SC3 automatically adjusts some of its parameters (see below). This allows to cluster a dataset of 5,000 cells in about 20-30 minutes. The parameters can also be manually adjusted for datasets with any number of cells.

For datasets with more than 5,000 cells SC3 utilizes a hybrid approach that combines unsupervised and supervised clusterings (see below). Namely, SC3 selects a subset of cells uniformly at random, and obtains clusters from this subset. Subsequently, the inferred labels are used to train a Support Vector Machine (SVM), which is employed to assign labels to the remaining cells. Training cells can also be manually selected by providing their indeces.

5 Plot Functions

SC3 also provides methods for plotting all figures from the interactive session.

5.1 Consensus Matrix

The consensus matrix is a N by N matrix, where N is the number of cells in the input dataset. It represents similarity between the cells based on the averaging of clustering results from all combinations of clustering parameters. Similarity 0 (blue) means that the two cells are always assigned to different clusters. In contrast, similarity 1 (red) means that the two cells are always assigned to the same cluster. The consensus matrix is clustered by hierarchical clustering and has a diagonal-block structure. Intuitively, the perfect clustering is achieved when all diagonal blocks are completely red and all off-diagonal elements are completely blue.

sc3_plot_consensus(sceset, k = 3)

It is also possible to annotate cells (columns of the consensus matrix) with any column of the phenoData slot of the SCESet object.

sc3_plot_consensus(
    sceset, k = 3, 
    show_pdata = c(
        "cell_type1", 
        "log10_total_features",
        "sc3_3_clusters", 
        "sc3_3_log2_outlier_score"
    )
)

5.2 Silhouette Plot

A silhouette is a quantitative measure of the diagonality of the consensus matrix. An average silhouette width (shown at the bottom left of the silhouette plot) varies from 0 to 1, where 1 represents a perfectly block-diagonal consensus matrix and 0 represents a situation where there is no block-diagonal structure. The best clustering is achieved when the average silhouette width is close to 1.

sc3_plot_silhouette(sceset, k = 3)

5.3 Expression Matrix

The expression panel represents the original input expression matrix (cells in columns and genes in rows) after cell and gene filters. Genes are clustered by kmeans with k = 100 (dendrogram on the left) and the heatmap represents the expression levels of the gene cluster centers after log2-scaling.

sc3_plot_expression(sceset, k = 3)

It is also possible to annotate cells (columns of the expression matrix) with any column of the phenoData slot of the SCESet object.

sc3_plot_expression(
    sceset, k = 3, 
    show_pdata = c(
        "cell_type1", 
        "log10_total_features",
        "sc3_3_clusters", 
        "sc3_3_log2_outlier_score"
    )
)

5.4 Cluster Stability

Stability index shows how stable each cluster is accross the selected range of ks. The stability index varies between 0 and 1, where 1 means that the same cluster appears in every solution for different k.

sc3_plot_cluster_stability(sceset, k = 3)

5.5 DE genes

Differential expression is calculated using the non-parametric Kruskal-Wallis test. A significant p-value indicates that gene expression in at least one cluster stochastically dominates one other cluster. SC3 provides a list of all differentially expressed genes with adjusted p-values < 0.01 and plots gene expression profiles of the 50 genes with the lowest p-values. Note that the calculation of differential expression after clustering can introduce a bias in the distribution of p-values, and thus we advise to use the p-values for ranking the genes only.

sc3_plot_de_genes(sceset, k = 3)

It is also possible to annotate cells (columns of the matrix containing DE genes) with any column of the phenoData slot of the SCESet object.

sc3_plot_de_genes(
    sceset, k = 3, 
    show_pdata = c(
        "cell_type1", 
        "log10_total_features",
        "sc3_3_clusters", 
        "sc3_3_log2_outlier_score"
    )
)

5.6 Marker Genes

To find marker genes, for each gene a binary classifier is constructed based on the mean cluster expression values. The classifier prediction is then calculated using the gene expression ranks. The area under the receiver operating characteristic (ROC) curve is used to quantify the accuracy of the prediction. A p-value is assigned to each gene by using the Wilcoxon signed rank test. By default the genes with the area under the ROC curve (AUROC) > 0.85 and with the p-value < 0.01 are selected and the top 10 marker genes of each cluster are visualized in this heatmap.

sc3_plot_markers(sceset, k = 3)

It is also possible to annotate cells (columns of the matrix containing marker genes) with any column of the phenoData slot of the SCESet object.

sc3_plot_markers(
    sceset, k = 3, 
    show_pdata = c(
        "cell_type1", 
        "log10_total_features",
        "sc3_3_clusters", 
        "sc3_3_log2_outlier_score"
    )
)

6 SC3 in Detail

The main sc3 method explained above is a wrapper that calls several other SC3 methods in the following order:

sc3_prepare
(optional) sc3_estimate_k
sc3_calc_dists
sc3_calc_transfs
sc3_kmeans
sc3_calc_consens
(optional) sc3_calc_biology

Let us go through each of them independently.

6.1 `sc3_prepare`

We start with sc3_prepare. This method prepares an object of SCESet class for SC3 clustering. This method also defines all parameters needed for clustering and stores them in the sc3 slot. The parameters have their own defaults but can be manually changed. For more information on the parameters please use ?sc3_prepare.

# Note that n_cores = 1 is required for compilation of this vignette.
# Please remove this parameter when running on your computer:
# sceset <- sc3_prepare(sceset, ks = 2:4)
sceset <- sc3_prepare(sceset, ks = 2:4, n_cores = 1)

## Setting SC3 parameters...

## Setting a range of k...

str(sceset@sc3)

## List of 6
##  $ kmeans_iter_max: num 1e+09
##  $ kmeans_nstart  : num 1000
##  $ n_dim          : int [1:4] 3 4 5 6
##  $ rand_seed      : num 1
##  $ n_cores        : num 1
##  $ ks             : int [1:3] 2 3 4

6.2 (optional) `sc3_estimate_k`

When the SCESet object is prepared for clustering, SC3 can also estimate the optimal number of clusters k in the dataset. SC3 utilizes the Tracy-Widom theory on random matrices to estimate k. sc3_estimate_k method creates and populates the following items of the sc3 slot:

k_estimation - contains the estimated value of k.

sceset <- sc3_estimate_k(sceset)

## Estimating k...

str(sceset@sc3)

## List of 7
##  $ kmeans_iter_max: num 1e+09
##  $ kmeans_nstart  : num 1000
##  $ n_dim          : int [1:4] 3 4 5 6
##  $ rand_seed      : num 1
##  $ n_cores        : num 1
##  $ ks             : int [1:3] 2 3 4
##  $ k_estimation   : num 3

6.3 `sc3_calc_dists`

Now we are ready to perform the clustering itself. First SC3 calculates distances between the cells. Method sc3_calc_dists calculates the distances, creates and populates the following items of the sc3 slot:

distances - contains a list of distance matrices corresponding to Euclidean, Pearson and Spearman distances.

sceset <- sc3_calc_dists(sceset)

## Calculating distances between the cells...

names(sceset@sc3$distances)

## [1] "euclidean" "pearson"   "spearman"

6.4 `sc3_calc_transfs`

Next the distance matrices are transformed using PCA and graph Laplacian. Method sc3_calc_transfs calculates transforamtions of the distance matrices contained in the distances item of the sc3 slot. It then creates and populates the following items of the sc3 slot:

transformations - contains a list of transformations of the distance matrices corresponding to PCA and graph Laplacian transformations.

sceset <- sc3_calc_transfs(sceset)

## Performing transformations and calculating eigenvectors...

names(sceset@sc3$transformations)

## [1] "euclidean_pca"       "pearson_pca"         "spearman_pca"       
## [4] "euclidean_laplacian" "pearson_laplacian"   "spearman_laplacian"

It also removes the previously calculated distances item from the sc3 slot:

sceset@sc3$distances

## NULL

6.5 `sc3_kmeans`

kmeans should then be performed on the transformed distance matrices contained in the transformations item of the sc3 slot. Method sc3_kmeans creates and populates the following items of the sc3 slot:

kmeans - contains a list of kmeans clusterings.

By default the nstart parameter passed to kmeans defined in sc3_prepare method, is set 1000 and written to kmeans_nstart item of the sc3 slot. If the number of cells in the dataset is more than 2,000, this parameter is set to 50. A user can also manually define this parameter by changing the value of the kmeans_nstart item of the sc3 slot.

sceset <- sc3_kmeans(sceset)

## Performing k-means clustering...

names(sceset@sc3$kmeans)

##  [1] "euclidean_pca_2_3"       "pearson_pca_2_3"        
##  [3] "spearman_pca_2_3"        "euclidean_laplacian_2_3"
##  [5] "pearson_laplacian_2_3"   "spearman_laplacian_2_3" 
##  [7] "euclidean_pca_3_3"       "pearson_pca_3_3"        
##  [9] "spearman_pca_3_3"        "euclidean_laplacian_3_3"
## [11] "pearson_laplacian_3_3"   "spearman_laplacian_3_3" 
## [13] "euclidean_pca_4_3"       "pearson_pca_4_3"        
## [15] "spearman_pca_4_3"        "euclidean_laplacian_4_3"
## [17] "pearson_laplacian_4_3"   "spearman_laplacian_4_3" 
## [19] "euclidean_pca_2_4"       "pearson_pca_2_4"        
## [21] "spearman_pca_2_4"        "euclidean_laplacian_2_4"
## [23] "pearson_laplacian_2_4"   "spearman_laplacian_2_4" 
## [25] "euclidean_pca_3_4"       "pearson_pca_3_4"        
## [27] "spearman_pca_3_4"        "euclidean_laplacian_3_4"
## [29] "pearson_laplacian_3_4"   "spearman_laplacian_3_4" 
## [31] "euclidean_pca_4_4"       "pearson_pca_4_4"        
## [33] "spearman_pca_4_4"        "euclidean_laplacian_4_4"
## [35] "pearson_laplacian_4_4"   "spearman_laplacian_4_4" 
## [37] "euclidean_pca_2_5"       "pearson_pca_2_5"        
## [39] "spearman_pca_2_5"        "euclidean_laplacian_2_5"
## [41] "pearson_laplacian_2_5"   "spearman_laplacian_2_5" 
## [43] "euclidean_pca_3_5"       "pearson_pca_3_5"        
## [45] "spearman_pca_3_5"        "euclidean_laplacian_3_5"
## [47] "pearson_laplacian_3_5"   "spearman_laplacian_3_5" 
## [49] "euclidean_pca_4_5"       "pearson_pca_4_5"        
## [51] "spearman_pca_4_5"        "euclidean_laplacian_4_5"
## [53] "pearson_laplacian_4_5"   "spearman_laplacian_4_5" 
## [55] "euclidean_pca_2_6"       "pearson_pca_2_6"        
## [57] "spearman_pca_2_6"        "euclidean_laplacian_2_6"
## [59] "pearson_laplacian_2_6"   "spearman_laplacian_2_6" 
## [61] "euclidean_pca_3_6"       "pearson_pca_3_6"        
## [63] "spearman_pca_3_6"        "euclidean_laplacian_3_6"
## [65] "pearson_laplacian_3_6"   "spearman_laplacian_3_6" 
## [67] "euclidean_pca_4_6"       "pearson_pca_4_6"        
## [69] "spearman_pca_4_6"        "euclidean_laplacian_4_6"
## [71] "pearson_laplacian_4_6"   "spearman_laplacian_4_6"

6.6 `sc3_calc_consens`

In this step SC3 will provide you with a clustering solution. Let’s first check that there are no SC3 related columns in the phenoData slot:

p_data <- pData(sceset)
head(p_data[ , grep("sc3_", colnames(p_data))])

## data frame with 0 columns and 6 rows

When calculating consensus for each value of k SC3 averages the clustering results of kmeans using a consensus approach. Method sc3_calc_consens calculates consensus matrices based on the clustering solutions contained in the kmeans item of the sc3 slot. It then creates and populates the following items of the sc3 slot:

consensus - for each value of k it contains: a consensus matrix, an hclust object, corresponding to hierarchical clustering of the consensus matrix and the Silhouette indeces of the clusters.

sceset <- sc3_calc_consens(sceset)

## Calculating consensus matrix...

names(sceset@sc3$consensus)

## [1] "2" "3" "4"

names(sceset@sc3$consensus$`3`)

## [1] "consensus"  "hc"         "silhouette"

It also removes the previously calculated kmeans item from the sc3 slot:

sceset@sc3$kmeans

## NULL

As mentioned before all the clustering results (cell-related information) are written to the phenoData slot of the SCESet object:

p_data <- pData(sceset)
head(p_data[ , grep("sc3_", colnames(p_data))])

##   sc3_2_clusters sc3_3_clusters sc3_4_clusters
## 1              1              1              1
## 2              1              1              1
## 3              1              3              4
## 4              1              1              1
## 5              1              1              1
## 6              1              1              1

We can see that SC3 calculated clusters for k = 2, 3 and 4 and wrote them to the phenoData slot of the SCESet object.

6.7 (optional) `sc3_calc_biology`

SC3 can also calculates DE genes, marker genes and cell outliers based on the calculated consensus clusterings. Similary to the clustering solutions, method sc3_calc_biology writes the results for the cell outliers (cell-related information) to the phenoData slot of the SCESet object. In contrast, DE and marker genes results (gene-related information) is are written to the featureData slot. In addition biology item of the sc3 slot is set to TRUE.

sceset <- sc3_calc_biology(sceset)

## Calculating biology...

6.7.1 Cell Outliers

Now we can see that cell outlier scores have been calculated for each value of k:

p_data <- pData(sceset)
head(p_data[ , grep("sc3_", colnames(p_data))])

##   sc3_2_clusters sc3_3_clusters sc3_4_clusters sc3_2_log2_outlier_score
## 1              1              1              1                        0
## 2              1              1              1                        0
## 3              1              3              4                        0
## 4              1              1              1                        0
## 5              1              1              1                        0
## 6              1              1              1                        0
##   sc3_3_log2_outlier_score sc3_4_log2_outlier_score
## 1                 0.000000                 0.000000
## 2                 3.224458                 3.258437
## 3                 0.000000                 0.000000
## 4                 3.306006                 3.310073
## 5                 0.000000                 0.000000
## 6                 3.269309                 3.300530

For more information on how the cell outliers are calculated please see ?get_outl_cells.

6.7.2 DE and marker genes

We can also see that DE and marker genes characteristics (adjusted p-values and area under the ROC curve) have been calculated for each value of k

f_data <- fData(sceset)
head(f_data[ , grep("sc3_", colnames(f_data))])

##               sc3_gene_filter sc3_2_markers_clusts sc3_2_markers_padj
## 0610005C13Rik           FALSE                   NA                 NA
## 0610007C21Rik            TRUE                    2                  1
## 0610007L01Rik            TRUE                    2                  1
## 0610007N19Rik            TRUE                    1                  1
## 0610007P08Rik            TRUE                    2                  1
## 0610007P14Rik            TRUE                    2                  1
##               sc3_2_markers_auroc sc3_3_markers_clusts sc3_3_markers_padj
## 0610005C13Rik                  NA                   NA                 NA
## 0610007C21Rik           0.6709957                    2                  1
## 0610007L01Rik           0.6250000                    2                  1
## 0610007N19Rik           0.7662338                    1                  1
## 0610007P08Rik           0.5043290                    1                  1
## 0610007P14Rik           0.6060606                    2                  1
##               sc3_3_markers_auroc sc3_4_markers_clusts sc3_4_markers_padj
## 0610005C13Rik                  NA                   NA                 NA
## 0610007C21Rik           0.6709957                    2                  1
## 0610007L01Rik           0.6250000                    3                  1
## 0610007N19Rik           0.6203175                    3                  1
## 0610007P08Rik           0.5187302                    3                  1
## 0610007P14Rik           0.6060606                    2                  1
##               sc3_4_markers_auroc sc3_2_de_padj sc3_3_de_padj
## 0610005C13Rik                  NA            NA            NA
## 0610007C21Rik           0.6709957             1             1
## 0610007L01Rik           0.6734234             1             1
## 0610007N19Rik           0.7139640             1             1
## 0610007P08Rik           0.5202703             1             1
## 0610007P14Rik           0.6060606             1             1
##               sc3_4_de_padj
## 0610005C13Rik            NA
## 0610007C21Rik             1
## 0610007L01Rik             1
## 0610007N19Rik             1
## 0610007P08Rik             1
## 0610007P14Rik             1

For more information on how the DE and marker genes are calculated please see ?get_de_genes and ?get_marker_genes.

7 Hybrid `SVM` Approach

For datasets with more than 5,000 cells SC3 automatically utilizes a hybrid approach that combines unsupervised and supervised clusterings. Namely, SC3 selects a subset of cells uniformly at random (5,000), and obtains clusters from this subset. The inferred labels can be used to train a Support Vector Machine (SVM), which is employed to assign labels to the remaining cells.

The hybrid approach can also be triggered by defining either the svm_num_cells parameter (the number of training cells, which is different from 5,000) or svm_train_inds parameter (training cells are manually selected by providing their indexes).

Let us first save the SC3 results for k = 3 obtained without using the hybrid approach:

no_svm_labels <- pData(sceset)$sc3_3_clusters

Now let us trigger the hybrid approach by asking for 50 training cells:

# Note that n_cores = 1 is required for compilation of this vignette.
# Please remove this parameter when running on your computer:
# sceset <- sc3(sceset, ks = 2:4, svm.num.cells = 50)
sceset <- sc3(sceset, ks = 2:4, biology = TRUE, svm_num_cells = 50, n_cores = 1)

## Setting SC3 parameters...

## Defining training cells for SVM using svm_num_cells parameter...

## Setting a range of k...

## Calculating distances between the cells...

## Performing transformations and calculating eigenvectors...

## Performing k-means clustering...

## Calculating consensus matrix...

## Calculating biology...

Note that when SVM is used all results (including marker genes, DE genes and cell outliers) correspond to the training cells only (50 cells), and values of all other cells are set to NA:

p_data <- pData(sceset)
head(p_data[ , grep("sc3_", colnames(p_data))])

##   sc3_2_clusters sc3_3_clusters sc3_4_clusters sc3_2_log2_outlier_score
## 1             NA             NA             NA                       NA
## 2             NA             NA             NA                       NA
## 3              2              3              4                 0.000000
## 4              2              1              1                 2.372392
## 5              2              1              1                 0.000000
## 6              2              1              3                 0.000000
##   sc3_3_log2_outlier_score sc3_4_log2_outlier_score
## 1                       NA                       NA
## 2                       NA                       NA
## 3                 0.000000                 0.000000
## 4                 2.438169                 3.305473
## 5                 0.000000                 0.000000
## 6                 1.752293                 0.000000

Now we can run the SVM and predict labels of all the other cells:

sceset <- sc3_run_svm(sceset)
p_data <- pData(sceset)
head(p_data[ , grep("sc3_", colnames(p_data))])

##   sc3_2_clusters sc3_3_clusters sc3_4_clusters sc3_2_log2_outlier_score
## 1              2              1              1                       NA
## 2              2              1              1                       NA
## 3              2              3              4                 0.000000
## 4              2              1              1                 2.372392
## 5              2              1              1                 0.000000
## 6              2              1              3                 0.000000
##   sc3_3_log2_outlier_score sc3_4_log2_outlier_score
## 1                       NA                       NA
## 2                       NA                       NA
## 3                 0.000000                 0.000000
## 4                 2.438169                 3.305473
## 5                 0.000000                 0.000000
## 6                 1.752293                 0.000000

Note that the cell outlier scores (and also DE and marker genes values) were not updated and they still contain NA values for non-training cells. To recalculate biological characteristics using the labels predicted by SVM one need to clear the svm_train_inds item in the sc3 slot and rerun the sc3_calc_biology method:

sceset@sc3$svm_train_inds <- NULL
sceset <- sc3_calc_biology(sceset)

## Calculating biology...

p_data <- pData(sceset)
head(p_data[ , grep("sc3_", colnames(p_data))])

##   sc3_2_clusters sc3_3_clusters sc3_4_clusters sc3_2_log2_outlier_score
## 1              2              1              1                        0
## 2              2              1              1                        0
## 3              2              3              4                        0
## 4              2              1              1                        0
## 5              2              1              1                        0
## 6              2              1              3                        0
##   sc3_3_log2_outlier_score sc3_4_log2_outlier_score
## 1                        0                 0.000000
## 2                        0                 0.000000
## 3                        0                 0.000000
## 4                        0                 2.654047
## 5                        0                 0.000000
## 6                        0                 0.000000

Now the biological characteristics are calculated for all cells (including those predicted by the SVM)

svm_labels <- pData(sceset)$sc3_3_clusters

Now we can compare the labels using the adjusted rand index (ARI):

if (require("mclust")) {
  adjustedRandIndex(no_svm_labels, svm_labels)
}

## Loading required package: mclust

## Package 'mclust' version 5.2.3

## Type 'citation("mclust")' for citing this R package in publications.

## [1] 0.6212237

ARI is less than 1, which means that SVM results are different from the non-SVM results, however ARI is still pretty close to 1 meaning that the solutions are very similar.

SC3 package manual

Vladimir Kiselev

2017-05-10

Contents

1 Introduction

2 Quality Control, Normalisation and `scater`

3 Quick Start

3.1 `SC3` Input

3.2 Run SC3

3.3 phenoData

3.4 featureData

4 Number of Сells

5 Plot Functions

5.1 Consensus Matrix

5.2 Silhouette Plot

5.3 Expression Matrix

5.4 Cluster Stability

5.5 DE genes

5.6 Marker Genes

6 SC3 in Detail

6.1 `sc3_prepare`

6.2 (optional) `sc3_estimate_k`

6.3 `sc3_calc_dists`

6.4 `sc3_calc_transfs`

6.5 `sc3_kmeans`

6.6 `sc3_calc_consens`

6.7 (optional) `sc3_calc_biology`

6.7.1 Cell Outliers

6.7.2 DE and marker genes

7 Hybrid `SVM` Approach

SC3 package manual

Vladimir Kiselev

2017-05-10

Contents

1 Introduction

2 Quality Control, Normalisation and scater

3 Quick Start

3.1 SC3 Input

3.2 Run SC3

3.3 phenoData

3.4 featureData

4 Number of Сells

5 Plot Functions

5.1 Consensus Matrix

5.2 Silhouette Plot

5.3 Expression Matrix

5.4 Cluster Stability

5.5 DE genes

5.6 Marker Genes

6 SC3 in Detail

6.1 sc3_prepare

6.2 (optional) sc3_estimate_k

6.3 sc3_calc_dists

6.4 sc3_calc_transfs

6.5 sc3_kmeans

6.6 sc3_calc_consens

6.7 (optional) sc3_calc_biology

6.7.1 Cell Outliers

6.7.2 DE and marker genes

7 Hybrid SVM Approach

2 Quality Control, Normalisation and `scater`

3.1 `SC3` Input

6.1 `sc3_prepare`

6.2 (optional) `sc3_estimate_k`

6.3 `sc3_calc_dists`

6.4 `sc3_calc_transfs`

6.5 `sc3_kmeans`

6.6 `sc3_calc_consens`

6.7 (optional) `sc3_calc_biology`

7 Hybrid `SVM` Approach