SC3 manual

Vladimir Kiselev

2016-05-15

Introduction

Single-Cell Consensus Clustering (SC3) is a tool for unsupervised clustering of scRNA-seq data. SC3 achieves high accuracy and robustness by consistently integrating different clustering solutions through a consensus approach. An interactive graphical implementation makes SC3 accessible to a wide audience of users. In addition, SC3 also aids biological interpretation by identifying marker genes, differentially expressed genes and outlier cells. A manuscript describing SC3 in details is currently under review but a copy of it is available on bioRxiv.

Prerequisites (optional)

Note that this prerequisite is only optional. If it is not installed you can still run SC3, but the GO anlaysis functionality will be disabled.

SC3 imports some of the RSelenium functionality. RSelenium depends on a stand-alone java binary file (see Rselenium documentation for more details). You can download and install this binary file by running (the file size is about 30Mb):

RSelenium::checkForServer()

Note, this command has to be executed only once, before running SC3 for the first time. Also note that the minimum Java version requirement for RSelenium is 1.7 (see this post for details).

“Built-in” datasets

There is one built-in dataset that is automatically loaded with SC3:

Dataset Source N cells k clusters
treutlein Distal lung epithelium 80 5

Quick start

To run SC3 with the built-in dataset, start R and then type:

library(SC3)
sc3(treutlein)

It should open SC3 in a browser window without providing any error.

SC3 parameters

By default SC3 clusters the input dataset in a region of k from 3 to 7. This region can be changed by adjusting the ks parameter of SC3. E.g. if you would like to check the clustering of your own dataset for k from 2 to 5, then you need to run the following:

sc3(dataset, ks = 2:5)

where dataset is either an R matrix / data.frame / data.table object OR a path to your input file containing an expression matrix.

For the full list of SC3 parameters please read the documentation by typing ?sc3.

Running SC3 without interactivity

If you would like to separate clustering calculations from visualisation, e.g. when your dataset is very large and you need to run it on a computing cluster, you can disable interactivity by setting interactivity parameter to FALSE:

sc3(treutlein, interactivity = FALSE)

In this case all clustering results will be saved as sc3.interactive.arg variable in your R session. You can then visualise your results by running:

sc3_interactive(sc3.interactive.arg)

Number of cells

Combining multiple clustering results through the consensus approach is quite computationally expensive. Therefore running SC3 on datasets with 1,000 cells may take ~10-15 mins. We do not recommend to use the default options on datasets with more than 1,000 cells. In order to apply SC3 to larger datasets, we have implemented a hybrid approach that combines unsupervised and supervised methodologies. When the number of cells is large (N>1,000), SC3 selects a small subset of cells uniformly at random, and obtains clusters from this subset. Subsequently, the inferred labels are used to train a Support Vector Machine (SVM), which is employed to assign labels to the remaining cells. Training the SVM typically takes only a few minutes, thus allowing SC3 to be applied to very large datasets. Based on our results, we set the default of SC3 so that when N>1,000 only 20% of the cells are used and when N>5,000 only 1,000 cells are used for clustering before training the SVM. To enable the SVM training please set the svm parameter to TRUE:

sc3(treutlein, svm = TRUE)

You can also control the number of training cells used to train SVM with the svm.num.cells parameter:

sc3(treutlein, svm = TRUE, svm.num.cells = 25)

Input file format

To run SC3 on an input file containing an expression matrix one need to preprocess the input file so that it looks as follows:

cell1 cell2 cell3 cell4 cell5
gene1 1 2 3 4 5
gene2 1 2 3 4 5
gene3 1 2 3 4 5

The first row of the expression matrix (with cell labels, e.g. cell1, cell2, etc.) should contain one fewer field than all other rows. Separators should be either spaces or tabs. If separators are commas (,) then the extension of the file must be .csv. If a path to your input file is “/path/to/input/file/expression-matrix.txt”, to run it:

sc3("/path/to/input/file/expression-matrix.txt", ks = 2:5)

Saving results

After finding the best clustering solution SC3 allows a user to either export the results into an excel file or to save them to the current R session. If the former is chosen then the excel file will be written to the default Downloads folder used by your browser. When saving to the R session a variable (list) SC3.results containing all the results from the interactive session will be created.