1 Introduction

Single-cell RNA sequencing (scRNA-seq) is a widely used technique for profiling gene expression in individual cells. This allows molecular biology to be studied at a resolution that cannot be matched by bulk sequencing of cell populations. The scran package implements methods to perform low-level processing of scRNA-seq data, including cell cycle phase assignment, scaling normalization, batch correction, variance modelling and testing for corrrelated genes. This vignette provides brief descriptions of these methods and some toy examples to demonstrate their use.

2 Setting up the data

We start off with a count matrix where each row is a gene and each column is a cell. These can be obtained by mapping read sequences to a reference genome, and then counting the number of reads mapped to the exons of each gene. (See, for example, the Rsubread package to do both of these tasks.) Alternatively, pseudo-alignment methods can be used to quantify the abundance of each transcript in each cell. For simplicity, though, we’ll just simulate some counts here from a negative binomial distribution.

ngenes <- 10000
ncells <- 200
mu <- 2^runif(ngenes, -1, 5)
gene.counts <- matrix(rnbinom(ngenes*ncells, mu=mu, size=10), nrow=ngenes)

We add some arbitrary Ensembl gene IDs to give the impression that this is real (mouse) data.

library(org.Mm.eg.db)
all.ensembl <- unique(toTable(org.Mm.egENSEMBL)$ensembl_id)
rownames(gene.counts) <- sample(all.ensembl, ngenes)

We also have a set of counts for spike-in transcripts. These are appended to the counts for the endogenous genes. In practice, the reads should have been mapped to the spike-in transcipts by including the spike-in sequences in the genome index.

nspikes <- 100
ncells <- 200
mu <- 2^runif(nspikes, -1, 5)
spike.counts <- matrix(rnbinom(nspikes*ncells, mu=mu, size=10), nrow=nspikes)
rownames(spike.counts) <- paste0("ERCC-", seq_len(nspikes))
all.counts <- rbind(gene.counts, spike.counts)

Finally, we construct a SingleCellExperiment object to store all of the data. We also indicate which rows correspond to spike-in transcripts. This is done through the calculateQCMetrics method from scater, which takes a named list of sets of control genes. We indicate which sets of controls are spike-ins using the setSpike setter function. (In this case, there is only one control set, so the process may seem more complicated than necessary. The usefulness of this setup becomes more obvious when multiple control sets are present.) This information can be easily extracted later on using the isSpike, spikes and whichSpike methods.

library(scran)
sce <- SingleCellExperiment(list(counts=all.counts))
isSpike(sce, "MySpike") <- grep("^ERCC", rownames(sce))

This is simulated data, so we assume that quality control has already been applied to remove low-quality cells or low-abundance genes. Check out the scater and cellity packages for more details. Also see the simpleSingleCell workflow where all these steps are used in real data analyses.

3 Cell cycle phase assignment

We use a pre-defined classifier to assign cells into their cell cycle phases (Scialdone et al. 2015). This classifier was constructed from a training data set by identifying pairs of genes where the difference in expression within each pair changed sign across phases. Thus, by examining the sign of the difference in test data, the phase to which the cell belongs can be identified. Classifiers for human and mouse data are provided with the package – for other systems, classifiers can be constructed from a training set using the sandbag function.

mm.pairs <- readRDS(system.file("exdata", "mouse_cycle_markers.rds", package="scran"))

The classification itself is done using the cyclone function, given the count data and the trained classifier. This yields a number of scores representing the consistency of the signs with each phase.

assigned <- cyclone(sce, pairs=mm.pairs)
head(assigned$scores)
##      G1     S   G2M
## 1 0.867 0.198 0.261
## 2 0.742 0.446 0.130
## 3 0.910 0.151 0.036
## 4 0.785 0.422 0.103
## 5 0.986 0.405 0.009
## 6 0.711 0.087 0.247

Cells are considered to be in G1 phase, if the G1 score is above 0.5 and the G2/M score is below 0.5; to be in G2/M phase, if the G2/M score is above 0.5 and the G1 score is below 0.5; to be in S phase, if both scores are below 0.5; and to be unknown, if both scores are above 0.5. Despite the availability of a S score, it tends to be more accurate to assign cells based on the G1 and G2/M scores only.

table(assigned$phases)
## 
##  G1 
## 200

Note that it is generally best practice to perform cell cycle phase assignment before filtering out low-abundance genes. This is because the lack of expression of particular genes can provide some information about the cell cycle.

4 Normalizing cell-specific biases

4.1 Based on the gene counts

Cell-specific biases are normalized using the computeSumFactors method, which implements the deconvolution strategy for scaling normalization (A. T. Lun, Bach, and Marioni 2016). This computes size factors that are used to scale the counts in each cell. The assumption is that most genes are not differentially expressed (DE) between cells, such that any differences in expression across the majority of genes represents some technical bias that should be removed.

sce <- computeSumFactors(sce)
summary(sizeFactors(sce))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9792  0.9950  0.9997  1.0000  1.0044  1.0229

For larger data sets, clustering should be performed with the quickCluster function before normalization. Briefly, cells are grouped into clusters of similar expression; normalization is applied within each cluster to compute size factors for each cell; and the factors are rescaled by normalization between clusters. This reduces the risk of violating the above assumption when many genes are DE between clusters in a heterogeneous population.

larger.sce <- SingleCellExperiment(list(counts=cbind(all.counts, all.counts, all.counts)))
clusters <- quickCluster(larger.sce, min.size=100)
larger.sce <- computeSumFactors(larger.sce, cluster=clusters)

Note that computeSumFactors will automatically remove low-abundance genes, which provides some protection against zero or negative size factor estimates. We also assume that quality control on the cells has already been performed, as low-quality cells with few expressed genes can often have negative size factor estimates.

4.2 Based on the spike-in counts

An alternative approach is to normalize based on the spike-in counts (Lun et al. 2017). The idea is that the same quantity of spike-in RNA was added to each cell prior to library preparation. Size factors are computed to scale the counts such that the total coverage of the spike-in transcripts is equal across cells. The main practical difference is that spike-in normalization preserves differences in total RNA content between cells, whereas computeSumFactors and other non-DE methods do not.

sce2 <- computeSpikeFactors(sce)
summary(sizeFactors(sce2))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.8240  0.9605  1.0003  1.0000  1.0367  1.1914

Even if we decide to use the deconvolution size factors, it is strongly recommended to compute a separate set of size factors for the spike-ins. This is because the spike-ins are not affected by total mRNA content. Using the deconvolution size factors will over-normalize the spike-in counts, whereas the spike-in size factors are more appropriate. To obtain the latter without overwriting the former, we set general.use=FALSE in our call to computeSpikeFactors. This means that the spike-in-based size factors will be computed and stored in the SingleCellExperiment object, but will only be used by the spike-in transcripts. (Obviously, if the spike-in size factors were already being used for normalization, e.g., in sce2, then this extra step is unnecessary.)

sce <- computeSpikeFactors(sce, general.use=FALSE)

4.3 Computing normalized expression values

Normalized expression values are calculated using the normalize method from scater (McCarthy et al. 2017). This will use the deconvolution size factors for the endogenous genes, and the spike-in-based size factors for the spike-in transcripts. Each expression value can be interpreted as a log-transformed “normalized count”, and can be used in downstream applications like clustering or dimensionality reduction.

sce <- normalize(sce)

5 Variance modelling

We identify genes that drive biological heterogeneity in the data set by modelling the per-gene variance. The aim is use a subset of highly variable genes in downstream analyses like clustering, to improve resolution by removing genes driven by technical noise. We first decompose the total variance of each gene into its biological and technical components (A. T. Lun, McCarthy, and Marioni 2016). We fit a mean-variance trend to the normalized log-expression values with trendVar. By default, this done using only the spike-in transcripts, as these should only exhibit technical noise.

fit <- trendVar(sce, parametric=TRUE)

The fitted value of the trend is used as an estimate of the technical component. We subtract the fitted value from the total variance to obtain the biological component for each gene. We can then extract some certain number of top genes for use in downstream procedures; or more generally, take all potentially interesting genes with positive biological components.

decomp <- decomposeVar(sce, fit)
top.hvgs <- order(decomp$bio, decreasing=TRUE)
head(decomp[top.hvgs,])
## DataFrame with 6 rows and 6 columns
##                                mean             total               bio
##                           <numeric>         <numeric>         <numeric>
## ENSMUSG00000075199 2.62012318702266 0.758774047419923 0.252821655945469
## ENSMUSG00000107563 2.52676580707279  0.73423352251127 0.215208436760169
## ENSMUSG00000021680 2.03840180002968 0.808128950218944 0.209653487338615
## ENSMUSG00000100433 1.86651846215236 0.814491846948946 0.199300800472213
## ENSMUSG00000029442 2.58466784376709 0.702617285220879 0.191933381880768
## ENSMUSG00000015653 1.75209554061534 0.810751614137527    0.189085759463
##                                 tech              p.value                FDR
##                            <numeric>            <numeric>          <numeric>
## ENSMUSG00000075199 0.505952391474455 6.29225150577593e-06 0.0314612575288796
## ENSMUSG00000107563 0.519025085751102 0.000105854769199466 0.0993162760622193
## ENSMUSG00000021680 0.598475462880329 0.000720868348452389  0.215862190178204
## ENSMUSG00000100433 0.615191046476732  0.00149296774758345  0.281692027845934
## ENSMUSG00000029442 0.510683903340111 0.000344711537495918  0.167437935678128
## ENSMUSG00000015653 0.621665854674527  0.00252198713541815   0.36020547449175

We examine this in more detail by constructing a mean-variance plot. Here, the black points represent the endogenous genes; the red points represent spike-in transcripts; and the red line represents the mean-variance trend fitted to the spike-ins.

plot(decomp$mean, decomp$total, xlab="Mean log-expression", ylab="Variance")
o <- order(decomp$mean)
lines(decomp$mean[o], decomp$tech[o], col="red", lwd=2)
points(fit$mean, fit$var, col="red", pch=16)

If spike-ins are absent or of poor quality, an alternative is to fit the trend to the gene variances directly with use.spikes=FALSE. This assumes that technical noise is the major contributor to the variance of most genes in the data set, such that the trend still represents the technical component. The resulting fit can then be used in decomposeVar as described above.

alt.fit <- trendVar(sce, use.spikes=FALSE) 
alt.decomp <- decomposeVar(sce, alt.fit)

If the data set already contains some uninteresting substructure (e.g., batch effects), we can block on this by setting the block= argument in trendVar. This ensures that the substructure does not inflate the variance estimates. For example, if the cells were prepared in two separate batches, we can set the batch of origin as block. The same blocking information will also be used in decomposeVar.

batch <- rep(c("1", "2"), each=100)
alt.fit2 <- trendVar(sce, block=batch)
alt.decomp2 <- decomposeVar(sce, alt.fit)

See this workflow for more discussion about variance modelling with trendVar and decomposeVar. Other alternatives include the DM and technicalCV2 functions, which quantify expression variance based on the coefficient of variation of the (normalized) counts. These provide more power for detecting genes that are only expressed in rare subpopulations, but are also more sensitive to outliers. Also see the improvedCV2 function, which is intended as a more stable counterpart of technicalCV2.

6 Detecting correlated genes

Another useful procedure is to identify significant pairwise correlations between pairs of HVGs. The idea is to distinguish between HVGs caused by random stochasticity, and those that are driving systematic heterogeneity, e.g., between subpopulations. Correlations are computed in the correlatePairs method using a slightly modified version of Spearman’s rho. Testing is performed against the null hypothesis of independent genes, using a permutation method in correlateNull to construct a null distribution.

null.dist <- correlateNull(ncol(sce))
# Only using the first 200 genes as a demonstration.
cor.pairs <- correlatePairs(sce, subset.row=top.hvgs[1:200], null.dist=null.dist)
head(cor.pairs)
## DataFrame with 6 rows and 6 columns
##                gene1              gene2                rho
##          <character>        <character>          <numeric>
## 1 ENSMUSG00000039219 ENSMUSG00000068917 -0.266637665941649
## 2 ENSMUSG00000070420 ENSMUSG00000057716 -0.263810095252381
## 3 ENSMUSG00000039099 ENSMUSG00000069295  -0.25940748518713
## 4 ENSMUSG00000025316 ENSMUSG00000097891   0.25815945398635
## 5 ENSMUSG00000024184 ENSMUSG00000007097 -0.255451886297158
## 6 ENSMUSG00000076272 ENSMUSG00000107068 -0.241354533863347
##                p.value               FDR   limited
##              <numeric>         <numeric> <logical>
## 1 0.000123999876000124 0.872630327369673     FALSE
## 2 0.000141999858000142 0.872630327369673     FALSE
## 3 0.000193999806000194 0.872630327369673     FALSE
## 4 0.000223999776000224 0.872630327369673     FALSE
## 5 0.000225999774000226 0.872630327369673     FALSE
## 6 0.000563999436000564   0.9002385869409     FALSE

As with variance estimation, if uninteresting substructure is present, this should be blocked on using the block= argument in both correlateNull and correlatePairs. This avoids strong correlations due to the blocking factor.

null.dist2 <- correlateNull(block=batch, iter=1e5) # fewer iterations, to speed it up.
cor.pairs2 <- correlatePairs(sce, subset.row=top.hvgs[1:200], 
    null.dist=null.dist2, block=batch)

The pairs can be used for choosing marker genes in experimental validation, and to construct gene-gene association networks. In other situations, the pairs may not be of direct interest - rather, we just want to know whether a gene is correlated with any other gene. This is often the case if we are to select a set of correlated HVGs for use in downstream steps like clustering or dimensionality reduction. To do so, we set per.gene=TRUE to compute a single set of statistics for each gene, rather than for each pair.

cor.genes <- correlatePairs(sce, subset.row=top.hvgs[1:200], 
    null.dist=null.dist, per.gene=TRUE)

Significant correlations are defined at a false discovery rate (FDR) threshold of, e.g., 5%. Note that the p-values are calculated by permutation and will have a lower bound. If there were insufficient permutation iterations, a warning will be issued suggesting that more iterations be performed.

7 Batch correction

Batch correction is performed by detecting mutual nearest neighbors (MNNs) (Haghverdi et al. 2018). We assume that two batches contain at least one common cell type, and that the batch effect is orthogonal to the biological differences in each batch. We then apply the fastMNN function to compute corrected values in a low-dimensional subspace defined by the first 50 PCs.

b1 <- sce
b2 <- sce

# Adding a very simple batch effect.
logcounts(b2) <- logcounts(b2) + runif(nrow(b2), -1, 1) 

out <- fastMNN(b1, b2)
dim(out$corrected)
## [1] 400  50
out$batch
## integer-Rle of length 400 with 2 runs
##   Lengths: 200 200
##   Values :   1   2

We see that out simple batch effect is removed in the corrected values:

combined <- cbind(b1, b2)
reducedDim(combined, "corrected") <- out$correct
combined$batch <- gl(2, ncol(b1))

library(scater)
multiplot(
    plotPCA(combined, colour_by="batch") + ggtitle("Without correction"),
    plotReducedDim(combined, "corrected", colour_by="batch") + ggtitle("With correction"),
    cols=2
)

The advantage of the MNN approach (that is not immediately obvious from the example above) is that it can handle differences in population composition between batches. This provides correct batch correction in situations where the cell type proportions change between samples, unlike standard methods like removeBatchEffect(). We suggest reading this workflow for more details.

8 Converting to other formats

The SingleCellExperiment object can be easily converted into other formats using the convertTo method. This allows analyses to be performed using other pipelines and packages. For example, if DE analyses were to be performed using edgeR, the count data in sce could be used to construct a DGEList.

y <- convertTo(sce, type="edgeR")

By default, rows corresponding to spike-in transcripts are dropped when get.spikes=FALSE. As such, the rows of y may not correspond directly to the rows of sce – users should match by row name to ensure correct cross-referencing between objects. Normalization factors are also automatically computed from the size factors.

The same conversion strategy roughly applies to the other supported formats. DE analyses can be performed using DESeq2 by converting the object to a DESeqDataSet. Cells can be ordered on pseudotime with monocle by converting the object to a CellDataSet (in this case, normalized unlogged expression values are stored).

9 Summary

This vignette describes the main functions in the scran package for basic analysis of single-cell RNA-seq data. We cover normalization, cell cycle phase assignment, HVG detection and correlation testing. Conversion to other formats can also be performed in preparation for analyses with other packages in the Bioconductor project. Further information can be obtained by examining the documentation for each function (e.g., ?convertTo); reading the simpleSingleCell workflow; or asking for help on the Bioconductor support site (please read the posting guide beforehand).

10 Session information

sessionInfo()
## R version 3.5.2 (2018-12-20)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 16.04.5 LTS
## 
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.8-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.8-bioc/R/lib/libRlapack.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] parallel  stats4    stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] scater_1.10.1               ggplot2_3.1.0              
##  [3] org.Mm.eg.db_3.7.0          AnnotationDbi_1.44.0       
##  [5] scran_1.10.2                SingleCellExperiment_1.4.1 
##  [7] SummarizedExperiment_1.12.0 DelayedArray_0.8.0         
##  [9] matrixStats_0.54.0          Biobase_2.42.0             
## [11] GenomicRanges_1.34.0        GenomeInfoDb_1.18.1        
## [13] IRanges_2.16.0              S4Vectors_0.20.1           
## [15] BiocGenerics_0.28.0         BiocParallel_1.16.5        
## [17] knitr_1.21                  BiocStyle_2.10.0           
## 
## loaded via a namespace (and not attached):
##  [1] viridis_0.5.1            dynamicTreeCut_1.63-1   
##  [3] edgeR_3.24.3             bit64_0.9-7             
##  [5] viridisLite_0.3.0        DelayedMatrixStats_1.4.0
##  [7] assertthat_0.2.0         statmod_1.4.30          
##  [9] BiocManager_1.30.4       blob_1.1.1              
## [11] GenomeInfoDbData_1.2.0   vipor_0.4.5             
## [13] yaml_2.2.0               RSQLite_2.1.1           
## [15] pillar_1.3.1             lattice_0.20-38         
## [17] glue_1.3.0               limma_3.38.3            
## [19] digest_0.6.18            XVector_0.22.0          
## [21] colorspace_1.3-2         cowplot_0.9.3           
## [23] htmltools_0.3.6          Matrix_1.2-15           
## [25] plyr_1.8.4               pkgconfig_2.0.2         
## [27] bookdown_0.9             zlibbioc_1.28.0         
## [29] purrr_0.2.5              scales_1.0.0            
## [31] HDF5Array_1.10.1         tibble_2.0.0            
## [33] withr_2.1.2              lazyeval_0.2.1          
## [35] magrittr_1.5             crayon_1.3.4            
## [37] memoise_1.1.0            evaluate_0.12           
## [39] beeswarm_0.2.3           tools_3.5.2             
## [41] stringr_1.3.1            Rhdf5lib_1.4.2          
## [43] munsell_0.5.0            locfit_1.5-9.1          
## [45] bindrcpp_0.2.2           compiler_3.5.2          
## [47] rlang_0.3.0.1            rhdf5_2.26.2            
## [49] grid_3.5.2               RCurl_1.95-4.11         
## [51] BiocNeighbors_1.0.0      igraph_1.2.2            
## [53] labeling_0.3             bitops_1.0-6            
## [55] rmarkdown_1.11           gtable_0.2.0            
## [57] DBI_1.0.0                reshape2_1.4.3          
## [59] R6_2.3.0                 gridExtra_2.3           
## [61] dplyr_0.7.8              bit_1.1-14              
## [63] bindr_0.1.1              stringi_1.2.4           
## [65] ggbeeswarm_0.6.0         Rcpp_1.0.0              
## [67] tidyselect_0.2.5         xfun_0.4

References

Haghverdi, L., A. T. L. Lun, M. D. Morgan, and J. C. Marioni. 2018. “Batch effects in single-cell RNA-sequencing data are corrected by matching mutual nearest neighbors.” Nat. Biotechnol. 36 (5):421–27.

Lun, A. T. L., F. J. Calero-Nieto, L. Haim-Vilmovsky, B. Gottgens, and J. C. Marioni. 2017. “Assessing the reliability of spike-in normalization for analyses of single-cell RNA sequencing data.” Genome Res. 27 (11):1795–1806.

Lun, A. T., K. Bach, and J. C. Marioni. 2016. “Pooling across cells to normalize single-cell RNA sequencing data with many zero counts.” Genome Biol. 17 (April):75.

Lun, A. T., D. J. McCarthy, and J. C. Marioni. 2016. “A step-by-step workflow for low-level analysis of single-cell RNA-seq data with Bioconductor.” F1000Res 5:2122.

McCarthy, D. J., K. R. Campbell, A. T. Lun, and Q. F. Wills. 2017. “Scater: pre-processing, quality control, normalization and visualization of single-cell RNA-seq data in R.” Bioinformatics 33 (8):1179–86.

Scialdone, A., K. N. Natarajan, L. R. Saraiva, V. Proserpio, S. A. Teichmann, O. Stegle, J. C. Marioni, and F. Buettner. 2015. “Computational assignment of cell-cycle stage from single-cell transcriptome data.” Methods 85 (September):54–61.