Food for the mind: using scran to perform basic analyses of single-cell RNA-seq data

Package: scran
Author: Aaron Lun (alun@wehi.edu.au)
Compilation date: 2016-08-10

Introduction

Single-cell RNA sequencing (scRNA-seq) is a widely used technique for profiling gene expression in individual cells. This allows molecular biology to be studied at a resolution that cannot be matched by bulk sequencing of cell populations. Common analyses include detection of highly variable and correlated genes across cells, or assignment of cells to cell cycle phases. Cell-specific biases also need to be normalized in a manner that is robust to low counts and technical noise. The scran package implements methods to perform these analyses. This vignette provides a brief description of each method and some toy examples for how they are used.

Setting up the data

We start off with a count matrix where each row is a gene and each column is a cell.

ngenes <- 10000
ncells <- 200
mu <- 2^runif(ngenes, 3, 10)
gene.counts <- matrix(rnbinom(ngenes*ncells, mu=mu, size=2), nrow=ngenes)

We add some arbitrary Ensembl gene IDs to give the impression that this is real (mouse) data.

library(org.Mm.eg.db)
all.ensembl <- unique(toTable(org.Mm.egENSEMBL)$ensembl_id)
rownames(gene.counts) <- sample(all.ensembl, ngenes)

We also have a set of counts for spike-in transcripts. These are appended to the counts for the endogenous genes.

nspikes <- 100
ncells <- 200
mu <- 2^runif(nspikes, 3, 10)
spike.counts <- matrix(rnbinom(nspikes*ncells, mu=mu, size=2), nrow=nspikes)
rownames(spike.counts) <- paste0("ERCC-", seq_len(nspikes))
all.counts <- rbind(gene.counts, spike.counts)

Finally, we construct a SCESet object to store all of the data. We also indicate which rows correspond to spike-in transcripts. This information can be easily extracted using the isSpike or spikes methods.

library(scran)
sce <- newSCESet(countData=data.frame(all.counts))
isSpike(sce) <- rep(c(FALSE, TRUE), c(ngenes, nspikes))

This is simulated data, so we assume that quality control has already been applied to remove low-quality cells or low-abundance genes. Check out the scater and cellity packages for more details.

Normalizing cell-specific biases

Cell-specific biases can be normalized using the computeSumFactors method. This computes size factors that are used to scale the counts in each cell. The assumption is that most genes are not differentially expressed (DE) between cells, such that any differences in expression across the majority of genes represents some technical bias that should be removed.

sce <- computeSumFactors(sce)
summary(sizeFactors(sce))

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9680  0.9926  1.0010  1.0000  1.0080  1.0370

For larger data sets, clustering can be performed with the quickCluster function before normalization. Briefly, cells are grouped into clusters of similar expression; normalization is applied within each cluster to compute size factors for each cell; and the factors are rescaled by normalization between clusters. This reduces the risk of violating the above assumption when many genes are DE across a heterogeneous population.

larger.sce <- newSCESet(countData=data.frame(cbind(all.counts, all.counts, all.counts)))
clusters <- quickCluster(larger.sce)
larger.sce <- computeSumFactors(larger.sce, cluster=clusters)

An alternative approach is to normalize based on the spike-in counts. The idea is that the same quantity of spike-in RNA was added to each cell prior to library preparation. Size factors can then be computed to scale the counts such that the total coverage of the spike-in transcripts is equal across cells. The main practical difference is that spike-in normalization preserves differences in total RNA content between cells, whereas computeSumFactors and other non-DE methods do not.

sce2 <- computeSpikeFactors(sce)
summary(sizeFactors(sce2))

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.7498  0.9260  0.9993  1.0060  1.0780  1.2910

Cell cycle phase assignment

We use a pre-defined classifier to assign cells into their cell cycle phases. This classifier was constructed from a training data set by identifying pairs of genes where the difference in expression within each pair changed sign across phases. Thus, by examining the sign of the difference in test data, the phase to which the cell belongs can be identified. Classifiers for human and mouse data are provided with the package – for other systems, classifiers can be constructed from a training set using the sandbag function.

mm.pairs <- readRDS(system.file("exdata", "mouse_cycle_markers.rds", package="scran"))

The classification itself is done using the cyclone function, given the count data and the trained classifier. This yields a number of scores representing the consistency of the signs with each phase.

assigned <- cyclone(sce, mm.pairs)
head(assigned$scores)

##      G1     S   G2M
## 1 0.712 0.733 0.225
## 2 0.592 0.977 0.124
## 3 0.849 0.980 0.026
## 4 0.703 0.912 0.234
## 5 0.598 0.966 0.475
## 6 0.583 0.996 0.145

Cells are considered to be in G1 phase, if the G1 score is above 0.5 and the G2/M score is below 0.5; to be in G2/M phase, if the G2/M score is above 0.5 and the G1 score is below 0.5; to be in S phase, if both scores are below 0.5; and to be unknown, if both scores are above 0.5. Despite the availability of a S score, it tends to be more accurate to assign cells based on the G1 and G2/M scores only.

phase <- rep("S", ncol(sce))
phase[assigned$scores$G1 > 0.5] <- "G1"
phase[assigned$scores$G2M > 0.5] <- "G2M"
phase[assigned$scores$G1 > 0.5 & assigned$scores$G2M > 0.5] <- "unknown"
table(phase)

## phase
##      G1     G2M       S unknown 
##     151       3      44       2

Detecting highly variable genes

Highly variable genes (HVGs) are detected by decomposing the total variance of each gene into its biological and technical components. This avoids prioritizing low-abundance genes that have large variances due to technical noise. First, we fit a mean-variance trend to the normalized log-expression values with trendVar. By default, this done using only the spike-in transcripts, as these should only exhibit technical noise.

fit <- trendVar(sce)

The fitted value of the trend can then be used as an estimate of the technical component. We subtract the fitted value from the total variance to obtain the biological component for each gene. HVGs can be defined as the top set of genes with the largest biological components.

decomp <- decomposeVar(sce, fit)
top.hvgs <- order(decomp$bio, decreasing=TRUE)
head(decomp[top.hvgs,])

##                        mean    total       bio     tech
## ENSMUSG00000102715 6.476175 1.955696 0.6508840 1.304812
## ENSMUSG00000079116 6.264082 1.917904 0.6223778 1.295526
## ENSMUSG00000015120 5.786924 1.884615 0.6041144 1.280501
## ENSMUSG00000028758 6.452648 1.893207 0.5895256 1.303682
## ENSMUSG00000037605 7.209836 1.935166 0.5824988 1.352667
## ENSMUSG00000039512 7.107341 1.898387 0.5536334 1.344754

This can be examined more visually by constructing a mean-variance plot. Here, the black points represent the endogenous genes; the red points represent spike-in transcripts; and the red line represents the mean-variance trend fitted to the spike-ins.

plot(decomp$mean, decomp$total, xlab="Mean log-expression", ylab="Variance")
o <- order(decomp$mean)
lines(decomp$mean[o], decomp$tech[o], col="red", lwd=2)
points(fit$mean, fit$var, col="red", pch=16)

plot of chunk unnamed-chunk-14

If spike-ins are absent or of poor quality, an alternative is to fit the trend to the gene variances directly with use.spikes=FALSE. This assumes that technical noise is the major contributor to the variance of most genes in the data set, such that the trend still represents the technical component. The resulting fit can then be used in decomposeVar as described above.

alt.fit <- trendVar(sce, use.spikes=FALSE) 
alt.decomp <- decomposeVar(sce, alt.fit)

If the data set already contains some uninteresting substructure (e.g., batch effects), we can block on this by setting the design argument in trendVar. This ensures that the substructure does not inflate the variance estimates. For example, if the cells were prepared in two separate batches, we could construct a design matrix incorporating this information with model.matrix and pass it to trendVar. The same design will also be used in decomposeVar.

batch <- rep(c("1", "2"), each=100)
design <- model.matrix(~batch)
alt.fit2 <- trendVar(sce, design=design)
alt.decomp2 <- decomposeVar(sce, alt.fit)

Detecting correlated genes

The top set of HVGs can be used to identify significant correlations between pairs of genes. The idea is to distinguish between HVGs caused by random stochasticity, and those that are driving systematic heterogeneity, e.g., between subpopulations. Correlations are computed in the correlatePairs method using a slightly modified version of Spearman's rho. Testing is performed against the null hypothesis of independent genes, using a permutation method in correlateNull to construct a null distribution.

null.dist <- correlateNull(ncol(sce))
cor.pairs <- correlatePairs(sce[top.hvgs[1:200],], null.dist=null.dist)
head(cor.pairs)

##                gene1              gene2        rho      p.value       FDR
## 1 ENSMUSG00000031503 ENSMUSG00000037646 -0.3040381 2.399998e-05 0.3780996
## 2 ENSMUSG00000030413 ENSMUSG00000001569  0.2868222 3.799996e-05 0.3780996
## 3 ENSMUSG00000044231 ENSMUSG00000025193 -0.2761794 8.999991e-05 0.5173995
## 4 ENSMUSG00000026751 ENSMUSG00000105712  0.2713418 1.039999e-04 0.5173995
## 5 ENSMUSG00000030615 ENSMUSG00000026175 -0.2549959 2.979997e-04 0.7425180
## 6 ENSMUSG00000038822 ENSMUSG00000051802  0.2462732 4.639995e-04 0.7425180

As with variance estimation, if uninteresting substructure is present, this should be blocked on using the design argument in both correlateNull and correlatePairs.

null.dist2 <- correlateNull(design=design)
cor.pairs2 <- correlatePairs(sce[top.hvgs[1:200],], null.dist=null.dist2, design=design)

Significant correlations between pairs of genes can be defined at a false discovery rate (FDR) threshold of, e.g., 5%. In this case, no correlations are significant as the counts were randomly generated for each gene. In other situations when correlated gene pairs are present, these can be used to construct heatmaps to verify whether subpopulations exist; for choosing marker genes in experimental validation; and to construct gene-gene association networks.

Converting to other formats

The SCESet object can be easily converted into other formats using the convertTo method. This allows analyses to be performed using other pipelines and packages. For example, if DE analyses were to be performed using edgeR, the count data in sce could be used to construct a DGEList.

y <- convertTo(sce, type="edgeR")

By default, rows corresponding to spike-in transcripts are dropped when get.spikes=FALSE. As such, the rows of y may not correspond directly to the rows of sce – users should match by row name to ensure correct cross-referencing between objects. Normalization factors are also automatically computed from the size factors.

The same conversion strategy roughly applies to the other supported formats. DE analyses can be performed using DESeq2 by converting the object to a DESeqDataSet. Cells can be ordered on pseudotime with monocle by converting the object to a CellDataSet (in this case, normalized unlogged expression values are stored).

Summary

This vignette describes the main functions in the scran package for basic analysis of single-cell RNA-seq data. We cover normalization, cell cycle phase assignment, HVG detection and correlation testing. Conversion to other formats can also be performed in preparation for analyses with other packages in the Bioconductor project. Further information can be obtained by reading the documentation for each function (e.g., ?convertTo), or asking for help on the Bioconductor support site (please read the posting guide beforehand).

Session information

sessionInfo()

## R version 3.3.1 (2016-06-21)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 14.04.4 LTS
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats4    parallel  stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] org.Mm.eg.db_3.3.0   AnnotationDbi_1.34.4 IRanges_2.6.1       
##  [4] S4Vectors_0.10.2     scran_1.0.4          scater_1.0.4        
##  [7] ggplot2_2.1.0        Biobase_2.32.0       BiocGenerics_0.18.0 
## [10] BiocParallel_1.6.5   BiocStyle_2.0.3      knitr_1.13          
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.12.6           formatR_1.4           plyr_1.8.4           
##  [4] zlibbioc_1.18.0       viridis_0.3.4         bitops_1.0-6         
##  [7] tools_3.3.1           biomaRt_2.28.0        digest_0.6.10        
## [10] lattice_0.20-33       rhdf5_2.16.0          tibble_1.1           
## [13] evaluate_0.9          RSQLite_1.0.0         gtable_0.2.0         
## [16] Matrix_1.2-6          shiny_0.13.2          DBI_0.4-1            
## [19] gridExtra_2.2.1       dplyr_0.5.0           stringr_1.0.0        
## [22] grid_3.3.1            shinydashboard_0.5.1  data.table_1.9.6     
## [25] R6_2.1.2              XML_3.98-1.4          limma_3.28.17        
## [28] reshape2_1.4.1        edgeR_3.14.0          magrittr_1.5         
## [31] matrixStats_0.50.2    scales_0.4.0          htmltools_0.3.5      
## [34] dynamicTreeCut_1.63-1 tximport_1.0.3        assertthat_0.1       
## [37] mime_0.5              xtable_1.8-2          colorspace_1.2-6     
## [40] httpuv_1.3.3          stringi_1.1.1         RCurl_1.95-4.8       
## [43] munsell_0.4.3         rjson_0.2.15          chron_2.3-47         
## [46] zoo_1.7-13