Further strategies for analyzing single-cell RNA-seq data

Aaron T. L. Lun1, Davis J. McCarthy2,3 and John C. Marioni1,2,4

1Cancer Research UK Cambridge Institute, Li Ka Shing Centre, Robinson Way, Cambridge CB2 0RE, United Kingdom
2EMBL European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, United Kingdom
3St Vincent's Institute of Medical Research, 41 Victoria Parade, Fitzroy, Victoria 3065, Australia
4Wellcome Trust Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SA, United Kingdom

2018-05-25

1 Overview
2 Quality control on cells
3 Normalizing based on spike-in coverage
4 Detecting highly variable genes
- 4.1 Setting up the data
- 4.2 Testing for significantly positive biological components
5 Advanced modelling of the technical noise
- 5.1 Trend fitting when spike-ins are unavailable
- 5.2 Blocking on uninteresting factors of variation
6 Identifying correlated gene pairs with Spearman’s rho
7 Using parallel analysis to choose the number of PCs
8 Blocking on the cell cycle phase
9 Concluding remarks
References

1 Overview

The previous workflows focused on analyzing single-cell RNA-seq data with “standard” procedures. However, a number of alternative parameter settings and strategies can be used at some steps of the workflow. This workflow describes a few of these alternative settings as well as the rationale behind choosing them instead of the defaults.

2 Quality control on cells

2.1 Assumptions of outlier identification

An outlier-based definition for low-quality cells assumes that most cells are of high quality. This is usually reasonable and can be experimentally supported in some situations by visually checking that the cells are intact, e.g., on the microwell plate. Another assumption is that the QC metrics are independent on the biological state of each cell. This ensures that any outlier values for these metrics are driven by technical factors rather than biological processes. Thus, removing cells based on the metrics will not misrepresent the biology in downstream analyses.

The second assumption is most likely to be violated in highly heterogeneous cell populations. For example, some cell types may naturally have less RNA or express fewer genes than other cell types. Such cell types are more likely to be considered outliers and removed, even if they are of high quality. The use of the MAD mitigates this problem by accounting for biological variability in the QC metrics. A heterogeneous population should have higher variability in the metrics among high-quality cells, increasing the MAD and reducing the chance of incorrectly removing particular cell types (at the cost of reducing power to remove low-quality cells). Nonetheless, filtering based on outliers may not be appropriate in extreme cases where one cell type is very different from the others.

Systematic differences in the QC metrics can be handled to some extent using the batch= argument in the isOutlier() function. For example, setting batch to the plate of origin will identify outliers within each level of batch, using plate-specific median and MAD estimates. This is obviously useful for accommodating known differences in experimental processing, e.g., sequencing at different depth or different amounts of added spike-in RNA. We can also include biological factors in batch, if those factors could result in systematically fewer expressed genes or lower RNA content. However, this is not applicable in experiments where the factors are not known in advance.

2.2 Checking for discarded cell types

We can diagnose loss of distinct cell types during QC by looking for differences in gene expression between the discarded and retained cells (Figure 1). If the discarded pool is enriched for a certain cell type, we should observe increased expression of the corresponding marker genes. No systematic upregulation of genes is apparent in the discarded pool in Figure 1, indicating that the QC step did not inadvertently filter out a cell type in the 416B dataset.

library(SingleCellExperiment)
sce.full.416b <- readRDS("416B_preQC.rds")

library(scater)
suppressWarnings({
    lost <- calcAverage(counts(sce.full.416b)[,!sce.full.416b$PassQC])
    kept <- calcAverage(counts(sce.full.416b)[,sce.full.416b$PassQC])
})
logfc <- log2((lost+1)/(kept+1))
head(sort(logfc, decreasing=TRUE), 20)

##               Retn ENSMUSG00000102352 ENSMUSG00000104647            Klhdc8b 
##           3.243140           3.106658           3.090012           2.984891 
## ENSMUSG00000075015              Nmur1      1700029I15Rik             Gm4952 
##           2.840926           2.788588           2.721717           2.692160 
##               Fut9 ENSMUSG00000107955      1700101I11Rik ENSMUSG00000102379 
##           2.495632           2.468569           2.399518           2.390858 
## ENSMUSG00000075014               Jph3            Tgfb1i1            Gramd1c 
##           2.356524           2.322170           2.316763           2.299913 
## ENSMUSG00000106341 ENSMUSG00000106680             Nat8f1 ENSMUSG00000092418 
##           2.162966           2.126438           2.102481           2.056228

plot(lost, kept, xlab="Average count (discarded)", 
    ylab="Average count (retained)", log="xy", pch=16)
is.spike <- isSpike(sce.full.416b)
points(lost[is.spike], kept[is.spike], col="red", pch=16)
is.mito <- rowData(sce.full.416b)$is_feature_control_Mt
points(lost[is.mito], kept[is.mito], col="dodgerblue", pch=16)

Average counts across all discarded and retained cells in the 416B dataset. Each point represents a gene, with spike-in and mitochondrial transcripts in red and blue respectively.

Figure 1: Average counts across all discarded and retained cells in the 416B dataset
Each point represents a gene, with spike-in and mitochondrial transcripts in red and blue respectively.

By comparison, a more stringent filter in the PBMC dataset would remove the previously identified platelet population (see the previous workflow). This manifests in Figure 2 as a shift to the bottom-right for a number of genes, including PF4 and PPBP.

sce.pbmc <- readRDS("pbmc_data.rds")
wrong.keep <- sce.pbmc$total_counts >= 1000
suppressWarnings({
    lost <- calcAverage(counts(sce.pbmc)[,!wrong.keep])
    kept <- calcAverage(counts(sce.pbmc)[,wrong.keep])
})
logfc <- log2((lost+1)/(kept+1))
head(sort(logfc, decreasing=TRUE), 20)

##      PPBP       PF4 HIST1H2AC     GNG11      SDPR     TUBB1       CLU 
## 1.8542024 1.7744368 1.5009613 1.3586439 1.2340323 1.0523433 0.9190291 
##     ACRBP     RGS18      NRGN  MAP3K7CL       MMD     SPARC    PGRMC1 
## 0.8077801 0.7879625 0.7757343 0.6454622 0.5226932 0.5144267 0.4893572 
##     CMTM5   TSC22D1    HRAT92       GP9    ITGA2B      CTSA 
## 0.3733080 0.3672441 0.3600693 0.3568626 0.3465166 0.3306618

plot(lost, kept, xlab="Average count (discarded)", 
    ylab="Average count (retained)", log="xy", pch=16)
platelet <- c("PF4", "PPBP", "SDPR")
points(lost[platelet], kept[platelet], col="orange", pch=16)

Average counts across all discarded and retained cells in the PBMC dataset. Each point represents a gene, with platelet-related genes highlighted in orange.

Figure 2: Average counts across all discarded and retained cells in the PBMC dataset
Each point represents a gene, with platelet-related genes highlighted in orange.

If cell types are being incorrectly discarded, the solution is to relax the QC filters. This can be achieved by either increasing nmads= in isOutlier() or by dropping the filter altogether. However, keep in mind that low-quality cells will often increase the apparent heterogeneity, usually because of increased sampling noise at low sequencing coverage. This can interfere with variance modelling and PCA, e.g., where the first few PCs separate cells according to quality rather than any biology. At worst, low-quality cells can form their own cluster, which requires additional care during interpretation of the results. These considerations motivate the use of a more strict filter (at least on the first pass) in our workflows.

2.3 Alternative approaches to quality control

2.3.1 Using fixed thresholds

One alternative strategy is to set pre-defined thresholds on each QC metric. For example, we might remove all cells with library sizes below 100000 and numbers of expressed genes below 4000. This generally requires considerable experience to determine appropriate thresholds for each experimental protocol and biological system. For example, thresholds for read count-based data are simply not applicable for UMI-based data, and vice versa. Indeed, even with the same protocol and system, the appropriate threshold can vary from run to run due to the vagaries of RNA capture and sequencing.

2.3.2 Using PCA-based outliers

Another strategy is to perform a principal components analysis (PCA) based on the quality metrics for each cell, e.g., the total number of reads, the total number of features and the proportion of mitochondrial or spike-in reads. Outliers on a PCA plot may be indicative of low-quality cells that have aberrant technical properties compared to the (presumed) majority of high-quality cells. This is demonstrated below on a brain cell dataset from Tasic et al. (2016), using functions from the scater package (McCarthy et al. 2017).

# Obtaining the dataset.
library(scRNAseq)
data(allen)

# Setting up the data.
sce.allen <- as(allen, "SingleCellExperiment")
assayNames(sce.allen) <- "counts"
isSpike(sce.allen, "ERCC") <- grep("ERCC", rownames(sce.allen))

# Computing the QC metrics and running PCA.
library(scater)
sce.allen <- calculateQCMetrics(sce.allen)
sce.allen <- runPCA(sce.allen, use_coldata=TRUE, detect_outliers=TRUE)
table(sce.allen$outlier)

## 
## FALSE  TRUE 
##   374     5

Methods like PCA-based outlier detection and support vector machines can provide more power to distinguish low-quality cells from high-quality counterparts (Ilicic et al. 2016). This is because they are able to detect subtle patterns across many quality metrics simultaneously. However, this comes at some cost to interpretability, as the reason for removing a given cell may not always be obvious. Users interested in the more sophisticated approaches are referred to the scater and cellity packages.

For completeness, we note that outliers can also be identified from PCA on the gene expression profiles, rather than QC metrics. We consider this to be a risky strategy as it can remove high-quality cells in rare populations.

3 Normalizing based on spike-in coverage

3.1 Motivation

Scaling normalization strategies for scRNA-seq data can be broadly divided into two classes. The first class assumes that there exists a subset of genes that are not DE between samples, as previously described. The second class uses the fact that the same amount of spike-in RNA was added to each cell (Lun et al. 2017). Differences in the coverage of the spike-in transcripts can only be due to cell-specific biases, e.g., in capture efficiency or sequencing depth. Scaling normalization is then applied to equalize spike-in coverage across cells.

The choice between these two normalization strategies depends on the biology of the cells and the features of interest. If the majority of genes are expected to be DE and there is no reliable house-keeping set, spike-in normalization may be the only option for removing cell-specific biases. Spike-in normalization should also be used if differences in the total RNA content of individual cells are of interest. In any particular cell, an increase in the amount of endogenous RNA will not increase spike-in coverage (with or without library quantification). Thus, the former will not be represented as part of the bias in the latter, which means that the effects of total RNA content on expression will not be removed upon scaling. With non-DE normalization, an increase in RNA content will systematically increase the expression of all genes in the non-DE subset, such that it will be treated as bias and removed.

3.2 Setting up the data

We demonstrate the use of spike-in normalization on a dataset involving different cell types – namely, mouse embryonic stem cells (mESCs) and mouse embryonic fibroblasts (MEFs) (Islam et al. 2011). The count table was obtained from the NCBI Gene Expression Omnibus (GEO) as a supplementary file using the accession number GSE29087. We load the counts into R, using colClasses to speed up read.table by pre-defining the type of each column. We also specify the rows corresponding to spike-in transcripts.

library(SingleCellExperiment)
counts <- read.table("GSE29087_L139_expression_tab.txt.gz", 
    colClasses=c(list("character", NULL, NULL, NULL, NULL, NULL, NULL), 
    rep("integer", 96)), skip=6, sep='\t', row.names=1)

is.spike <- grep("SPIKE", rownames(counts)) 
sce.islam <- SingleCellExperiment(list(counts=as.matrix(counts)))
isSpike(sce.islam, "spike") <- is.spike
dim(sce.islam)

## [1] 22936    96

We perform some quality control to remove low-quality cells using the calculateQCMetrics function. Outliers are identified within each cell type to avoid issues with systematic differences in the metrics between cell types. The negative control wells do not contain any cells and are useful for quality control (as they should manifest as outliers for the various metrics), but need to be removed prior to downstream analysis.

library(scater)
sce.islam <- calculateQCMetrics(sce.islam)
sce.islam$grouping <- rep(c("mESC", "MEF", "Neg"), c(48, 44, 4))

libsize.drop <- isOutlier(sce.islam$total_counts, nmads=3, type="lower", 
    log=TRUE, batch=sce.islam$grouping)
feature.drop <- isOutlier(sce.islam$total_features, nmads=3, type="lower", 
    log=TRUE, batch=sce.islam$grouping)
spike.drop <- isOutlier(sce.islam$pct_counts_spike, nmads=3, type="higher", 
    batch=sce.islam$grouping)
    
sce.islam <- sce.islam[,!(libsize.drop | feature.drop | 
    spike.drop | sce.islam$grouping=="Neg")]
data.frame(ByLibSize=sum(libsize.drop), ByFeature=sum(feature.drop),
    BySpike=sum(spike.drop), Remaining=ncol(sce.islam))

##   ByLibSize ByFeature BySpike Remaining
## 1         4         6      12        77

3.3 Calculating spike-in size factors

We apply the computeSpikeFactors method to estimate size factors for all cells. This method computes the total count over all spike-in transcripts in each cell, and calculates size factors to equalize the total spike-in count across cells. Here, we set general.use=TRUE as we intend to apply the spike-in factors to all counts.

library(scran)
sce.islam <- computeSpikeFactors(sce.islam, general.use=TRUE)

Running normalize will use the spike-in-based size factors to compute normalized log-expression values. Unlike the previous analyses, we do not have to define separate size factors for the spike-in transcripts. This is because the relevant factors are already being used for all genes and spike-in transcripts when general.use=TRUE. (The exception is if the experiment uses multiple spike-in sets that behave differently and need to be normalized separately.)

sce.islam <- normalize(sce.islam)

For comparison, we also compute the deconvolution size factors (Lun, Bach, and Marioni 2016) and plot them against the spike-in factors. We observe a negative correlation between the two sets of values (Figure 3). This is because MEFs contain more endogenous RNA, which reduces the relative spike-in coverage in each library (thereby decreasing the spike-in size factors) but increases the coverage of endogenous genes (thus increasing the deconvolution size factors). If the spike-in size factors were applied to the counts, the expression values in MEFs would be scaled up while expression in mESCs would be scaled down. However, the opposite would occur if deconvolution size factors were used.

colours <- c(mESC="red", MEF="grey")
deconv.sf <- computeSumFactors(sce.islam, sf.out=TRUE, cluster=sce.islam$grouping)
plot(sizeFactors(sce.islam), deconv.sf, col=colours[sce.islam$grouping], pch=16, 
    log="xy", xlab="Size factor (spike-in)", ylab="Size factor (deconvolution)")
legend("bottomleft", col=colours, legend=names(colours), pch=16)

Size factors from spike-in normalization, plotted against the size factors from deconvolution for all cells in the mESC/MEF dataset. Axes are shown on a log-scale, and cells are coloured according to their identity. Deconvolution size factors were computed with small pool sizes owing to the low number of cells of each type.

Figure 3: Size factors from spike-in normalization, plotted against the size factors from deconvolution for all cells in the mESC/MEF dataset
Axes are shown on a log-scale, and cells are coloured according to their identity. Deconvolution size factors were computed with small pool sizes owing to the low number of cells of each type.

Whether or not total RNA content is relevant – and thus, the choice of normalization strategy – depends on the biological hypothesis. In the HSC and brain analyses, variability in total RNA across the population was treated as noise and removed by non-DE normalization. This may not always be appropriate if total RNA is associated with a biological difference of interest. For example, Islam et al. (2011) observe a 5-fold difference in total RNA between mESCs and MEFs. Similarly, the total RNA in a cell changes across phases of the cell cycle (Buettner et al. 2015). Spike-in normalization will preserve these differences in total RNA content such that the corresponding biological groups can be easily resolved in downstream analyses.

Comments from Aaron:

We only use genes with average counts greater than 1 (as specified in min.mean) to compute the deconvolution size factors. This avoids problems with discreteness as mentioned in our previous uses of computeSumFactors.
Setting sf.out=TRUE will directly return the size factors, rather than a SingleCellExperiment object containing those factors. This is more convenient when only the size factors are required for further analysis.

4 Detecting highly variable genes

4.1 Setting up the data

Highly variable genes (HVGs) are defined as genes with biological components that are significantly greater than zero. These genes are interesting as they drive differences in the expression profiles between cells, and should be prioritized for further investigation. Formal detection of HVGs allows us to avoid genes that are highly variable due to technical factors such as sampling noise during RNA capture and library preparation. This adds another level of statistical rigour to our previous analyses, in which we only modelled the technical component.

To demonstrate, we use data from haematopoietic stem cells (HSCs) (Wilson et al. 2015), generated using the Smart-seq2 protocol (Picelli et al. 2014) with ERCC spike-ins. Counts were obtained from NCBI GEO as a supplementary file using the accession number GSE61533. Our first task is to load the count matrix into memory. In this case, some work is required to retrieve the data from the Gzip-compressed Excel format.

library(R.utils)
gunzip("GSE61533_HTSEQ_count_results.xls.gz", remove=FALSE, overwrite=TRUE)
library(gdata)
all.counts <- read.xls('GSE61533_HTSEQ_count_results.xls', sheet=1, header=TRUE)
rownames(all.counts) <- all.counts$ID
all.counts <- as.matrix(all.counts[,-1])

We store the results in a SingleCellExperiment object and identify the rows corresponding to the spike-ins based on the row names.

sce.hsc <- SingleCellExperiment(list(counts=all.counts))
dim(sce.hsc)

## [1] 38498    96

is.spike <- grepl("^ERCC", rownames(sce.hsc))
isSpike(sce.hsc, "ERCC") <- is.spike
summary(is.spike)

##    Mode   FALSE    TRUE 
## logical   38406      92

For each cell, we calculate quality control metrics using the calculateQCMetrics function as previously described. We filter out HSCs that are outliers for any metric, under the assumption that these represent low-quality libraries.

sce.hsc <- calculateQCMetrics(sce.hsc)
libsize.drop <- isOutlier(sce.hsc$total_counts, nmads=3, type="lower", log=TRUE)
feature.drop <- isOutlier(sce.hsc$total_features, nmads=3, type="lower", log=TRUE)
spike.drop <- isOutlier(sce.hsc$pct_counts_ERCC, nmads=3, type="higher")
sce.hsc <- sce.hsc[,!(libsize.drop | feature.drop | spike.drop)]
data.frame(ByLibSize=sum(libsize.drop), ByFeature=sum(feature.drop),
    BySpike=sum(spike.drop), Remaining=ncol(sce.hsc))

##   ByLibSize ByFeature BySpike Remaining
## 1         2         2       3        92

We remove genes that are not expressed in any cell to reduce computational work in downstream steps.

to.keep <- nexprs(sce.hsc, byrow=TRUE) > 0
sce.hsc <- sce.hsc[to.keep,]
summary(to.keep)

##    Mode   FALSE    TRUE 
## logical   17229   21269

We apply the deconvolution method to compute size factors for the endogenous genes (Lun, Bach, and Marioni 2016). Separate size factors for the spike-in transcripts are also calculated, as previously discussed. We then calculate log-transformed normalized expression values for further use.

sce.hsc <- computeSumFactors(sce.hsc)
summary(sizeFactors(sce.hsc))

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.4065  0.8116  0.9549  1.0000  1.1581  2.0055

sce.hsc <- computeSpikeFactors(sce.hsc, type="ERCC", general.use=FALSE)
summary(sizeFactors(sce.hsc, "ERCC"))

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2562  0.6198  0.8623  1.0000  1.2122  3.0289

sce.hsc <- normalize(sce.hsc)

4.2 Testing for significantly positive biological components

We fit a mean-variance trend to the spike-in transcripts to quantify the technical component of the variance, as previously described. The biological component for each gene is defined as the difference between its total variance and the fitted value of the trend (Figure 4).

var.fit <- trendVar(sce.hsc, parametric=TRUE, loess.args=list(span=0.3))
var.out <- decomposeVar(sce.hsc, var.fit)
plot(var.out$mean, var.out$total, pch=16, cex=0.6, xlab="Mean log-expression", 
    ylab="Variance of log-expression")
curve(var.fit$trend(x), col="dodgerblue", lwd=2, add=TRUE)
cur.spike <- isSpike(sce.hsc)
points(var.out$mean[cur.spike], var.out$total[cur.spike], col="red", pch=16)

Variance of normalized log-expression values for each gene in the HSC dataset, plotted against the mean log-expression. The blue line represents the mean-dependent trend fitted to the variances of the spike-in transcripts (red).

Figure 4: Variance of normalized log-expression values for each gene in the HSC dataset, plotted against the mean log-expression
The blue line represents the mean-dependent trend fitted to the variances of the spike-in transcripts (red).

We define HVGs as those genes that have a biological component that is significantly greater than zero. We use a false discovery rate (FDR) of 5% after correcting for multiple testing with the Benjamini-Hochberg method.

hvg.out <- var.out[which(var.out$FDR <= 0.05),]
nrow(hvg.out)

## [1] 508

We rank the results to focus on genes with larger biological components. This highlights an interesting aspect of the underlying hypothesis test, which is based on the ratio of the total variance to the expected technical variance. Ranking based on p-value tends to prioritize HVGs that are more likely to be true positives but, at the same time, less likely to be interesting. This is because the ratio can be very large for HVGs that have very low total variance and do not contribute much to the cell-cell heterogeneity.

hvg.out <- hvg.out[order(hvg.out$bio, decreasing=TRUE),] 
write.table(file="hsc_hvg.tsv", hvg.out, sep="\t", quote=FALSE, col.names=NA)
head(hvg.out)

## DataFrame with 6 rows and 6 columns
##                      mean            total              bio             tech
##                 <numeric>        <numeric>        <numeric>        <numeric>
## Fos      6.46169240839893 19.5686550413897  12.811243808293 6.75741123309673
## Dusp1    6.82129642811909 15.6268319695528  10.100913551922 5.52591841763087
## Rgs1     5.31215019519132 20.3108473714393  10.008463855624 10.3023835158153
## Ppp1r15a 6.66626841008152 14.5289776738074 8.47989805102701 6.04907962278038
## Ly6a     8.40441263707119 10.0751654284422 8.07599238035247 1.99917304808974
## Egr1     6.71450506741529 13.8497677999492  7.9654406350664 5.88432716488281
##                       p.value                  FDR
##                     <numeric>            <numeric>
## Fos       1.0802113713091e-18 7.39108496672172e-16
## Dusp1    8.36978881636986e-18 4.79815109686544e-15
## Rgs1     9.58553382470494e-08 1.11850140421996e-05
## Ppp1r15a 1.68662371654905e-12 4.76999675356292e-10
## Ly6a     1.97564673295629e-50 5.98649183610514e-47
## Egr1     6.21292730433103e-12 1.60710245185568e-09

We check the distribution of expression values for the genes with the largest biological components. This ensures that the variance estimate is not driven by one or two outlier cells (Figure 5).

fontsize <- theme(axis.text=element_text(size=12), axis.title=element_text(size=16))
plotExpression(sce.hsc, features=rownames(hvg.out)[1:10]) + fontsize

Violin plots of normalized log-expression values for the top 10 genes with the largest biological components in the HSC dataset. Each point represents the log-expression value in a single cell.

Figure 5: Violin plots of normalized log-expression values for the top 10 genes with the largest biological components in the HSC dataset
Each point represents the log-expression value in a single cell.

There are many other strategies for defining HVGs, based on a variety of metrics:

the coefficient of variation (Brennecke et al. 2013; Kołodziejczyk et al. 2015; Kim et al. 2015)
the dispersion parameter in the negative binomial distribution (McCarthy, Chen, and Smyth 2012)
a proportion of total variability (Vallejos, Marioni, and Richardson 2015)

Some of these methods are available in scran – for example, see DM or technicalCV2 for calculations based on the coefficient of variation. Here, we use the variance of the log-expression values because the log-transformation protects against genes with strong expression in only one or two cells. This ensures that the set of top HVGs is not dominated by genes with (mostly uninteresting) outlier expression patterns.

5 Advanced modelling of the technical noise

5.1 Trend fitting when spike-ins are unavailable

If spike-in RNA has not been added in appropriate quantities (or at all), an alternative approach is to fit the trend to the variance estimates of the endogenous genes. This is done using the use.spikes=FALSE setting in trendVar, as shown below for the HSC dataset.

var.fit.nospike <- trendVar(sce.hsc, parametric=TRUE, 
    use.spikes=FALSE, loess.args=list(span=0.2))
var.out.nospike <- decomposeVar(sce.hsc, var.fit.nospike)

The simplest interpretation of the results assumes that the majority of genes are not variably expressed. This means that the technical component dominates the total variance for most genes, such that the fitted trend can be treated as an estimate of the technical component. In Figure 11, the trend passes through or close to most of the spike-in variances, indicating that our assumption is valid.

plot(var.out.nospike$mean, var.out.nospike$total, pch=16, cex=0.6, 
    xlab="Mean log-expression", ylab="Variance of log-expression")
curve(var.fit.nospike$trend(x), col="dodgerblue", lwd=2, add=TRUE)
points(var.out.nospike$mean[cur.spike], var.out.nospike$total[cur.spike], col="red", pch=16)

Variance of normalized log-expression values for each gene in the 416B dataset, plotted against the mean log-expression. The blue line represents the mean-dependent trend fitted to the variances of the endogenous genes (black), with spike-in transcripts shown in red.

Figure 6: Variance of normalized log-expression values for each gene in the 416B dataset, plotted against the mean log-expression
The blue line represents the mean-dependent trend fitted to the variances of the endogenous genes (black), with spike-in transcripts shown in red.

If our assumption does not hold, the output of decomposeVar is more difficult to interpret. The fitted value of the trend can no longer be generally interpreted as the technical component, as it contains some biological variation as well. Instead, recall that the biological component reported by decomposeVar represents the residual for each gene over the majority of genes with the same abundance. One could assume that the variabilities of most genes are driven by constitutive “house-keeping” processes, which are biological in origin but generally uninteresting. Any gene with an increase in its variance is relatively highly variable and can be prioritized for further study.

5.2 Blocking on uninteresting factors of variation

5.2.1 Using the `block=` argument

Our previous analysis of the 416B dataset specified block= in trendVar() to ensure that systematic differences between plates do not inflate the variance. This involves estimating the mean and variance of the log-expression separately in each plate, followed by fitting a single trend to the plate-specific means and variances of all spike-in transcripts. In doing so, we implicitly assume that the trend is the same between plates, which is fairly reasonable for this dataset (Figure 7).

# Loading the saved object.
sce.416B <- readRDS("416B_data.rds") 

# Repeating the trendVar() call.
var.fit <- trendVar(sce.416B, parametric=TRUE, block=sce.416B$Plate,
    loess.args=list(span=0.3))

matplot(var.fit$means, var.fit$vars, col=c("darkorange", "forestgreen"))

Plate-specific variance estimates for all spike-in transcripts in the 416B dataset, plotted against the plate-specific means. Each point represents a spike-in transcript, numbered by the plate from which the values were estimated.

Figure 7: Plate-specific variance estimates for all spike-in transcripts in the 416B dataset, plotted against the plate-specific means
Each point represents a spike-in transcript, numbered by the plate from which the values were estimated.

The use of block= also assumes that the endogenous genes have comparable abundances between the two plates. This is most easily examined by comparing the distribution of size factors for the endogenous genes across the plates. Similar size factor distributions in Figure 8 indicate that the coverage of most genes is not systematically different between plates.

tmp.416B <- sce.416B
tmp.416B$log_size_factor <- log(sizeFactors(sce.416B))
plotColData(tmp.416B, x="Plate", y="log_size_factor")

Figure 8: Plate-specific distribution of the size factors for endogenous genes

However, these assumptions may not hold for other datasets. For example, if more spike-in RNA is added in a particular batch, the technical noise (and thus the trend) will decrease due to increased coverage. The mean log-expression of the spike-ins would also shift relative to the endogenous genes in that batch. The use of a single trend would subsequently be inappropriate, resulting in inaccurate estimates of the technical component for each gene.

5.2.2 Fitting batch-specific trends

For datasets containing multiple batches, an alternative strategy is to perform trend fitting and variance decomposition separately for each batch. This accommodates differences in the mean-variance trends between batches, especially if a different amount of spike-in RNA was added to the cells in each batch. We demonstrate this approach by treating each plate in the 416B dataset as a different batch, using the multiBlockVar() function. This yields plate-specific estimates of the biological and technical components for each gene.

sce.416B.2 <- normalize(sce.416B, size_factor_grouping=sce.416B$Plate)
comb.out <- multiBlockVar(sce.416B.2, block=sce.416B.2$Plate,
    trend.args=list(parametric=TRUE, loess.args=list(span=0.4)))

Statistics are combined across multiple batches using the combineVar() function within multiBlockVar(). This function computes a weighted average across batches for the means and variances, and applies Fisher’s method for combining the p-values. These results can be used in downstream functions such as denoisePCA, or for detecting highly variable genes (see below).

head(comb.out[,1:6])

## DataFrame with 6 rows and 6 columns
##                                   mean               total
##                              <numeric>           <numeric>
## ENSMUSG00000103377 0.00815549634688983  0.0121709791028416
## ENSMUSG00000103147  0.0339595735047905  0.0694766348373121
## ENSMUSG00000103161 0.00510126095695608 0.00476246997593087
## ENSMUSG00000102331  0.0185339126652833  0.0326551273294061
## ENSMUSG00000102948  0.0584131375963464  0.0864398505564873
## Rp1                 0.0969636806440467   0.454546479553771
##                                    bio               tech            p.value
##                              <numeric>          <numeric>          <numeric>
## ENSMUSG00000103377 -0.0266092921411351 0.0387802712439767                  1
## ENSMUSG00000103147  -0.075037896786809  0.144514531624121  0.999999999977276
## ENSMUSG00000103161 -0.0169458821913668 0.0217083521672976                  1
## ENSMUSG00000102331 -0.0517597559595797 0.0844148832889857   0.99999999999993
## ENSMUSG00000102948  -0.166023531759371  0.252463382315858                  1
## Rp1                 0.0191589020570154  0.435387577496755 0.0990115610294779
##                                  FDR
##                            <numeric>
## ENSMUSG00000103377                 1
## ENSMUSG00000103147                 1
## ENSMUSG00000103161                 1
## ENSMUSG00000102331                 1
## ENSMUSG00000102948                 1
## Rp1                0.353997588088405

We visualize the quality of the batch-specific trend fits by extracting the relevant statistics from comb.out (Figure 9).

par(mfrow=c(1,2))
is.spike <- isSpike(sce.416B.2)
for (plate in levels(sce.416B.2$Plate)) {
    cur.out <- comb.out$per.block[[plate]]
    plot(cur.out$mean, cur.out$total, pch=16, cex=0.6, xlab="Mean log-expression", 
        ylab="Variance of log-expression", main=plate)
    curve(metadata(cur.out)$trend(x), col="dodgerblue", lwd=2, add=TRUE)
    points(cur.out$mean[is.spike], cur.out$total[is.spike], col="red", pch=16)
}

Variance of normalized log-expression values for each gene in each plate of the 416B dataset, plotted against the mean log-expression. The blue line represents the mean-dependent trend fitted to the variances of the spike-in transcripts (red).

Figure 9: Variance of normalized log-expression values for each gene in each plate of the 416B dataset, plotted against the mean log-expression
The blue line represents the mean-dependent trend fitted to the variances of the spike-in transcripts (red).

By fitting separate trends, we avoid the need to assume that a single trend is present across batches. However, this also reduces the precision of each trend fit, as less information is available within each batch. We recommend using block= as the default unless there is clear evidence for differences in the trends between batches.

Comments from Aaron:

We run normalize() with size_factor_grouping= to centre the size factors within each level of the blocking factor. This adjusts the size factors across cells in each batch so that the mean is equal to 1, for both the spike-in and gene-based sets of size factors. Log-normalized expression values are then recalculated using these centred size factors. This procedure ensures that the average abundances of the spike-in transcripts are comparable to the endogenous genes, avoiding problems due to differences in the quantity of spike-in RNA between batches. Otherwise, if the globally-centred size factors were used, there would be a systematic difference in the scaling of spike-in transcripts compared to endogenous genes. The fitted trend would then be shifted along the x-axis and fail to accurately capture the technical component for each gene.

5.2.3 Using the `design=` argument

For completeness, it is worth mentioning the design= argument in trendVar(). This will estimate the residual variance from a linear model fitted to the log-normalized expression values for each gene. The linear model can include blocking factors for known unwanted factors of variation, ensuring that they do not inflate the variance estimate. The technical component for each gene is obtained at the average abundance across all cells.

lfit <- trendVar(sce.416B, design=model.matrix(~sce.416B$Plate))

We do not recommend using this approach for categorical blocking factors in one-way layouts. This is because it does not consider the mean of each blocking level, resulting in an inaccurate estimate of the technical component in the presence of a strong blocking effect. However, it is the only choice for dealing with real covariates or multiple blocking factors in an additive model.

6 Identifying correlated gene pairs with Spearman’s rho

Another use for scRNA-seq data is to identify correlations between the expression profiles of different genes. This is quantified by computing Spearman’s rho, which accommodates non-linear relationships in the expression values. Non-zero correlations between pairs of genes provide evidence for their co-regulation. However, the noise in the data requires some statistical analysis to determine whether a correlation is significantly non-zero.

To demonstrate, we use the correlatePairs function to identify significant correlations between the various histocompatability antigens in the HSC dataset. The significance of each correlation is determined using a permutation test. For each pair of genes, the null hypothesis is that the expression profiles of two genes are independent. Shuffling the profiles and recalculating the correlation yields a null distribution that is used to obtain a p-value for each observed correlation value (Phipson and Smyth 2010).

set.seed(100)
var.cor <- correlatePairs(sce.hsc, subset.row=grep("^H2-", rownames(sce.hsc)))
head(var.cor)

## DataFrame with 6 rows and 6 columns
##         gene1       gene2               rho             p.value
##   <character> <character>         <numeric>           <numeric>
## 1      H2-Ab1      H2-Eb1 0.497202657090455  1.999998000002e-06
## 2       H2-Aa      H2-Ab1 0.488402200884669  1.999998000002e-06
## 3       H2-D1       H2-K1 0.421928702433611 2.5999974000026e-05
## 4       H2-Aa      H2-Eb1 0.409598816330934 4.1999958000042e-05
## 5      H2-Ab1     H2-DMb1 0.359262056316755 0.00045999954000046
## 6       H2-Q6       H2-Q7 0.344990213152906 0.00080999919000081
##                    FDR   limited
##              <numeric> <logical>
## 1 0.000434999565000435      TRUE
## 2 0.000434999565000435      TRUE
## 3  0.00376999623000377     FALSE
## 4  0.00456749543250457     FALSE
## 5     0.04001995998004     FALSE
## 6   0.0587249412750587     FALSE

Correction for multiple testing across many gene pairs is performed by controlling the FDR at 5%.

sig.cor <- var.cor$FDR <= 0.05
summary(sig.cor)

##    Mode   FALSE    TRUE 
## logical     430       5

We can also compute correlations between specific pairs of genes, or between all pairs between two distinct sets of genes. The example below computes the correlation between Fos and Jun, which dimerize to form the AP-1 transcription factor (Angel and Karin 1991).

correlatePairs(sce.hsc, subset.row=cbind("Fos", "Jun"))

## DataFrame with 1 row and 6 columns
##         gene1       gene2               rho            p.value
##   <character> <character>         <numeric>          <numeric>
## 1         Fos         Jun 0.469460413359432 1.999998000002e-06
##                  FDR   limited
##            <numeric> <logical>
## 1 1.999998000002e-06      TRUE

Examination of the expression profiles in Figure 10 confirms the presence of a modest correlation between these two genes.

plotExpression(sce.hsc, features="Fos", x="Jun")

Figure 10: Expression of Fos plotted against the expression of Jun for all cells in the HSC dataset

The use of correlatePairs is primarily intended to identify correlated gene pairs for validation studies. Obviously, non-zero correlations do not provide evidence for a direct regulatory interaction, let alone specify causality. To construct regulatory networks involving many genes, we suggest using dedicated packages such as WCGNA.

Comments from Aaron:

We suggest only computing correlations between a subset of genes of interest, known either a priori or empirically defined, e.g., as HVGs. Computing correlations across all genes will take too long; unnecessarily increase the severity of the multiple testing correction; and may prioritize strong but uninteresting correlations, e.g., between tightly co-regulated house-keeping genes.
The correlatePairs function can also return gene-centric output by setting per.gene=TRUE. This calculates a combined p-value (Simes 1986) for each gene that indicates whether it is significantly correlated to any other gene. From a statistical perspective, this is a more natural approach to correcting for multiple testing when genes, rather than pairs of genes, are of interest.
The Limited field indicates whether the p-value was lower-bounded by the number of permutations. If this is TRUE for any non-significant gene at the chosen FDR threshold, consider increasing the number of permutations to improve power.

7 Using parallel analysis to choose the number of PCs

An alternative strategy for choosing the number of PCs is to use the parallelPCA() function. This performs Horn’s parallel analysis (Horn 1965), which involves permuting the matrix and repeating the PCA to determine the variance explained by each PC under a random null model. Repeated permutations can be used to obtain “p-values” for each PC (Buja and Eyuboglu 1992). PCs are discarded if they do not explain significantly more variance than expected under the null.

This is demonstrated using the batch-corrected expression values from the 416B dataset. To focus on relevant features, we only use genes with positive biological components from the variability analysis above. This procedure relies on random permutations, so setting the random seed is necessary for reproducible results.

set.seed(1000)
npcs <- parallelPCA(sce.416B, assay.type="corrected", 
    subset.row=comb.out$bio > 0, value="n")
as.integer(npcs)

## [1] 19

Parallel analysis tends to yield more accurate estimates of the true rank of the matrix than denoisePCA(). It is also applicable to expression values that have been transformed such that the gene-wise variances are distorted (and thus denoisePCA() cannot be used). This means that parallel analysis can be used to choose the number of PCs after certain non-linear operations like batch correction (Haghverdi et al. 2018). However, it is obviously much slower that denoisePCA() due to the need for multiple permutations.

8 Blocking on the cell cycle phase

Cell cycle phase is usually uninteresting in studies focusing on other aspects of biology. However, the effects of cell cycle on the expression profile can mask other effects and interfere with the interpretation of the results. This cannot be avoided by simply removing cell cycle marker genes, as the cell cycle can affect a substantial number of other transcripts (Buettner et al. 2015). Rather, more sophisticated strategies are required, one of which is demonstrated below using data from a study of T Helper 2 (T_H2) cells (Mahata et al. 2014). Buettner et al. (2015) have already applied quality control and normalized the data, so we can use them directly as log-expression values (accessible as Supplementary Data 1 of https://dx.doi.org/10.1038/nbt.3102).

library(readxl)
incoming <- as.data.frame(read_excel("nbt.3102-S7.xlsx", sheet=1))
rownames(incoming) <- incoming[,1]
incoming <- incoming[,-1]
incoming <- incoming[,!duplicated(colnames(incoming))] # Remove duplicated genes.
sce.th2 <- SingleCellExperiment(list(logcounts=t(incoming)))

We empirically identify the cell cycle phase using the pair-based classifier in cyclone. The majority of cells in Figure 11 seem to lie in G1 phase, with small numbers of cells in the other phases.

library(org.Mm.eg.db)
ensembl <- mapIds(org.Mm.eg.db, keys=rownames(sce.th2), keytype="SYMBOL", column="ENSEMBL")

set.seed(100)
mm.pairs <- readRDS(system.file("exdata", "mouse_cycle_markers.rds", 
    package="scran"))
assignments <- cyclone(sce.th2, mm.pairs, gene.names=ensembl, assay.type="logcounts")

plot(assignments$score$G1, assignments$score$G2M, 
    xlab="G1 score", ylab="G2/M score", pch=16)

Figure 11: Cell cycle phase scores from applying the pair-based classifier on the T_H2 dataset, where each point represents a cell

We can block directly on the phase scores in downstream analyses. This is more graduated than using a strict assignment of each cell to a specific phase, as the magnitude of the score considers the uncertainty of the assignment. The phase covariates in the design matrix will absorb any phase-related effects on expression such that they will not affect estimation of the effects of other experimental factors. Users should also ensure that the phase score is not confounded with other factors of interest. For example, model fitting is not possible if all cells in one experimental condition are in one phase, and all cells in another condition are in a different phase.

design <- model.matrix(~ G1 + G2M, assignments$score)
fit.block <- trendVar(sce.th2, design=design, parametric=TRUE, use.spikes=NA)
dec.block <- decomposeVar(sce.th2, fit.block)

library(limma)
sce.th2.block <- sce.th2
assay(sce.th2.block, "corrected") <- removeBatchEffect(
    logcounts(sce.th2), covariates=design[,-1])

sce.th2.block <- denoisePCA(sce.th2.block, technical=dec.block, 
    assay.type="corrected")
dim(reducedDim(sce.th2.block, "PCA"))

## [1] 81  5

The result of blocking on design is visualized with some PCA plots in Figure 12. Before removal, the distribution of cells along the first two principal components is strongly associated with their G1 and G2/M scores. This is no longer the case after removal, which suggests that the cell cycle effect has been mitigated.

sce.th2$G1score <- sce.th2.block$G1score <- assignments$score$G1
sce.th2$G2Mscore <- sce.th2.block$G2Mscore <- assignments$score$G2M

# Without blocking on phase score.
fit <- trendVar(sce.th2, parametric=TRUE, use.spikes=NA) 
sce.th2 <- denoisePCA(sce.th2, technical=fit$trend)
out <- plotReducedDim(sce.th2, use_dimred="PCA", ncomponents=2, colour_by="G1score", 
    size_by="G2Mscore") + fontsize + ggtitle("Before removal")

# After blocking on the phase score.
out2 <- plotReducedDim(sce.th2.block, use_dimred="PCA", ncomponents=2, 
    colour_by="G1score", size_by="G2Mscore") + fontsize + 
    ggtitle("After removal")
multiplot(out, out2, cols=2)

PCA plots before (left) and after (right) removal of the cell cycle effect in the T~H~2 dataset. Each cell is represented by a point with colour and size determined by the G1 and G2/M scores, respectively.

Figure 12: PCA plots before (left) and after (right) removal of the cell cycle effect in the T_H2 dataset
Each cell is represented by a point with colour and size determined by the G1 and G2/M scores, respectively.

As an aside, this dataset contains cells at various stages of differentiation (Mahata et al. 2014). This is an ideal use case for diffusion maps which perform dimensionality reduction along a continuous process. In Figure 13, cells are arranged along a trajectory in the low-dimensional space. The first diffusion component is likely to correspond to T_H2 differentiation, given that a key regulator Gata3 (Zhu et al. 2006) changes in expression from left to right.

plotDiffusionMap(sce.th2.block, colour_by="Gata3",
    run_args=list(use_dimred="PCA", sigma=25)) + fontsize

A diffusion map for the T~H~2 dataset, where each cell is coloured by its expression of _Gata3_. A larger `sigma` is used compared to the default value to obtain a smoother plot.

Figure 13: A diffusion map for the T_H2 dataset, where each cell is coloured by its expression of Gata3
A larger sigma is used compared to the default value to obtain a smoother plot.

9 Concluding remarks

All software packages used in this workflow are publicly available from the Comprehensive R Archive Network (https://cran.r-project.org) or the Bioconductor project (http://bioconductor.org). The specific version numbers of the packages used are shown below, along with the version of the R installation.

sessionInfo()

## R version 3.5.0 (2018-04-23)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 16.04.4 LTS
## 
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.7-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.7-bioc/R/lib/libRlapack.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] parallel  stats4    stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] readxl_1.1.0                          
##  [2] gdata_2.18.0                          
##  [3] R.utils_2.6.0                         
##  [4] R.oo_1.22.0                           
##  [5] R.methodsS3_1.7.1                     
##  [6] scRNAseq_1.6.0                        
##  [7] EnsDb.Hsapiens.v86_2.99.0             
##  [8] ensembldb_2.4.1                       
##  [9] AnnotationFilter_1.4.0                
## [10] DropletUtils_1.0.1                    
## [11] pheatmap_1.0.10                       
## [12] cluster_2.0.7-1                       
## [13] dynamicTreeCut_1.63-1                 
## [14] limma_3.36.1                          
## [15] scran_1.8.2                           
## [16] scater_1.8.0                          
## [17] ggplot2_2.2.1                         
## [18] TxDb.Mmusculus.UCSC.mm10.ensGene_3.4.0
## [19] GenomicFeatures_1.32.0                
## [20] org.Mm.eg.db_3.6.0                    
## [21] AnnotationDbi_1.42.1                  
## [22] SingleCellExperiment_1.2.0            
## [23] SummarizedExperiment_1.10.1           
## [24] DelayedArray_0.6.0                    
## [25] BiocParallel_1.14.1                   
## [26] matrixStats_0.53.1                    
## [27] Biobase_2.40.0                        
## [28] GenomicRanges_1.32.3                  
## [29] GenomeInfoDb_1.16.0                   
## [30] IRanges_2.14.10                       
## [31] S4Vectors_0.18.2                      
## [32] BiocGenerics_0.26.0                   
## [33] knitr_1.20                            
## [34] BiocStyle_2.8.1                       
## 
## loaded via a namespace (and not attached):
##   [1] shinydashboard_0.7.0     tidyselect_0.2.4        
##   [3] RSQLite_2.1.1            htmlwidgets_1.2         
##   [5] grid_3.5.0               trimcluster_0.1-2       
##   [7] Rtsne_0.13               munsell_0.4.3           
##   [9] destiny_2.10.0           statmod_1.4.30          
##  [11] DT_0.4                   sROC_0.1-2              
##  [13] colorspace_1.3-2         BiocInstaller_1.30.0    
##  [15] highr_0.6                robustbase_0.93-0       
##  [17] vcd_1.4-4                VIM_4.7.0               
##  [19] TTR_0.23-3               labeling_0.3            
##  [21] tximport_1.8.0           GenomeInfoDbData_1.1.0  
##  [23] cvTools_0.3.2            bit64_0.9-7             
##  [25] rhdf5_2.24.0             rprojroot_1.3-2         
##  [27] xfun_0.1                 ggthemes_3.5.0          
##  [29] diptest_0.75-7           R6_2.2.2                
##  [31] ggbeeswarm_0.6.0         robCompositions_2.0.7   
##  [33] RcppEigen_0.3.3.4.0      locfit_1.5-9.1          
##  [35] mvoutlier_2.0.9          flexmix_2.3-14          
##  [37] bitops_1.0-6             reshape_0.8.7           
##  [39] assertthat_0.2.0         promises_1.0.1          
##  [41] scales_0.5.0             nnet_7.3-12             
##  [43] beeswarm_0.2.3           gtable_0.2.0            
##  [45] rlang_0.2.0              scatterplot3d_0.3-41    
##  [47] splines_3.5.0            rtracklayer_1.40.2      
##  [49] lazyeval_0.2.1           yaml_2.1.19             
##  [51] reshape2_1.4.3           abind_1.4-5             
##  [53] backports_1.1.2          httpuv_1.4.3            
##  [55] tools_3.5.0              bookdown_0.7            
##  [57] zCompositions_1.1.1      RColorBrewer_1.1-2      
##  [59] proxy_0.4-22             Rcpp_0.12.17            
##  [61] plyr_1.8.4               progress_1.1.2          
##  [63] zlibbioc_1.26.0          purrr_0.2.4             
##  [65] RCurl_1.95-4.10          prettyunits_1.0.2       
##  [67] viridis_0.5.1            cowplot_0.9.2           
##  [69] zoo_1.8-1                haven_1.1.1             
##  [71] magrittr_1.5             data.table_1.11.2       
##  [73] openxlsx_4.0.17          lmtest_0.9-36           
##  [75] truncnorm_1.0-8          mvtnorm_1.0-7           
##  [77] ProtGenerics_1.12.0      mime_0.5                
##  [79] evaluate_0.10.1          xtable_1.8-2            
##  [81] smoother_1.1             XML_3.98-1.11           
##  [83] rio_0.5.10               mclust_5.4              
##  [85] gridExtra_2.3            compiler_3.5.0          
##  [87] biomaRt_2.36.1           tibble_1.4.2            
##  [89] KernSmooth_2.23-15       htmltools_0.3.6         
##  [91] pcaPP_1.9-73             later_0.7.2             
##  [93] rrcov_1.4-4              DBI_1.0.0               
##  [95] MASS_7.3-50              fpc_2.1-11              
##  [97] boot_1.3-20              Matrix_1.2-14           
##  [99] car_3.0-0                sgeostat_1.0-27         
## [101] bindr_0.1.1              igraph_1.2.1            
## [103] forcats_0.3.0            pkgconfig_2.0.1         
## [105] GenomicAlignments_1.16.0 foreign_0.8-70          
## [107] laeken_0.4.6             sp_1.2-7                
## [109] vipor_0.4.5              XVector_0.20.0          
## [111] NADA_1.6-1               stringr_1.3.1           
## [113] digest_0.6.15            pls_2.6-0               
## [115] Biostrings_2.48.0        rmarkdown_1.9           
## [117] cellranger_1.1.0         edgeR_3.22.2            
## [119] DelayedMatrixStats_1.2.0 curl_3.2                
## [121] kernlab_0.9-26           gtools_3.5.0            
## [123] shiny_1.1.0              Rsamtools_1.32.0        
## [125] modeltools_0.2-21        rjson_0.2.19            
## [127] bindrcpp_0.2.2           Rhdf5lib_1.2.1          
## [129] carData_3.0-1            viridisLite_0.3.0       
## [131] pillar_1.2.2             lattice_0.20-35         
## [133] GGally_1.4.0             httr_1.3.1              
## [135] DEoptimR_1.0-8           survival_2.42-3         
## [137] xts_0.10-2               glue_1.2.0              
## [139] FNN_1.1                  prabclus_2.2-6          
## [141] bit_1.1-13               class_7.3-14            
## [143] stringi_1.2.2            blob_1.1.1              
## [145] memoise_1.1.0            dplyr_0.7.5             
## [147] irlba_2.3.2              e1071_1.6-8

References

Angel, P., and M. Karin. 1991. “The role of Jun, Fos and the AP-1 complex in cell-proliferation and transformation.” Biochim. Biophys. Acta 1072 (2-3):129–57.

Brennecke, P., S. Anders, J. K. Kim, A. A. Kołodziejczyk, X. Zhang, V. Proserpio, B. Baying, et al. 2013. “Accounting for technical noise in single-cell RNA-seq experiments.” Nat. Methods 10 (11):1093–5.

Buettner, F., K. N. Natarajan, F. P. Casale, V. Proserpio, A. Scialdone, F. J. Theis, S. A. Teichmann, J. C. Marioni, and O. Stegle. 2015. “Computational analysis of cell-to-cell heterogeneity in single-cell RNA-sequencing data reveals hidden subpopulations of cells.” Nat. Biotechnol. 33 (2):155–60.

Buja, A., and N. Eyuboglu. 1992. “Remarks on Parallel Analysis.” Multivariate Behav Res 27 (4):509–40.

Haghverdi, L., A. T. L. Lun, M. D. Morgan, and J. C. Marioni. 2018. “Batch effects in single-cell RNA-sequencing data are corrected by matching mutual nearest neighbors.” Nat. Biotechnol., April.

Horn, J. L. 1965. “A Rationale and Test for the Number of Factors in Factor Analysis.” Psychometrika 30 (2):179–85. https://doi.org/10.1007/BF02289447.

Ilicic, T., J. K. Kim, A. A. Kołodziejczyk, F. O. Bagger, D. J. McCarthy, J. C. Marioni, and S. A. Teichmann. 2016. “Classification of low quality cells from single-cell RNA-seq data.” Genome Biol. 17 (1):29.

Islam, S., U. Kjallquist, A. Moliner, P. Zajac, J. B. Fan, P. Lonnerberg, and S. Linnarsson. 2011. “Characterization of the single-cell transcriptional landscape by highly multiplex RNA-seq.” Genome Res. 21 (7):1160–7.

Kim, J. K., A. A. Kołodziejczyk, T. Illicic, S. A. Teichmann, and J. C. Marioni. 2015. “Characterizing noise structure in single-cell RNA-seq distinguishes genuine from technical stochastic allelic expression.” Nat. Commun. 6:8687.

Kołodziejczyk, A. A., J. K. Kim, J. C. Tsang, T. Ilicic, J. Henriksson, K. N. Natarajan, A. C. Tuck, et al. 2015. “Single cell RNA-sequencing of pluripotent states unlocks modular transcriptional variation.” Cell Stem Cell 17 (4):471–85.

Lun, A. T. L., F. J. Calero-Nieto, L. Haim-Vilmovsky, B. Gottgens, and J. C. Marioni. 2017. “Assessing the reliability of spike-in normalization for analyses of single-cell RNA sequencing data.” Genome Res. 27 (11):1795–1806.

Lun, A. T., K. Bach, and J. C. Marioni. 2016. “Pooling across cells to normalize single-cell RNA sequencing data with many zero counts.” Genome Biol. 17 (April):75.

Mahata, B., X. Zhang, A. A. Kołodziejczyk, V. Proserpio, L. Haim-Vilmovsky, A. E. Taylor, D. Hebenstreit, et al. 2014. “Single-cell RNA sequencing reveals T helper cells synthesizing steroids de novo to contribute to immune homeostasis.” Cell Rep. 7 (4):1130–42.

McCarthy, D. J., K. R. Campbell, A. T. Lun, and Q. F. Wills. 2017. “Scater: pre-processing, quality control, normalization and visualization of single-cell RNA-seq data in R.” Bioinformatics 33 (8):1179–86.

McCarthy, D. J., Y. Chen, and G. K. Smyth. 2012. “Differential expression analysis of multifactor RNA-Seq experiments with respect to biological variation.” Nucleic Acids Res. 40 (10):4288–97.

Phipson, B., and G. K. Smyth. 2010. “Permutation P-values should never be zero: calculating exact P-values when permutations are randomly drawn.” Stat. Appl. Genet. Mol. Biol. 9:Article 39.

Picelli, S., O. R. Faridani, A. K. Bjorklund, G. Winberg, S. Sagasser, and R. Sandberg. 2014. “Full-length RNA-seq from single cells using Smart-seq2.” Nat Protoc 9 (1):171–81.

Simes, R. J. 1986. “An Improved Bonferroni Procedure for Multiple Tests of Significance.” Biometrika 73 (3):751–54.

Tasic, B., V. Menon, T. N. Nguyen, T. K. Kim, T. Jarsky, Z. Yao, B. Levi, et al. 2016. “Adult mouse cortical cell taxonomy revealed by single cell transcriptomics.” Nat. Neurosci. 19 (2):335–46.

Vallejos, C. A., J. C. Marioni, and S. Richardson. 2015. “BASiCS: Bayesian analysis of single-cell sequencing data.” PLoS Comput. Biol. 11 (6):e1004333.

Wilson, N. K., D. G. Kent, F. Buettner, M. Shehata, I. C. Macaulay, F. J. Calero-Nieto, M. Sanchez Castillo, et al. 2015. “Combined single-cell functional and gene expression analysis resolves heterogeneity within stem cell populations.” Cell Stem Cell 16 (6):712–24.

Zhu, J., H. Yamane, J. Cote-Sierra, L. Guo, and W. E. Paul. 2006. “GATA-3 promotes Th2 responses through three different mechanisms: induction of Th2 cytokine production, selective growth of Th2 cells and inhibition of Th1 cell-specific factors.” Cell Res. 16 (1):3–10.

Further strategies for analyzing single-cell RNA-seq data

2018-05-25

Contents

1 Overview

2 Quality control on cells

2.1 Assumptions of outlier identification

2.2 Checking for discarded cell types

2.3 Alternative approaches to quality control

2.3.1 Using fixed thresholds

2.3.2 Using PCA-based outliers

3 Normalizing based on spike-in coverage

3.1 Motivation

3.2 Setting up the data

3.3 Calculating spike-in size factors

4 Detecting highly variable genes

4.1 Setting up the data

4.2 Testing for significantly positive biological components

5 Advanced modelling of the technical noise

5.1 Trend fitting when spike-ins are unavailable

5.2 Blocking on uninteresting factors of variation

5.2.1 Using the block= argument

5.2.2 Fitting batch-specific trends

5.2.3 Using the design= argument

6 Identifying correlated gene pairs with Spearman’s rho

7 Using parallel analysis to choose the number of PCs

8 Blocking on the cell cycle phase

9 Concluding remarks

References

5.2.1 Using the `block=` argument

5.2.3 Using the `design=` argument