In this workflow, we use a relatively simple dataset (Lun et al. 2017) to introduce most of the concepts of scRNA-seq data analysis. This dataset contains two plates of 416B cells (an immortalized mouse myeloid progenitor cell line), processed using the Smart-seq2 protocol (Picelli et al. 2014). A constant amount of spike-in RNA from the External RNA Controls Consortium (ERCC) was also added to each cell’s lysate prior to library preparation. High-throughput sequencing was performed and the expression of each gene was quantified by counting the total number of reads mapped to its exonic regions. Similarly, the quantity of each spike-in transcript was measured by counting the number of reads mapped to the spike-in reference sequences. Counts for all genes/transcripts in each cell were obtained from ArrayExpress using the accession number E-MTAB-5522.
Our first task is to load the count matrices into memory. One matrix was generated for each plate of cells used in the study. In each matrix, each row represents an endogenous gene or a spike-in transcript, and each column represents a cell. Subsequently, the count in each entry of the matrix represents the number of reads mapped to a particular gene/transcript in a particular cell.
unzip("E-MTAB-5522.processed.1.zip")
# Reading in the count tables for each of the two plates.
plate1 <- read.delim("counts_Calero_20160113.tsv",
header=TRUE, row.names=1, check.names=FALSE)
plate2 <- read.delim("counts_Calero_20160325.tsv",
header=TRUE, row.names=1, check.names=FALSE)
gene.lengths <- plate1$Length # First column is the gene length.
plate1 <- as.matrix(plate1[,-1]) # Discarding gene length (as it is not a cell).
plate2 <- as.matrix(plate2[,-1])
rbind(Plate1=dim(plate1), Plate2=dim(plate2))
## [,1] [,2]
## Plate1 46703 96
## Plate2 46703 96
We combine the two matrices into a single object for further processing. This is done after verifying that the genes are in the same order between the two matrices.
stopifnot(identical(rownames(plate1), rownames(plate2)))
all.counts <- cbind(plate1, plate2)
For convenience, the count matrix is stored in a SingleCellExperiment
object from the SingleCellExperiment package.
This allows different types of row- and column-level metadata to be stored alongside the counts for synchronized manipulation throughout the workflow.
library(SingleCellExperiment)
sce <- SingleCellExperiment(list(counts=all.counts))
rowData(sce)$GeneLength <- gene.lengths
sce
## class: SingleCellExperiment
## dim: 46703 192
## metadata(0):
## assays(1): counts
## rownames(46703): ENSMUSG00000102693 ENSMUSG00000064842 ... SIRV7
## CBFB-MYH11-mcherry
## rowData names(1): GeneLength
## colnames(192): SLX-9555.N701_S502.C89V9ANXX.s_1.r_1
## SLX-9555.N701_S503.C89V9ANXX.s_1.r_1 ...
## SLX-11312.N712_S508.H5H5YBBXX.s_8.r_1
## SLX-11312.N712_S517.H5H5YBBXX.s_8.r_1
## colData names(0):
## reducedDimNames(0):
## spikeNames(0):
We identify the rows corresponding to ERCC spike-in transcripts from the row names.
We store this information in the SingleCellExperiment
object for future use.
This is necessary as spike-ins require special treatment in downstream steps such as normalization.
isSpike(sce, "ERCC") <- grepl("^ERCC", rownames(sce))
summary(isSpike(sce, "ERCC"))
## Mode FALSE TRUE
## logical 46611 92
This dataset is slightly unusual in that it contains information from another set of spike-in transcripts, the Spike-In RNA Variants (SIRV) set.
For simplicity, we will only use the ERCC spike-ins in this analysis.
Thus, we must remove the rows corresponding to the SIRV transcripts prior to further analysis, which can be done simply by subsetting the SingleCellExperiment
object.
is.sirv <- grepl("^SIRV", rownames(sce))
sce <- sce[!is.sirv,]
summary(is.sirv)
## Mode FALSE TRUE
## logical 46696 7
Comments from Aaron:
colData
) prior to further analyses.^ERCC
regular expression for human data where the row names of the count matrix are gene symbols.
An ERCC gene family actually exists in human annotation, so this would result in incorrect identification of genes as spike-in transcripts.
This problem can be avoided by publishing count matrices with standard identifiers (e.g., Ensembl, Entrez).We load in the metadata for each library/cell from the sdrf.txt
file.
It is important to check that the rows of the metadata table are in the same order as the columns of the count matrix.
Otherwise, incorrect metadata will be assigned to each cell.
metadata <- read.delim("E-MTAB-5522.sdrf.txt", check.names=FALSE, header=TRUE)
m <- match(colnames(sce), metadata[["Source Name"]]) # Enforcing identical order.
stopifnot(all(!is.na(m))) # Checking that nothing's missing.
metadata <- metadata[m,]
head(colnames(metadata))
## [1] "Source Name" "Comment[ENA_SAMPLE]"
## [3] "Comment[BioSD_SAMPLE]" "Characteristics[organism]"
## [5] "Characteristics[cell line]" "Characteristics[cell type]"
We only retain relevant metadata fields to avoid storing unnecessary information in the colData
of the SingleCellExperiment
object.
In particular, we keep the plate of origin (i.e., block
) and phenotype of each cell.
The second field is relevant as all of the cells contain a CBFB-MYH11 oncogene, but the expression of this oncogene is only induced in a subset of the cells.
colData(sce)$Plate <- factor(metadata[["Factor Value[block]"]])
pheno <- metadata[["Factor Value[phenotype]"]]
levels(pheno) <- c("induced", "control")
colData(sce)$Oncogene <- pheno
table(colData(sce)$Oncogene, colData(sce)$Plate)
##
## 20160113 20160325
## induced 48 48
## control 48 48
Feature-counting tools typically report genes in terms of standard identifiers from Ensembl or Entrez. These identifiers are used as they are unambiguous and highly stable. However, they are difficult to interpret compared to the gene symbols which are more commonly used in the literature. Given the Ensembl identifiers, we obtain the corresponding gene symbols using annotation packages like org.Mm.eg.db.
library(org.Mm.eg.db)
symb <- mapIds(org.Mm.eg.db, keys=rownames(sce), keytype="ENSEMBL", column="SYMBOL")
rowData(sce)$ENSEMBL <- rownames(sce)
rowData(sce)$SYMBOL <- symb
head(rowData(sce))
## DataFrame with 6 rows and 3 columns
## GeneLength ENSEMBL SYMBOL
## <integer> <character> <character>
## 1 1070 ENSMUSG00000102693 NA
## 2 110 ENSMUSG00000064842 NA
## 3 6094 ENSMUSG00000051951 Xkr4
## 4 480 ENSMUSG00000102851 NA
## 5 2819 ENSMUSG00000103377 NA
## 6 2233 ENSMUSG00000104017 NA
It is often desirable to rename the row names of sce
to the gene symbols, as these are easier to interpret.
However, this requires some work to account for missing and duplicate symbols.
The code below will replace missing symbols with the Ensembl identifier and concatenate duplicated symbols with the (unique) Ensembl identifiers.
new.names <- rowData(sce)$SYMBOL
missing.name <- is.na(new.names)
new.names[missing.name] <- rowData(sce)$ENSEMBL[missing.name]
dup.name <- new.names %in% new.names[duplicated(new.names)]
new.names[dup.name] <- paste0(new.names, "_", rowData(sce)$ENSEMBL)[dup.name]
rownames(sce) <- new.names
head(rownames(sce))
## [1] "ENSMUSG00000102693" "ENSMUSG00000064842" "Xkr4"
## [4] "ENSMUSG00000102851" "ENSMUSG00000103377" "ENSMUSG00000104017"
We also determine the chromosomal location for each gene using the TxDb.Mmusculus.UCSC.mm10.ensGene package. This will be useful later as several quality control metrics will be computed from rows corresponding to mitochondrial genes.
library(TxDb.Mmusculus.UCSC.mm10.ensGene)
location <- mapIds(TxDb.Mmusculus.UCSC.mm10.ensGene, keys=rowData(sce)$ENSEMBL,
column="CDSCHROM", keytype="GENEID")
rowData(sce)$CHR <- location
summary(location=="chrM")
## Mode FALSE TRUE NA's
## logical 22428 13 24255
Alternatively, annotation from BioMart resources can be directly added to the object using the getBMFeatureAnnos
function from scater.
This may be more convenient than the approach shown above, but depends on an available internet connection to the BioMart databases.
Low-quality cells need to be removed to ensure that technical effects do not distort downstream analysis results. We use several quality control (QC) metrics:
For each cell, we calculate these quality control metrics using the calculateQCMetrics
function from the scater package (McCarthy et al. 2017).
These are stored in the row- and column-wise metadata of the SingleCellExperiment
for future reference.
library(scater)
mito <- which(rowData(sce)$CHR=="chrM")
sce <- calculateQCMetrics(sce, feature_controls=list(Mt=mito))
head(colnames(colData(sce)), 10)
## [1] "Plate" "Oncogene"
## [3] "is_cell_control" "total_features_by_counts"
## [5] "log10_total_features_by_counts" "total_counts"
## [7] "log10_total_counts" "pct_counts_in_top_50_features"
## [9] "pct_counts_in_top_100_features" "pct_counts_in_top_200_features"
The distributions of these metrics are shown in Figure 1, stratified by oncogene induction status and plate of origin. The aim is to remove putative low-quality cells that have low library sizes, low numbers of expressed features, and high spike-in (or mitochondrial) proportions.
sce$PlateOnco <- paste0(sce$Oncogene, ".", sce$Plate)
multiplot(
plotColData(sce, y="total_counts", x="PlateOnco"),
plotColData(sce, y="total_features", x="PlateOnco"),
plotColData(sce, y="pct_counts_ERCC", x="PlateOnco"),
plotColData(sce, y="pct_counts_Mt", x="PlateOnco"),
cols=2)
It is also valuable to examine how the QC metrics behave with respect to each other (Figure 2). Generally, they will be in rough agreement, i.e., cells with low total counts will also have low numbers of expressed features and high ERCC/mitochondrial proportions. Clear discrepancies may correspond to technical differences between batches of cells (see below) or genuine biological differences in RNA content.
par(mfrow=c(1,3))
plot(sce$total_features, sce$total_counts/1e6, xlab="Number of expressed genes",
ylab="Library size (millions)")
plot(sce$total_features, sce$pct_counts_ERCC, xlab="Number of expressed genes",
ylab="ERCC proportion (%)")
plot(sce$total_features, sce$pct_counts_Mt, xlab="Number of expressed genes",
ylab="Mitochondrial proportion (%)")
Picking a threshold for these metrics is not straightforward as their absolute values depend on the experimental protocol. For example, sequencing to greater depth will lead to more reads and more expressed features, regardless of the quality of the cells. Similarly, using more spike-in RNA in the protocol will result in higher spike-in proportions. To obtain an adaptive threshold, we assume that most of the dataset consists of high-quality cells, and identify cells that are outliers for the various QC metrics.
Outliers are defined based on the median absolute deviation (MADs) from the median value of each metric across all cells. We remove cells with log-library sizes that are more than 3 MADs below the median log-library size. A log-transformation improves resolution at small values, especially when the MAD of the raw values is comparable to or greater than the median. We also remove cells where the log-transformed number of expressed genes is 3 MADs below the median value.
libsize.drop <- isOutlier(sce$total_counts, nmads=3, type="lower",
log=TRUE, batch=sce$PlateOnco)
feature.drop <- isOutlier(sce$total_features, nmads=3, type="lower",
log=TRUE, batch=sce$PlateOnco)
The batch=
argument ensures that outliers are identified within each level of the specified plate/oncogene factor.
This allows isOutlier()
to accommodate systematic differences in the QC metrics across plates (Figure 1),
which can arise due to technical differences in processing (e.g., differences in sequencing depth) rather than any changes in quality.
The same reasoning applies to the oncogene induction status, where induced cells may have naturally fewer expressed genes for biological reasons.
Failing to account for these systematic differences would inflate the MAD estimate and compromise the removal of low-quality cells.
We identify outliers for the proportion-based metrics in a similar manner. Here, no transformation is required as we are identifying large outliers, for which the distinction should be fairly clear on the raw scale. We do not need to use the mitochondrial proportions as we already have the spike-in proportions (which serve a similar purpose) for this dataset. This avoids potential issues arising from genuine differences in mitochondrial content between cell types that may confound outlier identification.
spike.drop <- isOutlier(sce$pct_counts_ERCC, nmads=3, type="higher",
batch=sce$PlateOnco)
Subsetting by column will retain only the high-quality cells that pass each filter described above. We examine the number of cells removed by each filter as well as the total number of retained cells. Removal of a substantial proportion of cells (> 10%) may be indicative of an overall issue with data quality.
keep <- !(libsize.drop | feature.drop | spike.drop)
data.frame(ByLibSize=sum(libsize.drop), ByFeature=sum(feature.drop),
BySpike=sum(spike.drop), Remaining=sum(keep))
## ByLibSize ByFeature BySpike Remaining
## 1 5 4 6 183
We then subset the SingleCellExperiment
object to retain only the putative high-quality cells.
We also save the original object to file for later use.
sce$PassQC <- keep
saveRDS(sce, file="416B_preQC.rds")
sce <- sce[,keep]
dim(sce)
## [1] 46696 183
Comments from Aaron:
We use the prediction method described by Scialdone et al. (2015) to classify cells into cell cycle phases based on the gene expression data. Using a training dataset, the sign of the difference in expression between two genes was computed for each pair of genes. Pairs with changes in the sign across cell cycle phases were chosen as markers. Cells in a test dataset can then be classified into the appropriate phase, based on whether the observed sign for each marker pair is consistent with one phase or another.
This approach is implemented in the cyclone
function from the scran package.
The package contains a pre-trained set of marker pairs for mouse data, which we can load in the the readRDS
function.
We use the Ensembl identifiers for each gene in our dataset to match up with the names in the pre-trained set of gene pairs.
set.seed(100)
library(scran)
mm.pairs <- readRDS(system.file("exdata", "mouse_cycle_markers.rds",
package="scran"))
assignments <- cyclone(sce, mm.pairs, gene.names=rowData(sce)$ENSEMBL)
The cyclone
result for each cell in the HSC dataset is shown in Figure 3.
Each cell is assigned a score for each phase, with a higher score corresponding to a higher probability that the cell is in that phase.
We focus on the G1 and G2/M scores as these are the most informative for classification.
plot(assignments$score$G1, assignments$score$G2M,
xlab="G1 score", ylab="G2/M score", pch=16)