1 Introduction

scuttle provides various low-level utilities for single-cell RNA-seq data analysis, focusing on basic normalization, quality control and transformation of expression values. These functions are intended for use by analysts, usually at the start of the analysis workflow; or by developers of other packages, to assemble more complex functions in downstream analysis steps.

To demonstrate, we will obtain the classic Zeisel dataset from the scRNAseq package. In this case, the dataset is provided as a SingleCellExperiment object. However, most scuttle functions can also be used with raw expression matrices or instances of the more general SummarizedExperiment class.

library(scRNAseq)
sce <- ZeiselBrainData()
sce
## class: SingleCellExperiment 
## dim: 20006 3005 
## metadata(0):
## assays(1): counts
## rownames(20006): Tspan12 Tshz1 ... mt-Rnr1 mt-Nd4l
## rowData names(1): featureType
## colnames(3005): 1772071015_C02 1772071017_G12 ... 1772066098_A12
##   1772058148_F03
## colData names(10): tissue group # ... level1class level2class
## reducedDimNames(0):
## altExpNames(2): ERCC repeat

Note: A more comprehensive description of the use of scuttle functions (along with other packages) in a scRNA-seq analysis workflow is available at https://osca.bioconductor.org.

2 Cell-level quality control

2.1 Computing metrics

The perCellQCMetrics() function computes a variety of basic cell-level metrics, including sum, total number of counts for the cell (i.e., the library size); detected, the number of features for the cell that have counts above the detection limit (default of zero); and subsets_X_percent, percentage of all counts that come from the feature control set named X.

library(scuttle)
is.mito <- grep("mt-", rownames(sce))
per.cell <- perCellQCMetrics(sce, subsets=list(Mito=is.mito))
summary(per.cell$sum)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2574    8130   12913   14954   19284   63505
summary(per.cell$detected)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     785    2484    3656    3777    4929    8167
summary(per.cell$subsets_Mito_percent)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   3.992   6.653   7.956  10.290  56.955

It is often convenient to store this in the colData() of our SingleCellExperiment object for future reference. One can either do so manually:

colData(sce) <- cbind(colData(sce), per.cell)

… or just use the addPerCellQC() function:

sce2 <- addPerCellQC(sce, subsets=list(Mito=is.mito))
colnames(colData(sce2))
##  [1] "tissue"                  "group #"                
##  [3] "total mRNA mol"          "well"                   
##  [5] "sex"                     "age"                    
##  [7] "diameter"                "cell_id"                
##  [9] "level1class"             "level2class"            
## [11] "sum"                     "detected"               
## [13] "subsets_Mito_sum"        "subsets_Mito_detected"  
## [15] "subsets_Mito_percent"    "altexps_ERCC_sum"       
## [17] "altexps_ERCC_detected"   "altexps_ERCC_percent"   
## [19] "altexps_repeat_sum"      "altexps_repeat_detected"
## [21] "altexps_repeat_percent"  "total"                  
## [23] "sum"                     "detected"               
## [25] "subsets_Mito_sum"        "subsets_Mito_detected"  
## [27] "subsets_Mito_percent"    "altexps_ERCC_sum"       
## [29] "altexps_ERCC_detected"   "altexps_ERCC_percent"   
## [31] "altexps_repeat_sum"      "altexps_repeat_detected"
## [33] "altexps_repeat_percent"  "total"

2.2 Filtering low-quality cells

We identify low-quality cells by setting a threshold on each of these metrics using the isOutlier() function. This defines the threshold at a certain number of median absolute deviations (MADs) away from the median; values beyond this threshold are considered outliers and can be filtered out, assuming that they correspond to low-quality cells. Here, we define small outliers (using type="lower") for the log-total counts at 3 MADs from the median.

keep.total <- !isOutlier(per.cell$sum, type="lower", log=TRUE)
filtered <- sce[,keep.total]

For typical scRNA-seq applications, quickPerCellQC() will conveniently detect outliers across several common metrics. This uses the total count, number of detected features and the percentage of counts in gene sets of diagnostic value (e.g., mitochondrial genes, spike-in transcripts) to identify which cells to discard and for what reason.

qc.stats <- quickPerCellQC(per.cell, percent_subsets="subsets_Mito_percent")
colSums(as.matrix(qc.stats))
##              low_lib_size            low_n_features high_subsets_Mito_percent 
##                         0                         3                       128 
##                   discard 
##                       131
filtered <- sce[,!qc.stats$discard]

The isOutlier approach adjusts to experiment-specific aspects of the data, e.g., sequencing depth, amount of spike-in RNA added, differences in total RNA content between cell types. In contrast, a fixed threshold would require manual adjustment to account for changes to the experimental protocol or system. We refer readers to the book for more details.

3 Computing feature-level statistics

Some basic feature-level statistics are computed by the perFeatureQCMetrics() function. This includes mean, the mean count of the gene/feature across all cells; detected, the percentage of cells with non-zero counts for each gene; subsets_Y_ratio, ratio of mean counts between the cell control set named Y and all cells.

# Pretending that the first 10 cells are empty wells, for demonstration.
per.feat <- perFeatureQCMetrics(sce, subsets=list(Empty=1:10))
summary(per.feat$mean)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##   0.0007   0.0097   0.1338   0.7475   0.5763 732.1524
summary(per.feat$detected)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##  0.03328  0.76539  9.01830 18.87800 31.24792 99.96672
summary(per.feat$subsets_Empty_ratio)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.000   0.601   1.872   2.016 300.500

A more refined calculation of the average is provided by the calculateAverage() function, which adjusts the counts by the relative library size (or size factor) prior to taking the mean.

ave <- calculateAverage(sce)
summary(ave)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##   0.0002   0.0109   0.1443   0.7475   0.5674 850.6880

These metrics tend to be more useful for informing the analyst about the overall behavior of the experiment, rather than being explicitly used to filter out genes. For example, one would hope that the most abundant genes are the “usual suspects”, e.g., ribosomal proteins, actin, histones.

4 Utilities for normalizaton

4.1 Computing size factors

scuttle provides a number of utilities for global scaling normalization, where the counts for each cell are divided by a cell-specific “size factor” to adjust for uninteresting differences in sequencing depth and capture efficiency. By default, the size factor is automatically computed from the library size of each cell using the librarySizeFactors() function. This calculation simply involves scaling the library sizes so that they have a mean of 1 across all cells.

summary(librarySizeFactors(sce))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1721  0.5437  0.8635  1.0000  1.2895  4.2466

scuttle also implements two other basic methods of computing size factors, either from the geometric mean or using a DESeq-esque approach based on the median. These have their own strengths and weaknesses that are discussed in the relevant documentation pages.

summary(geometricSizeFactors(sce))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.8304  0.9077  0.9761  1.0000  1.0653  1.5320
summary(medianSizeFactors(sce))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##      NA      NA      NA     NaN      NA      NA    3005

Note that if size factors are explicitly provided in the SingleCellExperiment, they will be used by the downstream normalization functions. Size factors can be manually added by sizeFactors()<- or with the functions like:

sce <- computeLibraryFactors(sce)
summary(sizeFactors(sce))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1721  0.5437  0.8635  1.0000  1.2895  4.2466

4.2 Normalizing expression values

The most commonly used function is logNormCounts(), which calculates log2-transformed normalized expression values by dividing each count by the size factor for the cell, adding a constant pseudo-count and log-transforming. The resulting values can be roughly interpreted on the same scale as log-transformed counts and are stored in "logcounts".

sce <- logNormCounts(sce)
assayNames(sce)
## [1] "counts"    "logcounts"

Of course, users can construct any arbitrary matrix of the same dimensions as the count matrix and store it as an assay. Here, we use the normalizeCounts() function to perform some custom normalization with random size factors.

assay(sce, "normed") <- normalizeCounts(sce, 
    size.factors=runif(ncol(sce)), pseudo.count=1.5)

scuttle can also calculate counts-per-million using the aptly-named calculateCPM() function. The output is most appropriately stored as an assay named "cpm" in the assays of the SingleCellExperiment object. Related functions include calculateTPM() and calculateFPKM(), which do pretty much as advertised.

assay(sce, "cpm") <- calculateCPM(sce)

5 Aggregation across groups or clusters

The aggregateAcrossCells() function is helpful for aggregating expression values across groups of cells. For example, we might wish to sum together counts for all cells in the same cluster, possibly to use as a summary statistic for downstream analyses (e.g., for differential expression with edgeR). This will also perform the courtesy of sensibly aggregating the column metadata for downstream use.

agg.sce <- aggregateAcrossCells(sce, ids=sce$level1class)
head(assay(agg.sce))
##          astrocytes_ependymal endothelial-mural interneurons microglia
## Tspan12                    74               138          186        14
## Tshz1                      96               105          335        43
## Fnbp1l                     89               169          689        33
## Adamts15                    2                23           64         0
## Cldn12                     50                27          160         5
## Rxfp1                       0                 3           74         0
##          oligodendrocytes pyramidal CA1 pyramidal SS
## Tspan12                99           269           58
## Tshz1                 297            65          332
## Fnbp1l                239           859         1056
## Adamts15               19            11           15
## Cldn12                214           411          273
## Rxfp1                  13            48           30
colData(agg.sce)[,c("ids", "ncells")]
## DataFrame with 7 rows and 2 columns
##                                       ids    ncells
##                               <character> <integer>
## astrocytes_ependymal astrocytes_ependymal       224
## endothelial-mural       endothelial-mural       235
## interneurons                 interneurons       290
## microglia                       microglia        98
## oligodendrocytes         oligodendrocytes       820
## pyramidal CA1               pyramidal CA1       939
## pyramidal SS                 pyramidal SS       399

It is similarly possible to sum across multiple factors, as shown below for the cell type and the tissue of origin. This yields one column per combination of cell type and tissue, which allows us to conveniently perform downstream analyses with both factors.

agg.sce <- aggregateAcrossCells(sce, 
    ids=colData(sce)[,c("level1class", "tissue")])
head(assay(agg.sce))
##          [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12]
## Tspan12    23   51   10  128   52  134    0   14   10    89   269    58
## Tshz1      38   58   10   95   90  245    1   42   50   247    65   332
## Fnbp1l     23   66    9  160  227  462    3   30   18   221   859  1056
## Adamts15    0    2    2   21   15   49    0    0    0    19    11    15
## Cldn12      5   45    0   27   61   99    0    5   22   192   411   273
## Rxfp1       0    0    0    3    3   71    0    0    1    12    48    30
colData(agg.sce)[,c("level1class", "tissue", "ncells")]
## DataFrame with 12 rows and 3 columns
##              level1class         tissue    ncells
##              <character>    <character> <integer>
## 1   astrocytes_ependymal ca1hippocampus        81
## 2   astrocytes_ependymal       sscortex       143
## 3      endothelial-mural ca1hippocampus        33
## 4      endothelial-mural       sscortex       202
## 5           interneurons ca1hippocampus       126
## ...                  ...            ...       ...
## 8              microglia       sscortex        84
## 9       oligodendrocytes ca1hippocampus       121
## 10      oligodendrocytes       sscortex       699
## 11         pyramidal CA1 ca1hippocampus       939
## 12          pyramidal SS       sscortex       399

Summation across rows may occasionally be useful for obtaining a measure of the activity of a gene set, e.g., in a pathway. Given a list of gene sets, we can use the sumCountsAcrossFeatures() function to aggregate expression values across features. This is usually best done by averaging the log-expression values as shown below.

agg.feat <- sumCountsAcrossFeatures(sce,
    ids=list(GeneSet1=1:10, GeneSet2=11:50, GeneSet3=1:100),
    average=TRUE, exprs_values="logcounts")
agg.feat[,1:10]
##          1772071015_C02 1772071017_G12 1772071017_A05 1772071014_B06
## GeneSet1      0.7717764      0.2177634      0.7577241      0.5380195
## GeneSet2      1.0602489      0.9955551      1.2558629      1.1482845
## GeneSet3      1.1410273      1.0018648      1.2727650      1.1995992
##          1772067065_H06 1772071017_E02 1772067065_B07 1772067060_B09
## GeneSet1      0.4987205      0.3528216      0.5262005      0.0749138
## GeneSet2      1.2010854      1.2855797      1.2889492      1.1806065
## GeneSet3      1.1244981      1.1567887      1.1057770      1.1353704
##          1772071014_E04 1772071015_D04
## GeneSet1      0.7267285      0.7924899
## GeneSet2      1.1728326      1.2871506
## GeneSet3      1.0098231      1.1846439

Similar functions are available to compute the number or proportion of cells with detectable expression in each group. To wit:

agg.n <- numDetectedAcrossCells(sce,
    ids=colData(sce)[,c("level1class", "tissue")])
head(assay(agg.n))
##          [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12]
## Tspan12    16   21    6   56   36   59    0    8    7    50   177    36
## Tshz1      17   22    7   45   46   78    1   12   26   136    48   135
## Fnbp1l     13   27    7   63   80  123    2   18    9   106   432   287
## Adamts15    0    2    2   10    6   29    0    0    0    10     9     9
## Cldn12      4   20    0   12   37   60    0    4   17   103   271   141
## Rxfp1       0    0    0    2    2   35    0    0    1     8    37    19

6 Miscellaneous functions

6.1 Reading in sparse matrices

Normally, sparse matrices are provided in the MatrixMarket (.mtx) format, where they can be read efficiently into memory using the readMM() function from the Matrix package. However, for some reason, it has been popular to save these files in dense form as tab- or comma-separate files. This is an inefficient and inconvenient approach, requiring users to read in the entire dataset in dense form with functions like read.delim() or read.csv() (and hope that they have enough memory on their machines to do so).

In such cases, scuttle provides the readSparseCounts() function to overcome excessive memory requirements. This reads in the dataset in a chunkwise manner, progressively coercing each chunk into a sparse format and combining them into a single sparse matrix to be returned to the user. In this manner, we never attempt to load the entire dataset in dense format to memory.

# Mocking up a dataset to demonstrate:
outfile <- tempfile()
write.table(counts(sce)[1:100,], file=outfile, sep="\t", quote=FALSE)

# Reading it in as a sparse matrix:
output <- readSparseCounts(outfile)
class(output)
## [1] "dgCMatrix"
## attr(,"package")
## [1] "Matrix"

6.2 Making gene symbols unique

When publishing a dataset, the best practice is to provide gene annotations in the form of a stable identifier like those from Ensembl or Entrez. This provides an unambiguous reference to the identity of the gene, avoiding difficulties with synonynms and making it easier to cross-reference. However, when working with a dataset, it is more convenient to use the gene symbols as these are easier to remember.

Thus, a common procedure is to replace the stable identifiers in the row names with gene symbols. However, this is not straightforward as the gene symbols may not exist (NAs) or may be duplicated. To assist this process, scuttle provides the uniquifyFeatureNames() function that emit gene symbols if they are unique; append the identifier, if they are duplicated; and replace the symbol with the identifier if the former is missing.

# Original row names are Ensembl IDs.
sce.ens <- ZeiselBrainData(ensembl=TRUE)
head(rownames(sce.ens)) 
## [1] "ENSMUSG00000029669" "ENSMUSG00000046982" "ENSMUSG00000039735"
## [4] "ENSMUSG00000033453" "ENSMUSG00000046798" "ENSMUSG00000034009"
# Replacing with guaranteed unique and non-missing symbols:
rownames(sce.ens) <- uniquifyFeatureNames(
    rownames(sce.ens), rowData(sce.ens)$originalName
)
head(rownames(sce.ens)) 
## [1] "Tspan12"  "Tshz1"    "Fnbp1l"   "Adamts15" "Cldn12"   "Rxfp1"

6.3 Creating a data.frame

The makePerCellDF() and makePerFeatureDF() functions create data.frames from the SingleCellExperiment object. In the makePerCellDF() case, each row of the output data.frame corresponds to a cell and each column represents the expression of a specified feature across cells, a field of the column metadata, or reduced dimensions (if any are available).

out <- makePerCellDF(sce, features="Tspan12")
colnames(out)
##  [1] "Tspan12"                 "tissue"                 
##  [3] "group.."                 "total.mRNA.mol"         
##  [5] "well"                    "sex"                    
##  [7] "age"                     "diameter"               
##  [9] "cell_id"                 "level1class"            
## [11] "level2class"             "sum"                    
## [13] "detected"                "subsets_Mito_sum"       
## [15] "subsets_Mito_detected"   "subsets_Mito_percent"   
## [17] "altexps_ERCC_sum"        "altexps_ERCC_detected"  
## [19] "altexps_ERCC_percent"    "altexps_repeat_sum"     
## [21] "altexps_repeat_detected" "altexps_repeat_percent" 
## [23] "total"                   "sizeFactor"

In the makePerFeatureDF() case, each row of the output data.frame corresponds to a gene and each column represents the expression profile of a specified cell or the values of a row metadata field.

out2 <- makePerFeatureDF(sce, cells=c("1772063062_D05",
    "1772063061_D01", "1772060240_F02", "1772062114_F05"))
colnames(out2)
## [1] "1772063062_D05" "1772063061_D01" "1772060240_F02" "1772062114_F05"
## [5] "featureType"

The aim is to enable the data in a SingleCellExperiment to be easilybe used in functions like model.matrix() or in ggplot(), without requiring users to manually extract the desired fields from the SingleCellExperiment to construct their own data.frame.

Session information

sessionInfo()
## R version 4.0.3 (2020-10-10)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.5 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.12-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.12-bioc/R/lib/libRlapack.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] parallel  stats4    stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] ensembldb_2.14.0            AnnotationFilter_1.14.0    
##  [3] GenomicFeatures_1.42.1      AnnotationDbi_1.52.0       
##  [5] scuttle_1.0.4               scRNAseq_2.4.0             
##  [7] SingleCellExperiment_1.12.0 SummarizedExperiment_1.20.0
##  [9] Biobase_2.50.0              GenomicRanges_1.42.0       
## [11] GenomeInfoDb_1.26.2         IRanges_2.24.1             
## [13] S4Vectors_0.28.1            BiocGenerics_0.36.0        
## [15] MatrixGenerics_1.2.0        matrixStats_0.57.0         
## [17] BiocStyle_2.18.1           
## 
## loaded via a namespace (and not attached):
##  [1] httr_1.4.2                    bit64_4.0.5                  
##  [3] AnnotationHub_2.22.0          DelayedMatrixStats_1.12.1    
##  [5] shiny_1.5.0                   assertthat_0.2.1             
##  [7] askpass_1.1                   interactiveDisplayBase_1.28.0
##  [9] BiocManager_1.30.10           BiocFileCache_1.14.0         
## [11] blob_1.2.1                    Rsamtools_2.6.0              
## [13] GenomeInfoDbData_1.2.4        yaml_2.2.1                   
## [15] progress_1.2.2                BiocVersion_3.12.0           
## [17] pillar_1.4.7                  RSQLite_2.2.1                
## [19] lattice_0.20-41               beachmat_2.6.3               
## [21] glue_1.4.2                    digest_0.6.27                
## [23] promises_1.1.1                XVector_0.30.0               
## [25] htmltools_0.5.0               httpuv_1.5.4                 
## [27] Matrix_1.2-18                 XML_3.99-0.5                 
## [29] pkgconfig_2.0.3               biomaRt_2.46.0               
## [31] bookdown_0.21                 zlibbioc_1.36.0              
## [33] purrr_0.3.4                   xtable_1.8-4                 
## [35] later_1.1.0.1                 BiocParallel_1.24.1          
## [37] openssl_1.4.3                 tibble_3.0.4                 
## [39] generics_0.1.0                ellipsis_0.3.1               
## [41] withr_2.3.0                   lazyeval_0.2.2               
## [43] magrittr_2.0.1                crayon_1.3.4                 
## [45] mime_0.9                      memoise_1.1.0                
## [47] evaluate_0.14                 xml2_1.3.2                   
## [49] prettyunits_1.1.1             tools_4.0.3                  
## [51] hms_0.5.3                     lifecycle_0.2.0              
## [53] stringr_1.4.0                 DelayedArray_0.16.0          
## [55] Biostrings_2.58.0             compiler_4.0.3               
## [57] rlang_0.4.9                   grid_4.0.3                   
## [59] RCurl_1.98-1.2                rappdirs_0.3.1               
## [61] bitops_1.0-6                  rmarkdown_2.6                
## [63] ExperimentHub_1.16.0          DBI_1.1.0                    
## [65] curl_4.3                      R6_2.5.0                     
## [67] GenomicAlignments_1.26.0      rtracklayer_1.50.0           
## [69] knitr_1.30                    dplyr_1.0.2                  
## [71] fastmap_1.0.1                 bit_4.0.4                    
## [73] ProtGenerics_1.22.0           stringi_1.5.3                
## [75] Rcpp_1.0.5                    vctrs_0.3.6                  
## [77] sparseMatrixStats_1.2.0       dbplyr_2.0.0                 
## [79] tidyselect_1.1.0              xfun_0.19