scater 1.14.6
This document gives an introduction to and overview of the quality control functionality of the scater package.
scater contains tools to help with the analysis of single-cell transcriptomic data,
focusing on low-level steps such as quality control, normalization and visualization.
It is based on the SingleCellExperiment
class (from the SingleCellExperiment package),
and thus is interoperable with many other Bioconductor packages such as scran, batchelor and iSEE.
Note: A more comprehensive description of the use of scater (along with other packages) in a scRNA-seq analysis workflow is available at https://osca.bioconductor.org.
SingleCellExperiment
objectWe assume that you have a matrix containing expression count data summarised at the level of some features (gene, exon, region, etc.).
First, we create a SingleCellExperiment
object containing the data, as demonstrated below with a famous brain dataset.
Rows of the object correspond to features, while columns correspond to samples, i.e., cells in the context of single-cell ’omics data.
library(scRNAseq)
example_sce <- ZeiselBrainData()
example_sce
## class: SingleCellExperiment
## dim: 20006 3005
## metadata(0):
## assays(1): counts
## rownames(20006): Tspan12 Tshz1 ... mt-Rnr1 mt-Nd4l
## rowData names(1): featureType
## colnames(3005): 1772071015_C02 1772071017_G12 ... 1772066098_A12
## 1772058148_F03
## colData names(10): tissue group # ... level1class level2class
## reducedDimNames(0):
## spikeNames(0):
## altExpNames(2): ERCC repeat
We usually expect (raw) count data to be labelled as "counts"
in the assays, which can be easily retrieved with the counts
accessor.
Getters and setters are also provided for exprs
, tpm
, cpm
, fpkm
and versions of these with the prefix norm_
.
str(counts(example_sce))
Row and column-level metadata are easily accessed (or modified) as shown below.
There are also dedicated getters and setters for size factor values (sizeFactors()
); reduced dimensionality results (reducedDim()
); and alternative experimental features (altExp()
).
example_sce$whee <- sample(LETTERS, ncol(example_sce), replace=TRUE)
colData(example_sce)
## DataFrame with 3005 rows and 11 columns
## tissue group # total mRNA mol well sex
## <character> <numeric> <numeric> <numeric> <numeric>
## 1772071015_C02 sscortex 1 1221 3 3
## 1772071017_G12 sscortex 1 1231 95 1
## 1772071017_A05 sscortex 1 1652 27 1
## 1772071014_B06 sscortex 1 1696 37 3
## 1772067065_H06 sscortex 1 1219 43 3
## ... ... ... ... ... ...
## 1772067059_B04 ca1hippocampus 9 1997 19 1
## 1772066097_D04 ca1hippocampus 9 1415 21 1
## 1772063068_D01 sscortex 9 1876 34 3
## 1772066098_A12 ca1hippocampus 9 1546 88 1
## 1772058148_F03 sscortex 9 1970 15 3
## age diameter cell_id level1class level2class
## <numeric> <numeric> <character> <character> <character>
## 1772071015_C02 2 1 1772071015_C02 interneurons Int10
## 1772071017_G12 1 353 1772071017_G12 interneurons Int10
## 1772071017_A05 1 13 1772071017_A05 interneurons Int6
## 1772071014_B06 2 19 1772071014_B06 interneurons Int10
## 1772067065_H06 6 12 1772067065_H06 interneurons Int9
## ... ... ... ... ... ...
## 1772067059_B04 4 382 1772067059_B04 endothelial-mural Peric
## 1772066097_D04 7 12 1772066097_D04 endothelial-mural Vsmc
## 1772063068_D01 7 268 1772063068_D01 endothelial-mural Vsmc
## 1772066098_A12 7 324 1772066098_A12 endothelial-mural Vsmc
## 1772058148_F03 7 6 1772058148_F03 endothelial-mural Vsmc
## whee
## <character>
## 1772071015_C02 F
## 1772071017_G12 A
## 1772071017_A05 H
## 1772071014_B06 X
## 1772067065_H06 X
## ... ...
## 1772067059_B04 T
## 1772066097_D04 H
## 1772063068_D01 K
## 1772066098_A12 E
## 1772058148_F03 Y
rowData(example_sce)$stuff <- runif(nrow(example_sce))
rowData(example_sce)
## DataFrame with 20006 rows and 2 columns
## featureType stuff
## <character> <numeric>
## Tspan12 endogenous 0.531340830726549
## Tshz1 endogenous 0.245747287524864
## Fnbp1l endogenous 0.841682275990024
## Adamts15 endogenous 0.47632492124103
## Cldn12 endogenous 0.631566006690264
## ... ... ...
## mt-Co2 mito 0.542126515181735
## mt-Co1 mito 0.915390015114099
## mt-Rnr2 mito 0.665483738295734
## mt-Rnr1 mito 0.612728938227519
## mt-Nd4l mito 0.610844046343118
Subsetting is very convenient with this class, as both data and metadata are processed in a synchronized manner.
More details about the SingleCellExperiment
class can be found in the documentation for SingleCellExperiment package.
Count matrices stored as CSV files or equivalent can be easily read into R session using read.table()
from utils or fread()
from the data.table package.
It is advisable to coerce the resulting object into a matrix before storing it in a SingleCellExperiment
object.
For large data sets, the matrix can be read in chunk-by-chunk with progressive coercion into a sparse matrix from the Matrix package.
This is performed using the readSparseCounts()
function and reduces memory usage by not explicitly storing zeroes in memory.
Data from 10X Genomics experiments can be read in using the read10xCounts
function from the DropletUtils package.
This will automatically generate a SingleCellExperiment
with a sparse matrix, see the documentation for more details.
Transcript abundances from the kallisto
and Salmon
pseudo-aligners can be imported using methods from the tximeta package.
This produces a SummarizedExperiment
object that can be coerced into a SingleCellExperiment
simply with as(se, "SingleCellExperiment")
.
scater provides functionality for three levels of quality control (QC):
Cell-level metrics are computed by the perCellQCMetrics()
function and include:
sum
: total number of counts for the cell (i.e., the library size).detected
: the number of features for the cell that have counts above the detection limit (default of zero).subsets_X_percent
: percentage of all counts that come from the feature control set named X
.library(scater)
per.cell <- perCellQCMetrics(example_sce,
subsets=list(Mito=grep("mt-", rownames(example_sce))))
summary(per.cell$sum)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2574 8130 12913 14954 19284 63505
summary(per.cell$detected)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 785 2484 3656 3777 4929 8167
summary(per.cell$subsets_Mito_percent)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 3.992 6.653 7.956 10.290 56.955
It is often convenient to store this in the colData()
of our SingleCellExperiment
object for future reference.
(In fact, the addPerCellQC()
function will do this automatically.)
colData(example_sce) <- cbind(colData(example_sce), per.cell)
Metadata variables can be plotted against each other using the plotColData()
function, as shown below.
We expect to see an increasing number of detected genes with increasing total count.
Each point represents a cell that is coloured according to its tissue of origin.
plotColData(example_sce, x = "sum", y="detected", colour_by="tissue")