To install the package, please use the following.
if (!requireNamespace("BiocManager"))
install.packages("BiocManager")
BiocManager::install("miQC")
This vignette provides a basic example of how to run miQC, which allows users to perform cell-wise filtering of single-cell RNA-seq data for quality control. Single-cell RNA-seq data is very sensitive to tissue quality and choice of experimental workflow; it’s critical to ensure compromised cells and failed cell libraries are removed. A high proportion of reads mapping to mitochondrial DNA is one sign of a damaged cell, so most analyses will remove cells with mtRNA over a certain threshold, but those thresholds can be arbitrary and/or detrimentally stringent, especially for archived tumor tissues. miQC jointly models both the proportion of reads mapping to mtDNA genes and the number of detected genes with mixture models in a probabilistic framework to identify the low-quality cells in a given dataset.
For more information about the statistical background of miQC and for biological use cases, consult the miQC paper (Hippen et al. 2021).
You’ll need the following packages installed to run this tutorial:
suppressPackageStartupMessages({
library(SingleCellExperiment)
library(scRNAseq)
library(scater)
library(flexmix)
library(splines)
library(miQC)
})
To demonstrate how to run miQC on a single-cell RNA-seq dataset, we’ll use data from mouse brain cells, originating from an experiment by Zeisel et al (Zeisel et al. 2015), and available through the Bioconductor package scRNAseq.
sce <- ZeiselBrainData()
sce
## class: SingleCellExperiment
## dim: 20006 3005
## metadata(0):
## assays(1): counts
## rownames(20006): Tspan12 Tshz1 ... mt-Rnr1 mt-Nd4l
## rowData names(1): featureType
## colnames(3005): 1772071015_C02 1772071017_G12 ... 1772066098_A12
## 1772058148_F03
## colData names(10): tissue group # ... level1class level2class
## reducedDimNames(0):
## mainExpName: endogenous
## altExpNames(2): ERCC repeat
In order to calculate the percent of reads in a cell that map to mitochondrial genes, we first need to establish which genes are mitochondrial. For genes listed as HGNC symbols, this is as simple as searching for genes starting with mt-. For other IDs, we recommend using a biomaRt query to map to chromosomal location and identify all mitochondrial genes.
mt_genes <- grepl("^mt-", rownames(sce))
feature_ctrls <- list(mito = rownames(sce)[mt_genes])
feature_ctrls
## $mito
## [1] "mt-Tl1" "mt-Tn" "mt-Tr" "mt-Tq" "mt-Ty" "mt-Tk" "mt-Ta"
## [8] "mt-Tf" "mt-Tp" "mt-Tc" "mt-Td" "mt-Tl2" "mt-Te" "mt-Tv"
## [15] "mt-Tg" "mt-Tt" "mt-Tw" "mt-Tm" "mt-Ti" "mt-Nd3" "mt-Nd6"
## [22] "mt-Nd4" "mt-Atp6" "mt-Nd2" "mt-Nd5" "mt-Nd1" "mt-Co3" "mt-Cytb"
## [29] "mt-Atp8" "mt-Co2" "mt-Co1" "mt-Rnr2" "mt-Rnr1" "mt-Nd4l"
miQC is designed to be run with the Bioconductor package scater, which has a built-in function addPerCellQC to calculate basic QC metrics like number of unique genes detected per cell and total number of reads. When we pass in our list of mitochondrial genes, it will also calculate percent mitochondrial reads.
sce <- addPerCellQC(sce, subsets = feature_ctrls)
head(colData(sce))
## DataFrame with 6 rows and 22 columns
## tissue group # total mRNA mol well sex
## <character> <numeric> <numeric> <numeric> <numeric>
## 1772071015_C02 sscortex 1 1221 3 3
## 1772071017_G12 sscortex 1 1231 95 1
## 1772071017_A05 sscortex 1 1652 27 1
## 1772071014_B06 sscortex 1 1696 37 3
## 1772067065_H06 sscortex 1 1219 43 3
## 1772071017_E02 sscortex 1 1378 5 1
## age diameter cell_id level1class level2class
## <numeric> <numeric> <character> <character> <character>
## 1772071015_C02 2 1 1772071015_C02 interneurons Int10
## 1772071017_G12 1 353 1772071017_G12 interneurons Int10
## 1772071017_A05 1 13 1772071017_A05 interneurons Int6
## 1772071014_B06 2 19 1772071014_B06 interneurons Int10
## 1772067065_H06 6 12 1772067065_H06 interneurons Int9
## 1772071017_E02 1 21 1772071017_E02 interneurons Int9
## sum detected subsets_mito_sum subsets_mito_detected
## <numeric> <numeric> <numeric> <numeric>
## 1772071015_C02 22354 4871 774 23
## 1772071017_G12 22869 4712 1121 27
## 1772071017_A05 32594 6055 952 27
## 1772071014_B06 33525 5852 611 28
## 1772067065_H06 21694 4724 164 23
## 1772071017_E02 25919 5427 1122 19
## subsets_mito_percent altexps_ERCC_sum altexps_ERCC_detected
## <numeric> <numeric> <numeric>
## 1772071015_C02 3.462468 6706 43
## 1772071017_G12 4.901832 6435 46
## 1772071017_A05 2.920783 6335 47
## 1772071014_B06 1.822521 7032 43
## 1772067065_H06 0.755969 5931 39
## 1772071017_E02 4.328871 6671 43
## altexps_ERCC_percent altexps_repeat_sum altexps_repeat_detected
## <numeric> <numeric> <numeric>
## 1772071015_C02 18.0070 8181 419
## 1772071017_G12 15.6349 11854 480
## 1772071017_A05 11.1238 18021 582
## 1772071014_B06 12.8999 13955 512
## 1772067065_H06 17.1908 6876 363
## 1772071017_E02 13.3543 17364 618
## altexps_repeat_percent total
## <numeric> <numeric>
## 1772071015_C02 21.9677 37241
## 1772071017_G12 28.8012 41158
## 1772071017_A05 31.6435 56950
## 1772071014_B06 25.5999 54512
## 1772067065_H06 19.9299 34501
## 1772071017_E02 34.7600 49954
Using the QC metrics generated via scater, we can use the plotMetrics function to visually inspect the quality of our dataset.
plotMetrics(sce)