augere.screen
implements an augere
pipeline to detect differentially abundant barcodes in a functional
screen. Given a matrix of barcode-by-sample counts, this uses
voom() from the limma
package to test for differences between conditions. It then consolidates
barcode-level statistics into per-gene inferences with a variety of
statistical methods. The pipeline also creates a self-contained
Rmarkdown report that documents each step in the analysis.
To install this package, follow the usual instructions for Bioconductor:
Now let’s mock up a dataset from a functional screen. This contains a count matrix of barcode abundances, where each row is a barcode and each column is a sample. Each barcode is associated to a target gene via a 1:many mapping. Some barcodes are also negative controls, i.e., they do not target any gene.
make_mock_dataset <- function() {
set.seed(100)
mu <- 2^runif(1000, 3, 6)
grouping <- rep(c("A", "B", "ctrl"), each=3)
counts <- matrix(rnbinom(length(mu)* length(grouping), mu=mu, size=20), ncol=length(grouping))
rownames(counts) <- sprintf("BARCODE-%s", seq_along(mu))
library(SummarizedExperiment)
se <- SummarizedExperiment(list(counts=counts), colData=DataFrame(group=grouping))
rowData(se)$type <- "other"
rowData(se)$gene <- paste0("GENE-", sample(LETTERS, length(mu), replace=TRUE))
rowData(se)$type[1:50] <- "NEG"
rowData(se)$gene[1:50] <- "NEG"
se
}
se <- make_mock_dataset()
se## class: SummarizedExperiment
## dim: 1000 9
## metadata(0):
## assays(1): counts
## rownames(1000): BARCODE-1 BARCODE-2 ... BARCODE-999 BARCODE-1000
## rowData names(2): type gene
## colnames: NULL
## colData names(1): group
The runScreen() function uses voom() to
test for differential abundance across a comparison of interest. Let’s
say we want to compare A versus B:
library(augere.screen)
output.dir <- tempfile()
res <- runScreen(
se,
group="group",
comparison=c("A", "B"),
gene.field="gene",
filter.reference.factor="group",
filter.reference.levels="ctrl",
norm.control.type.field="type",
norm.control.types="NEG",
output.dir=output.dir
)## [1] "Increase in `A` over `B`"
## DataFrame with 1000 rows and 7 columns
## gene LogFC AveAb t PValue FDR
## <character> <numeric> <numeric> <numeric> <numeric> <numeric>
## BARCODE-1 NEG -0.445506 9.14812 -1.09034 0.2756355 0.979159
## BARCODE-2 NEG -0.556312 9.02579 -1.31320 0.1891996 0.979159
## BARCODE-3 NEG 0.593803 9.97191 1.73677 0.0825132 0.979159
## BARCODE-4 NEG NA NA NA NA NA
## BARCODE-5 NEG 0.251991 9.84108 0.72858 0.4663056 0.979159
## ... ... ... ... ... ... ...
## BARCODE-996 GENE-H 0.0682531 10.4678 0.2127503 0.831534 0.979159
## BARCODE-997 GENE-N NA NA NA NA NA
## BARCODE-998 GENE-X NA NA NA NA NA
## BARCODE-999 GENE-D -0.3067536 9.1611 -0.7524626 0.451822 0.979159
## BARCODE-1000 GENE-W 0.0178532 10.4276 0.0555378 0.955713 0.990915
## B
## <numeric>
## BARCODE-1 -4.59061
## BARCODE-2 -4.57672
## BARCODE-3 -4.51731
## BARCODE-4 NA
## BARCODE-5 -4.61479
## ... ...
## BARCODE-996 -4.63948
## BARCODE-997 NA
## BARCODE-998 NA
## BARCODE-999 -4.60827
## BARCODE-1000 -4.64117
## [1] "simes" "fry" "holm-min"
## class: SummarizedExperiment
## dim: 1000 6
## metadata(0):
## assays(2): counts logcounts
## rownames(1000): BARCODE-1 BARCODE-2 ... BARCODE-999 BARCODE-1000
## rowData names(3): type gene retained
## colnames: NULL
## colData names(3): group lib.size norm.factors
The function generates an Rmarkdown report containing (almost) all of the steps required to reproduce the analysis. Let’s have a look at it:
## ---
## title: Differential abundance analysis of functional screens
## author:
## - root
## date: "`r Sys.Date()`"
## output: BiocStyle::html_document
## ---
##
## ```{r, echo=FALSE}
## knitr::opts_chunk$set(error=FALSE, warning=FALSE, message=FALSE)
## ```
##
## # Data loading
##
## Our `SummarizedExperiment` object contains all experimental data and metadata for this study.
## Each row corresponds to a barcode while each column corresponds to a sample.
##
## ```{r}
## se <- local({ # augere.core input (1)
## stop("insert commands to generate 'se' here")
## })
## original <- se # making a copy for downstream use.
## se
## ```
##
## We subset the `SummarizedExperiment` to the groups involved in the comparison.
## This ensures that our variance estimates are not affected by irrelevant variability in other groups.
##
## ```{r}
## se <- se[,SummarizedExperiment::colData(se)[,"group"] %in% c("A", "B")]
## ncol(se)
## ```
##
## We extract the information into a `DGEList` object in preparation for analysis.
##
## ```{r}
## y <- edgeR::DGEList(SummarizedExperiment::assay(se, 1))
##
## # Also tracking the original row index from the SummarizedExperiment,
## # so that we can match them up after filtering out low-abundance barcodes.
## y$genes <- data.frame(origin=seq_len(nrow(se)))
## ```
##
## Finally, we build the design matrix.
##
## ```{r}
## model.data <- list()
## cd <- SummarizedExperiment::colData(se)
## model.data$group. <- factor(cd[,"group"])
## design <- model.matrix(~ 0 + group., data=model.data)
The function will also save the results to file inside a
results subdirectory, which can be loaded back into the
session using the readResult() function:
reloaded <- readResult(file.path(output.dir, "results", "barcode-1"))
reloaded$x # the result itself## DataFrame with 1000 rows and 7 columns
## gene LogFC AveAb t PValue FDR
## <character> <numeric> <numeric> <numeric> <numeric> <numeric>
## BARCODE-1 NEG -0.445506 9.14812 -1.09034 0.2756355 0.979159
## BARCODE-2 NEG -0.556312 9.02579 -1.31320 0.1891996 0.979159
## BARCODE-3 NEG 0.593803 9.97191 1.73677 0.0825132 0.979159
## BARCODE-4 NEG NA NA NA NA NA
## BARCODE-5 NEG 0.251991 9.84108 0.72858 0.4663056 0.979159
## ... ... ... ... ... ... ...
## BARCODE-996 GENE-H 0.0682531 10.4678 0.2127503 0.831534 0.979159
## BARCODE-997 GENE-N NA NA NA NA NA
## BARCODE-998 GENE-X NA NA NA NA NA
## BARCODE-999 GENE-D -0.3067536 9.1611 -0.7524626 0.451822 0.979159
## BARCODE-1000 GENE-W 0.0178532 10.4276 0.0555378 0.955713 0.990915
## B
## <numeric>
## BARCODE-1 -4.59061
## BARCODE-2 -4.57672
## BARCODE-3 -4.51731
## BARCODE-4 NA
## BARCODE-5 -4.61479
## ... ...
## BARCODE-996 -4.63948
## BARCODE-997 NA
## BARCODE-998 NA
## BARCODE-999 -4.60827
## BARCODE-1000 -4.64117
## List of 4
## $ title : chr "Increase in `A` over `B` (barcode)"
## $ authors :List of 1
## ..$ : chr "root"
## $ type : chr "differential_barcode_abundance"
## $ differential_barcode_abundance:List of 3
## ..$ contrast :List of 3
## .. ..$ type : chr "versus"
## .. ..$ left :List of 1
## .. .. ..$ : chr "A"
## .. ..$ right:List of 1
## .. .. ..$ : chr "B"
## ..$ normalization: chr "TMM on controls (NEG)"
## ..$ method : chr "voom-limma"
As one might expect, the group and
comparisons arguments specify to the grouping factor in
colData(se) and the comparison(s) to be performed,
respectively. Many of the arguments to runScreen() are
shared with those in the runVoom() function in the augere.de
package for differential expression, so readers are referred to the
latter’s vignette for more details on the parametrization of the design
and contrasts.
The gene.field argument specifies the column of
rowData(se) that defines the gene associated with each
barcode. This is used to consolidate the barcode-level statistics into
per-gene results. Doing so is usually desirable as genes are the
features of scientific interest, not barcodes. If this argument is not
supplied, no consolidation is performed.
The filter.reference.factor and
filter.reference.levels arguments specify the reference
samples in the dataset. These are used to quantify the abundances in the
original barcode library. We then remove barcodes with unusually low
abundances in the references, e.g., due to defects in library
manufacture. If no reference samples are available,
runScreen() will still use the default filtering from
edgeR to
remove low-abundance barcodes.
The norm.control.type.field and
norm.control.types arguments specify the control barcodes
that should not change in abundance during the experiment. Thus, any
differences in abundance are technical and can be used to quantify the
scaling bias for normalization. In this case, our negative controls are
marked as targeting NEGs, i.e., non-essential genes, though
one could also use non-targeting controls (NTC). If no control barcodes
are present, runScreen() will fall back to TMM
normalization on all barcodes, assuming that the majority of barcodes do
not change between conditions.
When consolidating barcode-level information into gene-level results,
the joint null hypothesis for each gene is that none of its barcodes are
differentially abundant. Different methods are used by
runScreen() to test this joint null, each of which has
different statistical properties.
The simplest approach involves applying Simes’ method to the per-barcode \(p\)-values. This is the most sensitive approach as it can reject the joint null if a small number of barcodes (possibly just 1) are differentially abundant. It is most suited for CRISPRa where few guides are expected to succeed for each gene, and is also useful for studying sporadic off-target effects in CRISPR knockout or CRISPRi.
## [1] "Increase in `A` over `B`"
## DataFrame with 27 rows and 8 columns
## NumBarcodes PValue FDR Direction NumUp NumDown
## <integer> <numeric> <numeric> <character> <integer> <integer>
## GENE-A 31 1.368298 1.000000 up 0 0
## GENE-B 28 0.165247 0.727021 down 0 0
## GENE-C 31 1.062718 1.000000 down 0 0
## GENE-D 33 0.190504 0.727021 up 0 0
## GENE-E 22 0.323121 0.727021 down 0 0
## ... ... ... ... ... ... ...
## GENE-W 31 0.4680670 0.843643 down 0 0
## GENE-X 24 0.0664626 0.727021 up 0 0
## GENE-Y 31 0.2946920 0.727021 down 0 0
## GENE-Z 39 0.0475706 0.727021 up 0 0
## NEG 49 1.0107865 1.000000 up 0 0
## LogFC AveAb
## <numeric> <numeric>
## GENE-A -0.508967 10.96911
## GENE-B -0.925260 10.21210
## GENE-C -1.022414 8.35278
## GENE-D 0.978704 9.89606
## GENE-E -0.969849 9.31064
## ... ... ...
## GENE-W -1.049680 8.98693
## GENE-X 1.382716 8.76149
## GENE-Y -0.954461 9.23219
## GENE-Z 0.999664 10.87339
## NEG 0.648934 10.50059
A more stringent approach uses an adaptation of the Holm method on the per-barcode \(p\)-values. This only rejects the joint null if a user-specified minimum number/proportion of barcodes for a gene are differentially abundant. It is suitable for all CRISPR modalities but tends to be more conservative than the other methods.
## DataFrame with 27 rows and 8 columns
## NumBarcodes PValue FDR Direction NumUp NumDown
## <integer> <numeric> <numeric> <character> <integer> <integer>
## GENE-A 31 1 1 down 0 0
## GENE-B 28 1 1 down 0 0
## GENE-C 31 1 1 down 0 0
## GENE-D 33 1 1 down 0 0
## GENE-E 22 1 1 down 0 0
## ... ... ... ... ... ... ...
## GENE-W 31 1 1 down 0 0
## GENE-X 24 1 1 down 0 0
## GENE-Y 31 1 1 down 0 0
## GENE-Z 39 1 1 down 0 0
## NEG 49 1 1 down 0 0
## LogFC AveAb
## <numeric> <numeric>
## GENE-A -0.508967 10.96911
## GENE-B -0.925260 10.21210
## GENE-C -1.022414 8.35278
## GENE-D 0.978704 9.89606
## GENE-E -0.969849 9.31064
## ... ... ...
## GENE-W -1.049680 8.98693
## GENE-X 1.382716 8.76149
## GENE-Y -0.954461 9.23219
## GENE-Z 0.999664 10.87339
## NEG 0.648934 10.50059
Finally, we can use the fry() method from limma to
perform a “barcode set test” on the barcode abundances for each gene.
This favors consistent differences across the majority of barcodes for a
given gene. It is most suited to CRISPR knockout or CRISPRi screens.
## DataFrame with 27 rows and 10 columns
## NumBarcodes Direction PValue FDR PValue.Mixed FDR.Mixed
## <integer> <character> <numeric> <numeric> <numeric> <numeric>
## GENE-A 31 up 0.809955 0.949216 0.960329 0.966116
## GENE-B 28 up 0.901654 0.949216 0.583051 0.874576
## GENE-C 31 down 0.936767 0.949216 0.704725 0.921348
## GENE-D 33 up 0.420938 0.947110 0.559190 0.874576
## GENE-E 22 down 0.847173 0.949216 0.430066 0.874576
## ... ... ... ... ... ... ...
## GENE-W 31 up 0.917329 0.949216 0.442895 0.874576
## GENE-X 24 up 0.140628 0.931903 0.100161 0.874576
## GENE-Y 31 down 0.786387 0.949216 0.227278 0.874576
## GENE-Z 39 up 0.295041 0.931903 0.146459 0.874576
## NEG 49 up 0.617632 0.949216 0.836563 0.966116
## NumUp NumDown LogFC AveAb
## <integer> <integer> <numeric> <numeric>
## GENE-A 0 0 -0.508967 10.96911
## GENE-B 0 0 -0.925260 10.21210
## GENE-C 0 0 -1.022414 8.35278
## GENE-D 0 0 0.978704 9.89606
## GENE-E 0 0 -0.969849 9.31064
## ... ... ... ... ...
## GENE-W 0 0 -1.049680 8.98693
## GENE-X 0 0 1.382716 8.76149
## GENE-Y 0 0 -0.954461 9.23219
## GENE-Z 0 0 0.999664 10.87339
## NEG 0 0 0.648934 10.50059
For each gene, we also report the number of barcodes changing in each direction, the net direction of change, and the average abundance and log-fold change of the most significant barcode. This can be helpful for interpreting the gene’s consolidated statistics.
We can set save.results=FALSE to instruct
runEdgeR() to not write results to disk. This is more
efficient if we only need the results in the current R session.
Setting dry.run=TRUE will instruct
runEdgeR() to only generate the Rmarkdown report but not
evaluate it. This can be helpful for generating a draft report for
further refinement.
Careful readers might notice that the Rmarkdown report generated by
runScreen() is not quite self-contained as the pipeline
doesn’t know how se was constructed. In the absence of this
information, the report contains a placeholder stop() to
remind the user to insert the relevant commands:
## Our `SummarizedExperiment` object contains all experimental data and metadata for this study.
## Each row corresponds to a barcode while each column corresponds to a sample.
##
## ```{r}
## se <- local({ # augere.core input (1)
## stop("insert commands to generate 'se' here")
## })
## original <- se # making a copy for downstream use.
## se
## ```
If we know how the object was generated, we can pass this information
to runScreen():
output.dry.dir <- tempfile()
runScreen(
wrapInput(se, commands=as.character(body(make_mock_dataset))[-1]),
group="group",
comparison=c("A", "B"),
gene.field="gene",
filter.reference.factor="group",
filter.reference.levels="ctrl",
norm.control.type.field="type",
norm.control.types="NEG",
output.dir=output.dry.dir,
dry.run=TRUE
)## NULL
This instructs runScreen() to insert the relevant
commands into the report for full reproducibility:
## Our `SummarizedExperiment` object contains all experimental data and metadata for this study.
## Each row corresponds to a barcode while each column corresponds to a sample.
##
## ```{r}
## se <- local({ # augere.core input (1)
## set.seed(100)
## mu <- 2^runif(1000, 3, 6)
## grouping <- rep(c("A", "B", "ctrl"), each = 3)
## counts <- matrix(rnbinom(length(mu) * length(grouping), mu = mu, size = 20), ncol = length(grouping))
## rownames(counts) <- sprintf("BARCODE-%s", seq_along(mu))
## library(SummarizedExperiment)
We can also attach rowData() columns to the result
tables or add custom fields to the result metadata:
output.custom.dir <- tempfile()
res <- runScreen(
se,
group="group",
comparison=c("A", "B"),
gene.field="gene",
filter.reference.factor="group",
filter.reference.levels="ctrl",
norm.control.type.field="type",
norm.control.types="NEG",
row.data=c("gene", "type"),
gene.data="type",
metadata=list(
project="my_phd_project",
collaborators=list("anne", "bob", "charlie"),
data_source="https://foo.bar.com"
),
output.dir=output.custom.dir
)## DataFrame with 1000 rows and 8 columns
## gene type LogFC AveAb t PValue
## <character> <character> <numeric> <numeric> <numeric> <numeric>
## BARCODE-1 NEG NEG -0.445506 9.14812 -1.09034 0.2756355
## BARCODE-2 NEG NEG -0.556312 9.02579 -1.31320 0.1891996
## BARCODE-3 NEG NEG 0.593803 9.97191 1.73677 0.0825132
## BARCODE-4 NEG NEG NA NA NA NA
## BARCODE-5 NEG NEG 0.251991 9.84108 0.72858 0.4663056
## ... ... ... ... ... ... ...
## BARCODE-996 GENE-H other 0.0682531 10.4678 0.2127503 0.831534
## BARCODE-997 GENE-N other NA NA NA NA
## BARCODE-998 GENE-X other NA NA NA NA
## BARCODE-999 GENE-D other -0.3067536 9.1611 -0.7524626 0.451822
## BARCODE-1000 GENE-W other 0.0178532 10.4276 0.0555378 0.955713
## FDR B
## <numeric> <numeric>
## BARCODE-1 0.979159 -4.59061
## BARCODE-2 0.979159 -4.57672
## BARCODE-3 0.979159 -4.51731
## BARCODE-4 NA NA
## BARCODE-5 0.979159 -4.61479
## ... ... ...
## BARCODE-996 0.979159 -4.63948
## BARCODE-997 NA NA
## BARCODE-998 NA NA
## BARCODE-999 0.979159 -4.60827
## BARCODE-1000 0.990915 -4.64117
## DataFrame with 27 rows and 9 columns
## type NumBarcodes PValue FDR Direction NumUp
## <character> <integer> <numeric> <numeric> <character> <integer>
## GENE-A other 31 1.368298 1.000000 up 0
## GENE-B other 28 0.165247 0.727021 down 0
## GENE-C other 31 1.062718 1.000000 down 0
## GENE-D other 33 0.190504 0.727021 up 0
## GENE-E other 22 0.323121 0.727021 down 0
## ... ... ... ... ... ... ...
## GENE-W other 31 0.4680670 0.843643 down 0
## GENE-X other 24 0.0664626 0.727021 up 0
## GENE-Y other 31 0.2946920 0.727021 down 0
## GENE-Z other 39 0.0475706 0.727021 up 0
## NEG NEG 49 1.0107865 1.000000 up 0
## NumDown LogFC AveAb
## <integer> <numeric> <numeric>
## GENE-A 0 -0.508967 10.96911
## GENE-B 0 -0.925260 10.21210
## GENE-C 0 -1.022414 8.35278
## GENE-D 0 0.978704 9.89606
## GENE-E 0 -0.969849 9.31064
## ... ... ... ...
## GENE-W 0 -1.049680 8.98693
## GENE-X 0 1.382716 8.76149
## GENE-Y 0 -0.954461 9.23219
## GENE-Z 0 0.999664 10.87339
## NEG 0 0.648934 10.50059
# The custom metadata is also saved to disk.
str(readResult(file.path(output.custom.dir, "results", "barcode-1"))$metadata)## List of 7
## $ project : chr "my_phd_project"
## $ collaborators :List of 3
## ..$ : chr "anne"
## ..$ : chr "bob"
## ..$ : chr "charlie"
## $ data_source : chr "https://foo.bar.com"
## $ title : chr "Increase in `A` over `B` (barcode)"
## $ authors :List of 1
## ..$ : chr "root"
## $ type : chr "differential_barcode_abundance"
## $ differential_barcode_abundance:List of 3
## ..$ contrast :List of 3
## .. ..$ type : chr "versus"
## .. ..$ left :List of 1
## .. .. ..$ : chr "A"
## .. ..$ right:List of 1
## .. .. ..$ : chr "B"
## ..$ normalization: chr "TMM on controls (NEG)"
## ..$ method : chr "voom-limma"
## R version 4.6.0 (2026-04-24)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.4 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Etc/UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] limma_3.69.0 augere.screen_0.99.2
## [3] SummarizedExperiment_1.43.0 Biobase_2.73.1
## [5] GenomicRanges_1.65.0 Seqinfo_1.3.0
## [7] IRanges_2.47.0 S4Vectors_0.51.1
## [9] BiocGenerics_0.59.1 generics_0.1.4
## [11] MatrixGenerics_1.25.0 matrixStats_1.5.0
## [13] BiocStyle_2.41.0
##
## loaded via a namespace (and not attached):
## [1] sass_0.4.10 SparseArray_1.13.2 lattice_0.22-9
## [4] h5mread_1.5.0 alabaster.se_1.13.0 digest_0.6.39
## [7] evaluate_1.0.5 grid_4.6.0 augere.de_0.99.2
## [10] augere.core_0.99.3 fastmap_1.2.0 jsonlite_2.0.0
## [13] Matrix_1.7-5 alabaster.schemas_1.13.0 BiocManager_1.30.27
## [16] HDF5Array_1.41.0 jquerylib_0.1.4 abind_1.4-8
## [19] cli_3.6.6 rlang_1.2.0 XVector_0.53.0
## [22] cachem_1.1.0 DelayedArray_0.39.1 yaml_2.3.12
## [25] metapod_1.21.0 S4Arrays_1.13.0 tools_4.6.0
## [28] Rhdf5lib_2.1.0 locfit_1.5-9.12 alabaster.ranges_1.13.0
## [31] alabaster.matrix_1.13.0 alabaster.base_1.13.0 buildtools_1.0.0
## [34] R6_2.6.1 lifecycle_1.0.5 rhdf5_2.57.0
## [37] edgeR_4.11.0 bslib_0.10.0 Rcpp_1.1.1-1.1
## [40] statmod_1.5.1 xfun_0.57 sys_3.4.3
## [43] knitr_1.51 rhdf5filters_1.25.0 htmltools_0.5.9
## [46] rmarkdown_2.31 maketools_1.3.2 compiler_4.6.0