augere.de
implements augere
pipelines for differential expression (DE) analysis of RNA-seq data
based on edgeR and
voom(). Each analysis pipeline creates a self-contained
Rmarkdown report that documents each step in the DE analysis, after
which it compiles the report and returns the analysis results to the
user in the current R session. We can then inspect the report to
determine exactly how the analysis was performed. We can also modify the
report to customize the analysis beyond the options provided by the
pipeline function.
To install this package, follow the usual instructions for Bioconductor:
Let’s demonstrate on a dataset derived from the airway package. This has two groups (treated and untreated) so the parametrization of the analysis is fairly simple:
library(augere.de)
se <- loadExampleDataset()
# Testing for DE between 'trt' and 'untrt' in the 'dex' grouping factor.
output.dir <- tempfile()
res <- runEdgeR(se, group="dex", comparisons=c("trt", "untrt"), output.dir=output.dir)## [1] "Increase in `trt` over `untrt`"
## DataFrame with 63677 rows and 5 columns
## LogFC AveExpr F PValue FDR
## <numeric> <numeric> <numeric> <numeric> <numeric>
## ENSG00000000003 -0.3862784 5.05487 6.7849013 0.0287436 0.130627
## ENSG00000000005 NA NA NA NA NA
## ENSG00000000419 0.1967653 4.60947 6.0508824 0.0364214 0.150396
## ENSG00000000457 0.0254907 3.48214 0.0637723 0.8063598 0.911506
## ENSG00000000460 -0.1225528 1.48030 0.2075723 0.6595808 0.827011
## ... ... ... ... ... ...
## ENSG00000273489 NA NA NA NA NA
## ENSG00000273490 NA NA NA NA NA
## ENSG00000273491 NA NA NA NA NA
## ENSG00000273492 NA NA NA NA NA
## ENSG00000273493 NA NA NA NA NA
## class: RangedSummarizedExperiment
## dim: 63677 8
## metadata(0):
## assays(2): counts logCPM
## rownames(63677): ENSG00000000003 ENSG00000000005 ... ENSG00000273492
## ENSG00000273493
## rowData names(11): gene_id gene_name ... symbol retained
## colnames(8): SRR1039508 SRR1039509 ... SRR1039520 SRR1039521
## colData names(11): SampleName cell ... lib.size norm.factors
As mentioned, runEdgeR() generates an Rmarkdown report
containing (almost) all of the steps required to reproduce the analysis.
Let’s have a look at it:
## ---
## title: Differential expression with edgeR
## author:
## - root
## date: "`r Sys.Date()`"
## output: BiocStyle::html_document
## ---
##
## ```{r preamble, echo=FALSE}
## knitr::opts_chunk$set(error=FALSE, warning=FALSE, message=FALSE)
## ```
##
## # Data loading
##
## Our `SummarizedExperiment` object contains all experimental data and metadata for this study.
## Each row corresponds to a gene while each column corresponds to a sample.
##
## ```{r}
## se <- local({ # augere.core input (1)
## stop("insert commands to generate 'se' here")
## })
## se
## ```
##
## We subset the `SummarizedExperiment` to the groups involved in the comparison.
## This ensures that our variance estimates are not affected by irrelevant variability in other groups.
##
## ```{r}
## se <- se[,SummarizedExperiment::colData(se)[,"dex"] %in% c("trt", "untrt")]
## ncol(se)
## ```
##
## We extract the information into a `DGEList` object in preparation for analysis.
##
## ```{r}
## y <- edgeR::DGEList(SummarizedExperiment::assay(se, 1))
##
## # Also tracking the original row index from the SummarizedExperiment,
## # so that we can match them up after filtering out low-abundance genes.
## y$genes <- data.frame(origin=seq_len(nrow(se)))
## ```
##
## Finally, we build the design matrix.
##
## ```{r}
## model.data <- list()
## cd <- SummarizedExperiment::colData(se)
## model.data$group. <- factor(cd[,"dex"])
## design <- model.matrix(~ 0 + group., data=model.data)
## ```
The function will also save the results to file inside a
results subdirectory, which can be loaded back into the
session using the readResult() function:
## DataFrame with 63677 rows and 5 columns
## LogFC AveExpr F PValue FDR
## <numeric> <numeric> <numeric> <numeric> <numeric>
## ENSG00000000003 -0.3862784 5.05487 6.7849013 0.0287436 0.130627
## ENSG00000000005 NA NA NA NA NA
## ENSG00000000419 0.1967653 4.60947 6.0508824 0.0364214 0.150396
## ENSG00000000457 0.0254907 3.48214 0.0637723 0.8063598 0.911506
## ENSG00000000460 -0.1225528 1.48030 0.2075723 0.6595808 0.827011
## ... ... ... ... ... ...
## ENSG00000273489 NA NA NA NA NA
## ENSG00000273490 NA NA NA NA NA
## ENSG00000273491 NA NA NA NA NA
## ENSG00000273492 NA NA NA NA NA
## ENSG00000273493 NA NA NA NA NA
## List of 4
## $ title : chr "Increase in `trt` over `untrt`"
## $ authors :List of 1
## ..$ : chr "root"
## $ type : chr "differential_gene_expression"
## $ differential_gene_expression:List of 2
## ..$ contrast:List of 3
## .. ..$ type : chr "versus"
## .. ..$ left :List of 1
## .. .. ..$ : chr "trt"
## .. ..$ right:List of 1
## .. .. ..$ : chr "untrt"
## ..$ method : chr "edgeR QL"
runVoom() does the same but for voom()
instead of edgeR. For
simplicity, we will focus on runEdgeR() in the rest of this
vignette.
We can block on uninteresting factors of variation - for example, the cell line of origin:
output.block.dir <- tempfile()
res.block <- runEdgeR(
se,
group="dex",
block="cell",
comparisons=c("trt", "untrt"),
output.dir=output.block.dir,
# Not writing to disk or generating plots to reduce runtime.
save.results=FALSE,
suppress.plots=TRUE
)If continuous covariates are of interest, we can include them in the model and test their effects on expression:
# Making up a continuous covariate.
se$tumor_fraction <- runif(ncol(se))
# 'comparisons' of length 1 indicates that we want to test a single covariate.
output.cov.dir <- tempfile()
res.cov <- runEdgeR(
se,
group="dex",
covariate="tumor_fraction",
comparisons="tumor_fraction",
output.dir=output.cov.dir,
# Not writing to disk or generating plots to reduce runtime.
save.results=FALSE,
suppress.plots=TRUE
)For more complex cases, users can also directly pass in their own design matrices and contrasts:
We perform an ANOVA-like comparison by providing more than two groups
in comparisons. Here, the null hypothesis is that there is
no DE between any of the cell lines.
output.anova.dir <- tempfile()
res.anova <- runEdgeR(
se,
group="cell",
comparisons=c("N052611", "N061011", "N080611", "N61311"),
output.dir=output.anova.dir,
# Not writing to disk or generating plots to reduce runtime.
save.results=FALSE,
suppress.plots=TRUE
)We test for differences between the averages of two sets of groups by
passing a named character vector to comparisons=. Each
unique name defines a set of groups, e.g., resistant versus
sensitive.
output.sets.dir <- tempfile()
# Specify a named vector between the 'resistant' and 'sensitive' sets of cell lines.
res.sets <- runEdgeR(
se,
group="cell",
comparisons=c(resistant="N052611", resistant="N061011", sensitive="N080611", sensitive="N61311"),
output.dir=output.sets.dir,
# Not writing to disk or generating plots to reduce runtime.
save.results=FALSE,
suppress.plots=TRUE
)We perform multiple comparisons in a single call by passing a list of
character vectors to comparisons. This can be convenient
but does have some implications for subsetting, see comments below.
output.multiple.dir <- tempfile()
res.multiple <- runEdgeR(
se,
group="cell",
comparisons=list(c("N052611", "N061011"), c("N080611", "N61311")),
output.dir=output.multiple.dir,
# Not writing to disk or generating plots to reduce runtime.
save.results=FALSE,
suppress.plots=TRUE
)
names(res.multiple$results)## [1] "Increase in `N052611` over `N061011`"
## [2] "Increase in `N080611` over `N61311`"
We can test against a log-fold change threshold to focus on genes with larger changes rather than those with small variances. This is better than post hoc filtering of the DE table as it captures the uncertainty of the log-fold change estimates, otherwise, we’d enrich for genes with variable log-fold changes that only pass the threshold by chance.
If we only care about some of the samples, we can subset them by a pre-defined factor. For example, say we only care about a subset of the cell lines:
output.sub.dir <- tempfile()
res.sub <- runEdgeR(
se,
group="dex",
comparisons=c("trt", "untrt"),
subset.factor="cell",
subset.levels=c("N052611", "N080611"),
output.dir=output.lfc.dir,
# Not writing to disk or generating plots to reduce runtime.
save.results=FALSE,
suppress.plots=TRUE
)By default, even if no subsetting is specified by the user,
runEdgeR() will automatically subset the dataset to only
those groups that are involved in the comparisons. This ensures that the
variance/dispersion estimates are not affected by any unusual behavior
in other (irrelevant) groups. So, for example, the
runEdgeR() call below would only consider samples in the
N052611 and N061011 groups:
output.sub.group.dir <- tempfile()
res.sub.group <- runEdgeR(
se,
group="cell",
comparisons=c("N052611", "N061011"),
output.dir=output.sub.group.dir,
# Not writing to disk or generating plots to reduce runtime.
save.results=FALSE,
suppress.plots=TRUE
)The subsetting is only done once per runEdgeR() call, so
if multiple comparisons are specified, the subdataset will
retain groups from all comparisons. This is occasionally helpful for
meta-analyses as it ensures that the dispersion estimates are consistent
across comparisons. Most of the time, though, it is better to call
runEdgeR() separately for each comparison:
comparisons <- list(c("N052611", "N061011"), c("N080611", "N61311"))
# Multiple comparisons in one call:
combined.results <- runEdgeR(
se,
group="cell",
comparisons=comparisons,
output.dir=tempfile(),
# Not writing to disk or generating plots to reduce runtime.
save.results=FALSE,
suppress.plots=TRUE
)
# Or, repeated calls with different comparisons, which is usually the safer
# approach unless you know better.
separate.results <- list()
for (i in seq_along(comparisons)) {
separate.results[[i]] <- runEdgeR(
se,
group="cell",
comparisons=comparisons[[i]],
output.dir=tempfile(),
# Not writing to disk or generating plots to reduce runtime.
save.results=FALSE,
suppress.plots=TRUE
)
}As shown above, we can set save.results=FALSE to
instruct runEdgeR() to not write results to disk. This is
more efficient if we only need the results in the current R session.
Setting dry.run=TRUE will instruct
runEdgeR() to only generate the Rmarkdown report but not
evaluate it. This can be helpful for generating a draft report for
further refinement.
Careful readers will note that the auto-generated report is not quite
self-contained as runEdgeR() doesn’t know how
se was constructed. In the absence of this information, the
report contains a placeholder stop() to remind the user to
insert the relevant commands:
## Our `SummarizedExperiment` object contains all experimental data and metadata for this study.
## Each row corresponds to a gene while each column corresponds to a sample.
##
## ```{r}
## se <- local({ # augere.core input (1)
## stop("insert commands to generate 'se' here")
## })
## se
## ```
##
## We subset the `SummarizedExperiment` to the groups involved in the comparison.
If we know how the object was generated, we can pass this information
to runEdgeR():
output.dry.dir <- tempfile()
runEdgeR(
wrapInput(se, commands="augere.de::loadExampleDataset()"),
group="dex",
comparisons=c("trt", "untrt"),
output.dir=output.dry.dir,
dry.run=TRUE
)## NULL
This instructs runEdgeR() to insert the relevant
commands into the report for full reproducibility:
## Our `SummarizedExperiment` object contains all experimental data and metadata for this study.
## Each row corresponds to a gene while each column corresponds to a sample.
##
## ```{r}
## se <- local({ # augere.core input (1)
## augere.de::loadExampleDataset()
## })
## se
## ```
##
## We subset the `SummarizedExperiment` to the groups involved in the comparison.
We can also attach rowData() columns to the result
tables or add custom fields to the result metadata:
output.custom.dir <- tempfile()
res <- runEdgeR(
se,
group="dex",
comparisons=c("trt", "untrt"),
row.data=c("symbol", "gene_biotype"),
metadata=list(
project="my_phd_project",
collaborators=list("anne", "bob", "charlie"),
data_source="https://foo.bar.com"
),
output.dir=output.custom.dir
)## DataFrame with 63677 rows and 7 columns
## symbol gene_biotype LogFC AveExpr F
## <character> <character> <numeric> <numeric> <numeric>
## ENSG00000000003 TSPAN6 protein_coding -0.3862784 5.05487 6.7849013
## ENSG00000000005 TNMD protein_coding NA NA NA
## ENSG00000000419 DPM1 protein_coding 0.1967653 4.60947 6.0508824
## ENSG00000000457 SCYL3 protein_coding 0.0254907 3.48214 0.0637723
## ENSG00000000460 C1orf112 protein_coding -0.1225528 1.48030 0.2075723
## ... ... ... ... ... ...
## ENSG00000273489 RP11-180C16.1 antisense NA NA NA
## ENSG00000273490 TSEN34 protein_coding NA NA NA
## ENSG00000273491 RP11-138A9.2 lincRNA NA NA NA
## ENSG00000273492 AP000230.1 lincRNA NA NA NA
## ENSG00000273493 RP11-80H18.4 lincRNA NA NA NA
## PValue FDR
## <numeric> <numeric>
## ENSG00000000003 0.0287436 0.130627
## ENSG00000000005 NA NA
## ENSG00000000419 0.0364214 0.150396
## ENSG00000000457 0.8063598 0.911506
## ENSG00000000460 0.6595808 0.827011
## ... ... ...
## ENSG00000273489 NA NA
## ENSG00000273490 NA NA
## ENSG00000273491 NA NA
## ENSG00000273492 NA NA
## ENSG00000273493 NA NA
# The custom metadata is also saved to disk.
str(readResult(file.path(output.custom.dir, "results", "de-1"))$metadata)## List of 7
## $ project : chr "my_phd_project"
## $ collaborators :List of 3
## ..$ : chr "anne"
## ..$ : chr "bob"
## ..$ : chr "charlie"
## $ data_source : chr "https://foo.bar.com"
## $ title : chr "Increase in `trt` over `untrt`"
## $ authors :List of 1
## ..$ : chr "root"
## $ type : chr "differential_gene_expression"
## $ differential_gene_expression:List of 2
## ..$ contrast:List of 3
## .. ..$ type : chr "versus"
## .. ..$ left :List of 1
## .. .. ..$ : chr "trt"
## .. ..$ right:List of 1
## .. .. ..$ : chr "untrt"
## ..$ method : chr "edgeR QL"
## R version 4.6.0 (2026-04-24)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.4 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Etc/UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] augere.de_0.99.2 BiocStyle_2.41.0
##
## loaded via a namespace (and not attached):
## [1] sass_0.4.10 generics_0.1.4
## [3] SparseArray_1.13.2 lattice_0.22-9
## [5] h5mread_1.5.0 alabaster.se_1.13.0
## [7] digest_0.6.39 evaluate_1.0.5
## [9] grid_4.6.0 augere.core_0.99.3
## [11] fastmap_1.2.0 jsonlite_2.0.0
## [13] Matrix_1.7-5 alabaster.schemas_1.13.0
## [15] limma_3.69.0 BiocManager_1.30.27
## [17] HDF5Array_1.41.0 jquerylib_0.1.4
## [19] abind_1.4-8 cli_3.6.6
## [21] rlang_1.2.0 XVector_0.53.0
## [23] Biobase_2.73.1 cachem_1.1.0
## [25] DelayedArray_0.39.1 yaml_2.3.12
## [27] S4Arrays_1.13.0 tools_4.6.0
## [29] Rhdf5lib_2.1.0 locfit_1.5-9.12
## [31] alabaster.ranges_1.13.0 SummarizedExperiment_1.43.0
## [33] alabaster.matrix_1.13.0 BiocGenerics_0.59.1
## [35] alabaster.base_1.13.0 buildtools_1.0.0
## [37] R6_2.6.1 matrixStats_1.5.0
## [39] stats4_4.6.0 lifecycle_1.0.5
## [41] rhdf5_2.57.0 Seqinfo_1.3.0
## [43] S4Vectors_0.51.1 edgeR_4.11.0
## [45] IRanges_2.47.0 bslib_0.10.0
## [47] Rcpp_1.1.1-1.1 statmod_1.5.1
## [49] xfun_0.57 GenomicRanges_1.65.0
## [51] sys_3.4.3 MatrixGenerics_1.25.0
## [53] knitr_1.51 rhdf5filters_1.25.0
## [55] htmltools_0.5.9 rmarkdown_2.31
## [57] maketools_1.3.2 compiler_4.6.0