Feature: Iterative Benchmarking

Patrick K. Kimes, Alejandro Reyes

27 April 2020

Abstract

“Benchmark comparisons are often performed iteratively, with new methods added to the comparison or existing methods updated with new versions or features. Here, we describe the features currently available in the SummarizedBenchmark package for handling iterative benchmarking using an example case study. SummarizedBenchmark package version: 2.6.0”

Package

SummarizedBenchmark 2.6.0

1 Introduction
- 1.1 Example Case Study
2 Using updateBench
- 2.1 Benchmark Sessions
3 Updating Performance Metrics
References

1 Introduction

Often, benchmarking analyses are performed in multiple steps, iteratively, with methods added or updated based on previous results. Additionally, over the course of a benchmarking study, developers may update methods to improve the implementation or add new features. Ideally, whenever a method is updated or added, the benchmark results should also be updated. Naively, all methods in the benchmark could be re-computed by calling buildBench() on the original BenchDesign whenever the benchmarker notices that a method has been changed. However, not only is this computational inefficient, but requires the benchmarker to manually check if any methods have become outdated.

To make this process easier, the package includes the updateBench() function which can be used to both check if any results in a SummarizedBenchmark object need to be updated, and to update these results if necessary.

1.1 Example Case Study

library("SummarizedBenchmark")
library("magrittr")

To demonstrate the use of updateBench(), we use the same example of comparing methods for differential expression used in the SummarizedBenchmark: Full Case Study vignette. The BenchDesign is initialized using RNA-seq count data and ground truth information included with the package. The data is described in more detail in the Full Case Study vignette.

library("limma")
library("edgeR")
library("DESeq2")
library("tximport")

data(soneson2016)

mycounts <- round(txi$counts)
mycoldat <- data.frame(condition = factor(rep(c(1, 2), each = 3)))
rownames(mycoldat) <- colnames(mycounts)
mydat <- list(coldat = mycoldat, cntdat = mycounts,
              status = truthdat$status, lfc = truthdat$logFC)

bd <- BenchDesign(data = mydat)

Three methods for differential expression testing implemented in the DESeq2, edgeR, and limma packages are added to the benchmark. The p-values and log-fold change (LFC) values are stored for each method, as before.

deseq2_run <- function(countData, colData, design, contrast) {
    dds <- DESeqDataSetFromMatrix(countData, colData = colData, design = design)
    dds <- DESeq(dds)
    results(dds, contrast = contrast)
}
deseq2_pv <- function(x) { x$pvalue }
deseq2_lfc <- function(x) { x$log2FoldChange }

edgeR_run <- function(countData, group, design) {
    y <- DGEList(countData, group = group)
    y <- calcNormFactors(y)
    des <- model.matrix(design)
    y <- estimateDisp(y, des)
    fit <- glmFit(y, des)
    glmLRT(fit, coef=2)
}
edgeR_pv <- function(x) { x$table$PValue }
edgeR_lfc <- function(x) { x$table$logFC }

voom_run <- function(countData, group, design) {
    y <- DGEList(countData, group = group)
    y <- calcNormFactors(y)
    des <- model.matrix(design)
    y <- voom(y, des)
    eBayes(lmFit(y, des))
}
voom_pv <- function(x) { x$p.value[, 2] }
voom_lfc <- function(x) { x$coefficients[, 2] }

bd <- bd %>%
    addMethod(label = "deseq2", func = deseq2_run,
              post = list(pv = deseq2_pv, lfc = deseq2_lfc),
              params = rlang::quos(countData = cntdat,
                                   colData = coldat, 
                                   design = ~condition,
                                   contrast = c("condition", "2", "1"))) %>%
    addMethod(label = "edgeR", func = edgeR_run,
              post = list(pv = edgeR_pv, lfc = edgeR_lfc),
              params = rlang::quos(countData = cntdat,
                                   group = coldat$condition,
                                   design = ~coldat$condition)) %>%
    addMethod(label = "voom", func = voom_run,
              post = list(pv = voom_pv, lfc = voom_lfc), 
              params = rlang::quos(countData = cntdat,
                                   group = coldat$condition,
                                   design = ~coldat$condition))

sb <- buildBench(bd, truthCols = c(pv = "status", lfc = "lfc"))

2 Using updateBench

First, the updateBench() can be called with just a single SummarizedBenchmark object to see if any of the methods have become outdated. By default, the function will not actually run any methods.

updateBench(sb = sb)

## Update SummarizedBenchmark (dryrun) ---------------------------------- 
##   benchmark data: unchanged (full data missing)
##     MD5 hash: 3f41eeab0a157bf127e2b7ffb98ab685 
##     names: ??
##   benchmark methods:
##     |              | Need to  |           Outdated            |
##     | Method       | (Re)Run  |  Func Param  Meta  Post  Vers |
##     | deseq2       |       N  |     N     N     N     N     N |
##     | edgeR        |       N  |     N     N     N     N     N |
##     | voom         |       N  |     N     N     N     N     N |

The output is a table showing whether any methods need to be re-evaluated, as well as which components of the methods are now outdated. Here, the headings correspond to:

Func: the main method function,
Param: the method parameters,
Meta: the manually specified method metadata,
Post: the post-processing functions,
Vers: the corresponding package version for the main method.

As can be seen by the Ns under Need to (Re)Run, all results in the SummarizedBenchmark are still valid, and no method needs to be re-run. Also note that the benchmark data is listed as full data missing. This is because our SummarizedBenchmark object only contains a MD5 hash of the original data. (Optionally, the full data could have been stored with the SummarizedBenchmark object by specifying keepData = TRUE in the original buildBench() call above. However, by default this is not done to prevent making duplicating large data sets.) If any of the methods needed to be updated, the original data must be specified to updateBench() using the data = parameter.

BDData(BenchDesign(sb))

## BenchDesign Data (BDData) ---------------------------------- 
##   type: md5hash 
##   MD5 hash: 3f41eeab0a157bf127e2b7ffb98ab685

The second way to call updateBench() is to specify both a SummarizedBenchmark object and a BenchDesign object. Intuitively, this will result in updating the results of the SummarizedBenchmark object to include any new or modified methods in the BenchDesign. Methods between the two objects are matched by the method label, e.g. deseq2. As an example, suppose we modify one of the methods in the BenchDesign object and want to rerun the benchmarking experiment. We first create a modified BenchDesign object, where we modify a single method’s manually specified meta data. Note that this is just a trivial update and normally would not warrant updating the results.

bd2 <- modifyMethod(bd, "deseq2", rlang::quos(bd.meta = list(note = "minor update")))

Next, we pass the object to updateBench() along with the original SummarizedBenchmark object.

updateBench(sb = sb, bd = bd2)

## Update SummarizedBenchmark (dryrun) ---------------------------------- 
##   benchmark data: unchanged
##     MD5 hash: 3f41eeab0a157bf127e2b7ffb98ab685 
##     names: coldat cntdat status lfc 
##   benchmark methods:
##     |              | Need to  |           Outdated            |
##     | Method       | (Re)Run  |  Func Param  Meta  Post  Vers |
##     | deseq2       |       Y  |     N     N     Y     N     N |
##     | edgeR        |       N  |     N     N     N     N     N |
##     | voom         |       N  |     N     N     N     N     N |

Notice that now the Need to (Re)Run and Meta columns are now set to Y for the deseq2 method, indicating that the metadata of the method has become outdated in the SummarizedBenchmark object, and the method now needs to be rerun. Next, suppose we also decide to add a new method to our comparison. Here, we add an alternate version of DESeq2 to the comparison based on adaptive shrinkage (ASH) (Stephens 2017).

deseq2_ashr <- function(countData, colData, design, contrast) {
    dds <- DESeqDataSetFromMatrix(countData,
                                  colData = colData,
                                  design = design)
    dds <- DESeq(dds)
    lfcShrink(dds, contrast = contrast, type = "ashr")
}

bd2 <- addMethod(bd2, "deseq2_ashr", 
                 deseq2_ashr,
                 post = list(pv = deseq2_pv, lfc = deseq2_lfc),
                 params = rlang::quos(countData = cntdat,
                                      colData = coldat, 
                                      design = ~condition,
                                      contrast = c("condition", "2", "1")))

Now we see that bo the modified deseq2 and deseq2_ashr methods are listed as needing to be rerun.

updateBench(sb = sb, bd = bd2)

## Update SummarizedBenchmark (dryrun) ---------------------------------- 
##   benchmark data: unchanged
##     MD5 hash: 3f41eeab0a157bf127e2b7ffb98ab685 
##     names: coldat cntdat status lfc 
##   benchmark methods:
##     |              | Need to  |           Outdated            |
##     | Method       | (Re)Run  |  Func Param  Meta  Post  Vers |
##     | deseq2       |       Y  |     N     N     Y     N     N |
##     | edgeR        |       N  |     N     N     N     N     N |
##     | voom         |       N  |     N     N     N     N     N |
##     | deseq2_ashr  |       Y  |     -     -     -     -     - |

To run the update, we simply modify the call to updateBench() with dryrun = FALSE. When we do this, only the two methods needing to be rerun are evaluated.

sb2 <- updateBench(sb = sb, bd = bd2, dryrun = FALSE)
sb2

## class: SummarizedBenchmark 
## dim: 15677 4 
## metadata(1): sessions
## assays(2): pv lfc
## rownames: NULL
## rowData names(2): pv lfc
## colnames(4): edgeR voom deseq2_ashr deseq2
## colData names(10): func.pkg func.pkg.vers ... session.idx meta.note

We can check the output to see that the results of the deseq2_ashr method have been added along with the updated deseq2 results.

head(assay(sb2, "lfc"))

##            edgeR       voom deseq2_ashr     deseq2
## [1,]   3.6767779  3.5232524  3.40956483  3.7329019
## [2,]  -5.3773940 -5.5543949 -5.32293225 -5.3905346
## [3,] -10.2467488 -8.2533275 -9.20816380 -9.7125453
## [4,]  -0.4547095 -1.0510456 -0.08956300 -0.4706410
## [5,]   3.7024426  3.9084940  3.61844584  3.7048724
## [6,]  -0.1504299 -0.7607029 -0.02921223 -0.1554938

2.1 Benchmark Sessions

Notice that a session index is also stored in the column data of each method. This can be used to map methods quickly to the corresponding entry of the sessions list stored in the object metadata.

colData(sb2)[, "session.idx", drop = FALSE]

## DataFrame with 4 rows and 1 column
##             session.idx
##               <numeric>
## edgeR                 1
## voom                  1
## deseq2_ashr           2
## deseq2                2

As described in the SummarizedBenchmark: Class Details vignette, sessions is a list of session information stored in the metadata of the SummarizedBenchmark object. Each entry of the sessions list includes the names of methods for which the results were generated during that session, the corresponding method results, the run parameters, and the session info. We can check the list of methods for each session and see that it matches the session.idx values above.

lapply(metadata(sb2)$sessions, `[[`, "methods")

## [[1]]
## [1] "edgeR" "voom" 
## 
## [[2]]
## [1] "deseq2_ashr" "deseq2"

As another example, we can also compare the new SummarizedBenchmark object against the original BenchDesign. By default, updateBench() will keep all methods in the SummarizedBenchmark object even if a corresponding method is not found in the new BenchDesign object. To avoid this behavior, we can specify keepAll = FALSE.

updateBench(sb2, bd, keepAll = FALSE)

## Update SummarizedBenchmark (dryrun) ---------------------------------- 
##   benchmark data: unchanged
##     MD5 hash: 3f41eeab0a157bf127e2b7ffb98ab685 
##     names: coldat cntdat status lfc 
##   benchmark methods:
##     |              | Need to  |           Outdated            |
##     | Method       | (Re)Run  |  Func Param  Meta  Post  Vers |
##     | deseq2       |       Y  |     N     N     Y     N     N |
##     | edgeR        |       N  |     N     N     N     N     N |
##     | voom         |       N  |     N     N     N     N     N |
##     | deseq2_ashr  |    Drop  |     -     -     -     -     - |

As expected, the deseq2 method must again be updated to return to match the original method definition, and the newer deseq2_ashr method must be dropped.

3 Updating Performance Metrics

If users update the BenchDesign, it is possible to calculate the performance metrics only for methods that have been added or modified. This feature is useful, for example, when performance metrics are computationally expensive to calculate. In the example below, we calculate the number of rejections at a nominal 10% FDR threshold for the original SummarizedBenchmark object. (Note that we do this rather than directly calculating the performance metrics for the newer SummarizedBenchmark object for illustrative purposes only.)

sb <- addPerformanceMetric(sb, assay = "pv", 
                           evalMetric = "rejections", 
                           evalFunction = function(query, truth) { 
                               sum(p.adjust(query, method = "BH") < 0.1, na.rm = TRUE) 
                           })
sb <- estimatePerformanceMetrics(sb, addColData = TRUE)

Then, we update the BenchDesign and rerun the function estimatePerformanceMetrics() with the parameter rerun = FALSE. Setting this parameter to FALSE will detect which performance metrics have been calculated before and will only recalculate metrics for methods that have been added or modified.

sb2 <- updateBench(sb, bd2, dryrun = FALSE)
estimatePerformanceMetrics(sb2, rerun = FALSE, addColData = TRUE)

## 
## Option rerun is set to `FALSE`:
## Rerunning performance metrics only for the following methods: deseq2_ashr, deseq2

## class: SummarizedBenchmark 
## dim: 15677 4 
## metadata(1): sessions
## assays(2): pv lfc
## rownames: NULL
## rowData names(2): pv lfc
## colnames(4): edgeR voom deseq2_ashr deseq2
## colData names(12): func.pkg func.pkg.vers ... pm.session meta.note

References

Stephens, Matthew. 2017. “False discovery rates: a new deal.” Biostatistics 18 (2):275–94. https://doi.org/10.1093/biostatistics/kxw041.