1 Introduction

This vignette provides an introductory example on how to work with the analysis framework proposed in [1].

The package is named benchdamic, acronym for “BENCHmarking of Differential Abundance detection methods for MICrobial data”. Not only does the package structure allow the users to test a variety of commonly used methods for differential abundance analysis, but it also enables them to set benchmarks including custom methods on their datasets. Performances of each method are evaluated with respect to i) suitability of distributional assumptions, ii) ability to control false discoveries, iii) concordance of the findings, and iv) enrichment of differentially abundant microbial species in specific conditions. Each step of the assessment is flexible when it comes to the choice of differential abundance methods, their parameters, and input data types. Various graphic outputs lead the users to an informed decision when evaluating the most suitable method to use for their data.

2 Installation

The recommended way to install the benchdamic package is

devtools::install_github("mcalgaro93/benchdamic")

Then, let’s load some packages for basic functions and data.

library(benchdamic)
# Data management
library(phyloseq)
library(plyr)
# Graphics
library(ggplot2)
library(cowplot)

3 Goodness of Fit

3.1 Data loading

We consider a homogeneous group of samples (e.g. only samples from a specific experimental condition, phenotype, treatment, body site…).

In this demonstrative example, datasets are downloaded from the HMP16SData Bioconductor package.

data("ps_stool_16S")
ps_stool_16S

## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 71 taxa and 32 samples ]
## sample_data() Sample Data:       [ 32 samples by 7 sample variables ]
## tax_table()   Taxonomy Table:    [ 71 taxa by 6 taxonomic ranks ]
## phy_tree()    Phylogenetic Tree: [ 71 tips and 70 internal nodes ]

## 16S HMP data download
library(HMP16SData)
# Extracting V3-V5 16S sequenced regions' count data
ps_stool_16S_raw <- V35() %>%
  subset(select = HMP_BODY_SUBSITE == "Stool" & # Only fecal samples
    RUN_CENTER == "BI" & # Only sequenced at the BI RUN CENTER
    SEX == "Male" & # Only male subject
    VISITNO == 1 & # Only the first visit
    !duplicated(RSID)) %>% # Duplicated SubjectID removal
  as_phyloseq()
# Remove low depth samples
ps_stool_16S_pruned <- prune_samples(sample_sums(ps_stool_16S_raw) >= 10^3, ps_stool_16S_raw)
# Remove features with zero counts
ps_stool_16S_filtered <- filter_taxa(ps_stool_16S_pruned, function(x) {
  sum(x > 0) > 0
}, 1)
# Collapse counts to the genus level
ps_stool_16S <- tax_glom(ps_stool_16S_filtered, taxrank = "GENUS")

3.2 GOF structure

As different methods rely on different statistical distributions to perform DA analysis, we assess the goodness of fit (GOF) of the statistical models underlying each method on a 16S dataset. For each model, we evaluate its ability to correctly estimate the average counts and the proportion of zeroes by taxon.

We consider five distributions: (1) the negative binomial (NB) used in edgeR and DeSeq2 [2,3], (2) the zero-inflated negative binomial (ZINB) used in ZINB-WaVE [4], (3) the truncated Gaussian Hurdle model of MAST [5], (4) the zero-inflated Gaussian (ZIG) mixture model of metagenomeSeq [6], and (5) the Dirichlet-Multinomial (DM) distribution underlying ALDEx2 [7].

The relationships between the functions used in this section are explained by the following diagram:

Goodness of Fit - package structure

3.3 Model estimation

3.3.1 Negative Binomial and Zero-Inflated Negative Binomial Models

For any \(\mu \ge 0\) and \(\theta > 0\), let \(f_{NB}(\cdot;\mu,\theta)\) denote the probability mass function (PMF) of the negative binomial (NB) distribution with mean \(\mu\) and inverse dispersion parameter \(\theta\), namely:\[ f_{NB} = \frac{\Gamma(y+\theta)}{\Gamma(y+1)\Gamma(\theta)}\left(\frac{\theta}{\theta+1} \right)^\theta\left(\frac{\mu}{\mu+\theta} \right)^y, \forall y \in \mathbb{N} \]Note that another parametrization of the NB PMF is in terms of the dispersion parameter \(\psi = \theta^{-1}\) (although \(\theta\) is also sometimes called dispersion parameter in the literature). In both cases, the mean of the NB distribution is \(\mu\) and its variance is:\[ \sigma^2 = \mu + \frac{\mu^2}{\theta} = \mu+\psi\mu^2 \]In particular, the NB distribution boils down to a Poisson distribution when \(\psi=0 \iff \theta=+ \infty\).

For any \(\pi\in[0,1]\), let \(f_{ZINB}(\cdot;\mu,\theta,\pi)\) be the PMF of the ZINB distribution given by:

\[ f_{ZINB}(\cdot;\mu,\theta,\pi) = \pi\delta_0(y)+(1-\pi)f_{NB}(y;\mu,\theta), \forall y\in\mathbb{N} \]

where \(\delta_0(\cdot)\) is the Dirac function. Here, \(\pi\) can be interpreted as the probability that a 0 is observed instead of the actual count, resulting in an inflation of zeros compared to the NB distribution, hence the name ZINB.

The packages we rely on to estimate these distributions on real count data are edgeR [2] and ZINB-WaVE [4] but we can easily call the benchdamic functions fitNB and fitZINB.

3.3.2 Zero-Inflated Gaussian Model

The raw count for sample j and feature i is denoted by \(c_{ij}\). The zero-inflated model is defined for the continuity-corrected logarithm of the raw count data: \(y_{ij} = log_2(c_{ij}+1)\) as a mixture of a point mass at zero \(I_{0}(y)\) and a count distribution \(f_{count}(y;\mu,\sigma^2) \sim N(\mu,\sigma^2)\). Given mixture parameters \(\pi_j\), we have that the density of the ZIG distribution for feature i, in sample j with \(s_j\) total counts is: \[f_{ZIG}(y_{ij};s_j,\beta,\mu_i,\sigma^2_i) = \pi_j(s_j)\cdot I_{0}(y_{ij})+(1-\pi_j(s_j))\cdot f_{count}(y_{ij};\mu,\sigma^2)\]

The mean model is specified as:\[E(y_{ij})=\pi_{j} + (1-\pi_j)\cdot\left(b_{i0}+\eta_ilog_2\left( \frac{s_j^\hat{l}}{N}+1 \right) \right)\]

In this case, parameter \(b_{i0}\) is the intercept of the model while the term including the logged normalization factor \(log_2\left(\frac{s_j^\hat{l}}{N}+1 \right)\) captures feature-specific normalization factors through parameter \(\eta_i\). In details, \(s_j^\hat{l}\) is the median scaling factor resulted from the Cumulative Sum Scaling (CSS) normalization procedure. \(N\) is a constant fixed by default at 1000 but it should be a number close to the scaling factors to be used as a reference, for this reason a good choice could be the median of the scaling factors. The mixture parameters \(\pi_j(s_j)\) are modeled as a binomial process:

\[log\frac{\pi_j}{1-\pi_j} = \beta_0+\beta_1\cdot log(s_j)\]

The package we rely on to estimate this distribution on real count data is metagenomeSeq [6] but we can easily call the benchdamic function fitZIG.

3.3.3 Truncated Gaussian Hurdle Model

The original field of application of this method was the single-cell RNAseq data, where \(y = log_2(TPM+1)\) expression matrix was modeled as a two-part generalized regression model [5]. In microbiome data that starting point translates to a \(y_{ij} = log_2\left(counts_{ij}\cdot\frac{10^6}{libSize_{j}}+1 \right)\) or a \(log_2\left(counts_{ij}\cdot\frac{ median(libSize)}{libSize_{j}}+1\right)\).

The taxon presence rate is modeled using logistic regression and, conditioning on a sample with the taxon, the transformed abundance level is modeled as Gaussian.

Given normalized, possibly thresholded, abundance \(y_{ij}\), the rate of presence and the level of abundance for the samples were the taxon is present, are modeled conditionally independent for each gene \(i\). Define the indicator \(z_{ij}\), indicating whether taxon \(i\) is expressed in sample \(j\) (i.e., \(z_{ij} = 0\) if \(y_{ij} = 0\) and \(z_{ij} = 1\) if \(y_{ij} > 0\)). We fit logistic regression models for the discrete variable \(Z\) and a Gaussian linear model for the continuous variable \((Y|Z=1)\) independently, as follows:

\[ logit(Pr(Z_{ij}=1))=X_j\beta_i^D \]

\[ P(Y_{ij}=y|Z_{ij}=1) \sim N(X_j\beta^C_i,\sigma^2_i)\]

The package we rely on to estimate this distribution on real count data is MAST [5] but we can easily call the benchdamic function fitHURDLE.

3.3.4 Dirichlet-Multinomial Mixture Model

The probability mass function of a \(n\) dimensional multinomial sample \(y = (y_1,...,y_n)^T\) with library size \(libSize = \sum_{i=1}^ny_i\) and parameter \(p=(p_1,...,p_n)\) is:

\[ f(y;p)= {libSize\choose y}\prod_{i=1}^np_i^{y_i} \]

The mean-variance structure of the MN model doesn’t allow over-dispersion, which is common in real data. DM distribution models the probability parameter \(p\) in the MN model by a Dirichlet distribution. The probability mass of a n-category count vector \(y\) over \(libSize\) trials under DM with parameter \(\alpha=(\alpha_1,...,\alpha_n)\), \(a_i>0\) and proportion vector \(p \in \Delta_n=\{(p_1,...,p_n):p_i\ge0,\sum_ip_i=1 \}\) is:

\[ f(y|\alpha)={libSize\choose y}\frac{\prod_{i=1}^n(a_i)y_i}{(\sum_i\alpha_i)\cdot libSize} \]

The mean value for the \(i^{th}\) taxon and \(j^{th}\) sample of the count matrix is given by \(libSize_j\cdot \frac{\alpha_{ij}}{\sum_i a_{ij}}\).

The package we rely on to estimate this distribution on real count data is MGLM [8] but we can easily call the benchdamic function fitDM.

3.4 Comparing Estimated and Observed counts

The goodness of fit for several distributions is assessed comparing estimated and observed values. On one side we can compare, for each taxon, the observed mean abundance with the estimated one, obtaining the Mean Difference (MD). On the other side we can compare the proportion of samples which have zero counts for a specific taxon with the estimated probability to observe a zero count, obtaining the Zero Probability Difference (ZPD).

To easily compare estimated and observed mean values the natural logarithm transformation, with the continuity correction (\(log(counts+1)\)), is well suited, indeed it reduces count range making the differences more stable.

Except for fitHURDLE, which performs a CPM transformation on the counts (or the one with the median library size), and fitZIG which models the \(log_2(counts+1)\), the other methods, fitNB, fitZINB, and fitDM, model the \(counts\) directly. For these reasons, fitHURDLE’s output should not be compared directly to the observed \(log(counts+1)\) mean values as for the other methods. Instead, the logarithm of the observed CPM (or the one with the median library size) should be used.

example_HURDLE <- fitHURDLE(
  counts = ps_stool_16S,
  scale = "median"
)
head(example_HURDLE)

##                     Y          Y0
## OTU_97.192   2.152592 0.312760011
## OTU_97.1666        NA 0.967681879
## OTU_97.31213 4.057621 0.094509772
## OTU_97.39094 1.123198 0.749635900
## OTU_97.40451 7.327534 0.001949454
## OTU_97.19587 4.716914 0.063341989

The values above are the fitHURDLE estimated values (NA value in the second row due to the sparsity of that taxon). The internally used function to prepare observed counts is prepareObserved(), specifying the scale parameter if the HURDLE model is considered.

observed_hurdle <- prepareObserved(
  counts = ps_stool_16S, 
  scale = "median")
head(observed_hurdle)

##                       Y      Y0
## OTU_97.192   3.21387169 0.31250
## OTU_97.1666  0.03923904 0.96875
## OTU_97.31213 4.88350981 0.09375
## OTU_97.39094 4.89778959 0.75000
## OTU_97.40451 7.51646781 0.00000
## OTU_97.19587 5.26661203 0.06250

Which are different from the non-scaled observed values:

head(prepareObserved(counts = ps_stool_16S))

##                       Y      Y0
## OTU_97.192   3.34109346 0.31250
## OTU_97.1666  0.03077166 0.96875
## OTU_97.31213 4.77675725 0.09375
## OTU_97.39094 4.44081133 0.75000
## OTU_97.40451 7.53824550 0.00000
## OTU_97.19587 5.36801832 0.06250

The function to compute mean differences (MD) and zero probability difference (ZPD) between estimated and observed values, is meanDifferences() (also used internally).

head(meanDifferences(
  estimated = example_HURDLE,
  observed = observed_hurdle
))

##           MD           ZPD
## 1 -1.0612798  0.0002600107
## 2         NA -0.0010681206
## 3 -0.8258883  0.0007597716
## 4 -3.7745917 -0.0003641001
## 5 -0.1889343  0.0019494545
## 6 -0.5496976  0.0008419888

A wrapper function to simultaneously perform the estimates and the mean differences is fitModels().

GOF_stool_16S <- fitModels(
  counts = ps_stool_16S,
  models = c("NB", "ZINB", "DM", "ZIG", "HURDLE"),
  scale_ZIG = c("median", "default"),
  scale_HURDLE = c("median", "default"),
  verbose = FALSE
)

Exploiting the internal structure of the fitModels’s output we can obtain the Root Mean Squared Error (RMSE) values for MD values (the lower, the better):

plotRMSE(GOF_stool_16S, difference = "MD", plotIt = FALSE)

##                 Model       RMSE
## 1                  NB 0.07057518
## 2                ZINB 0.14298859
## 3                  DM 0.97766699
## 4          ZIG_median 0.53150840
## 5         ZIG_default 0.54067339
## 6       HURDLE_median 0.85437589
## 7 HURDLE_defaultFALSE 3.11439757

and for ZPD values:

plotRMSE(GOF_stool_16S, difference = "ZPD", plotIt = FALSE)

##                 Model         RMSE
## 1                  NB 0.0741340631
## 2                ZINB 0.0219383951
## 3                  DM 0.0436436801
## 4          ZIG_median 0.1129285606
## 5         ZIG_default 0.1248956620
## 6       HURDLE_median 0.0008776059
## 7 HURDLE_defaultFALSE 0.0008832596

3.5 Visualization

3.5.1 Mean Differences

To plot estimated and observed values we use the function plotMD, based on ggplot2.

plotMD(
  data = GOF_stool_16S,
  difference = "MD",
  split = TRUE
)

For the HURDLE_default model, a quite different range of values of mean differences is displayed because of the excessive ‘default’ scaling proposed (1 million). It is also possible to plot only a subset of the estimated models:

plotMD(
  data = GOF_stool_16S[1:6],
  difference = "MD",
  split = TRUE
)

And the mean differences for their Zero Probability/Proportion:

plotMD(
  data = GOF_stool_16S[1:6],
  difference = "ZPD",
  split = TRUE
)

MDs and ZPDs are also available in this other output layout:

plot_grid(plotMD(data = GOF_stool_16S[1:6], difference = "MD", split = FALSE),
  plotMD(data = GOF_stool_16S[1:6], difference = "ZPD", split = FALSE),
  ncol = 2
)

3.5.2 RMSE

While a summary statistics for the overall performance is visible through:

plot_grid(plotRMSE(GOF_stool_16S, difference = "MD"),
  plotRMSE(GOF_stool_16S, difference = "ZPD"),
  ncol = 2
)

4 Type I Error Control

We next try to evaluate Type I Error rate Control of each differential abundance detection method, i.e., the probability of the statistical test to call a feature DA when it is not. To do so, we consider mock comparisons on HMP Stool samples in which no true DA is present. Briefly, we randomly assign each sample to one of two experimental groups and perform DA analysis between these groups, repeating the process 10 times (at least 1000 suggested). In this setting, the p-values of a perfect test should be uniformly distributed between 0 and 1 and the false positive rate (FPR or observed α), which is the observed proportion of significant tests, should match the nominal value (e.g., α=0.05).

4.1 TIEC structure

The Type I Error Control (TIEC) analysis is summarized by the following diagram. The main functions are the createMocks() which randomly generates the mock groups, runMocks() which relies on runDA() function and performs the DA analysis on mock comparisons, and createTIEC() which produces the data.frame for plotting the results.

TIEC - package structure

Where the DA framework decided by the user is composed by two main steps:

normalization;
differential abundance.

These steps can be performed both directly, using the norm_*() and DA_*() methods for normalization and differential abundance respectively, or indirectly, using the setNormalizations() and set_*() functions. This allows a higher flexibility to the user which can set the instructions for normalization or DA analysis only at the beginning. If some modifications are needed, the user can re-sets the methods or modify the list of instructions directly.

Normalization and Differential Abundance structures

4.2 Create mock comparisons

Using createMocks() function we can randomly group the samples, N = 10 times. A higher N is suggested (at least 1000) but in that case a longer running time is required.

set.seed(123)
my_mocks <- createMocks(
  nsamples = phyloseq::nsamples(ps_stool_16S),
  N = 10
) # At least N = 1000 is suggested

4.3 Differential abundance

Once the mocks have been generated, we perform DA analysis. Firstly we add to the phyloseq object some scaling factors, such as TMM from edgeR and CSS from metagenomeSeq, and some normalization factor such as poscounts from DESeq2. This can be done, manually, using the norm_edgeR(), norm_DESeq2(), and norm_CSS() methods:

ps_stool_16S <- norm_edgeR(
  object = ps_stool_16S,
  method = "TMM"
)
ps_stool_16S <- norm_DESeq2(
  object = ps_stool_16S,
  method = "poscounts"
)
ps_stool_16S <- norm_CSS(
  object = ps_stool_16S,
  "median"
)

Or, more fastly, using the setNormalizations() and runNormalizations() methods:

my_normalizations <- setNormalizations(fun = c("norm_edgeR", "norm_DESeq2", "norm_CSS"), method = c("TMM", "poscounts", "median"))
ps_stool_16S <- runNormalizations(normalization_list = my_normalizations, object = ps_stool_16S, verbose = TRUE)

##       + Running now: norm_edgeR 
##         Parameters: method=TMM 
##       + Running now: norm_DESeq2 
##         Parameters: method=poscounts 
##       + Running now: norm_CSS 
##         Parameters: method=median

We also compute the zero-inflated negative binomial weights using the weights_ZINB function.

zinbweights <- weights_ZINB(
  object = ps_stool_16S,
  K = 0,
  design = "~ 1"
)

For each row of the mock_df data frame we run a bunch of DA methods. In this demonstrative example we use:

edgeR with TMM scaling factors;
edgeR with TMM scaling factors and ZINB weights;
DESeq2 with poscounts normalization factors;
DESeq2 with poscounts normalization factors and ZINB weights;
limma-voom with TMM scaling factors;
limma-voom with TMM scaling factors and ZINB weights;
limma-voom with CSS scaling factors;

my_edgeR <- set_edgeR(
    pseudo_count = FALSE,
    group_name = "group", 
    design = ~ group, 
    robust = FALSE, 
    coef = 2, 
    norm = "TMM", 
    weights_logical = c(TRUE, FALSE), 
    expand = TRUE
)
my_DESeq2 <- set_DESeq2(
    pseudo_count = FALSE,
    design = ~ group,
    contrast = c("group", "grp2", "grp1"),
    norm = "poscounts", 
    weights_logical = c(TRUE, FALSE), 
    alpha = 0.05,
    expand = TRUE
)
my_limma <- set_limma(
    pseudo_count = FALSE,
    design = ~ group,
    coef = 2,
    norm = c("TMM", "TMM", "CSSmedian"),
    weights_logical = c(FALSE, TRUE, FALSE), 
    expand = FALSE)

my_methods <- c(my_edgeR, my_DESeq2, my_limma)

# Random grouping each time
Stool_16S_mockDA <- runMocks(mocks = my_mocks, method_list = my_methods, object = ps_stool_16S, weights = zinbweights, verbose = FALSE)

The structure of the output in this example is the following:

Comparison1 to Comparison10 on the first level, which contains:
- Method1 to Method7 output lists on the second level:
  - pValMat which contains the matrix of raw p-values and adjusted p-values in rawP and adjP columns respectively;
  - statInfo which contains the matrix of summary statistics for each feature, such as the logFC, standard errors, test statistics and so on;
  - dispEsts which contains the dispersion estimates for methods like edgeR and DESeq2;
  - name which contains the complete name of the used method.

Be sure to create a similar result structure (respecting the names and types of the objects) if a custom method is added: dispEsts vector is optional, while the matrix pValMat, and the character name are mandatory for the “Type I Error Control” results’ visualization (see the example below for more information). statInfo matrix is also very important for results visualization, however it will be more useful later when the direction of the differential abundance will be investigated.

DA_yourMethod <- function(object, parameters) # others
{
    ### your method code ###
    
    ### extract important statistics ###
    vector_of_pval <- NA # contains the p-values
    vector_of_adjusted_pval <- NA # contains the adjusted p-values
    name_of_your_features <- NA # contains the OTU, or ASV, or other feature 
                                # names. Usually extracted from the rownames of 
                                # the count data
    vector_of_logFC <- NA # contains the logFCs
    vector_of_statistics <- NA # contains other statistics
    
    ### prepare the output ###
    pValMat <- data.frame("rawP" = vector_of_pval,
                          "adjP" = vector_of_adjusted_pval)
    statInfo <- data.frame("logFC" = vector_of_logFC,
                           "statistics" = vector_of_statistics) 
    name <- "write.here.the.name"
    # Be sure that your method hasn't changed the order of the features. If it 
    # happens, you'll need to re-establish the original order.
    rownames(pValMat) <- rownames(statInfo) <- name_of_your_features 
    
    # Return the output as a list
    return(list("pValMat" = pValMat, "statInfo" = statInfo, "name" = name))
} # END - function: DA_yourMethod

Once the custom method is set, it is very simple to add it to the framework:

direct way - just run the method using the function DA_yourMethod();
indirect way - create a list containing one or more instances of the custom method with the desired combination of parameters and add it to the my_methods list.

my_custom_method <- list(
    customMethod.1 = list(method = "DA_yourMethod", parameters),
    customMethod.2 = list(method = "DA_yourMethod", parameters)
)

In this case, the ‘method’ field, containing the name of the method to call, is mandatory. Instead, the argument containing the data object is not needed in order to keep the my_custom_method object more lightweight. To run the methods:

# Add the custom method instances to the others
my_methods <- c(my_edgeR, my_DESeq2, my_limma, my_custom_method)
# Run all the methods on a specific data object...
runDA(my_methods = my_methods, object = dataObject)
# ... Or on the mock datasets to investigate TIEC
runMocks(mocks = mock_df, my_methods = my_methods, object = dataObject)

4.4 Visualization

Since the visualization rely on ggplot2 package, firstly we run the createTIEC() function of the benchdamic package, in order to produce a list of 4 data.frames:

df_pval is a 3 columns and number_of_features x methods x comparisons rows data.frame. The three columns are called Comparison, pval, and method;
df_FPR is a 5 columns and methods x comparisons rows data.frame. For each set of method and comparison, the proportion of false positives, considering 3 threshold (0.01, 0.05, 0.1) are reported;
df_qq contains the coordinates to draw the QQ-plot for comparing the mean observed p-value distribution across comparisons, with the theoretical uniform distribution. Indeed, the observed p-values should follow a uniform distribution under the null hypothesis of no differential abundant features presence;
df_KS is a 5 columns and methods x comparisons rows data.frame. For each set of method and comparison, the Kolmogorov-Smirnov test statistics and pvalues are reported in KS and KS_pval columns respectively.

TIEC_summary <- createTIEC(Stool_16S_mockDA)

4.4.1 False Positive Rate

The false positive rate (FPR or observed α), which is the observed proportion of significant tests, should match the nominal value because all the findings are false positive by construction. In this example edgeR.TMM, edgeR.TMM.weighted, and limma.TMM.weighted appear to be quite over all the thresholds, differently DESeq2.poscounts, DESeq2.poscounts.weighted, limma.CSSmedian, and limma.TMM methods are close to each threshold.

cols <- createColors(variable = levels(TIEC_summary$df_pval$Method))
plotFPR(df_FPR = TIEC_summary$df_FPR, cols = cols)

4.4.2 QQ-Plot

The p-value distribution under the null hypothesis is uniform. This is qualitatively summarized in the QQ-plot below where the bisector represents a perfect correspondence between observed and theoretical quantiles of p-values. For each theoretical quantile, the corresponding observed quantile is obtained averaging the observed p-values’ quantiles from all 10 mock datasets. The plotting area is zoomed-in to show clearly the area between 0 and 0.1.

Methods over the bisector show a conservative behavior, while methods below the bisector a liberal one.

The starting point is determined by the total number of features. In our example the starting point for the theoretical p-value is computed as 1 divided by 71, rounded to the second digit.

plotQQ(df_QQ = TIEC_summary$df_QQ, zoom = c(0, 0.1), cols = cols)

4.4.3 Kolmogorov-Smirnov test

Departure from uniformity is evaluated through the Kolmogorov-Smirnov test which is reported for each method across all mock datasets using the the plotKS function.

plotKS(df_KS = TIEC_summary$df_KS, cols = cols)

5 Concordance

If we want to measure the ability of each method to produce replicable results in independent data and the amount of concordance of the results of each method on two random subsets of the same dataset, we’ll perform the Between Method Concordance (BMC) and the Within Method Concordance (WMC) analyses.

We consider the samples from the supragingival and subgingival plaques of 30 subjects.

data("ps_plaque_16S")

ps_plaque_16S_raw <- V35() %>% # Extracting V3-V5 16S sequenced regions' count data
  subset(select = HMP_BODY_SUBSITE %in% c("Supragingival Plaque", "Subgingival Plaque") & # Only gingival plaque samples
    RUN_CENTER == "WUGC" & # Only sequenced at the WUCG RUN CENTER
    SEX == "Male" & # Only male subject
    VISITNO == 1) %>% # Only the first visit
  as_phyloseq()

# Only paired samples
paired <- names(which(table(sample_data(ps_plaque_16S_raw)[, "RSID"]) == 2))
ps_plaque_16S_paired <- subset_samples(ps_plaque_16S_raw, RSID %in% paired)

# Remove low depth samples
ps_plaque_16S_pruned <- prune_samples(sample_sums(ps_plaque_16S_paired) >= 10^3, ps_plaque_16S_paired)

# Remove features with zero counts
ps_plaque_16S_filtered <- filter_taxa(ps_plaque_16S_pruned, function(x) sum(x > 0) > 0, 1)

# Collapse counts to the genus level
ps_plaque_16S <- tax_glom(ps_plaque_16S_filtered, taxrank = "GENUS")

5.1 Concordance structure

Several functions are involved in the concordance analysis. First of all, the normalization methods need to be set up using the setNormalizations() function. The differential abundance framework is set up by using set_edgeR(), set_DESeq2(), set_limma(), and so on, functions. Once all these instructions are set up, the user can call the runSplits() which relies on runNormalizations() and runDA() functions. Finally, the createConcordance() and plotConcordance() functions are used to compute and plot the results.

Concordance structure

5.2 Split datasets

Using the createSplits() function, the dataset is randomly divided by half. In this particular demonstrative dataset, samples are paired: 1 sample for supragingival plaque and 1 sample for subgingival plaque are considered for each subject. The paired parameter is passed to the method and it contains the name of the variable which describes the subject IDs. In this specific case, the two groups of samples are balanced between conditions, reflecting the starting dataset. However, if the starting dataset had been unbalanced, the balanced option would have allowed to keep the two splits unbalanced or not.

set.seed(123)
sample_data(ps_plaque_16S)$HMP_BODY_SUBSITE <- factor(sample_data(ps_plaque_16S)$HMP_BODY_SUBSITE)

my_splits <- createSplits(
  object = ps_plaque_16S,
  varName = "HMP_BODY_SUBSITE",
  paired = "RSID",
  balanced = TRUE,
  N = 10
) # At least 100 is suggested

The structure produced by createSplits() function consists in a list of two matrices: Subset1 and Subset2. Each matrix contains the randomly chosen sample IDs. The number of rows of both matrices is equal to the number of comparisons/splits.

5.3 Differential abundance

For some of the methods implemented in this package it is possible to perform differential abundance testings for the repeated measurements experimental designs (e.g. by adding the subject ID in the model formula of DESeq2). However, for the sake of simplicity we consider the samples as independent between body sub sites.

Once again, to set the differential abundance methods to use, the set_*() methods can be exploited. For a faster demonstration, differential abundance methods without weighting are used:

my_edgeR_noWeights <- set_edgeR(group_name = "HMP_BODY_SUBSITE", design = ~ HMP_BODY_SUBSITE, coef = 2, norm = "TMM")
my_DESeq2_noWeights <- set_DESeq2(contrast = c("HMP_BODY_SUBSITE","Supragingival Plaque", "Subgingival Plaque"), design = ~ HMP_BODY_SUBSITE, norm = "poscounts")
my_limma_noWeights <- set_limma(design = ~ HMP_BODY_SUBSITE, coef = 2, norm = c("TMM", "CSSmedian"))

my_methods_noWeights <- c(my_edgeR_noWeights, my_DESeq2_noWeights, my_limma_noWeights)

Similarly, to set the normalization methods, the setNormalizations() function can be used. In this case it has already been set up for the TIEC analysis.

str(my_normalizations)

## List of 3
##  $ norm_edgeR :List of 2
##   ..$ fun   : chr "norm_edgeR"
##   ..$ method: chr "TMM"
##  $ norm_DESeq2:List of 2
##   ..$ fun   : chr "norm_DESeq2"
##   ..$ method: chr "poscounts"
##  $ norm_CSS   :List of 2
##   ..$ fun   : chr "norm_CSS"
##   ..$ method: chr "median"

The runSplits() function generates the subsets and performs DA analysis:

Plaque_16S_splitsDA <- runSplits(split_list = my_splits, method_list = my_methods_noWeights, normalization_list = my_normalizations, object = ps_plaque_16S, verbose = FALSE)

The structure of the output in this example is the following:

Subset1 and Subset2 on the first level, which contains:
- Comparison1 to Comparison10 output lists on the second level:
  - results of 4 methods on the third level: edgeR with TMM scaling factors, DESeq2 with poscounts normalization factors, limma-voom with TMM scaling factors, limma-voom with CSS scaling factors. They are organized as already described in the “Goodness of Fit” chapter:
    - pValMat which contains the matrix of raw p-values and adjusted p-values in rawP and adjP columns respectively;
    - statInfo which contains the matrix of summary statistics for each feature, such as the logFC, standard errors, test statistics and so on;
    - dispEsts which contains the dispersion estimates for methods like edgeR and DESeq2;
    - name which contains the complete name of the used method.

For each pair of methods the concordance is computed by the createConcordance() function. It produces a long format data.frame object with several columns:

comparison which indicates the comparison number;
n_features which indicates the total number of taxa in the comparison dataset;
name of method1;
name of method2;
rank;
concordance which is defined as the cardinality of the intersection of the top rank elements of each list, divided by rank, i.e., \(\frac{L_{1:rank} \bigcap M_{1:rank}}{rank}\), where L and M represent the lists of p-values of method1 and method2 respectively.

concordance <- createConcordance(
    object = Plaque_16S_splitsDA, 
    slot = "pValMat", 
    colName = "rawP", 
    type = "pvalue"
)
head(concordance)

##    comparison n_features          method1          method2 rank concordance
## 1 Comparison1         69 DESeq2.poscounts DESeq2.poscounts    1   0.0000000
## 2 Comparison1         69 DESeq2.poscounts DESeq2.poscounts    2   0.5000000
## 3 Comparison1         69 DESeq2.poscounts DESeq2.poscounts    3   0.6666667
## 4 Comparison1         69 DESeq2.poscounts DESeq2.poscounts    4   0.5000000
## 5 Comparison1         69 DESeq2.poscounts DESeq2.poscounts    5   0.4000000
## 6 Comparison1         69 DESeq2.poscounts DESeq2.poscounts    6   0.5000000

The createConcordance() method is very flexible. In the example below the concordances are built using the log fold changes instead of the p-values. To do so, it is necessary to know the column names generated by each differential abundance method in the statInfo matrix.

Firstly inspect the method order:

names(Plaque_16S_splitsDA$Subset1$Comparison1)

## [1] "edgeR.TMM"        "DESeq2.poscounts" "limma.CSSmedian"  "limma.TMM"

Then look at the column names:

names(Plaque_16S_splitsDA$Subset1$Comparison1$edgeR.TMM$statInfo)

## [1] "logFC"  "logCPM" "F"      "PValue"

names(Plaque_16S_splitsDA$Subset1$Comparison1$DESeq2.poscounts$statInfo)

## [1] "baseMean"       "log2FoldChange" "lfcSE"          "stat"          
## [5] "pvalue"         "padj"

names(Plaque_16S_splitsDA$Subset1$Comparison1$limma.CSSmedian$statInfo)

## [1] "logFC"     "AveExpr"   "t"         "P.Value"   "adj.P.Val" "B"

names(Plaque_16S_splitsDA$Subset1$Comparison1$limma.TMM$statInfo)

## [1] "logFC"     "AveExpr"   "t"         "P.Value"   "adj.P.Val" "B"

All methods, except for DESeq2, contain the log fold change values in the logFC column of statInfo matrix. Knowing this, the concordance data.frame can be built using:

concordance_alternative <- createConcordance(
  object = Plaque_16S_splitsDA,
  slot = "statInfo",
  colName = c("logFC", "log2FoldChange", "logFC", "logFC"),
  type = "logfc"
)

5.4 Visualization

Starting from the table of concordances, the plotConcordance() function can produce 2 graphical results:

the dendrogram of methods, clustered by the area over the concordance bisector in concordanceDendrogram slot;
the heatmap of the between and within method concordances in concordanceHeatmap slot. For each tile of the symmetric heatmap, which corresponds to a pair of methods, the concordance from rank 1 to a threshold rank is drawn.

The area between the curve and the bisector is colored to highlight concordant methods (blue) and non-concordant ones (red). The two graphical results should be drawn together for the best experience.

pC <- plotConcordance(concordance = concordance, threshold = 30)
cowplot::plot_grid(plotlist = pC, ncol = 2, align = "h", axis = "tb",
        rel_widths = c(1, 3))

The WMC and BMC from rank 1 to rank 30 are reported in the plot above. 69 is the maximum rank because it is the number of taxa for which all methods have been able to calculate p-values (in all comparisons). However, a custom threshold of 30 was supplied .

It is common that WMC values (in red rectangles) are lower than BMC ones. Indeed, BMC is computed between different methods on the same data, while WMC is computed for a single method, run in different datasets (some taxa are dataset-specific).

limma.CSSmedian and limma.TMM methods show the highest BMC values. Regarding the WMC, edgeR.TMM has the lowest values while other methods have comparable ones.

6 Enrichment analysis

While mock comparisons and random splits allow to evaluate type I error and concordance, these analyses do not assess the correctness of the discoveries. Even the method with the highest WMC could nonetheless consistently identify false positive DA taxa.

While the lack of ground truth makes it challenging to assess the validity of DA results in real data, enrichment analysis can provide an alternative solution to rank methods in terms of their ability to identify, as significant, taxa that are known to be differentially abundant between two groups.

6.1 A priori knowledge

Here, we leveraged the peculiar environment of the gingival site: the supragingival biofilm is directly exposed to the open atmosphere of the oral cavity, favoring the growth of aerobic species. In the subgingival biofilm, however, the atmospheric conditions gradually become strict anaerobic, favoring the growth of anaerobic species [9]. From the comparison of the two sites, we thus expected to find an abundance of aerobic microbes in the supragingival plaque and of anaerobic bacteria in the subgingival plaque. DA analysis should reflect this difference by finding an enrichment of aerobic (anaerobic) bacteria among the DA taxa with a positive (negative) log-fold-change.

Firstly, the microbial metabolism information is necessary. These data comes from [10] paper github repository (https://github.com/waldronlab/nychanesmicrobiome), but they can be loaded using data("microbial_metabolism").

data("microbial_metabolism")
head(microbial_metabolism)

##              Genus        Type
## 1     Acholeplasma F Anaerobic
## 2 Actinomycetaceae F Anaerobic
## 3    Aeriscardovia     Aerobic
## 4       Aerococcus F Anaerobic
## 5  Aggregatibacter F Anaerobic
## 6    Alloscardovia   Anaerobic

The microbial genus and its type of metabolism are specified in the first and second column respectively. To match each taxon of the phyloseq object to its type of metabolism the next chunk of code can be used.

# Extract genera from the phyloseq tax_table slot
genera <- tax_table(ps_plaque_16S)[, "GENUS"]
# Genera as rownames of microbial_metabolism data.frame
rownames(microbial_metabolism) <- microbial_metabolism$Genus
# Match OTUs to their metabolism
priorInfo <- data.frame(genera, "Type" =  microbial_metabolism[genera, "Type"])
unknown_metabolism <- is.na(priorInfo$Type)
priorInfo[unknown_metabolism, "Type"] <- "Unknown"
# Relabel 'F Anaerobic' to 'F_Anaerobic' to remove space
priorInfo$Type <- factor(priorInfo$Type, levels = c("Aerobic","Anaerobic","F Anaerobic","Unknown"), labels = c("Aerobic","Anaerobic","F_Anaerobic","Unknown"))
# Add a more informative names column
priorInfo[, "newNames"] <- paste0(rownames(priorInfo), "|", 
    priorInfo[, "GENUS"])

6.2 Differential abundance

Prepare some normalization/scaling factors.

none_normalization <- setNormalizations(fun = "norm_edgeR", method = "none")
my_normalizations_enrichment <- c(my_normalizations, none_normalization)
ps_plaque_16S <- runNormalizations(normalization_list = my_normalizations_enrichment, object = ps_plaque_16S, verbose = FALSE)

Differently from the Type I Error Control and Concordance analyses, the enrichment analysis rely on a single phyloseq object. For this reason many methods can be assessed without computational trade-offs.

my_metagenomeSeq <- set_metagenomeSeq(design = ~ HMP_BODY_SUBSITE, coef = 2, norm = "CSSmedian")
my_ALDEx2 <- set_ALDEx2(conditions = "HMP_BODY_SUBSITE", test = "t", norm = "none")
my_corncob <- set_corncob(formula = ~ HMP_BODY_SUBSITE, phi.formula = ~ HMP_BODY_SUBSITE, formula_null = ~ 1, phi.formula_null = ~ HMP_BODY_SUBSITE, test = "Wald", coefficient = "HMP_BODY_SUBSITESupragingival Plaque", norm = "none")
my_MAST <- set_MAST(rescale = "median", design = ~ HMP_BODY_SUBSITE, coefficient = "HMP_BODY_SUBSITESupragingival Plaque", norm = "none")
my_Seurat <- set_Seurat(test.use = "wilcox", contrast = c("HMP_BODY_SUBSITE", "Supragingival Plaque", "Subgingival Plaque"), norm = "none")

my_methods_enrichment <- c(
    my_methods_noWeights, 
    my_metagenomeSeq, 
    my_ALDEx2, 
    my_corncob, 
    my_MAST, 
    my_Seurat
)

# Convert to factor
sample_data(ps_plaque_16S)$HMP_BODY_SUBSITE <- factor(sample_data(ps_plaque_16S)$HMP_BODY_SUBSITE)
# Reference level = "Subgingival Plaque"
sample_data(ps_plaque_16S)$HMP_BODY_SUBSITE <- relevel(
    x = sample_data(ps_plaque_16S)$HMP_BODY_SUBSITE, 
    ref = "Subgingival Plaque"
)
Plaque_16S_DA <- runDA(method_list = my_methods_enrichment, object = ps_plaque_16S, weights = zinbweights)

##       * Running now: DA_edgeR 
##         Parameters: pseudo_count=FALSE, group_name=HMP_BODY_SUBSITE, design=~HMP_BODY_SUBSITE, coef=2, robust=FALSE, norm=TMM, weights=FALSE 
##       * Running now: DA_DESeq2 
##         Parameters: pseudo_count=FALSE, design=~HMP_BODY_SUBSITE, contrast=HMP_BODY_SUBSITE.Supragingival Plaque.Subgingival Plaque, alpha=0.05, norm=poscounts, weights=FALSE 
##       * Running now: DA_limma 
##         Parameters: pseudo_count=FALSE, design=~HMP_BODY_SUBSITE, coef=2, norm=CSSmedian, weights=FALSE 
##       * Running now: DA_limma 
##         Parameters: pseudo_count=FALSE, design=~HMP_BODY_SUBSITE, coef=2, norm=TMM, weights=FALSE 
##       * Running now: DA_metagenomeSeq 
##         Parameters: pseudo_count=FALSE, design=~HMP_BODY_SUBSITE, coef=2, norm=CSSmedian 
## it= 0, nll=95.09, log10(eps+1)=Inf, stillActive=88
## it= 1, nll=96.15, log10(eps+1)=0.04, stillActive=47
## it= 2, nll=93.80, log10(eps+1)=0.06, stillActive=44
## it= 3, nll=91.68, log10(eps+1)=0.09, stillActive=35
## it= 4, nll=89.93, log10(eps+1)=0.09, stillActive=28
## it= 5, nll=90.79, log10(eps+1)=0.07, stillActive=20
## it= 6, nll=91.82, log10(eps+1)=0.06, stillActive=8
## it= 7, nll=91.82, log10(eps+1)=0.05, stillActive=5
## it= 8, nll=92.16, log10(eps+1)=0.04, stillActive=3
## it= 9, nll=91.97, log10(eps+1)=0.07, stillActive=2
## it=10, nll=91.89, log10(eps+1)=0.06, stillActive=2
## it=11, nll=92.35, log10(eps+1)=0.00, stillActive=0
##       * Running now: DA_ALDEx2 
##         Parameters: pseudo_count=FALSE, conditions=HMP_BODY_SUBSITE, mc.samples=128, test=t, denom=iqlr, norm=none 
## |------------(25%)----------(50%)----------(75%)----------|
##       * Running now: DA_corncob 
##         Parameters: pseudo_count=FALSE, formula=~.HMP_BODY_SUBSITE, formula_null=~.1, phi.formula=~.HMP_BODY_SUBSITE, phi.formula_null=~.HMP_BODY_SUBSITE, coefficient=HMP_BODY_SUBSITESupragingival Plaque, test=Wald, boot=FALSE, norm=none 
##       * Running now: DA_MAST 
##         Parameters: pseudo_count=FALSE, design=~.HMP_BODY_SUBSITE, coefficient=HMP_BODY_SUBSITESupragingival Plaque, rescale=median, norm=none 
##       * Running now: DA_Seurat 
##         Parameters: pseudo_count=FALSE, contrast=HMP_BODY_SUBSITE.Supragingival Plaque.Subgingival Plaque, test.use=wilcox, norm=none

6.3 Visualization

Plaque_16_DA object contains the results for 9 methods. In order to extract p-values, the optional direction of DA (DA vs non-DA, or UP Abundant vs DOWN Abundant), and to add any a priori information, the createEnrichment() function can be used.

In the direction argument, which is set to NULL by default, the column name containing the direction (e.g. logfc, logFC, logFoldChange…) of each method’s statInfo matrix must be supplied.

Firstly, the order of methods should be investigated:

names(Plaque_16S_DA)

## [1] "edgeR.TMM"               "DESeq2.poscounts"       
## [3] "limma.CSSmedian"         "limma.TMM"              
## [5] "metagenomeSeq.CSSmedian" "ALDEx2.none.iqlr.t"     
## [7] "corncob.none.Wald"       "MAST.none.median"       
## [9] "Seurat.none.wilcox"

Following the methods’ order, the direction parameter is supplied together with other parameters:

threshold_pvalue, threshold_logfc, and top (optional), to set differential abundance thresholds;
slot, colName, and type, which specify where to apply the above thresholds;
priorKnowledge, enrichmentCol, and namesCol, to add enrichment information to DA analysis;

enrichment <- createEnrichment(
    object = Plaque_16S_DA, priorKnowledge = priorInfo, enrichmentCol = "Type",
    namesCol = "newNames", slot = "pValMat", colName = "adjP", type = "pvalue",
    direction = c("logFC", # edgeR
        "log2FoldChange", # DEseq2
        "logFC", # limma
        "logFC", # limma
        "HMP_BODY_SUBSITESupragingival Plaque", # metagenomeSeq
        "effect", # ALDEx2
        "Estimate", # corncob
        "logFC",# MAST
        "avg_log2FC"), # Seurat
    threshold_pvalue = 0.1,
    threshold_logfc = 0,
    top = NULL,
    alternative = "greater",
    verbose = TRUE
)

The produced enrichment object consists in a list of elements as long as the number of tested methods:

the data slot contains information for each feature. P-values, adjusted p-values (or other statistics) in stats column, log fold changes (or other statistics, if specified) in direction column, differential abundace information in the DA column (according to the thresholds), the variable of interest for the enrichment analysis, and the name of the feature in the feature column;
in the tables slot a maximum of 2 x (levels of enrichment variable) contingency tables (2x2) are present;
in the tests slot, the list of Fisher exact tests produced by the fisher.test() function are saved for each contingency table;
in the summaries slot, the first elements of the contingency tables and the respective p-values are collected for graphical purposes.

Considering the most liberal method of our example, metagenomeSeq.CSSmedian, we obtain 8 contingency tables. Both UP Abundant and DOWN Abundant taxa are found and the enrichment variable has Aerobic, Anaerobic, F_Anaerobic, and Unknown levels. For each level, we can build 2 contingency tables: one for DOWN Abundant vs non.DOWN Abundant features and one for UP Abundant vs non.UP Abundant features. The enrichment is tested using Fisher exact test. The plotContingency() function summarize all these information.

plotContingency(enrichment = enrichment, levels_to_plot = c("Aerobic", "Anaerobic"), method = "metagenomeSeq.CSSmedian")

To summarize enrichment analysis for all the methods simultaneously, the plotEnrichment() function can be used. Only Aerobic and Anaerobic levels are plotted:

plotEnrichment(enrichment = enrichment, enrichmentCol = "Type", levels_to_plot = c("Aerobic", "Anaerobic"))

Since “Subgingival Plaque” is the reference level for each method, the coefficients extracted from the methods are referred to the “Supragingival Plaque” class. Five out of nine methods identify, as expected, a statistically significant (\(0.001 < p \le 0.05\)) amount of DOWN Abundant Anaerobic features in Supragingival Plaque. Moreover, metagenomeSeq.CSSmedian finds some UP Abundant Aerobic genera in Supragingival Plaque but, together with edgeR.TMM, some Anaerobic genera as UP Abundant, which are probably false positives. Seurat.none.wilcox, ALDEx2.none.iqlr.t, and limma.CSSmedian have instead a more conservative behavior, but still they can find some DOWN Abundant genera towards the expected direction. Finally, MAST.none doesn’t find any DA feature.

To investigate the DA features, the plotMutualFindings() function can be used. While levels_to_plot argument allows to choose which levels of the enrichment variable to plot, n_methods argument allows to extract only features which are mutually found as DA by more than 1 method.

plotMutualFindings(enrichment, enrichmentCol = "Type", levels_to_plot = c("Aerobic", "Anaerobic"), n_methods = 1)

In this example, only Anaerobic genera are found as DA by more than 1 method simultaneously. Among them Prevotella, Fusobacterium, Catonella, and Dialister genera are found as DOWN Abundant in Supragingival Plaque by the majority of methods. Only the Bacteroides genus is considered UP Abundant in Supragingival Plaque by edgeR.TMM. However, the same feature is found as DOWN Abundant by metagenomeSeq.CSSmedian.

6.4 True and False Positives

To evaluate the overall performances we can study a statistic based on the difference between putative True Positives (TP) and the putative False Positives (FP). To build the matrix to plot, the createPositives() can be used. In details, the correctness of the DA features is evaluated comparing the direction of the top ranked features to the expected direction supplied by the user in the TP and FP lists. The procedure is performed for several thresholds of top parameter in order to observe a trend, if present.

positives <- createPositives(
    object = Plaque_16S_DA, priorKnowledge = priorInfo, enrichmentCol = "Type",
    namesCol = "newNames", slot = "pValMat", colName = "rawP", type = "pvalue",
    direction = c("logFC", # edgeR
        "log2FoldChange", # DEseq2
        "logFC", # limma
        "logFC", # limma
        "HMP_BODY_SUBSITESupragingival Plaque", # metagenomeSeq
        "effect", # ALDEx2
        "Estimate", # corncob
        "logFC",# MAST
        "avg_log2FC"), # Seurat
    threshold_pvalue = 1,
    threshold_logfc = 0,
    top = seq.int(from = 0, to = 50, by = 5), 
    alternative = "greater",
    verbose = FALSE,
    TP = list(c("DOWN Abundant", "Anaerobic"), c("UP Abundant", "Aerobic")),
    FP = list(c("DOWN Abundant", "Aerobic"), c("UP Abundant", "Anaerobic"))
)
head(positives)

##   top                  method TP FP
## 1   5               edgeR.TMM  4  0
## 2   5        DESeq2.poscounts  3  0
## 3   5         limma.CSSmedian  4  0
## 4   5               limma.TMM  5  0
## 5   5 metagenomeSeq.CSSmedian  5  0
## 6   5      ALDEx2.none.iqlr.t  4  0

Finally, the plotPositives() function can be used to summarize the methods’ performances. Higher values usually represents better performances. In our example, all methods show similar values of the statistics from the top 5, to the top 20 ranked features.

plotPositives(positives)

MAST.none.median and ALDEx2.none.iqlr.t methods are obviously very conservative and are located on the lower part of the plot. Very liberal methods, such as metagenomeSeq.CSSmedian and edgeR.TMM show instead the highest values. This means that their findings are in line with the a priori knowledge supplied by the user, however we are considering a toy example with only 88 features.

7 Session Info

sessionInfo()

## R version 4.1.1 (2021-08-10)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04.3 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.14-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.14-bioc/R/lib/libRlapack.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_GB              LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] cowplot_1.1.1    ggplot2_3.3.5    plyr_1.8.6       phyloseq_1.38.0 
## [5] benchdamic_1.0.0 BiocStyle_2.22.0
## 
## loaded via a namespace (and not attached):
##   [1] ica_1.0-2                   zinbwave_1.16.0            
##   [3] Rsamtools_2.10.0            foreach_1.5.1              
##   [5] lmtest_0.9-38               glmnet_4.1-2               
##   [7] crayon_1.4.1                spatstat.core_2.3-0        
##   [9] MASS_7.3-54                 rhdf5filters_1.6.0         
##  [11] MAST_1.20.0                 nlme_3.1-153               
##  [13] ALDEx2_1.26.0               rlang_0.4.12               
##  [15] XVector_0.34.0              ROCR_1.0-11                
##  [17] irlba_2.3.3                 limma_3.50.0               
##  [19] minfi_1.40.0                filelock_1.0.2             
##  [21] BiocParallel_1.28.0         rjson_0.2.20               
##  [23] bit64_4.0.5                 glue_1.4.2                 
##  [25] ffpe_1.38.0                 rngtools_1.5.2             
##  [27] sctransform_0.3.2           sfsmisc_1.1-12             
##  [29] parallel_4.1.1              methylumi_2.40.0           
##  [31] spatstat.sparse_2.0-0       AnnotationDbi_1.56.0       
##  [33] BiocGenerics_0.40.0         VGAM_1.1-5                 
##  [35] spatstat.geom_2.3-0         tidyselect_1.1.1           
##  [37] SummarizedExperiment_1.24.0 SeuratObject_4.0.2         
##  [39] NADA_1.6-1.1                fitdistrplus_1.1-6         
##  [41] XML_3.99-0.8                nleqslv_3.3.2              
##  [43] tidyr_1.1.4                 zoo_1.8-9                  
##  [45] GenomicAlignments_1.30.0    xtable_1.8-4               
##  [47] magrittr_2.0.1              evaluate_0.14              
##  [49] zlibbioc_1.40.0             doRNG_1.8.2                
##  [51] miniUI_0.1.1.1              metagenomeSeq_1.36.0       
##  [53] bslib_0.3.1                 rpart_4.1-15               
##  [55] zCompositions_1.3.4         shiny_1.7.1                
##  [57] xfun_0.27                   askpass_1.1                
##  [59] multtest_2.50.0             cluster_2.1.2              
##  [61] caTools_1.18.2              biomformat_1.22.0          
##  [63] KEGGREST_1.34.0             tibble_3.1.5               
##  [65] ggrepel_0.9.1               base64_2.0                 
##  [67] ape_5.5                     scrime_1.3.5               
##  [69] listenv_0.8.0               Biostrings_2.62.0          
##  [71] png_0.1-7                   permute_0.9-5              
##  [73] reshape_0.8.8               future_1.22.1              
##  [75] withr_2.4.2                 lumi_2.46.0                
##  [77] bitops_1.0-7                slam_0.1-48                
##  [79] RcppZiggurat_0.1.6          pillar_1.6.4               
##  [81] bumphunter_1.36.0           gplots_3.1.1               
##  [83] cachem_1.0.6                GenomicFeatures_1.46.0     
##  [85] TTR_0.24.2                  DelayedMatrixStats_1.16.0  
##  [87] xts_0.12.1                  vctrs_0.3.8                
##  [89] ellipsis_0.3.2              generics_0.1.1             
##  [91] ROI.plugin.lpsolve_1.0-1    tools_4.1.1                
##  [93] munsell_0.5.0               DelayedArray_0.20.0        
##  [95] fastmap_1.1.0               compiler_4.1.1             
##  [97] abind_1.4-5                 httpuv_1.6.3               
##  [99] rtracklayer_1.54.0          beanplot_1.2               
## [101] plotly_4.10.0               GenomeInfoDbData_1.2.7     
## [103] gridExtra_2.3               edgeR_3.36.0               
## [105] lattice_0.20-45             deldir_1.0-6               
## [107] utf8_1.2.2                  later_1.3.0                
## [109] dplyr_1.0.7                 BiocFileCache_2.2.0        
## [111] jsonlite_1.7.2              affy_1.72.0                
## [113] scales_1.1.1                pbapply_1.5-0              
## [115] sparseMatrixStats_1.6.0     genefilter_1.76.0          
## [117] lazyeval_0.2.2              promises_1.2.0.1           
## [119] goftest_1.2-3               spatstat.utils_2.2-0       
## [121] reticulate_1.22             rmarkdown_2.11             
## [123] nor1mix_1.3-0               siggenes_1.68.0            
## [125] Rtsne_0.15                  softImpute_1.4-1           
## [127] Biobase_2.54.0              uwot_0.1.10                
## [129] igraph_1.2.7                HDF5Array_1.22.0           
## [131] survival_3.2-13             numDeriv_2016.8-1.1        
## [133] yaml_2.2.1                  htmltools_0.5.2            
## [135] memoise_2.0.0               BiocIO_1.4.0               
## [137] Seurat_4.0.5                locfit_1.5-9.4             
## [139] IRanges_2.28.0              quadprog_1.5-8             
## [141] viridisLite_0.4.0           digest_0.6.28              
## [143] assertthat_0.2.1            mime_0.12                  
## [145] rappdirs_0.3.3              registry_0.5-1             
## [147] RSQLite_2.2.8               Rfast_2.0.3                
## [149] future.apply_1.8.1          data.table_1.14.2          
## [151] blob_1.2.2                  vegan_2.5-7                
## [153] S4Vectors_0.32.0            preprocessCore_1.56.0      
## [155] detectseparation_0.2        lpSolveAPI_5.5.2.0-17.7    
## [157] splines_4.1.1               labeling_0.4.2             
## [159] Rhdf5lib_1.16.0             illuminaio_0.36.0          
## [161] RCurl_1.98-1.5              hms_1.1.1                  
## [163] rhdf5_2.38.0                colorspace_2.0-2           
## [165] BiocManager_1.30.16         GenomicRanges_1.46.0       
## [167] shape_1.4.6                 sass_0.4.0                 
## [169] GEOquery_2.62.0             Rcpp_1.0.7                 
## [171] mclust_5.4.7                bookdown_0.24              
## [173] RANN_2.6.1                  fansi_0.5.0                
## [175] tzdb_0.1.2                  truncnorm_1.0-8            
## [177] parallelly_1.28.1           R6_2.5.1                   
## [179] grid_4.1.1                  ggridges_0.5.3             
## [181] lifecycle_1.0.1             curl_4.3.2                 
## [183] affyio_1.64.0               leiden_0.3.9               
## [185] jquerylib_0.1.4             Matrix_1.3-4               
## [187] corncob_0.2.0               RcppAnnoy_0.0.19           
## [189] RColorBrewer_1.1-2          iterators_1.0.13           
## [191] stringr_1.4.0               htmlwidgets_1.5.4          
## [193] Wrench_1.12.0               polyclip_1.10-0            
## [195] biomaRt_2.50.0              purrr_0.3.4                
## [197] ROI_1.0-0                   mgcv_1.8-38                
## [199] globals_0.14.0              openssl_1.4.5              
## [201] patchwork_1.1.1             codetools_0.2-18           
## [203] matrixStats_0.61.0          gtools_3.9.2               
## [205] prettyunits_1.1.1           SingleCellExperiment_1.16.0
## [207] dbplyr_2.1.1                GenomeInfoDb_1.30.0        
## [209] gtable_0.3.0                DBI_1.1.1                  
## [211] stats4_4.1.1                tensor_1.5                 
## [213] httr_1.4.2                  highr_0.9                  
## [215] KernSmooth_2.23-20          stringi_1.7.5              
## [217] progress_1.2.2              reshape2_1.4.4             
## [219] farver_2.1.0                annotate_1.72.0            
## [221] magick_2.7.3                xml2_1.3.2                 
## [223] ggdendro_0.1.22             trust_0.1-8                
## [225] restfulr_0.0.13             ade4_1.7-18                
## [227] readr_2.0.2                 geneplotter_1.72.0         
## [229] scattermore_0.7             DESeq2_1.34.0              
## [231] bit_4.0.4                   MatrixGenerics_1.6.0       
## [233] spatstat.data_2.1-0         pkgconfig_2.0.3            
## [235] MGLM_0.2.0                  knitr_1.36

References

1. Calgaro M. mcalgaro93/sc2meta [Internet]. 2020 [cited 2020 Jul 13]. Available from: https://github.com/mcalgaro93/sc2meta

2. Robinson MD, McCarthy DJ, Smyth GK. edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics. 2010;26:139–40.

3. Thorsen J, Brejnrod A, Mortensen M, Rasmussen MA, Stokholm J, Al-Soud WA, et al. Large-scale benchmarking reveals false discoveries and count transformation sensitivity in 16S rRNA gene amplicon data analysis methods used in microbiome studies. Microbiome. 2016;4:62.

4. Risso D, Perraudeau F, Gribkova S, Dudoit S, Vert J-P. A general and flexible method for signal extraction from single-cell RNA-seq data. Nat Commun. 2018;9:284.

5. Finak G, McDavid A, Yajima M, Deng J, Gersuk V, Shalek AK, et al. MAST: a flexible statistical framework for assessing transcriptional changes and characterizing heterogeneity in single-cell RNA sequencing data. 2015.

6. Paulson JN, Stine OC, Bravo HC, Pop M. Differential abundance analysis for microbial marker-gene surveys. Nat Methods. 2013;10:1200–2.

7. Fernandes AD, Reid JN, Macklaim JM, McMurrough TA, Edgell DR, Gloor GB. Unifying the analysis of high-throughput sequencing datasets: characterizing RNA-seq, 16S rRNA gene sequencing and selective growth experiments by compositional data analysis. Microbiome. 2014;2:15.

8. Kim J, Zhang Y, Day J, Zhou H. MGLM: An R Package for Multivariate Categorical Data Analysis. The R Journal [Internet]. The R Foundation; 2018;10:73. Available from: http://dx.doi.org/10.32614/rj-2018-015

9. Thurnheer T, Bostanci N, Belibasakis GN. Microbial dynamics during conversion from supragingival to subgingival biofilms in an in vitro model. Molecular Oral Microbiology [Internet]. 2016 [cited 2020 May 28];31:125–35. Available from: https://onlinelibrary.wiley.com/doi/abs/10.1111/omi.12108

10. Beghini F, Renson A, Zolnik CP, Geistlinger L, Usyk M, Moody TU, et al. Tobacco exposure associated with oral microbiota oxygen utilization in the New York City Health and Nutrition Examination Study. Ann Epidemiol. 2019;34:18–25.e3.

An introduction to benchdamic

26 October 2021

Contents