MMUPHin is an R package implementing meta-analysis methods for microbial
community profiles. It has interfaces for: a) covariate-controlled batch and
study effect adjustment, b) meta-analytic differential abundance testing, and meta-analytic discovery of c) discrete (cluster-based) or d) continuous
unsupervised population structure.
MMUPHin enables the normalization and combination of multiple
microbial community studies. It can then help in identifying microbes, genes,
or pathways that are differential with respect to combined phenotypes. Finally,
it can find clusters or gradients of sample types that reproduce consistently
MMUPHin is a Bioconductor package and can be installed via the following command.
if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager") BiocManager::install("MMUPHin")
This vignette is intended to provide working examples for all four
MMUPHin can be loaded into R with the first
library() command shown below. We also load some utility packages for data manipulation and visualization.
library(MMUPHin) library(magrittr) library(dplyr) library(ggplot2)
MMUPHin requires a properly
formatted collection of microbial community studies, with both feature
abundances and accompanying metadata. Here we use the five published
colorectal cancer (CRC)
stool metagenomic studies, incorporated in Thomas et al. (2019).
Data for the studies are already
conveniently packaged and accessible through the Biocondcutor package
though additional wranglings are needed to format input for
MMUPHin asks that feature abundances be provided as a
feature-by-sample matrix, and the metadata be provided as a data frame.
The two objects shoud agree on sample IDs, that is,
colname of the feature
abundance matrix and
rowname of the metadata data frame must agree.
Many popular ’omic data classes in R already enforce this standard, such as
ExpressionSet from Biobase, or
To minimize users’ efforts in loading data to run the examples, we have properly formatted the five studies for easy access. The feature abundances and metadata can be loaded with the following code chunk. For the interested user, the commented out scripts were used for accessing data directly from curatedMetagenomicData and formatting. It might be worthwhile to read through these as they perform many of the common tasks for preprocessing microbial feature abundance data in R, including sample/feature subsetting, normalization, filtering, etc.
data("CRC_abd", "CRC_meta") CRC_abd[1:5, 1, drop = FALSE]
## FengQ_2015.metaphlan_bugs_list.stool:SID31004 ## s__Faecalibacterium_prausnitzii 0.11110668 ## s__Streptococcus_salivarius 0.09660736 ## s__Ruminococcus_sp_5_1_39BFAA 0.09115385 ## s__Subdoligranulum_unclassified 0.05806767 ## s__Bacteroides_stercoris 0.05685503
## subjectID body_site ## FengQ_2015.metaphlan_bugs_list.stool:SID31004 SID31004 stool ## study_condition ## FengQ_2015.metaphlan_bugs_list.stool:SID31004 CRC ## disease ## FengQ_2015.metaphlan_bugs_list.stool:SID31004 CRC;fatty_liver;hypertension ## age_category ## FengQ_2015.metaphlan_bugs_list.stool:SID31004 adult
## ## FengQ_2015.metaphlan_bugs_list.stool ## 107 ## HanniganGD_2017.metaphlan_bugs_list.stool ## 55 ## VogtmannE_2016.metaphlan_bugs_list.stool ## 104 ## YuJ_2015.metaphlan_bugs_list.stool ## 128 ## ZellerG_2014.metaphlan_bugs_list.stool ## 157
adjust_batch aims to correct for technical study/batch effects in microbial
feature abundances. It takes as input the feature-by-sample abundance matrix,
and performs batch effect adjustment given provided batch and optional covariate variables. It returns the batch-adjusted abundance matrix. Check
help(adjust_batch) for additional details and options. This figure from the paper (figure 2b) shows the desired effect of the batch adjustment: batch effects are removed, making the effect of the biological variable of interest easier to detect.
Here we use
adjust_batch to correct for differences in the five studies,
while controlling for the effect of CRC versus control (variable
adjust_batch returns a list of multiple outputs. The part we are most interested in now is the adjusted abundance table, which we assign to
fit_adjust_batch <- adjust_batch(feature_abd = CRC_abd, batch = "studyID", covariates = "study_condition", data = CRC_meta, control = list(verbose = FALSE)) CRC_abd_adj <- fit_adjust_batch$feature_abd_adj
One way to evaluate the effect of batch adjustment is to assess the total
variability in microbial profiles attributable to study differences, before
and after adjustment. This is commonly known as a PERMANOVA test
(Tang, Chen, and Alekseyenko 2016), and can be performed with the
adonis function in
vegan. The inputs to
adonis() are indices of microbial compositional dissimilarity.
library(vegan, quietly = TRUE)
## This is vegan 2.6-4
D_before <- vegdist(t(CRC_abd)) D_after <- vegdist(t(CRC_abd_adj)) set.seed(1) fit_adonis_before <- adonis(D_before ~ studyID, data = CRC_meta)
## 'adonis' will be deprecated: use 'adonis2' instead
fit_adonis_after <- adonis(D_after ~ studyID, data = CRC_meta)
## 'adonis' will be deprecated: use 'adonis2' instead
## Permutation: free ## Number of permutations: 999 ## ## Terms added sequentially (first to last) ## ## Df SumsOfSqs MeanSqs F.Model R2 Pr(>F) ## studyID 4 13.36 3.3400 11.855 0.07991 0.001 *** ## Residuals 546 153.82 0.2817 0.92009 ## Total 550 167.18 1.00000 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Permutation: free ## Number of permutations: 999 ## ## Terms added sequentially (first to last) ## ## Df SumsOfSqs MeanSqs F.Model R2 Pr(>F) ## studyID 4 4.939 1.23463 4.2855 0.03044 0.001 *** ## Residuals 546 157.298 0.28809 0.96956 ## Total 550 162.237 1.00000 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We can see that, before study effect adjustment, study differences can expalin a total of 7.99% of the variability in microbial abundance profiles, whereas after adjustment this was reduced to 3.04%, though the effect of study is significant in both cases.
One of the most common meta-analysis goals is to combine association effects
across batches/studies to identify consistent overall effects.
provides a straightforward interface to this task, by first performing
regression analysis in individual batches/studies using the well-validated
Maaslin2 package, and then aggregating results with
established fixed/mixed effect models, realized via the
vegan package. Here, we use
lm_meta to test for
consistently differential abundant species between CRC and control samples
across the five studies, while controlling for demographic covariates (gender,
age, BMI). Again, lm_meta returns a list of more than one components.
meta_fits provides the final meta-analytical testing results. See
help(lm_meta) for the meaning of other components.
fit_lm_meta <- lm_meta(feature_abd = CRC_abd_adj, batch = "studyID", exposure = "study_condition", covariates = c("gender", "age", "BMI"), data = CRC_meta, control = list(verbose = FALSE))
## Warning in lm_meta(feature_abd = CRC_abd_adj, batch = "studyID", exposure = "study_condition", : Covariate gender is missing or has only one non-missing value in the following batches; will be excluded from model for these batches: ## HanniganGD_2017.metaphlan_bugs_list.stool, YuJ_2015.metaphlan_bugs_list.stool
## Warning in lm_meta(feature_abd = CRC_abd_adj, batch = "studyID", exposure = "study_condition", : Covariate age is missing or has only one non-missing value in the following batches; will be excluded from model for these batches: ## HanniganGD_2017.metaphlan_bugs_list.stool, YuJ_2015.metaphlan_bugs_list.stool
## Warning in lm_meta(feature_abd = CRC_abd_adj, batch = "studyID", exposure = "study_condition", : Covariate BMI is missing or has only one non-missing value in the following batches; will be excluded from model for these batches: ## HanniganGD_2017.metaphlan_bugs_list.stool, YuJ_2015.metaphlan_bugs_list.stool
meta_fits <- fit_lm_meta$meta_fits
You’ll note that this command produces a few error messages. This is because for one of the studies the specified covariates were not measured. As the message states, these covariates are not included when analyzing data from that study. We also note that batch correction is standard but not necessary for the analysis here.
We can visualize the significant (FDR q < 0.05) species associated with CRC in these studies/samples. Comparing them with Figure 1b of Thomas et al. (2019), we can see that many of the significant species do agree.
meta_fits %>% filter(qval.fdr < 0.05) %>% arrange(coef) %>% mutate(feature = factor(feature, levels = feature)) %>% ggplot(aes(y = coef, x = feature)) + geom_bar(stat = "identity") + coord_flip()
Clustering analysis of microbial profiles can help identify meaningful
discrete population subgroups (Ravel et al. 2011), but must be carried out
carefully with validations to ensure that the identified structures are
consistent (Koren et al. 2013).
discrete_discover provides the functionality to
use prediction strength (Tibshirani and Walther 2005) to evaluate discrete
clustering structures within individual microbial studies, as well as a
“generalized prediction strength” to evaluate their reproducibility in
other studies. These jointly provide meta-analytical evidence for
(or against) identifying discrete population structures in microbial profiles.
help(discrete_discover) to see more details on the method and additional
options. We can contrast discrete structure (clusters) with continuous structure (which will be overviewed in the next section):