Contents

1 Introduction

It is well established that the microbiome play a key role in human health and disease, due to its function such as host nutrition production (e.g. short-chain fatty acids, SCFA), defense against pathogens, and development of immunity (Gilbert et al. 2018). The microbiome provide novel biomarkers for many disease, and characterizing biomarkers based on microbiome profiles has great potential for translational medicine and precision medicine (Manor et al. 2020).

Differential analysis (DA) is a widely used approach to identify biomarkers. To date, a number of methods have been developed for microbiome marker discovery based on metagenomic profiles, e.g. simple statistical analysis methods STAMP (Parks et al. 2014), RNA-seq based methods such as edgeR (Robinson, McCarthy, and Smyth 2010) and DESeq2 (Love, Huber, and Anders 2014), metagenomeSeq (Paulson et al. 2013), and Linear Discriminant Analysis Effect Size (LEfSe) (Segata et al. 2011). However, all of these methods have its own advantages and disadvantages, and none of them is considered standard or universal. Moreover, the programs/softwares for different DA methods may be development using different programming languages, even in different operating systems. Here, we have developed an all-in-one R/Bioconductor package microbiomeMarker that integrates commonly used differential analysis methods as well as three machine learning-based approaches (Logistic regression, Random forest, and Support vector machine) to facilitate the identification of microbiome markers.

2 Installation

Install the package from Bioconductor directly:

if (!requireNamespace("BiocManager", quietly = TRUE)) {
    install.packages("BiocManager")
}

BiocManager::install("microbiomeMarker")

Or install the development version of the package from Github.

if (!requireNamespace("remotes", quietly = TRUE)) {
    install.packages("remotes")
}
remotes::install_github("yiluheihei/microbiomeMarker")

3 Package loading

Load the microbiomeMarker into the R session:

library(microbiomeMarker)

4 Data structure

4.1 Input phyloseq-class object

phyloseq is the most popular Biocondcutor package used by the microbiome research community, and phyloseq-class objects are a great data-standard for microbiome data in R. Therefore, the core functions in microbiomeMarker take phyloseq-class object as input. Conveniently, microbiomeMarker provides features to import external metagenomic abundance profiles from two popular microbiome analysis pipelines, qiime2 (Bolyen et al. 2019) and dada2 (Callahan et al. 2016), and return a phyloseq-class object.

4.1.1 Import from dada2

The output of the dada2 pipeline is a feature table of amplicon sequence variants (an ASV table): A matrix with rows corresponding to samples and columns to ASVs, in which the value of each entry is the number of times that ASV was observed in that sample. This table is analogous to the traditional OTU table. Conveniently, taxa names are saved as

seq_tab <- readRDS(
    system.file(
        "extdata", "dada2_seqtab.rds",
        package = "microbiomeMarker"
    )
)
tax_tab <- readRDS(
    system.file(
        "extdata", "dada2_taxtab.rds",
        package = "microbiomeMarker"
    )
)
sam_tab <- read.table(
    system.file(
        "extdata", "dada2_samdata.txt",
        package = "microbiomeMarker"
    ),
    sep = "\t",
    header = TRUE,
    row.names = 1
)
ps <- import_dada2(seq_tab = seq_tab, tax_tab = tax_tab, sam_tab = sam_tab)
ps
#> phyloseq-class experiment-level object
#> otu_table()   OTU Table:         [ 232 taxa and 20 samples ]
#> sample_data() Sample Data:       [ 20 samples by 4 sample variables ]
#> tax_table()   Taxonomy Table:    [ 232 taxa by 6 taxonomic ranks ]
#> refseq()      DNAStringSet:      [ 232 reference sequences ]

4.1.2 Import from qiime2

qiime2 is the most widely used software for metagenomic analysis. User can import the feature table, taxonomic table, phylogenetic tree, representative sequence and sample metadata from qiime2 using import_qiime2().

otuqza_file <- system.file(
    "extdata", "table.qza",
    package = "microbiomeMarker"
)
taxaqza_file <- system.file(
    "extdata", "taxonomy.qza",
    package = "microbiomeMarker"
)
sample_file <- system.file(
    "extdata", "sample-metadata.tsv",
    package = "microbiomeMarker"
)
treeqza_file <- system.file(
    "extdata", "tree.qza",
    package = "microbiomeMarker"
)

ps <- import_qiime2(
    otu_qza = otuqza_file, taxa_qza = taxaqza_file,
    sam_tab = sample_file, tree_qza = treeqza_file
)
ps
#> phyloseq-class experiment-level object
#> otu_table()   OTU Table:         [ 770 taxa and 34 samples ]
#> sample_data() Sample Data:       [ 34 samples by 9 sample variables ]
#> tax_table()   Taxonomy Table:    [ 770 taxa by 7 taxonomic ranks ]
#> phy_tree()    Phylogenetic Tree: [ 770 tips and 768 internal nodes ]

4.1.3 Other import functions reexport from phyloseq

Moreover, microbiomeMarker reexports three import functions from phyloseq, including import_biom(), import_qiime() and import_mothur(), to help users to import abundance data from biom file, qiime1, and mothur. More details on these three import functions can be see from here.

Users can also import the external files into phyloseq-class object manually. For more details on how to create phyloseq-class object from manually imported data, please see this tutorial.

4.2 Output microbiomeMaker-class object

The object class used by the microbiomeMarker package to store the result of microbiome marker analysis (also referred as DA) is the microbiomeMarker-class object. The microbiomeMarker-class extends the phyloseq-class by adding three custom slots:

  • marker_table: also a new S4 class to store the markers, which is inherit from data.frame. Rows represent the microbiome markers and variables represents feature of the marker, such as feature names, effect size and p value.
  • norm_method: normalization method.
  • diff_method: DA method.

Once users have a microbiomeMarker-class object, many accessor functions are available to query aspects of the data set. The function name and its purpose can be seen here.

5 Diferential analysis

A number of methods have been developed for identifying differentially metagenomic features. microbiomeMarker provides the most commonly used DA methods which can be divided into three main categories: a) simple statistical tests; b) RNA-seq based methods; c) metagenomic based methods. All the names of DA functions in microbiomeMarker are prefixed with run_ (the run_* family of functions).

By default, all the methods will perform DA on all levels of features (taxa_rank = "all" in DA functions) like LEfSe (Segata et al. 2011), therefore, the corrected p value in the result (var padj in the marker_table object) may be over-corrected. Users can change the para taxa_rank to a specific level of interest, and the DA will only perform in the specified level. For simplicity, DA on a specific level of feature is not contained in this vignette.

5.1 Normalization

It is critical to normalize the metagenomic data to eliminate artifactual bias in the original measurements prior to DA (Weiss et al. 2017). Here in microbiomeMarker, we provides seven popular normalization methods, including:

  • rarefy: random subsampling counts to the smallest library size in the data set.
  • TSS: total sum scaling, also referred to as “relative abundance”, the abundances were normalized by dividing the corresponding sample library size.
  • TMM: trimmed mean of m-values. First, a sample is chosen as reference. The scaling factor is then derived using a weighted trimmed mean over the differences of the log-transformed gene-count
    fold-change between the sample and the reference.
  • RLE: relative log expression, RLE uses a pseudo-reference calculated using the geometric mean of the gene-specific abundances over all samples. The scaling factors are then calculated as the median of the gene counts ratios between the samples and the reference.
  • CSS: cumulative sum scaling, calculates scaling factors as the cumulative sum of gene abundances up to a data-derived threshold.
  • CLR: centered log-ratio normalization.
  • CPM: pre-sample normalization of the sum of the values to 1e+06.

We can use norm_*() family of functions or a wrapper function normalize to normalize the original metagenomic abundance data.

# take tss as example
norm_tss(ps)
#> phyloseq-class experiment-level object
#> otu_table()   OTU Table:         [ 770 taxa and 34 samples ]
#> sample_data() Sample Data:       [ 34 samples by 9 sample variables ]
#> tax_table()   Taxonomy Table:    [ 770 taxa by 7 taxonomic ranks ]
#> phy_tree()    Phylogenetic Tree: [ 770 tips and 768 internal nodes ]

normalize(ps, method = "TSS")
#> phyloseq-class experiment-level object
#> otu_table()   OTU Table:         [ 770 taxa and 34 samples ]
#> sample_data() Sample Data:       [ 34 samples by 9 sample variables ]
#> tax_table()   Taxonomy Table:    [ 770 taxa by 7 taxonomic ranks ]
#> phy_tree()    Phylogenetic Tree: [ 770 tips and 768 internal nodes ]

Note: all the DA functions provides a para to specify the normalization method. We emphasize that users should specify the normalization method in the DA functions rather than using these normalization functions directly. If you use normalize data first and then perform DA, you should set the norm_method manually. We recommend to use the default normalization methods for the corresponding DA methods, e.g. “CPM” for LEfSe and “CSS” for metagenomeSeq, and the default values of norm in the DA functions is set as their default normalization methods.

data(kostic_crc)
mm_test <- normalize(kostic_crc, method = "CPM") %>%
    run_lefse(
        wilcoxon_cutoff = 0.01,
        norm = "none", # must be "none" since the input has been normalized
        group = "DIAGNOSIS",
        kw_cutoff = 0.01,
        multigrp_strat = TRUE,
        lda_cutoff = 4
    )
# equivalent to
run_lefse(
    wilcoxon_cutoff = 0.01,
    norm = "CPM",
    group = "DIAGNOSIS",
    kw_cutoff = 0.01,
    multigrp_strat = TRUE,
    lda_cutoff = 4
)

5.2 Simple statitical tests

In practice, simple statitical tests such as t-test (for two groups comparison) and Kruskal-Wallis rank sum test (for multiple groups comparison) are frequently used for metagenomic differential analysis. STAMP [parks2014stamp] is a widely-used graphical software package that provides “best pratices” in choose appropriate statistical methods for metagenomic analysis. Here in microbiomeMarker, t-test, Welch’s t-test, and White’s non-parametric t-test are provided for two groups comparison, and ANOVA and Kruskal–Wallis test for multiple groups comparisons.

We can use test_two_groups() to perform simple statistical differential test between two groups.

data(enterotypes_arumugam)
tg_welch <- run_test_two_groups(
    enterotypes_arumugam,
    group = "Gender",
    method = "welch.test"
)

# three significantly differential genera (marker)
tg_welch
#> microbiomeMarker-class inherited from phyloseq-class
#> normalization method:              [ TSS ]
#> microbiome marker identity method: [ welch.test ]
#> marker_table() Marker Table:       [ 3 microbiome markers with 5 variables ]
#> otu_table()    OTU Table:          [ 244 taxa and  39 samples ]
#> sample_data()  Sample Data:        [ 39 samples by  9 sample variables ]
#> tax_table()    Taxonomy Table:     [ 244 taxa by 1 taxonomic ranks ]

# details of result of the three markers
head(marker_table(tg_welch))
#>                                     feature enrich_group  ef_diff_mean
#> marker1     p__Firmicutes|g__Heliobacterium            M -8.542172e-06
#> marker2         p__Firmicutes|g__Parvimonas            M -1.339857e-05
#> marker3 p__Firmicutes|g__Peptostreptococcus            M -6.695045e-05
#>             pvalue       padj
#> marker1 0.02940341 0.02940341
#> marker2 0.03281399 0.03281399
#> marker3 0.01714937 0.01714937

Function run_test_multiple_groups() is constructed for statistical differential test for multiple groups.

# three groups
ps <- phyloseq::subset_samples(
    enterotypes_arumugam,
    Enterotype %in% c("Enterotype 3", "Enterotype 2", "Enterotype 1")
)
mg_anova <- run_test_multiple_groups(
    ps,
    group = "Enterotype",
    method = "anova"
)

# 24 markers
mg_anova
#> microbiomeMarker-class inherited from phyloseq-class
#> normalization method:              [ TSS ]
#> microbiome marker identity method: [ anova ]
#> marker_table() Marker Table:       [ 24 microbiome markers with 5 variables ]
#> otu_table()    OTU Table:          [ 238 taxa and  32 samples ]
#> sample_data()  Sample Data:        [ 32 samples by  9 sample variables ]
#> tax_table()    Taxonomy Table:     [ 238 taxa by 1 taxonomic ranks ]

head(marker_table(mg_anova))
#>                                     feature enrich_group ef_eta_squared
#> marker1                    p__Bacteroidetes Enterotype 1      0.5821619
#> marker2                     p__Unclassified Enterotype 3      0.4497271
#> marker3      p__Actinobacteria|g__Scardovia Enterotype 2      0.2196652
#> marker4       p__Bacteroidetes|g__Alistipes Enterotype 3      0.2001541
#> marker5     p__Bacteroidetes|g__Bacteroides Enterotype 1      0.7633661
#> marker6 p__Bacteroidetes|g__Parabacteroides Enterotype 1      0.2582573
#>               pvalue         padj
#> marker1 3.196070e-06 3.196070e-06
#> marker2 1.731342e-04 1.731342e-04
#> marker3 2.742042e-02 2.742042e-02
#> marker4 3.922758e-02 3.922758e-02
#> marker5 8.396825e-10 8.396825e-10
#> marker6 1.314233e-02 1.314233e-02

Moreover, a wrapper of run_test_two_groups() and run_test_multiple_groups() named run_simple_stat() is provided for simple statistical differential analysis.

5.3 RNA-seq based DA methods

Some models developed specifically for RNA-Seq data have been proposed for metagenomic differential analysis. Three popular methods, including DESeq2 (Love, Huber, and Anders 2014) (run_deseq2()), edgeR (Robinson, McCarthy, and Smyth 2010) (run_edger()), and Voom (Law et al. 2014) (run_limma_voom()) are provided in microbiomeMarker.

Here we take edgeR method as an example.

# contrast must be specified for two groups comparison
data(pediatric_ibd)
mm_edger <- run_edger(
    pediatric_ibd,
    group = "Class",
    pvalue_cutoff = 0.1,
    p_adjust = "fdr"
)
mm_edger
#> microbiomeMarker-class inherited from phyloseq-class
#> normalization method:              [ TMM ]
#> microbiome marker identity method: [ edgeR: LRT ]
#> marker_table() Marker Table:       [ 34 microbiome markers with 5 variables ]
#> otu_table()    OTU Table:          [ 786 taxa and  43 samples ]
#> sample_data()  Sample Data:        [ 43 samples by  2 sample variables ]
#> tax_table()    Taxonomy Table:     [ 786 taxa by 1 taxonomic ranks ]

# multiple groups
data(cid_ying)
cid <- phyloseq::subset_samples(
    cid_ying,
    Consistency %in% c("formed stool", "liquid", "semi-formed")
)
mm_edger_mg <- run_edger(
    cid,
    group = "Consistency",
    method  = "QLFT",
    pvalue_cutoff = 0.05,
    p_adjust = "fdr"
)
mm_edger_mg
#> microbiomeMarker-class inherited from phyloseq-class
#> normalization method:              [ TMM ]
#> microbiome marker identity method: [ edgeR: QLFT ]
#> marker_table() Marker Table:       [ 325 microbiome markers with 5 variables ]
#> otu_table()    OTU Table:          [ 669 taxa and  413 samples ]
#> sample_data()  Sample Data:        [ 413 samples by  6 sample variables ]
#> tax_table()    Taxonomy Table:     [ 669 taxa by 1 taxonomic ranks ]

5.4 metagenomic based methods

Five methods, LEfSe (Segata et al. 2011), metagenomeSeq (Paulson et al. 2013), ALDEx2 (Fernandes et al. 2014), ANCOM (Mandal et al. 2015), and ANCOMBC (Lin and Peddada 2020), which were developed specifically for microbiome data (contain many more zeros that RNA-seq data), are also provided in our package. All these methods have greater power to detect differentially features than simple statistical tests by incorporating more sensitive tests.

Curently, LEfSe is the most popular tool for microbiome biomarker discovery. Here we take LEfSe method for example:

data(kostic_crc)
kostic_crc_small <- phyloseq::subset_taxa(
    kostic_crc,
    Phylum %in% c("Firmicutes")
)
mm_lefse <- run_lefse(
    kostic_crc_small,
    wilcoxon_cutoff = 0.01,
    group = "DIAGNOSIS",
    kw_cutoff = 0.01,
    multigrp_strat = TRUE,
    lda_cutoff = 4
)

mm_lefse
#> microbiomeMarker-class inherited from phyloseq-class
#> normalization method:              [ CPM ]
#> microbiome marker identity method: [ lefse ]
#> marker_table() Marker Table:       [ 12 microbiome markers with 5 variables ]
#> otu_table()    OTU Table:          [ 276 taxa and  177 samples ]
#> sample_data()  Sample Data:        [ 177 samples by  71 sample variables ]
#> tax_table()    Taxonomy Table:     [ 276 taxa by 1 taxonomic ranks ]
head(marker_table(mm_lefse))
#>                                                                                                                         feature
#> marker1                                             k__Bacteria|p__Firmicutes|c__Clostridia|o__Clostridiales|f__Ruminococcaceae
#> marker2                         k__Bacteria|p__Firmicutes|c__Clostridia|o__Clostridiales|f__Ruminococcaceae|g__Faecalibacterium
#> marker3 k__Bacteria|p__Firmicutes|c__Clostridia|o__Clostridiales|f__Ruminococcaceae|g__Faecalibacterium|s__Faecalibacterium_s__
#> marker4                                                                                 k__Bacteria|p__Firmicutes|c__Clostridia
#> marker5                                                                k__Bacteria|p__Firmicutes|c__Clostridia|o__Clostridiales
#> marker6                      k__Bacteria|p__Firmicutes|c__Clostridia|o__Clostridiales|f__Ruminococcaceae|g__Ruminococcaceae_g__
#>         enrich_group   ef_lda       pvalue         padj
#> marker1      Healthy 5.013283 7.154793e-05 7.154793e-05
#> marker2      Healthy 4.866423 5.914547e-04 5.914547e-04
#> marker3      Healthy 4.864530 6.043983e-04 6.043983e-04
#> marker4      Healthy 4.649214 7.176046e-04 7.176046e-04
#> marker5      Healthy 4.649214 7.176046e-04 7.176046e-04
#> marker6      Healthy 4.289835 6.990210e-03 6.990210e-03

5.5 Supervised machine learning methods

Given that supervised learning (SL) methods can be used to predict differentiate samples based on there metagenomic profiles efficiently (Knights et al. 2011). microbiomeMarker also provides three SL classification models, random forest, logistic regression, and support vector machine, to identify microbiome biomarkers. In addition, the feature importance score for each marker will be provided too.

Here we take random forest for example:

# must specify the importance para for random forest
set.seed(2021)
# small example phyloseq object for test
ps_small <- phyloseq::subset_taxa(
    enterotypes_arumugam,
    Phylum %in% c("Firmicutes", "Bacteroidetes")
)
mm_lr <- run_sl(
    ps_small,
    group = "Gender",
    nfolds = 2,
    nrepeats = 1,
    taxa_rank = "Genus",
    top_n = 15,
    norm = "TSS",
    method = "LR",
)

marker_table(mm_lr)
#>                     feature enrich_group    ef_imp
#> marker1         Dyadobacter            M 100.00000
#> marker2       Paenibacillus            M  80.09623
#> marker3           Weissella            M  52.90190
#> marker4          Bacillales            F  49.21685
#> marker5        Zunongwangia            F  47.67223
#> marker6         Macrococcus            M  46.78631
#> marker7      Heliobacterium            M  42.77684
#> marker8             Gemella            M  40.85011
#> marker9    Syntrophothermus            M  36.87214
#> marker10        Geobacillus            M  33.29326
#> marker11    Symbiobacterium            M  32.72664
#> marker12 Desulfitobacterium            F  30.36170
#> marker13 Thermoanaerobacter            F  29.51966
#> marker14 Porphyromonadaceae            F  28.72514
#> marker15     Syntrophomonas            M  25.87942

Please note that SL methods can be biased for data with sample size due to the model overfitting. Thus, we advise users to use these SL methods with caution for a smaller dataset.

5.6 Pair-wise comparison of multiple groups

All the DE methods in microbiomeMarker, except for simple statistical tests for two groups comparison (test_mulitple_groups()), can be used for multiple groups comparison, that is to find markers that differ between any of the groups by analyze all groups at once. Users can perform post-hoc test to identify which pairs of groups may differ from each other using run_posthoc_test(). Apparently, the mutliple groups comparison will result in a larger number of genes than the individual pair-wise comparisons.

pht <- run_posthoc_test(ps, group = "Enterotype")
pht
#> postHocTest-class object
#> Pairwise test result of 238  features,  DataFrameList object, each DataFrame has five variables:
#>         comparisons    : pair groups to test which separated by '-'
#>         diff_mean: difference in mean proportions
#>         pvalue        : post hoc test p values
#>         ci_lower : lower confidence interval
#>         ci_upper : upper confidence interval
#> Posthoc multiple comparisons of means  using  tukey  method

# 24 significantly differential genera
markers <- marker_table(mg_anova)$feature
markers
#>                        p__Bacteroidetes                         p__Unclassified 
#>                      "p__Bacteroidetes"                       "p__Unclassified" 
#>          p__Actinobacteria|g__Scardovia           p__Bacteroidetes|g__Alistipes 
#>        "p__Actinobacteria|g__Scardovia"         "p__Bacteroidetes|g__Alistipes" 
#>         p__Bacteroidetes|g__Bacteroides     p__Bacteroidetes|g__Parabacteroides 
#>       "p__Bacteroidetes|g__Bacteroides"   "p__Bacteroidetes|g__Parabacteroides" 
#>          p__Bacteroidetes|g__Prevotella              p__Firmicutes|g__Bulleidia 
#>        "p__Bacteroidetes|g__Prevotella"            "p__Firmicutes|g__Bulleidia" 
#>        p__Firmicutes|g__Catenibacterium              p__Firmicutes|g__Catonella 
#>      "p__Firmicutes|g__Catenibacterium"            "p__Firmicutes|g__Catonella" 
#>             p__Firmicutes|g__Holdemania          p__Firmicutes|g__Lactobacillus 
#>           "p__Firmicutes|g__Holdemania"        "p__Firmicutes|g__Lactobacillus" 
#>            p__Firmicutes|g__Macrococcus     p__Firmicutes|g__Peptostreptococcus 
#>          "p__Firmicutes|g__Macrococcus"   "p__Firmicutes|g__Peptostreptococcus" 
#>           p__Firmicutes|g__Ruminococcus            p__Firmicutes|g__Selenomonas 
#>         "p__Firmicutes|g__Ruminococcus"          "p__Firmicutes|g__Selenomonas" 
#>          p__Firmicutes|g__Streptococcus        p__Firmicutes|g__Subdoligranulum 
#>        "p__Firmicutes|g__Streptococcus"      "p__Firmicutes|g__Subdoligranulum" 
#>         p__Proteobacteria|g__Bartonella           p__Proteobacteria|g__Brucella 
#>       "p__Proteobacteria|g__Bartonella"         "p__Proteobacteria|g__Brucella" 
#>      p__Proteobacteria|g__Granulibacter     p__Proteobacteria|g__Rhodospirillum 
#>    "p__Proteobacteria|g__Granulibacter"   "p__Proteobacteria|g__Rhodospirillum" 
#>   p__Proteobacteria|g__Stenotrophomonas         p__Unclassified|g__Unclassified 
#> "p__Proteobacteria|g__Stenotrophomonas"       "p__Unclassified|g__Unclassified"

# take a marker "p__Bacteroidetes|g__Bacteroides"
# for example, we will show "p__Bacteroidetes|g__Bacteroides"  differ from
# between Enterotype 2-Enterotype 1 and Enterotype 3-Enterotype 2.
extract_posthoc_res(pht, "p__Bacteroidetes|g__Bacteroides")[[1]]
#> DataFrame with 3 rows and 5 columns
#>                                      comparisons  diff_mean      pvalue
#>                                      <character>  <numeric>   <numeric>
#> Enterotype 2-Enterotype 1 Enterotype 2-Enterot.. -0.2813948 4.77015e-08
#> Enterotype 3-Enterotype 1 Enterotype 3-Enterot.. -0.2604547 1.63635e-09
#> Enterotype 3-Enterotype 2 Enterotype 3-Enterot..  0.0209401 7.88993e-01
#>                             ci_lower   ci_upper
#>                            <numeric>  <numeric>
#> Enterotype 2-Enterotype 1 -0.3713469 -0.1914428
#> Enterotype 3-Enterotype 1 -0.3312286 -0.1896808
#> Enterotype 3-Enterotype 2 -0.0575765  0.0994567

In addition, for the five linear models-based methods, including edgeR, DESeq2, metagenoSeq, limma-voom, and ANCOMBC, users can perform pair-wise comparisons by setting the argument contrast, a two length character in which the first element is the reference level (donominator of the logFC) and the second element is used as baseline (numerator for fold change). For more details on contrast argument, please see the help page of the corresponding functions. Here we take limma-voom method as example:

# comparison between Enterotype 3 and Enterotype 2
mm_lv_pair <- run_limma_voom(
    ps,
    "Enterotype",
    contrast = c("Enterotype 3", "Enterotype 2"),
    pvalue_cutoff = 0.05,
    p_adjust = "fdr"
)
mm_lv_pair
#> microbiomeMarker-class inherited from phyloseq-class
#> normalization method:              [ none ]
#> microbiome marker identity method: [ limma_voom ]
#> marker_table() Marker Table:       [ 3 microbiome markers with 5 variables ]
#> otu_table()    OTU Table:          [ 238 taxa and  32 samples ]
#> sample_data()  Sample Data:        [ 32 samples by  9 sample variables ]
#> tax_table()    Taxonomy Table:     [ 238 taxa by 1 taxonomic ranks ]
head(marker_table(mm_lv_pair))
#>                                   feature enrich_group  ef_logFC       pvalue
#> marker1    p__Bacteroidetes|g__Prevotella Enterotype.2  5.853735 1.051399e-12
#> marker2 p__Verrucomicrobia|g__Akkermansia Enterotype.3 -8.651612 3.104720e-04
#> marker3                p__Verrucomicrobia Enterotype.3 -8.299989 5.426191e-04
#>                 padj
#> marker1 2.502331e-10
#> marker2 3.694617e-02
#> marker3 4.304778e-02

6 Visualization

In microbiomeMarker, users can visualize the microbiome biomarker in different ways, such as box plot, bar plot, dot plot, heatmap, and cladogram. Except for heatmap, all these plots are generated using the most flexible and popular data visualization package ggplot2. Therefore, these plots can be easily customized before they are generated using the build-in functions of ggplot2, e.g. using theme() to modify the titles and labels. Heatmap is generated using a fantastic Bioconductor package ComplexHeatmap package.

6.1 Abundance box plot

First of all, users can visualize the abundances of markers using box plots with function plot_abundance(). We emphasize a concern that the group para for plot_abunance() must be keep same with the group para in the differential analysis function. By default, plot_abundance() will plot all the markers, users can plot the specificity markers using para markers.

p_abd <- plot_abundance(mm_lefse, group = "DIAGNOSIS")
p_abd


# customize the plot with ggplot2, modify the fill color manually
library(ggplot2)
p_abd + scale_fill_manual(values = c("Healthy" = "grey", "Tumor" = "red"))

6.2 Heat map

Moreover, users can also visualize the abundances of markers using heatmap, in which rows represents the markers and columns represents the samples. Like the above abundance box plot, users should pay attention to the para group, and control which markers to display by setting para markers.

plot_heatmap(mm_edger, transform = "log10p", group = "Class")