The MicrobiomeBenchamrkData
package provides access to a collection of
datasets with biological ground truth for benchmarking differential
abundance methods. The datasets are deposited on Zenodo:
https://doi.org/10.5281/zenodo.6911026
## Install BioConductor if not installed
if (!require("BiocManager", quietly = TRUE))
install.packages("BiocManager")
## Release version (not yet in Bioc, so it doesn't work yet)
BiocManager::install("MicrobiomeBenchmarkData")
## Development version
BiocManager::install("waldronlab/MicrobiomeBenchmarkData")
library(MicrobiomeBenchmarkData)
library(purrr)
All sample metadata is merged into a single data frame and provided as a data object:
data('sampleMetadata', package = 'MicrobiomeBenchmarkData')
## Get columns present in all samples
sample_metadata <- sampleMetadata |>
discard(~any(is.na(.x))) |>
head()
knitr::kable(sample_metadata)
dataset | sample_id | body_site | library_size | pmid | study_condition | sequencing_method |
---|---|---|---|---|---|---|
HMP_2012_16S_gingival_V13 | 700103497 | oral_cavity | 5356 | 22699609 | control | 16S |
HMP_2012_16S_gingival_V13 | 700106940 | oral_cavity | 4489 | 22699609 | control | 16S |
HMP_2012_16S_gingival_V13 | 700097304 | oral_cavity | 3043 | 22699609 | control | 16S |
HMP_2012_16S_gingival_V13 | 700099015 | oral_cavity | 2832 | 22699609 | control | 16S |
HMP_2012_16S_gingival_V13 | 700097644 | oral_cavity | 2815 | 22699609 | control | 16S |
HMP_2012_16S_gingival_V13 | 700097247 | oral_cavity | 6333 | 22699609 | control | 16S |
Currently, there are 6
datasets available through the MicrobiomeBenchmarkData. These datasets are
accessed through the getBenchmarkData
function.
If no arguments are provided, the list of available datasets is printed on screen and a data.frame is returned with the description of the datasets:
dats <- getBenchmarkData()
#> 1 HMP_2012_16S_gingival_V13
#> 2 HMP_2012_16S_gingival_V35
#> 3 HMP_2012_16S_gingival_V35_subset
#> 4 HMP_2012_WMS_gingival
#> 5 Stammler_2016_16S_spikein
#> 6 Ravel_2011_16S_BV
#>
#> Use vignette('datasets', package = 'MicrobiomeBenchmarkData') for a detailed description of the datasets.
#>
#> Use getBenchmarkData(dryrun = FALSE) to import all of the datasets.
dats
#> Dataset Dimensions Body.site
#> 1 HMP_2012_16S_gingival_V13 33127 x 311 Gingiva
#> 2 HMP_2012_16S_gingival_V35 17949 x 311 Gingiva
#> 3 HMP_2012_16S_gingival_V35_subset 892 x 76 Gingiva
#> 4 HMP_2012_WMS_gingival 235 x 16 Gingiva
#> 5 Stammler_2016_16S_spikein 247 x 394 Stool
#> 6 Ravel_2011_16S_BV 4036 x 17 Vagina
#> Contrasts
#> 1 Subgingival vs Supragingival plaque.
#> 2 Subgingival vs Supragingival plaque.
#> 3 Subgingival vs Supragingival plaque.
#> 4 Subgingival vs Supragingival plaque.
#> 5 Pre-ASCT (allogeneic stem cell transplantation) vs 14 days after treatment.
#> 6 Healthy vs bacterial vaginosis
#> Biological.ground.truth
#> 1 Enrichment of aerobic taxa in the supragingival plaque and enrichment of anaerobic taxa in the subgingival plaque.
#> 2 Enrichment of aerobic taxa in the supragingival plaque and enrichment of anaerobic taxa in the subgingival plaque.
#> 3 Enrichment of aerobic taxa in the supragingival plaque and enrichment of anaerobic taxa in the subgingival plaque.
#> 4 Enrichment of aerobic taxa in the supragingival plaque and enrichment of anaerobic taxa in the subgingival plaque.
#> 5 Same bacterial loads of the spike-in bacteria across all samples: Salinibacter ruber (extreme halophilic), Rhizobium radiobacter (found in soils and plants), and Alicyclobacillus acidiphilu (thermo-acidophilic).
#> 6 Decrease of Lactobacillus and increase of bacteria isolated during bacterial vaginosis in samples with high Nugent scores (bacterial vaginosis).
In order to import a dataset, the getBenchmarkData
function must be used with
the name of the dataset as the first argument (x
) and the dryrun
argument
set to FALSE
. The output is a list vector with the dataset imported as a
TreeSummarizedExperiment object.
tse <- getBenchmarkData('HMP_2012_16S_gingival_V35_subset', dryrun = FALSE)[[1]]
#> Finished HMP_2012_16S_gingival_V35_subset.
tse
#> class: TreeSummarizedExperiment
#> dim: 892 76
#> metadata(0):
#> assays(1): counts
#> rownames(892): OTU_97.31247 OTU_97.44487 ... OTU_97.45365 OTU_97.45307
#> rowData names(7): kingdom phylum ... genus taxon_annotation
#> colnames(76): 700023057 700023179 ... 700114009 700114338
#> colData names(13): dataset subject_id ... sequencing_method
#> variable_region_16s
#> reducedDimNames(0):
#> mainExpName: NULL
#> altExpNames(0):
#> rowLinks: a LinkDataFrame (892 rows)
#> rowTree: 1 phylo tree(s) (892 leaves)
#> colLinks: NULL
#> colTree: NULL
Several datasets can be imported simultaneously by giving the names of the different datasets in a character vector:
list_tse <- getBenchmarkData(dats$Dataset[2:4], dryrun = FALSE)
#> Finished HMP_2012_16S_gingival_V35.
#> Finished HMP_2012_16S_gingival_V35_subset.
#> Finished HMP_2012_WMS_gingival.
str(list_tse, max.level = 1)
#> List of 3
#> $ HMP_2012_16S_gingival_V35 :Formal class 'TreeSummarizedExperiment' [package "TreeSummarizedExperiment"] with 14 slots
#> $ HMP_2012_16S_gingival_V35_subset:Formal class 'TreeSummarizedExperiment' [package "TreeSummarizedExperiment"] with 14 slots
#> $ HMP_2012_WMS_gingival :Formal class 'TreeSummarizedExperiment' [package "TreeSummarizedExperiment"] with 14 slots
If all of the datasets must to be imported, this can be done by providing
the dryrun = FALSE
argument alone.
mbd <- getBenchmarkData(dryrun = FALSE)
#> Finished HMP_2012_16S_gingival_V13.
#> Finished HMP_2012_16S_gingival_V35.
#> Finished HMP_2012_16S_gingival_V35_subset.
#> Finished HMP_2012_WMS_gingival.
#> Warning: No taxonomy_tree available for Ravel_2011_16S_BV.
#> Finished Ravel_2011_16S_BV.
#> Warning: No taxonomy_tree available for Stammler_2016_16S_spikein.
#> Finished Stammler_2016_16S_spikein.
str(mbd, max.level = 1)
#> List of 6
#> $ HMP_2012_16S_gingival_V13 :Formal class 'TreeSummarizedExperiment' [package "TreeSummarizedExperiment"] with 14 slots
#> $ HMP_2012_16S_gingival_V35 :Formal class 'TreeSummarizedExperiment' [package "TreeSummarizedExperiment"] with 14 slots
#> $ HMP_2012_16S_gingival_V35_subset:Formal class 'TreeSummarizedExperiment' [package "TreeSummarizedExperiment"] with 14 slots
#> $ HMP_2012_WMS_gingival :Formal class 'TreeSummarizedExperiment' [package "TreeSummarizedExperiment"] with 14 slots
#> $ Ravel_2011_16S_BV :Formal class 'TreeSummarizedExperiment' [package "TreeSummarizedExperiment"] with 14 slots
#> $ Stammler_2016_16S_spikein :Formal class 'TreeSummarizedExperiment' [package "TreeSummarizedExperiment"] with 14 slots
The biological annotations of each taxa are provided as a column in the
rowData
slot of the TreeSummarizedExperiment.
## In the case, the column is named as taxon_annotation
tse <- mbd$HMP_2012_16S_gingival_V35_subset
rowData(tse)
#> DataFrame with 892 rows and 7 columns
#> kingdom phylum class order
#> <character> <character> <character> <character>
#> OTU_97.31247 Bacteria Firmicutes Bacilli Lactobacillales
#> OTU_97.44487 Bacteria Firmicutes Bacilli Lactobacillales
#> OTU_97.34979 Bacteria Firmicutes Bacilli Lactobacillales
#> OTU_97.34572 Bacteria Firmicutes Bacilli Lactobacillales
#> OTU_97.42259 Bacteria Firmicutes Bacilli Lactobacillales
#> ... ... ... ... ...
#> OTU_97.44294 Bacteria Firmicutes Bacilli Lactobacillales
#> OTU_97.45429 Bacteria Firmicutes Bacilli Lactobacillales
#> OTU_97.44375 Bacteria Firmicutes Bacilli Lactobacillales
#> OTU_97.45365 Bacteria Firmicutes Bacilli Lactobacillales
#> OTU_97.45307 Bacteria Firmicutes Bacilli Lactobacillales
#> family genus taxon_annotation
#> <character> <character> <character>
#> OTU_97.31247 Streptococcaceae Streptococcus facultative_anaerobic
#> OTU_97.44487 Streptococcaceae Streptococcus facultative_anaerobic
#> OTU_97.34979 Streptococcaceae Streptococcus facultative_anaerobic
#> OTU_97.34572 Streptococcaceae Streptococcus facultative_anaerobic
#> OTU_97.42259 Streptococcaceae Streptococcus facultative_anaerobic
#> ... ... ... ...
#> OTU_97.44294 Streptococcaceae Streptococcus facultative_anaerobic
#> OTU_97.45429 Streptococcaceae Streptococcus facultative_anaerobic
#> OTU_97.44375 Streptococcaceae Streptococcus facultative_anaerobic
#> OTU_97.45365 Streptococcaceae Streptococcus facultative_anaerobic
#> OTU_97.45307 Streptococcaceae Streptococcus facultative_anaerobic
The datasets are cached so they’re only downloaded once. The cache and all of
the files contained in it can be removed with the removeCache
function.
removeCache()
sessionInfo()
#> R version 4.4.0 beta (2024-04-15 r86425)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 22.04.4 LTS
#>
#> Matrix products: default
#> BLAS: /home/biocbuild/bbs-3.19-bioc/R/lib/libRblas.so
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_GB LC_COLLATE=C
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: America/New_York
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats4 stats graphics grDevices utils datasets methods
#> [8] base
#>
#> other attached packages:
#> [1] purrr_1.0.2 MicrobiomeBenchmarkData_1.6.0
#> [3] TreeSummarizedExperiment_2.12.0 Biostrings_2.72.0
#> [5] XVector_0.44.0 SingleCellExperiment_1.26.0
#> [7] SummarizedExperiment_1.34.0 Biobase_2.64.0
#> [9] GenomicRanges_1.56.0 GenomeInfoDb_1.40.0
#> [11] IRanges_2.38.0 S4Vectors_0.42.0
#> [13] BiocGenerics_0.50.0 MatrixGenerics_1.16.0
#> [15] matrixStats_1.3.0 BiocStyle_2.32.0
#>
#> loaded via a namespace (and not attached):
#> [1] xfun_0.43 bslib_0.7.0 lattice_0.22-6
#> [4] yulab.utils_0.1.4 vctrs_0.6.5 tools_4.4.0
#> [7] generics_0.1.3 curl_5.2.1 parallel_4.4.0
#> [10] RSQLite_2.3.6 tibble_3.2.1 fansi_1.0.6
#> [13] blob_1.2.4 pkgconfig_2.0.3 Matrix_1.7-0
#> [16] dbplyr_2.5.0 lifecycle_1.0.4 GenomeInfoDbData_1.2.12
#> [19] compiler_4.4.0 treeio_1.28.0 codetools_0.2-20
#> [22] htmltools_0.5.8.1 sass_0.4.9 lazyeval_0.2.2
#> [25] yaml_2.3.8 tidyr_1.3.1 pillar_1.9.0
#> [28] crayon_1.5.2 jquerylib_0.1.4 BiocParallel_1.38.0
#> [31] DelayedArray_0.30.0 cachem_1.0.8 abind_1.4-5
#> [34] nlme_3.1-164 tidyselect_1.2.1 digest_0.6.35
#> [37] dplyr_1.1.4 bookdown_0.39 fastmap_1.1.1
#> [40] grid_4.4.0 cli_3.6.2 SparseArray_1.4.0
#> [43] magrittr_2.0.3 S4Arrays_1.4.0 utf8_1.2.4
#> [46] ape_5.8 withr_3.0.0 filelock_1.0.3
#> [49] UCSC.utils_1.0.0 bit64_4.0.5 rmarkdown_2.26
#> [52] httr_1.4.7 bit_4.0.5 memoise_2.0.1
#> [55] evaluate_0.23 knitr_1.46 BiocFileCache_2.12.0
#> [58] rlang_1.1.3 Rcpp_1.0.12 DBI_1.2.2
#> [61] glue_1.7.0 tidytree_0.4.6 BiocManager_1.30.22
#> [64] jsonlite_1.8.8 R6_2.5.1 fs_1.6.4
#> [67] zlibbioc_1.50.0