The metagenomeFeatures package and associated annotation packages (e.g. greengene13.5MgDb) can be used to and taxonomic annotations to MRexperiment objects. The metagenomeSeq package includes the MRexperiment-class, and has a number of methods for analyzing marker-gene metagenome datasets. This vignette demonstrate how to add taxonomic annotation to a MRexperiment-class object using metagenomeFeatures.
To generate a MRexperiment object with taxonomic information you need;
For MRexperiment objects, the assayData and phenoData need to have the sample sample order similarly assayData and featureData need to have the sample OTU order.
Required Packages
library(Biostrings)
library(metagenomeFeatures)
## Warning: Installed Rcpp (0.12.12) different from Rcpp used to build dplyr (0.12.11).
## Please reinstall dplyr to avoid random crashes or undefined behavior.
library(metagenomeSeq)
The magrittr
pipe operator %>%
is used throughout this vignette (exported by metagenomeFeatures), see the magrittr package vignette for more information about this handy operator (https://cran.r-project.org/web/packages/magrittr/index.html).
The assayData and phenoData we will use is in a file included in the metagenomeFeatures package with count data derived from the msd16s study.
Creating an initial MRexperiment with count and phenoData.
assay_data <- metagenomeFeatures:::vignette_assay_data()
pheno_data <- metagenomeFeatures:::vignette_pheno_data()
demoMRexp = newMRexperiment(assay_data$counts, phenoData=pheno_data)
demoMRexp
## MRexperiment (storageMode: environment)
## assayData: 384 features, 4 samples
## element names: counts
## protocolData: none
## phenoData
## sampleNames: 1686.HMPMockV1.1.Even1.673832
## 1686.HMPMockV1.2.Staggered1.673834
## 1686.HMPMockV1.2.Staggered2.673833
## 1686.HMPMockV1.1.Even2.673831
## varLabels: artifact_ids command_id reference_id study
## varMetadata: labelDescription
## featureData: none
## experimentData: use 'experimentData(object)'
## Annotation:
Currently the featureData
slot is empty, we will now use the annotateMRexp_fData
function in the metagenomeFeatures package to load the featureData
.
You first need a MgDb
object for the database used to classify the OTUs. A subset of the greengenes database, mock_MgDb
is used to generate a MgDb-object for annotating the msd16s data. This database is only used here for demonstration purposes, the gg13.5MgDb
database in the greengenes13.5MgDb package can also be used to generate the metagenomeAnnotation object.
The annotateMRexp_fData
function takes an MgDb
and MRexperiment
objects as input and defines the featureData
slot using the OTU ids from the assayData
slot. The slot is defined with a mgFeatures
object which extends the AnnotatedDataFrame
object with additional slots with the database reference sequences (refDbSeq
), phylogenetic tree (refDbtree
, if available), and metadata about the reference database used.
demoMRexp <- annotateMRexp_fData(mgdb = mock_MgDb, MRobj = demoMRexp)
## Warning in if (select_type != "all") {: the condition has length > 1 and
## only the first element will be used
demoMRexp
## MRexperiment (storageMode: environment)
## assayData: 384 features, 4 samples
## element names: counts
## protocolData: none
## phenoData
## sampleNames: 1686.HMPMockV1.1.Even1.673832
## 1686.HMPMockV1.2.Staggered1.673834
## 1686.HMPMockV1.2.Staggered2.673833
## 1686.HMPMockV1.1.Even2.673831
## varLabels: artifact_ids command_id reference_id study
## varMetadata: labelDescription
## featureData
## featureNames: 86
## fvarLabels: Keys Kingdom ... Species (8 total)
## fvarMetadata: labelDescription
## experimentData: use 'experimentData(object)'
## Annotation:
The featureData
slot in the demoMRexp
object now contains the taxonomic data for the OTUs as well as the database reference sequences as well as the phylogenetic tree.
Each mgFeature slot can be accessed using the approporiate accessor function.
Seq Data
featureData(demoMRexp) %>% mgF_seq() %>% head()
## A DNAStringSet instance of length 1
## width seq names
## [1] 1334 GCTACTGCTATTGGGATTCGA...TGGACACACCGCCCGTCACG 86
Taxonomy
featureData(demoMRexp) %>% mgF_taxa() %>% head()
## Keys Kingdom Phylum Class
## 86 86 k__Archaea p__Euryarchaeota c__Methanobacteria
## Order Family Genus
## 86 o__Methanobacteriales f__Methanobacteriaceae g__Methanobrevibacter
## Species
## 86 s__
Metadata
featureData(demoMRexp) %>% mgF_meta() %>% head()
## $ACCESSION_DATE
## [1] "3/31/2015"
##
## $URL
## [1] "https://greengenes.microbio.me"
##
## $DB_TYPE_NAME
## [1] "GreenGenes-MgDb-Mock"
##
## $DB_TYPE_VALUE
## [1] "MgDb"
##
## $DB_SCHEMA_VERSION
## [1] "1.0"
Phylogenetic Tree The tree slot is not defined for demoMRexp
. If available the tree would be accessed using the following command.
featureData(demoMRexp) %>% mgF_tree()
sessionInfo()
## R version 3.4.1 (2017-06-30)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 16.04.2 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.5-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.5-bioc/R/lib/libRlapack.so
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats4 parallel stats graphics grDevices utils datasets
## [8] methods base
##
## other attached packages:
## [1] metagenomeSeq_1.18.0 RColorBrewer_1.1-2
## [3] glmnet_2.0-10 foreach_1.4.3
## [5] Matrix_1.2-10 limma_3.32.3
## [7] metagenomeFeatures_1.8.1 Biobase_2.36.2
## [9] Biostrings_2.44.1 XVector_0.16.0
## [11] IRanges_2.10.2 S4Vectors_0.14.3
## [13] BiocGenerics_0.22.0
##
## loaded via a namespace (and not attached):
## [1] SummarizedExperiment_1.6.3 gtools_3.5.0
## [3] lattice_0.20-35 htmltools_0.3.6
## [5] yaml_2.1.14 blob_1.1.0
## [7] rlang_0.1.1 glue_1.1.1
## [9] DBI_0.7 BiocParallel_1.10.1
## [11] bit64_0.9-7 dbplyr_1.1.0
## [13] bindrcpp_0.2 matrixStats_0.52.2
## [15] GenomeInfoDbData_0.99.0 bindr_0.1
## [17] stringr_1.2.0 zlibbioc_1.22.0
## [19] hwriter_1.3.2 caTools_1.17.1
## [21] codetools_0.2-15 memoise_1.1.0
## [23] evaluate_0.10.1 latticeExtra_0.6-28
## [25] knitr_1.16 GenomeInfoDb_1.12.2
## [27] Rcpp_0.12.12 KernSmooth_2.23-15
## [29] backports_1.1.0 gdata_2.18.0
## [31] DelayedArray_0.2.7 ShortRead_1.34.0
## [33] bit_1.1-12 Rsamtools_1.28.0
## [35] gplots_3.0.1 digest_0.6.12
## [37] stringi_1.1.5 dplyr_0.7.1
## [39] GenomicRanges_1.28.4 grid_3.4.1
## [41] rprojroot_1.2 tools_3.4.1
## [43] bitops_1.0-6 magrittr_1.5
## [45] lazyeval_0.2.0 RCurl_1.95-4.8
## [47] tibble_1.3.3 RSQLite_2.0
## [49] pkgconfig_2.0.1 assertthat_0.2.0
## [51] rmarkdown_1.6 iterators_1.0.8
## [53] R6_2.2.2 GenomicAlignments_1.12.1
## [55] compiler_3.4.1