Example 16S Annotation Workflow

Nathan D. Olson

2017-10-30

Overview

The metagenomeFeatures package and associated annotation packages (e.g. greengene13.5MgDb) can be used to and taxonomic annotations to MRexperiment objects. The metagenomeSeq package includes the MRexperiment-class, and has a number of methods for analyzing marker-gene metagenome datasets. This vignette demonstrate how to add taxonomic annotation to a MRexperiment-class object using metagenomeFeatures.

To generate a MRexperiment object with taxonomic information you need;

For MRexperiment objects, the assayData and phenoData need to have the sample sample order similarly assayData and featureData need to have the sample OTU order.

Required Packages

library(Biostrings)
library(metagenomeFeatures)
library(metagenomeSeq)

The magrittr pipe operator %>% is used throughout this vignette (exported by metagenomeFeatures), see the magrittr package vignette for more information about this handy operator (https://cran.r-project.org/web/packages/magrittr/index.html).

Creating Inital MRexperiment Object

The assayData and phenoData we will use is in a file included in the metagenomeFeatures package with count data derived from the msd16s study.

Creating an initial MRexperiment with count and phenoData.

assay_data <- metagenomeFeatures:::vignette_assay_data()
pheno_data <- metagenomeFeatures:::vignette_pheno_data()
demoMRexp = newMRexperiment(assay_data$counts, phenoData=pheno_data)
demoMRexp
## MRexperiment (storageMode: environment)
## assayData: 384 features, 4 samples 
##   element names: counts 
## protocolData: none
## phenoData
##   sampleNames: 1686.HMPMockV1.1.Even1.673832
##     1686.HMPMockV1.2.Staggered1.673834
##     1686.HMPMockV1.2.Staggered2.673833
##     1686.HMPMockV1.1.Even2.673831
##   varLabels: artifact_ids command_id reference_id study
##   varMetadata: labelDescription
## featureData: none
## experimentData: use 'experimentData(object)'
## Annotation:

Currently the featureData slot is empty, we will now use the annotateMRexp_fData function in the metagenomeFeatures package to load the featureData.

Loading featureData with metagenomeFeatures

You first need a MgDb object for the database used to classify the OTUs. A subset of the greengenes database, mock_MgDb is used to generate a MgDb-object for annotating the msd16s data. This database is only used here for demonstration purposes, the gg13.5MgDb database in the greengenes13.5MgDb package can also be used to generate the metagenomeAnnotation object.

annotateMRexp_fData

The annotateMRexp_fData function takes an MgDb and MRexperiment objects as input and defines the featureData slot using the OTU ids from the assayData slot. The slot is defined with a mgFeatures object which extends the AnnotatedDataFrame object with additional slots with the database reference sequences (refDbSeq), phylogenetic tree (refDbtree, if available), and metadata about the reference database used.

demoMRexp <- annotateMRexp_fData(mgdb = mock_MgDb, MRobj = demoMRexp)
## Warning in if (select_type != "all") {: the condition has length > 1 and
## only the first element will be used
demoMRexp
## MRexperiment (storageMode: environment)
## assayData: 384 features, 4 samples 
##   element names: counts 
## protocolData: none
## phenoData
##   sampleNames: 1686.HMPMockV1.1.Even1.673832
##     1686.HMPMockV1.2.Staggered1.673834
##     1686.HMPMockV1.2.Staggered2.673833
##     1686.HMPMockV1.1.Even2.673831
##   varLabels: artifact_ids command_id reference_id study
##   varMetadata: labelDescription
## featureData
##   featureNames: 86
##   fvarLabels: Keys Kingdom ... Species (8 total)
##   fvarMetadata: labelDescription
## experimentData: use 'experimentData(object)'
## Annotation:

The featureData slot in the demoMRexp object now contains the taxonomic data for the OTUs as well as the database reference sequences as well as the phylogenetic tree.

Accessing mgFeature Slots

Each mgFeature slot can be accessed using the approporiate accessor function.

Seq Data

featureData(demoMRexp) %>% mgF_seq() %>% head()
##   A DNAStringSet instance of length 1
##     width seq                                          names               
## [1]  1334 GCTACTGCTATTGGGATTCGA...TGGACACACCGCCCGTCACG 86

Taxonomy

featureData(demoMRexp) %>% mgF_taxa() %>% head()
##    Keys    Kingdom           Phylum              Class
## 86   86 k__Archaea p__Euryarchaeota c__Methanobacteria
##                    Order                 Family                 Genus
## 86 o__Methanobacteriales f__Methanobacteriaceae g__Methanobrevibacter
##    Species
## 86     s__

Metadata

featureData(demoMRexp) %>% mgF_meta() %>% head()
## $ACCESSION_DATE
## [1] "3/31/2015"
## 
## $URL
## [1] "https://greengenes.microbio.me"
## 
## $DB_TYPE_NAME
## [1] "GreenGenes-MgDb-Mock"
## 
## $DB_TYPE_VALUE
## [1] "MgDb"
## 
## $DB_SCHEMA_VERSION
## [1] "1.0"

Phylogenetic Tree The tree slot is not defined for demoMRexp. If available the tree would be accessed using the following command.

featureData(demoMRexp) %>% mgF_tree()
sessionInfo()
## R version 3.4.2 (2017-09-28)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 16.04.3 LTS
## 
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.6-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.6-bioc/R/lib/libRlapack.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats4    parallel  stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] metagenomeSeq_1.20.0      RColorBrewer_1.1-2       
##  [3] glmnet_2.0-13             foreach_1.4.3            
##  [5] Matrix_1.2-11             limma_3.34.0             
##  [7] metagenomeFeatures_1.10.0 Biobase_2.38.0           
##  [9] Biostrings_2.46.0         XVector_0.18.0           
## [11] IRanges_2.12.0            S4Vectors_0.16.0         
## [13] BiocGenerics_0.24.0      
## 
## loaded via a namespace (and not attached):
##  [1] SummarizedExperiment_1.8.0 gtools_3.5.0              
##  [3] lattice_0.20-35            htmltools_0.3.6           
##  [5] yaml_2.1.14                blob_1.1.0                
##  [7] rlang_0.1.2                glue_1.2.0                
##  [9] DBI_0.7                    BiocParallel_1.12.0       
## [11] bit64_0.9-7                dbplyr_1.1.0              
## [13] bindrcpp_0.2               matrixStats_0.52.2        
## [15] GenomeInfoDbData_0.99.1    bindr_0.1                 
## [17] stringr_1.2.0              zlibbioc_1.24.0           
## [19] hwriter_1.3.2              caTools_1.17.1            
## [21] codetools_0.2-15           memoise_1.1.0             
## [23] evaluate_0.10.1            latticeExtra_0.6-28       
## [25] knitr_1.17                 GenomeInfoDb_1.14.0       
## [27] Rcpp_0.12.13               KernSmooth_2.23-15        
## [29] backports_1.1.1            gdata_2.18.0              
## [31] DelayedArray_0.4.0         ShortRead_1.36.0          
## [33] bit_1.1-12                 Rsamtools_1.30.0          
## [35] gplots_3.0.1               digest_0.6.12             
## [37] stringi_1.1.5              dplyr_0.7.4               
## [39] GenomicRanges_1.30.0       grid_3.4.2                
## [41] rprojroot_1.2              tools_3.4.2               
## [43] bitops_1.0-6               magrittr_1.5              
## [45] lazyeval_0.2.1             RCurl_1.95-4.8            
## [47] tibble_1.3.4               RSQLite_2.0               
## [49] pkgconfig_2.0.1            assertthat_0.2.0          
## [51] rmarkdown_1.6              iterators_1.0.8           
## [53] R6_2.2.2                   GenomicAlignments_1.14.0  
## [55] compiler_3.4.2