Overview

The metagenomeFeatures package and associated annotation packages (e.g. greengene13.5MgDb) can be used to and taxonomic annotations to MRexperiment objects. The metagenomeSeq package includes the MRexperiment-class, and has a number of methods for analyzing marker-gene metagenome datasets. This vignette demonstrate how to add taxonomic annotation to a MRexperiment-class object using metagenomeFeatures.

To generate a MRexperiment object with taxonomic information you need;

For MRexperiment objects, the assayData and phenoData need to have the sample sample order similarly assayData and featureData need to have the sample OTU order.

Required Packages

library(metagenomeFeatures)
library(metagenomeSeq)

Loading assayData

A file with count data from the msd16s study is included in the metagenomeFeatures package.

dataDirectory <- system.file("extdata", package = "metagenomeFeatures")
assay_data <- load_meta(file.path(dataDirectory, "msd16s_counts.csv"),sep = ",")

Loading phenoData

pheno_data <- AnnotatedDataFrame(read.csv(file.path(dataDirectory, 
                                                    "msd16s_sample_data.csv"), 
                                          row.names = 1, stringsAsFactors = F))

featureData

Generating a metagenomeAnnotation-class object

The msd16s_query_df is a dataframe with greengenes taxonomic ids for the msd16s data.

data("msd16s_query_df")
head(msd16s_query_df)
##     Keys OTU
## 1 228054  54
## 2 228057  94
## 3 228050 113
## 4 228052 117
## 5  73399 145
## 6  11545 202

A subset of the greengenes database, msd16s_MgDb is used to generate a MgDb-object for annotating the msd16s data. This database is only used here for demonstration purposes, the gg13.5MgDb database in the greengenes13.5MgDb package can also be used to generate the metagenomeAnnotation object.

msd16s_feature_data <- annotate(msd16s_MgDb, query_df = msd16s_query_df)
## Joining by: "Keys"
## Using the greengenes13.5MgDb annotation package
# library(greengenes13.5MgDb)
# msd16s_feature_data <- annotate(gg13.5MgDb, query_df = msd16s_query_df)

Adding featureNames and ordering features

Prepare the metagneomeAnnotation object for use in a MRexperiment by defining featureName and ordering the OTUs to that they agree with the assayData. Note that the OTU column in msd16s_feature_data are factors and are first converted to numeric before ordering.

OTU_numbers <- as.numeric(as.character(msd16s_feature_data$OTU))
featureNames(msd16s_feature_data) <- OTU_numbers
ord <-  sort(OTU_numbers)
msd16s_feature_data <- msd16s_feature_data[order(OTU_numbers),]

Creating MRexperiment object

obj = newMRexperiment(assay_data$counts,
                      phenoData=pheno_data,
                      featureData = msd16s_feature_data)
obj
## MRexperiment (storageMode: environment)
## assayData: 13530 features, 992 samples 
##   element names: counts 
## protocolData: none
## phenoData
##   sampleNames: 100259 100262 ... 602385 (992 total)
##   varLabels: Type Country ... Dysentery (5 total)
##   varMetadata: labelDescription
## featureData
##   featureNames: 54 94 ... 275111 (13530 total)
##   fvarLabels: Keys OTU ... Species (9 total)
##   fvarMetadata: labelDescription
## experimentData: use 'experimentData(object)'
## Annotation:
sessionInfo()
## R version 3.2.2 (2015-08-14)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 14.04.3 LTS
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats4    parallel  stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] Biostrings_2.38.0        XVector_0.10.0          
##  [3] IRanges_2.4.0            S4Vectors_0.8.0         
##  [5] metagenomeSeq_1.12.0     RColorBrewer_1.1-2      
##  [7] glmnet_2.0-2             foreach_1.4.3           
##  [9] Matrix_1.2-2             limma_3.26.0            
## [11] metagenomeFeatures_1.0.0 Biobase_2.30.0          
## [13] BiocGenerics_0.16.0     
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.12.1                formatR_1.2.1             
##  [3] futile.logger_1.4.1        GenomeInfoDb_1.6.0        
##  [5] bitops_1.0-6               futile.options_1.0.0      
##  [7] iterators_1.0.8            tools_3.2.2               
##  [9] zlibbioc_1.16.0            digest_0.6.8              
## [11] evaluate_0.8               RSQLite_1.0.0             
## [13] lattice_0.20-33            DBI_0.3.1                 
## [15] yaml_2.1.13                dplyr_0.4.3               
## [17] stringr_1.0.0              hwriter_1.3.2             
## [19] knitr_1.11                 caTools_1.17.1            
## [21] gtools_3.5.0               grid_3.2.2                
## [23] R6_2.1.1                   BiocParallel_1.4.0        
## [25] rmarkdown_0.8.1            gdata_2.17.0              
## [27] latticeExtra_0.6-26        lambda.r_1.1.7            
## [29] magrittr_1.5               gplots_2.17.0             
## [31] matrixStats_0.14.2         Rsamtools_1.22.0          
## [33] codetools_0.2-14           htmltools_0.2.6           
## [35] GenomicRanges_1.22.0       GenomicAlignments_1.6.0   
## [37] assertthat_0.1             ShortRead_1.28.0          
## [39] SummarizedExperiment_1.0.0 KernSmooth_2.23-15        
## [41] stringi_0.5-5              lazyeval_0.1.10