curatedMetagenomicData vignette: microbial taxonomy and functional data for the human microbiome

19 January 2018

Abstract

The curatedMetagenomicData package provides taxonomic, functional, and gene marker abundance for samples collected from different human bodysites. It provides processed data from whole-metagenome sequencing for thousand of human microbiome samples, including the Human Microbiome Project. Microbiome data and associated subject, specimen, and sequencing information are integrated as Bioconductor ExpressionSet objects, allowing straightfoward analyses. A conversion function also links taxonomic datasets to the phyloseq package for ecological analyses.

1 What `curatedMetagenomicData` provides

curatedMetagenomicData provides 6 types of data for each dataset:

Species-level taxonomic profiles, expressed as relative abundance from kingdom to strain level
Presence of unique, clade-specific markers
Abundance of unique, clade-specific markers
Abundance of gene families
Metabolic pathway coverage
Metabolic pathway abundance

Types 1-3 are generated by MetaPhlAn2; 4-6 are generated by HUMAnN2.

Currently, curatedMetagenomicData provides:

5,716 samples from 26 datasets, primarily of the human gut but including body sites profiled in the Human Microbiome Project
Processed data from whole-metagenome shotgun metagenomics, with manually-curated metadata, as integrated and documented Bioconductor ExpressionSet objects
~80 fields of specimen metadata from original papers, supplementary files, and websites, with manual curation to standardize annotations
Processing of data through the MetaPhlAn2 pipeline for taxonomic abundance, and HUMAnN2 pipeline for metabolic analysis
This effort required analyzing ~100T of raw sequencing data

These datasets are documented in the reference manual.

2 Using `curatedMetagenomicData` Resources

Use of the resources in curatedMetagenomicData is simplified with the use of Bioconductor’s ExperimentHub platform, which allows for the accessing of data through an intuitive interface. First, curatedMetagenomicData is installed using BiocInstaller and then called as a library - the process allows for the user to simply call datasets as functions because the package is aware of the resources present in ExperimentHub S3 buckets.

# BiocInstaller::biocLite("curatedMetagenomicData")  #Bioconductor version
# BiocInstaller::biocLite("waldronlab/curatedMetagenomicData")  #bleeding edge version
suppressPackageStartupMessages(library(curatedMetagenomicData))

2.1 Available samples and metadata

The manually curated metadata for all available samples are provided in a single table combined_metadata:

?combined_metadata
View(combined_metadata)

table(combined_metadata$antibiotics_current_use)

## 
##   no  yes 
## 1855  558

table(combined_metadata$disease)

## 
##                                   AD                                AD;AR 
##                                   10                                   16 
##                            AD;asthma                         AD;asthma;AR 
##                                    4                                    8 
##                                   AR                               CDI;NA 
##                                    2                                   14 
##                       CDI;cellulitis                   CDI;osteoarthritis 
##                                    2                                    1 
##                        CDI;pneumonia                    CDI;ureteralstone 
##                                   15                                    1 
##                                  CRC                 CRC;T2D;hypertension 
##                                  258                                    2 
##                      CRC;fatty_liver     CRC;fatty_liver;T2D;hypertension 
##                                    3                                    4 
##         CRC;fatty_liver;hypertension                     CRC;hypertension 
##                                   12                                   12 
##                                  IBD                                  IGT 
##                                  148                                   49 
##                                   NK                                 STEC 
##                                    1                                   43 
##                                  T1D                      T1D;coeliac;IBD 
##                                   24                                    3 
##                                  T2D                     T2D;hypertension 
##                                  223                                    1 
##                      abdominalhernia                              adenoma 
##                                    2                                   77 
##                          adenoma;T2D             adenoma;T2D;hypertension 
##                                    1                                    2 
##                  adenoma;fatty_liver              adenoma;fatty_liver;T2D 
##                                   12                                    1 
## adenoma;fatty_liver;T2D;hypertension     adenoma;fatty_liver;hypertension 
##                                    1                                   14 
##                 adenoma;hypertension                            arthritis 
##                                    8                                   11 
##                            asthma;AR                           bronchitis 
##                                    8                                   18 
##                           cellulitis                            cirrhosis 
##                                    6                                    8 
##                        cirrhosis;HBV                    cirrhosis;HBV;HDV 
##                                   39                                    3 
##            cirrhosis;HBV;HDV;ascites                     cirrhosis;HBV;HE 
##                                    2                                    2 
##                    cirrhosis;HBV;HEV            cirrhosis;HBV;HEV;ascites 
##                                    3                                    4 
##                cirrhosis;HBV;ascites    cirrhosis;HBV;schistosoma;ascites 
##                                   44                                    1 
##         cirrhosis;HBV;wilson;ascites                        cirrhosis;HCV 
##                                    1                                    1 
##                     cirrhosis;HEV;HE                    cirrhosis;ascites 
##                                    1                                    9 
##        cirrhosis;pbcirrhosis;ascites    cirrhosis;schistosoma;HEV;ascites 
##                                    1                                    2 
##        cirrhosis;schistosoma;ascites             cirrhosis;wilson;ascites 
##                                    1                                    1 
##     coeliac;gestational_diabetes;CMV                                cough 
##                                    2                                    2 
##                             cystitis                          fatty_liver 
##                                    1                                    8 
##                      fatty_liver;T2D         fatty_liver;T2D;hypertension 
##                                    3                                    9 
##             fatty_liver;hypertension                                fever 
##                                   13                                    3 
##                             gangrene                              healthy 
##                                    1                                 2483 
##                            hepatitis                         hypertension 
##                                    3                                   11 
##            infectiousgastroenteritis                                 none 
##                                    5                                  260 
##                       osteoarthritis                               otitis 
##                                    1                                  107 
##                            pneumonia                            psoriasis 
##                                    7                                   36 
##                  psoriasis;arthritis                        pyelonefritis 
##                                   12                                    2 
##                       pyelonephritis                       respiratoryinf 
##                                    6                                   13 
##                        salmonellosis                        schizophrenia 
##                                    1                                   12 
##                     schizophrenia;CD                    schizophrenia;T2D 
##                                    1                                    3 
##                               sepsis                              skininf 
##                                    1                                    2 
##                           stomatitis                              suspinf 
##                                    2                                    1 
##                          tonsillitis 
##                                    3

2.1.1 Read depth of all samples across all studies

combined_metadata also provides technical information for each sample like sequencing platform, read length, and read depth. The following uses combined_metadata to create a boxplot of read depth for each sample in each study, with boxes colored by body site. First, create a ranking of datasets by median read depth:

dsranking <- combined_metadata %>%
  group_by(dataset_name) %>%
  summarize(mediandepth = median(number_reads) / 1e6) %>%
  mutate(dsorder = rank(mediandepth)) %>%
  arrange(dsorder)
dsranking

## # A tibble: 28 x 3
##    dataset_name    mediandepth dsorder
##    <chr>                 <dbl>   <dbl>
##  1 TettAJ_2016            1.19    1.00
##  2 HanniganGD_2017        5.86    2.00
##  3 LomanNJ_2013           6.66    3.00
##  4 OhJ_2014              10.5     4.00
##  5 ChngKR_2016           14.1     5.00
##  6 VincentC_2016         15.0     6.00
##  7 RampelliS_2015        15.2     7.00
##  8 VatanenT_2016         19.1     8.00
##  9 AsnicarF_2017         21.1     9.00
## 10 KarlssonFH_2013       27.8    10.0 
## # ... with 18 more rows

Create a factor ds that is ordered according to the ranking by median read depth, to show datasets in order from lowest to highest median read depth, then create the box plot.

suppressPackageStartupMessages(library(ggplot2))
mutate(combined_metadata, 
       ds = factor(combined_metadata$dataset_name, levels=dsranking$dataset_name)) %>%
  ggplot(aes(ds, number_reads / 1e6, fill=body_site)) + 
  geom_boxplot() +
  theme(axis.text.x = element_text(angle=45, hjust=1)) +
  labs(x="Dataset", y="Read Depth (millions)")

For the datasets that provide both stool and oral cavity profiles, notice how much lower the read depth of oral cavity profiles is due to the higher proportions of removed human DNA reads.

3 Accessing datasets

Individual data projects can be fetched via per-dataset functions or the curatedMetagenomicData() function. A function call to a dataset name returns a Bioconductor ExpressionSet object:

suppressPackageStartupMessages(library(curatedMetagenomicData))
loman.eset = LomanNJ_2013.metaphlan_bugs_list.stool()

However, the above approach lacks additional options and versioning provided by the curatedMetagenomicData function, which returns a list of datasets. In this case the list has length 1:

loman <- curatedMetagenomicData("LomanNJ_2013.metaphlan_bugs_list.stool", dryrun = FALSE)

## Working on LomanNJ_2013.metaphlan_bugs_list.stool

## snapshotDate(): 2017-10-30

## see ?curatedMetagenomicData and browseVignettes('curatedMetagenomicData') for documentation

## loading from cache '/home/biocbuild//.ExperimentHub/451'

loman

## List of length 1
## names(1): LomanNJ_2013.metaphlan_bugs_list.stool

loman.eset <- loman[[1]]
loman.eset

## ExpressionSet (storageMode: lockedEnvironment)
## assayData: 736 features, 43 samples 
##   element names: exprs 
## protocolData: none
## phenoData
##   sampleNames: OBK1122 OBK1196 ... OBK2535b (43 total)
##   varLabels: subjectID body_site ... NCBI_accession (20 total)
##   varMetadata: labelDescription
## featureData: none
## experimentData: use 'experimentData(object)'
##   pubMedIds: 23571589 
## Annotation:

The following creates a list of two ExpressionSet objects providing the BritoIL_2016 and Castro-NallarE_2015 oral cavity taxonomic profiles:

oral <- c("BritoIL_2016.metaphlan_bugs_list.oralcavity",
          "Castro-NallarE_2015.metaphlan_bugs_list.oralcavity")
esl <- curatedMetagenomicData(oral, dryrun = FALSE)
esl
esl[[1]]
esl[[2]]

And the following would provide all stool metaphlan datasets if dryrun = FALSE were set:

curatedMetagenomicData("*metaphlan_bugs_list.stool*", dryrun = TRUE)

## Dry run: see return values for datasets that would be downloaded. Run with `dryrun=FALSE` to actually download these datasets.

##  [1] "AsnicarF_2017.metaphlan_bugs_list.stool"       
##  [2] "BritoIL_2016.metaphlan_bugs_list.stool"        
##  [3] "FengQ_2015.metaphlan_bugs_list.stool"          
##  [4] "HMP_2012.metaphlan_bugs_list.stool"            
##  [5] "HanniganGD_2017.metaphlan_bugs_list.stool"     
##  [6] "Heitz-BuschartA_2016.metaphlan_bugs_list.stool"
##  [7] "KarlssonFH_2013.metaphlan_bugs_list.stool"     
##  [8] "LeChatelierE_2013.metaphlan_bugs_list.stool"   
##  [9] "LiJ_2014.metaphlan_bugs_list.stool"            
## [10] "LiuW_2016.metaphlan_bugs_list.stool"           
## [11] "LomanNJ_2013.metaphlan_bugs_list.stool"        
## [12] "NielsenHB_2014.metaphlan_bugs_list.stool"      
## [13] "Obregon-TitoAJ_2015.metaphlan_bugs_list.stool" 
## [14] "QinJ_2012.metaphlan_bugs_list.stool"           
## [15] "QinN_2014.metaphlan_bugs_list.stool"           
## [16] "RampelliS_2015.metaphlan_bugs_list.stool"      
## [17] "RaymondF_2016.metaphlan_bugs_list.stool"       
## [18] "SchirmerM_2016.metaphlan_bugs_list.stool"      
## [19] "VatanenT_2016.metaphlan_bugs_list.stool"       
## [20] "VincentC_2016.metaphlan_bugs_list.stool"       
## [21] "VogtmannE_2016.metaphlan_bugs_list.stool"      
## [22] "XieH_2016.metaphlan_bugs_list.stool"           
## [23] "YuJ_2015.metaphlan_bugs_list.stool"            
## [24] "ZellerG_2014.metaphlan_bugs_list.stool"

3.1 Merging multiple datasets

The following merges the two oral cavity datasets downloaded above into a single ExpressionSet.

eset <- mergeData(esl)
eset

## ExpressionSet (storageMode: lockedEnvironment)
## assayData: 1914 features, 172 samples 
##   element names: exprs 
## protocolData: none
## phenoData
##   sampleNames: BritoIL_2016.metaphlan_bugs_list.oralcavity:M1.1.SA
##     BritoIL_2016.metaphlan_bugs_list.oralcavity:M1.10.SA ...
##     Castro-NallarE_2015.metaphlan_bugs_list.oralcavity:ES_080 (172
##     total)
##   varLabels: subjectID body_site ... studyID (20 total)
##   varMetadata: labelDescription
## featureData: none
## experimentData: use 'experimentData(object)'
## Annotation:

This works for any number of datasets. The function will not stop you from merging different data types (e.g. metaphlan bugs lists with gene families), but you probably don’t want to do that.

4 Using `ExpressionSet` Objects

All datasets are represented as ExpressionSet objects because of the integrative nature of the class and its ability to bind data and metadata. There are three main functions, from the Biobase package, that provide access to experiment-level metadata, subject-level metadata, and the data itself.

To access the experiment-level metadata the experimentData() function is used to return a MIAME (Minimum Information About a Microarray Experiment) object.

experimentData( loman.eset )

## Experiment data
##   Experimenter name: Loman NJ, Constantinidou C, Christner M, Rohde H, Chan JZ, Quick J, Weir JC, Quince C, Smith GP, Betley JR, Aepfelbacher M, Pallen MJ 
##   Laboratory: Institute of Microbiology and Infection, University of Birmingham, Birmingham, England. 
##   Contact information:  
##   Title: A culture-independent sequence-based metagenomics approach to the investigation of an outbreak of Shiga-toxigenic Escherichia coli O104:H4. 
##   URL:  
##   PMIDs: 23571589 
## 
##   Abstract: A 308 word abstract is available. Use 'abstract' method.
##   notes:
##    Sequencing platform:     
##       IlluminaHiSeq

To access the subject-level metadata the pData() function is used to return a data.frame containing subject-level variables of study.

head( pData( loman.eset ) )

##         subjectID body_site antibiotics_current_use study_condition disease
## OBK1122   OBK1122     stool                    <NA>            STEC    STEC
## OBK1196   OBK1196     stool                    <NA>            STEC    STEC
## OBK1253   OBK1253     stool                    <NA>            STEC    STEC
## OBK2535   OBK2535     stool                    <NA>            STEC    STEC
## OBK2638   OBK2638     stool                    <NA>            STEC    STEC
## OBK2661   OBK2661     stool                    <NA>            STEC    STEC
##         age age_category gender country non_westernized DNA_extraction_kit
## OBK1122  NA        adult   <NA>     DEU              no             Qiagen
## OBK1196  NA        adult   <NA>     DEU              no             Qiagen
## OBK1253  NA        adult   <NA>     DEU              no             Qiagen
## OBK2535  NA        adult   <NA>     DEU              no             Qiagen
## OBK2638  NA        adult   <NA>     DEU              no             Qiagen
## OBK2661  NA        adult   <NA>     DEU              no             Qiagen
##         number_reads number_bases minimum_read_length median_read_length
## OBK1122       464060       464060                 150                150
## OBK1196     30181380     30181380                 150                150
## OBK1253     65704818     65704818                 150                150
## OBK2535     43597736     43597736                 150                150
## OBK2638      1105492      1105492                 150                150
## OBK2661     16760944     16760944                 150                150
##         days_after_onset stec_count shigatoxin_2_elisa stool_texture
## OBK1122               NA       <NA>               <NA>          <NA>
## OBK1196               NA       <NA>               <NA>          <NA>
## OBK1253               NA       <NA>               <NA>          <NA>
## OBK2535                3   moderate           positive        smooth
## OBK2638                5       high           positive        bloody
## OBK2661                7        low           negative        watery
##         NCBI_accession
## OBK1122           <NA>
## OBK1196           <NA>
## OBK1253           <NA>
## OBK2535           <NA>
## OBK2638           <NA>
## OBK2661           <NA>

To access the data itself (in this case relative abundance), the exprs() function returns a variables by samples (rows by columns) numeric matrix. Note the presence of “synthetic” clades at all levels of the taxonomy, starting with kingdom, e.g. k__bacteria here:

exprs( loman.eset )[1:6, 1:5]  #first 6 rows and 5 columns

##                                               OBK1122   OBK1196   OBK1253
## k__Bacteria                                 100.00000 100.00000 100.00000
## k__Bacteria|p__Bacteroidetes                 83.55662  30.41764  79.41203
## k__Bacteria|p__Firmicutes                    15.14819  68.30191  19.28784
## k__Bacteria|p__Proteobacteria                 1.29520   0.06303   0.97218
## k__Bacteria|p__Bacteroidetes|c__Bacteroidia  83.55662  30.41764  79.41203
## k__Bacteria|p__Firmicutes|c__Clostridia      14.93141  33.22720  11.26228
##                                              OBK2535   OBK2638
## k__Bacteria                                 99.39953 100.00000
## k__Bacteria|p__Bacteroidetes                15.14525   7.39179
## k__Bacteria|p__Firmicutes                   25.71252   0.80938
## k__Bacteria|p__Proteobacteria               58.53298  91.58526
## k__Bacteria|p__Bacteroidetes|c__Bacteroidia 15.14525   7.39179
## k__Bacteria|p__Firmicutes|c__Clostridia     16.50313   0.51321

Bioconductor provides further documentation of the ExpressionSet class and has published an excellent introduction.

4.1 Estimating Absolute Raw Count Data

Absolute raw count data can be estimated from the relative count data by multiplying the columns of the ExpressionSet data by the number of reads for each sample, as found in the pData column “number_reads”. For demo purposes you could (but don’t have to!) do this manually by dividing by 100 and multiplying by the number of reads:

loman.counts = sweep(exprs( loman.eset ), 2, loman.eset$number_reads / 100, "*")
loman.counts = round(loman.counts)
loman.counts[1:6, 1:5]

##                                             OBK1122  OBK1196  OBK1253
## k__Bacteria                                  464060 30181380 65704818
## k__Bacteria|p__Bacteroidetes                 387753  9180464 52177530
## k__Bacteria|p__Firmicutes                     70297 20614459 12673040
## k__Bacteria|p__Proteobacteria                  6011    19023   638769
## k__Bacteria|p__Bacteroidetes|c__Bacteroidia  387753  9180464 52177530
## k__Bacteria|p__Firmicutes|c__Clostridia       69291 10028427  7399861
##                                              OBK2535 OBK2638
## k__Bacteria                                 43335945 1105492
## k__Bacteria|p__Bacteroidetes                 6602986   81716
## k__Bacteria|p__Firmicutes                   11210077    8948
## k__Bacteria|p__Proteobacteria               25519054 1012468
## k__Bacteria|p__Bacteroidetes|c__Bacteroidia  6602986   81716
## k__Bacteria|p__Firmicutes|c__Clostridia      7194991    5673

or just set the counts argument in curatedMetagenomicData() to TRUE:

loman.eset2 = curatedMetagenomicData("LomanNJ_2013.metaphlan_bugs_list.stool",
                                     counts = TRUE, dryrun = FALSE)[[1]]

## Working on LomanNJ_2013.metaphlan_bugs_list.stool

## snapshotDate(): 2017-10-30

## see ?curatedMetagenomicData and browseVignettes('curatedMetagenomicData') for documentation

## loading from cache '/home/biocbuild//.ExperimentHub/451'

all.equal(exprs(loman.eset2), loman.counts)

## [1] TRUE

5 E. coli prevalence

Here’s a direct, exploratory analysis of E. coli prevalence in the Loman dataset using the ExpressionSet object. More elegant solutions will be provided later using subsetting methods provided by the phyloseq package, but for users familiar with grep() and the ExpressionSet object, such manual methods may suffice.

First, which E. coli-related taxa are available?

grep("coli", rownames(loman.eset), value=TRUE)

## [1] "k__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|o__Enterobacteriales|f__Enterobacteriaceae|g__Escherichia|s__Escherichia_coli"                                 
## [2] "k__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|o__Enterobacteriales|f__Enterobacteriaceae|g__Escherichia|s__Escherichia_coli|t__Escherichia_coli_unclassified"
## [3] "k__Bacteria|p__Firmicutes|c__Clostridia|o__Clostridiales|f__Ruminococcaceae|g__Anaerotruncus|s__Anaerotruncus_colihominis"                                          
## [4] "k__Bacteria|p__Firmicutes|c__Clostridia|o__Clostridiales|f__Ruminococcaceae|g__Anaerotruncus|s__Anaerotruncus_colihominis|t__GCF_000154565"

Create a vector of E. coli relative abundances. This grep call with a “$” at the end selects the only row that ends with “s__Escherichia_coli“:

x = exprs( loman.eset )[grep("s__Escherichia_coli$", rownames( loman.eset)), ]
summary( x )

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.6047  2.3710 13.6897 11.8291 90.4474

This could be plotted as a histogram:

hist( x, xlab = "Relative Abundance", main="Prevalence of E. Coli",
      breaks="FD")

6 Taxonomy-Aware Analysis using `phyloseq`

For the MetaPhlAn2 bugs datasets (but not other data types), you gain a lot of taxonomy-aware, ecological analysis and plotting by conversion to a phyloseq class object. curatedMetagenomicData provides the ExpressionSet2phyloseq() function to make this easy:

suppressPackageStartupMessages(library(phyloseq))
loman.pseq = ExpressionSet2phyloseq( loman.eset )

Note the simplified row names of the OTU table, showing only the most detailed level of the taxonomy. This results from the default argument simplify=TRUE, which is convenient and lossless because taxonomic information is now attainable by tax_table(loman.pseq2)

6.1 phylogenetic trees and UniFrac distances

Set phylogenetictree = TRUE to include a phylogenetic tree in the phyloseq object:

loman.tree <- ExpressionSet2phyloseq( loman.eset, phylogenetictree = TRUE)

wt = UniFrac(loman.tree, weighted=TRUE, normalized=FALSE, 
             parallel=FALSE, fast=TRUE)
plot(hclust(wt), main="Weighted UniFrac distances")

6.2 Components of a phyloseq object

This phyloseq objects contain 3 components, with extractor functions hinted at by its show method:

loman.pseq

## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 736 taxa and 43 samples ]
## sample_data() Sample Data:       [ 43 samples by 20 sample variables ]
## tax_table()   Taxonomy Table:    [ 736 taxa by 8 taxonomic ranks ]

otu_table() returns the same thing as exprs( loman.eset ) did, the Operational Taxanomic Unit (OTU) table. Here are the first 6 rows and 5 columns:

otu_table( loman.pseq )[1:6, 1:5]

## OTU Table:          [6 taxa and 5 samples]
##                      taxa are rows
##                     OBK1122   OBK1196   OBK1253  OBK2535   OBK2638
## k__Bacteria       100.00000 100.00000 100.00000 99.39953 100.00000
## p__Bacteroidetes   83.55662  30.41764  79.41203 15.14525   7.39179
## p__Firmicutes      15.14819  68.30191  19.28784 25.71252   0.80938
## p__Proteobacteria   1.29520   0.06303   0.97218 58.53298  91.58526
## c__Bacteroidia     83.55662  30.41764  79.41203 15.14525   7.39179
## c__Clostridia      14.93141  33.22720  11.26228 16.50313   0.51321

The same patient or participant data that was available from pData( loman.eset ) is now availble using sample_data() on the phyloseq object:

sample_data( loman.pseq )[1:6, 1:5]

##         subjectID body_site antibiotics_current_use study_condition disease
## OBK1122   OBK1122     stool                    <NA>            STEC    STEC
## OBK1196   OBK1196     stool                    <NA>            STEC    STEC
## OBK1253   OBK1253     stool                    <NA>            STEC    STEC
## OBK2535   OBK2535     stool                    <NA>            STEC    STEC
## OBK2638   OBK2638     stool                    <NA>            STEC    STEC
## OBK2661   OBK2661     stool                    <NA>            STEC    STEC

But this object also is aware of the taxonomic structure, which will enable the powerful subsetting methods of the phyloseq package.

head( tax_table( loman.pseq ) )

## Taxonomy Table:     [6 taxa by 8 taxonomic ranks]:
##                   Kingdom    Phylum           Class         Order Family
## k__Bacteria       "Bacteria" NA               NA            NA    NA    
## p__Bacteroidetes  "Bacteria" "Bacteroidetes"  NA            NA    NA    
## p__Firmicutes     "Bacteria" "Firmicutes"     NA            NA    NA    
## p__Proteobacteria "Bacteria" "Proteobacteria" NA            NA    NA    
## c__Bacteroidia    "Bacteria" "Bacteroidetes"  "Bacteroidia" NA    NA    
## c__Clostridia     "Bacteria" "Firmicutes"     "Clostridia"  NA    NA    
##                   Genus Species Strain
## k__Bacteria       NA    NA      NA    
## p__Bacteroidetes  NA    NA      NA    
## p__Firmicutes     NA    NA      NA    
## p__Proteobacteria NA    NA      NA    
## c__Bacteroidia    NA    NA      NA    
## c__Clostridia     NA    NA      NA

6.3 Subsetting / Pruning

The process of subsetting begins with the names of taxonomic ranks:

rank_names( loman.pseq )

## [1] "Kingdom" "Phylum"  "Class"   "Order"   "Family"  "Genus"   "Species"
## [8] "Strain"

Taxa can be filtered by these rank names. For example, to return an object with only species and strains:

subset_taxa( loman.pseq, !is.na(Species))

## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 535 taxa and 43 samples ]
## sample_data() Sample Data:       [ 43 samples by 20 sample variables ]
## tax_table()   Taxonomy Table:    [ 535 taxa by 8 taxonomic ranks ]

To keep only phylum-level data (not class or finer, and not kingdom-level):

subset_taxa( loman.pseq, is.na(Class) & !is.na(Phylum))

## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 8 taxa and 43 samples ]
## sample_data() Sample Data:       [ 43 samples by 20 sample variables ]
## tax_table()   Taxonomy Table:    [ 8 taxa by 8 taxonomic ranks ]

Or to keep only Bacteroidetes phylum. Note that taxa names have been shortened from the rownames of the ExpressionSet object, for nicer plotting.

loman.bd = subset_taxa( loman.pseq, Phylum == "Bacteroidetes")
head( taxa_names( loman.bd ) )

## [1] "p__Bacteroidetes"      "c__Bacteroidia"        "o__Bacteroidales"     
## [4] "f__Bacteroidaceae"     "f__Porphyromonadaceae" "f__Rikenellaceae"

6.4 Advanced Pruning

The phyloseq package provides advanced pruning of taxa, such as the following which keeps only taxa that are among the most abundant 5% in at least five samples:

keepotu = genefilter_sample(loman.pseq, filterfun_sample(topp(0.05)), A=5)
summary(keepotu)

##    Mode   FALSE    TRUE 
## logical     624     112

subset_taxa(loman.pseq, keepotu)

## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 112 taxa and 43 samples ]
## sample_data() Sample Data:       [ 43 samples by 20 sample variables ]
## tax_table()   Taxonomy Table:    [ 112 taxa by 8 taxonomic ranks ]

Note that phyloseq also provides topk() for selecting the most abundant k taxa, and other functions for advanced pruning of taxa.

6.5 Taxonomy Heatmap

The phyloseq package provides the plot_heatmap() function to create heatmaps using a variety of built-in dissimilarity metrics for clustering. Here, we apply the same abundance filter as above, keep only strain-level OTUs. This function supports a large number of distance and ordination methods, here we use Bray-Curtis dissimilarity for distance and PCoA as the ordination method for organizing the heatmap.

loman.filt = subset_taxa(loman.pseq, keepotu & !is.na(Strain))
plot_heatmap(loman.filt, method="PCoA", distance="bray")

6.6 Taxonomy Histogram

Here we plot the top 20 most abundant species (not strains), defined by the sum of abundance across all samples in the dataset:

loman.sp = subset_taxa(loman.pseq, !is.na(Species) & is.na(Strain))
par(mar = c(20, 4, 0, 0) + 0.15) #increase margin size on the bottom
barplot(sort(taxa_sums(loman.sp), TRUE)[1:20] / nsamples(loman.sp),
        ylab = "Total counts", las = 2)

6.7 Alpha Diversity Estimation

The phyloseq package calculates numerous alpha diversity measures. Here we compare three diversity in the species-level data, stratifying by stool texture:

alphas = c("Shannon", "Simpson", "InvSimpson")
plot_richness(loman.sp, "stool_texture", measures = alphas)

Let’s compare these three alpha diversity measures:

pairs( estimate_richness(loman.sp, measures = alphas) )

6.8 Beta Diversity / Dissimilarity Clustering

Numerous beta diversity / dissimilarity are provided by the distance() function when provided a phyloseq object, and these can be used for any kind of clustering or classification scheme. For example, here is a hierarchical clustering dendrogram produced by the hclust() from the base R stats package with “Ward” linkage:

mydist = distance(loman.sp, method="bray")
myhclust = hclust( mydist )
plot(myhclust, main="Bray-Curtis Dissimilarity", 
     method="ward.D", xlab="Samples", sub = "")

6.9 Ordination Analysis

The phyloseq package provides a variety of ordination methods, with convenient options for labelling points. Here is a Principal Coordinates Analysis plot of species-level taxa from the Loman dataset, using Bray-Curtis distance:

ordinated_taxa = ordinate(loman.sp, method="PCoA", distance="bray")
plot_ordination(loman.sp, ordinated_taxa, color="stool_texture", 
                title = "Bray-Curtis Principal Coordinates Analysis")

plot_scree(ordinated_taxa, title="Screeplot")

7 Using ExperimentHub directly

We recommend that most users use the convenience functions shown above to find curatedMetagenomicData datasets and navigate versions. However, it is also possible to use ExperimentHub directly for consistency with other ExperimentHub packages.

7.1 Browsing ExperimentHub

In the next section we will demonstrate convenience functions that make ExperimentHub transparent, but it can be useful and powerful to interact directly with ExperimentHub. A “hub” connects you to the ExperimentHub server and its metadata, without downloading any data:

suppressPackageStartupMessages(library(ExperimentHub))
eh = ExperimentHub()

## snapshotDate(): 2017-10-30

The following queries ExperimentHub for all records matching curatedMetagenomicData:

myquery = query(eh, "curatedMetagenomicData")

This is an abbreviated list of what was found:

myquery

## ExperimentHub with 384 records
## # snapshotDate(): 2017-10-30 
## # $dataprovider: Department of Psychology, Abdul Haq Campus, Federal Urdu...
## # $species: Homo Sapiens
## # $rdataclass: ExpressionSet
## # additional mcols(): taxonomyid, genome, description,
## #   coordinate_1_based, maintainer, rdatadateadded, preparerclass,
## #   tags, rdatapath, sourceurl, sourcetype 
## # retrieve records with, e.g., 'object[["EH178"]]' 
## 
##            title                                              
##   EH178  | HMP_2012.genefamilies_relab.anterior_nares         
##   EH179  | HMP_2012.genefamilies_relab.buccal_mucosa          
##   EH180  | HMP_2012.genefamilies_relab.hard_palate            
##   EH181  | HMP_2012.genefamilies_relab.keratinized_gingiva    
##   EH182  | HMP_2012.genefamilies_relab.l_retroauricular_crease
##   ...      ...                                                
##   EH1024 | 20171006.HanniganGD_2017.marker_abundance.stool    
##   EH1025 | 20171006.HanniganGD_2017.marker_presence.stool     
##   EH1026 | 20171006.HanniganGD_2017.metaphlan_bugs_list.stool 
##   EH1027 | 20171006.HanniganGD_2017.pathabundance_relab.stool 
##   EH1028 | 20171006.HanniganGD_2017.pathcoverage.stool

Note that the first column, “EH178” etc, are unique IDs for each dataset. For a full list, mcols(myquery) produces a data.frame. The following is unevaluated in this script, but can be used to display an interactive spreadsheet of the results:

View(mcols(myquery))

Or to write the search results to a csv file:

write.csv(mcols(myquery), file="curatedMetagenomicData_allrecords.csv")

7.2 Advanced searching of ExperimentHub

Tags can (eventually) help identify useful datasets:

head(myquery$tags)

## [1] "Homo_sapiens_Data, ReproducibleResearch, MicrobiomeData"
## [2] "Homo_sapiens_Data, ReproducibleResearch, MicrobiomeData"
## [3] "Homo_sapiens_Data, ReproducibleResearch, MicrobiomeData"
## [4] "Homo_sapiens_Data, ReproducibleResearch, MicrobiomeData"
## [5] "Homo_sapiens_Data, ReproducibleResearch, MicrobiomeData"
## [6] "Homo_sapiens_Data, ReproducibleResearch, MicrobiomeData"

Titles can be used to select body site and data type. Here are the MetaPhlan2 taxonomy tables for all the available Human Microbiome Project (HMP) body sites, found using a Perl regular expression to find “HMP” at the beginning of a string, followed by any number of characters “.*“, followed by the word”metaphlan“:

grep("(?=HMP)(?=.*metaphlan)", myquery$title, perl=TRUE, val=TRUE)

##  [1] "HMP_2012.metaphlan_bugs_list.anterior_nares"         
##  [2] "HMP_2012.metaphlan_bugs_list.buccal_mucosa"          
##  [3] "HMP_2012.metaphlan_bugs_list.hard_palate"            
##  [4] "HMP_2012.metaphlan_bugs_list.keratinized_gingiva"    
##  [5] "HMP_2012.metaphlan_bugs_list.l_retroauricular_crease"
##  [6] "HMP_2012.metaphlan_bugs_list.mid_vagina"             
##  [7] "HMP_2012.metaphlan_bugs_list.palatine_tonsils"       
##  [8] "HMP_2012.metaphlan_bugs_list.posterior_fornix"       
##  [9] "HMP_2012.metaphlan_bugs_list.r_retroauricular_crease"
## [10] "HMP_2012.metaphlan_bugs_list.saliva"                 
## [11] "HMP_2012.metaphlan_bugs_list.stool"                  
## [12] "HMP_2012.metaphlan_bugs_list.subgingival_plaque"     
## [13] "HMP_2012.metaphlan_bugs_list.supragingival_plaque"   
## [14] "HMP_2012.metaphlan_bugs_list.throat"                 
## [15] "HMP_2012.metaphlan_bugs_list.tongue_dorsum"          
## [16] "HMP_2012.metaphlan_bugs_list.vaginal_introitus"      
## [17] "20170526.HMP_2012.metaphlan_bugs_list.nasalcavity"   
## [18] "20170526.HMP_2012.metaphlan_bugs_list.oralcavity"    
## [19] "20170526.HMP_2012.metaphlan_bugs_list.stool"         
## [20] "20170526.HMP_2012.metaphlan_bugs_list.vagina"

Here are all the metabolic pathway abundance for the stool body site:

grep("(?=.*stool)(?=.*metaphlan)", myquery$title, perl=TRUE, val=TRUE)

##  [1] "HMP_2012.metaphlan_bugs_list.stool"                     
##  [2] "KarlssonFH_2013.metaphlan_bugs_list.stool"              
##  [3] "LeChatelierE_2013.metaphlan_bugs_list.stool"            
##  [4] "LomanNJ_2013_Hi.metaphlan_bugs_list.stool"              
##  [5] "LomanNJ_2013_Mi.metaphlan_bugs_list.stool"              
##  [6] "NielsenHB_2014.metaphlan_bugs_list.stool"               
##  [7] "Obregon_TitoAJ_2015.metaphlan_bugs_list.stool"          
##  [8] "QinJ_2012.metaphlan_bugs_list.stool"                    
##  [9] "QinN_2014.metaphlan_bugs_list.stool"                    
## [10] "RampelliS_2015.metaphlan_bugs_list.stool"               
## [11] "ZellerG_2014.metaphlan_bugs_list.stool"                 
## [12] "20170526.AsnicarF_2017.metaphlan_bugs_list.stool"       
## [13] "20170526.BritoIL_2016.metaphlan_bugs_list.stool"        
## [14] "20170526.FengQ_2015.metaphlan_bugs_list.stool"          
## [15] "20170526.Heitz-BuschartA_2016.metaphlan_bugs_list.stool"
## [16] "20170526.HMP_2012.metaphlan_bugs_list.stool"            
## [17] "20170526.KarlssonFH_2013.metaphlan_bugs_list.stool"     
## [18] "20170526.LiuW_2016.metaphlan_bugs_list.stool"           
## [19] "20170526.LomanNJ_2013.metaphlan_bugs_list.stool"        
## [20] "20170526.NielsenHB_2014.metaphlan_bugs_list.stool"      
## [21] "20170526.Obregon-TitoAJ_2015.metaphlan_bugs_list.stool" 
## [22] "20170526.QinJ_2012.metaphlan_bugs_list.stool"           
## [23] "20170526.RampelliS_2015.metaphlan_bugs_list.stool"      
## [24] "20170526.RaymondF_2016.metaphlan_bugs_list.stool"       
## [25] "20170526.SchirmerM_2016.metaphlan_bugs_list.stool"      
## [26] "20170526.VatanenT_2016.metaphlan_bugs_list.stool"       
## [27] "20170526.VincentC_2016.metaphlan_bugs_list.stool"       
## [28] "20170526.VogtmannE_2016.metaphlan_bugs_list.stool"      
## [29] "20170526.XieH_2016.metaphlan_bugs_list.stool"           
## [30] "20170526.YuJ_2015.metaphlan_bugs_list.stool"            
## [31] "20170526.ZellerG_2014.metaphlan_bugs_list.stool"        
## [32] "20170526.QinN_2014.metaphlan_bugs_list.stool"           
## [33] "20170526.LeChatelierE_2013.metaphlan_bugs_list.stool"   
## [34] "20170907.LiJ_2014.metaphlan_bugs_list.stool"            
## [35] "20171006.HanniganGD_2017.metaphlan_bugs_list.stool"

Or, here are all the data products available for the “Loman” study:

(lomannames = grep("LomanNJ_2013", myquery$title, perl=TRUE, val=TRUE))

##  [1] "LomanNJ_2013_Hi.genefamilies_relab.stool"       
##  [2] "LomanNJ_2013_Hi.marker_abundance.stool"         
##  [3] "LomanNJ_2013_Hi.marker_presence.stool"          
##  [4] "LomanNJ_2013_Hi.metaphlan_bugs_list.stool"      
##  [5] "LomanNJ_2013_Hi.pathabundance_relab.stool"      
##  [6] "LomanNJ_2013_Hi.pathcoverage.stool"             
##  [7] "LomanNJ_2013_Mi.genefamilies_relab.stool"       
##  [8] "LomanNJ_2013_Mi.marker_abundance.stool"         
##  [9] "LomanNJ_2013_Mi.marker_presence.stool"          
## [10] "LomanNJ_2013_Mi.metaphlan_bugs_list.stool"      
## [11] "LomanNJ_2013_Mi.pathabundance_relab.stool"      
## [12] "LomanNJ_2013_Mi.pathcoverage.stool"             
## [13] "20170526.LomanNJ_2013.genefamilies_relab.stool" 
## [14] "20170526.LomanNJ_2013.marker_abundance.stool"   
## [15] "20170526.LomanNJ_2013.marker_presence.stool"    
## [16] "20170526.LomanNJ_2013.metaphlan_bugs_list.stool"
## [17] "20170526.LomanNJ_2013.pathabundance_relab.stool"
## [18] "20170526.LomanNJ_2013.pathcoverage.stool"

We could create a list of ExpressionSet objects containing all of the Loman products as follows (not evaluated):

(idx = grep("LomanNJ_2013", myquery$title, perl=TRUE))
loman.list = lapply(idx, function(i){
  return(myquery[[i]])
})
names(loman.list) = lomannames
loman.list

8 Addition of Datasets to `curatedMetagenomicData`

Authors welcome the addition of new metagenomic datasets provided that the raw data are hosted by NCBI/SRA and can be run through our MetaPhlAn2 and HUMAnN2 pipeline. You can request the addition of a dataset by opening an issue on the issue tracker, pointing us to the publication and raw data.

You can speed up our ability to incorporate a new dataset by providing curated metadata. See https://github.com/waldronlab/curatedMetagenomicData for how to curate a new dataset.

9 Reporting Bugs and Errors in Curation

Development of the curatedMetagenomicData package occurs on GitHub. Please visit the project repository and report software bugs and data problems on our issue tracker.

10 Other Issues

If you have an issue that is not documented elsewhere, visit the Bioconductor support site at https://support.bioconductor.org/, briefly describe your issue, and add the tag curatedMetagenomicData.