curatedMetagenomicData
package provides taxonomic, functional, and gene marker abundance for samples collected from different human bodysites. It provides processed data from whole-metagenome sequencing for thousand of human microbiome samples, including the Human Microbiome Project. Microbiome data and associated subject, specimen, and sequencing information are integrated as Bioconductor ExpressionSet objects, allowing straightfoward analyses. A conversion function also links taxonomic datasets to the phyloseq package for ecological analyses.
curatedMetagenomicData
providescuratedMetagenomicData
ResourcesExpressionSet
Objectsphyloseq
curatedMetagenomicData
curatedMetagenomicData
providescuratedMetagenomicData
provides 6 types of data for each dataset:
Types 1-3 are generated by MetaPhlAn2; 4-6 are generated by HUMAnN2.
Currently, curatedMetagenomicData
provides:
These datasets are documented in the reference manual.
curatedMetagenomicData
ResourcesBecause curatedMetagenomicData
utilizes Bioconductor’s ExperimentHub platform, data are stored as Amazon S3 buckets and only downloaded individually when requested. Local caching occurs automatically; see ExperimentHub documentation to customize options such as where the cache is stored (by default it is ~/.ExperimentHub
). In case you haven’t installed it yet, do:
# BiocManager::install("curatedMetagenomicData")
library(curatedMetagenomicData)
Note that newly datasets require installing the development version of Bioconductor.
The manually curated metadata for all available samples are provided in a single table combined_metadata
:
?combined_metadata
View(combined_metadata)
table(combined_metadata$antibiotics_current_use)
##
## no yes
## 4605 656
table(combined_metadata$disease)
##
## ACVD AD
## 214 10
## AD;AR AD;AR;asthma
## 16 8
## AD;asthma AR
## 4 2
## AR;asthma BD
## 8 20
## CD;schizophrenia CDI;NA
## 1 14
## CDI;cellulitis CDI;osteoarthritis
## 2 1
## CDI;pneumonia CDI;ureteralstone
## 15 1
## CRC CRC;T2D;hypertension
## 305 2
## CRC;cholesterolemia CRC;fatty_liver
## 1 3
## CRC;fatty_liver;T2D;hypertension CRC;fatty_liver;hypertension
## 4 12
## CRC;hypercholesterolemia CRC;hypertension
## 3 20
## CRC;hypertension;hypercholesterolemia CRC;metastases
## 1 1
## HBV;HDV;ascites;cirrhosis HBV;HDV;cirrhosis
## 2 3
## HBV;HE;cirrhosis HBV;HEV;ascites;cirrhosis
## 2 4
## HBV;HEV;cirrhosis HBV;ascites;cirrhosis
## 3 44
## HBV;ascites;cirrhosis;schistosoma HBV;ascites;cirrhosis;wilson
## 1 1
## HBV;cirrhosis HCV;cirrhosis
## 39 1
## HE;HEV;cirrhosis HEV;ascites;cirrhosis;schistosoma
## 1 2
## IBD IGT
## 148 49
## NK RA
## 1 97
## STEC T1D
## 43 55
## T1D;coeliac;IBD T2D
## 3 223
## T2D;hypertension T2D;schizophrenia
## 1 3
## abdominalhernia acute_diarrhoea
## 2 18
## adenoma adenoma;T2D
## 93 2
## adenoma;T2D;hypertension adenoma;fatty_liver
## 2 12
## adenoma;fatty_liver;T2D adenoma;fatty_liver;T2D;hypertension
## 1 1
## adenoma;fatty_liver;hypertension adenoma;hypercholesterolemia
## 14 4
## adenoma;hypercholesterolemia;metastases adenoma;hypertension
## 1 12
## adenoma;metastases arthritis
## 1 1
## ascites;cirrhosis ascites;cirrhosis;pbcirrhosis
## 9 1
## ascites;cirrhosis;schistosoma ascites;cirrhosis;wilson
## 1 1
## bronchitis cellulitis
## 18 6
## cirrhosis coeliac;gestational_diabetes;CMV
## 8 2
## cough cystitis
## 2 1
## fatty_liver fatty_liver;T2D
## 94 3
## fatty_liver;T2D;hypertension fatty_liver;hypertension
## 9 13
## fever gangrene
## 3 1
## healthy hepatitis
## 6229 3
## hypercholesterolemia hypertension
## 5 169
## hypertension;metastases infectiousgastroenteritis
## 1 20
## melanoma;metastases metabolic_syndrome
## 64 50
## metastases osteoarthritis
## 1 1
## otitis periodontitis
## 107 48
## pneumonia psoriasis
## 7 74
## psoriasis;arthritis pyelonefritis
## 22 2
## pyelonephritis respiratoryinf
## 6 13
## salmonellosis schizophrenia
## 1 12
## sepsis skininf
## 1 2
## stomatitis suspinf
## 2 1
## tonsillitis
## 3
combined_metadata
also provides technical information for each sample like sequencing platform, read length, and read depth. The following uses combined_metadata
to create a boxplot of read depth for each sample in each study, with boxes colored by body site. First, create a ranking of datasets by median read depth:
dsranking <- combined_metadata %>%
as.data.frame() %>%
group_by(dataset_name) %>%
summarize(mediandepth = median(number_reads) / 1e6) %>%
mutate(dsorder = rank(mediandepth)) %>%
arrange(dsorder)
dsranking
## # A tibble: 57 x 3
## dataset_name mediandepth dsorder
## <chr> <dbl> <dbl>
## 1 TettAJ_2016 1.19 1
## 2 SmitsSA_2017 3.81 2
## 3 HanniganGD_2017 5.86 3
## 4 DavidLA_2015 6.12 4
## 5 LomanNJ_2013 6.66 5
## 6 DhakanDB_2019 8.59 6
## 7 PehrssonE_2016 8.84 7
## 8 OhJ_2014 10.5 8
## 9 ChngKR_2016 14.1 9
## 10 FerrettiP_2018 14.6 10
## # … with 47 more rows
Create a factor ds
that is ordered according to the ranking by median read depth, to show datasets in order from lowest to highest median read depth, then create the box plot.
suppressPackageStartupMessages(library(ggplot2))
combined_metadata %>%
as.data.frame() %>%
mutate(ds = factor(combined_metadata$dataset_name, levels=dsranking$dataset_name)) %>%
ggplot(aes(ds, number_reads / 1e6, fill=body_site)) +
geom_boxplot() +
theme(axis.text.x = element_text(angle=45, hjust=1)) +
labs(x="Dataset", y="Read Depth (millions)")
For the datasets that provide both stool and oral cavity profiles, notice how much lower the read depth of oral cavity profiles is due to the higher proportions of removed human DNA reads.
Individual data projects can be fetched via per-dataset functions or the curatedMetagenomicData()
function. A function call to a dataset name returns a Bioconductor ExpressionSet object:
suppressPackageStartupMessages(library(curatedMetagenomicData))
loman.eset = LomanNJ_2013.metaphlan_bugs_list.stool()
However, the above approach lacks additional options and versioning provided by the curatedMetagenomicData function, which returns a list of datasets. In this case the list has length 1:
loman <- curatedMetagenomicData("LomanNJ_2013.metaphlan_bugs_list.stool", dryrun = FALSE)
## Working on LomanNJ_2013.metaphlan_bugs_list.stool
## snapshotDate(): 2019-10-22
## see ?curatedMetagenomicData and browseVignettes('curatedMetagenomicData') for documentation
## loading from cache
loman
## List of length 1
## names(1): LomanNJ_2013.metaphlan_bugs_list.stool
loman.eset <- loman[[1]]
loman.eset
## ExpressionSet (storageMode: lockedEnvironment)
## assayData: 736 features, 43 samples
## element names: exprs
## protocolData: none
## phenoData
## sampleNames: OBK1122 OBK1196 ... OBK5066 (43 total)
## varLabels: subjectID body_site ... NCBI_accession (19 total)
## varMetadata: labelDescription
## featureData: none
## experimentData: use 'experimentData(object)'
## pubMedIds: 23571589
## Annotation:
The following creates a list of two ExpressionSet
objects providing the BritoIL_2016 and Castro-NallarE_2015 oral cavity taxonomic profiles:
oral <- c("BritoIL_2016.metaphlan_bugs_list.oralcavity",
"Castro-NallarE_2015.metaphlan_bugs_list.oralcavity")
esl <- curatedMetagenomicData(oral, dryrun = FALSE)
esl
esl[[1]]
esl[[2]]
And the following would provide all stool metaphlan datasets if dryrun = FALSE
were set:
curatedMetagenomicData("*metaphlan_bugs_list.stool*", dryrun = TRUE)
## Dry run: see return values for datasets that would be downloaded. Run with `dryrun=FALSE` to actually download these datasets.
## [1] "AsnicarF_2017.metaphlan_bugs_list.stool"
## [2] "BackhedF_2015.metaphlan_bugs_list.stool"
## [3] "Bengtsson-PalmeJ_2015.metaphlan_bugs_list.stool"
## [4] "BritoIL_2016.metaphlan_bugs_list.stool"
## [5] "ChengpingW_2017.metaphlan_bugs_list.stool"
## [6] "CosteaPI_2017.metaphlan_bugs_list.stool"
## [7] "DavidLA_2015.metaphlan_bugs_list.stool"
## [8] "DhakanDB_2019.metaphlan_bugs_list.stool"
## [9] "FengQ_2015.metaphlan_bugs_list.stool"
## [10] "FerrettiP_2018.metaphlan_bugs_list.stool"
## [11] "GopalakrishnanV_2018.metaphlan_bugs_list.stool"
## [12] "HMP_2012.metaphlan_bugs_list.stool"
## [13] "HanniganGD_2017.metaphlan_bugs_list.stool"
## [14] "HansenLBS_2018.metaphlan_bugs_list.stool"
## [15] "Heitz-BuschartA_2016.metaphlan_bugs_list.stool"
## [16] "JieZ_2017.metaphlan_bugs_list.stool"
## [17] "KarlssonFH_2013.metaphlan_bugs_list.stool"
## [18] "KieserS_2018.metaphlan_bugs_list.stool"
## [19] "KosticAD_2015.metaphlan_bugs_list.stool"
## [20] "LeChatelierE_2013.metaphlan_bugs_list.stool"
## [21] "LiJ_2014.metaphlan_bugs_list.stool"
## [22] "LiJ_2017.metaphlan_bugs_list.stool"
## [23] "LiSS_2016.metaphlan_bugs_list.stool"
## [24] "LiuW_2016.metaphlan_bugs_list.stool"
## [25] "LomanNJ_2013.metaphlan_bugs_list.stool"
## [26] "LoombaR_2017.metaphlan_bugs_list.stool"
## [27] "LouisS_2016.metaphlan_bugs_list.stool"
## [28] "MatsonV_2018.metaphlan_bugs_list.stool"
## [29] "NielsenHB_2014.metaphlan_bugs_list.stool"
## [30] "Obregon-TitoAJ_2015.metaphlan_bugs_list.stool"
## [31] "OlmMR_2017.metaphlan_bugs_list.stool"
## [32] "PasolliE_2018.metaphlan_bugs_list.stool"
## [33] "PehrssonE_2016.metaphlan_bugs_list.stool"
## [34] "QinJ_2012.metaphlan_bugs_list.stool"
## [35] "QinN_2014.metaphlan_bugs_list.stool"
## [36] "RampelliS_2015.metaphlan_bugs_list.stool"
## [37] "RaymondF_2016.metaphlan_bugs_list.stool"
## [38] "SchirmerM_2016.metaphlan_bugs_list.stool"
## [39] "SmitsSA_2017.metaphlan_bugs_list.stool"
## [40] "TettAJ_2019_a.metaphlan_bugs_list.stool"
## [41] "TettAJ_2019_b.metaphlan_bugs_list.stool"
## [42] "TettAJ_2019_c.metaphlan_bugs_list.stool"
## [43] "ThomasAM_2018a.metaphlan_bugs_list.stool"
## [44] "ThomasAM_2018b.metaphlan_bugs_list.stool"
## [45] "VatanenT_2016.metaphlan_bugs_list.stool"
## [46] "VincentC_2016.metaphlan_bugs_list.stool"
## [47] "VogtmannE_2016.metaphlan_bugs_list.stool"
## [48] "XieH_2016.metaphlan_bugs_list.stool"
## [49] "YeZ_2018.metaphlan_bugs_list.stool"
## [50] "YuJ_2015.metaphlan_bugs_list.stool"
## [51] "ZeeviD_2015.metaphlan_bugs_list.stool"
## [52] "ZellerG_2014.metaphlan_bugs_list.stool"
The following merges the two oral cavity datasets downloaded above into a single ExpressionSet.
eset <- mergeData(esl)
eset
## ExpressionSet (storageMode: lockedEnvironment)
## assayData: 1914 features, 172 samples
## element names: exprs
## protocolData: none
## phenoData
## sampleNames: BritoIL_2016.metaphlan_bugs_list.oralcavity:M1.1.SA
## BritoIL_2016.metaphlan_bugs_list.oralcavity:M1.10.SA ...
## Castro-NallarE_2015.metaphlan_bugs_list.oralcavity:ES_211 (172
## total)
## varLabels: subjectID body_site ... studyID (19 total)
## varMetadata: labelDescription
## featureData: none
## experimentData: use 'experimentData(object)'
## Annotation:
This works for any number of datasets. The function will not stop you from merging different data types (e.g. metaphlan bugs lists with gene families), but you probably don’t want to do that.
ExpressionSet
ObjectsAll datasets are represented as ExpressionSet
objects because of the integrative nature of the class and its ability to bind data and metadata. There are three main functions, from the Biobase
package, that provide access to experiment-level metadata, subject-level metadata, and the data itself.
To access the experiment-level metadata the experimentData()
function is used to return a MIAME
(Minimum Information About a Microarray Experiment) object.
experimentData( loman.eset )
## Experiment data
## Experimenter name: Loman NJ, Constantinidou C, Christner M, Rohde H, Chan JZ, Quick J, Weir JC, Quince C, Smith GP, Betley JR, Aepfelbacher M, Pallen MJ
## Laboratory: [1] Institute of Microbiology and Infection, University of Birmingham, Birmingham, England.
## Contact information:
## Title: A culture-independent sequence-based metagenomics approach to the investigation of an outbreak of Shiga-toxigenic Escherichia coli O104:H4.
## URL:
## PMIDs: 23571589
##
## Abstract: A 308 word abstract is available. Use 'abstract' method.
## notes:
## Sequencing platform:
## IlluminaHiSeq
To access the subject-level metadata the pData()
function is used to return a data.frame
containing subject-level variables of study.
head( pData( loman.eset ) )
## subjectID body_site antibiotics_current_use study_condition disease
## OBK1122 OBK1122 stool <NA> STEC STEC
## OBK1196 OBK1196 stool <NA> STEC STEC
## OBK1253 OBK1253 stool <NA> STEC STEC
## OBK2535 OBK2535 stool <NA> STEC STEC
## OBK2535b OBK2535b stool <NA> STEC STEC
## OBK2638 OBK2638 stool <NA> STEC STEC
## age_category country non_westernized DNA_extraction_kit number_reads
## OBK1122 adult DEU no Qiagen 464060
## OBK1196 adult DEU no Qiagen 30181380
## OBK1253 adult DEU no Qiagen 65704818
## OBK2535 adult DEU no Qiagen 43597736
## OBK2535b adult DEU no Qiagen 22253314
## OBK2638 adult DEU no Qiagen 1105492
## number_bases minimum_read_length median_read_length days_after_onset
## OBK1122 69609000 150 150 NA
## OBK1196 4527207000 150 150 NA
## OBK1253 9855722700 150 150 NA
## OBK2535 6539660400 150 150 3
## OBK2535b 3337997100 150 150 NA
## OBK2638 165823800 150 150 5
## stec_count shigatoxin_2_elisa stool_texture curator
## OBK1122 <NA> <NA> <NA> Paolo_Manghi
## OBK1196 <NA> <NA> <NA> Paolo_Manghi
## OBK1253 <NA> <NA> <NA> Paolo_Manghi
## OBK2535 moderate positive smooth Paolo_Manghi
## OBK2535b <NA> <NA> <NA> Paolo_Manghi
## OBK2638 high positive bloody Paolo_Manghi
## NCBI_accession
## OBK1122 ERR262939;ERR262938
## OBK1196 ERR262941;ERR262940
## OBK1253 ERR262942;ERR260476
## OBK2535 ERR260478;ERR260477
## OBK2535b ERR260479
## OBK2638 ERR262943;ERR260480
To access the data itself (in this case relative abundance), the exprs()
function returns a variables by samples (rows by columns) numeric matrix. Note the presence of “synthetic” clades at all levels of the taxonomy, starting with kingdom, e.g. k__bacteria here:
exprs( loman.eset )[1:6, 1:5] #first 6 rows and 5 columns
## OBK1122 OBK1196 OBK1253
## k__Bacteria 100.00000 100.00000 100.00000
## k__Bacteria|p__Bacteroidetes 83.55662 30.41764 79.41203
## k__Bacteria|p__Firmicutes 15.14819 68.30191 19.28784
## k__Bacteria|p__Proteobacteria 1.29520 0.06303 0.97218
## k__Bacteria|p__Bacteroidetes|c__Bacteroidia 83.55662 30.41764 79.41203
## k__Bacteria|p__Firmicutes|c__Clostridia 14.93141 33.22720 11.26228
## OBK2535 OBK2535b
## k__Bacteria 99.39953 99.63084
## k__Bacteria|p__Bacteroidetes 15.14525 7.03101
## k__Bacteria|p__Firmicutes 25.71252 45.17325
## k__Bacteria|p__Proteobacteria 58.53298 47.29051
## k__Bacteria|p__Bacteroidetes|c__Bacteroidia 15.14525 7.03101
## k__Bacteria|p__Firmicutes|c__Clostridia 16.50313 13.97996
Bioconductor provides further documentation of the ExpressionSet class and has published an excellent introduction.
Absolute raw count data can be estimated from the relative count data by multiplying the columns of the ExpressionSet
data by the number of reads for each sample, as found in the pData
column “number_reads”. For demo purposes you could (but don’t have to!) do this manually by dividing by 100 and multiplying by the number of reads:
loman.counts = sweep(exprs( loman.eset ), 2, loman.eset$number_reads / 100, "*")
loman.counts = round(loman.counts)
loman.counts[1:6, 1:5]
## OBK1122 OBK1196 OBK1253 OBK2535
## k__Bacteria 464060 30181380 65704818 43335945
## k__Bacteria|p__Bacteroidetes 387753 9180464 52177530 6602986
## k__Bacteria|p__Firmicutes 70297 20614459 12673040 11210077
## k__Bacteria|p__Proteobacteria 6011 19023 638769 25519054
## k__Bacteria|p__Bacteroidetes|c__Bacteroidia 387753 9180464 52177530 6602986
## k__Bacteria|p__Firmicutes|c__Clostridia 69291 10028427 7399861 7194991
## OBK2535b
## k__Bacteria 22171164
## k__Bacteria|p__Bacteroidetes 1564633
## k__Bacteria|p__Firmicutes 10052545
## k__Bacteria|p__Proteobacteria 10523706
## k__Bacteria|p__Bacteroidetes|c__Bacteroidia 1564633
## k__Bacteria|p__Firmicutes|c__Clostridia 3111004
or just set the counts
argument in curatedMetagenomicData()
to TRUE
:
loman.eset2 = curatedMetagenomicData("LomanNJ_2013.metaphlan_bugs_list.stool",
counts = TRUE, dryrun = FALSE)[[1]]
## Working on LomanNJ_2013.metaphlan_bugs_list.stool
## snapshotDate(): 2019-10-22
## see ?curatedMetagenomicData and browseVignettes('curatedMetagenomicData') for documentation
## loading from cache
all.equal(exprs(loman.eset2), loman.counts)
## [1] TRUE
Here’s a direct, exploratory analysis of E. coli prevalence in the Loman dataset using the ExpressionSet
object. More elegant solutions will be provided later using subsetting methods provided by the phyloseq package, but for users familiar with grep()
and the ExpressionSet
object, such manual methods may suffice.
First, which E. coli-related taxa are available?
grep("coli", rownames(loman.eset), value=TRUE)
## [1] "k__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|o__Enterobacteriales|f__Enterobacteriaceae|g__Escherichia|s__Escherichia_coli"
## [2] "k__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|o__Enterobacteriales|f__Enterobacteriaceae|g__Escherichia|s__Escherichia_coli|t__Escherichia_coli_unclassified"
## [3] "k__Bacteria|p__Firmicutes|c__Clostridia|o__Clostridiales|f__Ruminococcaceae|g__Anaerotruncus|s__Anaerotruncus_colihominis"
## [4] "k__Bacteria|p__Firmicutes|c__Clostridia|o__Clostridiales|f__Ruminococcaceae|g__Anaerotruncus|s__Anaerotruncus_colihominis|t__GCF_000154565"
Create a vector of E. coli relative abundances. This grep
call with a “$” at the end selects the only row that ends with “s__Escherichia_coli“:
x = exprs( loman.eset )[grep("s__Escherichia_coli$", rownames( loman.eset)), ]
summary( x )
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.6047 2.3710 13.6897 11.8291 90.4474
This could be plotted as a histogram:
hist( x, xlab = "Relative Abundance", main="Prevalence of E. Coli",
breaks="FD")
phyloseq
For the MetaPhlAn2 bugs datasets (but not other data types), you gain a lot of taxonomy-aware, ecological analysis and plotting by conversion to a phyloseq class object. curatedMetagenomicData provides the ExpressionSet2phyloseq()
function to make this easy:
suppressPackageStartupMessages(library(phyloseq))
loman.pseq = ExpressionSet2phyloseq( loman.eset )
Note the simplified row names of the OTU table, showing only the most detailed level of the taxonomy. This results from the default argument simplify=TRUE
, which is convenient and lossless because taxonomic information is now attainable by tax_table(loman.pseq2)
Set phylogenetictree = TRUE to include a phylogenetic tree in the phyloseq
object:
loman.tree <- ExpressionSet2phyloseq( loman.eset, phylogenetictree = TRUE)
wt = UniFrac(loman.tree, weighted=TRUE, normalized=FALSE,
parallel=FALSE, fast=TRUE)
plot(hclust(wt), main="Weighted UniFrac distances")
This phyloseq objects contain 3 components, with extractor functions hinted at by its show method:
loman.pseq
## phyloseq-class experiment-level object
## otu_table() OTU Table: [ 736 taxa and 43 samples ]
## sample_data() Sample Data: [ 43 samples by 19 sample variables ]
## tax_table() Taxonomy Table: [ 736 taxa by 8 taxonomic ranks ]
otu_table()
returns the same thing as exprs( loman.eset )
did, the Operational Taxanomic Unit (OTU) table. Here are the first 6 rows and 5 columns:
otu_table( loman.pseq )[1:6, 1:5]
## OTU Table: [6 taxa and 5 samples]
## taxa are rows
## OBK1122 OBK1196 OBK1253 OBK2535 OBK2535b
## k__Bacteria 100.00000 100.00000 100.00000 99.39953 99.63084
## p__Bacteroidetes 83.55662 30.41764 79.41203 15.14525 7.03101
## p__Firmicutes 15.14819 68.30191 19.28784 25.71252 45.17325
## p__Proteobacteria 1.29520 0.06303 0.97218 58.53298 47.29051
## c__Bacteroidia 83.55662 30.41764 79.41203 15.14525 7.03101
## c__Clostridia 14.93141 33.22720 11.26228 16.50313 13.97996
The same patient or participant data that was available from pData( loman.eset )
is now availble using sample_data()
on
the phyloseq object:
sample_data( loman.pseq )[1:6, 1:5]
## subjectID body_site antibiotics_current_use study_condition disease
## OBK1122 OBK1122 stool <NA> STEC STEC
## OBK1196 OBK1196 stool <NA> STEC STEC
## OBK1253 OBK1253 stool <NA> STEC STEC
## OBK2535 OBK2535 stool <NA> STEC STEC
## OBK2535b OBK2535b stool <NA> STEC STEC
## OBK2638 OBK2638 stool <NA> STEC STEC
But this object also is aware of the taxonomic structure, which will enable the powerful subsetting methods of the phyloseq package.
head( tax_table( loman.pseq ) )
## Taxonomy Table: [6 taxa by 8 taxonomic ranks]:
## Kingdom Phylum Class Order Family Genus
## k__Bacteria "Bacteria" NA NA NA NA NA
## p__Bacteroidetes "Bacteria" "Bacteroidetes" NA NA NA NA
## p__Firmicutes "Bacteria" "Firmicutes" NA NA NA NA
## p__Proteobacteria "Bacteria" "Proteobacteria" NA NA NA NA
## c__Bacteroidia "Bacteria" "Bacteroidetes" "Bacteroidia" NA NA NA
## c__Clostridia "Bacteria" "Firmicutes" "Clostridia" NA NA NA
## Species Strain
## k__Bacteria NA NA
## p__Bacteroidetes NA NA
## p__Firmicutes NA NA
## p__Proteobacteria NA NA
## c__Bacteroidia NA NA
## c__Clostridia NA NA
The process of subsetting begins with the names of taxonomic ranks:
rank_names( loman.pseq )
## [1] "Kingdom" "Phylum" "Class" "Order" "Family" "Genus" "Species"
## [8] "Strain"
Taxa can be filtered by these rank names. For example, to return an object with only species and strains:
subset_taxa( loman.pseq, !is.na(Species))
## phyloseq-class experiment-level object
## otu_table() OTU Table: [ 535 taxa and 43 samples ]
## sample_data() Sample Data: [ 43 samples by 19 sample variables ]
## tax_table() Taxonomy Table: [ 535 taxa by 8 taxonomic ranks ]
To keep only phylum-level data (not class or finer, and not kingdom-level):
subset_taxa( loman.pseq, is.na(Class) & !is.na(Phylum))
## phyloseq-class experiment-level object
## otu_table() OTU Table: [ 8 taxa and 43 samples ]
## sample_data() Sample Data: [ 43 samples by 19 sample variables ]
## tax_table() Taxonomy Table: [ 8 taxa by 8 taxonomic ranks ]
Or to keep only Bacteroidetes phylum. Note that taxa names have been shortened from the rownames of the ExpressionSet
object, for nicer plotting.
loman.bd = subset_taxa( loman.pseq, Phylum == "Bacteroidetes")
head( taxa_names( loman.bd ) )
## [1] "p__Bacteroidetes" "c__Bacteroidia" "o__Bacteroidales"
## [4] "f__Bacteroidaceae" "f__Porphyromonadaceae" "f__Rikenellaceae"
The phyloseq package provides advanced pruning of taxa, such as the following which keeps only taxa that are among the most abundant 5% in at least five samples:
keepotu = genefilter_sample(loman.pseq, filterfun_sample(topp(0.05)), A=5)
summary(keepotu)
## Mode FALSE TRUE
## logical 624 112
subset_taxa(loman.pseq, keepotu)
## phyloseq-class experiment-level object
## otu_table() OTU Table: [ 112 taxa and 43 samples ]
## sample_data() Sample Data: [ 43 samples by 19 sample variables ]
## tax_table() Taxonomy Table: [ 112 taxa by 8 taxonomic ranks ]
Note that phyloseq also provides topk()
for selecting the most abundant k
taxa, and other functions for advanced pruning of taxa.
The phyloseq package provides the plot_heatmap()
function to create heatmaps using a variety of built-in dissimilarity metrics for clustering. Here, we apply the same abundance filter as above, keep only strain-level OTUs. This function supports a large number of distance and ordination methods, here we use Bray-Curtis dissimilarity for distance and PCoA as the ordination method for organizing the heatmap.
loman.filt = subset_taxa(loman.pseq, keepotu & !is.na(Strain))
plot_heatmap(loman.filt, method="PCoA", distance="bray")
Here we plot the top 20 most abundant species (not strains), defined by the sum of abundance across all samples in the dataset:
loman.sp = subset_taxa(loman.pseq, !is.na(Species) & is.na(Strain))
par(mar = c(20, 4, 0, 0) + 0.15) #increase margin size on the bottom
barplot(sort(taxa_sums(loman.sp), TRUE)[1:20] / nsamples(loman.sp),
ylab = "Total counts", las = 2)
The phyloseq package calculates numerous alpha diversity measures. Here we compare three diversity in the species-level data, stratifying by stool texture:
alphas = c("Shannon", "Simpson", "InvSimpson")
plot_richness(loman.sp, "stool_texture", measures = alphas)
Let’s compare these three alpha diversity measures:
pairs( estimate_richness(loman.sp, measures = alphas) )
Numerous beta diversity / dissimilarity are provided by the distance()
function when provided a phyloseq object, and these can be used for any kind of clustering or classification scheme. For example, here is a hierarchical clustering dendrogram produced by the hclust()
from the base R stats
package with “Ward” linkage:
mydist = distance(loman.sp, method="bray")
myhclust = hclust( mydist )
plot(myhclust, main="Bray-Curtis Dissimilarity",
method="ward.D", xlab="Samples", sub = "")
The phyloseq package provides a variety of ordination methods, with convenient options for labelling points. Here is a Principal Coordinates Analysis plot of species-level taxa from the Loman dataset, using Bray-Curtis distance:
ordinated_taxa = ordinate(loman.sp, method="PCoA", distance="bray")
plot_ordination(loman.sp, ordinated_taxa, color="stool_texture",
title = "Bray-Curtis Principal Coordinates Analysis")
plot_scree(ordinated_taxa, title="Screeplot")
We recommend that most users use the convenience functions shown above to find curatedMetagenomicData
datasets and navigate versions. However, it is also possible to use ExperimentHub directly for consistency with other ExperimentHub packages.
In the next section we will demonstrate convenience functions that make ExperimentHub transparent, but it can be useful and powerful to interact directly with ExperimentHub. A “hub” connects you to the ExperimentHub server and its metadata, without downloading any data:
suppressPackageStartupMessages(library(ExperimentHub))
eh = ExperimentHub()
## snapshotDate(): 2019-10-22
The following queries ExperimentHub for all records matching curatedMetagenomicData
:
myquery = query(eh, "curatedMetagenomicData")
This is an abbreviated list of what was found:
myquery
## ExperimentHub with 1020 records
## # snapshotDate(): 2019-10-22
## # $dataprovider: NA, Department of Psychology, Abdul Haq Campus, Federal Urd...
## # $species: Homo sapiens
## # $rdataclass: ExpressionSet
## # additional mcols(): taxonomyid, genome, description,
## # coordinate_1_based, maintainer, rdatadateadded, preparerclass, tags,
## # rdatapath, sourceurl, sourcetype
## # retrieve records with, e.g., 'object[["EH178"]]'
##
## title
## EH178 | HMP_2012.genefamilies_relab.anterior_nares
## EH179 | HMP_2012.genefamilies_relab.buccal_mucosa
## EH180 | HMP_2012.genefamilies_relab.hard_palate
## EH181 | HMP_2012.genefamilies_relab.keratinized_gingiva
## EH182 | HMP_2012.genefamilies_relab.l_retroauricular_crease
## ... ...
## EH3205 | 20190919.ZeeviD_2015.marker_abundance.stool
## EH3206 | 20190919.ZeeviD_2015.marker_presence.stool
## EH3207 | 20190919.ZeeviD_2015.metaphlan_bugs_list.stool
## EH3208 | 20190919.ZeeviD_2015.pathabundance_relab.stool
## EH3209 | 20190919.ZeeviD_2015.pathcoverage.stool
Note that the first column, “EH178” etc, are unique IDs for each dataset. For a full list, mcols(myquery)
produces a data.frame
. The following is unevaluated in this script, but can be used to display an interactive spreadsheet of the results:
View(mcols(myquery))
Or to write the search results to a csv file:
write.csv(mcols(myquery), file="curatedMetagenomicData_allrecords.csv")
Tags can (eventually) help identify useful datasets:
head(myquery$tags)
## [1] "Homo_sapiens_Data, ReproducibleResearch, MicrobiomeData"
## [2] "Homo_sapiens_Data, ReproducibleResearch, MicrobiomeData"
## [3] "Homo_sapiens_Data, ReproducibleResearch, MicrobiomeData"
## [4] "Homo_sapiens_Data, ReproducibleResearch, MicrobiomeData"
## [5] "Homo_sapiens_Data, ReproducibleResearch, MicrobiomeData"
## [6] "Homo_sapiens_Data, ReproducibleResearch, MicrobiomeData"
Titles can be used to select body site and data type. Here are the MetaPhlan2 taxonomy tables for all the available Human Microbiome Project (HMP) body sites, found using a Perl regular expression to find “HMP” at the beginning of a string, followed by any number of characters “.*“, followed by the word”metaphlan“:
grep("(?=HMP)(?=.*metaphlan)", myquery$title, perl=TRUE, val=TRUE)
## [1] "HMP_2012.metaphlan_bugs_list.anterior_nares"
## [2] "HMP_2012.metaphlan_bugs_list.buccal_mucosa"
## [3] "HMP_2012.metaphlan_bugs_list.hard_palate"
## [4] "HMP_2012.metaphlan_bugs_list.keratinized_gingiva"
## [5] "HMP_2012.metaphlan_bugs_list.l_retroauricular_crease"
## [6] "HMP_2012.metaphlan_bugs_list.mid_vagina"
## [7] "HMP_2012.metaphlan_bugs_list.palatine_tonsils"
## [8] "HMP_2012.metaphlan_bugs_list.posterior_fornix"
## [9] "HMP_2012.metaphlan_bugs_list.r_retroauricular_crease"
## [10] "HMP_2012.metaphlan_bugs_list.saliva"
## [11] "HMP_2012.metaphlan_bugs_list.stool"
## [12] "HMP_2012.metaphlan_bugs_list.subgingival_plaque"
## [13] "HMP_2012.metaphlan_bugs_list.supragingival_plaque"
## [14] "HMP_2012.metaphlan_bugs_list.throat"
## [15] "HMP_2012.metaphlan_bugs_list.tongue_dorsum"
## [16] "HMP_2012.metaphlan_bugs_list.vaginal_introitus"
## [17] "20170526.HMP_2012.metaphlan_bugs_list.nasalcavity"
## [18] "20170526.HMP_2012.metaphlan_bugs_list.oralcavity"
## [19] "20170526.HMP_2012.metaphlan_bugs_list.stool"
## [20] "20170526.HMP_2012.metaphlan_bugs_list.vagina"
## [21] "20180425.HMP_2012.metaphlan_bugs_list.nasalcavity"
## [22] "20180425.HMP_2012.metaphlan_bugs_list.oralcavity"
## [23] "20180425.HMP_2012.metaphlan_bugs_list.skin"
## [24] "20180425.HMP_2012.metaphlan_bugs_list.stool"
## [25] "20180425.HMP_2012.metaphlan_bugs_list.vagina"
## [26] "20181025.HMP_2012.metaphlan_bugs_list.nasalcavity"
## [27] "20181025.HMP_2012.metaphlan_bugs_list.oralcavity"
## [28] "20181025.HMP_2012.metaphlan_bugs_list.skin"
## [29] "20181025.HMP_2012.metaphlan_bugs_list.stool"
## [30] "20181025.HMP_2012.metaphlan_bugs_list.vagina"
## [31] "20190423.HMP_2012.metaphlan_bugs_list.nasalcavity"
## [32] "20190423.HMP_2012.metaphlan_bugs_list.oralcavity"
## [33] "20190423.HMP_2012.metaphlan_bugs_list.skin"
## [34] "20190423.HMP_2012.metaphlan_bugs_list.stool"
## [35] "20190423.HMP_2012.metaphlan_bugs_list.vagina"
Here are all the metabolic pathway abundance for the stool body site:
grep("(?=.*stool)(?=.*metaphlan)", myquery$title, perl=TRUE, val=TRUE)
## [1] "HMP_2012.metaphlan_bugs_list.stool"
## [2] "KarlssonFH_2013.metaphlan_bugs_list.stool"
## [3] "LeChatelierE_2013.metaphlan_bugs_list.stool"
## [4] "LomanNJ_2013_Hi.metaphlan_bugs_list.stool"
## [5] "LomanNJ_2013_Mi.metaphlan_bugs_list.stool"
## [6] "NielsenHB_2014.metaphlan_bugs_list.stool"
## [7] "Obregon_TitoAJ_2015.metaphlan_bugs_list.stool"
## [8] "QinJ_2012.metaphlan_bugs_list.stool"
## [9] "QinN_2014.metaphlan_bugs_list.stool"
## [10] "RampelliS_2015.metaphlan_bugs_list.stool"
## [11] "ZellerG_2014.metaphlan_bugs_list.stool"
## [12] "20170526.AsnicarF_2017.metaphlan_bugs_list.stool"
## [13] "20170526.BritoIL_2016.metaphlan_bugs_list.stool"
## [14] "20170526.FengQ_2015.metaphlan_bugs_list.stool"
## [15] "20170526.Heitz-BuschartA_2016.metaphlan_bugs_list.stool"
## [16] "20170526.HMP_2012.metaphlan_bugs_list.stool"
## [17] "20170526.KarlssonFH_2013.metaphlan_bugs_list.stool"
## [18] "20170526.LiuW_2016.metaphlan_bugs_list.stool"
## [19] "20170526.LomanNJ_2013.metaphlan_bugs_list.stool"
## [20] "20170526.NielsenHB_2014.metaphlan_bugs_list.stool"
## [21] "20170526.Obregon-TitoAJ_2015.metaphlan_bugs_list.stool"
## [22] "20170526.QinJ_2012.metaphlan_bugs_list.stool"
## [23] "20170526.RampelliS_2015.metaphlan_bugs_list.stool"
## [24] "20170526.RaymondF_2016.metaphlan_bugs_list.stool"
## [25] "20170526.SchirmerM_2016.metaphlan_bugs_list.stool"
## [26] "20170526.VatanenT_2016.metaphlan_bugs_list.stool"
## [27] "20170526.VincentC_2016.metaphlan_bugs_list.stool"
## [28] "20170526.VogtmannE_2016.metaphlan_bugs_list.stool"
## [29] "20170526.XieH_2016.metaphlan_bugs_list.stool"
## [30] "20170526.YuJ_2015.metaphlan_bugs_list.stool"
## [31] "20170526.ZellerG_2014.metaphlan_bugs_list.stool"
## [32] "20170526.QinN_2014.metaphlan_bugs_list.stool"
## [33] "20170526.LeChatelierE_2013.metaphlan_bugs_list.stool"
## [34] "20170907.LiJ_2014.metaphlan_bugs_list.stool"
## [35] "20171006.HanniganGD_2017.metaphlan_bugs_list.stool"
## [36] "20180425.AsnicarF_2017.metaphlan_bugs_list.stool"
## [37] "20180425.BritoIL_2016.metaphlan_bugs_list.stool"
## [38] "20180425.FengQ_2015.metaphlan_bugs_list.stool"
## [39] "20180425.HanniganGD_2017.metaphlan_bugs_list.stool"
## [40] "20180425.Heitz_BuschartA_2016.metaphlan_bugs_list.stool"
## [41] "20180425.HMP_2012.metaphlan_bugs_list.stool"
## [42] "20180425.KarlssonFH_2013.metaphlan_bugs_list.stool"
## [43] "20180425.LeChatelierE_2013.metaphlan_bugs_list.stool"
## [44] "20180425.LiJ_2014.metaphlan_bugs_list.stool"
## [45] "20180425.LiJ_2017.metaphlan_bugs_list.stool"
## [46] "20180425.LiuW_2016.metaphlan_bugs_list.stool"
## [47] "20180425.LomanNJ_2013.metaphlan_bugs_list.stool"
## [48] "20180425.LouisS_2016.metaphlan_bugs_list.stool"
## [49] "20180425.NielsenHB_2014.metaphlan_bugs_list.stool"
## [50] "20180425.Obregon_TitoAJ_2015.metaphlan_bugs_list.stool"
## [51] "20180425.QinJ_2012.metaphlan_bugs_list.stool"
## [52] "20180425.QinN_2014.metaphlan_bugs_list.stool"
## [53] "20180425.RampelliS_2015.metaphlan_bugs_list.stool"
## [54] "20180425.RaymondF_2016.metaphlan_bugs_list.stool"
## [55] "20180425.SchirmerM_2016.metaphlan_bugs_list.stool"
## [56] "20180425.SmitsSA_2017.metaphlan_bugs_list.stool"
## [57] "20180425.VatanenT_2016.metaphlan_bugs_list.stool"
## [58] "20180425.VincentC_2016.metaphlan_bugs_list.stool"
## [59] "20180425.VogtmannE_2016.metaphlan_bugs_list.stool"
## [60] "20180425.XieH_2016.metaphlan_bugs_list.stool"
## [61] "20180425.YuJ_2015.metaphlan_bugs_list.stool"
## [62] "20180425.ZellerG_2014.metaphlan_bugs_list.stool"
## [63] "20181025.AsnicarF_2017.metaphlan_bugs_list.stool"
## [64] "20181025.Bengtsson-PalmeJ_2015.metaphlan_bugs_list.stool"
## [65] "20181025.DavidLA_2015.metaphlan_bugs_list.stool"
## [66] "20181025.FengQ_2015.metaphlan_bugs_list.stool"
## [67] "20181025.HanniganGD_2017.metaphlan_bugs_list.stool"
## [68] "20181025.Heitz-BuschartA_2016.metaphlan_bugs_list.stool"
## [69] "20181025.HMP_2012.metaphlan_bugs_list.stool"
## [70] "20181025.KarlssonFH_2013.metaphlan_bugs_list.stool"
## [71] "20181025.KosticAD_2015.metaphlan_bugs_list.stool"
## [72] "20181025.LiJ_2014.metaphlan_bugs_list.stool"
## [73] "20181025.LiJ_2017.metaphlan_bugs_list.stool"
## [74] "20181025.LiSS_2016.metaphlan_bugs_list.stool"
## [75] "20181025.LiuW_2016.metaphlan_bugs_list.stool"
## [76] "20181025.LomanNJ_2013.metaphlan_bugs_list.stool"
## [77] "20181025.LoombaR_2017.metaphlan_bugs_list.stool"
## [78] "20181025.LouisS_2016.metaphlan_bugs_list.stool"
## [79] "20181025.Obregon-TitoAJ_2015.metaphlan_bugs_list.stool"
## [80] "20181025.OlmMR_2017.metaphlan_bugs_list.stool"
## [81] "20181025.PasolliE_2018.metaphlan_bugs_list.stool"
## [82] "20181025.QinN_2014.metaphlan_bugs_list.stool"
## [83] "20181025.RampelliS_2015.metaphlan_bugs_list.stool"
## [84] "20181025.RaymondF_2016.metaphlan_bugs_list.stool"
## [85] "20181025.SchirmerM_2016.metaphlan_bugs_list.stool"
## [86] "20181025.SmitsSA_2017.metaphlan_bugs_list.stool"
## [87] "20181025.ThomasAM_2018a.metaphlan_bugs_list.stool"
## [88] "20181025.VatanenT_2016.metaphlan_bugs_list.stool"
## [89] "20181025.VincentC_2016.metaphlan_bugs_list.stool"
## [90] "20181025.VogtmannE_2016.metaphlan_bugs_list.stool"
## [91] "20181025.WenC_2017.metaphlan_bugs_list.stool"
## [92] "20181025.XieH_2016.metaphlan_bugs_list.stool"
## [93] "20181025.YuJ_2015.metaphlan_bugs_list.stool"
## [94] "20181025.ZellerG_2014.metaphlan_bugs_list.stool"
## [95] "20190422.ThomasAM_2018b.metaphlan_bugs_list.stool"
## [96] "20190422.FerrettiP_2018.metaphlan_bugs_list.stool"
## [97] "20190422.ChengpingW_2017.metaphlan_bugs_list.stool"
## [98] "20190423.HMP_2012.metaphlan_bugs_list.stool"
## [99] "20190423.BackhedF_2015.metaphlan_bugs_list.stool"
## [100] "20190424.CosteaPI_2017.metaphlan_bugs_list.stool"
## [101] "20190919.DhakanDB_2019.metaphlan_bugs_list.stool"
## [102] "20190919.GopalakrishnanV_2018.metaphlan_bugs_list.stool"
## [103] "20190919.HansenLBS_2018.metaphlan_bugs_list.stool"
## [104] "20190919.JieZ_2017.metaphlan_bugs_list.stool"
## [105] "20190919.KieserS_2018.metaphlan_bugs_list.stool"
## [106] "20190919.MatsonV_2018.metaphlan_bugs_list.stool"
## [107] "20190919.PehrssonE_2016.metaphlan_bugs_list.stool"
## [108] "20190919.TettAJ_2019_a.metaphlan_bugs_list.stool"
## [109] "20190919.TettAJ_2019_b.metaphlan_bugs_list.stool"
## [110] "20190919.TettAJ_2019_c.metaphlan_bugs_list.stool"
## [111] "20190919.YeZ_2018.metaphlan_bugs_list.stool"
## [112] "20190919.ZeeviD_2015.metaphlan_bugs_list.stool"
Or, here are all the data products available for the “Loman” study:
(lomannames = grep("LomanNJ_2013", myquery$title, perl=TRUE, val=TRUE))
## [1] "LomanNJ_2013_Hi.genefamilies_relab.stool"
## [2] "LomanNJ_2013_Hi.marker_abundance.stool"
## [3] "LomanNJ_2013_Hi.marker_presence.stool"
## [4] "LomanNJ_2013_Hi.metaphlan_bugs_list.stool"
## [5] "LomanNJ_2013_Hi.pathabundance_relab.stool"
## [6] "LomanNJ_2013_Hi.pathcoverage.stool"
## [7] "LomanNJ_2013_Mi.genefamilies_relab.stool"
## [8] "LomanNJ_2013_Mi.marker_abundance.stool"
## [9] "LomanNJ_2013_Mi.marker_presence.stool"
## [10] "LomanNJ_2013_Mi.metaphlan_bugs_list.stool"
## [11] "LomanNJ_2013_Mi.pathabundance_relab.stool"
## [12] "LomanNJ_2013_Mi.pathcoverage.stool"
## [13] "20170526.LomanNJ_2013.genefamilies_relab.stool"
## [14] "20170526.LomanNJ_2013.marker_abundance.stool"
## [15] "20170526.LomanNJ_2013.marker_presence.stool"
## [16] "20170526.LomanNJ_2013.metaphlan_bugs_list.stool"
## [17] "20170526.LomanNJ_2013.pathabundance_relab.stool"
## [18] "20170526.LomanNJ_2013.pathcoverage.stool"
## [19] "20180425.LomanNJ_2013.genefamilies_relab.stool"
## [20] "20180425.LomanNJ_2013.marker_abundance.stool"
## [21] "20180425.LomanNJ_2013.marker_presence.stool"
## [22] "20180425.LomanNJ_2013.metaphlan_bugs_list.stool"
## [23] "20180425.LomanNJ_2013.pathabundance_relab.stool"
## [24] "20180425.LomanNJ_2013.pathcoverage.stool"
## [25] "20181025.LomanNJ_2013.genefamilies_relab.stool"
## [26] "20181025.LomanNJ_2013.marker_abundance.stool"
## [27] "20181025.LomanNJ_2013.marker_presence.stool"
## [28] "20181025.LomanNJ_2013.metaphlan_bugs_list.stool"
## [29] "20181025.LomanNJ_2013.pathabundance_relab.stool"
## [30] "20181025.LomanNJ_2013.pathcoverage.stool"
We could create a list of ExpressionSet
objects containing all of the Loman products as follows (not evaluated):
(idx = grep("LomanNJ_2013", myquery$title, perl=TRUE))
loman.list = lapply(idx, function(i){
return(myquery[[i]])
})
names(loman.list) = lomannames
loman.list
curatedMetagenomicData
Authors welcome the addition of new metagenomic datasets provided that the raw data are hosted by NCBI/SRA and can be run through our MetaPhlAn2 and HUMAnN2 pipeline. You can request the addition of a dataset by opening an issue on the issue tracker, pointing us to the publication and raw data.
You can speed up our ability to incorporate a new dataset by providing curated metadata. See https://github.com/waldronlab/curatedMetagenomicData for how to curate a new dataset.
Development of the curatedMetagenomicData
package occurs on GitHub. Please visit the
project repository and report software bugs and data problems on our issue tracker.
If you have an issue that is not documented elsewhere, visit the Bioconductor support site at https://support.bioconductor.org/, briefly describe your issue, and add the tag curatedMetagenomicData
.