Paul J. McMurdie and Susan Holmes
If you find phyloseq and/or its tutorials useful, please acknowledge and cite phyloseq in your publications:
phyloseq: An R package for reproducible interactive analysis and graphics of microbiome census data (2013) PLoS ONE 8(4):e61217 http://dx.plos.org/10.1371/journal.pone.0061217
The phyloseq project also has a number of supporting online resources, most of which can by found at the phyloseq home page, or from the phyloseq stable release page on Bioconductor.
To post feature requests or ask for help, try the phyloseq Issue Tracker.
The analysis of microbiological communities brings many challenges: the integration of many different types of data with methods from ecology, genetics, phylogenetics, network analysis, visualization and testing. The data itself may originate from widely different sources, such as the microbiomes of humans, soils, surface and ocean waters, wastewater treatment plants, industrial facilities, and so on; and as a result, these varied sample types may have very different forms and scales of related data that is extremely dependent upon the experiment and its question(s). The phyloseq package is a tool to import, store, analyze, and graphically display complex phylogenetic sequencing data that has already been clustered into Operational Taxonomic Units (OTUs), especially when there is associated sample data, phylogenetic tree, and/or taxonomic assignment of the OTUs. This package leverages many of the tools available in R for ecology and phylogenetic analysis (vegan, ade4, ape, picante), while also using advanced/flexible graphic systems (ggplot2) to easily produce publication-quality graphics of complex phylogenetic data. phyloseq uses a specialized system of S4 classes to store all related phylogenetic sequencing data as single experiment-level object, making it easier to share data and reproduce analyses. In general, phyloseq seeks to facilitate the use of R for efficient interactive and reproducible analysis of OTU-clustered high-throughput phylogenetic sequencing data.
A separate vignette is included within the phyloseq-package that describes the basics of importing pre-clustered phylogenetic sequencing data, data filtering, as well as some transformations and some additional details about the package and installation. A quick way to load it is:
vignette("phyloseq-basics")
By contrast, this vignette is intended to provide functional examples of the analysis tools and wrappers included in phyloseq. All necessary code for performing the analysis and producing graphics will be included with its description, and the focus will be on the use of example data that is included and documented within the phyloseq-package.
Let’s start by loading the `phyloseq-package:
library("phyloseq")
library("ggplot2")
And because we will show examples of custom modifications to ggplot2 plots, we also loaded ggplot2 as well. Here I’ll set as default my favorite ggplot2 theme. These are completely optional, and modifiable.
theme_set(theme_bw())
See the microbio_me_qiime tutorial for more details and examples downloading and importing into phyloseq/R directly from this public database.
To facilitate testing and exploration of tools in phyloseq, this package includes example data from published studies. Many of the examples in this vignette use either the Global Patterns or enterotype
datasets as source data. The Global Patterns data was described in a 2011 article in PNAS(Caporaso 2011), and compares the microbial communities of 25 environmental samples and three known “mock communities” — a total of 9 sample types — at a depth averaging 3.1 million reads per sample. The human enterotype dataset was described in a 2011 article in Nature (Arumugam 2011), which compares the faecal microbial communities from 22 subjects using complete shotgun DNA sequencing. The authors further compare these microbial communities with the faecal communities of subjects from other studies, for a total of 280 faecal samples / subjects, and 553 genera. Sourcing data from different studies invariable leads to gaps in the data for certain variables, and this is easily handled by R's core
NA features.
Because this data is included in the package, the examples can easily be run on your own computer using the code shown in this vignette. The data is loaded into memory using the data
command. Let’s start by loading the Global Patterns data.
data(GlobalPatterns)
Later on we will use an additional categorical designation — human versus non-human associated samples — that was not in the original dataset. Now is a good time to add it as an explicit variable of the sample_data
, and because we don’t want to type long words over and over, we’ll choose a shorter name for this modified version of Global Patterns, call it GP
, and also remove a handful of taxa that are not present in any of the samples included in this dataset (probably an artifact):
# prune OTUs that are not present in at least one sample
GP <- prune_taxa(taxa_sums(GlobalPatterns) > 0, GlobalPatterns)
# Define a human-associated versus non-human categorical variable:
human <- get_variable(GP, "SampleType") %in% c("Feces", "Mock", "Skin", "Tongue")
# Add new human variable to sample data:
sample_data(GP)$human <- factor(human)
For further details, see the plot_richness tutorial
We can easily create a complex graphic that compares the richness estimates of samples from different environment types in the Global Patterns dataset, using the plot_richness
function. Note that it is important to use raw (untrimmed) OTU-clustered data when performing richness estimates, as they can be highly dependent on the number of singletons in a sample.
alpha_meas = c("Observed", "Chao1", "ACE", "Shannon", "Simpson", "InvSimpson")
(p <- plot_richness(GP, "human", "SampleType", measures=alpha_meas))
Add a ggplot2 box plot layer to the previous plot
p + geom_boxplot(data=p$data, aes(x=human, y=value, color=NULL), alpha=0.1)
Alpha diversity estimators for samples in the Global Patterns dataset. Each panel shows a different type of estimator. Individual color-shaded points and brackets represent the richness estimate and the theoretical standard error range associated with that estimate, respectively. The colors group the sample-sources into “types”. Within each panel, the samples are further organized into human-associated (TRUE
) or not (FALSE
), and a boxplot is overlayed on top of this for the two groups, illustrating that these human-associated samples are less rich than the environmental samples (although the type of environment appears to matter a great deal as well).
For further details, see the plot_tree tutorial
phyloseq also contains a method for easily plotting an annotated phylogenetic tree with information regarding the sample in which a particular taxa was observed, and optionally the number of individuals that were observed.
For the sake of creating a readable tree, let’s subset the data to just the Chlamydiae phylum, which consists of obligate intracellular pathogens and is present in only a subset of environments in this dataset.
GP.chl <- subset_taxa(GP, Phylum=="Chlamydiae")
And now we will create the tree graphic form this subset of Global Patterns, shading by the “`SampleType” variable, which indicates the environment category from which the microbiome samples originated. The following command also takes the option of labeling the number of individuals observed in each sample (if at all) of each taxa. The symbols are slightly enlarged as the number of individuals increases.
plot_tree(GP.chl, color="SampleType", shape="Family", label.tips="Genus", size="Abundance")