Paul J. McMurdie and Susan Holmes

phyloseq Home Page

If you find phyloseq and/or its tutorials useful, please acknowledge and cite phyloseq in your publications:

phyloseq: An R package for reproducible interactive analysis and graphics of microbiome census data (2013) PLoS ONE 8(4):e61217

0.1 Other resources

The phyloseq project also has a number of supporting online resources, most of which can by found at the phyloseq home page, or from the phyloseq stable release page on Bioconductor.

To post feature requests or ask for help, try the phyloseq Issue Tracker.

1 Introduction

The analysis of microbiological communities brings many challenges: the integration of many different types of data with methods from ecology, genetics, phylogenetics, network analysis, visualization and testing. The data itself may originate from widely different sources, such as the microbiomes of humans, soils, surface and ocean waters, wastewater treatment plants, industrial facilities, and so on; and as a result, these varied sample types may have very different forms and scales of related data that is extremely dependent upon the experiment and its question(s). The phyloseq package is a tool to import, store, analyze, and graphically display complex phylogenetic sequencing data that has already been clustered into Operational Taxonomic Units (OTUs), especially when there is associated sample data, phylogenetic tree, and/or taxonomic assignment of the OTUs. This package leverages many of the tools available in R for ecology and phylogenetic analysis (vegan, ade4, ape, picante), while also using advanced/flexible graphic systems (ggplot2) to easily produce publication-quality graphics of complex phylogenetic data. phyloseq uses a specialized system of S4 classes to store all related phylogenetic sequencing data as single experiment-level object, making it easier to share data and reproduce analyses. In general, phyloseq seeks to facilitate the use of R for efficient interactive and reproducible analysis of OTU-clustered high-throughput phylogenetic sequencing data.

2 About this vignette

2.1 Typesetting Legend

  • bold - Bold is used for emphasis.
  • italics - Italics are used for package names, and special words, phrases.
  • code font - The font for code, usually courrier-like, but depends on the theme.
  • myFun() - Code font word with () attached at the right-end, is a function name.
  • Hyperlink - Hyperlinks are clickable text that will jump to sections and external pages.

3 phyloseq classes

The class structure in the phyloseq package follows the inheritance diagram shown in the figure below. Currently, phyloseq uses 4 core data classes. They are (1) the OTU abundance table (otu_table), a table of sample data (sample_data); (2) a table of taxonomic descriptors (taxonomyTable); and (3) a phylogenetic tree ("phylo"-class, ape package.

The otu_table class can be considered the central data type, as it directly represents the number and type of sequences observed in each sample. otu_table extends the numeric matrix class in the R base, and has a few additonal feature slots. The most important of these feature slots is the taxa_are_rows slot, which holds a single logical that indicates whether the table is oriented with taxa as rows (as in the genefilter package in Bioconductor or with taxa as columns (as in vegan and picante packages). In phyloseq methods, as well as its extensions of methods in other packages, the taxa_are_rows value is checked to ensure proper orientation of the otu_table. A phyloseq user is only required to specify the otu_table orientation during initialization, following which all handling is internal.

The sample_data class directly inherits R’s data.frame class, and thus effectively stores both categorical and numerical data about each sample. The orientation of a data.frame in this context requires that samples/trials are rows, and variables are columns (consistent with vegan and other packages). The taxonomyTable class directly inherits the matrix class, and is oriented such that rows are taxa/OTUs and columns are taxonomic levels (e.g. Phylum).

The phyloseq-class can be considered an “experiment-level class” and should contain two or more of the previously-described core data classes. We assume that phyloseq users will be interested in analyses that utilize their abundance counts derived from the phylogenetic sequencing data, and so the phyloseq() constructor will stop with an error if the arguments do not include an otu_table. There are a number of common methods that require either an otu_table and sample_data combination, or an otu_table and phylogenetic tree combination. These methods can operate on instances of the phyloseq-class, and will stop with an error if the required component data is missing.