Paul J. McMurdie and Susan Holmes
If you find phyloseq and/or its tutorials useful, please acknowledge and cite phyloseq in your publications:
phyloseq: An R package for reproducible interactive analysis and graphics of microbiome census data (2013) PLoS ONE 8(4):e61217 http://dx.plos.org/10.1371/journal.pone.0061217
The phyloseq project also has a number of supporting online resources, most of which can by found at the phyloseq home page, or from the phyloseq stable release page on Bioconductor.
To post feature requests or ask for help, try the phyloseq Issue Tracker.
The analysis of microbiological communities brings many challenges: the integration of many different types of data with methods from ecology, genetics, phylogenetics, network analysis, visualization and testing. The data itself may originate from widely different sources, such as the microbiomes of humans, soils, surface and ocean waters, wastewater treatment plants, industrial facilities, and so on; and as a result, these varied sample types may have very different forms and scales of related data that is extremely dependent upon the experiment and its question(s). The phyloseq package is a tool to import, store, analyze, and graphically display complex phylogenetic sequencing data that has already been clustered into Operational Taxonomic Units (OTUs), especially when there is associated sample data, phylogenetic tree, and/or taxonomic assignment of the OTUs. This package leverages many of the tools available in R for ecology and phylogenetic analysis (vegan, ade4, ape, picante), while also using advanced/flexible graphic systems (ggplot2) to easily produce publication-quality graphics of complex phylogenetic data. phyloseq uses a specialized system of S4 classes to store all related phylogenetic sequencing data as single experiment-level object, making it easier to share data and reproduce analyses. In general, phyloseq seeks to facilitate the use of R for efficient interactive and reproducible analysis of OTU-clustered high-throughput phylogenetic sequencing data.
code font
- The font for code, usually courrier-like,
but depends on the theme.myFun()
- Code font word with ()
attached at the right-end,
is a function name.An overview of phyloseq’s intended functionality, goals, and design is provided in the following free and open access article:
McMurdie and Holmes (2013). phyloseq: An R package for reproducible interactive analysis and graphics of microbiome census data. PLoS ONE e61217.
The most updated examples are posted in our online tutorials from the phyloseq home page
A separate vignette describes analysis tools included in phyloseq along with various examples using included example data. A quick way to load it is:
vignette("phyloseq_analysis")
By contrast, this vignette is intended to provide functional examples of the basic data import and manipulation infrastructure included in phyloseq. This includes example code for importing OTU-clustered data from different clustering pipelines, as well as performing clear and reproducible filtering tasks that can be altered later and checked for robustness. The motivation for including tools like this in phyloseq is to save time, and also to build-in a structure that requires consistency across related data tables from the same experiment.This not only reduces code repetition, but also decreases the likelihood of mistakes during data filtering and analysis. For example, it is intentionally difficult in phyloseq to create an experiment-level object in which a component tree and OTU table have different OTU names. The import functions, trimming tools, as well as the main tool for creating an experiment-level object, phyloseq
, all automatically trim the OTUs and samples indices to their intersection, such that these component data types are exactly coherent.
The class structure in the phyloseq package follows the inheritance diagram shown in the figure below.
Currently, phyloseq uses 4 core data classes.
They are
(1) the OTU abundance table (otu_table
),
a table of sample data (sample_data
);
(2) a table of taxonomic descriptors (taxonomyTable
); and
(3) a phylogenetic tree ("phylo"
-class, ape package.
The otu_table
class can be considered the central data type,
as it directly represents the number and type of sequences observed in each sample.
otu_table
extends the numeric matrix class in the R
base,
and has a few additonal feature slots.
The most important of these feature slots is the taxa_are_rows
slot,
which holds a single logical that indicates whether the table is oriented
with taxa as rows (as in the genefilter package in Bioconductor
or with taxa as columns (as in vegan and picante packages).
In phyloseq methods, as well as its extensions of methods in other packages,
the taxa_are_rows
value is checked to ensure proper orientation of the otu_table
.
A phyloseq user is only required to specify the otu_table
orientation during initialization, following which all handling is internal.
The sample_data
class directly inherits R
’s data.frame
class, and thus effectively stores both categorical and numerical data about each sample. The orientation of a data.frame
in this context requires that samples/trials are rows, and variables are columns (consistent with vegan and other packages). The taxonomyTable
class directly inherits the matrix
class, and is oriented such that rows are taxa/OTUs and columns are taxonomic levels (e.g. Phylum).
The phyloseq-class can be considered an “experiment-level class” and should contain two or more of the previously-described core data classes. We assume that phyloseq users will be interested in analyses that utilize their abundance counts derived from the phylogenetic sequencing data, and so the phyloseq()
constructor will stop with an error if the arguments do not include an otu_table
. There are a number of common methods that require either an otu_table
and sample_data
combination, or an otu_table
and phylogenetic tree combination. These methods can operate on instances of the phyloseq-class, and will stop with an error if the required component data is missing.
Classes and inheritance in the phyloseq package. The class name and its slots are shown with red- or blue-shaded text, respectively. Coercibility is indicated graphically by arrows with the coercion function shown. Lines without arrows indicate that the more complex class (``phyloseq“) contains a slot with the associated data class as its components.
Now let’s get started by loading phyloseq, and describing some methods for importing data.
To use phyloseq in a new R session, it will have to be loaded. This can be done in your package manager, or at the command line using the library()
command:
library("phyloseq")
An important feature of phyloseq are methods for importing phylogenetic sequencing data from common taxonomic clustering pipelines. These methods take file pathnames as input, read and parse those files, and return a single object that contains all of the data.
Some additional background details are provided below. The best reproducible examples on importing data with phyloseq can be found on the official data import tutorial page:
New versions of QIIME (see below) produce a file in version 2 of the biom file format, which is a specialized definition of the HDF5 format.
The phyloseq package provides the import_biom()
function,
which can import both
Version 1 (JSON) and
Version 2 (HDF5)
of the BIOM file format.
The phyloseq package fully supports both taxa and sample observations of the biom format standard, and works with the BIOM files output from QIIME, RDP, MG-RAST, etc.
The default output from modern versions of QIIME is a BIOM-format file (among others). This is suppored in phyloseq.
Sometimes inaccurately referred to as metadata, additional observations on samples provided as mapping file to QIIME have not typically been output in the BIOM files, even though BIOM format supports it. This failure to support the full capability of the BIOM format means that you’ll have to provide sample observations as a separate file. There are many ways to do this, but the QIIME sample map is supported.
Two QIIME output files (.biom
, .tre
)
are recognized by the import_biom()
function.
One QIIME input file (sample map, tab-delimited),
is recognized by the import_qiime_sample_data()
function.
The objects created by each of the import functions above
should be merged using merge_phyloseq
to create one coordinated, self-consistent object.
merge_phyloseq
, the output from these import activities is the three separate objects listed in the previous table.QIIME’s “Moving Pictures” example tutorial output is a little too large to include within the phyloseq package (and thus is not directly included in this vignette). However, the phyloseq home page includes a full reproducible example of the import procedure described above:
Link HERE
For reference, or if you want to try yourself, the following is the relative paths within the QIIME tutorial directory for each of the files you will need.
QIIME is a free, open-source OTU clustering and analysis pipeline written for Unix (mostly Linux). It is distributed in a number of different forms (including a pre-installed virtual machine). See the QIIME home page for details.
One QIIME input file (sample map), and two QIIME output files (otu_table.txt
, .tre
) are recognized by the import_qiime()
function. Only one of the three input files is required to run, although an "otu_table.txt"
file is required if import_qiime()
is to return a complete experiment object.
In practice, you will have to find the relevant QIIME files among a number of other files created by the QIIME pipeline. A screenshot of the directory structure created during a typical QIIME run is shown in the QIIME Directory Figure.