Contents

Paul J. McMurdie and Susan Holmes

mcmurdie@stanford.edu

phyloseq Home Page

If you find phyloseq and/or its tutorials useful, please acknowledge and cite phyloseq in your publications:

phyloseq: An R package for reproducible interactive analysis and graphics of microbiome census data (2013) PLoS ONE 8(4):e61217 http://dx.plos.org/10.1371/journal.pone.0061217

0.1 Other resources

The phyloseq project also has a number of supporting online resources, most of which can by found at the phyloseq home page, or from the phyloseq stable release page on Bioconductor.

To post feature requests or ask for help, try the phyloseq Issue Tracker.

1 Introduction

The analysis of microbiological communities brings many challenges: the integration of many different types of data with methods from ecology, genetics, phylogenetics, network analysis, visualization and testing. The data itself may originate from widely different sources, such as the microbiomes of humans, soils, surface and ocean waters, wastewater treatment plants, industrial facilities, and so on; and as a result, these varied sample types may have very different forms and scales of related data that is extremely dependent upon the experiment and its question(s). The phyloseq package is a tool to import, store, analyze, and graphically display complex phylogenetic sequencing data that has already been clustered into Operational Taxonomic Units (OTUs), especially when there is associated sample data, phylogenetic tree, and/or taxonomic assignment of the OTUs. This package leverages many of the tools available in R for ecology and phylogenetic analysis (vegan, ade4, ape, picante), while also using advanced/flexible graphic systems (ggplot2) to easily produce publication-quality graphics of complex phylogenetic data. phyloseq uses a specialized system of S4 classes to store all related phylogenetic sequencing data as single experiment-level object, making it easier to share data and reproduce analyses. In general, phyloseq seeks to facilitate the use of R for efficient interactive and reproducible analysis of OTU-clustered high-throughput phylogenetic sequencing data.

2 About this vignette

2.1 Typesetting Legend

  • bold - Bold is used for emphasis.
  • italics - Italics are used for package names, and special words, phrases.
  • code font - The font for code, usually courrier-like, but depends on the theme.
  • myFun() - Code font word with () attached at the right-end, is a function name.
  • Hyperlink - Hyperlinks are clickable text that will jump to sections and external pages.

3 phyloseq classes

The class structure in the phyloseq package follows the inheritance diagram shown in the figure below. Currently, phyloseq uses 4 core data classes. They are (1) the OTU abundance table (otu_table), a table of sample data (sample_data); (2) a table of taxonomic descriptors (taxonomyTable); and (3) a phylogenetic tree ("phylo"-class, ape package.

The otu_table class can be considered the central data type, as it directly represents the number and type of sequences observed in each sample. otu_table extends the numeric matrix class in the R base, and has a few additonal feature slots. The most important of these feature slots is the taxa_are_rows slot, which holds a single logical that indicates whether the table is oriented with taxa as rows (as in the genefilter package in Bioconductor or with taxa as columns (as in vegan and picante packages). In phyloseq methods, as well as its extensions of methods in other packages, the taxa_are_rows value is checked to ensure proper orientation of the otu_table. A phyloseq user is only required to specify the otu_table orientation during initialization, following which all handling is internal.

The sample_data class directly inherits R’s data.frame class, and thus effectively stores both categorical and numerical data about each sample. The orientation of a data.frame in this context requires that samples/trials are rows, and variables are columns (consistent with vegan and other packages). The taxonomyTable class directly inherits the matrix class, and is oriented such that rows are taxa/OTUs and columns are taxonomic levels (e.g. Phylum).

The phyloseq-class can be considered an “experiment-level class” and should contain two or more of the previously-described core data classes. We assume that phyloseq users will be interested in analyses that utilize their abundance counts derived from the phylogenetic sequencing data, and so the phyloseq() constructor will stop with an error if the arguments do not include an otu_table. There are a number of common methods that require either an otu_table and sample_data combination, or an otu_table and phylogenetic tree combination. These methods can operate on instances of the phyloseq-class, and will stop with an error if the required component data is missing.

phyloseq class structure Classes and inheritance in the phyloseq package. The class name and its slots are shown with red- or blue-shaded text, respectively. Coercibility is indicated graphically by arrows with the coercion function shown. Lines without arrows indicate that the more complex class (``phyloseq“) contains a slot with the associated data class as its components.

4 Load phyloseq and import data

Now let’s get started by loading phyloseq, and describing some methods for importing data.

4.1 Load phyloseq

To use phyloseq in a new R session, it will have to be loaded. This can be done in your package manager, or at the command line using the library() command:

library("phyloseq")

4.2 Import data

An important feature of phyloseq are methods for importing phylogenetic sequencing data from common taxonomic clustering pipelines. These methods take file pathnames as input, read and parse those files, and return a single object that contains all of the data.

Some additional background details are provided below. The best reproducible examples on importing data with phyloseq can be found on the official data import tutorial page:

http://joey711.github.com/phyloseq/import-data

4.3 Import from biom-format

New versions of QIIME (see below) produce a file in version 2 of the biom file format, which is a specialized definition of the HDF5 format.

The phyloseq package provides the import_biom() function, which can import both Version 1 (JSON) and Version 2 (HDF5) of the BIOM file format.

The phyloseq package fully supports both taxa and sample observations of the biom format standard, and works with the BIOM files output from QIIME, RDP, MG-RAST, etc.

4.4 Import from QIIME (Modern)

The default output from modern versions of QIIME is a BIOM-format file (among others). This is suppored in phyloseq.

4.4.1 Sample data from QIIME

Sometimes inaccurately referred to as metadata, additional observations on samples provided as mapping file to QIIME have not typically been output in the BIOM files, even though BIOM format supports it. This failure to support the full capability of the BIOM format means that you’ll have to provide sample observations as a separate file. There are many ways to do this, but the QIIME sample map is supported.

4.4.2 Input

Two QIIME output files (.biom, .tre) are recognized by the import_biom() function. One QIIME input file (sample map, tab-delimited), is recognized by the import_qiime_sample_data() function.

The objects created by each of the import functions above should be merged using merge_phyloseq to create one coordinated, self-consistent object.

4.4.3 Output

  • Before Merging - Before merging with merge_phyloseq, the output from these import activities is the three separate objects listed in the previous table.
  • After Merging - After merging you have a single self-consistent phyloseq object that contains an OTU table, taxonomy table, sample-data, and a phylogenetic tree.

4.4.4 QIIME Example Tutorial

QIIME’s “Moving Pictures” example tutorial output is a little too large to include within the phyloseq package (and thus is not directly included in this vignette). However, the phyloseq home page includes a full reproducible example of the import procedure described above:

Link HERE

For reference, or if you want to try yourself, the following is the relative paths within the QIIME tutorial directory for each of the files you will need.

  • BIOM file, originally at: moving_pictures_tutorial-1.9.0/illumina/precomputed-output/otus/otu_table_mc2_w_tax_no_pynast_failures.biom
  • Tree file, originally at: moving_pictures_tutorial-1.9.0/illumina/precomputed-output/otus/rep_set.tre
  • Map File, originally at: moving_pictures_tutorial-1.9.0/illumina/map.tsv

4.5 Import from QIIME Legacy

QIIME is a free, open-source OTU clustering and analysis pipeline written for Unix (mostly Linux). It is distributed in a number of different forms (including a pre-installed virtual machine). See the QIIME home page for details.

4.5.1 Input

One QIIME input file (sample map), and two QIIME output files (otu_table.txt, .tre) are recognized by the import_qiime() function. Only one of the three input files is required to run, although an "otu_table.txt" file is required if import_qiime() is to return a complete experiment object.

In practice, you will have to find the relevant QIIME files among a number of other files created by the QIIME pipeline. A screenshot of the directory structure created during a typical QIIME run is shown in the QIIME Directory Figure.