Exploring sequence and phylogenetic diversity for a taxonomic group of interest.

Nate D. Olson

2018-04-30

Overview

MgDb objects for 16S rRNA sequence reference databases can be used to explore sequnce and phylogenetic diversity of taxonomic groups of interest. In this vignette we will explore the 16S rRNA diversity for the Enterobacteriace family. The Greengenes 16S rRNA database version 13.8 clusted at the 85% threshold will be used in this vignette. A MgDb object with this database (gg85) is included in the metagenomeFeatures package. We will first load the MgDb class object, then select the taxa of interest, and finally provide a explor the phylogenetic and sequence diversity and taxonomic composition of the Enterobacteriace family.

The database exploration will consist of four parts.

  1. Loading and subsetting the database.
  2. Taxonomic analysis.
  3. Phylogenetic diversity.
  4. Sequence diversity.

Loading and subsetting the database

First we need to load the metagenomeFeatues package. The tidyverse packages, dplyr, tidyr, and ggplot are used display the results.

library(metagenomeFeatures) 
library(tidyverse)

The gg85 database is loaded using get_gg13.8_85MgDb().

gg85 <- get_gg13.8_85MgDb()

Other databases are avilable as Bioconductor AnnotationData.

Next the mgDb_select() function is first used to subset the database, the key arguments is used to define the taxonmic group(s) of interest and keytype is used to define the taxonomic level for the group(s) of interest. With the subsetted database you can analyze the taxonomy, sequences, and phylogeny for the taxonomic group of interest. The select function returns single object with a subset of the database taxonomic, sequence, or phylogenetic data, as well as a named list with any two or all three data types.

gamma_16S <- mgDb_select(gg85, 
                          type = "all", 
                          keys = "Gammaproteobacteria", 
                          keytype = "Class")

Taxonomic Analysis

We want to know how many genera in the Order Enterobacteriaceae there are sequences for in the Greengenes 85% OTU database as well as how many sequences are assigned to each genera. We will first create a dataframe with the desired information then a plot to summarize the results.

## Per genus count data
gamma_df <- gamma_16S$taxa %>% 
    group_by(Genus) %>% 
    summarise(Count = n()) %>% 
    ungroup() %>% 
    mutate(Genus = fct_reorder(Genus, Count)) 

## Count info for text 
total_otus <- sum(gamma_df$Count)
no_genus_assignment <- gamma_df$Count[gamma_df$Genus == "g__"]
escherichia_count <- gamma_df$Count[gamma_df$Genus == "g__Escherichia"]

## excluding unassigned genera and genera with only one assigned sequence
gamma_trim_df <- gamma_df %>% filter(Genus != "g__", Count > 1)

For the Greengenes database there are a total of 247 OTUs assigned to 56 genera in the Class Gammaproteobacteria. The number of OTUs assigned to specific genera, range from 76 to 1 @ref(fig:generaCount). As this database is preclustered to 85% similarity the number of OTUs per genus is more of an indicator of genera diversity than how well the genera is represented in the database. For example no sequences present in the database are assigned to the genus Shigella and only 1 are assigned to Escherichia. OTUs only assigned to the family, g__, is the most abundant group, 154, many of which are likely from these two genera. Next we will use the phylogenetic information to evaluate this assumption.

 ggplot(gamma_trim_df) + 
    geom_bar(aes(x = Genus, y = Count), stat = "identity") + 
    labs(y = "Number of OTUs") + 
    coord_flip() + 
    theme_bw() 
Number of seqeunces assigned to genera in the Class Gammaproteobacteria. Only Genera with more than one assigned sequence are shown.

Number of seqeunces assigned to genera in the Class Gammaproteobacteria. Only Genera with more than one assigned sequence are shown.

Phylogenetic Diversity

We will use the ggtree package to visualize the phylogenetic tree and annotate the tips with Genera information. A number of OTUs unassigned at the genus level are in the same clade as OTUs assigned to the Genera Escherichia and a related Genera Salmonella @ref(fig:annoTree).

library(ggtree)

genus_anno_df <- gamma_16S$taxa %>% 
    group_by(Genus) %>% 
    mutate(Count = n()) %>% 
    ungroup() %>% 
    mutate(Genus_lab = if_else(Genus %in% paste0("g__", c("","Escherichia", "Salmonella")), Genus, ""))

ggtree(gamma_16S$tree) %<+% 
    genus_anno_df + 
    geom_tippoint(aes(color = Genus_lab), size = 3) + 
    scale_color_manual(values = c("#FF000000","darkorange","blue", "darkgreen")) + 
    theme(legend.position = "bottom") + 
    labs(color = "Genus")
Phylogenetic tree for Gammaproteobacteria class OTU representative sequences. OTUs not classified to the genus level are annotated with orange dots, and Escherichia and Salmonella OTUs are annotated with blue and green dots respectively.

Phylogenetic tree for Gammaproteobacteria class OTU representative sequences. OTUs not classified to the genus level are annotated with orange dots, and Escherichia and Salmonella OTUs are annotated with blue and green dots respectively.

Sequence Diversity

The mgDb_select function returns the subsetted sequence data as a biostring object. The sequence data can be used for additional analysis such as, multiple sequencing alignment, primer region extraction, or PCR primer design.

gamma_16S$seq
#>   A DNAStringSet instance of length 247
#>       width seq                                        names               
#>   [1]  1503 AGAGTTTGATCCTGGCTCAG...GTGAAGTCGTAACAAGGTA 1111561
#>   [2]  1526 AGAGTTTGATCATGGCTCAC...TTGTAACAAGGTAGCCGTA 1110088
#>   [3]  1502 TAGAGTTTGATCCTGGCTCA...GTCGTAACAAGGTAGCCGT 1107824
#>   [4]  1366 ATTGAACGCTGGCGGCAGGC...TGAATACGTTCCCGGGCCT 1047956
#>   [5]  1365 TGAACGCTGGCGGCAGGCCT...TGAATACGTTCCCGGGCCT 1047401
#>   ...   ... ...
#> [243]  1540 AGAGTTTGATCCTGGCTCAG...AAGTCGTAACAAGGTAACC 4466628
#> [244]  1486 ATTGAACCCTGGCGGCAGGC...GTGAAGTCGTAACAAGGTA 4472725
#> [245]  1575 GCGGAAACGATGGTAGCTTG...GCTTAACCTTCGGGGAGGC 4481079
#> [246]  1501 AGAGTTTGTTCCTGGCTCAG...AAGTCGTAACAAGGTAACC 4481801
#> [247]  1508 TGAGTTTGATCCTGGCTCAG...GGGGTGAAGTCGTAACAAG 4482291

System Information

s_info <- devtools::session_info()
print(s_info$platform)
#>  setting  value                       
#>  version  R version 3.5.0 (2018-04-23)
#>  system   x86_64, linux-gnu           
#>  ui       X11                         
#>  language (EN)                        
#>  collate  C                           
#>  tz       America/New_York            
#>  date     2018-04-30

Package Versions

s_info$packages
#>  package            * version date       source        
#>  ape                  5.1     2018-04-04 CRAN (R 3.5.0)
#>  assertthat           0.2.0   2017-04-11 CRAN (R 3.5.0)
#>  backports            1.1.2   2017-12-13 CRAN (R 3.5.0)
#>  base               * 3.5.0   2018-04-27 local         
#>  bindr                0.1.1   2018-03-13 CRAN (R 3.5.0)
#>  bindrcpp           * 0.2.2   2018-03-29 CRAN (R 3.5.0)
#>  Biobase            * 2.40.0  2018-04-30 Bioconductor  
#>  BiocGenerics       * 0.26.0  2018-04-30 Bioconductor  
#>  Biostrings           2.48.0  2018-04-30 Bioconductor  
#>  bit                  1.1-12  2014-04-09 CRAN (R 3.5.0)
#>  bit64                0.9-7   2017-05-08 CRAN (R 3.5.0)
#>  blob                 1.1.1   2018-03-25 CRAN (R 3.5.0)
#>  broom                0.4.4   2018-03-29 CRAN (R 3.5.0)
#>  cellranger           1.1.0   2016-07-27 CRAN (R 3.5.0)
#>  cli                  1.0.0   2017-11-05 CRAN (R 3.5.0)
#>  colorspace           1.3-2   2016-12-14 CRAN (R 3.5.0)
#>  compiler             3.5.0   2018-04-27 local         
#>  crayon               1.3.4   2017-09-16 CRAN (R 3.5.0)
#>  datasets           * 3.5.0   2018-04-27 local         
#>  DBI                  0.8     2018-03-02 CRAN (R 3.5.0)
#>  dbplyr               1.2.1   2018-02-19 CRAN (R 3.5.0)
#>  DECIPHER             2.8.0   2018-04-30 Bioconductor  
#>  devtools             1.13.5  2018-02-18 CRAN (R 3.5.0)
#>  digest               0.6.15  2018-01-28 CRAN (R 3.5.0)
#>  dplyr              * 0.7.4   2017-09-28 CRAN (R 3.5.0)
#>  evaluate             0.10.1  2017-06-24 CRAN (R 3.5.0)
#>  forcats            * 0.3.0   2018-02-19 CRAN (R 3.5.0)
#>  foreign              0.8-70  2017-11-28 CRAN (R 3.5.0)
#>  ggplot2            * 2.2.1   2016-12-30 CRAN (R 3.5.0)
#>  ggtree             * 1.12.0  2018-04-30 Bioconductor  
#>  glue                 1.2.0   2017-10-29 CRAN (R 3.5.0)
#>  graphics           * 3.5.0   2018-04-27 local         
#>  grDevices          * 3.5.0   2018-04-27 local         
#>  grid                 3.5.0   2018-04-27 local         
#>  gtable               0.2.0   2016-02-26 CRAN (R 3.5.0)
#>  haven                1.1.1   2018-01-18 CRAN (R 3.5.0)
#>  highr                0.6     2016-05-09 CRAN (R 3.5.0)
#>  hms                  0.4.2   2018-03-10 CRAN (R 3.5.0)
#>  htmltools            0.3.6   2017-04-28 CRAN (R 3.5.0)
#>  httr                 1.3.1   2017-08-20 CRAN (R 3.5.0)
#>  IRanges              2.14.0  2018-04-30 Bioconductor  
#>  jsonlite             1.5     2017-06-01 CRAN (R 3.5.0)
#>  knitr                1.20    2018-02-20 CRAN (R 3.5.0)
#>  labeling             0.3     2014-08-23 CRAN (R 3.5.0)
#>  lattice              0.20-35 2017-03-25 CRAN (R 3.5.0)
#>  lazyeval             0.2.1   2017-10-29 CRAN (R 3.5.0)
#>  lubridate            1.7.4   2018-04-11 CRAN (R 3.5.0)
#>  magrittr             1.5     2014-11-22 CRAN (R 3.5.0)
#>  memoise              1.1.0   2017-04-21 CRAN (R 3.5.0)
#>  metagenomeFeatures * 2.0.0   2018-05-01 Bioconductor  
#>  methods            * 3.5.0   2018-04-27 local         
#>  mnormt               1.5-5   2016-10-15 CRAN (R 3.5.0)
#>  modelr               0.1.1   2017-07-24 CRAN (R 3.5.0)
#>  munsell              0.4.3   2016-02-13 CRAN (R 3.5.0)
#>  nlme                 3.1-137 2018-04-07 CRAN (R 3.5.0)
#>  parallel           * 3.5.0   2018-04-27 local         
#>  pillar               1.2.2   2018-04-26 CRAN (R 3.5.0)
#>  pkgconfig            2.0.1   2017-03-21 CRAN (R 3.5.0)
#>  plyr                 1.8.4   2016-06-08 CRAN (R 3.5.0)
#>  psych                1.8.3.3 2018-03-30 CRAN (R 3.5.0)
#>  purrr              * 0.2.4   2017-10-18 CRAN (R 3.5.0)
#>  R6                   2.2.2   2017-06-17 CRAN (R 3.5.0)
#>  Rcpp                 0.12.16 2018-03-13 CRAN (R 3.5.0)
#>  readr              * 1.1.1   2017-05-16 CRAN (R 3.5.0)
#>  readxl               1.1.0   2018-04-20 CRAN (R 3.5.0)
#>  reshape2             1.4.3   2017-12-11 CRAN (R 3.5.0)
#>  rlang                0.2.0   2018-02-20 CRAN (R 3.5.0)
#>  rmarkdown            1.9     2018-03-01 CRAN (R 3.5.0)
#>  rprojroot            1.3-2   2018-01-03 CRAN (R 3.5.0)
#>  RSQLite              2.1.0   2018-03-29 CRAN (R 3.5.0)
#>  rstudioapi           0.7     2017-09-07 CRAN (R 3.5.0)
#>  rvcheck              0.0.9   2017-07-10 CRAN (R 3.5.0)
#>  rvest                0.3.2   2016-06-17 CRAN (R 3.5.0)
#>  S4Vectors            0.18.0  2018-04-30 Bioconductor  
#>  scales               0.5.0   2017-08-24 CRAN (R 3.5.0)
#>  stats              * 3.5.0   2018-04-27 local         
#>  stats4               3.5.0   2018-04-27 local         
#>  stringi              1.1.7   2018-03-12 CRAN (R 3.5.0)
#>  stringr            * 1.3.0   2018-02-19 CRAN (R 3.5.0)
#>  tibble             * 1.4.2   2018-01-22 CRAN (R 3.5.0)
#>  tidyr              * 0.8.0   2018-01-29 CRAN (R 3.5.0)
#>  tidyselect           0.2.4   2018-02-26 CRAN (R 3.5.0)
#>  tidytree             0.1.8   2018-04-19 CRAN (R 3.5.0)
#>  tidyverse          * 1.2.1   2017-11-14 CRAN (R 3.5.0)
#>  tools                3.5.0   2018-04-27 local         
#>  treeio               1.4.0   2018-04-30 Bioconductor  
#>  utf8                 1.1.3   2018-01-03 CRAN (R 3.5.0)
#>  utils              * 3.5.0   2018-04-27 local         
#>  withr                2.1.2   2018-03-15 CRAN (R 3.5.0)
#>  xml2                 1.2.0   2018-01-24 CRAN (R 3.5.0)
#>  XVector              0.20.0  2018-04-30 Bioconductor  
#>  yaml                 2.1.18  2018-03-08 CRAN (R 3.5.0)
#>  zlibbioc             1.26.0  2018-04-30 Bioconductor