SECOM Tutorial

Huang Lin\(^1\)

\(^1\)NICHD, 6710B Rockledge Dr, Bethesda, MD 20817

December 18, 2022

knitr::opts_chunk$set(message = FALSE, warning = FALSE, comment = NA, 
                      fig.width = 6.25, fig.height = 5)
library(ANCOMBC)
library(tidyverse)
get_upper_tri = function(cormat){
    cormat[lower.tri(cormat)] = NA
    diag(cormat) = NA
    return(cormat)
}

1. Introduction

Sparse Estimation of Correlations among Microbiomes (SECOM) (Lin, Eggesbø, and Peddada 2022) is a methodology that aims to detect both linear and nonlinear relationships between a pair of taxa within an ecosystem (e.g., gut) or across ecosystems (e.g., gut and tongue). SECOM corrects both sample-specific and taxon-specific biases and obtains a consistent estimator for the correlation matrix of microbial absolute abundances while maintaining the underlying true sparsity. For more details, please refer to the SECOM paper.

2. Installation

Download package.

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("ANCOMBC")

Load the package.

library(ANCOMBC)

3. Example Data

The HITChip Atlas dataset contains genus-level microbiota profiling with HITChip for 1006 western adults with no reported health complications, reported in (Lahti et al. 2014). The dataset is also available via the microbiome R package (Lahti et al. 2017) in phyloseq (McMurdie and Holmes 2013) format.

data(atlas1006)

# Subset to baseline
tse = atlas1006[, atlas1006$time == 0]

# Re-code the bmi group
tse$bmi = recode(tse$bmi_group,
                 obese = "obese",
                 severeobese = "obese",
                 morbidobese = "obese")
# Subset to lean, overweight, and obese subjects
tse = tse[, tse$bmi %in% c("lean", "overweight", "obese")]

# Create the region variable
tse$region = recode(as.character(tse$nationality),
                    Scandinavia = "NE", UKIE = "NE", SouthEurope = "SE", 
                    CentralEurope = "CE", EasternEurope = "EE",
                    .missing = "unknown")

# Discard "EE" as it contains only 1 subject
# Discard subjects with missing values of region
tse = tse[, ! tse$region %in% c("EE", "unknown")]

print(tse)
class: TreeSummarizedExperiment 
dim: 130 873 
metadata(0):
assays(1): counts
rownames(130): Actinomycetaceae Aerococcus ... Xanthomonadaceae
  Yersinia et rel.
rowData names(3): Phylum Family Genus
colnames(873): Sample-1 Sample-2 ... Sample-1005 Sample-1006
colData names(12): age sex ... bmi region
reducedDimNames(0):
mainExpName: NULL
altExpNames(0):
rowLinks: NULL
rowTree: NULL
colLinks: NULL
colTree: NULL

4. Run SECOM on a Single Ecosystem

4.1 Run secom functions

4.2 Visualizations

5. Run SECOM on Multiple Ecosystems

5.1 Data manipulation

To compute correlations whithin and across different ecosystems, one needs to make sure that there are samples in common across these ecosystems.

class: TreeSummarizedExperiment 
dim: 130 578 
metadata(0):
assays(1): counts
rownames(130): Actinomycetaceae Aerococcus ... Xanthomonadaceae
  Yersinia et rel.
rowData names(3): Phylum Family Genus
colnames(578): Sample-1 Sample-2 ... Sample-577 Sample-578
colData names(12): age sex ... bmi region
reducedDimNames(0):
mainExpName: NULL
altExpNames(0):
rowLinks: NULL
rowTree: NULL
colLinks: NULL
colTree: NULL
class: TreeSummarizedExperiment 
dim: 130 181 
metadata(0):
assays(1): counts
rownames(130): Actinomycetaceae Aerococcus ... Xanthomonadaceae
  Yersinia et rel.
rowData names(3): Phylum Family Genus
colnames(181): Sample-1 Sample-2 ... Sample-180 Sample-181
colData names(12): age sex ... bmi region
reducedDimNames(0):
mainExpName: NULL
altExpNames(0):
rowLinks: NULL
rowTree: NULL
colLinks: NULL
colTree: NULL

5.2 Run secom functions

5.3 Visualizations

Pearson correlation with thresholding

corr_linear = res_linear$corr_th
cooccur_linear = res_linear$mat_cooccur

# Filter by co-occurrence
overlap = 10
corr_linear[cooccur_linear < overlap] = 0

df_linear = data.frame(get_upper_tri(corr_linear)) %>%
  rownames_to_column("var1") %>%
  pivot_longer(cols = -var1, names_to = "var2", values_to = "value") %>%
  filter(!is.na(value)) %>%
  mutate(var2 = gsub("\\...", " - ", var2),
         value = round(value, 2))

tax_name = sort(union(df_linear$var1, df_linear$var2))
df_linear$var1 = factor(df_linear$var1, levels = tax_name)
df_linear$var2 = factor(df_linear$var2, levels = tax_name)
txt_color = ifelse(grepl("CE", tax_name), "#1B9E77", "#D95F02")

heat_linear_th = df_linear %>%
  ggplot(aes(var2, var1, fill = value)) +
  geom_tile(color = "black") +
  scale_fill_gradient2(low = "blue", high = "red", mid = "white", 
                       na.value = "grey", midpoint = 0, limit = c(-1,1), 
                       space = "Lab", name = NULL) +
  scale_x_discrete(drop = FALSE) +
  scale_y_discrete(drop = FALSE) +
  geom_text(aes(var2, var1, label = value), color = "black", size = 4) +
  labs(x = NULL, y = NULL, title = "Pearson (Thresholding)") +
  theme_bw() +
  geom_vline(xintercept = 6.5, color = "blue", linetype = "dashed") +
  geom_hline(yintercept = 6.5, color = "blue", linetype = "dashed") +
  theme(axis.text.x = element_text(angle = 45, vjust = 1, size = 12, hjust = 1, 
                                   face = "italic", color = txt_color),
        axis.text.y = element_text(size = 12, face = "italic", 
                                   color = txt_color),
        strip.text.x = element_text(size = 14),
        strip.text.y = element_text(size = 14),
        legend.text = element_text(size = 12),
        plot.title = element_text(hjust = 0.5, size = 15),
        panel.grid.major = element_blank(),
        axis.ticks = element_blank(),
        legend.position = "none") +
  coord_fixed()

heat_linear_th

Pearson correlation with p-value filtering

corr_linear = res_linear$corr_th
cooccur_linear = res_linear$mat_cooccur

# Filter by co-occurrence
overlap = 10
corr_linear[cooccur_linear < overlap] = 0

df_linear = data.frame(get_upper_tri(corr_linear)) %>%
  rownames_to_column("var1") %>%
  pivot_longer(cols = -var1, names_to = "var2", values_to = "value") %>%
  filter(!is.na(value)) %>%
  mutate(var2 = gsub("\\...", " - ", var2),
         value = round(value, 2))

tax_name = sort(union(df_linear$var1, df_linear$var2))
df_linear$var1 = factor(df_linear$var1, levels = tax_name)
df_linear$var2 = factor(df_linear$var2, levels = tax_name)
txt_color = ifelse(grepl("CE", tax_name), "#1B9E77", "#D95F02")

heat_linear_fl = df_linear %>%
  ggplot(aes(var2, var1, fill = value)) +
  geom_tile(color = "black") +
  scale_fill_gradient2(low = "blue", high = "red", mid = "white", 
                       na.value = "grey", midpoint = 0, limit = c(-1,1), 
                       space = "Lab", name = NULL) +
  scale_x_discrete(drop = FALSE) +
  scale_y_discrete(drop = FALSE) +
  geom_text(aes(var2, var1, label = value), color = "black", size = 4) +
  labs(x = NULL, y = NULL, title = "Pearson (Filtering)") +
  theme_bw() +
  geom_vline(xintercept = 6.5, color = "blue", linetype = "dashed") +
  geom_hline(yintercept = 6.5, color = "blue", linetype = "dashed") +
  theme(axis.text.x = element_text(angle = 45, vjust = 1, size = 12, hjust = 1, 
                                   face = "italic", color = txt_color),
        axis.text.y = element_text(size = 12, face = "italic", 
                                   color = txt_color),
        strip.text.x = element_text(size = 14),
        strip.text.y = element_text(size = 14),
        legend.text = element_text(size = 12),
        plot.title = element_text(hjust = 0.5, size = 15),
        panel.grid.major = element_blank(),
        axis.ticks = element_blank(),
        legend.position = "none") +
  coord_fixed()

heat_linear_fl

Distance correlation with p-value filtering

corr_dist = res_dist$dcorr_fl
cooccur_dist = res_dist$mat_cooccur

# Filter by co-occurrence
overlap = 10
corr_dist[cooccur_dist < overlap] = 0

df_dist = data.frame(get_upper_tri(corr_dist)) %>%
  rownames_to_column("var1") %>%
  pivot_longer(cols = -var1, names_to = "var2", values_to = "value") %>%
  filter(!is.na(value)) %>%
  mutate(var2 = gsub("\\...", " - ", var2),
         value = round(value, 2))

tax_name = sort(union(df_dist$var1, df_dist$var2))
df_dist$var1 = factor(df_dist$var1, levels = tax_name)
df_dist$var2 = factor(df_dist$var2, levels = tax_name)
txt_color = ifelse(grepl("CE", tax_name), "#1B9E77", "#D95F02")

heat_dist_fl = df_dist %>%
  ggplot(aes(var2, var1, fill = value)) +
  geom_tile(color = "black") +
  scale_fill_gradient2(low = "blue", high = "red", mid = "white", 
                       na.value = "grey", midpoint = 0, limit = c(-1,1), 
                       space = "Lab", name = NULL) +
  scale_x_discrete(drop = FALSE) +
  scale_y_discrete(drop = FALSE) +
  geom_text(aes(var2, var1, label = value), color = "black", size = 4) +
  labs(x = NULL, y = NULL, title = "Distance (Filtering)") +
  theme_bw() +
  geom_vline(xintercept = 6.5, color = "blue", linetype = "dashed") +
  geom_hline(yintercept = 6.5, color = "blue", linetype = "dashed") +
  theme(axis.text.x = element_text(angle = 45, vjust = 1, size = 12, hjust = 1, 
                                   face = "italic", color = txt_color),
        axis.text.y = element_text(size = 12, face = "italic", 
                                   color = txt_color),
        strip.text.x = element_text(size = 14),
        strip.text.y = element_text(size = 14),
        legend.text = element_text(size = 12),
        plot.title = element_text(hjust = 0.5, size = 15),
        panel.grid.major = element_blank(),
        axis.ticks = element_blank(),
        legend.position = "none") +
  coord_fixed()

heat_dist_fl

Session information

sessionInfo()
R version 4.2.2 (2022-10-31)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.5 LTS

Matrix products: default
BLAS:   /home/biocbuild/bbs-3.16-bioc/R/lib/libRblas.so
LAPACK: /home/biocbuild/bbs-3.16-bioc/R/lib/libRlapack.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_GB              LC_COLLATE=C              
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] caret_6.0-93    lattice_0.20-45 DT_0.26         forcats_0.5.2  
 [5] stringr_1.5.0   dplyr_1.0.10    purrr_0.3.5     readr_2.1.3    
 [9] tidyr_1.2.1     tibble_3.1.8    ggplot2_3.4.0   tidyverse_1.3.2
[13] ANCOMBC_2.0.2  

loaded via a namespace (and not attached):
  [1] utf8_1.2.2                     tidyselect_1.2.0              
  [3] lme4_1.1-31                    RSQLite_2.2.19                
  [5] htmlwidgets_1.6.0              grid_4.2.2                    
  [7] BiocParallel_1.32.4            gmp_0.6-9                     
  [9] pROC_1.18.0                    munsell_0.5.0                 
 [11] ScaledMatrix_1.6.0             codetools_0.2-18              
 [13] interp_1.1-3                   future_1.30.0                 
 [15] withr_2.5.0                    colorspace_2.0-3              
 [17] Biobase_2.58.0                 energy_1.7-10                 
 [19] highr_0.9                      knitr_1.41                    
 [21] rstudioapi_0.14                stats4_4.2.2                  
 [23] SingleCellExperiment_1.20.0    DescTools_0.99.47             
 [25] listenv_0.9.0                  labeling_0.4.2                
 [27] MatrixGenerics_1.10.0          Rdpack_2.4                    
 [29] emmeans_1.8.3                  GenomeInfoDbData_1.2.9        
 [31] farver_2.1.1                   bit64_4.0.5                   
 [33] parallelly_1.33.0              coda_0.19-4                   
 [35] vctrs_0.5.1                    treeio_1.22.0                 
 [37] generics_0.1.3                 TH.data_1.1-1                 
 [39] ipred_0.9-13                   xfun_0.35                     
 [41] timechange_0.1.1               R6_2.5.1                      
 [43] doParallel_1.0.17              GenomeInfoDb_1.34.4           
 [45] ggbeeswarm_0.7.1               rsvd_1.0.5                    
 [47] bitops_1.0-7                   cachem_1.0.6                  
 [49] DelayedArray_0.24.0            assertthat_0.2.1              
 [51] scales_1.2.1                   multcomp_1.4-20               
 [53] nnet_7.3-18                    googlesheets4_1.0.1           
 [55] beeswarm_0.4.0                 rootSolve_1.8.2.3             
 [57] gtable_0.3.1                   beachmat_2.14.0               
 [59] globals_0.16.2                 lmom_2.9                      
 [61] sandwich_3.0-2                 timeDate_4021.107             
 [63] rlang_1.0.6                    splines_4.2.2                 
 [65] lazyeval_0.2.2                 ModelMetrics_1.2.2.2          
 [67] gargle_1.2.1                   broom_1.0.2                   
 [69] checkmate_2.1.0                modelr_0.1.10                 
 [71] yaml_2.3.6                     reshape2_1.4.4                
 [73] crosstalk_1.2.0                backports_1.4.1               
 [75] Hmisc_4.7-2                    lava_1.7.0                    
 [77] tools_4.2.2                    ellipsis_0.3.2                
 [79] decontam_1.18.0                jquerylib_0.1.4               
 [81] RColorBrewer_1.1-3             proxy_0.4-27                  
 [83] BiocGenerics_0.44.0            MultiAssayExperiment_1.24.0   
 [85] Rcpp_1.0.9                     plyr_1.8.8                    
 [87] base64enc_0.1-3                sparseMatrixStats_1.10.0      
 [89] zlibbioc_1.44.0                RCurl_1.98-1.9                
 [91] rpart_4.1.19                   TreeSummarizedExperiment_2.6.0
 [93] deldir_1.0-6                   viridis_0.6.2                 
 [95] S4Vectors_0.36.1               zoo_1.8-11                    
 [97] SummarizedExperiment_1.28.0    haven_2.5.1                   
 [99] ggrepel_0.9.2                  cluster_2.1.4                 
[101] fs_1.5.2                       DECIPHER_2.26.0               
[103] magrittr_2.0.3                 data.table_1.14.6             
[105] lmerTest_3.1-3                 reprex_2.0.2                  
[107] googledrive_2.0.0              mvtnorm_1.1-3                 
[109] matrixStats_0.63.0             gsl_2.1-7.1                   
[111] hms_1.1.2                      evaluate_0.19                 
[113] xtable_1.8-4                   jpeg_0.1-10                   
[115] readxl_1.4.1                   IRanges_2.32.0                
[117] gridExtra_2.3                  compiler_4.2.2                
[119] scater_1.26.1                  crayon_1.5.2                  
[121] minqa_1.2.5                    htmltools_0.5.4               
[123] tzdb_0.3.0                     mgcv_1.8-41                   
[125] Formula_1.2-4                  expm_0.999-6                  
[127] Exact_3.2                      lubridate_1.9.0               
[129] DBI_1.1.3                      dbplyr_2.2.1                  
[131] MASS_7.3-58.1                  boot_1.3-28.1                 
[133] Matrix_1.5-3                   permute_0.9-7                 
[135] cli_3.4.1                      rbibutils_2.2.11              
[137] gower_1.0.0                    parallel_4.2.2                
[139] GenomicRanges_1.50.2           pkgconfig_2.0.3               
[141] numDeriv_2016.8-1.1            foreign_0.8-84                
[143] recipes_1.0.3                  scuttle_1.8.3                 
[145] xml2_1.3.3                     foreach_1.5.2                 
[147] hardhat_1.2.0                  vipor_0.4.5                   
[149] bslib_0.4.2                    DirichletMultinomial_1.40.0   
[151] rngtools_1.5.2                 XVector_0.38.0                
[153] prodlim_2019.11.13             estimability_1.4.1            
[155] mia_1.6.0                      rvest_1.0.3                   
[157] CVXR_1.0-11                    doRNG_1.8.2                   
[159] yulab.utils_0.0.5              digest_0.6.31                 
[161] vegan_2.6-4                    Biostrings_2.66.0             
[163] rmarkdown_2.19                 cellranger_1.1.0              
[165] tidytree_0.4.2                 htmlTable_2.4.1               
[167] gld_2.6.6                      DelayedMatrixStats_1.20.0     
[169] nloptr_2.0.3                   lifecycle_1.0.3               
[171] nlme_3.1-161                   jsonlite_1.8.4                
[173] BiocNeighbors_1.16.0           viridisLite_0.4.1             
[175] fansi_1.0.3                    pillar_1.8.1                  
[177] fastmap_1.1.0                  httr_1.4.4                    
[179] survival_3.4-0                 glue_1.6.2                    
[181] png_0.1-8                      iterators_1.0.14              
[183] bit_4.0.5                      class_7.3-20                  
[185] stringi_1.7.8                  sass_0.4.4                    
[187] blob_1.2.3                     BiocSingular_1.14.0           
[189] latticeExtra_0.6-30            memoise_2.0.1                 
[191] Rmpfr_0.8-9                    future.apply_1.10.0           
[193] irlba_2.3.5.1                  e1071_1.7-12                  
[195] ape_5.6-2                     

References

Lahti, Leo, Jarkko Salojärvi, Anne Salonen, Marten Scheffer, and Willem M De Vos. 2014. “Tipping Elements in the Human Intestinal Ecosystem.” Nature Communications 5 (1): 1–10.

Lahti, Leo, Sudarshan Shetty, T Blake, J Salojarvi, and others. 2017. “Tools for Microbiome Analysis in R.” Version 1: 10013.

Lin, Huang, Merete Eggesbø, and Shyamal Das Peddada. 2022. “Linear and Nonlinear Correlation Estimators Unveil Undescribed Taxa Interactions in Microbiome Data.” Nature Communications 13 (1): 1–16.

McMurdie, Paul J, and Susan Holmes. 2013. “Phyloseq: An R Package for Reproducible Interactive Analysis and Graphics of Microbiome Census Data.” PloS One 8 (4): e61217.