1. Introduction

Analysis of Compositions of Microbiomes with Bias Correction (ANCOM-BC) (Lin and Peddada 2020) is a methodology of differential abundance (DA) analysis for microbial absolute abundance data. ANCOM-BC estimates the unknown sampling fractions, corrects the bias induced by their differences among samples, and identifies taxa that are differentially abundant according to the covariate of interest.

For more details, please refer to the ANCOM-BC paper.

2. Installation

Download package.

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("ANCOMBC")

Load the package.

library(ANCOMBC)

3. Running ANCOMBC

3.1 Intestinal microbiota profiling data

The HITChip Atlas data set (Lahti et al. 2014) is available via the microbiome R package (Lahti et al. 2017) in phyloseq (McMurdie and Holmes 2013) format, and via Data Dryad in tabular format. This data set comes with 130 genus-like taxonomic groups across 1006 western adults with no reported health complications. Some subjects have also short time series.

data(atlas1006) 
# Subset to baseline
pseq = subset_samples(atlas1006, time == 0)
# Re-code the bmi group
sample_data(pseq)$bmi_group = recode(sample_data(pseq)$bmi_group,
                                     underweight = "lean",
                                     lean = "lean",
                                     overweight = "overweight",
                                     obese = "obese",
                                     severeobese = "obese",
                                     morbidobese = "obese")
# Re-code the nationality group
sample_data(pseq)$nation = recode(sample_data(pseq)$nationality,
                                  Scandinavia = "NE",
                                  UKIE = "NE",
                                  SouthEurope = "SE",
                                  CentralEurope = "CE",
                                  EasternEurope = "EE")

# Aggregate to phylum level
phylum_data = aggregate_taxa(pseq, "Phylum")

3.2 Data summary

options(qwraps2_markup = "markdown")
summary_template =
  list("Age" =
         list("min" = ~ min(age, na.rm = TRUE),
              "max" = ~ max(age, na.rm = TRUE),
              "mean (sd)" = ~ mean_sd(age, na_rm = TRUE, 
                                      show_n = "never")),
       "Gender" =
         list("F" = ~ n_perc0(sex == "female", na_rm = TRUE),
              "M" = ~ n_perc0(sex == "male", na_rm = TRUE),
              "NA" = ~ n_perc0(is.na(sex))),
       "Nationality" =
         list("Central Europe" = ~ n_perc0(nation == "CE", na_rm = TRUE),
              "Eastern Europe" = ~ n_perc0(nation == "EE", na_rm = TRUE),
              "Northern Europe" = ~ n_perc0(nation == "NE", na_rm = TRUE),
              "Southern Europe" = ~ n_perc0(nation == "SE", na_rm = TRUE),
              "US" = ~ n_perc0(nation == "US", na_rm = TRUE),
              "NA" = ~ n_perc0(is.na(nation))),
       "BMI" =
         list("Lean" = ~ n_perc0(bmi_group == "lean", na_rm = TRUE),
              "Overweight" = ~ n_perc0(bmi_group == "overweight", 
                                       na_rm = TRUE),
              "Obese" = ~ n_perc0(bmi_group == "obese", na_rm = TRUE),
              "NA" = ~ n_perc0(is.na(bmi_group)))
       )
data_summary = summary_table(meta(pseq), summary_template)
data_summary

	meta(pseq) (N = 1,006)
Age
min	18
max	77
mean (sd)	44.71 ± 14.44
Gender
F	553 (57)
M	416 (43)
NA	37 (4)
Nationality
Central Europe	617 (63)
Eastern Europe	15 (2)
Northern Europe	212 (22)
Southern Europe	86 (9)
US	44 (5)
NA	32 (3)
BMI
Lean	461 (51)
Overweight	154 (17)
Obese	285 (32)
NA	106 (11)

The number of samples: 1006.
The number of phyla: 8.

3.3 Running ancombc function

out = ancombc(phyloseq = phylum_data, formula = "age + nation + bmi_group", 
              p_adj_method = "holm", zero_cut = 0.90, lib_cut = 1000, 
              group = "nation", struc_zero = TRUE, neg_lb = TRUE, tol = 1e-5, 
              max_iter = 100, conserve = TRUE, alpha = 0.05, global = TRUE)

res = out$res
res_global = out$res_global

3.4 ANCOMBC primary result

Result from the ANCOM-BC log-linear model to determine taxa that are differentially abundant according to the covariate of interest. It contains: 1) coefficients; 2) standard errors; 3) test statistics; 4) p-values; 5) adjusted p-values; 6) indicators whether the taxon is differentially abundant (TRUE) or not (FALSE).

3.41 Coefficients

tab_coef = res$beta
col_name = c("Age", "EE - CE", "NE - CE", "SE - CE", "US - CE", 
             "Oerweight - Lean", "Obese - Lean")
colnames(tab_coef) = col_name
tab_coef %>% datatable(caption = "Coefficients from the Primary Result") %>%
      formatRound(col_name, digits = 2)

3.42 SEs

tab_se = res$se
colnames(tab_se) = col_name
tab_se %>% datatable(caption = "SEs from the Primary Result") %>%
      formatRound(col_name, digits = 2)

3.43 Test statistics

tab_w = res$W
colnames(tab_w) = col_name
tab_w %>% datatable(caption = "Test Statistics from the Primary Result") %>%
      formatRound(col_name, digits = 2)

3.44 P-values

tab_p = res$p_val
colnames(tab_p) = col_name
tab_p %>% datatable(caption = "P-values from the Primary Result") %>%
      formatRound(col_name, digits = 2)

3.45 Adjusted p-values

tab_q = res$q
colnames(tab_q) = col_name
tab_q %>% datatable(caption = "Adjusted p-values from the Primary Result") %>%
      formatRound(col_name, digits = 2)

3.46 Differentially abundant taxa

tab_diff = res$diff_abn
colnames(tab_diff) = col_name
tab_diff %>% 
  datatable(caption = "Differentially Abundant Taxa 
            from the Primary Result")

3.47 Bias-adjusted abundances

Step 1: estimate sample-specific sampling fractions (log scale). Note that for each sample, if it contains missing values for any variable specified in the formula, the corresponding sampling fraction estimate for this sample will return NA since the sampling fraction is not estimable with the presence of missing values.

Step 2: adjust the log observed abundances by subtracting the estimated sampling fraction from log observed abundances of each sample.

samp_frac = out$samp_frac
# Replace NA with 0
samp_frac[is.na(samp_frac)] = 0 

# Add pesudo-count (1) to avoid taking the log of 0
log_obs_abn = log(abundances(phylum_data) + 1) 
# Adjust the log observed abundances
log_obs_abn_adj = t(t(log_obs_abn) - samp_frac)
# Show the first 6 samples
round(log_obs_abn_adj[, 1:6], 2) %>% 
  datatable(caption = "Bias-adjusted log observed abundances")

3.48 Visualizations for “age”

“Age” is a continuous variable.

df_fig1 = data.frame(res$beta * res$diff_abn, check.names = FALSE) %>% 
  rownames_to_column("taxon_id")
df_fig2 = data.frame(res$se * res$diff_abn, check.names = FALSE) %>% 
  rownames_to_column("taxon_id")
colnames(df_fig2)[-1] = paste0(colnames(df_fig2)[-1], "SD")
df_fig = df_fig1 %>% left_join(df_fig2, by = "taxon_id") %>%
  transmute(taxon_id, age, ageSD) %>%
  filter(age != 0) %>% arrange(desc(age)) %>%
  mutate(group = ifelse(age > 0, "g1", "g2"))
df_fig$taxon_id = factor(df_fig$taxon_id, levels = df_fig$taxon_id)
  
p = ggplot(data = df_fig, 
           aes(x = taxon_id, y = age, fill = group, color = group)) + 
  geom_bar(stat = "identity", width = 0.7, 
           position = position_dodge(width = 0.4)) +
  geom_errorbar(aes(ymin = age - ageSD, ymax = age + ageSD), width = 0.2,
                position = position_dodge(0.05), color = "black") + 
  labs(x = NULL, y = "Log fold change", 
       title = "Waterfall Plot for the Age Effect") + 
  theme_bw() + 
  theme(legend.position = "none",
        plot.title = element_text(hjust = 0.5),
        panel.grid.minor.y = element_blank(),
        axis.text.x = element_text(angle = 60, hjust = 1))
p

3.5 ANCOMBC global test result

Result from the ANCOM-BC global test to determine taxa that are differentially abundant between three or more groups of multiple samples. It contains: 1) test statistics; 2) p-values; 3) adjusted p-values; 4) indicators whether the taxon is differentially abundant (TRUE) or not (FALSE).

3.51 Test statistics

tab_w = res_global[, "W", drop = FALSE]
colnames(tab_w) = "Nationality"
tab_w %>% datatable(caption = "Test Statistics 
                    from the Global Test Result") %>%
      formatRound(c("Nationality"), digits = 2)

3.52 P-values

tab_p = res_global[, "p_val", drop = FALSE]
colnames(tab_p) = "Nationality"
tab_p %>% datatable(caption = "P-values 
                    from the Global Test Result") %>%
      formatRound(c("Nationality"), digits = 2)

3.53 Adjusted p-values

tab_q = res_global[, "q_val", drop = FALSE]
colnames(tab_q) = "Nationality"
tab_q %>% datatable(caption = "Adjusted p-values 
                    from the Global Test Result") %>%
      formatRound(c("Nationality"), digits = 2)

3.54 Differentially abundant taxa

tab_diff = res_global[, "diff_abn", drop = FALSE]
colnames(tab_diff) = "Nationality"
tab_diff %>% datatable(caption = "Differentially Abundant Taxa 
                       from the Global Test Result")

Session information

sessionInfo()

R version 4.1.0 (2021-05-18)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.2 LTS

Matrix products: default
BLAS:   /home/biocbuild/bbs-3.13-bioc/R/lib/libRblas.so
LAPACK: /home/biocbuild/bbs-3.13-bioc/R/lib/libRlapack.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_GB              LC_COLLATE=C              
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] DT_0.18           ANCOMBC_1.2.2     qwraps2_0.5.2     magrittr_2.0.1   
 [5] microbiome_1.14.0 phyloseq_1.36.0   forcats_0.5.1     stringr_1.4.0    
 [9] dplyr_1.0.7       purrr_0.3.4       readr_2.0.1       tidyr_1.1.3      
[13] tibble_3.1.3      ggplot2_3.3.5     tidyverse_1.3.1  

loaded via a namespace (and not attached):
 [1] Rtsne_0.15             colorspace_2.0-2       ellipsis_0.3.2        
 [4] XVector_0.32.0         fs_1.5.0               rstudioapi_0.13       
 [7] farver_2.1.0           fansi_0.5.0            lubridate_1.7.10      
[10] xml2_1.3.2             codetools_0.2-18       splines_4.1.0         
[13] knitr_1.33             ade4_1.7-17            jsonlite_1.7.2        
[16] nloptr_1.2.2.2         broom_0.7.9            cluster_2.1.2         
[19] dbplyr_2.1.1           compiler_4.1.0         httr_1.4.2            
[22] backports_1.2.1        assertthat_0.2.1       Matrix_1.3-4          
[25] cli_3.0.1              htmltools_0.5.1.1      tools_4.1.0           
[28] igraph_1.2.6           gtable_0.3.0           glue_1.4.2            
[31] GenomeInfoDbData_1.2.6 reshape2_1.4.4         Rcpp_1.0.7            
[34] Biobase_2.52.0         cellranger_1.1.0       jquerylib_0.1.4       
[37] vctrs_0.3.8            Biostrings_2.60.2      rhdf5filters_1.4.0    
[40] multtest_2.48.0        ape_5.5                nlme_3.1-152          
[43] iterators_1.0.13       crosstalk_1.1.1        xfun_0.25             
[46] rbibutils_2.2.3        rvest_1.0.1            lifecycle_1.0.0       
[49] zlibbioc_1.38.0        MASS_7.3-54            scales_1.1.1          
[52] hms_1.1.0              parallel_4.1.0         biomformat_1.20.0     
[55] rhdf5_2.36.0           yaml_2.2.1             sass_0.4.0            
[58] stringi_1.7.3          highr_0.9              S4Vectors_0.30.0      
[61] foreach_1.5.1          permute_0.9-5          BiocGenerics_0.38.0   
[64] GenomeInfoDb_1.28.1    Rdpack_2.1.2           rlang_0.4.11          
[67] pkgconfig_2.0.3        bitops_1.0-7           evaluate_0.14         
[70] lattice_0.20-44        Rhdf5lib_1.14.2        htmlwidgets_1.5.3     
[73] labeling_0.4.2         tidyselect_1.1.1       plyr_1.8.6            
[76] R6_2.5.0               IRanges_2.26.0         generics_0.1.0        
[79] DBI_1.1.1              pillar_1.6.2           haven_2.4.3           
[82] withr_2.4.2            mgcv_1.8-36            survival_3.2-12       
[85] RCurl_1.98-1.3         modelr_0.1.8           crayon_1.4.1          
[88] utf8_1.2.2             tzdb_0.1.2             rmarkdown_2.10        
[91] grid_4.1.0             readxl_1.3.1           data.table_1.14.0     
[94] vegan_2.5-7            reprex_2.0.1           digest_0.6.27         
[97] stats4_4.1.0           munsell_0.5.0          bslib_0.2.5.1

References

Lahti, Leo, Jarkko Salojärvi, Anne Salonen, Marten Scheffer, and Willem M De Vos. 2014. “Tipping Elements in the Human Intestinal Ecosystem.” Nature Communications 5 (1): 1–10.

Lahti, Leo, Sudarshan Shetty, T Blake, J Salojarvi, and others. 2017. “Tools for Microbiome Analysis in R.” Version 1: 10013.

Lin, Huang, and Shyamal Das Peddada. 2020. “Analysis of Compositions of Microbiomes with Bias Correction.” Nature Communications 11 (1): 1–11.

McMurdie, Paul J, and Susan Holmes. 2013. “Phyloseq: An R Package for Reproducible Interactive Analysis and Graphics of Microbiome Census Data.” PloS One 8 (4): e61217.

ANCOM-BC

Huang Lin\(^1\)

\(^1\)Department of Biostatistics, University of Pittsburgh, 130 De Soto Street, Pittsburgh, PA 15261 USA

August 15, 2021