A workflow for identifying enterotypes based on the relative abundance of gut microbiota was implemented refereed on the reports of Arumugam[^2]
First of all, the dataset of microbiota relative abundance was retrived from the enterotypes weblink. The missing value was imputed using KNN by impute
package.
dat <- read.delim('http://enterotypes.org/ref_samples_abundance_MetaHIT.txt')
dat <- impute::impute.knn(as.matrix(dat), k = 100)
dat <- as.data.frame(dat$data+0.001)
setDT(dat, keep.rownames = TRUE)
dat
Constructe the bSet
class and then estimate the the proper cluster number using the estimate_k
function. The estimate_k
function take advantage of Jensen-Shannon divergence
to cluster the samples and the number of clusters was optimizated by Calinski-Harabasz (CH) Index and Silhouette Coefficient.
The estimate_k
returns verCHI
class, a S3
class containing a optimal cluster results, optimal number cluster, a minmum CHI, a minmum Silhouette value, and Jensen-Shannon divergence matrix.
dat <- bSet(b = dat)
res <- estimate_k(dat)
res
#> optimal number of cluster: 4
#> Max CHI: 164.642158008611
#> Silhouette: 0.181445495999067
The proper number of cluster is 4.
Next, the enterotyping
function was used to identify the enterotype for each cluster and it returns a 3-length list. This list contains two enterotypes matrices and a unidentified samples vector. Cluster 2, 3, and 4 was enterotype Bacteroides, Prevotella, and Ruminococcus, resepectively.
ret=enterotyping(dat, res$verOptCluster)
ret
#> $enterotypes
#> Enterotype max which cluster
#> 1: Bacteroides 0.36724946 2 cluster 2
#> 2: Prevotella 0.29692944 3 cluster 3
#> 3: Ruminococcus 0.02416713 4 cluster 4
#>
#> $data
#> Samples Enterotype cluster
#> 1: MH0087 Bacteroides cluster 2
#> 2: MH0156 Bacteroides cluster 2
#> 3: MH0444 Bacteroides cluster 2
#> 4: MH0333 Bacteroides cluster 2
#> 5: MH0233 Bacteroides cluster 2
#> ---
#> 234: MH0012 Ruminococcus cluster 4
#> 235: MH0415 Ruminococcus cluster 4
#> 236: MH0457 Ruminococcus cluster 4
#> 237: MH0442 Ruminococcus cluster 4
#> 238: MH0448 Ruminococcus cluster 4
#>
#> $UnIdentifiedSamples
#> [1] "MH0277" "MH0161" "MH0046" "MH0175" "MH0152" "MH0104" "MH0151" "MH0189"
#> [9] "MH0030" "MH0157" "MH0063" "MH0075" "MH0141" "MH0169" "MH0050" "MH0286"
#> [17] "MH0096" "MH0053" "MH0217" "MH0098" "MH0009" "MH0197" "MH0065" "MH0173"
#> [25] "MH0168" "MH0070" "MH0077" "MH0288" "MH0200" "MH0031" "MH0183" "MH0132"
#> [33] "MH0144" "MH0124" "MH0430" "MH0276" "MH0407" "MH0428" "MH0126" "MH0447"
Furthermore, this result was validated by enterotypes results given by the enterotype website.
enterotypes <- read.table(system.file('extdata', 'enterotype.txt', package = 'mbOmic'))
enterotypes <- enterotypes[samples(dat),]
table(res$verOptCluster, enterotypes$ET)
#>
#> ET_B ET_F ET_P
#> 1 0 21 19
#> 2 67 5 0
#> 3 0 0 40
#> 4 3 123 0
devtools::session_info()
#> ─ Session info ───────────────────────────────────────────────────────────────
#> setting value
#> version R version 4.2.0 RC (2022-04-19 r82224)
#> os Ubuntu 20.04.4 LTS
#> system x86_64, linux-gnu
#> ui X11
#> language (EN)
#> collate C
#> ctype en_US.UTF-8
#> tz America/New_York
#> date 2022-04-27
#> pandoc 2.5 @ /usr/bin/ (via rmarkdown)
#>
#> ─ Packages ───────────────────────────────────────────────────────────────────
#> package * version date (UTC) lib source
#> ade4 1.7-19 2022-04-19 [2] CRAN (R 4.2.0)
#> AnnotationDbi 1.58.0 2022-04-27 [2] Bioconductor
#> assertthat 0.2.1 2019-03-21 [2] CRAN (R 4.2.0)
#> backports 1.4.1 2021-12-13 [2] CRAN (R 4.2.0)
#> base64enc 0.1-3 2015-07-28 [2] CRAN (R 4.2.0)
#> Biobase 2.56.0 2022-04-27 [2] Bioconductor
#> BiocGenerics 0.42.0 2022-04-27 [2] Bioconductor
#> Biostrings 2.64.0 2022-04-27 [2] Bioconductor
#> bit 4.0.4 2020-08-04 [2] CRAN (R 4.2.0)
#> bit64 4.0.5 2020-08-30 [2] CRAN (R 4.2.0)
#> bitops 1.0-7 2021-04-24 [2] CRAN (R 4.2.0)
#> blob 1.2.3 2022-04-10 [2] CRAN (R 4.2.0)
#> brio 1.1.3 2021-11-30 [2] CRAN (R 4.2.0)
#> bslib 0.3.1 2021-10-06 [2] CRAN (R 4.2.0)
#> cachem 1.0.6 2021-08-19 [2] CRAN (R 4.2.0)
#> callr 3.7.0 2021-04-20 [2] CRAN (R 4.2.0)
#> checkmate 2.1.0 2022-04-21 [2] CRAN (R 4.2.0)
#> class 7.3-20 2022-01-16 [2] CRAN (R 4.2.0)
#> cli 3.3.0 2022-04-25 [2] CRAN (R 4.2.0)
#> cluster 2.1.3 2022-03-28 [2] CRAN (R 4.2.0)
#> clusterSim 0.49-2 2021-01-06 [2] CRAN (R 4.2.0)
#> codetools 0.2-18 2020-11-04 [2] CRAN (R 4.2.0)
#> colorspace 2.0-3 2022-02-21 [2] CRAN (R 4.2.0)
#> crayon 1.5.1 2022-03-26 [2] CRAN (R 4.2.0)
#> data.table * 1.14.2 2021-09-27 [2] CRAN (R 4.2.0)
#> DBI 1.1.2 2021-12-20 [2] CRAN (R 4.2.0)
#> desc 1.4.1 2022-03-06 [2] CRAN (R 4.2.0)
#> devtools 2.4.3 2021-11-30 [2] CRAN (R 4.2.0)
#> digest 0.6.29 2021-12-01 [2] CRAN (R 4.2.0)
#> doParallel 1.0.17 2022-02-07 [2] CRAN (R 4.2.0)
#> dplyr 1.0.8 2022-02-08 [2] CRAN (R 4.2.0)
#> dynamicTreeCut 1.63-1 2016-03-11 [2] CRAN (R 4.2.0)
#> e1071 1.7-9 2021-09-16 [2] CRAN (R 4.2.0)
#> ellipsis 0.3.2 2021-04-29 [2] CRAN (R 4.2.0)
#> evaluate 0.15 2022-02-18 [2] CRAN (R 4.2.0)
#> extrafont 0.18 2022-04-12 [2] CRAN (R 4.2.0)
#> extrafontdb 1.0 2012-06-11 [2] CRAN (R 4.2.0)
#> fansi 1.0.3 2022-03-24 [2] CRAN (R 4.2.0)
#> fastcluster 1.2.3 2021-05-24 [2] CRAN (R 4.2.0)
#> fastmap 1.1.0 2021-01-25 [2] CRAN (R 4.2.0)
#> foreach 1.5.2 2022-02-02 [2] CRAN (R 4.2.0)
#> foreign 0.8-82 2022-01-16 [2] CRAN (R 4.2.0)
#> Formula 1.2-4 2020-10-16 [2] CRAN (R 4.2.0)
#> fs 1.5.2 2021-12-08 [2] CRAN (R 4.2.0)
#> generics 0.1.2 2022-01-31 [2] CRAN (R 4.2.0)
#> GenomeInfoDb 1.32.0 2022-04-27 [2] Bioconductor
#> GenomeInfoDbData 1.2.8 2022-04-21 [2] Bioconductor
#> ggplot2 3.3.5 2021-06-25 [2] CRAN (R 4.2.0)
#> glue 1.6.2 2022-02-24 [2] CRAN (R 4.2.0)
#> GO.db 3.15.0 2022-04-21 [2] Bioconductor
#> gridExtra 2.3 2017-09-09 [2] CRAN (R 4.2.0)
#> gtable 0.3.0 2019-03-25 [2] CRAN (R 4.2.0)
#> highr 0.9 2021-04-16 [2] CRAN (R 4.2.0)
#> Hmisc 4.7-0 2022-04-19 [2] CRAN (R 4.2.0)
#> htmlTable 2.4.0 2022-01-04 [2] CRAN (R 4.2.0)
#> htmltools 0.5.2 2021-08-25 [2] CRAN (R 4.2.0)
#> htmlwidgets 1.5.4 2021-09-08 [2] CRAN (R 4.2.0)
#> httr 1.4.2 2020-07-20 [2] CRAN (R 4.2.0)
#> igraph 1.3.1 2022-04-20 [2] CRAN (R 4.2.0)
#> impute 1.70.0 2022-04-27 [2] Bioconductor
#> IRanges 2.30.0 2022-04-27 [2] Bioconductor
#> iterators 1.0.14 2022-02-05 [2] CRAN (R 4.2.0)
#> jpeg 0.1-9 2021-07-24 [2] CRAN (R 4.2.0)
#> jquerylib 0.1.4 2021-04-26 [2] CRAN (R 4.2.0)
#> jsonlite 1.8.0 2022-02-22 [2] CRAN (R 4.2.0)
#> KEGGREST 1.36.0 2022-04-27 [2] Bioconductor
#> knitr 1.39 2022-04-26 [2] CRAN (R 4.2.0)
#> lattice 0.20-45 2021-09-22 [2] CRAN (R 4.2.0)
#> latticeExtra 0.6-29 2019-12-19 [2] CRAN (R 4.2.0)
#> lifecycle 1.0.1 2021-09-24 [2] CRAN (R 4.2.0)
#> magrittr 2.0.3 2022-03-30 [2] CRAN (R 4.2.0)
#> MASS 7.3-57 2022-04-22 [2] CRAN (R 4.2.0)
#> Matrix 1.4-1 2022-03-23 [2] CRAN (R 4.2.0)
#> matrixStats 0.62.0 2022-04-19 [2] CRAN (R 4.2.0)
#> mbOmic * 1.0.0 2022-04-27 [1] Bioconductor
#> memoise 2.0.1 2021-11-26 [2] CRAN (R 4.2.0)
#> mnormt 2.0.2 2020-09-01 [2] CRAN (R 4.2.0)
#> munsell 0.5.0 2018-06-12 [2] CRAN (R 4.2.0)
#> nlme 3.1-157 2022-03-25 [2] CRAN (R 4.2.0)
#> nnet 7.3-17 2022-01-16 [2] CRAN (R 4.2.0)
#> pillar 1.7.0 2022-02-01 [2] CRAN (R 4.2.0)
#> pkgbuild 1.3.1 2021-12-20 [2] CRAN (R 4.2.0)
#> pkgconfig 2.0.3 2019-09-22 [2] CRAN (R 4.2.0)
#> pkgdown 2.0.3 2022-04-24 [2] CRAN (R 4.2.0)
#> pkgload 1.2.4 2021-11-30 [2] CRAN (R 4.2.0)
#> png 0.1-7 2013-12-03 [2] CRAN (R 4.2.0)
#> preprocessCore 1.58.0 2022-04-27 [2] Bioconductor
#> prettyunits 1.1.1 2020-01-24 [2] CRAN (R 4.2.0)
#> processx 3.5.3 2022-03-25 [2] CRAN (R 4.2.0)
#> proxy 0.4-26 2021-06-07 [2] CRAN (R 4.2.0)
#> ps 1.7.0 2022-04-23 [2] CRAN (R 4.2.0)
#> psych 2.2.3 2022-03-19 [2] CRAN (R 4.2.0)
#> purrr 0.3.4 2020-04-17 [2] CRAN (R 4.2.0)
#> R2HTML 2.3.2 2016-06-23 [2] CRAN (R 4.2.0)
#> R6 2.5.1 2021-08-19 [2] CRAN (R 4.2.0)
#> RColorBrewer 1.1-3 2022-04-03 [2] CRAN (R 4.2.0)
#> Rcpp 1.0.8.3 2022-03-17 [2] CRAN (R 4.2.0)
#> RCurl 1.98-1.6 2022-02-08 [2] CRAN (R 4.2.0)
#> remotes 2.4.2 2021-11-30 [2] CRAN (R 4.2.0)
#> rgl 0.108.3 2021-11-21 [2] CRAN (R 4.2.0)
#> rlang 1.0.2 2022-03-04 [2] CRAN (R 4.2.0)
#> rmarkdown 2.14 2022-04-25 [2] CRAN (R 4.2.0)
#> rpart 4.1.16 2022-01-24 [2] CRAN (R 4.2.0)
#> rprojroot 2.0.3 2022-04-02 [2] CRAN (R 4.2.0)
#> RSQLite 2.2.12 2022-04-02 [2] CRAN (R 4.2.0)
#> rstudioapi 0.13 2020-11-12 [2] CRAN (R 4.2.0)
#> Rttf2pt1 1.3.10 2022-02-07 [2] CRAN (R 4.2.0)
#> S4Vectors 0.34.0 2022-04-27 [2] Bioconductor
#> sass 0.4.1 2022-03-23 [2] CRAN (R 4.2.0)
#> scales 1.2.0 2022-04-13 [2] CRAN (R 4.2.0)
#> sessioninfo 1.2.2 2021-12-06 [2] CRAN (R 4.2.0)
#> stringi 1.7.6 2021-11-29 [2] CRAN (R 4.2.0)
#> stringr 1.4.0 2019-02-10 [2] CRAN (R 4.2.0)
#> survival 3.3-1 2022-03-03 [2] CRAN (R 4.2.0)
#> testthat 3.1.4 2022-04-26 [2] CRAN (R 4.2.0)
#> tibble 3.1.6 2021-11-07 [2] CRAN (R 4.2.0)
#> tidyselect 1.1.2 2022-02-21 [2] CRAN (R 4.2.0)
#> tmvnsim 1.0-2 2016-12-15 [2] CRAN (R 4.2.0)
#> usethis 2.1.5 2021-12-09 [2] CRAN (R 4.2.0)
#> utf8 1.2.2 2021-07-24 [2] CRAN (R 4.2.0)
#> vctrs 0.4.1 2022-04-13 [2] CRAN (R 4.2.0)
#> visNetwork 2.1.0 2021-09-29 [2] CRAN (R 4.2.0)
#> WGCNA 1.71 2022-04-22 [2] CRAN (R 4.2.0)
#> withr 2.5.0 2022-03-03 [2] CRAN (R 4.2.0)
#> xfun 0.30 2022-03-02 [2] CRAN (R 4.2.0)
#> XVector 0.36.0 2022-04-27 [2] Bioconductor
#> yaml 2.3.5 2022-02-21 [2] CRAN (R 4.2.0)
#> zlibbioc 1.42.0 2022-04-27 [2] Bioconductor
#>
#> [1] /tmp/RtmpvToPOC/Rinst2c681c42aa7dd8
#> [2] /home/biocbuild/bbs-3.15-bioc/R/library
#>
#> ──────────────────────────────────────────────────────────────────────────────