1 Introduction

A two-step approach to imputing missing data in metabolomics. Step 1 uses a random forest classifier to classify missing values as either Missing Completely at Random/Missing At Random (MCAR/MAR) or Missing Not At Random (MNAR). MCAR/MAR are combined because it is often difficult to distinguish these two missing types in metabolomics data. Step 2 imputes the missing values based on the classified missing mechanisms, using the appropriate imputation algorithms. Imputation algorithms tested and available for MCAR/MAR include Bayesian Principal Component Analysis (BPCA), Multiple Imputation No-Skip K-Nearest Neighbors (Multi_nsKNN), and Random Forest. Imputation algorithms tested and available for MNAR include nsKNN and a single imputation approach for imputation of metabolites where left-censoring is present.

2 Installation

The following code chunk depicts how to install MAI from Bioconductor

3 Using MAI when your data is a data.frame or matrix

## Estimating pattern of missingness
## Imposing missingness
## Generating features
## Training
## Predicting
## Imputing
##          [,1]     [,2]     [,3]     [,4]     [,5]
## [1,] 12.87164 12.70516 12.11793 12.07897 11.18725
## [2,] 12.10813 12.36043 12.08463 12.07897 14.30504
## [3,] 12.10813 11.16815 12.08463 12.07897 11.79195
## [4,] 12.01395 12.28791 12.12017 12.07897 10.05397
## [5,] 11.32825 12.55314 12.23772 11.31292 12.28869
## $Alpha
## [1] 20
## 
## $Beta
## [1] 80
## 
## $Gamma
## [1] 60

These parameters estimate the ratio of MCAR/MAR to MNAR in the data. The parameters \(\alpha\) and \(\beta\) separate high, medium, and low average abundance metabolites, while the parameter \(\gamma\) is used to impose missingness in the medium and low abundance metabolites. A smaller \(\alpha\) corresponds to more MCAR/MAR being present, while larger \(\beta\) and \(\gamma\) values imply more MNAR values being present. The returned estimated parameters are then used to impose known missingness in the complete subset of the input data. Subsequently, a random forest classifier is trained to classify the known missingness in the complete subset of the input data. Once the classifier is established it is applied to the unknown missingness of the full input data to predict the missingness. Finally, the missing values are imputed using a specific algorithm, chosen by the user, according to the predicted missingness mechanism.

4 Using MAI when your data is a SummarizedExperiment (SE) class

## Estimating pattern of missingness
## Imposing missingness
## Generating features
## Training
## Predicting
## Imputing
##          [,1]     [,2]     [,3]     [,4]     [,5]
## [1,] 12.87164 12.70516 12.11793 12.07897 11.18725
## [2,] 12.10813 12.36043 12.08463 12.07897 14.30504
## [3,] 12.10813 11.16815 12.08463 12.07897 11.79195
## [4,] 12.01395 12.28791 12.12017 12.07897 10.05397
## [5,] 11.32825 12.55314 12.23772 11.31292 12.28869
## $Alpha
## [1] 20
## 
## $Beta
## [1] 80
## 
## $Gamma
## [1] 60

5 Session Information

## R version 4.2.1 (2022-06-23)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04.5 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.16-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.16-bioc/R/lib/libRlapack.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_GB              LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] SummarizedExperiment_1.28.0 Biobase_2.58.0             
##  [3] GenomicRanges_1.50.0        GenomeInfoDb_1.34.0        
##  [5] IRanges_2.32.0              S4Vectors_0.36.0           
##  [7] BiocGenerics_0.44.0         MatrixGenerics_1.10.0      
##  [9] matrixStats_0.62.0          caret_6.0-93               
## [11] lattice_0.20-45             ggplot2_3.3.6              
## [13] MAI_1.4.0                   BiocStyle_2.26.0           
## 
## loaded via a namespace (and not attached):
##   [1] googledrive_2.0.0      colorspace_2.0-3       ellipsis_0.3.2        
##   [4] class_7.3-20           XVector_0.38.0         fs_1.5.2              
##   [7] proxy_0.4-27           listenv_0.8.0          prodlim_2019.11.13    
##  [10] fansi_1.0.3            lubridate_1.8.0        xml2_1.3.3            
##  [13] codetools_0.2-18       splines_4.2.1          doParallel_1.0.17     
##  [16] cachem_1.0.6           knitr_1.40             itertools_0.1-3       
##  [19] jsonlite_1.8.3         pROC_1.18.0            broom_1.0.1           
##  [22] dbplyr_2.2.1           missForest_1.5         BiocManager_1.30.19   
##  [25] readr_2.1.3            compiler_4.2.1         httr_1.4.4            
##  [28] backports_1.4.1        assertthat_0.2.1       Matrix_1.5-1          
##  [31] fastmap_1.1.0          gargle_1.2.1           cli_3.4.1             
##  [34] htmltools_0.5.3        tools_4.2.1            GenomeInfoDbData_1.2.9
##  [37] gtable_0.3.1           glue_1.6.2             reshape2_1.4.4        
##  [40] dplyr_1.0.10           doRNG_1.8.2            Rcpp_1.0.9            
##  [43] cellranger_1.1.0       jquerylib_0.1.4        vctrs_0.5.0           
##  [46] nlme_3.1-160           iterators_1.0.14       timeDate_4021.106     
##  [49] gower_1.0.0            xfun_0.34              stringr_1.4.1         
##  [52] globals_0.16.1         rvest_1.0.3            lifecycle_1.0.3       
##  [55] rngtools_1.5.2         googlesheets4_1.0.1    future_1.28.0         
##  [58] zlibbioc_1.44.0        MASS_7.3-58.1          scales_1.2.1          
##  [61] ipred_0.9-13           pcaMethods_1.90.0      hms_1.1.2             
##  [64] parallel_4.2.1         tidyverse_1.3.2        yaml_2.3.6            
##  [67] sass_0.4.2             rpart_4.1.19           stringi_1.7.8         
##  [70] randomForest_4.7-1.1   foreach_1.5.2          e1071_1.7-12          
##  [73] hardhat_1.2.0          lava_1.7.0             rlang_1.0.6           
##  [76] pkgconfig_2.0.3        bitops_1.0-7           evaluate_0.17         
##  [79] purrr_0.3.5            recipes_1.0.2          tidyselect_1.2.0      
##  [82] parallelly_1.32.1      plyr_1.8.7             magrittr_2.0.3        
##  [85] bookdown_0.29          R6_2.5.1               generics_0.1.3        
##  [88] DelayedArray_0.24.0    DBI_1.1.3              pillar_1.8.1          
##  [91] haven_2.5.1            withr_2.5.0            survival_3.4-0        
##  [94] RCurl_1.98-1.9         nnet_7.3-18            tibble_3.1.8          
##  [97] future.apply_1.9.1     modelr_0.1.9           utf8_1.2.2            
## [100] tzdb_0.3.0             rmarkdown_2.17         grid_4.2.1            
## [103] readxl_1.4.1           data.table_1.14.4      forcats_0.5.2         
## [106] ModelMetrics_1.2.2.2   reprex_2.0.2           digest_0.6.30         
## [109] tidyr_1.2.1            munsell_0.5.0          bslib_0.4.0

6 References

Dekermanjian, J.P., Shaddox, E., Nandy, D. et al. Mechanism-aware imputation: a two-step approach in handling missing values in metabolomics. BMC Bioinformatics 23, 179 (2022). https://doi.org/10.1186/s12859-022-04659-1