omicsGMF is an R package for generalized matrix
factorization and missing value imputation in omics data. It is designed
for dimensionality reduction and visualization, specifically handling
count data and missing values efficiently. Unlike conventional PCA,
omicsGMF does not require log-transformation of RNA-seq
data or prior imputation of proteomics data.
A key advantage of omicsGMF is its ability to control
for known sample- and feature-level covariates, such as batch effects.
This improves downstream analyses like clustering. Additionally,
omicsGMF includes model selection to optimize the number of latent
confounders, ensuring an optimal dimensionality for analysis. Its
stochastic optimization algorithms allow it to remain fast while
handling these complex data structures.
omicsGMF builds on the sgdGMF framework
provided in the sgdGMF CRAN package, and provides easy
integration with SingleCellExperiment,
SummarizedExperiment, and QFeature classes,
with adapted default values for the optimization arguments when dealing
with omics data.
All details about the sgdGMF framework, such as the
adaptive learning rates, exponential gradient averaging and subsampling
of the data are described in our preprint (Castiglione et al. 2024). There, we show the
use of the sgdGMF-framework on single-cell RNA-seq data. In
our newest preprint (2025), we show how
omicsGMF can be used to visualize (single-cell) proteomics
data and impute missing values.
This vignette provides a step-by-step workflow for using
omicsGMF for dimensionality reduction of omics data. The
main function are:
runCVGMF or calculateCVGMF performs
cross-validation to determine the optimal number of latent confounders.
These results can be visualized using plotCV. This
cross-validation avoids arbitrarily choosing ncomponents,
but requires some computational time. An alternative is
calculateRankGMF, which performs an eigenvalue
decomposition on the deviance residuals. This allows for model selection
based on a scree plot using plotRank, for example using the
elbow method.
runGMF or calculateGMF estimates the
latent confounders and the rotation matrix, and estimates the respective
parameters of the sample-level and feature-level covariates.
plotGMF plots the samples using its
decomposition.
imputeGMF creates a new assay with missing values
imputed using the estimates of runGMF.
We here apply omicsGMF on proteomics data.
sgdGMF can be installed through CRAN.
omicsGMF can be installed from github, and will be soon
available through Bioconductor.
To perform dimensionality reduction on proteomics data, one can use
the log-transformed intensities, which makes the data Gaussian
distributed. Optionally, one can opt to perform normalization such as
median-normalization, although this is not required, but might enhance
numerical stability and convergence speed. For proteomics data,
family = gaussian() should be used in the data analysis,
and missing values should not be imputed prior to the matrix
factorization as omicsGMF deals with these internally. If the goal is to
impute missing values after matrix decomposition, one should include all
features in the analysis, therefore ignoring the ntop
argument.
For the proteomics vignette, we will simulate some artificial
Gaussian data, and introduce some missing values completely ad random.
We store the simulated intensities in the logintensities
assay of a SingleCellExperiment. For sake of exposition, we
also include a batch effect in the colData.
omicsGMF can internally correct for these batch effects,
and therefore does not require prior correction with other tools.
sim_intensities <- matrix(rnorm(n = 20*50, mean = 1, sd = 1),
ncol = 20)
NAs <- rbinom(n = prod(dim(sim_intensities)), size = 1, prob = 0.3) == 1
sim_intensities[NAs] <- NA
colnames(sim_intensities) <- paste0("S_", c(1:20))
rownames(sim_intensities) <- paste0("G_", c(1:50))
example_sce <- SingleCellExperiment(
assays = SimpleList("logintensities" = sim_intensities),
colData = data.frame("Batch" = rep(c("Batch1", "Batch2"), each = 10)))
X <- model.matrix(~Batch, data = colData(example_sce))A recommended step is to estimate the optimal dimensionality in the
model by using cross-validation. This cross-validation masks a
proportion of the values as missing, and tries to reconstruct these.
Using the out-of-sample deviances, one can estimate the optimal
dimensionality of the latent space. This cross-validation can be done
with the runCVGMF or calculateCVGMF function,
which builds on the sgdgmf.cv function from the sgdGMF
package. Although the sgdGMF framework allows great
flexibility regarding the optimization algorithm, sensible default
values are here introduced for omics data. One should choose the correct
distribution family (family), the number of components in
the dimensionality reduction for which the cross-validation is run
(ncomponents), and the known covariate matrices to account
for (X and Z). Also, one should select the
right assay that is used for dimensionality reduction
(exprs_values or assay.type).
Visualization of the cross-validation results can be done using
plotCV. In case that multiple cross-validation results are
available in the metadata, it is possible to visualize
these by giving all names of the metadata slots. The optimal
dimensionality is the one that has the lowest out-of-sample
deviances.
example_sce <- runCVGMF(
example_sce,
X = X, # Covariate matrix
exprs_values="logintensities", # Use log-transformed intensities
family = gaussian(), # Gaussian model for proteomics data
ncomponents = c(1:5)) # Test components from 1 to 5
metadata(example_sce)$cv_GMF %>%
group_by(ncomp) %>%
summarise(mean_dev = mean(dev),
mean_aic = mean(aic),
mean_bic = mean(bic),
mean_mae = mean(mae),
mean_mse = mean(mse))## # A tibble: 5 × 6
## ncomp mean_dev mean_aic mean_bic mean_mae mean_mse
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 1.17 0.996 2.23 0.690 1.17
## 2 2 1.36 1.06 2.75 0.743 1.36
## 3 3 1.61 1.16 3.30 0.806 1.61
## 4 4 1.76 1.27 3.87 0.845 1.76
## 5 5 1.88 1.42 4.47 0.882 1.88
If the dataset is large or you are unsure about the optimal range of components to test, an alternative is the scree plot approach. This method uses PCA on deviance residuals to estimate eigenvalues, providing a fast approximation of the optimal dimensionality.
This can be done using runRankGMF or
calculateRankGMF followed by plotRank or
screeplot_rank respectively. Note that now, the
maxcomp argument can be defined, which is the number of
eigenvalues computed.
example_sce <- runRankGMF(example_sce,
X = X,
exprs_values="logintensities",
family = gaussian(),
maxcomp = 10)
plotRank(example_sce, maxcomp = 10)After choosing the number of components to use in the final
dimensionality reduction, runGMF or
calculateGMF can be used. Again, one should select the
distribution family (family), the dimensionality
(ncomponents), the known covariate matrices to account for
(X and Z) and the assay used
(exprs_values or assay.type). Unlike
runPCA, runGMF uses all features by default.
If you want to select the most variable genes instead, set
ntop. However, when the goal is imputing missing values,
all features should be used. The results are stored in the
reducedDim slot of the SingleCellExperiment
object. Additional information such as the rotation matrix,
parameter estimates, the optimization history of sgdGMF
framework and many more are available in the attributes.
See runGMF for all outputs.
example_sce <- runGMF(
example_sce,
exprs_values="logintensities",
family = gaussian(),
ncomponents = 3, # Use optimal dimensionality, here arbitrarily chosen as 3
name = "GMF")reducedDimNames(example_sce)
head(reducedDim(example_sce, type = "GMF"))
names(attributes(reducedDim(example_sce, type = "GMF")))
head(attr(reducedDim(example_sce, type = "GMF"), "rotation"))
tail(attr(reducedDim(example_sce, type = "GMF"), "trace"))To visualize the reduced dimensions, you can use
plotReducedDim from the scater package,
specifying “GMF” as the dimension reduction method. Alternatively, the
plotGMF function provides a direct wrapper for this.
Finally, it is possible to impute the missing values with the
model-based estimates. This can be done with the imputeGMF
function:
example_sce <- imputeGMF(example_sce,
exprs_values = "logintensities",
reducedDimName = "GMF",
name = "logintensities_imputed")
assay(example_sce,'logintensities')[1:5,1:5]
assay(example_sce,'logintensities_imputed')[1:5,1:5]## R version 4.5.2 (2025-10-31)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.3 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Etc/UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] dplyr_1.2.0 omicsGMF_1.1.0
## [3] scater_1.39.2 scuttle_1.21.0
## [5] SingleCellExperiment_1.33.0 SummarizedExperiment_1.41.1
## [7] Biobase_2.71.0 GenomicRanges_1.63.1
## [9] Seqinfo_1.1.0 IRanges_2.45.0
## [11] S4Vectors_0.49.0 BiocGenerics_0.57.0
## [13] generics_0.1.4 MatrixGenerics_1.23.0
## [15] matrixStats_1.5.0 sgdGMF_1.0.1
## [17] ggplot2_4.0.2 knitr_1.51
## [19] BiocStyle_2.39.0
##
## loaded via a namespace (and not attached):
## [1] gridExtra_2.3 rlang_1.1.7
## [3] magrittr_2.0.4 clue_0.3-66
## [5] otel_0.2.0 compiler_4.5.2
## [7] vctrs_0.7.1 reshape2_1.4.5
## [9] stringr_1.6.0 ProtGenerics_1.43.0
## [11] pkgconfig_2.0.3 fastmap_1.2.0
## [13] XVector_0.51.0 labeling_0.4.3
## [15] rmarkdown_2.30 ggbeeswarm_0.7.3
## [17] purrr_1.2.1 xfun_0.56
## [19] MultiAssayExperiment_1.37.2 cachem_1.1.0
## [21] beachmat_2.27.2 jsonlite_2.0.0
## [23] DelayedArray_0.37.0 BiocParallel_1.45.0
## [25] irlba_2.3.7 parallel_4.5.2
## [27] cluster_2.1.8.2 R6_2.6.1
## [29] bslib_0.10.0 stringi_1.8.7
## [31] RColorBrewer_1.1-3 jquerylib_0.1.4
## [33] Rcpp_1.1.1 iterators_1.0.14
## [35] Matrix_1.7-4 igraph_2.2.1
## [37] tidyselect_1.2.1 abind_1.4-8
## [39] yaml_2.3.12 viridis_0.6.5
## [41] doParallel_1.0.17 codetools_0.2-20
## [43] lattice_0.22-7 tibble_3.3.1
## [45] plyr_1.8.9 withr_3.0.2
## [47] S7_0.2.1 evaluate_1.0.5
## [49] pillar_1.11.1 BiocManager_1.30.27
## [51] foreach_1.5.2 scales_1.4.0
## [53] RcppArmadillo_15.2.3-1 glue_1.8.0
## [55] lazyeval_0.2.2 maketools_1.3.2
## [57] tools_4.5.2 BiocNeighbors_2.5.3
## [59] sys_3.4.3 ScaledMatrix_1.19.0
## [61] QFeatures_1.21.0 RSpectra_0.16-2
## [63] buildtools_1.0.0 cowplot_1.2.0
## [65] grid_4.5.2 tidyr_1.3.2
## [67] MsCoreUtils_1.23.2 beeswarm_0.4.0
## [69] BiocSingular_1.27.1 vipor_0.4.7
## [71] cli_3.6.5 rsvd_1.0.5
## [73] S4Arrays_1.11.1 viridisLite_0.4.3
## [75] AnnotationFilter_1.35.0 gtable_0.3.6
## [77] sass_0.4.10 digest_0.6.39
## [79] SparseArray_1.11.10 ggrepel_0.9.6
## [81] farver_2.1.2 htmltools_0.5.9
## [83] lifecycle_1.0.5 MASS_7.3-65