Contents

1 Introduction

omicsGMF is an R package for generalized matrix factorization and missing value imputation in omics data. It is designed for dimensionality reduction and visualization, specifically handling count data and missing values efficiently. Unlike conventional PCA, omicsGMF does not require log-transformation of RNA-seq data or prior imputation of proteomics data.

A key advantage of omicsGMF is its ability to control for known sample- and feature-level covariates, such as batch effects. This improves downstream analyses like clustering. Additionally, omicsGMF includes model selection to optimize the number of latent confounders, ensuring an optimal dimensionality for analysis. Its stochastic optimization algorithms allow it to remain fast while handling these complex data structures.

omicsGMF builds on the sgdGMF framework provided in the sgdGMF CRAN package, and provides easy integration with SingleCellExperiment, SummarizedExperiment, and QFeature classes, with adapted default values for the optimization arguments when dealing with omics data.

All details about the sgdGMF framework, such as the adaptive learning rates, exponential gradient averaging and subsampling of the data are described in our preprint (Castiglione et al. 2024). There, we show the use of the sgdGMF-framework on single-cell RNA-seq data. In our newest preprint (2025), we show how omicsGMF can be used to visualize (single-cell) proteomics data and impute missing values.

This vignette provides a step-by-step workflow for using omicsGMF for dimensionality reduction of omics data. The main function are:

  1. runCVGMF or calculateCVGMF performs cross-validation to determine the optimal number of latent confounders. These results can be visualized using plotCV. This cross-validation avoids arbitrarily choosing ncomponents, but requires some computational time. An alternative is calculateRankGMF, which performs an eigenvalue decomposition on the deviance residuals. This allows for model selection based on a scree plot using plotRank, for example using the elbow method.

  2. runGMF or calculateGMF estimates the latent confounders and the rotation matrix, and estimates the respective parameters of the sample-level and feature-level covariates.

  3. plotGMF plots the samples using its decomposition.

  4. imputeGMF creates a new assay with missing values imputed using the estimates of runGMF.

We here apply omicsGMF on RNA-seq data.

2 Package installation

sgdGMF can be installed through CRAN. omicsGMF can be installed from github, and will be soon available through Bioconductor.

if(!requireNamespace("sgdGMF", quietly = TRUE))
    install.packages("sgdGMF")

BiocManager::install("omicsGMF")
library(sgdGMF)
library(omicsGMF)
library(dplyr)
library(scuttle)
set.seed(100)

3 RNA-seq analysis

To perform dimensionality reduction on RNA-seq data, one can use the original count matrices, without normalizing or log-transforming the sequencing counts to the Gaussian scale. By using family = poisson(), omicsGMF optimizes the dimensionality reduction with respect to the likelihood of the Poisson family. Further, by including a known covariate matrix, X, omicsGMF corrects for known confounders in the dimensionality reduction.

First, we simulate a small dataset using the scuttle package. For sake of exposition, we will further account for the Treatment covariate from the colData. omicsGMF can internally correct for these treatment effects, and therefore does not require prior correction with other tools.

example_sce <- mockSCE(ncells = 20, ngenes = 50)

X <- model.matrix(~Treatment, colData(example_sce))

A recommended step is to estimate the optimal dimensionality in the model by using cross-validation. This cross-validation masks a proportion of the values as missing, and tries to reconstruct these. Using the out-of-sample deviances, one can estimate the optimal dimensionality of the latent space. This cross-validation can be done with the runCVGMF or calculateCVGMF function, which builds on the sgdgmf.cv function from the sgdGMF package. Although the sgdGMF framework allows great flexibility regarding the optimization algorithm, sensible default values are here introduced for omics data. One should choose the correct distribution family (family), the number of components in the dimensionality reduction for which the cross-validation is run (ncomponents), and the known covariate matrices to account for (X and Z). Also, one should select the right assay that is used for dimensionality reduction (exprs_values or assay.type).

Visualization of the cross-validation results can be done using plotCV. In case that multiple cross-validation results are available in the metadata, it is possible to visualize these by giving all names of the metadata slots. The optimal dimensionality is the one that has the lowest out-of-sample deviances.

example_sce <- runCVGMF(
    example_sce, 
    X = X,                   # Covariate matrix
    exprs_values = "counts", # Use raw counts (no normalization)
    family = poisson(),      # Poisson model for RNA-seq count data
    ncomponents = 1:5,       # Test components from 1 to 5
    ntop = 50               # Use top 50 most variable genes
)             

metadata(example_sce)$cv_GMF %>% 
    group_by(ncomp) %>% 
    summarise(mean_dev = mean(dev),
              mean_aic = mean(aic),
              mean_bic = mean(bic),
              mean_mae = mean(mae),
              mean_mse = mean(mse))
## # A tibble: 5 × 6
##   ncomp mean_dev mean_aic mean_bic mean_mae mean_mse
##   <dbl>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>
## 1     1    1077.     227.     228.     137.  112100.
## 2     2    1141.     218.     220.     161.  161930.
## 3     3    1217.     214.     216.     186.  226963.
## 4     4    1250.     208.     210.     200.  259911.
## 5     5    1300.     219.     222.     224.  324342.
plotCV(example_sce, name = "cv_GMF")

If the dataset is large or you are unsure about the optimal range of components to test, an alternative is the scree plot approach. This method uses PCA on deviance residuals to estimate eigenvalues, providing a fast approximation of the optimal dimensionality.

This can be done using runRankGMF or calculateRankGMF followed by plotRank or screeplot_rank respectively. Note that now, the maxcomp argument can be defined, which is the number of eigenvalues computed.

example_sce <- runRankGMF(
    example_sce, 
    X = X, 
    exprs_values="counts", 
    family = poisson(), 
    maxcomp = 10,
    ntop = 50)

plotRank(example_sce, maxcomp = 10)

After choosing the number of components to use in the final dimensionality reduction, runGMF or calculateGMF can be used. Again, one should select the distribution family (family), the dimensionality (ncomponents), the known covariate matrices to account for (X and Z) and the assay used (exprs_values or assay.type). Unlike runPCA, runGMF uses all features by default. If you want to select the most variable genes instead, set ntop. The results are stored in the reducedDim slot of the SingleCellExperiment object. Additional information such as the rotation matrix, parameter estimates, the optimization history of sgdGMF framework and many more are available in the attributes. See runGMF for all outputs.

example_sce <- runGMF(
    example_sce, 
    X = X, 
    exprs_values="counts", 
    family = poisson(), 
    ncomponents = 3, # Use optimal dimensionality, here arbitrarily chosen as 3
    ntop = 50,
    name = "GMF")
reducedDimNames(example_sce)
head(reducedDim(example_sce))

names(attributes(reducedDim(example_sce, type = "GMF")))
head(attr(reducedDim(example_sce, type = "GMF"), "rotation"))
tail(attr(reducedDim(example_sce, type = "GMF"), "trace"))

To visualize the reduced dimensions, you can use plotReducedDim from the scater package, specifying “GMF” as the dimension reduction method. Alternatively, the plotGMF function provides a direct wrapper for this.

plotReducedDim(example_sce, dimred = "GMF", colour_by = "Mutation_Status")

sessionInfo()
## R version 4.5.1 Patched (2025-08-23 r88802)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.3 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.22-bioc/R/lib/libRblas.so 
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0  LAPACK version 3.12.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_GB              LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: America/New_York
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] dplyr_1.1.4                 omicsGMF_0.99.7            
##  [3] scater_1.37.0               scuttle_1.19.0             
##  [5] SingleCellExperiment_1.31.1 SummarizedExperiment_1.39.2
##  [7] Biobase_2.69.1              GenomicRanges_1.61.6       
##  [9] Seqinfo_0.99.3              IRanges_2.43.6             
## [11] S4Vectors_0.47.5            BiocGenerics_0.55.4        
## [13] generics_0.1.4              MatrixGenerics_1.21.0      
## [15] matrixStats_1.5.0           sgdGMF_1.0.1               
## [17] ggplot2_4.0.0               knitr_1.50                 
## [19] BiocStyle_2.37.1           
## 
## loaded via a namespace (and not attached):
##  [1] gridExtra_2.3               rlang_1.1.6                
##  [3] magrittr_2.0.4              clue_0.3-66                
##  [5] compiler_4.5.1              vctrs_0.6.5                
##  [7] reshape2_1.4.4              stringr_1.5.2              
##  [9] ProtGenerics_1.41.0         pkgconfig_2.0.3            
## [11] crayon_1.5.3                fastmap_1.2.0              
## [13] magick_2.9.0                XVector_0.49.1             
## [15] labeling_0.4.3              rmarkdown_2.30             
## [17] ggbeeswarm_0.7.2            tinytex_0.57               
## [19] purrr_1.1.0                 xfun_0.53                  
## [21] MultiAssayExperiment_1.35.9 cachem_1.1.0               
## [23] beachmat_2.25.5             jsonlite_2.0.0             
## [25] DelayedArray_0.35.3         BiocParallel_1.43.4        
## [27] irlba_2.3.5.1               parallel_4.5.1             
## [29] cluster_2.1.8.1             R6_2.6.1                   
## [31] bslib_0.9.0                 stringi_1.8.7              
## [33] RColorBrewer_1.1-3          jquerylib_0.1.4            
## [35] Rcpp_1.1.0                  bookdown_0.45              
## [37] iterators_1.0.14            BiocBaseUtils_1.11.2       
## [39] Matrix_1.7-4                igraph_2.2.0               
## [41] tidyselect_1.2.1            dichromat_2.0-0.1          
## [43] abind_1.4-8                 yaml_2.3.10                
## [45] viridis_0.6.5               doParallel_1.0.17          
## [47] codetools_0.2-20            lattice_0.22-7             
## [49] tibble_3.3.0                plyr_1.8.9                 
## [51] withr_3.0.2                 S7_0.2.0                   
## [53] evaluate_1.0.5              pillar_1.11.1              
## [55] BiocManager_1.30.26         foreach_1.5.2              
## [57] scales_1.4.0                RcppArmadillo_15.0.2-2     
## [59] glue_1.8.0                  lazyeval_0.2.2             
## [61] tools_4.5.1                 BiocNeighbors_2.3.1        
## [63] ScaledMatrix_1.17.0         QFeatures_1.19.4           
## [65] RSpectra_0.16-2             cowplot_1.2.0              
## [67] grid_4.5.1                  tidyr_1.3.1                
## [69] MsCoreUtils_1.21.0          beeswarm_0.4.0             
## [71] BiocSingular_1.25.1         vipor_0.4.7                
## [73] cli_3.6.5                   rsvd_1.0.5                 
## [75] S4Arrays_1.9.1              viridisLite_0.4.2          
## [77] AnnotationFilter_1.33.0     gtable_0.3.6               
## [79] sass_0.4.10                 digest_0.6.37              
## [81] SparseArray_1.9.1           ggrepel_0.9.6              
## [83] farver_2.1.2                htmltools_0.5.8.1          
## [85] lifecycle_1.0.4             MASS_7.3-65

Castiglione, Cristian, Alexandre Segers, Lieven Clement, and Davide Risso. 2024. “Stochastic Gradient Descent Estimation of Generalized Matrix Factorization Models with Application to Single-Cell Rna Sequencing Data.” https://arxiv.org/abs/2412.20509.

2025.