omicsGMF is an R package for generalized matrix factorization and missing
value imputation in omics data. It is designed for dimensionality reduction and
visualization, specifically handling count data and missing values efficiently.
Unlike conventional PCA, omicsGMF does not require log-transformation of
RNA-seq data or prior imputation of proteomics data.
A key advantage of omicsGMF is its ability to control for known sample- and
feature-level covariates, such as batch effects. This improves downstream
analyses like clustering. Additionally, omicsGMF includes model selection to
optimize the number of latent confounders, ensuring an optimal dimensionality
for analysis. Its stochastic optimization algorithms allow it to remain fast
while handling these complex data structures.
omicsGMF builds on the sgdGMF framework provided in the sgdGMF CRAN
package, and provides easy integration with SingleCellExperiment,
SummarizedExperiment,
and QFeature classes, with adapted default values for the optimization
arguments when dealing with omics data.
All details about the sgdGMF framework, such as the adaptive learning rates,
exponential gradient averaging and subsampling of the data are
described in our preprint (???). There, we show the use of the
sgdGMF-framework on single-cell RNA-seq data. In our newest preprint
(???), we show how omicsGMF can be used to visualize (single-cell)
proteomics data and impute missing values.
This vignette provides a step-by-step workflow for using omicsGMF for
dimensionality reduction of omics data. The main function are:
runCVGMF or calculateCVGMF performs cross-validation to determine the
optimal number of latent confounders. These results can be visualized using
plotCV. This cross-validation avoids arbitrarily choosing ncomponents, but
requires some computational time. An alternative is calculateRankGMF, which
performs an eigenvalue decomposition on the deviance residuals. This allows for
model selection based on a scree plot using plotRank, for example using
the elbow method.
runGMF or calculateGMF estimates the latent confounders and the rotation
matrix, and estimates the respective parameters of the sample-level and
feature-level covariates.
plotGMF plots the samples using its decomposition.
imputeGMF creates a new assay with missing values imputed using the
estimates of runGMF.
We here apply omicsGMF on proteomics data.
sgdGMF can be installed through CRAN.
omicsGMF can be installed from github, and will be soon available
through Bioconductor.
if(!requireNamespace("sgdGMF", quietly = TRUE))
install.packages("sgdGMF")
BiocManager::install("omicsGMF")
library(sgdGMF)
library(omicsGMF)
library(dplyr)
library(scuttle)
set.seed(100)
To perform dimensionality reduction on proteomics data, one can use the
log-transformed intensities, which makes the data Gaussian distributed.
Optionally, one can opt to perform normalization such as median-normalization,
although this is not required, but might enhance numerical stability and
convergence speed. For proteomics data, family = gaussian() should be used
in the data analysis, and missing values should not be imputed prior to the
matrix factorization as omicsGMF deals with these internally. If the goal is to
impute missing values after matrix decomposition, one should include all
features in the analysis, therefore ignoring the ntop argument.
For the proteomics vignette, we will simulate some artificial Gaussian
data, and introduce some missing values completely ad random.
We store the simulated intensities in the logintensities assay of a
SingleCellExperiment. For sake of exposition, we also include a batch
effect in the colData. omicsGMF can internally correct for these batch
effects, and therefore does not require prior correction with other tools.
sim_intensities <- matrix(rnorm(n = 20*50, mean = 1, sd = 1),
ncol = 20)
NAs <- rbinom(n = prod(dim(sim_intensities)), size = 1, prob = 0.3) == 1
sim_intensities[NAs] <- NA
colnames(sim_intensities) <- paste0("S_", c(1:20))
rownames(sim_intensities) <- paste0("G_", c(1:50))
example_sce <- SingleCellExperiment(
assays = SimpleList("logintensities" = sim_intensities),
colData = data.frame("Batch" = rep(c("Batch1", "Batch2"), each = 10)))
X <- model.matrix(~Batch, data = colData(example_sce))
A recommended step is to estimate the optimal dimensionality in the model
by using cross-validation. This cross-validation masks a proportion of the
values as missing, and tries to reconstruct these. Using the out-of-sample
deviances, one can estimate the optimal dimensionality of the latent space.
This cross-validation can be done with the runCVGMF or calculateCVGMF
function, which builds
on the sgdgmf.cv function from the sgdGMF package. Although the
sgdGMF framework allows great flexibility regarding the optimization
algorithm, sensible default values are here introduced for omics data. One
should choose the correct distribution family (family), the number
of components in the dimensionality reduction for which the cross-validation is
run (ncomponents), and the known covariate matrices to account for
(X and Z). Also, one should select the right assay that is used for
dimensionality reduction (exprs_values or assay.type).
Visualization of the cross-validation results can be done using
plotCV. In case that multiple cross-validation results are available in the
metadata, it is possible to visualize these by giving all names of the
metadata slots. The optimal dimensionality is the one that has the lowest
out-of-sample deviances.
example_sce <- runCVGMF(
example_sce,
X = X, # Covariate matrix
exprs_values="logintensities", # Use log-transformed intensities
family = gaussian(), # Gaussian model for proteomics data
ncomponents = c(1:5)) # Test components from 1 to 5
metadata(example_sce)$cv_GMF %>%
group_by(ncomp) %>%
summarise(mean_dev = mean(dev),
mean_aic = mean(aic),
mean_bic = mean(bic),
mean_mae = mean(mae),
mean_mse = mean(mse))
## # A tibble: 5 × 6
## ncomp mean_dev mean_aic mean_bic mean_mae mean_mse
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 1.17 0.996 2.23 0.690 1.17
## 2 2 1.36 1.06 2.75 0.743 1.36
## 3 3 1.61 1.16 3.30 0.806 1.61
## 4 4 1.76 1.27 3.87 0.845 1.76
## 5 5 1.88 1.42 4.47 0.882 1.88
plotCV(example_sce, name = "cv_GMF")
If the dataset is large or you are unsure about the optimal range of components to test, an alternative is the scree plot approach. This method uses PCA on deviance residuals to estimate eigenvalues, providing a fast approximation of the optimal dimensionality.
This can be done using runRankGMF or calculateRankGMF followed
by plotRank or screeplot_rank respectively. Note that now, the
maxcomp argument can be defined, which is the number of
eigenvalues computed.
example_sce <- runRankGMF(example_sce,
X = X,
exprs_values="logintensities",
family = gaussian(),
maxcomp = 10)
plotRank(example_sce, maxcomp = 10)
After choosing the number of components to use in the final dimensionality
reduction, runGMF or calculateGMF can be used. Again, one should select the
distribution family (family), the dimensionality (ncomponents),
the known covariate matrices to account for (X and Z) and the
assay used (exprs_values or assay.type). Unlike runPCA, runGMF uses all
features by default. If you want to select the most variable genes instead,
set ntop. However, when the goal is imputing missing values, all
features should be used. The results are stored in the reducedDim slot of the
SingleCellExperiment object. Additional information such as the rotation
matrix, parameter estimates, the optimization history of sgdGMF framework
and many more are available in the attributes. See runGMF for all
outputs.
example_sce <- runGMF(
example_sce,
exprs_values="logintensities",
family = gaussian(),
ncomponents = 3, # Use optimal dimensionality, here arbitrarily chosen as 3
name = "GMF")
reducedDimNames(example_sce)
head(reducedDim(example_sce, type = "GMF"))
names(attributes(reducedDim(example_sce, type = "GMF")))
head(attr(reducedDim(example_sce, type = "GMF"), "rotation"))
tail(attr(reducedDim(example_sce, type = "GMF"), "trace"))
To visualize the reduced dimensions, you can use plotReducedDim from the
scater package, specifying “GMF” as the dimension reduction method.
Alternatively, the plotGMF function provides a direct wrapper for this.
plotReducedDim(example_sce, dimred = "GMF", colour_by = "Batch")
Finally, it is possible to impute the missing values with the model-based
estimates. This can be done with the imputeGMF function:
example_sce <- imputeGMF(example_sce,
exprs_values = "logintensities",
reducedDimName = "GMF",
name = "logintensities_imputed")
assay(example_sce,'logintensities')[1:5,1:5]
assay(example_sce,'logintensities_imputed')[1:5,1:5]
sessionInfo()
## R version 4.5.1 Patched (2025-08-23 r88802)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.3 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.22-bioc/R/lib/libRblas.so
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0 LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_GB LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: America/New_York
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] dplyr_1.1.4 omicsGMF_0.99.7
## [3] scater_1.37.0 scuttle_1.19.0
## [5] SingleCellExperiment_1.31.1 SummarizedExperiment_1.39.2
## [7] Biobase_2.69.1 GenomicRanges_1.61.6
## [9] Seqinfo_0.99.3 IRanges_2.43.6
## [11] S4Vectors_0.47.5 BiocGenerics_0.55.4
## [13] generics_0.1.4 MatrixGenerics_1.21.0
## [15] matrixStats_1.5.0 sgdGMF_1.0.1
## [17] ggplot2_4.0.0 knitr_1.50
## [19] BiocStyle_2.37.1
##
## loaded via a namespace (and not attached):
## [1] gridExtra_2.3 rlang_1.1.6
## [3] magrittr_2.0.4 clue_0.3-66
## [5] compiler_4.5.1 vctrs_0.6.5
## [7] reshape2_1.4.4 stringr_1.5.2
## [9] ProtGenerics_1.41.0 pkgconfig_2.0.3
## [11] crayon_1.5.3 fastmap_1.2.0
## [13] magick_2.9.0 XVector_0.49.1
## [15] labeling_0.4.3 rmarkdown_2.30
## [17] ggbeeswarm_0.7.2 tinytex_0.57
## [19] purrr_1.1.0 xfun_0.53
## [21] MultiAssayExperiment_1.35.9 cachem_1.1.0
## [23] beachmat_2.25.5 jsonlite_2.0.0
## [25] DelayedArray_0.35.3 BiocParallel_1.43.4
## [27] irlba_2.3.5.1 parallel_4.5.1
## [29] cluster_2.1.8.1 R6_2.6.1
## [31] bslib_0.9.0 stringi_1.8.7
## [33] RColorBrewer_1.1-3 jquerylib_0.1.4
## [35] Rcpp_1.1.0 bookdown_0.45
## [37] iterators_1.0.14 BiocBaseUtils_1.11.2
## [39] Matrix_1.7-4 igraph_2.2.0
## [41] tidyselect_1.2.1 dichromat_2.0-0.1
## [43] abind_1.4-8 yaml_2.3.10
## [45] viridis_0.6.5 doParallel_1.0.17
## [47] codetools_0.2-20 lattice_0.22-7
## [49] tibble_3.3.0 plyr_1.8.9
## [51] withr_3.0.2 S7_0.2.0
## [53] evaluate_1.0.5 pillar_1.11.1
## [55] BiocManager_1.30.26 foreach_1.5.2
## [57] scales_1.4.0 RcppArmadillo_15.0.2-2
## [59] glue_1.8.0 lazyeval_0.2.2
## [61] tools_4.5.1 BiocNeighbors_2.3.1
## [63] ScaledMatrix_1.17.0 QFeatures_1.19.4
## [65] RSpectra_0.16-2 cowplot_1.2.0
## [67] grid_4.5.1 tidyr_1.3.1
## [69] MsCoreUtils_1.21.0 beeswarm_0.4.0
## [71] BiocSingular_1.25.1 vipor_0.4.7
## [73] cli_3.6.5 rsvd_1.0.5
## [75] S4Arrays_1.9.1 viridisLite_0.4.2
## [77] AnnotationFilter_1.33.0 gtable_0.3.6
## [79] sass_0.4.10 digest_0.6.37
## [81] SparseArray_1.9.1 ggrepel_0.9.6
## [83] farver_2.1.2 htmltools_0.5.8.1
## [85] lifecycle_1.0.4 MASS_7.3-65