When performing multiple imputation with {rbmi}
using many imputations (e.g., 100-1000), the full imputed dataset can
become very large. However, most of this data is redundant: observed
values are identical across all imputations.
The {rbmiUtils} package provides two functions to
address this:
reduce_imputed_data(): Extract only the imputed values
(originally missing)expand_imputed_data(): Reconstruct the full dataset
when neededThis approach can reduce storage requirements by 90% or more, depending on the proportion of missing data.
Consider a typical clinical trial dataset:
Full storage: 2,500 rows × 1,000 imputations = 2.5 million rows
Reduced storage: 125 missing values × 1,000 imputations = 125,000 rows (5% of full size)
The {rbmiUtils} package includes example datasets we can
use:
data("ADMI", package = "rbmiUtils") # Full imputed dataset
data("ADEFF", package = "rbmiUtils") # Original data with missing values
# Check dimensions
cat("Full imputed dataset (ADMI):", nrow(ADMI), "rows\n")
#> Full imputed dataset (ADMI): 100000 rows
cat("Number of imputations:", length(unique(ADMI$IMPID)), "\n")
#> Number of imputations: 100First, prepare the original data to match the imputed data structure:
original <- ADEFF |>
mutate(
TRT = TRT01P,
USUBJID = as.character(USUBJID)
)
# Count missing values
n_missing <- sum(is.na(original$CHG))
cat("Missing values in original data:", n_missing, "\n")
#> Missing values in original data: 44Define the variables specification:
Now reduce the imputed data:
The reduced dataset contains only the rows that were originally missing:
# First few rows
head(reduced)
#> # A tibble: 6 × 12
#> IMPID STRATA REGION REGIONC TRT BASE CHG AVISIT USUBJID CRIT1FLN CRIT1FL
#> <dbl> <chr> <chr> <dbl> <chr> <dbl> <dbl> <chr> <chr> <dbl> <chr>
#> 1 1 A North … 1 Plac… 12 -1.96 Week … ID011 0 N
#> 2 1 A Europe 3 Drug… 3 -3.71 Week … ID014 0 N
#> 3 1 B Europe 3 Drug… 9 -1.96 Week … ID018 0 N
#> 4 1 A Asia 4 Drug… 10 -5.55 Week … ID033 0 N
#> 5 1 A Asia 4 Drug… 0 -1.28 Week … ID061 0 N
#> 6 1 A South … 2 Plac… 5 -2.60 Week … ID071 0 N
#> # ℹ 1 more variable: CRIT <chr>
# Structure matches original imputed data
cat("\nColumns in reduced data:\n")
#>
#> Columns in reduced data:
cat(paste(names(reduced), collapse = ", "))
#> IMPID, STRATA, REGION, REGIONC, TRT, BASE, CHG, AVISIT, USUBJID, CRIT1FLN, CRIT1FL, CRITEach row represents an imputed value for a specific subject-visit-imputation combination.
When you need to run analyses, expand the reduced data back to full form:
Let’s verify that the round-trip preserves data integrity:
# Sort both datasets for comparison
admi_sorted <- ADMI |>
arrange(IMPID, USUBJID, AVISIT)
expanded_sorted <- expanded |>
arrange(IMPID, USUBJID, AVISIT)
# Compare CHG values
all_equal <- all.equal(
admi_sorted$CHG,
expanded_sorted$CHG,
tolerance = 1e-10
)
cat("Data integrity check:", all_equal, "\n")
#> Data integrity check: TRUEHere’s how to integrate efficient storage into your workflow:
# After imputation
impute_obj <- impute(draws_obj, references = c("Placebo" = "Placebo", "Drug A" = "Placebo"))
full_imputed <- get_imputed_data(impute_obj)
# Reduce for storage
reduced <- reduce_imputed_data(full_imputed, original_data, vars)
# Save both (reduced is much smaller)
saveRDS(reduced, "imputed_reduced.rds")
saveRDS(original_data, "original_data.rds")# Load saved data
reduced <- readRDS("imputed_reduced.rds")
original_data <- readRDS("original_data.rds")
# Expand when needed for analysis
full_imputed <- expand_imputed_data(reduced, original_data, vars)
# Run analysis
ana_obj <- analyse_mi_data(
data = full_imputed,
vars = vars,
method = method,
fun = ancova
)Here’s a comparison of storage requirements for different scenarios:
| Subjects | Visits | Missing % | Imputations | Full Rows | Reduced Rows | Savings |
|---|---|---|---|---|---|---|
| 500 | 5 | 5% | 100 | 250,000 | 12,500 | 95% |
| 500 | 5 | 5% | 1,000 | 2,500,000 | 125,000 | 95% |
| 1,000 | 8 | 10% | 500 | 4,000,000 | 400,000 | 90% |
| 200 | 4 | 20% | 1,000 | 800,000 | 160,000 | 80% |
The savings scale with:
Use reduced storage when:
Keep full data when:
If the original data has no missing values,
reduce_imputed_data() returns an empty data.frame:
The functions work with any number of imputations, including just one.
The reduce_imputed_data() and
expand_imputed_data() functions provide an efficient way to
store imputed datasets:
This approach is particularly valuable when working with large numbers of imputations or when storage and memory are constrained.
For the complete analysis workflow using imputed data, see
vignette('pipeline').