--- title: "Preparing your data: from fitted models to F_hat" output: rmarkdown::html_vignette: default vignette: > %\VignetteIndexEntry{Preparing your data: from fitted models to F_hat} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r dp-knit-opts, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.width = 6, fig.height = 4, fig.align = "center" ) ``` ```{r setup} library(MetaHunt) set.seed(1) ``` ## Why this step matters Most users arrive at MetaHunt with **one fitted model per study** (a random forest from each centre, a causal forest from each trial site, a linear model from each cohort), not with the `m`-by-`G` numeric matrix `F_hat` that the rest of the package consumes. The two onramp helpers in MetaHunt — `build_grid()` and `f_hat_from_models()` — bridge that gap. The conceptual picture is simple. Pick a finite set of "patient profiles" (the **grid**); evaluate every centre's model at every profile; stack the resulting numeric vectors as rows. The output is `F_hat`, where `F_hat[i, g]` is the prediction of centre `i`'s model at grid point `g`. This tutorial walks through that two-step flow and covers: - the dispatch table `f_hat_from_models()` uses, - a fully-worked dependency-free example with `lm`, - the `predict_fn` escape hatch for custom model classes, - a multi-dimensional grid, - the sanity checks worth running before plugging the result into `metahunt()`. ## `build_grid()` `build_grid()` constructs a data frame of grid points from any **reference patient-level dataset**. The reference data should have the same columns and live on the same scale as the data each centre's model was trained on. Common choices, in rough order of preference: 1. **A held-out sample from the target population.** Best if you have one — even 50–500 rows is enough. The grid then is a draw from the measure you care about and the default uniform `grid_weights` is exactly right. 2. **A public dataset of the same population.** For clinical work, e.g. NHANES, MEPS, or a national survey. For political-science replications, a public benchmark like the World Values Survey or ANES individual-level files works. 3. **A small individual-level sample from a single source site**, if your privacy regime allows one site to share its covariates (but not outcomes). This is common in multi-site clinical trials where one site can provide a covariate sample under a covariate-only DUA. 4. **A synthetic grid based on summary statistics.** If no individual-level data are available anywhere, construct a grid by sampling from marginal distributions whose moments you have agreed summary statistics for. This is the most permissive option but the least faithful — distances over the grid will reflect the independence assumption you imposed. Whatever you pick, **the grid must use the same column names and encodings the per-site models were trained with** — `f_hat_from_models()` will call `predict(model, newdata = grid)` and silently produce nonsense if (say) factor levels disagree. ```{r build-grid-1d} ref <- data.frame(age = rnorm(500, 60, 10), bp = rnorm(500, 130, 15), bmi = rnorm(500, 28, 4)) grid <- build_grid(ref, n_grid = 50, seed = 1) dim(grid) head(grid) ``` If `n_grid` is `NULL` (or at least `nrow(reference_data)`), the full reference data is returned unchanged. Otherwise `n_grid` rows are sampled uniformly without replacement. The empirical distribution of the returned grid implicitly defines the measure `mu` used by the `L^2(mu)` inner product downstream. If your grid is itself a representative sample of the population you care about, the default uniform `grid_weights` is appropriate. If the grid was sampled from one population but you want distances to reflect another, pass non-uniform `grid_weights` proportional to the target density at each grid point. See `vignette("grid-weights")`. ## `f_hat_from_models()` `f_hat_from_models(models, grid)` takes a list of fitted model objects and the grid you just built, and returns the `m`-by-`G_grid` numeric matrix `F_hat`. ### Dispatch table Internally, `f_hat_from_models()` dispatches on each model's class: | Class | Call form | |--------------------------------------------------------|------------------------------------------------------| | `ranger` | `predict(model, data = grid)$predictions` | | `grf::causal_forest`, `regression_forest` | `predict(model, newdata = grid)$predictions` | | anything else | `as.numeric(predict(model, newdata = grid))` | The default branch covers `lm`, `glm`, `randomForest`, and most other R model objects whose `predict()` method returns a numeric vector when called with `newdata`. #### `ranger` example (not run) ```{r ranger-example, eval = FALSE} library(ranger) centre_models <- lapply(centre_data_list, function(d) ranger(y ~ ., data = d)) F_hat <- f_hat_from_models(centre_models, grid) ``` #### `grf::causal_forest` example (not run) ```{r grf-example, eval = FALSE} library(grf) centre_models <- lapply(centre_data_list, function(d) causal_forest(d$X, d$Y, d$W)) F_hat <- f_hat_from_models(centre_models, grid) ``` These chunks are `eval = FALSE` so this vignette does not depend on `ranger` or `grf` — installing either is unnecessary just to read the tutorial. ### Worked `lm` example To keep the tutorial fully reproducible without external dependencies, here is a complete onramp using `lm()` as the per-centre model. (This is the same `lm-onramp` flow that previously appeared in the introductory vignette, broken out and annotated.) ```{r lm-onramp} m <- 8 centre_meta <- data.frame( region = factor(sample(c("N", "S", "E", "W"), m, replace = TRUE)), mean_age = round(runif(m, 50, 70)), pct_female = round(runif(m, 0.4, 0.6), 2) ) # Each centre fits a quadratic on a single covariate `x`. make_centre_data <- function(i) { x <- runif(80, -1, 1) beta <- centre_meta$mean_age[i] / 60 # toy effect of metadata data.frame(x = x, y = beta * x + 0.3 * x^2 + rnorm(80, sd = 0.2)) } centre_models <- lapply(seq_len(m), function(i) stats::lm(y ~ poly(x, 2), data = make_centre_data(i))) # A 1-D grid in the centres' covariate space. grid_centres <- data.frame(x = seq(-1, 1, length.out = 30)) F_hat_centres <- f_hat_from_models(centre_models, grid_centres) dim(F_hat_centres) # 8 x 30 F_hat_centres[1:3, 1:5] ``` You now have `F_hat_centres` (`m × G_grid`) and `centre_meta` (`m × p`), which are everything `metahunt()` needs. ### `predict_fn` for custom S4 / bespoke classes If your fitted models are S4 objects, ensembles, or otherwise need custom handling, override the dispatcher with `predict_fn`. The function must accept `(model, grid)` and return a length-`nrow(grid)` numeric vector: ```{r predict-fn-demo} # Toy "model" that is just a list with a slope. predict_fn evaluates it. fake_models <- lapply(seq_len(4), function(i) list(slope = i / 4, intercept = 0)) custom_predict <- function(model, grid) { model$intercept + model$slope * grid$x } F_hat_custom <- f_hat_from_models(fake_models, grid_centres, predict_fn = custom_predict) dim(F_hat_custom) # 4 x 30 ``` This same pattern extends to S4 model objects: write a one-line adapter that pulls predictions out and coerces to a numeric vector. ## Multi-dimensional grids (3 covariates) The grid does not need to be one-dimensional. With more than one patient-level covariate, sample the grid from a multivariate reference dataset and let `predict()` evaluate at each row. ```{r multi-d-grid} # A 3-covariate reference dataset and a sub-sampled grid. ref3 <- data.frame(age = rnorm(400, 60, 10), bp = rnorm(400, 130, 15), bmi = rnorm(400, 28, 4)) grid3 <- build_grid(ref3, n_grid = 25, seed = 1) dim(grid3) # Each centre fits an lm on (age, bp, bmi); slopes vary across centres # so there is genuine cross-centre heterogeneity to recover. set.seed(2) m3 <- 8 centre_data3 <- lapply(seq_len(m3), function(i) { n_i <- 60 age <- rnorm(n_i, 60, 10) bp <- rnorm(n_i, 130, 15) bmi <- rnorm(n_i, 28, 4) # slopes vary across centres beta_age <- 0.02 + 0.03 * (i / m3) beta_bp <- -0.01 + 0.02 * cos(pi * i / m3) beta_bmi <- 0.05 - 0.04 * (i / m3) y <- beta_age * age + beta_bp * bp + beta_bmi * bmi + rnorm(n_i, sd = 0.3) data.frame(age = age, bp = bp, bmi = bmi, y = y) }) centre_models3 <- lapply(centre_data3, function(d) stats::lm(y ~ age + bp + bmi, data = d)) F_hat3 <- f_hat_from_models(centre_models3, grid3) dim(F_hat3) # 8 x 25 ``` The number of grid points `G_grid` is yours to choose. Larger grids give finer-resolution function estimates; smaller grids run faster. A few dozen to a few hundred is typical. ## Sanity checks After building `F_hat`, run these quick checks before passing it into the rest of the pipeline: ```{r sanity} # Right shape: m studies x G grid points. dim(F_hat3) # Numeric, no NA. is.numeric(F_hat3) anyNA(F_hat3) # Rows look like functions of similar magnitude (large outliers can # dominate d-fSPA's `Delta`). summary(apply(F_hat3, 1, function(r) c(min = min(r), max = max(r)))) ``` A quick visual check on a 1-D grid is also worth the cost: ```{r sanity-plot} matplot(grid_centres$x, t(F_hat_centres), type = "l", lty = 1, col = "grey50", xlab = "x", ylab = expression(hat(f)(x)), main = "Per-centre fitted functions on the shared grid") ``` If a single centre's curve is wildly far from the rest, that often flags a bug (mis-scaled covariate, an `lm` with the wrong response) rather than a real low-rank-violating outlier. ## Common pitfalls - **Different columns across centres.** Every centre's model must accept rows of the same `grid` data frame. If centre A was fit on `(age, bp)` and centre B was fit on `(age, bmi)`, you cannot share a grid; harmonise the covariates first. - **Categorical levels.** Factor levels in `grid` must match those the model saw at training. Pre-build levels explicitly with `factor(x, levels = ...)` if needed. - **Wrong return shape from `predict_fn`.** Must be a length-`G_grid` numeric vector. `f_hat_from_models()` errors if the length is wrong or any value is `NA`. - **Using a non-representative grid.** The grid implicitly defines the measure `mu`. A grid concentrated in one corner of covariate space will under-weight other regions. Either resample the grid more evenly, or use non-uniform `grid_weights` (see `vignette("grid-weights")`). - **Confusing `m × p` study covariates `W` with the patient grid.** `W` lives at the *study* level (one row per centre, columns like region or year). The grid lives at the *patient* level (one row per profile, columns like age and BMI). They are different objects.