data(pbc, package="survival")
pbc <- transform(pbc, y = (time < 730) * (status > 0))learner)2025-12-09
Methods for targeted and semiparametric inference rely on fitting nuisance models to observed data when estimating the target parameter of interest. The {targeted} package implements the R6 reference class learner to harmonize common statistical and machine learning models for the usage as nuisance models across the various implemented estimators. Commonly used models are constructed as learner class objects through the learner_ functions. These functions are wrappers for statistical and machine learning models that have been implemented in other R packages. This is conceptually similar to packages such as {caret} and {mlr3}. Besides implementing a large number of prediction models, these packages provide extensive functionalities for data preprocessing, model selection and diagnostics, and hyper-parameter optimization. The design of the learner class and accompanied functions is much leaner, as our use-cases often only require a standardized interface for estimating models and generating predictions, instead of the full predictive modelling pipeline.
This vignette uses the Mayo Clinic Primary Biliary Cholangitis data (?survival::pbc) to exemplify the usage of the learner class objects. Our interest lies in the prediction of the composite event of death or transplant before 2 years.
data(pbc, package="survival")
pbc <- transform(pbc, y = (time < 730) * (status > 0))A logistic regression model with a single age covariate is defined and estimated via
lr <- learner_glm(y ~ age, family = binomial())
lr$estimate(pbc)and predictions for the event y = 1 (class 2 probabilities) for new data are obtained by
lr$predict(newdata = data.frame(age = c(20, 40, 60, 80))) 1 2 3 4
0.02578001 0.06868011 0.17047725 0.36415988
The remainder of this vignette is structured as follows. We first provide more details about the built-in learners that wrap statistical and machine learning models from other R packages. Once the usage of the returned learner class objects has been introduced, we provide details to implement new learners.
The various constructors for commonly used statistical and machine learning models are listed as part of the learner class documentation
?learner # help(learner)which contains all essential information for the usage of the learner class objects. With the exception of learner_sl, all constructors require a formula argument that specify the response vector and design matrix for the underlying model. This way the construction of a learner is separated from the estimation of the model, as shown in the above logistic regression example. A result of this design is the convenient specification of ensemble learners (superlearner), where individual learners operate on different subsets of features.
lr_sl <- learner_sl(
learners = list(
glm1 = learner_glm(y ~ age * bili, family = "binomial"),
glm2 = learner_glm(y ~ age, family = "binomial"),
gam = learner_gam(y ~ s(age) + s(bili), family = "binomial")
)
)
lr_sl$estimate(pbc, nfolds = 10)
lr_sl────────── learner object ──────────
superlearner
glm1
glm2
gam
Estimate arguments: learners=<list>, nfolds=5, meta.learner=<function>, model.score=<function>
Predict arguments:
Formula: y ~ age * bili
─────────────────────────────────────
score weight
glm1 0.09904905 0.12160103
glm2 0.10740116 0.08248735
gam 0.09654980 0.79591162
Most constructors have additional arguments that impact the resulting model fit, ranging from the specification of a link function for generalized linear models to hyper hyperparameters of machine learning models. Arguments naturally vary between the constructors and are documented for each function (e.g. ?learner_gam). The implemented constructors only provide arguments for the most relevant parameters of the underlying fit function (mgcv::gam in case of learner_gam). As described in the documentation, the ellipsis argument (...) can be used to pass additional arguments to the fit function. However, in most situations this is rarely required.
It is often necessary to consider a range of models with different hyper-parameters and to facilitate this we can use the learner_expand_grid function
lrs <- learner_expand_grid( learner_xgboost,
list(formula = Sepal.Length ~ .,
eta = c(0.2, 0.5, 0.3)) )
lrs$`xgboost reg:squarederror`
────────── learner object ──────────
xgboost reg:squarederror
Estimate arguments: max_depth=2, learning_rate=1, nrounds=2, subsample=1, reg_lambda=1, objective=reg:squarederror, eta=0.2
Predict arguments:
Formula: Sepal.Length ~ .
$`xgboost reg:squarederror.1`
────────── learner object ──────────
xgboost reg:squarederror
Estimate arguments: max_depth=2, learning_rate=1, nrounds=2, subsample=1, reg_lambda=1, objective=reg:squarederror, eta=0.5
Predict arguments:
Formula: Sepal.Length ~ .
$`xgboost reg:squarederror.2`
────────── learner object ──────────
xgboost reg:squarederror
Estimate arguments: max_depth=2, learning_rate=1, nrounds=2, subsample=1, reg_lambda=1, objective=reg:squarederror, eta=0.3
Predict arguments:
Formula: Sepal.Length ~ .
The list of models can then serve as inputs for predictor_sl or for assessing the generalization error of the model across different hyper-parameters with the cv method.
The basic usage is to construct a new learner by providing the formula argument and the set of additional parameters that control the model fitting process and the task (binary classification in this example).
lr_xgboost <- learner_xgboost(
formula = y ~ age + sex + bili,
eta = 0.3, nrounds = 5, # hyperparameters
objective = "binary:logistic" # learning task
)
lr_xgboost────────── learner object ──────────
xgboost binary:logistic
Estimate arguments: max_depth=2, learning_rate=1, nrounds=5, subsample=1, reg_lambda=1, objective=binary:logistic, eta=0.3
Predict arguments:
Formula: y ~ age + sex + bili
The model is estimated via the estimate method
lr_xgboost$estimate(data = pbc)The default behavior is to assign the fitted model to the learner object, which can be accessed via
class(lr_xgboost$fit)[1] "xgb.Booster"
Once the model has been fitted, predictions are generated with
lr_xgboost$predict(head(pbc))[1] 0.66811311 0.10537987 0.32460621 0.10537987 0.11731242 0.03387715
where the fitted model is used inside the learner object to generate the predictions.
S3 methods are implemented for the learner class objects as an alternative to the R6 class syntax for fitting the model and making predictions.
lr <- learner_glm(y ~ age, family = "binomial")
estimate(lr, pbc)
predict(lr, head(pbc)) 1 2 3 4 5 6
0.16171479 0.14624529 0.25614862 0.13566571 0.06272415 0.22071201
The estimate and predict functions of learner class objects are by design immutable. Both functions can be inspected as part of the return values of the summary method
lr_xgboost$summary()$estimatefunction (x, y, ...)
{
params <- list(...)
par1 <- intersect(formalArgs(xgboost::xgb.params), names(params))
xgb_params <- params[par1]
xgb_train_args <- params
xgb_train_args[par1] <- NULL
d <- xgboost::xgb.DMatrix(x, label = y)
res <- do.call(xgboost::xgb.train, c(list(data = d), xgb_train_args,
list(params = xgb_params)), )
return(res)
}
<bytecode: 0x9164f8708>
<environment: 0x913b66708>
Rare situations may arise where one wants to update the formula argument. This supported and implemented via the update method
lr_xgboost$update(y ~ age + sex)The design matrix that results from by the specified formula can be inspected via
head(lr_xgboost$design(pbc)$x) age sexf
1 58.76523 1
2 56.44627 1
3 70.07255 0
4 54.74059 1
5 38.10541 1
6 66.25873 1
and the response vector via
head(lr_xgboost$response(pbc))1 2 3 4 5 6
1 0 0 0 0 0
See ?learner for the differences between learner$design and learner$response and ?design for more details about the construction of the design matrix and response vector.
The {targeted} package provides a generic implementation for repeated k-fold cross-validation with learner class objects. Parallelization is supported via the {future} and {parallel} packages (see ?cv for more details).
# future::plan("multicore")
lrs <- list(
glm = learner_glm(y ~ age + age, family = "binomial"),
gam = learner_gam(y ~ s(age) + s(bili), family = "binomial")
)
# 2 times repeated 5-fold cross-validation
cv(lrs, data = pbc, rep = 2, nfolds = 5)Call: cv.default(object = lrs, data = pbc, nfolds = 5, rep = 2)
5-fold cross-validation with 2 repetitions
mse mae
glm 0.1072236 0.2128743
gam 0.1010563 0.1885859
Statistical or machine learning models for which no constructors are provided can be implemented with a few lines of code. In what follows we cover three general cases where the input to the fit function differs.
The first general case covers fit functions which expect a formula and data argument. Both arguments are used then by the fitting routine to construct the design matrix and response vector. Statistical models are usually implement with such an interface, with examples including stats::glm, mgcv::gam and earth::earth. The learner R6 class supports these fitting routines by checking if the provided estimate function has a formula and data argument. If it does, then the formula and data argument are passed on to the estimate function without further modifications. This is exemplified in the following for stats::glm, where we define a new constructor to return a learner class object that fits a generalized model
new_glm <- function(formula, ...) {
learner$new(
formula = formula,
estimate = stats::glm,
predict = stats::predict,
predict.args = list(type = "response"),
estimate.args = list(...),
info = "new glm learner" # optional
)
}
lr <- new_glm(y ~ age, family = "binomial")
lr────────── learner object ──────────
new glm learner
Estimate arguments: family=binomial
Predict arguments: type=response
Formula: y ~ age
It can be seen that the optional arguments of new_glm define the estimate.args of a constructed learner object. When estimating the model with
lr$estimate(pbc)the parameters defined in estimate.args are passed on with the formula and data to stats::glm (i.e. the function specified via the estimate argument). Thus, the above is equivalent to
fit <- glm(y ~ age, family = "binomial", data = pbc)
all(coef(fit) == coef(lr$fit))[1] TRUE
The code further instructs to construct a learner object that uses stats::predict to make predictions. The learner object similarly passes on the predict.args to stats::predict for
lr$predict(head(pbc)) 1 2 3 4 5 6
0.16171479 0.14624529 0.25614862 0.13566571 0.06272415 0.22071201
which is equivalent to
predict(fit, newdata = head(pbc), type = "response") 1 2 3 4 5 6
0.16171479 0.14624529 0.25614862 0.13566571 0.06272415 0.22071201
Indeed, the documentation of ?learner reveals that the predict method always requires an object and newdata argument. An important implementation detail is that the learner R6 class allows to overrule the defined estimate.args and predict.args in the method calls.
lr$estimate(pbc, family = "poisson")
lr$predict(head(pbc), type = "link") 1 2 3 4 5 6
-1.837279 -1.939001 -1.341276 -2.013822 -2.743535 -1.508572
Most fitting routines for machine learning models expect a design matrix and response vector as inputs. The R6 learner class supports these fitting functions by internally processing the formula argument via the targeted::design function. Take the example of grf::probability_forest, which expects a X (design matrix) and Y (response vector) as inputs.
new_grf <- function(formula, ...) {
learner$new(
formula = formula,
estimate = function(x, y, ...) grf::probability_forest(X = x, Y = y, ...),
predict = function(object, newdata) {
predict(object, newdata = newdata)$predictions
},
estimate.args = list(...),
info = "grf::probability_forest"
)
}
lr <- new_grf(as.factor(y) ~ age + bili, num.trees = 100)
lr$estimate(pbc)Compared to the previous case, the pbc data object is not directly passed on to the fit function. Instead,targeted::design constructs the design matrix and response vector from the defined formula argument. As shown previously,
dsgn <- lr$design(pbc)can be used to inspect the design object. The x and y attributes of the returned design object are then passed on to the fit function.
To support ensemble/meta learners, it is also possible to construct a learner without providing a formula argument.
new_sl <- function(learners, ...) {
learner$new(
info = "new superlearner",
estimate = superlearner,
predict = targeted:::predict.superlearner,
estimate.args = c(list(learners = learners), list(...))
)
}
lrs <- list(
glm = learner_glm(y ~ age, family = "binomial"),
gam = learner_gam(y ~ s(age), family = "binomial")
)
lr <- new_sl(lrs, nfolds = 2)
lr$estimate(pbc)
lr────────── learner object ──────────
new superlearner
Estimate arguments: learners=<list>, nfolds=2
Predict arguments:
Formula: NULL
─────────────────────────────────────
score weight
glm 0.1067067 0.6825316
gam 0.1067525 0.3174684
In this case, all arguments provided to lr$estimate are joined together with the specified estimate.args and passed on to the defined estimate function (i.e. superlearner).
Certain learner functions allow for specifying special terms in the formula. For example, let’s consider an aggregated dataset that includes only the response variable, treatment, and sex:
library("data.table")
dd <- data.table(pbc)[!is.na(trt), .(.N), by=.(y,trt,sex)]
print(dd) y trt sex N
<int> <int> <fctr> <int>
1: 1 1 f 13
2: 0 1 f 124
3: 0 1 m 19
4: 0 2 f 123
5: 1 2 f 16
6: 0 2 m 12
7: 1 2 m 3
8: 1 1 m 2
Next, we fit a Naive Bayes classifier using the weights special term to specify the frequency weights for the estimation
lr <- learner_naivebayes(y ~ trt + sex + weights(N))
lr$estimate(dd)
lr$predict(dd)[1] 0.09138473 0.09138473 0.12139346 0.11913520 0.11913520 0.15668515 0.15668515
[8] 0.12139346
Here weights.numeric is simply the identity function
targeted:::weights.numericfunction (object, ...)
object
<bytecode: 0x915b0f3b8>
<environment: namespace:targeted>
To illustrate how to define a custom learner that utilizes special terms, consider the following example where we introduce a strata term. In this case, we introduce an estimation method that fits a linear regression model for each value of the strata variable, as well as a corresponding prediction method
est <- function(formula, data, strata, ...)
lapply(levels(strata), \(s) lm(formula, data[which(strata==s),]))
pred <- function(object, newdata, strata, ...) {
res <- numeric(length(strata))
for (i in seq_along(levels(strata))) {
idx <- which(strata == levels(strata)[i])
res[idx] <- predict(object[[i]], newdata[idx, ], ...)
}
return(res)
}The new learner is now defined by including the argument `specials = “strata”, which ensures that the strata variable is correctly passed to both the estimate and predict functions
lr <- learner$new(y ~ sex + strata(trt),
estimate=est, predict=pred, specials = "strata")
des <- lr$design(head(pbc))
des────────── design object ──────────
response (length: 6)
1 1
2 0
---
5 0
6 0
specials
- strata [factor]
design matrix (dim: 6, 1)
sexf
1 1
2 1
---
5 1
6 1
des$strata 1 2 3 4 5 6
trt=1 trt=1 trt=1 trt=1 trt=2 trt=2
Levels: trt=1 trt=2
lr$estimate(pbc)
lr────────── learner object ──────────
Estimate arguments:
Predict arguments:
Formula: y ~ sex + strata(trt)
─────────────────────────────────────
[[1]]
Call:
lm(formula = formula, data = data[which(strata == s), ])
Coefficients:
(Intercept) sexf
0.0952381 -0.0003476
[[2]]
Call:
lm(formula = formula, data = data[which(strata == s), ])
Coefficients:
(Intercept) sexf
0.20000 -0.08489
lr$predict(head(pbc))[1] 0.09489051 0.09489051 0.09523810 0.09489051 0.11510791 0.11510791