learner)Methods for targeted and semiparametric inference rely on fitting
nuisance models to observed data when estimating the target parameter of
interest. The {targeted} package implements the R6 reference class learner
to harmonize common statistical and machine learning models for the
usage as nuisance models across the various implemented estimators.
Commonly used models are constructed as learner class
objects through the learner_ functions. These functions are
wrappers for statistical and machine learning models that have been
implemented in other R packages. This is conceptually similar to
packages such as {caret} and {mlr3}. Besides
implementing a large number of prediction models, these packages provide
extensive functionalities for data preprocessing, model selection and
diagnostics, and hyper-parameter optimization. The design of the
learner class and accompanied functions is much leaner, as
our use-cases often only require a standardized interface for estimating
models and generating predictions, instead of the full predictive
modelling pipeline.
This vignette uses the Mayo Clinic Primary Biliary Cholangitis data
(?survival::pbc) to exemplify the usage of the
learner class objects. Our interest lies in the prediction
of the composite event of death or transplant before 2 years.
data(pbc, package="survival")
pbc <- transform(pbc, y = (time < 730) * (status > 0))
A logistic regression model with a single age covariate
is defined and estimated via
lr <- learner_glm(y ~ age, family = binomial())
lr$estimate(pbc)
and predictions for the event y = 1 (class 2
probabilities) for new data are obtained by
lr$predict(newdata = data.frame(age = c(20, 40, 60, 80)))
#> 1 2 3 4
#> 0.02578001 0.06868011 0.17047725 0.36415988
The remainder of this vignette is structured as follows. We first
provide more details about the built-in learners that wrap statistical
and machine learning models from other R packages. Once the usage of the
returned learner class objects has been introduced, we
provide details to implement new learners.
The various constructors for commonly used statistical and machine
learning models are listed as part of the learner class
documentation
?learner # help(learner)
which contains all essential information for the usage of the
learner class objects. With the exception of
learner_sl, all constructors require a formula
argument that specify the response vector and design matrix for the
underlying model. This way the construction of a learner is separated
from the estimation of the model, as shown in the above logistic
regression example. A result of this design is the convenient
specification of ensemble learners (superlearner), where individual
learners operate on different subsets of features.
lr_sl <- learner_sl(
learners = list(
glm1 = learner_glm(y ~ age * bili, family = "binomial"),
glm2 = learner_glm(y ~ age, family = "binomial"),
gam = learner_gam(y ~ s(age) + s(bili), family = "binomial")
)
)
lr_sl$estimate(pbc, nfolds = 10)
lr_sl
#> ────────── learner object ──────────
#> superlearner
#> glm1
#> glm2
#> gam
#>
#> Estimate arguments: learners=<list>, nfolds=5, meta.learner=<function>, model.score=<function>
#> Predict arguments:
#> Formula: y ~ age * bili
#> ─────────────────────────────────────
#> score weight
#> glm1 0.09742563 0.31631808
#> glm2 0.10687034 0.01128399
#> gam 0.09500222 0.67239793
Most constructors have additional arguments that impact the resulting
model fit, ranging from the specification of a link function for
generalized linear models to hyper hyperparameters of machine learning
models. Arguments naturally vary between the constructors and are
documented for each function (e.g. ?learner_gam). The
implemented constructors only provide arguments for the most relevant
parameters of the underlying fit function (mgcv::gam in
case of learner_gam). As described in the documentation,
the ellipsis argument (...) can be used to pass additional
arguments to the fit function. However, in most situations this is
rarely required.
It is often necessary to consider a range of models with different
hyper-parameters and to facilitate this we can use the
learner_expand_grid function
lrs <- learner_expand_grid( learner_xgboost,
list(formula = Sepal.Length ~ .,
eta = c(0.2, 0.5, 0.3)) )
lrs
#> $`xgboost reg:squarederror`
#> ────────── learner object ──────────
#> xgboost reg:squarederror
#>
#> Estimate arguments: max_depth=2, eta=0.2, nrounds=2, subsample=1, lambda=1, verbose=0, objective=reg:squarederror
#> Predict arguments:
#> Formula: Sepal.Length ~ .
#>
#> $`xgboost reg:squarederror.1`
#> ────────── learner object ──────────
#> xgboost reg:squarederror
#>
#> Estimate arguments: max_depth=2, eta=0.5, nrounds=2, subsample=1, lambda=1, verbose=0, objective=reg:squarederror
#> Predict arguments:
#> Formula: Sepal.Length ~ .
#>
#> $`xgboost reg:squarederror.2`
#> ────────── learner object ──────────
#> xgboost reg:squarederror
#>
#> Estimate arguments: max_depth=2, eta=0.3, nrounds=2, subsample=1, lambda=1, verbose=0, objective=reg:squarederror
#> Predict arguments:
#> Formula: Sepal.Length ~ .
The list of models can then serve as inputs for
predictor_sl or for assessing the generalization error of
the model across different hyper-parameters with the cv
method.
The basic usage is to construct a new learner by providing the
formula argument and the set of additional parameters that
control the model fitting process and the task (binary classification in
this example).
lr_xgboost <- learner_xgboost(
formula = y ~ age + sex + bili,
eta = 0.3, nrounds = 5, # hyperparameters
objective = "binary:logistic" # learning task
)
lr_xgboost
#> ────────── learner object ──────────
#> xgboost binary:logistic
#>
#> Estimate arguments: max_depth=2, eta=0.3, nrounds=5, subsample=1, lambda=1, verbose=0, objective=binary:logistic
#> Predict arguments:
#> Formula: y ~ age + sex + bili
The model is estimated via the estimate method
lr_xgboost$estimate(data = pbc)
The default behavior is to assign the fitted model to the learner object, which can be accessed via
class(lr_xgboost$fit)
#> [1] "xgb.Booster"
Once the model has been fitted, predictions are generated with
lr_xgboost$predict(head(pbc))
#> [1] 0.5164376 0.1311248 0.3392217 0.1311248 0.1655048 0.1311248
where the fitted model is used inside the learner object to generate the predictions.
S3 methods are implemented for the learner class objects
as an alternative to the R6 class syntax for fitting the model and
making predictions.
lr <- learner_glm(y ~ age, family = "binomial")
estimate(lr, pbc)
predict(lr, head(pbc))
#> 1 2 3 4 5 6
#> 0.16171479 0.14624529 0.25614862 0.13566571 0.06272415 0.22071201
The estimate and predict functions of learner class
objects are by design immutable. Both functions can be inspected as part
of the return values of the summary method
lr_xgboost$summary()$estimate
#> function (x, y, ...)
#> {
#> d <- xgboost::xgb.DMatrix(x, label = y)
#> res <- do.call(xgboost::xgboost, c(list(data = d), list(...)),
#> )
#> return(res)
#> }
#> <bytecode: 0x9c00e3508>
#> <environment: 0x9c0804938>
Rare situations may arise where one wants to update the formula
argument. This supported and implemented via the update
method
lr_xgboost$update(y ~ age + sex)
The design matrix that results from by the specified formula can be inspected via
head(lr_xgboost$design(pbc)$x)
#> age sexf
#> 1 58.76523 1
#> 2 56.44627 1
#> 3 70.07255 0
#> 4 54.74059 1
#> 5 38.10541 1
#> 6 66.25873 1
and the response vector via
head(lr_xgboost$response(pbc))
#> 1 2 3 4 5 6
#> 1 0 0 0 0 0
See ?learner for the differences between
learner$design and learner$response and
?design for more details about the construction of the
design matrix and response vector.
The {targeted} package provides a generic implementation
for repeated k-fold cross-validation with learner class
objects. Parallelization is supported via the {future} and
{parallel} packages (see ?cv for more
details).
# future::plan("multicore")
lrs <- list(
glm = learner_glm(y ~ age + age, family = "binomial"),
gam = learner_gam(y ~ s(age) + s(bili), family = "binomial")
)
# 2 times repeated 5-fold cross-validation
cv(lrs, data = pbc, rep = 2, nfolds = 5)
#> Call: cv.default(object = lrs, data = pbc, nfolds = 5, rep = 2)
#>
#> 5-fold cross-validation with 2 repetitions
#>
#> mse mae
#> glm 0.10736745 0.2127521
#> gam 0.09894063 0.1888139
Statistical or machine learning models for which no constructors are provided can be implemented with a few lines of code. In what follows we cover three general cases where the input to the fit function differs.
The first general case covers fit functions which expect a formula
and data argument. Both arguments are used then by the fitting routine
to construct the design matrix and response vector. Statistical models
are usually implement with such an interface, with examples including
stats::glm, mgcv::gam and
earth::earth. The learner R6 class supports
these fitting routines by checking if the provided estimate function has
a formula and data argument. If it does, then
the formula and data argument are passed on to
the estimate function without further modifications. This is exemplified
in the following for stats::glm, where we define a new
constructor to return a learner class object that fits a
generalized model
new_glm <- function(formula, ...) {
learner$new(
formula = formula,
estimate = stats::glm,
predict = stats::predict,
predict.args = list(type = "response"),
estimate.args = list(...),
info = "new glm learner" # optional
)
}
lr <- new_glm(y ~ age, family = "binomial")
lr
#> ────────── learner object ──────────
#> new glm learner
#>
#> Estimate arguments: family=binomial
#> Predict arguments: type=response
#> Formula: y ~ age
It can be seen that the optional arguments of new_glm
define the estimate.args of a constructed learner object.
When estimating the model with
lr$estimate(pbc)
the parameters defined in estimate.args are passed on
with the formula and data to
stats::glm (i.e. the function specified via the
estimate argument). Thus, the above is equivalent to
fit <- glm(y ~ age, family = "binomial", data = pbc)
all(coef(fit) == coef(lr$fit))
#> [1] TRUE
The code further instructs to construct a learner object that uses
stats::predict to make predictions. The learner object
similarly passes on the predict.args to
stats::predict for
lr$predict(head(pbc))
#> 1 2 3 4 5 6
#> 0.16171479 0.14624529 0.25614862 0.13566571 0.06272415 0.22071201
which is equivalent to
predict(fit, newdata = head(pbc), type = "response")
#> 1 2 3 4 5 6
#> 0.16171479 0.14624529 0.25614862 0.13566571 0.06272415 0.22071201
Indeed, the documentation of ?learner reveals that the
predict method always requires an object and
newdata argument. An important implementation detail is
that the learner R6 class allows to overrule the defined
estimate.args and predict.args in the method
calls.
lr$estimate(pbc, family = "poisson")
lr$predict(head(pbc), type = "link")
#> 1 2 3 4 5 6
#> -1.837279 -1.939001 -1.341276 -2.013822 -2.743535 -1.508572
Most fitting routines for machine learning models expect a design
matrix and response vector as inputs. The R6 learner class
supports these fitting functions by internally processing the
formula argument via the targeted::design
function. Take the example of grf::probability_forest,
which expects a X (design matrix) and Y
(response vector) as inputs.
new_grf <- function(formula, ...) {
learner$new(
formula = formula,
estimate = function(x, y, ...) grf::probability_forest(X = x, Y = y, ...),
predict = function(object, newdata) {
predict(object, newdata = newdata)$predictions
},
estimate.args = list(...),
info = "grf::probability_forest"
)
}
lr <- new_grf(as.factor(y) ~ age + bili, num.trees = 100)
lr$estimate(pbc)
Compared to the previous case, the pbc data object is
not directly passed on to the fit function.
Instead,targeted::design constructs the design matrix and
response vector from the defined formula argument. As shown
previously,
dsgn <- lr$design(pbc)
can be used to inspect the design object. The x and
y attributes of the returned design object are
then passed on to the fit function.
To support ensemble/meta learners, it is also possible to construct a learner without providing a formula argument.
new_sl <- function(learners, ...) {
learner$new(
info = "new superlearner",
estimate = superlearner,
predict = targeted:::predict.superlearner,
estimate.args = c(list(learners = learners), list(...))
)
}
lrs <- list(
glm = learner_glm(y ~ age, family = "binomial"),
gam = learner_gam(y ~ s(age), family = "binomial")
)
lr <- new_sl(lrs, nfolds = 2)
lr$estimate(pbc)
lr
#> ────────── learner object ──────────
#> new superlearner
#>
#> Estimate arguments: learners=<list>, nfolds=2
#> Predict arguments:
#> Formula: NULL
#> ─────────────────────────────────────
#> score weight
#> glm 0.1081636 0.6177615
#> gam 0.1082369 0.3822385
In this case, all arguments provided to lr$estimate are
joined together with the specified estimate.args and passed
on to the defined estimate function (i.e.
superlearner).
Certain learner functions allow for specifying special terms in the formula. For example, let’s consider an aggregated dataset that includes only the response variable, treatment, and sex:
library("data.table")
dd <- data.table(pbc)[!is.na(trt), .(.N), by=.(y,trt,sex)]
print(dd)
#> y trt sex N
#> <int> <int> <fctr> <int>
#> 1: 1 1 f 13
#> 2: 0 1 f 124
#> 3: 0 1 m 19
#> 4: 0 2 f 123
#> 5: 1 2 f 16
#> 6: 0 2 m 12
#> 7: 1 2 m 3
#> 8: 1 1 m 2
Next, we fit a Naive Bayes classifier using the weights
special term to specify the frequency weights for the estimation
lr <- learner_naivebayes(y ~ trt + sex + weights(N))
lr$estimate(dd)
lr$predict(dd)
#> [1] 0.09138473 0.09138473 0.12139346 0.11913520 0.11913520 0.15668515 0.15668515
#> [8] 0.12139346
Here weights.numeric is simply the identity function
targeted:::weights.numeric
#> function (object, ...)
#> object
#> <bytecode: 0x9c1e5d380>
#> <environment: namespace:targeted>
To illustrate how to define a custom learner that utilizes special
terms, consider the following example where we introduce a
strata term. In this case, we introduce an estimation
method that fits a linear regression model for each value of the strata
variable, as well as a corresponding prediction method
est <- function(formula, data, strata, ...)
lapply(levels(strata), \(s) lm(formula, data[which(strata==s),]))
pred <- function(object, newdata, strata, ...) {
res <- numeric(length(strata))
for (i in seq_along(levels(strata))) {
idx <- which(strata == levels(strata)[i])
res[idx] <- predict(object[[i]], newdata[idx, ], ...)
}
return(res)
}
The new learner is now defined by including the argument `specials = “strata”, which ensures that the strata variable is correctly passed to both the estimate and predict functions
lr <- learner$new(y ~ sex + strata(trt),
estimate=est, predict=pred, specials = "strata")
des <- lr$design(head(pbc))
des
#> ────────── design object ──────────
#>
#> response (length: 6)
#> 1 1
#> 2 0
#> ---
#> 5 0
#> 6 0
#>
#> specials
#> - strata [factor]
#>
#> design matrix (dim: 6, 1)
#> sexf
#> 1 1
#> 2 1
#> ---
#> 5 1
#> 6 1
des$strata
#> 1 2 3 4 5 6
#> trt=1 trt=1 trt=1 trt=1 trt=2 trt=2
#> Levels: trt=1 trt=2
lr$estimate(pbc)
lr
#> ────────── learner object ──────────
#> Estimate arguments:
#> Predict arguments:
#> Formula: y ~ sex + strata(trt)
#> ─────────────────────────────────────
#> [[1]]
#>
#> Call:
#> lm(formula = formula, data = data[which(strata == s), ])
#>
#> Coefficients:
#> (Intercept) sexf
#> 0.0952381 -0.0003476
#>
#>
#> [[2]]
#>
#> Call:
#> lm(formula = formula, data = data[which(strata == s), ])
#>
#> Coefficients:
#> (Intercept) sexf
#> 0.20000 -0.08489
lr$predict(head(pbc))
#> [1] 0.09489051 0.09489051 0.09523810 0.09489051 0.11510791 0.11510791