--- title: "Estimating a relative risk or risk difference with a binary exposure" author: Klaus Kähler Holst date: "`r Sys.Date()`" format: html: toc: true html-math-method: mathjax vignette: > %\VignetteIndexEntry{Estimating a relative risk or risk difference with a binary exposure} %\VignetteEngine{quarto::html} %\VignetteEncoding{UTF-8} execute: echo: true warning: false collapse: true comment: "#>" --- ```{r, include = FALSE} library(targeted) ``` # Introduction Let $Y$ be a *binary response*, $A$ a *binary exposure*, and $V$ a *vector of covariates*. ![DAG for the statistical model with the dashed edge representing a potential interaction between exposure $A$ and covariates $V$.](dag1.svg) In a common setting, the main interest lies in quantifying the treatment effect, $\nu$, of $A$ on $Y$ adjusting for the set of covariates, and often a standard approach is to use a Generalized Linear Model (GLM): $$g\{ E(Y\mid A,V) \} = A\nu^tW + \underset{\mathrm{nuisance}}{\mu^tZ}$$ with link function $g$, and $W = w(V)$, $Z= v(V)$ known vector functions of $V$. The canonical link (logit) leads to nice computational properties (logistic regression) and parameters with an odds-ratio interpretation. But ORs are not *collapsible* even under randomization. For example $$ E(Y\mid X) = E[ E(Y\mid X,Z) \mid X ] = E[\operatorname{expit}( \mu + \alpha X + \beta Z ) \mid X] \neq \operatorname{expit}[\mu + \alpha X + \beta E(Z\mid X)], $$ When marginalizing we leave the class of logistic regression. This non-collapsibility makes it hard to interpret odds-ratios and to compare results from *different studies* **Relative risks** (and risk differences) are collapsible and generally considered easier to interpret than odds-ratios. [Richardson et al](https://doi.org/10.1080/01621459.2016.1192546) (JASA, 2017) proposed a regression model for a binary exposures which solves the computational problems and need for parameter contraints that are associated with using for example binomial regression with a log-link function (or identify link for the risk difference) to obtain such parameter estimates. In the following we consider the relative risk as the **target parameter** $$ \mathrm{RR}(v) = \frac{P(Y=1\mid A=1, V=v)}{P(Y=1\mid A=0, V=v)}. $$ Let $p_a(V) = P(Y \mid A=a, V), a\in\{0,1\}$, the idea is then to posit a linear model for $$ \theta(v) = \log \big(RR(v)\big) $$, i.e., $$\log \big(RR(v)\big) = \alpha^Tv,$$ and a **nuisance model** for the *odds-product* $$ \phi(v) = \log\left(\frac{p_{0}(v)p_{1}(v)}{(1-p_{0}(v))(1-p_{1}(v))}\right) $$ noting that these two parameters are *variation independent* as illustrated by the below L'Abbé plot. ```{r} #| fig.width=5, fig.height=5 p0 <- seq(0,1,length.out=100) p1 <- function(p0,op) 1/(1+(op*(1-p0)/p0)^-1) plot(0, type="n", xlim=c(0,1), ylim=c(0,1), xlab="P(Y=1|A=0)", ylab="P(Y=1|A=1)", main="Constant odds product") for (op in exp(seq(-6,6,by=.25))) lines(p0,p1(p0,op), col="lightblue") ``` ```{r} #| fig.width=5, fig.height=5, #| fig.cap=' ' p0 <- seq(0,1,length.out=100) p1 <- function(p0,rr) rr*p0 plot(0, type="n", xlim=c(0,1), ylim=c(0,1), xlab="P(Y=1|A=0)", ylab="P(Y=1|A=1)", main="Constant relative risk") for (rr in exp(seq(-3,3,by=.25))) lines(p0,p1(p0,rr), col="lightblue") ``` Similarly, a model can be constructed for the risk-difference on the following scale $$\theta(v) = \operatorname{arctanh} \big(RD(v)\big).$$ # Simulation First the `targeted` package is loaded ```{r setup} library(targeted) ``` This automatically imports [lava](https://arxiv.org/abs/1206.3421) [(CRAN)](https://CRAN.R-project.org/package=lava) which we can use to simulate from the Relative-Risk Odds-Product (RR-OP) model. ```{r} m <- lava::lvm(a ~ x, lp.target ~ 1, lp.nuisance ~ x+z) m <- lava::binomial.rr(m, response="y", exposure="a", target.model="lp.target", nuisance.model="lp.nuisance") ``` The `lvm` call first defines the linear predictor for the exposure to be of the form $$\mathrm{LP}_A := \mu_A + \alpha X$$ and the linear predictors for the /target parameter/ (relative risk) and the /nuisance parameter/ (odds product) to be of the form $$\mathrm{LP}_{RR} := \mu_{RR},$$ $$\mathrm{LP}_{OP} := \mu_{OP} + \beta_x X + \beta_z Z.$$ The covariates are by default assumed to be independent and standard normal $X, Z\sim\mathcal{N}(0,1)$, but their distribution can easily be altered using the `lava::distribution` method. The `binomial.rr` function ```{r} args(lava::binomial.rr) ``` then defines the link functions, i.e., $$\operatorname{logit}(E[A\mid X,Z]) = \mu_A + \alpha X,$$ $$\operatorname{log}(E[Y\mid X,Z, A=1]/E[Y\mid X, A=0]) = \mu_{RR},$$ $$\operatorname{log}\{p_1(X,Z)p_0(X,Z)/[(1-p_1(X,Z))(1-p_0(X,Z))]\} = \mu_{OP}+\beta_x X + \beta_z Z$$ with $p_a(X,Z)=E(Y\mid A=a,X,Z)$. The risk-difference model with the RD parameter modeled on the $\operatorname{arctanh}$ scale can be defined similarly using the `binomial.rd` method ```{r} args(lava::binomial.rd) ``` We can inspect the parameter names of the modeled ```{r} coef(m) ``` Here the intercepts of the model are simply given the same name as the variables, such that $\mu_A$ becomes `a`, and the other regression coefficients are labeled using the "~"-formula notation, e.g., $\alpha$ becomes `a~x`. Intercepts are by default set to zero and regression parameters set to one in the simulation. Hence to simulate from the model with $(mu_A, \mu_{RR}, \mu_{OP}, \alpha, \beta_x, \beta_z)^T = (-1,1,-2,2,1,1)^T$, we define the parameter vector ``p`` given by ```{r} p <- c('a'=-1, 'lp.target'=1, 'lp.nuisance'=-1, 'a~x'=2) ``` and then simulate from the model using the ``sim`` method ```{r} d <- lava::sim(m, 1e4, p=p, seed=1) head(d) ``` Notice, that in this simulated data the **target parameter** $\mu_{RR}$ has been set to `lp.target =` `r p['lp.target']`. # Estimation ## MLE We start by fitting the model using the maximum likelihood estimator. ```{r} args(riskreg_mle) ``` The ``riskreg_mle`` function takes vectors/matrices as input arguments with the response ``y``, exposure ``a``, target parameter design matrix ``x1`` (i.e., the matrix $W$ at the start of this text), and the nuisance model design matrix ``x2`` (odds product). We first consider the case of a correctly specified model, hence we do not consider any interactions with the exposure for the odds product and simply let ``x1`` be a vector of ones, whereas the design matrix for the log-odds-product depends on both $X$ and $Z$ ```{r} x1 <- model.matrix(~1, d) x2 <- model.matrix(~x+z, d) fit1 <- with(d, riskreg_mle(y, a, x1, x2, type="rr")) fit1 ``` The parameters are presented in the same order as the columns of ``x1``and ``x2``, hence the target parameter estimate is in the first row ```{r} estimate(fit1, keep=1) ``` ## DRE We next fit the model using a double robust estimator (DRE) which introduces a model for the exposure $E(A=1\mid V)$ (propensity model). The double-robustness stems from the fact that the this estimator remains consistent in the union model where either the odds-product model or the propensity model is correctly specified. With both models correctly specified the estimator is efficient. ```{r} with(d, riskreg_fit(y, a, target=x1, nuisance=x2, propensity=x2, type="rr")) ``` The usual /formula/-syntax can be applied using the ``riskreg`` function. Here we illustrate the double-robustness by using a *wrong propensity model* but a correct nuisance paramter (odds-product) model: ```{r} riskreg(y~a, nuisance=~x+z, propensity=~z, data=d, type="rr") ``` Or vice-versa ```{r} riskreg(y~a, nuisance=~z, propensity=~x+z, data=d, type="rr") ``` whereas the MLE in this case yields a biased estimate of the relative risk: ```{r} fit2 <- with(d, riskreg_mle(y, a, x1=model.matrix(~1,d), x2=model.matrix(~z, d))) estimate(fit2, keep=1) ``` ## Interactions The more general model where $$\log RR(V) = A \alpha^TV$$ for a subset $V$ of the covariates can be estimated using the ``target`` argument: ```{r} fit <- riskreg(y~a, target=~x, nuisance=~x+z, data=d) fit ``` As expected we do not see any evidence of an effect of $X$ on the relative risk with the 95% confidence limits clearly overlapping zero. Note, that when the ``propensity`` argument is omitted as above, the same design matrix is used for both the odds-product model and the propensity model. ## Risk-difference The syntax for fitting the risk-difference model is similar. To illustrate this we simulate some new data from this model ```{r} m2 <- lava::binomial.rd(m, response="y", exposure="a", target.model="lp.target", nuisance.model="lp.nuisance") d2 <- lava::sim(m2, 1e4, p=p) ``` And we can then fit the DRE with the syntax ```{r} riskreg(y~a, nuisance=~x+z, data=d2, type="rd") ``` ## Influence-function The DRE is a *regular and asymptotic linear* (RAL) estimator, hence $$\sqrt{n}(\widehat{\alpha}_{\mathrm{DRE}} - \alpha) = \frac{1}{\sqrt{n}}\sum_{i=1}^{n} \phi_{\mathrm{eff}}(Z_{i}) + o_{p}(1)$$ where $Z_i = (Y_i, A_i, V_i), i=1,\ldots,n$ are the i.i.d. observations and $\phi_{\mathrm{eff}}$ is the *influence function*. The influence function can be extracted using the ``IC`` method ```{r} head(IC(fit)) ``` # SessionInfo ```{r} sessionInfo() ```