---
title: "Multivariate missingness and monotonicity"
author: "Janick Weberpals"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Multivariate missingness and monotonicity}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---
```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  dpi = 150,
  fig.width = 6,
  fig.height = 4.5
  )
```
```{r setup}
library(smdi)
library(gt)
suppressPackageStartupMessages(library(dplyr))
```
# Multivariate missing data in `smdi`
In this article, we want to briefly highlight two aspect regarding **multivariate missingness**: 
1. How does `smdi` handle multivariate missingness?
2. What is the link between missing data patterns and missing data mechanisms and how does this affect the behavior and performance of the `smdi` functionality?
## Established taxonomies
In general, there are two classic established missing data taxonomies:
* Mechanisms: Missing completely at random (MCAR), at random (MAR) and not at random (MNAR)
* Patterns: Monotone versus Non-monotone missingness
# How does `smdi` handle multivariate missingness?
In all `smdi` functions, except `smdi_little()`, binary missing indicator variables are created for each partially observed variable (either specified by the analyst using the `covar` parameter or automatically identified via `smdi_check_covar()` if `covar` = NULL) and the columns with the actual variable values are dropped. For the variable importance visualization in `smdi_rf()`, these variables are indicated with a *"_NA"* suffix. Missing values are accordingly indicated with a *1* and complete observations with a *0*. This functionality is controlled via the `smdi_na_indicator()` utility function. 
```{r, fig.cap="Illustrating missing indicator variable generation within `smdi` functions"}
smdi_data %>% 
  smdi_na_indicator(
    drop_NA_col = FALSE # usually TRUE, but for demonstration purposes set to FALSE
    ) %>% 
  select(
    ecog_cat, ecog_cat_NA, 
    egfr_cat, egfr_cat_NA, 
    pdl1_num, pdl1_num_NA
    ) %>% 
  head() %>% 
  gt()
```
Now, let's assume we have three partially observed covariates *X*, *Y* and *Z*, which we would like to include in our missingness diagnostics. All `smdi_diagnose()` functions, except `smdi_little()`, create *X_NA*, *Y_NA* and *Z_NA* and *X*, *Y* and *Z* are discarded from the dataset. The functions will then iterate the diagnostics through all *X_NA*, *Y_NA* and *Z_NA* one-by-one. That is, if, for example, *X_NA* is assessed, *Y_NA* and *Z_NA* serve as predictor variables along with all other covariates in the dataset. If *Y_NA* is assessed, *X_NA* and *Z_NA* are included as predictor variables, and so forth.
  Important! It is important to notice that this strategy is the default to deal with multivariate missingness in the `smdi` package, however, another possible approach could be to *not* consider the other partially observed variables in the first place (e.g. by dropping them before applying any `smdi` function) and stacking the diagnostics focusing on one partially observed variable at a time. Such a strategy would be advisable in scenarios of monotone missing data patterns (see next section).
# `smdi` in case of monotone missing data patterns
While in the `smdi` package we mainly focus on missing data mechanisms, missing data patterns always need to be considered, too. Please refer to the [routine structural missing data diagnostics article](https://janickweberpals.gitlab-pages.partners.org/smdi/articles/b_routine_diagnostics.html#descriptives), where we highlight the importance of describing missingness proportions and patterns before running any of the `smdi` diagnostics.
As mentioned in the section before, in case of monotone missing data patterns, the `smdi` functionality may be misleading.
  Monotonicity A missing data pattern is said to be monotone if the variables Yj can be ordered such that if Yj is missing then all variables Yk with k > j are also missing (taken from Stef van Buuren [^1]).
[^1]: For more information on missing data patterns see 
A good example for monotone missing data could be clinical blood laboratory tests ("labs") which are often tested together in a lab panel. If one lab is missing, typically the other labs of this panel are also missing.
```{r}
# we simulatea monotone missingness pattern
# following an MCAR mechanism
set.seed(42)
data_monotone <- smdi_data_complete %>% 
  mutate(
    lab1 = rnorm(nrow(smdi_data_complete), mean = 5, sd = 0.5),
    lab2 = rnorm(nrow(smdi_data_complete), mean = 10, sd = 2.25)
    )
data_monotone[3:503, "lab1"] <- NA
data_monotone[1:500, "lab2"] <- NA
```
```{r}
smdi::gg_miss_upset(data = data_monotone)
```
```{r}
smdi::md.pattern(data_monotone[, c("lab1", "lab2")], plot = FALSE)
```
In extreme cases of perfect linearity, this can lead to multiple warnings and errors such as `system is exactly singular` or `-InfWarning: Variable has only NA's in at least one stratum`.
In cases in which monotonicity is still clearly present but not as extreme (like in the example above), `smdi` will prompt a message to the analyst to raise awareness of this issue as the `smdi` output can be **highly misleading** in such instances.
```{r}
diagnostics_jointly <- smdi_diagnose(
  data = data_monotone,
  covar = NULL, # NULL includes all covariates with at least one NA
  model = "cox",
  form_lhs = "Surv(eventtime, status)"
  )
```
```{r, fig.cap="Diagnostics of lab 1 if analyzed separately."}
diagnostics_jointly %>% 
  smdi_style_gt()
```
In such cases, it may be advisable to *not* consider including *lab2* in the missingness diagnostics of *lab1* and vice versa and stack the diagnostics focusing on one partially observed variable at a time.
## Lab 1 analyzed without Lab 2
```{r, fig.cap="Diagnostics of lab 1 if analyzed separately."}
# lab 1
lab1_diagnostics <- smdi_diagnose(
  data = data_monotone %>% select(-lab2),
  model = "cox",
  form_lhs = "Surv(eventtime, status)"
  )
lab1_diagnostics %>% 
  smdi_style_gt()
```
## Lab 2 analyzed without Lab 1
```{r, fig.cap="Diagnostics of lab 2 if analyzed separately."}
# lab 2
lab2_diagnostics <- smdi_diagnose(
  data = data_monotone %>% select(-lab1),
  model = "cox",
  form_lhs = "Surv(eventtime, status)"
  )
lab2_diagnostics %>% 
  smdi_style_gt()
```
## Presented in one table using `smdi_style_gt()`
We can also combine the output of individually stacked `smdi_diagnose` tables and enhance it with a global Little's test that takes into account the multivariate missingness of the entire dataset.
```{r}
# computing a gloabl p-value for Little's test including both lab1 and lab2
little_global <- smdi_little(data = data_monotone)
# combining two individual lab smdi tables and global Little's test
smdi_style_gt(
  smdi_object = rbind(lab1_diagnostics$smdi_tbl, lab2_diagnostics$smdi_tbl), 
  include_little = little_global
  )
```
Since the missingness follows an MCAR mechanism, `smdi_diagnose()` now shows the expected missingness diagnostics patterns one would expect from an MCAR mechanism.