---
title: "Common errors & warnings encountered in DImulti()"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Common errors & warnings encountered in DImulti()}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---
```{css styling, echo=FALSE}
span.R {
  font-family: Courier New;
}
span.bad {
  color: red;
}
```
```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
options(crayon.enabled = TRUE)
ansi_aware_handler <- function(x, options)
{
  paste0(
    "
",
    fansi::sgr_to_html(x = x, warn = FALSE, term.cap = "256"),
    "
"
  )
}
old_hooks <- fansi::set_knit_hooks(knitr::knit_hooks,
                                 which = c("output", "message", "error", "warning"))
knitr::knit_hooks$set(
  output = ansi_aware_handler,
  message = ansi_aware_handler,
  warning = ansi_aware_handler,
  error = ansi_aware_handler
)
```
```{r setup}
library(DImodelsMulti)
data("dataBEL"); data("dataSWE"); data("simMV"); data("simRM"); data("simMVRM")
```
Incorrect argument specifications 
The DImulti() function has a long list of possible parameters, which can make
it difficult to ensure that all arguments supplied are in the requested format. See below for an 
example of each parameter and the included datasets, 
dataBEL, 
dataSWE, 
simMV, 
simRM and 
simMVRM, for examples of dataset formatting.
y
This is a required argument. It can be passed in three forms, using either column names or indices:
* Univariate
In this case, we pass a single name or index of a numerical column containing the ecosystem function
response value. See ?dataSWE for an example of a univariate dataset suitable 
for use with DImulti().
```{r y_univar, eval = FALSE}
DImulti(y = "YIELD", ..., data = dataSWE)
DImulti(y = 9, ..., data = dataSWE)
```
```{r y_dataSWE, eval = TRUE, echo = FALSE}
dataSWE[1:6, "YIELD", drop = FALSE]
```
* Multivariate stacked
In this case, we pass a single name or index of a numerical column containing the ecosystem function
response value. See ?dataBEL for an example of a multivariate stacked dataset 
suitable for use with DImulti().
```{r y_MVstack, eval = FALSE}
DImulti(y = "Y", ..., data = dataBEL)
DImulti(y = 9, ..., data = dataBEL)
```
```{r y_dataBEL, eval = TRUE, echo = FALSE}
dataBEL[1:6, "Y", drop = FALSE]
```
* Multivariate wide
In this case, we pass multiple names or indices of numerical columns containing the 
ecosystem function response values. See ?simMV for an example of a 
multivariate wide dataset suitable for use with DImulti().
```{r y_MVwide, eval = FALSE}
DImulti(y = c("Y1", "Y2", "Y3", "Y4"), ..., data = simMV)
DImulti(y = 9:12, ..., data = simMV)
```
```{r y_simMV, eval = TRUE, echo = FALSE}
simMV[1:6, c("Y1", "Y2", "Y3", "Y4"), drop = FALSE]
```
eco_func
This is a required argument for multivariate data and excluded for univariate data. It can be passed 
in two forms:
* Multivariate stacked
In this case, we pass a vector of size two. The first index contains a single column name for the 
ecosystem function response type, a factor, the second contains the autocorrelation structure 
desired, either "UN" or "CS". See 
?dataBEL for an example of a multivariate stacked dataset suitable for use 
with DImulti().
```{r ecoFunc_MVstack, eval = FALSE}
DImulti(eco_func = c("Var", "UN"), ..., data = dataBEL)
```
```{r ecoFunc_dataBEL, eval = TRUE, echo = FALSE}
dataBEL[1:6, "Var", drop = FALSE]
```
* Multivariate wide
In this case, we pass a vector of size two. The first index contains the string "NA", the second 
contains the autocorrelation structure desired, either "UN" or "CS". See ?simMV for an example of a multivariate stacked 
dataset suitable for use with DImulti().
```{r ecoFunc_MVwide, eval = FALSE}
DImulti(eco_func = c("NA", "UN"), ..., data = simMV)
```
time
This is a required argument for repeated measures data and excluded for data taken at a single time 
point. It can be passed in the following form:
* Repeated Measures
In this case, we pass a vector of size two. The first index contains a single column name for the 
time point reference, a factor, the second contains the autocorrelation structure 
desired, either "UN", "CS", or 
"AR1". See ?dataSWE for an example of a repeated 
measures dataset suitable for use with DImulti().
```{r time, eval = FALSE}
DImulti(time = c("YEARN", "AR1"), ..., data = dataSWE)
```
```{r time_dataSWE, eval = TRUE, echo = FALSE}
dataSWE[c(1, 49, 97, 2, 50, 98), "YEARN", drop = FALSE]
```
unit_IDs
This is a required argument for any data. 
It can be passed a column name or index, referencing a unique reference for each experimental unit, 
used for grouping responses from different ecosystem functions or time points. 
```{r unit_IDs, eval = FALSE}
DImulti(unit_IDs = "plot", ..., data = simMVRM)
DImulti(unit_IDs = 1, ..., data = simMVRM)
```
```{r unit_IDs_simVMRM, eval = TRUE, echo = FALSE}
simMVRM[1:6, "plot", drop = FALSE]
```
prop
This is a required argument for any data. 
It can be passed as a vector of column names or indices, referencing the (numerical) initial 
proportions of any species included in the experimental design. 
In the event that a species is included as a factor as opposed to a numerical value, e.g. if it only
appears at values 0 or 1, only the numerical species should be included here, while the factor 
species is passed through the extra_fixed parameter. A warning will be 
printed to the console warning the user of not meeting the simplex requirement.
```{r prop, eval = FALSE}
DImulti(unit_IDs = paste0("p", 1:4), ..., data = simMVRM)
DImulti(unit_IDs = 2:5, ..., data = simMVRM)
```
```{r prop_simVMRM, eval = TRUE, echo = FALSE}
simMVRM[1:6, 2:5, drop = FALSE]
```
data
This is a required argument for any data. 
This argument should be a dataframe or tibble which contains all columns referenced in the 
DImulti() call. There are some restrictions on columns names allowed in this
dataset. An error will be printed to the console to inform a user of name changes required.
```{r data, eval = FALSE}
DImulti(..., data = simMVRM)
DImulti(..., data = simMVRM)
```
```{r data_simVMRM, eval = TRUE, echo = FALSE}
simMVRM[1:6, , drop = FALSE]
```
DImodel
This is a required argument for any data. 
This argument should be a single string selected from the list of included interaction structures:
"STR",
"ID",
"FULL",
"E",
"AV",
"ADD",
"FG".
```{r DImodel, eval = FALSE}
DImulti(DImodel = "FULL", ...)
DImulti(DImodel = "AV", ...)
```
FG
This is a required argument if the argument "FG" is passed through 
DImodel and can be excluded otherwise.
This argument should be a vector of strings, of the same length as prop, 
which maps each species in the experiment to a functional grouping.
```{r FG, eval = FALSE}
DImulti(prop = c("G1", "G2", "L1", "L2"), DImodel = "FG", 
        FG = c("Grass", "Grass", "Legume", "Legume"), ..., data = dataBEL)
```
ID
This is an optional argument.
This argument should be a vector of strings, of the same length as prop, 
which maps each species in the experiment to a identity grouping, assuming functional redundancy 
between within-group species for their ID effects, this grouping does not affect interactions.
```{r ID, eval = FALSE}
DImulti(prop = c("G1", "G2", "L1", "L2"), ID = c("Grass", "Grass", "Legume", "Legume"), 
        ..., data = dataBEL)
```
extra_fixed
This is an optional argument.
This parameter can be passed a formula of any additional fixed effects to be included, e.g. 
treatments. These effects can be easily crossed with the automatic DI model formula by prefacing 
the formula passed with 1: or 1*. See 
?dataSWE for a dataset containing treatments suitable for use with 
DImulti().
```{r extra_fixed, eval = FALSE}
DImulti(extra_fixed = ~DENS, ..., data = dataSWE)
DImulti(extra_fixed = ~DENS + TREAT, ..., data = dataSWE)
DImulti(extra_fixed = ~1*DENS, ..., data = dataSWE)
DImulti(extra_fixed = ~1:(DENS+TREAT), ..., data = dataSWE)
DImulti(extra_fixed = ~1:DENS:TREAT, ..., data = dataSWE)
```
```{r extra_fixed_dataSWE, eval = TRUE, echo = FALSE}
dataSWE[1:6, c("DENS", "TREAT"), drop = FALSE]
```
estimate_theta
This is an optional argument.
This parameter can be passed a boolean value (FALSE), indicating whether the user wants the function to use profile 
likelihood to estimate values for the non-linear parameter $\theta$. Defaults to 
"STR" or "ID" was passed 
through the parameter DImodel, then setting 
estimate_theta = TRUE will cause a warning to be printed to the console, as
$\theta$ only applies to interaction terms, which do not exist for those options.
```{r estimate_theta, eval = FALSE}
DImulti(estimate_theta = TRUE, ...)
DImulti(estimate_theta = FALSE, ...)
```
theta
This is an optional argument.
This argument should contain numerical values representing the non-linear parameter $\theta$, which
will be applied as a power to the products of pairs of species in the interaction terms of the 
model. A single value may be passed, which will be applied to each ecosystem function, or a 
different value may be passed for each ecosystem function.
If "STR" or "ID" was passed 
through the parameter DImodel, then setting a non-1 value for
theta will cause a warning to be printed to the console, as
$\theta$ only applies to interaction terms, which do not exist for those options.
```{r theta, eval = FALSE}
DImulti(theta = 1, ...)
DImulti(theta = c(0.5, 1, 1.2), ...)
```
method
This is an optional argument.
This parameter is used to change the estimation method used by DImulti(). The 
options are "REML", referring to restricted maximum likelihood, and 
"ML", referring to maximum likelihood. 
Defaults to "REML".
```{r method, eval = FALSE}
DImulti(method = "REML", ...)
DImulti(method = "ML", ...)
```
Singular fit 
A singular fit error, thrown by nlme::gls() when fitting the model, occurs 
when an element with value of exactly zero exists in the fixed effect variance covariance matrix of 
the model. This usually caused by rank deficiency in the model, i.e. two terms explaining the same
variance. The fix is usually to simplify/change the fixed effect terms, be it the interactions 
structure, treatment terms, or theta values. To aid in this change, the fixed effect formula and any 
estimated theta values are printed to the console with the error.
As an example, we use the dataset dataSWEXXDO NOT RUNXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
```{r singular_dataSWE, eval = TRUE, error = TRUE}
DImulti(prop = 5:8, y = "YIELD", time = c("YEARN", "CS"), unit_IDs = "PLOT", data = dataSWE, 
        DImodel = "ID", extra_fixed = ~DENS+TREAT, method = "REML")
```
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
As we include no theta values or interaction terms, and use the simple structure 
"CS", compound symmetry, for our repeated measure, the only complication in 
this model is the inclusion of the two treatment terms, DENS and 
TREAT. When multiple treatments are included, it is usually good to check 
that one cannot be inferred from the other, e.g. if DENS = High then we know 
that TREAT = 1. Luckily this is not the case here.
```{r singular_dataSWE_DENSTREAT, eval = TRUE, echo = FALSE}
dataSWE[c(1, 16, 31, 40), c("DENS", "TREAT"), drop = FALSE]
```
Instead, we change the formatting of our formula, from creating intercepts to crossing the treatment
with our ID effects, but not each other.
```{r singular_dataSWE_fix, eval = TRUE}
DImulti(prop = 5:8, y = "YIELD", time = c("YEARN", "CS"), unit_IDs = "PLOT", data = dataSWE, 
        DImodel = "ID", extra_fixed = ~1:(DENS+TREAT), method = "REML")
```
Which has resolved our issue.
We show another example using the multivariate dataset dataBEL and estimating
$\theta$. We would first set theta = 1 and use model selection to choose an
appropriate interaction structure, which gives us the following model, with a functional group
structure:
```{r singular_dataBEL_Works, eval = TRUE}
model1 <- DImulti(prop = 2:5, y = "Y", eco_func = c("Var", "UN"), unit_IDs = 1, data = dataBEL,
                  FG = c("Grass", "Grass", "Leg", "Leg"), DImodel = "FG", extra_fixed = ~Density, 
                  method = "ML", theta = 1)
```
Comparing models: ML vs REML 
The optional parameter method in the function 
DImulti() allows a user to change the estimation method used, between the 
default "REML", restricted maximum likelihood, and 
"ML", maximum likelihood.
"REML" was chosen as the default as it provides unbiased estimates, as 
opposed to "ML", however, "REML" is not appropriate
for use with model comparison where fixed effects vary between models, as we can see from the 
following warning message:
XXDO NOT RUNXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
```{r comparison_dataBEL, eval = TRUE}
model1 <- DImulti(prop = 2:5, y = "Y", eco_func = c("Var", "UN"), unit_IDs = 1, data = dataBEL,
                  FG = c("Grass", "Grass", "Leg", "Leg"), DImodel = "FG", 
                  method = "REML")
model2 <- DImulti(prop = 2:5, y = "Y", eco_func = c("Var", "UN"), unit_IDs = 1, data = dataBEL,
                  FG = c("Grass", "Grass", "Leg", "Leg"), DImodel = "FG", extra_fixed = ~Density,
                  method = "REML")
anova(model1, model2)
```
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
In this case, we should use "ML" to fit the models for comparison, then refit
the chosen model using "REML", for the unbiased estimates.