--- title: "Common errors & warnings encountered in DImulti()" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Common errors & warnings encountered in DImulti()} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{css styling, echo=FALSE} span.R { font-family: Courier New; } span.bad { color: red; } ``` ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) options(crayon.enabled = TRUE) ansi_aware_handler <- function(x, options) { paste0( "

",
    fansi::sgr_to_html(x = x, warn = FALSE, term.cap = "256"),
    "

" ) } old_hooks <- fansi::set_knit_hooks(knitr::knit_hooks, which = c("output", "message", "error", "warning")) knitr::knit_hooks$set( output = ansi_aware_handler, message = ansi_aware_handler, warning = ansi_aware_handler, error = ansi_aware_handler ) ``` ```{r setup} library(DImodelsMulti) data("dataBEL"); data("dataSWE"); data("simMV"); data("simRM"); data("simMVRM") ```

Incorrect argument specifications

The DImulti() function has a long list of possible parameters, which can make it difficult to ensure that all arguments supplied are in the requested format. See below for an example of each parameter and the included datasets, dataBEL, dataSWE, simMV, simRM and simMVRM, for examples of dataset formatting.

y

This is a required argument. It can be passed in three forms, using either column names or indices: *

Univariate

In this case, we pass a single name or index of a numerical column containing the ecosystem function response value. See ?dataSWE for an example of a univariate dataset suitable for use with DImulti(). ```{r y_univar, eval = FALSE} DImulti(y = "YIELD", ..., data = dataSWE) DImulti(y = 9, ..., data = dataSWE) ``` ```{r y_dataSWE, eval = TRUE, echo = FALSE} dataSWE[1:6, "YIELD", drop = FALSE] ``` *

Multivariate stacked

In this case, we pass a single name or index of a numerical column containing the ecosystem function response value. See ?dataBEL for an example of a multivariate stacked dataset suitable for use with DImulti(). ```{r y_MVstack, eval = FALSE} DImulti(y = "Y", ..., data = dataBEL) DImulti(y = 9, ..., data = dataBEL) ``` ```{r y_dataBEL, eval = TRUE, echo = FALSE} dataBEL[1:6, "Y", drop = FALSE] ``` *

Multivariate wide

In this case, we pass multiple names or indices of numerical columns containing the ecosystem function response values. See ?simMV for an example of a multivariate wide dataset suitable for use with DImulti(). ```{r y_MVwide, eval = FALSE} DImulti(y = c("Y1", "Y2", "Y3", "Y4"), ..., data = simMV) DImulti(y = 9:12, ..., data = simMV) ``` ```{r y_simMV, eval = TRUE, echo = FALSE} simMV[1:6, c("Y1", "Y2", "Y3", "Y4"), drop = FALSE] ```

eco_func

This is a required argument for multivariate data and excluded for univariate data. It can be passed in two forms: *

Multivariate stacked

In this case, we pass a vector of size two. The first index contains a single column name for the ecosystem function response type, a factor, the second contains the autocorrelation structure desired, either "UN" or "CS". See ?dataBEL for an example of a multivariate stacked dataset suitable for use with DImulti(). ```{r ecoFunc_MVstack, eval = FALSE} DImulti(eco_func = c("Var", "UN"), ..., data = dataBEL) ``` ```{r ecoFunc_dataBEL, eval = TRUE, echo = FALSE} dataBEL[1:6, "Var", drop = FALSE] ``` *

Multivariate wide

In this case, we pass a vector of size two. The first index contains the string "NA", the second contains the autocorrelation structure desired, either "UN" or "CS". See ?simMV for an example of a multivariate stacked dataset suitable for use with DImulti(). ```{r ecoFunc_MVwide, eval = FALSE} DImulti(eco_func = c("NA", "UN"), ..., data = simMV) ```

time

This is a required argument for repeated measures data and excluded for data taken at a single time point. It can be passed in the following form: *

Repeated Measures

In this case, we pass a vector of size two. The first index contains a single column name for the time point reference, a factor, the second contains the autocorrelation structure desired, either "UN", "CS", or "AR1". See ?dataSWE for an example of a repeated measures dataset suitable for use with DImulti(). ```{r time, eval = FALSE} DImulti(time = c("YEARN", "AR1"), ..., data = dataSWE) ``` ```{r time_dataSWE, eval = TRUE, echo = FALSE} dataSWE[c(1, 49, 97, 2, 50, 98), "YEARN", drop = FALSE] ```

unit_IDs

This is a required argument for any data. It can be passed a column name or index, referencing a unique reference for each experimental unit, used for grouping responses from different ecosystem functions or time points. ```{r unit_IDs, eval = FALSE} DImulti(unit_IDs = "plot", ..., data = simMVRM) DImulti(unit_IDs = 1, ..., data = simMVRM) ``` ```{r unit_IDs_simVMRM, eval = TRUE, echo = FALSE} simMVRM[1:6, "plot", drop = FALSE] ```

prop

This is a required argument for any data. It can be passed as a vector of column names or indices, referencing the (numerical) initial proportions of any species included in the experimental design. In the event that a species is included as a factor as opposed to a numerical value, e.g. if it only appears at values 0 or 1, only the numerical species should be included here, while the factor species is passed through the extra_fixed parameter. A warning will be printed to the console warning the user of not meeting the simplex requirement. ```{r prop, eval = FALSE} DImulti(unit_IDs = paste0("p", 1:4), ..., data = simMVRM) DImulti(unit_IDs = 2:5, ..., data = simMVRM) ``` ```{r prop_simVMRM, eval = TRUE, echo = FALSE} simMVRM[1:6, 2:5, drop = FALSE] ```

data

This is a required argument for any data. This argument should be a dataframe or tibble which contains all columns referenced in the DImulti() call. There are some restrictions on columns names allowed in this dataset. An error will be printed to the console to inform a user of name changes required. ```{r data, eval = FALSE} DImulti(..., data = simMVRM) DImulti(..., data = simMVRM) ``` ```{r data_simVMRM, eval = TRUE, echo = FALSE} simMVRM[1:6, , drop = FALSE] ```

DImodel

This is a required argument for any data. This argument should be a single string selected from the list of included interaction structures: "STR", "ID", "FULL", "E", "AV", "ADD", "FG". ```{r DImodel, eval = FALSE} DImulti(DImodel = "FULL", ...) DImulti(DImodel = "AV", ...) ```

FG

This is a required argument if the argument "FG" is passed through DImodel and can be excluded otherwise. This argument should be a vector of strings, of the same length as prop, which maps each species in the experiment to a functional grouping. ```{r FG, eval = FALSE} DImulti(prop = c("G1", "G2", "L1", "L2"), DImodel = "FG", FG = c("Grass", "Grass", "Legume", "Legume"), ..., data = dataBEL) ```

ID

This is an optional argument. This argument should be a vector of strings, of the same length as prop, which maps each species in the experiment to a identity grouping, assuming functional redundancy between within-group species for their ID effects, this grouping does not affect interactions. ```{r ID, eval = FALSE} DImulti(prop = c("G1", "G2", "L1", "L2"), ID = c("Grass", "Grass", "Legume", "Legume"), ..., data = dataBEL) ```

extra_fixed

This is an optional argument. This parameter can be passed a formula of any additional fixed effects to be included, e.g. treatments. These effects can be easily crossed with the automatic DI model formula by prefacing the formula passed with 1: or 1*. See ?dataSWE for a dataset containing treatments suitable for use with DImulti(). ```{r extra_fixed, eval = FALSE} DImulti(extra_fixed = ~DENS, ..., data = dataSWE) DImulti(extra_fixed = ~DENS + TREAT, ..., data = dataSWE) DImulti(extra_fixed = ~1*DENS, ..., data = dataSWE) DImulti(extra_fixed = ~1:(DENS+TREAT), ..., data = dataSWE) DImulti(extra_fixed = ~1:DENS:TREAT, ..., data = dataSWE) ``` ```{r extra_fixed_dataSWE, eval = TRUE, echo = FALSE} dataSWE[1:6, c("DENS", "TREAT"), drop = FALSE] ```

estimate_theta

This is an optional argument. This parameter can be passed a boolean value (FALSE), indicating whether the user wants the function to use profile likelihood to estimate values for the non-linear parameter $\theta$. Defaults to "STR" or "ID" was passed through the parameter DImodel, then setting estimate_theta = TRUE will cause a warning to be printed to the console, as $\theta$ only applies to interaction terms, which do not exist for those options. ```{r estimate_theta, eval = FALSE} DImulti(estimate_theta = TRUE, ...) DImulti(estimate_theta = FALSE, ...) ```

theta

This is an optional argument. This argument should contain numerical values representing the non-linear parameter $\theta$, which will be applied as a power to the products of pairs of species in the interaction terms of the model. A single value may be passed, which will be applied to each ecosystem function, or a different value may be passed for each ecosystem function. If "STR" or "ID" was passed through the parameter DImodel, then setting a non-1 value for theta will cause a warning to be printed to the console, as $\theta$ only applies to interaction terms, which do not exist for those options. ```{r theta, eval = FALSE} DImulti(theta = 1, ...) DImulti(theta = c(0.5, 1, 1.2), ...) ```

method

This is an optional argument. This parameter is used to change the estimation method used by DImulti(). The options are "REML", referring to restricted maximum likelihood, and "ML", referring to maximum likelihood. Defaults to "REML". ```{r method, eval = FALSE} DImulti(method = "REML", ...) DImulti(method = "ML", ...) ```

Singular fit

A singular fit error, thrown by nlme::gls() when fitting the model, occurs when an element with value of exactly zero exists in the fixed effect variance covariance matrix of the model. This usually caused by rank deficiency in the model, i.e. two terms explaining the same variance. The fix is usually to simplify/change the fixed effect terms, be it the interactions structure, treatment terms, or theta values. To aid in this change, the fixed effect formula and any estimated theta values are printed to the console with the error. As an example, we use the dataset dataSWEXXDO NOT RUNXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX ```{r singular_dataSWE, eval = TRUE, error = TRUE} DImulti(prop = 5:8, y = "YIELD", time = c("YEARN", "CS"), unit_IDs = "PLOT", data = dataSWE, DImodel = "ID", extra_fixed = ~DENS+TREAT, method = "REML") ``` XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX As we include no theta values or interaction terms, and use the simple structure "CS", compound symmetry, for our repeated measure, the only complication in this model is the inclusion of the two treatment terms, DENS and TREAT. When multiple treatments are included, it is usually good to check that one cannot be inferred from the other, e.g. if DENS = High then we know that TREAT = 1. Luckily this is not the case here. ```{r singular_dataSWE_DENSTREAT, eval = TRUE, echo = FALSE} dataSWE[c(1, 16, 31, 40), c("DENS", "TREAT"), drop = FALSE] ``` Instead, we change the formatting of our formula, from creating intercepts to crossing the treatment with our ID effects, but not each other. ```{r singular_dataSWE_fix, eval = TRUE} DImulti(prop = 5:8, y = "YIELD", time = c("YEARN", "CS"), unit_IDs = "PLOT", data = dataSWE, DImodel = "ID", extra_fixed = ~1:(DENS+TREAT), method = "REML") ``` Which has resolved our issue. We show another example using the multivariate dataset dataBEL and estimating $\theta$. We would first set theta = 1 and use model selection to choose an appropriate interaction structure, which gives us the following model, with a functional group structure: ```{r singular_dataBEL_Works, eval = TRUE} model1 <- DImulti(prop = 2:5, y = "Y", eco_func = c("Var", "UN"), unit_IDs = 1, data = dataBEL, FG = c("Grass", "Grass", "Leg", "Leg"), DImodel = "FG", extra_fixed = ~Density, method = "ML", theta = 1) ```

Comparing models: ML vs REML

The optional parameter method in the function DImulti() allows a user to change the estimation method used, between the default "REML", restricted maximum likelihood, and "ML", maximum likelihood. "REML" was chosen as the default as it provides unbiased estimates, as opposed to "ML", however, "REML" is not appropriate for use with model comparison where fixed effects vary between models, as we can see from the following warning message: XXDO NOT RUNXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX ```{r comparison_dataBEL, eval = TRUE} model1 <- DImulti(prop = 2:5, y = "Y", eco_func = c("Var", "UN"), unit_IDs = 1, data = dataBEL, FG = c("Grass", "Grass", "Leg", "Leg"), DImodel = "FG", method = "REML") model2 <- DImulti(prop = 2:5, y = "Y", eco_func = c("Var", "UN"), unit_IDs = 1, data = dataBEL, FG = c("Grass", "Grass", "Leg", "Leg"), DImodel = "FG", extra_fixed = ~Density, method = "REML") anova(model1, model2) ``` XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX In this case, we should use "ML" to fit the models for comparison, then refit the chosen model using "REML", for the unbiased estimates.