---
title: "DImodelsVis with black-box models"
output: rmarkdown::html_vignette
vignette: >
%\VignetteIndexEntry{DImodelsVis with black-box models}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
warning = FALSE,
fig.width = 5,
fig.height = 5,
fig.align = "center"
)
```
```{css, echo = FALSE}
/* Highlight box */
.highlightbox {
background-color: #e6f0fa; /* light blue */
border: 2px solid #156082; /* darker blue */
border-radius: 6px;
padding: 10px 14px;
margin: 12px 0;
box-shadow: 2px 2px 6px rgba(0,0,0,0.1);
}
/* Note box */
.notebox {
background-color: #f7f7f7; /* light gray */
border: 1px solid #999999; /* medium gray */
border-radius: 4px;
padding: 8px 12px;
margin: 10px 0;
box-shadow: 1px 1px 4px rgba(0,0,0,0.08);
}
```
This vignette shows an example of how `DImodelsVis` can be used to visualise the response surface across the simplex space using black-box style models such as neural networks or random forests. Note that, while we show limited examples in this vignetted, it is worth noting that all visualisations in the package except for model diagnostics and prediction contributions plots can be created for black-box style models.
### Loading necessary packages
```{r, setup, message = FALSE}
library(DImodelsVis)
library(dplyr)
library(ggplot2)
library(nnet)
library(randomForest)
```
### Data
For this example we simulate a dataset containing 200 soil samples with varying proportions of sand, silt and clay, constrained to sum to one. The response variable is soil organic carbon (SOC) and was simulated to reflect that soils richer in clay have higher SOC, those with more silt have moderate SOC, and those dominated by sand have lower SOC.
```{r, simulate_data}
# For reproducability
set.seed(123)
# Number of samples
n <- 200
# Simulate 3d compositional data
# Assume equal density for each variable
alpha <- c(1, 1, 1)
X <- matrix(rgamma(n * length(alpha), shape = alpha), ncol = length(alpha), byrow = TRUE)
X <- X / rowSums(X) # normalize to sum to 1
colnames(X) <- c("Sand", "Silt", "Clay")
# Simulate Soil organic carbon (SOC) to be used as a response
# We assume, SOC is higher for higher clay content, lower for sand, moderate for silt
SOC <- 4 + 5*X[, "Clay"] - 2*X[, "Sand"] + 1*X[, "Silt"] + rnorm(n, 0, 0.5)
# Combine the compositional predictors and continuous response into data frame
soil_data <- data.frame(X, SOC)
# Snippet of data
head(soil_data)
```
### Visualising raw data
We create a ternary diagram using the `tenrary_plot()` function showing the spread of data across the 3d simplex space. The points are coloured by SOC values, with higher values colour green and lower values given shades of pink and white.
```{r, raw_data, fig.alt = "Plot of raw data in ternary"}
ternary_plot(
# Compositional variables
prop = c("Sand", "Silt", "Clay"),
# Labels for ternary axes
tern_labels = c("Sand", "Silt", "Clay"),
# Data with compositional variables
data = soil_data,
# Show raw data as points instead of a contour map
show = "points",
# Colour points by SOC variable
col_var = "SOC")
```
### Neural network
We fit a basic neural network containing a single hidden layer with seven nodes to the soil data using the proportions of sand, silt, and clay as predictors. Additional parameters such as `decay`, `linout`, `maxit` are set (explained in code) at specific values to ensure predictions are consistent across successive runs of the neural network.
```{r, nn_model}
# Seed to ensure weight initialisation is reproducible
set.seed(737)
nn_model <- nnet(SOC ~ Sand + Silt + Clay,
data = soil_data,
size = 7, # Seven nodes in hidden layer
decay = 0.01, # Parameter for weight decay (helps stabilise predictions)
linout = TRUE, # Boolean to indicate continuous predictions
maxit = 1000) # Number of iterations
nn_model
```
#### Visualise response surface across ternary
The predicted SOC surface across the 3d simplex space is visualised here. We first generated the grid of points across the simplex space using the `ternary_data()` function and then use the neural network to predict the SOC. The data.frame containing the grid of points and corresponding predicted SOC is then passed to the `ternary_plot()` function for visualising the response surface.
:::{.notebox}
While we set the `prediction = FALSE` argument here and add the predictions at a later step using the `add_prediction()` function from `DImodelsVis`. Users can also use the base R `predict()` function or any custom function of their choice which returns the predicted response from a black-box model.
:::
```{r, data_nn_model}
# Prepare data
ternary_grid <- ternary_data(# Compositional predictors
prop = c("Sand", "Silt", "Clay"),
# Don't make predictions now and only return template
prediction = FALSE)
head(ternary_grid)
```
The predicted response from the neural network can now be added to the data template created by the `ternary_data()` function.
```{r, predict_nn_model}
# Add the predicted response to data
plot_data <- add_prediction(# Grid data from ternary_data() function
data = ternary_grid,
# Model to use for making predictions
model = nn_model,
# No uncertainty interval added to data
interval = "none")
# The predicted response is added as a column called .Pred
head(plot_data)
```
The response surface can be visualised as follows
```{r, visualise_nn_model, fig.alt = "Ternary diagram showing SOC"}
# Create ternary plot
ternary_plot(data = plot_data, # Compositional data with predicted response
lower_lim = 1.5, # Lower limit of fill scale
upper_lim = 9.5, # Upper limit of fill scale
nlevels = 8, # Number of colours for fill scale
# Labels for ternary axes
tern_labels = c("Sand", "Silt", "Clay")) +
# ggplot function to add labels on plot
labs(fill = "Predicted\nSOC")
```
The predicted SOC is higher in soil containing high proportions of clay and lower in soil containing high amounts of sand.
#### Effects plots for the compositional predictors
We can also generate effects plots for the compositional predictors in the neural network. First, the dataset needed for generating the effects plot is created using the `visualise_effects_data()` function. The predicted SOC is then added to the data using the `add_prediction()` function and the prepared data is passed onto the `visualise_effects_plot()` function for visualising the effects.
:::{.notebox}
Users could also set `prediction = TRUE` and have the predictions added to the data in the data preparation step itself, as shown in the random-forest example later in the vignette.
:::
```{r, nn_model_effects, fig.alt = "Effects plot for compositional predictors", fig.height = 5, fig.width=9}
# Compositions to be used for generating the effects plots.
# Not all are needed but having more initial compositions gives higher accuracy.
eff_data <- tribble(~"Sand", ~"Silt", ~"Clay",
1, 0, 0,
0, 1, 0,
0, 0, 1,
0.5, 0.5, 0,
0.5, 0, 0.5,
0, 0.5, 0.5,
0.8, 0.2, 0,
0.2, 0.8, 0,
0.8, 0, 0.2,
0.2, 0, 0.8,
0, 0.8, 0.2,
0, 0.2, 0.8,
0.6, 0.2, 0.2,
0.2, 0.6, 0.2,
0.2, 0.2, 0.6,
1/3, 1/3, 1/3)
# The data-preparation, adding predictions and plotting are done in a single dplyr pipeline here
# Prepare data
visualise_effects_data(# Data with initial compostions
data = eff_data,
# Column names for compositional predictors in data
prop = c("Sand", "Silt", "Clay"),
# Show effects plot for all compostional variables
var_interest = c("Sand", "Silt", "Clay"),
# Don't make predictions now and only return template
prediction = FALSE) %>%
# Add predictions
add_prediction(interval = "none", # No uncertainty intervals
model = nn_model) %>% # Model to use for making predictions
# Print first few rows of data
as_tibble() %>% print(n = 6) %>%
# Create plot
visualise_effects_plot() +
# ggplot function to add labels on plot
labs(y = "Predicted SOC")
```
Increasing the proportion of sand lowers SOC, while increasing clay raises it. The effect of silt depends on the soil context: it tends to increase SOC in sand-rich soils but reduce it in clay-rich soils, leading to an overall flat effect of silt on SOC.
### Random forest model
We also show an example of using `DImodelsVis` with random forest models. The same pipeline as depicted above can be used with random forest models to create visualisations from `DImodelsVis`.
We first a random forest model using the proportions of sand, silt, and clay as well as their pairwise interactions as predictors.
```{r, forest_model}
forest_model <- randomForest(# Predictor for random forest model
SOC ~ Sand + Silt + Clay +
Sand:Silt + Silt:Clay + Sand:Clay,
# Data
data = soil_data,
# 5000 trees to be used random forest
ntree = 5000)
forest_model
```
#### Response surface across ternary using random forests
Similar to above, we could visualise the predicted response surface for SOC across the three dimensional simplex space using a ternary diagram. For this example, we also showcase the use of setting `prediction = TRUE` in the `ternary_data()` function to automatically add predictions to created grid of points for the ternary diagram. This prepared data is then passed to the `ternary_plot()` function to visualise the response surface.
:::{.notebox}
The `prediction = TRUE` argument can be specified in any of the data preparation functions in the package to automatically add predictions to the prepared data. However, when working with complex models with multiple predictions, care should be taken as users may wish to supply values for these predictors other than the function's defaults.
:::
```{r, visualise_forest_model, fig.alt = "Response surface from random forest"}
# Prepare data
ternary_data(# Names of compositional predictors
prop = c("Sand", "Silt", "Clay"),
# Model to use for making predictions
model = forest_model,
# Add predictions directly
prediction = TRUE) %>%
ternary_plot(lower_lim = 1.5, # Lower limit of fill scale
upper_lim = 9.5, # Upper limit of fill scale
nlevels = 8, # Number of colours for fill scale
# Labels for ternary axes
tern_labels = c("Sand", "Silt", "Clay")) +
# ggplot function to add labels on plot
labs(fill = "Predicted\nSOC")
```
A similar inference can be drawn as the neural network: predicted SOC is higher in soils with greater clay content and lower in those with higher sand content. However, the predicted response surface is noticeably more jagged for a random forest compared to a neural network or standard linear regression model. This could be improved by either fine tuning the model parameters further or by applying a post-processing smoothing operation over the predicted values.