library(filtro)
library(dplyr)
library(modeldata)⚠️ work-in-progress
We’ll need to load a few packages:
library(filtro)
library(dplyr)
library(modeldata)Predictor importance can be assessed using three different random forest models. They can be accessed via the following score class objects:
score_imp_rf
score_imp_rf_conditional
score_imp_rf_obliqueThese models are powered by the following packages:
#> [1] "ranger"
#> [1] "partykit"
#> [1] "aorsf"Regarding score types:
The {ranger} random forest computes the importance scores.
The {partykit} conditional random forest computes the conditional importance scores.
The {aorsf} oblique random forest computes the permutation importance scores.
The {modeldata} package contains a data set used to predict which cells in a high content screen were well segmented. It has 57 predictor columns and a factor variable class (the outcome).
Since case is only used to indicate Train/Test, not for data analysis, it will be set to NULL. Furthermore, for efficiency, we will use a small sample of 50 from the original 2019 observations.
cells_subset <- modeldata::cells |> 
  # Use a small example for efficiency
  dplyr::slice(1:50)
cells_subset$case <- NULL
# cells_subset |> str() # Uncomment to see the structure of the dataFirst, we create a score class object to specify a {ranger} random forest, and then use the fit() method with the standard formula to compute the importance scores.
# Specify random forest and fit score
cells_imp_rf_res <- score_imp_rf |>
  fit(
    class ~ .,
    data = cells_subset, 
    seed = 42 
  )The data frame of results can be accessed via object@results.
cells_imp_rf_res@results
#> # A tibble: 56 × 4
#>    name        score outcome predictor                   
#>    <chr>       <dbl> <chr>   <chr>                       
#>  1 imp_rf  0.000967  class   angle_ch_1                  
#>  2 imp_rf -0.0000620 class   area_ch_1                   
#>  3 imp_rf  0.00438   class   avg_inten_ch_1              
#>  4 imp_rf  0.00916   class   avg_inten_ch_2              
#>  5 imp_rf -0.000426  class   avg_inten_ch_3              
#>  6 imp_rf -0.000296  class   avg_inten_ch_4              
#>  7 imp_rf  0.00836   class   convex_hull_area_ratio_ch_1 
#>  8 imp_rf  0.00133   class   convex_hull_perim_ratio_ch_1
#>  9 imp_rf  0.000739  class   diff_inten_density_ch_1     
#> 10 imp_rf -0.00128   class   diff_inten_density_ch_3     
#> # ℹ 46 more rowsA copule of notes here:
The random forest filter, including all three types of random forests,
regression tasks, and
classificaiton tasks.
In case where NA is produced, a safe value can be used to retain the predictor, and can be accessed via object@fallback_value.
Larger values indicate more important predictors.
For this specific filter, i.e., score_imp_rf_*, case weights are supported.
Like {parsnip}, the argument names are harmonized. For example, the arguments to set the number of trees: num.trees in {ranger}, ntree in {partykit}, and n_tree in {aorsf} are all standardized to a single name, trees, so users only need to remember a single name.
The same applies to the number of variables to split at each node, mtry, and the minimum node size for splitting, min_n.
# Set hyperparameters
cells_imp_rf_res <- score_imp_rf |>
  fit(
    class ~ .,
    data = cells_subset,     
    trees = 100, 
    mtry = 2,
    min_n = 1
  )However, there is one argument name specific to {ranger}. For reproducibility, instead of using the standard set.seed() method, we would use the seed argument.
cells_imp_rf_res <- score_imp_rf |>
  fit(
    class ~ .,
    data = cells_subset,     
    trees = 100,
    mtry = 2,
    min_n = 1, 
    seed = 42 # Set seed for reproducibility
  )If users use {ranger} argument names, intentionally or not, it still works. We have handled the necessary adjustments. The following code chunk can be used to obtain a fitted score:
cells_imp_rf_res <- score_imp_rf |>
  fit(
    class ~ .,
    data = cells_subset,     
    num.trees = 100,
    mtry = 2,
    min.node.size = 1, 
    seed = 42 
  )The same applies to {partykit}- and {aorsf}- specific arguments.
For the {partykit} conditional random forest, we again create a score class object to specify the model, then use the fit() method to compute the importance scores.
The data frame of results can be accessed via object@results.
# Set seed for reproducibility
set.seed(42)
# Specify conditional random forest and fit score
cells_imp_rf_conditional_res <- score_imp_rf_conditional |>
  fit(class ~ ., data = cells_subset, trees = 100)
cells_imp_rf_conditional_res@results
#> # A tibble: 40 × 4
#>    name                  score outcome predictor                   
#>    <chr>                 <dbl> <chr>   <chr>                       
#>  1 imp_rf_conditional -0.0306  class   angle_ch_1                  
#>  2 imp_rf_conditional  0.178   class   area_ch_1                   
#>  3 imp_rf_conditional  0.158   class   avg_inten_ch_1              
#>  4 imp_rf_conditional  0.132   class   avg_inten_ch_2              
#>  5 imp_rf_conditional  0.0927  class   convex_hull_area_ratio_ch_1 
#>  6 imp_rf_conditional  0.963   class   convex_hull_perim_ratio_ch_1
#>  7 imp_rf_conditional -0.0842  class   diff_inten_density_ch_1     
#>  8 imp_rf_conditional  0.0688  class   diff_inten_density_ch_3     
#>  9 imp_rf_conditional  0.147   class   entropy_inten_ch_1          
#> 10 imp_rf_conditional  0.00105 class   entropy_inten_ch_3          
#> # ℹ 30 more rowsNote that when a predictor’s importance score is 0, partykit::cforest() may exclude its name from the output. In such cases, a score of 0 is assigned to the missing predictors.
For the {aorsf} oblique random forest, we again create a score class object to specify the model, then use the fit() method to compute the importance scores.
The data frame of results can be accessed via object@results.
# Set seed for reproducibility
set.seed(42)
# Specify oblique random forest and fit score
cells_imp_rf_oblique_res <- score_imp_rf_oblique |>
  fit(class ~ ., data = cells_subset, trees = 100, mtry = 2)
cells_imp_rf_oblique_res@results
#> # A tibble: 56 × 4
#>    name             score outcome predictor               
#>    <chr>            <dbl> <chr>   <chr>                   
#>  1 imp_rf_oblique 0.0165  class   fiber_width_ch_1        
#>  2 imp_rf_oblique 0.0121  class   inten_cooc_contrast_ch_3
#>  3 imp_rf_oblique 0.0109  class   inten_cooc_max_ch_3     
#>  4 imp_rf_oblique 0.00850 class   shape_p_2_a_ch_1        
#>  5 imp_rf_oblique 0.00777 class   entropy_inten_ch_1      
#>  6 imp_rf_oblique 0.00725 class   eq_ellipse_lwr_ch_1     
#>  7 imp_rf_oblique 0.00589 class   inten_cooc_asm_ch_3     
#>  8 imp_rf_oblique 0.00543 class   diff_inten_density_ch_1 
#>  9 imp_rf_oblique 0.00513 class   shape_lwr_ch_1          
#> 10 imp_rf_oblique 0.00506 class   fiber_length_ch_1       
#> # ℹ 46 more rowsThe list of score class objects for random forests, their corresponding engines and supported tasks:
| object | engine | task | 
|---|---|---|
| score_imp_rf | ranger::ranger | regression, classification | 
| score_imp_rf_conditional | partykit::cforest | regression, classification | 
| score_imp_rf_oblique | aorsf::orsf | regression, classification |