Random Forest in R

Nick J Lyon

Prepare

  • First, you’ll need to install and load a few R packages
    • While not technically necessary, the librarian package makes library management much simpler


# Install librarian (if you need to)
# install.packages("librarian")

# Install (if not already present) and load needed libraries
librarian::shelf(tidyverse, randomForest, permimp, vegan)

Lichen Data

  • The vegan package includes some lichen community composition data we can use for exploratory purposes


  • We’ll begin by loading that data (with some minor wrangling)
# Load vegan's lichen dataset & associated chemistry dataset
utils::data("varespec", package = 'vegan')
utils::data("varechem", package = 'vegan')

# Get one lichen species' cover information separate
lichen_sp <- dplyr::select(varespec, Callvulg)

# Attach the single species to the chemistry data
lichen_df <- cbind(lichen_sp, varechem)

Data Structure

  • This data object now has the following structure:
# Check lichen data structure
str(lichen_df)
'data.frame':   24 obs. of  15 variables:
 $ Callvulg: num  0.55 0.67 0.1 0 0 ...
 $ N       : num  19.8 13.4 20.2 20.6 23.8 22.8 26.6 24.2 29.8 28.1 ...
 $ P       : num  42.1 39.1 67.7 60.8 54.5 40.9 36.7 31 73.5 40.5 ...
 $ K       : num  140 167 207 234 181 ...
 $ Ca      : num  519 357 973 834 777 ...
 $ Mg      : num  90 70.7 209.1 127.2 125.8 ...
 $ S       : num  32.3 35.2 58.1 40.7 39.5 40.8 33.8 27.1 42.5 60.2 ...
 $ Al      : num  39 88.1 138 15.4 24.2 ...
 $ Fe      : num  40.9 39 35.4 4.4 3 ...
 $ Mn      : num  58.1 52.4 32.1 132 50.1 ...
 $ Zn      : num  4.5 5.4 16.8 10.7 6.6 9.1 7.4 5.2 9.3 9.1 ...
 $ Mo      : num  0.3 0.3 0.8 0.2 0.3 0.4 0.3 0.3 0.3 0.5 ...
 $ Baresoil: num  43.9 23.6 21.2 18.7 46 40.5 23 29.8 17.6 29.9 ...
 $ Humdepth: num  2.2 2.2 2 2.9 3 3.8 2.8 2 3 2.2 ...
 $ pH      : num  2.7 2.8 3 2.8 2.7 2.7 2.8 2.8 2.8 2.8 ...

Random Forest

  • Run the random forest with the function and package of the same name
# Actually do the random forest
lich_rf <- randomForest::randomForest(Callvulg ~ ., data = lichen_df, 
                                      ntree = 1000, mtry = 2, 
                                      na.action = na.omit,
                                      keep.forest = T, keep.inbag = T)


  • Quick argument explanation
    • ‘Y ~ .’ format of model means all other columns are (potential) predictors
    • ntree is the number of trees in the forest
    • mtry is the number of variables per node in the tree

Variable Importance Plot

  • We can now generate a variable importance plot based on that random forest
# Create variable importance plot
randomForest::varImpPlot(x = lich_rf, sort = T,
                         n.var = (ncol(lichen_df) - 1),
                         main = "Variable Importance")

Conditional Permutation Importance (CPI)

  • We can use that random forest to perform conditional permutation
# Implement conditional permutation
high_thresh <- permimp::permimp(object = lich_rf, conditional = T,
                                # Note the threshold is set to 0.95
                               threshold = 0.95, do_check = F, progressBar = F)

# Make CPI plot
plot(high_thresh, type = "box", horizontal = T)

CPI - Thresholds

  • As you might imagine, the threshold you pick can have a dramatic effect!
# Implement conditional permutation
low_thresh <- permimp::permimp(object = lich_rf, conditional = T,
                               # Note the lower threshold
                               threshold = 0.50, do_check = F, progressBar = F)

# Make CPI plot
plot(low_thresh, type = "box", horizontal = T)

Exploratory Plotting

  • Let’s graph the response against the four ‘most important’ variables
    • This part is just for fun!

Thanks! Questions?