我使用iml
包从caret
训练的rf
模型中导出ALE值.在分类任务中,因变量的级别具有语法无效的字符串值,这可能会导致问题,因为在预测期间,这些级别最终成为列名.
下面是一个愚蠢的例子,它会在最后一行代码中抛出一个undefined columns selected错误:
# ----- Packages -----
library(randomForest)
library(caret)
library(iml)
# ----- Dummy Data -----
One <- as.factor(sample(c("1", "0"), size = 250, replace = TRUE))
Two <- as.factor(sample(make.names(c("1", "0")), size = 250, replace = TRUE))
Three <- as.factor(sample(c("A-1_x", "B-0_y", "1 C-$_3.5"), size = 250, replace = TRUE))
Four <- as.factor(sample(make.names(c("A-1_x", "B-0_y", "1 C-$_3.5")), size = 250, replace = TRUE))
df <- cbind.data.frame(One, Two, Three, Four)
# ----- Modelling + IML for syntactically invalid levels from "Three" -----
ALE.ClassOfInterest <- "1 C-$_3.5"
TrainData <- cbind.data.frame(One, Two, Four)
rf <- caret::train(TrainData, Three, method = "rf", tuneLength = 3, trControl = trainControl(method = "cv"))
Pred <- Predictor$new(rf, data=df, class=ALE.ClassOfInterest)
FE3 <- FeatureEffects$new(Pred, features=names(df), method="ale")$results
我有一些例子,一个非常简单的修改就可以做到这一点,简单地在代码的倒数第二行调用make. names,如下所示:
Pred <- Predictor$new(rf, data=df, class=make.names(ALE.ClassOfInterest))
然而,在上面的示例中,这并没有帮助,我找到的唯一解决方案是在开始时使用make.names
,以便在训练模型之前将所有级别转换为语法有效的字符串(请参阅列"4").然而,出于各种原因,我想坚持使用原始字符串,并且我已经注意到,其他同样无效的级别,如"0"、"1"(请参阅列"1")不需要任何变通方法,这是可行的:
# ----- Modelling + IML for syntactically invalid levels from "One" -----
ALE.ClassOfInterest <- "1"
TrainData <- cbind.data.frame(Two, Three, Four)
rf <- caret::train(TrainData, One, method = "rf", tuneLength = 3, trControl = trainControl(method = "cv"))
Pred <- Predictor$new(rf, data=df, class=ALE.ClassOfInterest)
FE1 <- FeatureEffects$new(Pred, features=names(df), method="ale")$results
有没有人知道引擎盖下发生了什么,如果它不是一个普通的make.names
或可以建议一个解决方案,让我坚持在模型中的原始因素水平?
谢谢你马克