我正在自学如何使用优秀的tidyModels包集合来练习机器学习.
在下面的例子中,我基本上是在try 复制Julie Sigle在这里(https://juliasilge.com/blog/water-sources/)发表的关于使用Ranger程序包预测水源的博客文章.
我没有在那篇博客中使用她的数据集,而是使用内置的钻石数据集作为练习.
当我try 根据预测绘制真相图时,我可以重新创建除yardmark::roc_curv()之外的所有集合.
我得到的错误如下所示
Error in `dplyr::summarise()`:
! Problem while computing `.estimate = metric_fn(...)`.
ℹ The error occurred in group 1: id = "Fold01".
Caused by error in `validate_class()`:
! `estimate` should be a numeric but a factor was supplied.
虽然数据集和转换步骤不同,但以下步骤大致对应于上面链接中的内容.
我认识到,从统计学上讲,可能有更有效或更好的方法来做到这一点,但我只是试图更熟悉这些工具和包,并获得使用它们的经验.
library(tidyverse)
library(tidymodels)
# set a outcome variable that I want to try and predict (e.g. price is above $10,000)
diamonds <- diamonds %>%
mutate(high_price_indicator=if_else(price>10000,"high","low"))
#split data sets
data_split <- rsample::initial_split(diamonds,strata = high_price_indicator)
training_split <- rsample::training(data_split)
testing_split <- rsample::testing(data_split)
# cross fold
diamonds_fold <- rsample::vfold_cv(training_split,strata=high_price_indicator)
#choose model, set engine and mode
rf_spec <- parsnip::rand_forest(trees = 1000) %>%
set_mode("classification") %>%
set_engine("ranger")
#set recipe and do some transformations - not sure if the error is here
rec <- recipes::recipe(high_price_indicator ~., data=training_split) %>%
recipes::step_normalize(all_numeric_predictors()) %>%
step_zv(all_predictors(),) %>%
step_dummy(c("cut","color","clarity"),one_hot = TRUE)
# create the workflow
workflow <- workflow() %>%
add_model(rf_spec) %>%
add_recipe(rec)
# fit workflow to cross folded data and save predictions
fit_folds <- tune::fit_resamples(workflow,
resamples = diamonds_fold,
control = control_resamples(save_pred = TRUE)
)
# this is where I get the error
collect_predictions(fit_folds) %>%
group_by(id) %>%
roc_curve(high_price_indicator, .pred_class) %>%
autoplot()
感谢任何人的指导!
下面是我的步骤.如果有人能帮助我理解我在将预测与结果变量进行对比时的错误之处,我将不胜感激.