了解 R 中的 YeoJohnson 转换

发布于08月19日

我try 使用caret和recipes执行YeoJohnson转换，但我认为我没有正确指定调用，或者我缺少一些额外的参数.

library(tidyverse)
library(tidytuesdayR)

# Data is all numeric except for column 7
# get it from
# https://github.com/rfordatascience/tidytuesday/blob/master/data/2023/2023-08-15/spam.csv
# or load it with tt_load()
spam <- tt_load(2023, week=33)$spam


# pre-process
pp_hpc <- caret::preProcess(spam[,1:6], 
                            method = c("center", "scale", "YeoJohnson"))
# fails to transform variables all variables
pp_hpc
Created from 4601 samples and 6 variables

Pre-processing:
  - centered (6)
  - ignored (0)
  - scaled (6)
  - Yeo-Johnson transformation (1)

Lambda estimates for Yeo-Johnson transformation:
0

# I can apply the transformation but obviously doesn't do the expected transformation in all the columns
transformed <- predict(pp_hpc, newdata = df$spam[,1:6])

现在试着用recipes

# recipes package 
library(recipes)
# do I really need this just to transform the data?
rec <- recipe(
  yesno ~ .,
  data = spam
)

yj_transform <- step_YeoJohnson(rec, all_numeric())
# only transform some variables
yj_estimates <- prep(yj_transform, verbose = T)
yj_estimates

── Recipe ────────────────────────────────────────────────

── Inputs 
Number of variables by role
outcome:   1
predictor: 6

── Training information 
Training data contained 4601 data points and no
incomplete rows.

── Operations 
• Yeo-Johnson transformation on: crl.tot, bang | Trained

同样，应用也可以，但并不是所有的柱都会被转换(我也没有居中/zoom ，因为这不是问题所在).

yj_te <- bake(yj_estimates, spam)

bestNormalize%的套餐在这里似乎没有问题:

# works as expected
df_transformed <- select(spam, where(is.numeric)) %>% 
  mutate_all(.funs = function(x) predict(bestNormalize::yeojohnson(x), newdata = x))

为了以防万一，这就是我如何用python语言或使用reticulate

# Python version
library(reticulate)
repl_python()
from sklearn import preprocessing
X = r.spam.drop('yesno', axis = 1)
scaler = preprocessing.PowerTransformer().set_output(transform="pandas")
X = scaler.fit_transform(X)

> tidy(yj_estimates,number = 1) # A tibble: 6 × 3 terms value id <chr> <dbl> <chr> 1 crl.tot 0.000979 YeoJohnson_jKN6C 2 dollar -13.1 YeoJohnson_jKN6C 3 bang -3.88 YeoJohnson_jKN6C 4 money -14.6 YeoJohnson_jKN6C 5 n000 -13.4 YeoJohnson_jKN6C 6 make -11.0 YeoJohnson_jKN6C

了解 R 中的 YeoJohnson 转换

推荐答案

R相关问答推荐

使用na.locf在长格式数据集中输入具有多个时间点的数据集

为什么以及如何修复Mapview不显示所有点并且st_buffer合并一些区域R？

在特定列上滞后n行，同时扩展框架的长度

如何将在HW上运行的R中的消息(错误、警告等)作为批处理任务输出

判断字符串中数字的连续性

如何通过Docker部署我的shiny 应用程序(多个文件)

在ggplot中为不同几何体使用不同的 colored颜色比例

在RStudio中堆叠条形图和折线图

可以替代与NSE一起使用的‘any_of()’吗？

将二进制数据库转换为频率表

跨列查找多个时间报告

警告消息"；没有非缺失的参数到min；，正在返回数据中的inf"；.表分组集

在gggraph中显示来自不同数据帧的单个值

在点图上绘制置信度或预测区间ggplot2

为什么在写入CSV文件时Purrr：：Pwalk不起作用

访问数据帧中未定义的列时出现R错误

如何从嵌套数据中自动创建命名对象？在R中

如何更改包中函数中的参数？

GgHighlight找不到它创建的列：`Highlight..1`->；`Highlight.....`

将`magick`对象转换为原始向量