我try 使用caret
和recipes
执行YeoJohnson转换,但我认为我没有正确指定调用,或者我缺少一些额外的参数.
library(tidyverse)
library(tidytuesdayR)
# Data is all numeric except for column 7
# get it from
# https://github.com/rfordatascience/tidytuesday/blob/master/data/2023/2023-08-15/spam.csv
# or load it with tt_load()
spam <- tt_load(2023, week=33)$spam
# pre-process
pp_hpc <- caret::preProcess(spam[,1:6],
method = c("center", "scale", "YeoJohnson"))
# fails to transform variables all variables
pp_hpc
Created from 4601 samples and 6 variables
Pre-processing:
- centered (6)
- ignored (0)
- scaled (6)
- Yeo-Johnson transformation (1)
Lambda estimates for Yeo-Johnson transformation:
0
# I can apply the transformation but obviously doesn't do the expected transformation in all the columns
transformed <- predict(pp_hpc, newdata = df$spam[,1:6])
现在试着用recipes
# recipes package
library(recipes)
# do I really need this just to transform the data?
rec <- recipe(
yesno ~ .,
data = spam
)
yj_transform <- step_YeoJohnson(rec, all_numeric())
# only transform some variables
yj_estimates <- prep(yj_transform, verbose = T)
yj_estimates
── Recipe ────────────────────────────────────────────────
── Inputs
Number of variables by role
outcome: 1
predictor: 6
── Training information
Training data contained 4601 data points and no
incomplete rows.
── Operations
• Yeo-Johnson transformation on: crl.tot, bang | Trained
同样,应用也可以,但并不是所有的柱都会被转换(我也没有居中/zoom ,因为这不是问题所在).
yj_te <- bake(yj_estimates, spam)
bestNormalize
%的套餐在这里似乎没有问题:
# works as expected
df_transformed <- select(spam, where(is.numeric)) %>%
mutate_all(.funs = function(x) predict(bestNormalize::yeojohnson(x), newdata = x))
为了以防万一,这就是我如何用python语言或使用reticulate
# Python version
library(reticulate)
repl_python()
from sklearn import preprocessing
X = r.spam.drop('yesno', axis = 1)
scaler = preprocessing.PowerTransformer().set_output(transform="pandas")
X = scaler.fit_transform(X)