link to 100

我的代码是:

import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
car_data = pd.read_csv('car_data.csv')

# Create X
X = car_data.drop('Buy Rate', axis=1)

# Create Y
y = car_data['Buy Rate']

clf = RandomForestClassifier()
clf.get_params()

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
clf.fit(X_train, y_train)

在带有clf.fit的行之后,弹出以下错误:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
/tmp/ipykernel_51905/2395142735.py in ?()
----> 1 clf.fit(X_train, y_train)

~/Desktop/ml-course/env/lib/python3.10/site-packages/sklearn/base.py in ?(estimator, *args, **kwargs)
   1147                 skip_parameter_validation=(
   1148                     prefer_skip_nested_validation or global_skip_validation
   1149                 )
   1150             ):
-> 1151                 return fit_method(estimator, *args, **kwargs)

~/Desktop/ml-course/env/lib/python3.10/site-packages/sklearn/ensemble/_forest.py in ?(self, X, y, sample_weight)
    344         """
    345         # Validate or convert input data
    346         if issparse(y):
    347             raise ValueError("sparse multilabel-indicator for y is not supported.")
--> 348         X, y = self._validate_data(
    349             X, y, multi_output=True, accept_sparse="csc", dtype=DTYPE
    350         )
    351         if sample_weight is not None:

~/Desktop/ml-course/env/lib/python3.10/site-packages/sklearn/base.py in ?(self, X, y, reset, validate_separately, cast_to_ndarray, **check_params)
    617                 if "estimator" not in check_y_params:
    618                     check_y_params = {**default_check_params, **check_y_params}
    619                 y = check_array(y, input_name="y", **check_y_params)
    620             else:
--> 621                 X, y = check_X_y(X, y, **check_params)
    622             out = X, y
    623 
    624         if not no_val_X and check_params.get("ensure_2d", True):

~/Desktop/ml-course/env/lib/python3.10/site-packages/sklearn/utils/validation.py in ?(X, y, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, multi_output, ensure_min_samples, ensure_min_features, y_numeric, estimator)
   1143         raise ValueError(
   1144             f"{estimator_name} requires y to be passed, but the target y is None"
   1145         )
   1146 
-> 1147     X = check_array(
   1148         X,
   1149         accept_sparse=accept_sparse,
   1150         accept_large_sparse=accept_large_sparse,

~/Desktop/ml-course/env/lib/python3.10/site-packages/sklearn/utils/validation.py in ?(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, estimator, input_name)
    914                         )
    915                     array = xp.astype(array, dtype, copy=False)
    916                 else:
    917                     array = _asarray_with_order(array, order=order, dtype=dtype, xp=xp)
--> 918             except ComplexWarning as complex_warning:
    919                 raise ValueError(
    920                     "Complex data not supported\n{}\n".format(array)
    921                 ) from complex_warning

~/Desktop/ml-course/env/lib/python3.10/site-packages/sklearn/utils/_array_api.py in ?(array, dtype, order, copy, xp)
    376         # Use NumPy API to support order
    377         if copy is True:
    378             array = numpy.array(array, order=order, dtype=dtype)
    379         else:
--> 380             array = numpy.asarray(array, order=order, dtype=dtype)
    381 
    382         # At this point array is a NumPy ndarray. We convert it to an array
    383         # container that is consistent with the input's namespace.

~/Desktop/ml-course/env/lib/python3.10/site-packages/pandas/core/generic.py in ?(self, dtype)
   2082     def __array__(self, dtype: npt.DTypeLike | None = None) -> np.ndarray:
   2083         values = self._values
-> 2084         arr = np.asarray(values, dtype=dtype)
   2085         if (
   2086             astype_is_view(values.dtype, arr.dtype)
   2087             and using_copy_on_write()

ValueError: could not convert string to float: 'Hyundai'

我看过这里发布的类似问题,但都没有帮助.

推荐答案

出现错误是因为您在X中的特征(即,Make和 colored颜色 )是绝对的,如果您使用标签编码器将它们编码为数字变量,

from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
car_data['Make'] = le.fit_transform(car_data['Make'])
car_data['Color'] = le.fit_transform(car_data['Color'])

那么如果执行car_data.head(2),你的结果会是这样的,

    Make    Year    Price   Mileage Color   Buy Rate
0   30      2018    20000   50000   1       0.80
1   13      2019    25000   40000   4       0.70

这就解决了你的问题!

但是,由于您的目标变量(即,购买率)是连续的,所以在使用RandomForestClasser训练数据时会出现错误,

因此,为了进行分类,首先需要将目标变量放入bin.

num_bins = 3
bin_boundaries = [0, 0.5, 0.75, 1]
car_data['Buy Rate'] = pd.cut(car_data['Buy Rate'], bins=num_bins, labels=False)
car_data['Buy Rate'] = car_data['Buy Rate'].map({0: 'Low', 1: 'Medium', 2: 'High'})

结果,

0     Medium
1     Medium
2        Low
.     
.
.
32      High

在绑定之后,您可以使用RandomForest分类器训练您的数据.

import pandas as pd
from sklearn.model_selection import train_test_split
car_data = pd.read_csv('car_data.csv')

from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
car_data['Make'] = le.fit_transform(car_data['Make'])
car_data['Color'] = le.fit_transform(car_data['Color'])

num_bins = 3
bin_boundaries = [0, 0.5, 0.75, 1] 
car_data['Buy Rate'] = pd.cut(car_data['Buy Rate'], bins=num_bins, labels=False)
car_data['Buy Rate'] = car_data['Buy Rate'].map({0: 'Low', 1: 'Medium', 2: 'High'})

# Create X
X = car_data.drop('Buy Rate', axis=1)
# Create Y
y = car_data['Buy Rate'] # target variable

from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier()
clf.get_params()
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
clf.fit(X_train, y_train)

最后,这就是您的代码应该如何修改.

Python相关问答推荐

Python 3.12中的通用[T]类方法隐式类型检索

三个给定的坐标可以是矩形的点吗

点到面的Y距离

当使用keras.utils.Image_dataset_from_directory仅加载测试数据集时,结果不同

追溯(最近最后一次调用):文件C:\Users\Diplom/PycharmProject\Yolo01\Roboflow-4.py,第4行,在模块导入roboflow中

. str.替换pandas.series的方法未按预期工作

可变参数数量的重载类型(args或kwargs)

海运图:调整行和列标签

从numpy数组和参数创建收件箱

发生异常:TclMessage命令名称无效.!listbox"

django禁止直接分配到多对多集合的前端.使用user.set()

如何调整QscrollArea以正确显示内部正在变化的Qgridlayout?

在vscode上使用Python虚拟环境时((env))

创建可序列化数据模型的最佳方法

如何在表中添加重复的列?

基于形状而非距离的两个numpy数组相似性

python中csv. Dictreader. fieldname的类型是什么?'

使用tqdm的进度条

使用SeleniumBase保存和加载Cookie时出现问题

关于数字S种子序列内部工作原理的困惑