I'm working on making a logistic regression with a simple dataset in Python: Simple dataset with 5 rows

My goal is to predict whether or not someone survived. After cleaning the dataset & getting rid of NaN values as well as String columns, I've used the following code to turn every column data type to float64(cleaned dataset shown below as well): Dataset cleaned and values turned to float64

titanic_data['Survived'] = titanic_data['Survived'].astype(float)
titanic_data['Sibling/Spouse'] = titanic_data['Sibling/Spouse'].astype(float)
titanic_data['Parents/Children'] = titanic_data['Parents/Children'].astype(float)
titanic_data['male'] = titanic_data['male'].astype(float)
titanic_data['Q'] = titanic_data['Q'].astype(float)
titanic_data['S'] = titanic_data['S'].astype(float)
titanic_data[2] = titanic_data[2].astype(float)
titanic_data[3] = titanic_data[3].astype(float)

上述代码的输出:

Survived            float64
Age                 float64
Sibling/Spouse      float64
Parents/Children    float64
Fare                float64
male                float64
Q                   float64
S                   float64
2                   float64
3                   float64
dtype: object

当我运行Logistic回归代码时(如下所示),我得到错误mixed type of string and non-string is not supported.

我的回归代码:

# Logistic regression
# Split the dataset

x = titanic_data.drop("Survived",axis=1)
y = titanic_data["Survived"]

from sklearn.model_selection import train_test_split
x_train, y_train, x_test, y_test = train_test_split(x,y,test_size=0.3,random_state=1)

from sklearn.linear_model import LogisticRegression

logreg = LogisticRegression()
logreg.fit(x_train, y_train)

但是正如您所看到的,我已经将列数据类型更改为完全相同,那么为什么我会收到这个错误&我可以做些什么来修复它?

EDIT: The error message I got: Error message

推荐答案

您看到的错误不是关于列内容,而是关于列名.Beware of naming columns with non-strings(例如,用于分位标记或一热编码级别的0/1/2/3).SkLearning的sanity checks家公司预计会有column names are strings家.为了安全起见,请try

X.columns = X.columns.astype(str)

为了避免此类问题(而不是事后修复),请使用更规范的方法来操作和编码数据,如pd.get_dummies或其他方法.以下是一个完整的工作示例:


# Fetch Titanic

from sklearn.datasets import fetch_openml
X, y = fetch_openml("titanic", version=1, as_frame=True, return_X_y=True,)
dropped_cols = ['boat', 'body', 'home.dest', 'name', 'cabin', 'embarked', 'ticket']
X.drop(dropped_cols, axis=1, inplace=True)

# Encode (one-hot for categories) & inpute (naive)

import pandas as pd
X = pd.get_dummies(X,columns=['sex', 'pclass'], drop_first=True)
y = y.astype(float)
X = X.fillna(0)

# Logistic regression

from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression()
logreg.fit(X, y)
logreg.score(X,y) # 0.7868601986249045

在这里,get_dummies个方法只对带有前缀的命名列进行热编码,从而保持了正确的字符串类型.X.columns看起来如下所示:

Index(['age', 'sibsp', 'parch', 'fare', 'sex_male', 'pclass_2.0',
       'pclass_3.0'],
      dtype='object')

Python相关问答推荐

当值是一个integer时,在Python中使用JMESPath来验证字典中的值(例如:1)

遵循轮廓中对象方向的计算线

列表上值总和最多为K(以O(log n))的最大元素数

DataFrame groupby函数从列返回数组而不是值

带条件计算最小值

如何找到满足各组口罩条件的第一行?

运行终端命令时出现问题:pip start anonymous"

大小为M的第N位_计数(或人口计数)的公式

Python键入协议默认值

OR—Tools CP SAT条件约束

实现神经网络代码时的TypeError

在www.example.com中使用`package_data`包含不包含__init__. py的非Python文件

寻找Regex模式返回与我当前函数类似的结果

如何在Python请求中组合多个适配器?

freq = inject在pandas中做了什么?''它与freq = D有什么不同?''

Python将一个列值分割成多个列,并保持其余列相同

使用Python TCP套接字发送整数并使用C#接收—接收正确数据时出错

你能把函数的返回类型用作其他地方的类型吗?'

我可以不带视频系统的pygame,只用于游戏手柄输入吗?''

为什么Visual Studio Code说我的代码在使用Pandas concat函数后无法访问?