我创建了一个函数,用于查找数据帧中缺少的值.


import numpy as np
import pandas as pd
from sklearn.datasets import load_iris

# Create a sample dataset
iris = load_iris()

df = pd.DataFrame(data= np.c_[iris['data'], iris['target']],
                 columns= iris['feature_names'] + ['target'])
df['species'] = pd.Categorical.from_codes(iris.target, iris.target_names)

# Here we replace all values of setosa with 'missing_value'
df = df.applymap(lambda x: 'missing_value' if x == 'setosa' else x)

# Here we want to create a flag for the missing values
def add_missing_value_flags(mydf, column):
    
    # Generate the new column name
    new_col = "missing_" + column
    
    # Create flags where the data is missing
    # that has put in a holder to represent a missing value
    mydf[new_col]= np.where(mydf[column] == 'missing_value', True,
                            np.where(mydf[column] == '', True,
                            np.where(mydf[column] == 'N/A', True,  
                            np.where(mydf[column] == 'N\A', True, 
                            np.where(mydf[column] == 'NA', True,
                            np.where(mydf[column] == 'N.A.', True,
                            np.where(mydf[column] == 'NONE', True,
                            np.where(mydf[column] == '.', True, 
                            np.where(mydf[column].str.len() == 1, True, 
                            np.where(mydf[column] == '..', True, False))))))))))
    
    return(mydf)


add_missing_value_flags(df, 'species')

     sepal length (cm)  sepal width (cm)  ...        species  missing_species
0                  5.1               3.5  ...  missing_value             True
1                  4.9               3.0  ...  missing_value             True
2                  4.7               3.2  ...  missing_value             True
3                  4.6               3.1  ...  missing_value             True
4                  5.0               3.6  ...  missing_value             True
..                 ...               ...  ...            ...              ...
145                6.7               3.0  ...      virginica            False
146                6.3               2.5  ...      virginica            False
147                6.5               3.0  ...      virginica            False
148                6.2               3.4  ...      virginica            False
149                5.9               3.0  ...      virginica            False

Is there a method in python where I can apply my function to the rest of my columns similiar to: mydf[mydf.columns[mydf.columns.str.contains('species|plant|earth')]].apply...

推荐答案

# Create a sample dataset
iris = load_iris()

df = pd.DataFrame(data= np.c_[iris['data'], iris['target']],
                 columns= iris['feature_names'] + ['target'])
df['species'] = pd.Categorical.from_codes(iris.target, iris.target_names)

# Here we replace all values of setosa with 'missing_value'
df = df.applymap(lambda x: 'missing_value' if x == 'setosa' else x)


def function_to_apply(series):
    if not re.match("missing_", series.name):
        new_column_name = "missing_"+series.name
        new_column_values = series.isin([
            'missing_value', '', 'N/A', 'N\A',
            'NA', 'N.A.', 'NONE', '.', '..'
        ])
        try:
            new_column_values = new_column_values | (series.str.len()==1)
        except AttributeError:
            pass
        df[new_column_name] = new_column_values
    return

为了修改inplace df,我写了function_to_apply,并将返回值设置为None,因此:

df.apply(function_to_apply)
#RETURNS
#sepal length (cm)    None
#sepal width (cm)     None
#petal length (cm)    None
#petal width (cm)     None
#target               None
#species              None
#dtype: object

However, by applying this function, you have added columns to df: modified df

我知道这不是最干净的解决方案,但它有效,速度相对较快.

P、 除了其他库之外,您还需要import re才能运行此代码.

Python相关问答推荐

使用decorator 重复超载

如何在Python中增量更新DF

Twilio:CallInstance对象没有来自_的属性'

过滤绕轴旋转的螺旋桨

数字梯度的意外值

如何使用entry.bind(FocusIn,self.Method_calling)用于使用网格/列表创建的收件箱

添加包含中具有任何值的其他列的计数的列

如何使用SubProcess/Shell从Python脚本中调用具有几个带有html标签的参数的Perl脚本?

比较两个二元组列表,NP.isin

我必须将Sigmoid函数与r2值的两种类型的数据集(每种6个数据集)进行匹配,然后绘制匹配函数的求导.我会犯错

按列分区,按另一列排序

将两只Pandas rame乘以指数

scikit-learn导入无法导入名称METRIC_MAPPING64'

按顺序合并2个词典列表

如何在solve()之后获得症状上的等式的值

在Python中动态计算范围

为什么NumPy的向量化计算在将向量存储为类属性时较慢?'

在Django admin中自动完成相关字段筛选

Django—cte给出:QuerySet对象没有属性with_cte''''

如何更改groupby作用域以找到满足掩码条件的第一个值?