我正试着用winsorize个数据集.我在多个层面上做到了这一点.

第一个:我需要基于比率的窗口化,该比率基于TotalAsset(我的数据集中的一列).

FirmMonthlyAccountingData[item].div(FirmMonthlyAccountingData['TotalAssets'])

然后,我使用相同的代码(为了使用更少的内存)并将值提取为NumPy数组,然后应用Winsorazation(我需要删除top/Button 5%).

winsorize(FirmMonthlyAccountingData[item].div(FirmMonthlyAccountingData['TotalAssets'], axis=0).values,limits=[0.05,0.05])

然后我需要把它改回数据帧.错误实际上发生在这里.

pd.DataFrame(winsorize(FirmMonthlyAccountingData[item].div(FirmMonthlyAccountingData['TotalAssets'],axis=0).values,limits[0.05,0.05]),columns=item)

然后将它乘以*FirmMonthlyAccountingData['totalAssets'],这样我就可以得到原始值.

Copy_of_firmmonthlydata[item]=pd.DataFrame(winsorize(FirmMonthlyAccountingData[item].div(FirmMonthlyAccountingData['TotalAssets'],axis=0).values,limits[0.05,0.05]),columns=item)*FirmMonthlyAccountingData['totalAssets']

最后,我需要使用for循环为所有列执行此操作,以便尽可能地节省内存.

columns_to_winsorize= ['Mcap', 'first', 'second', 'third']

for item in columns_to_winsorize:
    Copy_of_firmmonthlydata[item]=pd.DataFrame(winsorize(FirmMonthlyAccountingData[item].div(FirmMonthlyAccountingData['TotalAssets'], axis=0).values,limits=[0.05,0.05]),columns=item)*FirmMonthlyAccountingData['totalAssets']

但我得到了这个错误

  TypeError                                 Traceback (most recent call last)
Cell In[27], line 10
      3 columns_to_winsorize= ['Mcap', 'first', 'second']
      9 for item in columns_to_winsorize:
---> 10     Copy_of_firmmonthlydata=pd.DataFrame(winsorize(FirmMonthlyAccountingData[f'{item}'].div(FirmMonthlyAccountingData['TotalAssets'], axis=0).values,limits=[0.05,0.05]),columns=item)*FirmMonthlyAccountingData['totalAssets']

File c:\Users\anaconda3\envs\PythonCourse2023\Lib\site-packages\pandas\core\frame.py:722, in DataFrame.__init__(self, data, index, columns, dtype, copy)
    720     # a masked array
    721     data = sanitize_masked_array(data)
--> 722     mgr = ndarray_to_mgr(
    723         data,
    724         index,
    725         columns,
    726         dtype=dtype,
    727         copy=copy,
    728         typ=manager,
    729     )
    731 elif isinstance(data, (np.ndarray, Series, Index, ExtensionArray)):
    732     if data.dtype.names:
    733         # i.e. numpy structured array

File c:\Users\anaconda3\envs\PythonCourse2023\Lib\site-packages\pandas\core\internals\construction.py:333, in ndarray_to_mgr(values, index, columns, dtype, copy, typ)
    324     values = sanitize_array(
    325         values,
    326         None,
   (...)
    329         allow_2d=True,
    330     )
    332 # _prep_ndarraylike ensures that values.ndim == 2 at this point
--> 333 index, columns = _get_axes(
    334     values.shape[0], values.shape[1], index=index, columns=columns
    335 )
    337 _check_values_indices_shape_match(values, index, columns)
    339 if typ == "array":

File c:\Users\anaconda3\envs\PythonCourse2023\Lib\site-packages\pandas\core\internals\construction.py:738, in _get_axes(N, K, index, columns)
    736     columns = default_index(K)
    737 else:
--> 738     columns = ensure_index(columns)
    739 return index, columns
...
   5066         f"{cls.__name__}(...) must be called with a collection of some "
   5067         f"kind, {repr(data)} was passed"
   5068     )

TypeError: Index(...) must be called with a collection of some kind, 'Mcap' was passed

任何帮助都将不胜感激.

推荐答案

这里不需要DataFrame,还可以将第TotalAssets列转换为NumPy数组:

for item in columns_to_winsorize:
    Copy_of_firmmonthlydata[item]= winsorize(FirmMonthlyAccountingData[item].div(FirmMonthlyAccountingData['TotalAssets'], axis=0).values,limits=[0.05,0.05]) *FirmMonthlyAccountingData['TotalAssets'].values

或使用Series:

for item in columns_to_winsorize:
    Copy_of_firmmonthlydata[item]=pd.Series(winsorize(FirmMonthlyAccountingData[item].div(FirmMonthlyAccountingData['TotalAssets'], axis=0).values,limits=[0.05,0.05]))*FirmMonthlyAccountingData['TotalAssets']
   

Python相关问答推荐

Asyncio与队列的多处理通信-仅运行一个协程

两极:滚动组,起始指数由不同列设置

如何将自动创建的代码转换为类而不是字符串?

Snap 7- read_Area用于类似地址的变量

Polars Select 多个元素产品

具有2D功能的Python十六进制图

在for循环中仅执行一次此操作

有条件地采样我的大型DF的最有效方法

计算相同形状的两个张量的SSE损失

优化在numpy数组中非零值周围创建缓冲区的函数的性能

比较两个数据帧并并排附加结果(获取性能警告)

如何从在虚拟Python环境中运行的脚本中运行需要宿主Python环境的Shell脚本?

如果条件不满足,我如何获得掩码的第一个索引并获得None?

计算每个IP的平均值

连接一个rabrame和另一个1d rabrame不是问题,但当使用[...]'运算符会产生不同的结果

如何合并两个列表,并获得每个索引值最高的列表名称?

从Windows Python脚本在WSL上运行Linux应用程序

如果初始groupby找不到满足掩码条件的第一行,我如何更改groupby列,以找到它?

在Python中计算连续天数

剪切间隔以添加特定日期