我有两个数据帧primary_tumor_dfhealthy_tissue_df来执行Mann-Whitney U测试.我还从两个数据帧中删除了nan个值.

Structure of primary_tumor_df. Primary tumor dataframe

Structure of healthy_tissue_df. Healthy tissue dataframe

primary_tumor_df.dropna(inplace=True)
healthy_tissue_df.dropna(inplace=True)

This shows that there are no nan or null values. Checking for any nan values in primary_tumor_df Checking for any nan values in healthy_tumor_df

但在执行测试时,它给我带来了以下错误:

from scipy.stats import mannwhitneyu
p_value_dict = {}
for gene in primary_tumor_df.columns:
stats, p_value = mannwhitneyu(primary_tumor_df[gene], healthy_tissue_df[gene],
                              alternative='two-sided')

Error:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[9], line 4
      2 p_value_dict = {}
      3 for gene in primary_tumor_df.columns:
----> 4     stats, p_value = mannwhitneyu(primary_tumor_df[gene],
      5                                  healthy_tissue_df[gene],
      6                                  alternative='two-sided')
      7     p_value_dict[gene] = p_value
      9 # converting into DataFrame

File ~/.local/lib/python3.10/site-packages/scipy/stats/_axis_nan_policy.py:502, in _axis_nan_policy_factory.<locals>.axis_nan_policy_decorator.<locals>.axis_nan_policy_wrapper(***failed resolving arguments***)
    500 if sentinel:
    501     samples = _remove_sentinel(samples, paired, sentinel)
--> 502 res = hypotest_fun_out(*samples, **kwds)
    503 res = result_to_tuple(res)
    504 res = _add_reduced_axes(res, reduced_axes, keepdims)

File ~/.local/lib/python3.10/site-packages/scipy/stats/_mannwhitneyu.py:460, in mannwhitneyu(x, y, use_continuity, alternative, axis, method)
    249 @_axis_nan_policy_factory(MannwhitneyuResult, n_samples=2)
    250 def mannwhitneyu(x, y, use_continuity=True, alternative="two-sided",
    251                  axis=0, method="auto"):
    252     r'''Perform the Mann-Whitney U rank test on two independent samples.
    253 
    254     The Mann-Whitney U test is a nonparametric test of the null hypothesis
   (...)
    456 
    457     '''
    459     x, y, use_continuity, alternative, axis_int, method = (
--> 460         _mwu_input_validation(x, y, use_continuity, alternative, axis, method))
    462     x, y, xy = _broadcast_concatenate(x, y, axis)
    464     n1, n2 = x.shape[-1], y.shape[-1]

File ~/.local/lib/python3.10/site-packages/scipy/stats/_mannwhitneyu.py:200, in _mwu_input_validation(x, y, use_continuity, alternative, axis, method)
    198 # Would use np.asarray_chkfinite, but infs are OK
    199 x, y = np.atleast_1d(x), np.atleast_1d(y)
--> 200 if np.isnan(x).any() or np.isnan(y).any():
    201     raise ValueError('`x` and `y` must not contain NaNs.')
    202 if np.size(x) == 0 or np.size(y) == 0:

TypeError: ufunc 'isnan' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''

即使数据帧中没有任何nan个值,为什么它会产生 错误?

推荐答案

问题在于,至少有一列primary_tumor_dfhealthy_tissue_df具有object d型,而不是其中一列都具有NaN.

您可以看出,因为最终引发错误的那一行:

if np.isnan(x).any() or np.isnan(y).any():

对于mannwhitneyu的输入xy中的NaN来说是checking,并且它抱怨

TypeError: ufunc 'isnan' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''

用简单的英语来说,这只是"isnan对您提供的数据类型不起作用,而且它不能safely将数据转换为isnan确实适用的类型."

对于数字数据类型不会引发此错误.

import numpy as np
for dtype in [np.uint8, np.int16, np.float32, np.complex64]:
    x = np.arange(10., dtype=np.float64)
    np.isnan(x)  # no error

无论他们是否有NaN:

y = x.copy()
y[0] = np.nan
np.isnan(y)  # no error

毕竟,isnan的目的是find个NaN并通过布尔数组报告它们的位置.

问题在于非数字数据类型.

x = np.asarray(x, dtype=object)
np.isnan(x)  # error

该错误表明它不能将safely转换为isnan适用的类型,但这并不意味着它将被转换为can't.如果数据确实是数字的,但pandas将其存储为某种更通用的对象类型,则您应该能够通过将其转换为浮点类型来解决该问题,然后将其传递到SciPy中.

import numpy as np
import pandas as pd
from scipy import stats
rng = np.random.default_rng(435982435982345)
primary_tumor_df = pd.DataFrame(rng.random((10, 3)).astype(object))
healthy_tissue_df = pd.DataFrame(rng.random((10, 3)).astype(object))

# generates your error:
# for gene in primary_tumor_df.columns:
#     res = stats.mannwhitneyu(primary_tumor_df[gene],
#                              healthy_tissue_df[gene],
#                              alternative='two-sided')

# no error    
for gene in primary_tumor_df.columns:
    res = stats.mannwhitneyu(primary_tumor_df[gene].astype(np.float64),
                             healthy_tissue_df[gene].astype(np.float64),
                             alternative='two-sided')

也就是说,您甚至不需要for循环.mannwhitneyu是vector化的,默认情况下它沿着axis=0(您的列)工作.

res = stats.mannwhitneyu(primary_tumor_df.astype(np.float64),
                         healthy_tissue_df.astype(np.float64),
                         alternative='two-sided')

Python相关问答推荐

Pandas :多索引组

使用Keras的线性回归参数估计

根据不同列的值在收件箱中移动数据

在Python中处理大量CSV文件中的数据

为什么这个带有List输入的简单numba函数这么慢

Python库:可选地支持numpy类型,而不依赖于numpy

如何在python xsModel库中定义一个可选[December]字段,以产生受约束的SON模式

如何更改分组条形图中条形图的 colored颜色 ?

如果条件不满足,我如何获得掩码的第一个索引并获得None?

如何从数据库上传数据到html?

名为__main__. py的Python模块在导入时不运行'

在www.example.com中使用`package_data`包含不包含__init__. py的非Python文件

如何禁用FastAPI应用程序的Swagger UI autodoc中的application/json?

下三角形掩码与seaborn clustermap bug

根据Pandas中带条件的两个列的值创建新列

使用嵌套对象字段的Qdrant过滤

多个矩阵的张量积

多索引数据帧到标准索引DF

将标签与山脊线图对齐

迭代工具组合不会输出大于3的序列