Python 如何使用pandasDataFrames和scipy高度优化相关性计算

发布于04月20日

我想计算pandas DataFrame列的皮尔逊相关性.我不仅仅想使用DataFrame.corr()，因为我还需要相关性的p值;因此，我使用scipy.stats.pearsonr(x, y).我现在的问题是我的 pyramid 很大(形状:(1166，49262))，所以我正在寻找(49262 ' 2 -49262)/2相关性.

请建议如何优化它以减少计算时间.

我的相关性代码:

# the variable `data` contains the dataframe of shape (1166, 49262)

# setting up output dataframes
dfcols = pd.DataFrame(columns=data.columns)
correlation = dfcols.T.join(dfcols, how='outer')
pvalues = correlation.copy()
# pairwise calculation
for r in range(len(data.columns)):
    for c in range(r+1, len(data.columns)):
        # iterate over all combinations of columns to calculate correlation
        tmp = input.iloc[:, [r,c]].dropna()
        if len(tmp) < 2:
            # too few data points to calculate correlation coefficient
            result = (0, 1) 
        else:
            result = pearsonr(tmp.iloc[:, 0], tmp.iloc[:, 1])
        correlation.iloc[r, c] = result[0]
        pvalues.iloc[r, c] = result[1]

一些注意事项:

我还愿意接受scipy以外的软件包的建议;我只需要相关性的p值.
我认为通过integer索引而不是列名迭代列更快;有人能确认/否认这一点吗？
数据帧有很多缺失的数据，因此I .dropna()并捕捉剩余数据点少于两个的情况.
简单地迭代框架并成对提取列将需要超过16.5天的时间.甚至没有做任何计算.(使用以下代码从前5次完整通过推断)

def foo():
    data = load_df()        # the pd.DataFrame of shape (1166, 49262)
    cols = data.columns
    for i in range(len(cols)):
        logging.info(f"{i+1}/{len(cols)}")
        for j in range(i+1, len(cols)):
            tmp = data.iloc[:, [i, j]].dropna()
            if len(tmp) < 2:
                # You may ignore this for this post; I was looking for columns pairs with too few data points to correlate
                logging.warn(f"correlating columns '{cols[i]}' and '{cols[j]}' results in less than 2 usable data points")

foo()

我认为我可以使用多线程来至少使用更多线程来进行相关性计算.
以防有人认为这一点很重要:我正在处理的数据是一个蛋白质组数据集，包含约50，000个肽和1166名患者;我想以成对的方式将所有患者的肽表达关联起来.

import pandas as pd import numpy as np from scipy.stats import pearsonr import scipy M = 1000 N = 200 P = 0.1 A = np.random.rand(M, N) A[np.random.rand(M, N) < P] = np.nan df = pd.DataFrame(A, columns=[f"a{i}" for i in range(1, N + 1)]) # setting up output dataframes def calculate_corr(data): dfcols = pd.DataFrame(columns=data.columns) correlation = dfcols.T.join(dfcols, how='outer') pvalues = correlation.copy() # pairwise calculation for r in range(len(data.columns)): for c in range(r, len(data.columns)): # iterate over all combinations of columns to calculate correlation tmp = data.iloc[:, [r,c]].dropna() if len(tmp) < 2: # too few data points to calculate correlation coefficient result = (0, 1) else: result = pearsonr(tmp.iloc[:, 0], tmp.iloc[:, 1]) correlation.iloc[r, c] = result[0] correlation.iloc[c, r] = result[0] pvalues.iloc[r, c] = result[1] pvalues.iloc[c, r] = result[1] return correlation, pvalues def get_pvalue_vectorized(r, ab, alternative='two-sided'): """Get p-value from beta dist given the statistic, and alternative.""" assert len(r.shape) == 2 assert len(ab.shape) == 2 diag = np.arange(r.shape[0]) # This is just to keep squareform happy. These don't actually # get sent to the beta distribution function. r[diag, diag] = 0 ab[diag, diag] = 0 # Avoid doing repeated computations of r,c and c,r rsq = scipy.spatial.distance.squareform(r) r[diag, diag] = 1 absq = scipy.spatial.distance.squareform(ab) kwargs = dict(a=absq, b=absq, loc=-1, scale=2) if alternative == 'less': pvalue = scipy.stats.beta.cdf(rsq, **kwargs) elif alternative == 'greater': pvalue = scipy.stats.beta.sf(rsq, **kwargs) elif alternative == 'two-sided': pvalue = 2 * (scipy.stats.beta.sf(np.abs(rsq), **kwargs)) else: message = "`alternative` must be 'less', 'greater', or 'two-sided'." raise ValueError(message) # Put back into 2d matrix pvalue = scipy.spatial.distance.squareform(pvalue) return pvalue def calculate_corr_fast(data): correlation = data.corr() # For each pair of data values, count how many cases where both data values are # defined at the same position, using matrix multiply as a pairwise boolean and. data_notna = data.notna().values.astype('float32') value_count = data_notna.T @ data_notna # This is the a and b parameter for the beta distribution. It is different # for every p-value, because each one can potentially have a different number # of missing values ab = value_count / 2 - 1 pvalues = get_pvalue_vectorized(correlation.values, ab) invalid = value_count < 2 pvalues[invalid] = np.nan # Convert back to dataframe pvalues = pd.DataFrame(pvalues, columns=correlation.columns, index=correlation.index) return correlation, pvalues correlation, pvalues = calculate_corr(df) correlation_fast, pvalues_fast = calculate_corr_fast(df) assert np.allclose(pvalues_fast.values.astype('float64'), pvalues.values.astype('float64')) assert np.allclose(correlation_fast.values.astype('float64'), correlation.values.astype('float64'))

Original code 40.5 s ± 1.18 s per loop (mean ± std. dev. of 7 runs, 1 loop each) @Andrej Kesely's answer ~20 seconds My answer 190 ms ± 209 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Python 如何使用pandasDataFrames和scipy高度优化相关性计算

推荐答案

Python相关问答推荐

如何将带有逗号分隔的数字的字符串解析为int Array？

当值是一个integer时，在Python中使用JMESPath来验证字典中的值(例如：1)

添加包含中具有任何值的其他列的计数的列

使用scipy. optimate.least_squares()用可变数量的参数匹配两条曲线

如何在Deliveryter笔记本中从同步上下文正确地安排和等待Delivercio代码中的结果？

Pythind 11无法弄清楚如何访问tuple元素

如何在具有重复数据的pandas中对groupby进行总和，同时保留其他列

对Numpy函数进行载体化

按顺序合并2个词典列表

Streamlit应用程序中的Plotly条形图中未正确显示Y轴刻度

在Python中动态计算范围

Pre—Commit MyPy无法禁用非错误消息

组/群集按字符串中的子字符串或子字符串中的字符串轮询数据框

Pandas Loc Select 到NaN和值列表

Pandas Data Wrangling/Dataframe Assignment

在Python中使用if else或使用regex将二进制数据如111转换为001""

Flash只从html表单中获取一个值

在代码执行后关闭ChromeDriver窗口

根据客户端是否正在传输响应来更改基于Flask的API的行为

一个telegram 机器人应该发送一个测验如何做？""