通过使用pd.corr()
加上将R值转换为Beta分布的概率,您可以获得大约200倍的加速.
我建议通过查看SciPy是如何做到的,并查看是否有任何适用于您的 case 的改进来实现这一点.源代码可以找到here.这准确地告诉您他们如何实现p值.具体来说,他们采用a = b = n / 2 - 1的Beta分布,从-1到1,并找到该分布在指定R值处的累积分布函数或生存函数.
因此,虽然pearsonr()
不支持在所有列对之间进行载体化,但基础Beta分布does支持这一点.使用此功能,您可以将pd.corr()
给您的相关性转化为相关性加p值.
我已经根据您现有的算法判断了这一点,它在机器收件箱内与之一致.我还用NA值测试了它.
就速度而言,它大约比原始解决方案快200倍,比仅使用一个核心的多核解决方案更快.
这是代码.请注意,只有calculate_corr_fast
和get_pvalue_vectorized
对解决方案很重要.其余的只是设置测试数据或进行比较.
import pandas as pd
import numpy as np
from scipy.stats import pearsonr
import scipy
M = 1000
N = 200
P = 0.1
A = np.random.rand(M, N)
A[np.random.rand(M, N) < P] = np.nan
df = pd.DataFrame(A, columns=[f"a{i}" for i in range(1, N + 1)])
# setting up output dataframes
def calculate_corr(data):
dfcols = pd.DataFrame(columns=data.columns)
correlation = dfcols.T.join(dfcols, how='outer')
pvalues = correlation.copy()
# pairwise calculation
for r in range(len(data.columns)):
for c in range(r, len(data.columns)):
# iterate over all combinations of columns to calculate correlation
tmp = data.iloc[:, [r,c]].dropna()
if len(tmp) < 2:
# too few data points to calculate correlation coefficient
result = (0, 1)
else:
result = pearsonr(tmp.iloc[:, 0], tmp.iloc[:, 1])
correlation.iloc[r, c] = result[0]
correlation.iloc[c, r] = result[0]
pvalues.iloc[r, c] = result[1]
pvalues.iloc[c, r] = result[1]
return correlation, pvalues
def get_pvalue_vectorized(r, ab, alternative='two-sided'):
"""Get p-value from beta dist given the statistic, and alternative."""
assert len(r.shape) == 2
assert len(ab.shape) == 2
diag = np.arange(r.shape[0])
# This is just to keep squareform happy. These don't actually
# get sent to the beta distribution function.
r[diag, diag] = 0
ab[diag, diag] = 0
# Avoid doing repeated computations of r,c and c,r
rsq = scipy.spatial.distance.squareform(r)
r[diag, diag] = 1
absq = scipy.spatial.distance.squareform(ab)
kwargs = dict(a=absq, b=absq, loc=-1, scale=2)
if alternative == 'less':
pvalue = scipy.stats.beta.cdf(rsq, **kwargs)
elif alternative == 'greater':
pvalue = scipy.stats.beta.sf(rsq, **kwargs)
elif alternative == 'two-sided':
pvalue = 2 * (scipy.stats.beta.sf(np.abs(rsq), **kwargs))
else:
message = "`alternative` must be 'less', 'greater', or 'two-sided'."
raise ValueError(message)
# Put back into 2d matrix
pvalue = scipy.spatial.distance.squareform(pvalue)
return pvalue
def calculate_corr_fast(data):
correlation = data.corr()
# For each pair of data values, count how many cases where both data values are
# defined at the same position, using matrix multiply as a pairwise boolean and.
data_notna = data.notna().values.astype('float32')
value_count = data_notna.T @ data_notna
# This is the a and b parameter for the beta distribution. It is different
# for every p-value, because each one can potentially have a different number
# of missing values
ab = value_count / 2 - 1
pvalues = get_pvalue_vectorized(correlation.values, ab)
invalid = value_count < 2
pvalues[invalid] = np.nan
# Convert back to dataframe
pvalues = pd.DataFrame(pvalues, columns=correlation.columns, index=correlation.index)
return correlation, pvalues
correlation, pvalues = calculate_corr(df)
correlation_fast, pvalues_fast = calculate_corr_fast(df)
assert np.allclose(pvalues_fast.values.astype('float64'), pvalues.values.astype('float64'))
assert np.allclose(correlation_fast.values.astype('float64'), correlation.values.astype('float64'))
1000 x200 rame的基准结果:
Original code
40.5 s ± 1.18 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
@Andrej Kesely's answer
~20 seconds
My answer
190 ms ± 209 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
基准注释:
- 我使用了一个有1000行和200列的 pyramid .我try 使用大约相同数量的行,以保留查找相关性和计算p值之间的工作分配.我减少了列的数量,以便基准测试将在我有生之年完成.:)
- 我的基准测试系统有4个核心.具体来说,它是一款核心较少的Intel(R)Core(TM)i7 - 8850 H中央处理器.