下面我有两个版本的代码来获得相同的输出-我正在运行一个可能的数字组合列表,找到哪些过滤器会使每个组合为真,然后只找到唯一的过滤器(例如,过滤器(5,6)(5,6)(5,6)不是唯一的,因为它会"通过"所有与过滤器(5,7)(5,7)(5,7)相同的组合

我读到向量化可以加速进程,所以我在下面的简单for循环之后try 了一下. 矢量化要慢得多. 这么慢,我一定是做错了什么(但仍然得到相同的正确结果).

任何帮助将不胜感激. (另外,不知道为什么在下面的输出中,在矢量化后找到唯一值需要这么长时间)

#########################################
### Compare for loop to vectorization ###
#########################################
import pandas as pd
import itertools
import numpy as np
import time

size_value = 8
#df = pd.DataFrame({'var_1_lower_limit':[1,2,1,3,2],'var_1_upper_limit':[5,5,4,7,6],'var_2_lower_limit':[5,5,3,6,4],'var_2_upper_limit':[9,9,8,9,7],'var_3_lower_limit':[6,7,4,2,4],'var_3_upper_limit':[8,8,6,8,7]})
df = pd.DataFrame({'var_1_lower_limit':np.random.randint(0,5,size=size_value),'var_1_upper_limit':np.random.randint(5,10,size=size_value),'var_2_lower_limit':np.random.randint(0,5,size=size_value),'var_2_upper_limit':np.random.randint(5,10,size=size_value),'var_3_lower_limit':np.random.randint(0,5,size=size_value),'var_3_upper_limit':np.random.randint(5,10,size=size_value)})
df.sort_values(['var_1_lower_limit','var_1_upper_limit','var_2_lower_limit','var_2_upper_limit','var_3_lower_limit','var_3_upper_limit'],ascending=True, inplace=True) # I think important to sort here
display(df)


########## for loop
checkpoint_01 = time.perf_counter()
print('Beginning for loop run...')
default_possibilities = [0,1,2,3,4,5,6,7,8,9]
possibilities = list(itertools.product(default_possibilities,default_possibilities,default_possibilities,default_possibilities,default_possibilities,default_possibilities)) #Creates all combinations of possibilities
print('length of possibilities',len(possibilities))
possibilities = [item for item in possibilities if item[1]>=item[0] and item[3]>=item[2] and item[5]>=item[4]] # removes combinations that don't make sense such as first value is 5 and second value is 3
print('length of revised possibilities',len(possibilities))

filtering_output = []
checkpoint_02 = time.perf_counter()
for i in possibilities:
  a = i[0]
  b = i[1]
  c = i[2]
  d = i[3]
  e = i[4]
  f = i[5]
  for z in df.itertuples():
    if z.var_1_lower_limit <=a and z.var_1_upper_limit >=b and z.var_2_lower_limit <=c and z.var_2_upper_limit >=d and z.var_3_lower_limit <=e and z.var_3_upper_limit >=f:
      filtering_output.append({'combination':i,'var_1_lower_limit':z.var_1_lower_limit,'var_1_upper_limit':z.var_1_upper_limit, 'var_2_lower_limit':z.var_2_lower_limit,'var_2_upper_limit':z.var_2_upper_limit,'var_3_lower_limit':z.var_3_lower_limit,'var_3_upper_limit':z.var_3_upper_limit,'pass_or_fail':'YES'})
    else:
      continue

filtering_output_df = pd.DataFrame(filtering_output)
checkpoint_03 = time.perf_counter()
print('Iteration loop time was',(checkpoint_03-checkpoint_02)/60,'minutes')
unique_combinations = filtering_output_df.drop_duplicates(subset=['combination'], keep='first')
final_filters = unique_combinations.drop_duplicates(subset=['var_1_lower_limit','var_1_upper_limit','var_2_lower_limit','var_2_upper_limit','var_3_lower_limit','var_3_upper_limit'], keep='first') # since df was sorted, keeping only filters that had a passing combination
checkpoint_04 = time.perf_counter()
print('Done finding unique',(checkpoint_04-checkpoint_03)/60,'minutes')
display(final_filters)
checkpoint_05 = time.perf_counter()
print('Total time for loop',(checkpoint_05-checkpoint_01)/60,'minutes')

######### vectorization
checkpoint_06 = time.perf_counter()
print('Starting vectoriation...')
default_possibilities_new = np.array(list(itertools.product(range(10), repeat=6)))
print('Length of default possibilities',len(default_possibilities_new))

mask_new = (default_possibilities_new[:, 1] >= default_possibilities_new[:, 0]) & \
       (default_possibilities_new[:, 3] >= default_possibilities_new[:, 2]) & \
       (default_possibilities_new[:, 5] >= default_possibilities_new[:, 4])

possibilities_new = default_possibilities_new[mask_new]
print('length of revised possibilties',len(possibilities_new))

filtering_output_new = []
checkpoint_07 = time.perf_counter()
for i in possibilities_new:
    mask_new = (
            (df['var_1_lower_limit'] <= i[0]) & (df['var_1_upper_limit'] >= i[1]) &
            (df['var_2_lower_limit'] <= i[2]) & (df['var_2_upper_limit'] >= i[3]) &
            (df['var_3_lower_limit'] <= i[4]) & (df['var_3_upper_limit'] >= i[5])
    )

    if mask_new.any():
        row = df[mask_new].iloc[0]
        filtering_output_new.append({
            'combination': i,
            'var_1_lower_limit': row['var_1_lower_limit'],
            'var_1_upper_limit': row['var_1_upper_limit'],
            'var_2_lower_limit': row['var_2_lower_limit'],
            'var_2_upper_limit': row['var_2_upper_limit'],
            'var_3_lower_limit': row['var_3_lower_limit'],
            'var_3_upper_limit': row['var_3_upper_limit'],
            'pass_or_fail': 'YES'
        })

filtering_output_df_new = pd.DataFrame(filtering_output_new)
checkpoint_08 = time.perf_counter()
print('Done with vectorization',(checkpoint_08-checkpoint_06)/60,'minutes')
unique_combinations_new = filtering_output_df_new.drop_duplicates(subset=['combination'], keep='first')
final_filters_new = unique_combinations_new.drop_duplicates(subset=['var_1_lower_limit', 'var_1_upper_limit', 'var_2_lower_limit', 'var_2_upper_limit', 'var_3_lower_limit','var_3_upper_limit'], keep='first')
checkpoint_09 = time.perf_counter()
print('Done finding unique',(checkpoint_09-checkpoint_08)/60,'minutes')
display(final_filters_new)

checkpoint_10 = time.perf_counter()
print("Vectorization done in", (checkpoint_10 - checkpoint_06)/60,'minutes')

Output enter image description here

推荐答案

通过删除循环改进了"已完成矢量化".将帧数据转换为数字array.mask_faster为每行获取一个布尔掩码(每列都是用于从数据帧中 Select 行的掩码-这就是您在循环中所做的).

bl中,如果至少有一个为真,则有一个掩码.用于从arr中检索数组行的index个索引.mask_faster被掩码b1:mask_faster[:, bl]重写.在new_df中,通过列表理解来填写必要的值.结果,速度提高了60-70倍.

import numpy as np
import pandas as pd
import time

dpn = default_possibilities_new[mask_new]

checkpoint_mask_faster = time.perf_counter()

mask_faster = (
            (df['var_1_lower_limit'].values[:, np.newaxis] <= dpn[:, 0]) &
            (df['var_1_upper_limit'].values[:, np.newaxis]  >= dpn[:, 1]) &
            (df['var_2_lower_limit'].values[:, np.newaxis]  <= dpn[:, 2]) &
            (df['var_2_upper_limit'].values[:, np.newaxis]  >= dpn[:, 3]) &
            (df['var_3_lower_limit'].values[:, np.newaxis]  <= dpn[:, 4]) &
            (df['var_3_upper_limit'].values[:, np.newaxis]  >= dpn[:, 5])
    )

bl = np.any(mask_faster, axis=0)
index = np.where(bl)[0]
arr = df.values
mask_faster = mask_faster[:, bl]

new_df = pd.DataFrame([{'combination': dpn[index[i]],
                        'var_1_lower_limit': arr[mask_faster[:,i]][0][0],
                        'var_1_upper_limit': arr[mask_faster[:,i]][0][1],
                        'var_2_lower_limit': arr[mask_faster[:,i]][0][2],
                        'var_2_upper_limit': arr[mask_faster[:,i]][0][3],
                        'var_3_lower_limit': arr[mask_faster[:,i]][0][4],
                        'var_3_upper_limit': arr[mask_faster[:,i]][0][5],
                        'pass_or_fail': 'YES'
                        } for i in range(mask_faster.shape[1])])

print('Done with mask_faster',(time.perf_counter()-checkpoint_mask_faster)/60,'minutes')

您还可以将您的数据帧结果与我的数据帧结果进行比较:

print('comparison of dataframes', new_df.equals(filtering_output_df_new))

我认为最慢的部分是unique_combinations_new,因为每一行都有一个列表.我用np.unique试了试,我不太确定它和drop_duplicates的效果是否一样,但它的效果非常快.您需要判断以下内容:

ind_uniq = np.unique(dpn[index], axis=0, return_index=True)[1]

unique_combinations_new = new_df.loc[ind_uniq, :]

Python相关问答推荐

如何调整spaCy token 化器,以便在德国模型中将数字拆分为行末端的点

如何让 turtle 通过点击和拖动来绘制?

Pythind 11无法弄清楚如何访问tuple元素

运行回文查找器代码时发生错误:[类型错误:builtin_index_or_system对象不可订阅]

Polars比较了两个预设-有没有方法在第一次不匹配时立即失败

Python daskValue错误:无法识别的区块管理器dask -必须是以下之一:[]

Odoo 14 hr. emergency.public内的二进制字段

切片包括面具的第一个实例在内的眼镜的最佳方法是什么?

django禁止直接分配到多对多集合的前端.使用user.set()

Python,Fitting into a System of Equations

如何将一个动态分配的C数组转换为Numpy数组,并在C扩展模块中返回给Python

导入错误:无法导入名称';操作';

Odoo16:模板中使用的docs变量在哪里定义?

在用于Python的Bokeh包中设置按钮的样式

数据框,如果值在范围内,则获取范围和

如何获取包含`try`外部堆栈的`__traceback__`属性的异常

read_csv分隔符正在创建无关的空列

操作布尔值的Series时出现索引问题

将标签与山脊线图对齐

将索引表转换为Numy数组