我有以下问题,我很难为它编写优化的程序.

我得到了一个2D二进制数组(所有元素都是0或1,其形状是(n,m)),我需要从数组中删除最大数量的元素,同时保持每行之和大于min_x,每列之和大于min_y的属性.

我已经编写了以下非优化代码,它可以很好地用于小数组:

import numpy as np
from typing import Optional, Tuple
from itertools import product
from tqdm import tqdm

def min_dataset(data, min_x, min_y):
    '''Data is a row-major numpy array of shape (n, m), consisting of only 0s and 1s (a mask).
    min_x and min_y are the minimum number of 1s in each row and column, respectively.
    This function returns a new data array, where each row and column has at least min_x and min_y 1s, respectively, and there are as few 1s as possible.
    '''
    best_solution = data
    score = np.sum(best_solution)
    # format: {matrix.tobytes(): (best_matrix, score)}

    memo = {}

    if np.any(np.sum(data, axis=0) < min_y) or np.any(np.sum(data, axis=1) < min_x):
        return None

    def get_best(data, main=False) -> Tuple[np.ndarray, int]:
        '''Gets the matrix with the lowest score, given the constraints.
        Uses memoization.
        '''
        nonlocal memo
        if data.tobytes() in memo:
            return memo[data.tobytes()]
        
        # Find the rows with more than min_x 1s
        can_remove_rows = np.sum(data, axis=1) > min_x
        # Find the columns with more than min_y 1s
        can_remove_columns = np.sum(data, axis=0) > min_y

        if (not np.any(can_remove_rows)) or (not np.any(can_remove_columns)):
            # If there's no row or column where we can remove a 1, we're done
            ans = (data, np.sum(data))
            memo[data.tobytes()] = (data, np.sum(data))
            return ans
        
        # Try removing each combination of rows and columns where we can remove a 1
        best = data, np.sum(data)
        # print(np.nonzero(can_remove_rows))
        # print(np.nonzero(can_remove_columns))
        iterator = product(np.nonzero(can_remove_rows)[0], np.nonzero(can_remove_columns)[0])
        if main:
            iterator = tqdm(list(iterator))
        for row, col in iterator:
            # print(row, col)
            if data[row, col] == 0:
                continue
            
            # Remove the 1 at (row, col)
            new_data = data.copy()
            new_data[row, col] = 0

            # Check if the new matrix is valid
            if np.any(np.sum(new_data, axis=0) < min_y) or np.any(np.sum(new_data, axis=1) < min_x):
                continue

            # Recurse
            new_best = get_best(new_data)
            if new_best[1] < best[1]:
                best = new_best

        memo[data.tobytes()] = best
        return best

    get_best(data, main=True)
    best = memo[data.tobytes()][0]
    return best

if __name__ == "__main__":
    # Create a random 5x5 binary matrix
    data = np.random.randint(0, 2, (5, 5))
    print(data)
    print("computing")
    ans = min_dataset(data, 1, 1)
    print("Solution:")
    print(ans)
    print("Score (lower is better; None means no solution was found.):")
    print(np.sum(ans))

示例输出:

 [[0 1 0 0 0]
 [1 1 1 1 1]
 [1 1 0 1 1]
 [0 1 1 0 1]
 [0 1 1 0 1]]
computing
100%|█████████████████████████████████████████████████████████████████████████████████| 20/20 [00:02<00:00,  8.27it/s]
Solution:
[[0 1 0 0 0]
 [0 0 0 1 0]
 [1 0 0 0 0]
 [0 0 0 0 1]
 [0 0 1 0 0]]
Score (lower is better; None means no solution was found.):
5

然而,一旦我开始使用10x10矩阵,代码就运行得非常慢,我需要能够运行20000x500矩阵.

我如何才能加速我的代码?

推荐答案

这里有一种非常不同的方法.

这个问题可以表示为一个易于编写但非常非常大的MIP(混合整数规划)模型:

 # data
 i = 1..20000
 j = 1..500  
 A[i,j] : boolean matrix with about 10% ones
 r[i]   : minimum row sums (about 0.5 of row sums in A)
 c[j]   : minimum column sums (about 0.5 of column sums in A)     

 # model
 minimize sum((i,j), A[i,j]*x[i,j])
 subject to
        sum(j, A[i,j]*x[i,j]) >= r[i]   ∀i
        sum(i, A[i,j]*x[i,j]) >= c[j]   ∀j
        x[i,j] ∈ {0,1}  

这是一个非常大的MIP,有超过20k的约束和100万个二进制变量.(不包括A[i,j]=0的x[i,j].我跳过了这些.)我们可以通过 slack 二元变量来作为LP来解决这个问题.解决方案将自动为整数值.这将把解决方案的时间缩短到只有14秒.如果需要,这也可以表示为网络优化问题.这只需要几秒钟.

统计数据

大小

  ---   20,501 rows  999,470 columns  2,998,408 non-zeroes
  ---   999,469 discrete-columns

LP计时

Dual simplex solved model.
--- LP status (1): optimal.
--- Cplex Time: 14.11sec (det. 10018.76 ticks)
Optimal solution found
Objective:       504711.000000

网络解算器

Network - Optimal:  Objective =    5.0471100000e+05
Network time = 3.22 sec. (1050.56 ticks)  Iterations = 1235117 (1235117)
--- LP status (1): optimal.
--- Cplex Time: 4.05sec (det. 1859.71 ticks)
Optimal solution found
Objective:       504711.000000

直接使用网络算法的例子可以在这里找到:https://yetanothermathprogrammingconsultant.blogspot.com/2023/05/solving-as-network-with-lowerbounds.html.


由海报添加

使用GPT-4将其转换为工作代码,我们最终得到:

import random
from pulp import LpProblem, LpMinimize, LpVariable, lpSum, LpBinary, LpStatus # pip install pulp
import numpy as np # pip install numpy (not required - just used for printing at the end)

def create_data(n_rows, n_cols):
    A = [[random.choice([0, 1]) for _ in range(n_cols)] for _ in range(n_rows)]
    # in my case, this will be r = [2, 2, 2, 2, 2] and c = [3, 3, 3, 3, 3] or something similar
    r = [int(sum(row) * 0.5) for row in A]
    c = [int(sum(col) * 0.5) for col in zip(*A)]
    return A, r, c

def solve_mip(A, r, c):
    n_rows = len(A)
    n_cols = len(A[0])

    prob = LpProblem("MIP_Problem", LpMinimize)

    x = [[LpVariable(f"x_{i}_{j}", cat=LpBinary) for j in range(n_cols)] for i in range(n_rows)]

    prob += lpSum(A[i][j] * x[i][j] for i in range(n_rows) for j in range(n_cols))

    for i in range(n_rows):
        prob += lpSum(A[i][j] * x[i][j] for j in range(n_cols)) >= r[i]

    for j in range(n_cols):
        prob += lpSum(A[i][j] * x[i][j] for i in range(n_rows)) >= c[j]

    prob.solve()

    if LpStatus[prob.status] == "Optimal":
        return [[int(x[i][j].value()) if x[i][j].value() is not None else None for j in range(n_cols)] for i in range(n_rows)]
    else:
        return None

if __name__ == "__main__":
    n_rows = 5
    n_cols = 5
    A, r, c = create_data(n_rows, n_cols)
    print("A:", A)
    print("r:", r)
    print("c:", c)

    result = solve_mip(A, r, c)
    if result is not None:
        print("Result:", result)
    else:
        print("No solution found.")

    print(f"A: {np.array(A)}")
    print(f"r: {np.array(r).reshape(-1, 1)}")
    print(f"c: {np.array(c)}")
    # nan means that it didn't matter what the value was
    # these can be set to 0 to get the array we want
    print(f"Result:\n{np.array(result, dtype=np.float32)}")

Python相关问答推荐

具有2D功能的Python十六进制图

在Python中管理多个OpenGVBO和VAO实例

实现的差异取决于计算出的表达是直接返回还是首先存储在变量中然后返回

难以在Manim中正确定位对象

在Wayland上使用setCellWidget时,try 编辑QTable Widget中的单元格时,PyQt 6崩溃

在Mac上安装ipython

修复mypy错误-赋值中的类型不兼容(表达式具有类型xxx,变量具有类型yyy)

我如何使法国在 map 中完全透明的代码?

梯度下降:简化要素集的运行时间比原始要素集长

如何从数据库上传数据到html?

Pandas DataFrame中行之间的差异

使用groupby方法移除公共子字符串

如何在Python中获取`Genericums`超级类型?

可以bcrypts AES—256 GCM加密损坏ZIP文件吗?

在Python中使用if else或使用regex将二进制数据如111转换为001""

如何杀死一个进程,我的Python可执行文件以sudo启动?

Python pint将1/华氏度转换为1/摄氏度°°

如何在Python 3.9.6和MacOS Sonoma 14.3.1下安装Pyregion

如何获得3D点的平移和旋转,给定的点已经旋转?

一个telegram 机器人应该发送一个测验如何做?""