我目前正在做一个项目,我有一个很大的DataFrame(500‘000行),其中包含作为行的多边形,每个多边形代表一个地理区域. DataFrame的列表示不同的土地覆盖类别(34个类别),单元格中的值表示每个土地覆盖类别覆盖的面积(以平方公里为单位).

我的目标是subsample this DataFrame based on target requirements for landcover classes岁.具体地说,我希望 Select 一个多边形子集,这些多边形子集共同满足每个土地覆盖类别的特定目标覆盖要求.目标要求被指定 for each 土地覆盖类别所需的面积覆盖率.

一些同事暗示,这可以被解读为optimisation problem with an objective function分.然而,我还没有找到解决方案,我try 了一种不同的、缓慢的、迭代的方法(见下文).

为了让您更好地理解,下面是我的DataFrame struct 的一个最小可重现示例,该 struct 只有4个多边形和3个类:

import pandas as pd

# Create a sample DataFrame
data = {
    'Polygon': ['Polygon A', 'Polygon B', 'Polygon C', 'Polygon D'],
    'Landcover 1': [10, 5, 7, 3],
    'Landcover 2': [15, 8, 4, 6],
    'Landcover 3': [20, 12, 9, 14]
}

df = pd.DataFrame(data)

例如,假设我对陆地覆盖类有以下目标需求:

target_requirements = {
    'Landcover 1': 15,
    'Landcover 2': 20,
    'Landcover 3': 25
}

基于这些目标要求,我想通过 for each 陆地覆盖类 Select collectively meet or closely approximate the target area coverage个多边形子集来对DataFrame进行子采样. 在本例中,多边形A和C是很好的子样本,因为它们的土地覆盖总和接近于我设置的要求.

My [extended] code so far

以下是我到目前为止编写的代码. 您将看到大约extra steps个在这里实现的代码:

  1. 权重:使用赤字和盈余来指导多边形的 Select
  2. 前0.5%的随机采样:基于权重,我 Select 前0.5%的多边形,并从该 Select 中随机选取1个.
  3. 容差:我为当前子样本发现的累积面积与所需要求之间的差异设置了容差.
  4. 进度条:美观.
import numpy as np
import pandas as pd
from tqdm import tqdm

def select_polygons(row, cumulative_coverages, landcover_columns, target_coverages):
    selected_polygon = row[landcover_columns]

    # Add the selected polygon to the subsample
    subsample = selected_polygon.to_frame().T
    cumulative_coverages += selected_polygon.values

    return cumulative_coverages, subsample

df_data = # Your DataFrame with polygons and landcover classes
landcover_columns = # List of landcover columns in the DataFrame
target_coverages = # Dictionary of target coverages for each landcover class

total_coverages = df_data[landcover_columns].sum()
target_coverages = pd.Series(target_coverages, landcover_columns)

df_data = df_data.sample(frac=1).dropna().reset_index(drop=True)

# Set parameters for convergence
max_iterations = 30000
convergence_threshold = 0.1
top_percentage = 0.005

# Initialize variables
subsample = pd.DataFrame(columns=landcover_columns)
cumulative_coverages = pd.Series(0, index=landcover_columns)

# Initialize tqdm progress bar
progress_bar = tqdm(total=max_iterations)

# Iterate until the cumulative coverage matches or is close to the target coverage
for iteration in range(max_iterations):
    remaining_diff = target_coverages - cumulative_coverages
    deficit = remaining_diff.clip(lower=0)
    surplus = remaining_diff.clip(upper=0) * 0.1
    deficit_sum = deficit.sum()

    normalized_weights = deficit / deficit_sum

    # Calculate the combined weights for deficit and surplus for the entire dataset
    weights = df_data[landcover_columns].mul(normalized_weights) + surplus

    # Calculate the weight sum for each polygon
    weight_sum = weights.sum(axis=1)

    # Select the top 1% polygons based on weight sum
    top_percentile = int(len(df_data) * top_percentage)
    top_indices = weight_sum.nlargest(top_percentile).index
    selected_polygon_index = np.random.choice(top_indices)

    selected_polygon = df_data.loc[selected_polygon_index]

    cumulative_coverages, subsample_iteration = select_polygons(
        selected_polygon, cumulative_coverages, landcover_columns, target_coverages
    )

    # Add the selected polygon to the subsample
    subsample = subsample.append(subsample_iteration)

    df_data = df_data.drop(selected_polygon_index)

    # Check if all polygons have been selected or the cumulative coverage matches or is close to the target coverage
    if df_data.empty or np.allclose(cumulative_coverages, target_coverages, rtol=convergence_threshold):
        break

    # Calculate the percentage of coverage achieved
    coverage_percentage = (cumulative_coverages.sum() / target_coverages.sum()) * 100

    # Update tqdm progress bar
    progress_bar.set_description(f"Iteration {iteration+1}: Coverage Percentage: {coverage_percentage:.2f}%")
    progress_bar.update(1)

progress_bar.close()

subsample.reset_index(drop=True, inplace=True)

The problem

代码很慢(10次迭代/S),并且不能很好地管理容差,即我可以获得远高于OPTIMISATION%的累积覆盖率,而容差还没有得到满足(我的 Select 指南不够好). 另外,必须有一个更好的OPTIMISATION才能得到我想要的.

如有任何帮助/建议,我们将不胜感激.

推荐答案

问题描述给出了Mixed-Integer-Linear Program (MIP)分,有助于解决问题.

用于解决混合整数问题的一个方便的库是PuLP,它随内置的Coin-OR suite一起提供,特别是整数求解器CBC.

请注意,我更改了示例中的DataFrame的数据 struct ,因为没有索引的查找使用了大量的.loc,这是我个人不喜欢的,所以代码与原始的DataFrame变得非常混乱

data = {
    'Polygon A': {
        'Landcover 1': 10,
        'Landcover 2': 15,
        'Landcover 3': 20
    },
    'Polygon B': {
        'Landcover 1': 5,
        'Landcover 2': 8,
        'Landcover 3': 12
    },
    'Polygon C': {
        'Landcover 1': 7,
        'Landcover 2': 4,
        'Landcover 3': 9
    },
    'Polygon D': {
        'Landcover 1': 3,
        'Landcover 2': 6,
        'Landcover 3': 14
    }
}

target_requirements = {
    'Landcover 1': 15,
    'Landcover 2': 20,
    'Landcover 3': 25
}

然后我们用公式表示线性问题:

from pulp import LpProblem, LpVariable, LpMinimize, lpSum, value



model = LpProblem("LandcoverOptimization", LpMinimize)

# Create a binary variable for each polygon to indicate whether it was selected in the solution
polygons = list(data.keys())
selected = LpVariable.dicts("Selected", polygons, cat='Binary')

# Minimize the amount of selected polygons
model += lpSum(selected[polygon] for polygon in polygons)


# We need to approximately cover the target required landcover +- tolerance
# Also: We have to select a polygon to add it to the solution, this will be minimized by the objective
tolerance = 0.1
# We create the sums only once for performance
sums = {}
for landcover in target_requirements:
    sums[landcover] = lpSum(landcovers[polygon][landcover] * selected[polygon] for polygon in polygons)
for landcover in target_requirements:
    model += sums[landcover] >= target_requirements[landcover] * (1 - tolerance)
    model += sums[landcover] <= target_requirements[landcover] * (1 + tolerance)

model.solve(PULP_CBC_CMD(fracGap = 0.9))# Set mip gap very permissively for performance

为了使输出可视化,我们提取解决方案:

selected_polygons = [polygon for polygon in polygons if value(selected[polygon]) == 1.0]
print("Selected Polygons:")
for polygon in selected_polygons:
    print(polygon)

achieved_landcovers = {}
for landcover in target_requirements:
    total_landcover = sum(data[polygon][landcover] for polygon in selected_polygons)
    achieved_landcovers[landcover] = total_landcover

print("Required Landcovers:")
for landcover, requirement in target_requirements.items():
    print(f"{landcover}: {requirement}")

print("Achieved Landcovers:")
for landcover, coverage in achieved_landcovers.items():
    print(f"{landcover}: {coverage}")

输出:

Selected Polygons:
Polygon A
Polygon C
Required Landcovers:
Landcover 1: 15
Landcover 2: 20
Landcover 3: 25
Achieved Landcovers:
Landcover 1: 17
Landcover 2: 19
Landcover 3: 29

关于可操控性:

我已经解决了一些含有约300万个二元变量的MIP模型.This question说500万个变量不能解决他们的问题.玩公差,以及optimality gap将使您能够拨入一个有用的和实际的解决方案,为您的问题.输出还将包含最佳度量值的日志(log)和/或证明问题不可行(例如,如果您将公差设置得太紧):

Result - Optimal solution found

Objective value:                2.00000000
Enumerated nodes:               0
Total iterations:               1
Time (CPU seconds):             0.00
Time (Wallclock seconds):       0.00

Option for printingOptions changed from normal to all
Total time (CPU seconds):       0.00   (Wallclock seconds):       0.01

如果数据量真的很大,容差需要非常严格,CBC可能再也无法处理这个问题.商业上可用的解算器(如Gurobi)将处理更大的问题实例,但不是免费提供的.

根据我的经验,如果求解器找不到令人满意的解决方案,您就不可能编写一些能够找到令人满意的解决方案的Python/Pandas代码.

为了获得描述问题而不是思考解决方案的心态,我推荐Raymond Hettinger关于PyCon 2019-Modern solvers: Problems well-defined are problems solved的精彩演讲

要安装PuLP

pip install pulp

编辑:对于我的MacBook Pro上OP的88MB问题实例:

Result - Optimal solution found (within gap tolerance)

Objective value:                14178.00000000
Lower bound:                    14080.357
Gap:                            0.01
Enumerated nodes:               0
Total iterations:               0
Time (CPU seconds):             218.06
Time (Wallclock seconds):       219.21

Option for printingOptions changed from normal to all
Total time (CPU seconds):       293.59   (Wallclock seconds):       296.26

调整容差和MIP间隙将改善运行时或解决方案的质量.

它可以证明至少需要 Select 14081个多边形,并找到了一个解决方案, Select 了14178个多边形(1%最优差距),覆盖容差为+-3%:

Landcover Comparison:
n20: Required=3303.7, Achieved=3204.7, Difference=99.0
n25: Required=20000.0, Achieved=19401.1, Difference=598.9
n5: Required=20000.0, Achieved=19400.0, Difference=600.0
n16: Required=8021.1, Achieved=7781.3, Difference=239.8
....

Python相关问答推荐

隐藏QComboBox的指示器(qdarkstyle)

如何判断. text文件中的某个字符,然后读取该行

Python:MultiIndex Dataframe到类似json的字典列表

如何使用矩阵在sklearn中同时对每个列执行matthews_corrcoef?

当密钥是复合且唯一时,Pandas合并抱怨标签不唯一

如何检测背景有噪的图像中的正方形

rame中不兼容的d类型

为什么带有dropna=False的groupby会阻止后续的MultiIndex.dropna()工作?

输出中带有南的亚麻神经网络

按列分区,按另一列排序

如何在python xsModel库中定义一个可选[December]字段,以产生受约束的SON模式

如何获取numpy数组的特定索引值?

当独立的网络调用不应该互相阻塞时,'

我对我应该做什么以及我如何做感到困惑'

为一个组的每个子组绘制,

如何更新pandas DataFrame上列标题的de值?

如何指定列数据类型

为什么'if x is None:pass'比'x is None'单独使用更快?

为什么我的sundaram筛这么低效

查看pandas字符列是否在字符串列中