Python 基于多列目标要求的 Pandas DataFrame 的高效随机子样本

发布于07月04日

我目前正在做一个项目，我有一个很大的DataFrame(500‘000行)，其中包含作为行的多边形，每个多边形代表一个地理区域. DataFrame的列表示不同的土地覆盖类别(34个类别)，单元格中的值表示每个土地覆盖类别覆盖的面积(以平方公里为单位).

我的目标是subsample this DataFrame based on target requirements for landcover classes岁.具体地说，我希望 Select 一个多边形子集，这些多边形子集共同满足每个土地覆盖类别的特定目标覆盖要求.目标要求被指定 for each 土地覆盖类别所需的面积覆盖率.

一些同事暗示，这可以被解读为optimisation problem with an objective function分.然而，我还没有找到解决方案，我try 了一种不同的、缓慢的、迭代的方法(见下文).

为了让您更好地理解，下面是我的DataFrame struct 的一个最小可重现示例，该 struct 只有4个多边形和3个类:

import pandas as pd

# Create a sample DataFrame
data = {
    'Polygon': ['Polygon A', 'Polygon B', 'Polygon C', 'Polygon D'],
    'Landcover 1': [10, 5, 7, 3],
    'Landcover 2': [15, 8, 4, 6],
    'Landcover 3': [20, 12, 9, 14]
}

df = pd.DataFrame(data)

例如，假设我对陆地覆盖类有以下目标需求:

target_requirements = {
    'Landcover 1': 15,
    'Landcover 2': 20,
    'Landcover 3': 25
}

基于这些目标要求，我想通过 for each 陆地覆盖类 Select collectively meet or closely approximate the target area coverage个多边形子集来对DataFrame进行子采样. 在本例中，多边形A和C是很好的子样本，因为它们的土地覆盖总和接近于我设置的要求.

My [extended] code so far

以下是我到目前为止编写的代码. 您将看到大约extra steps个在这里实现的代码:

权重:使用赤字和盈余来指导多边形的 Select
前0.5%的随机采样:基于权重，我 Select 前0.5%的多边形，并从该 Select 中随机选取1个.
容差:我为当前子样本发现的累积面积与所需要求之间的差异设置了容差.
进度条:美观.

import numpy as np
import pandas as pd
from tqdm import tqdm

def select_polygons(row, cumulative_coverages, landcover_columns, target_coverages):
    selected_polygon = row[landcover_columns]

    # Add the selected polygon to the subsample
    subsample = selected_polygon.to_frame().T
    cumulative_coverages += selected_polygon.values

    return cumulative_coverages, subsample

df_data = # Your DataFrame with polygons and landcover classes
landcover_columns = # List of landcover columns in the DataFrame
target_coverages = # Dictionary of target coverages for each landcover class

total_coverages = df_data[landcover_columns].sum()
target_coverages = pd.Series(target_coverages, landcover_columns)

df_data = df_data.sample(frac=1).dropna().reset_index(drop=True)

# Set parameters for convergence
max_iterations = 30000
convergence_threshold = 0.1
top_percentage = 0.005

# Initialize variables
subsample = pd.DataFrame(columns=landcover_columns)
cumulative_coverages = pd.Series(0, index=landcover_columns)

# Initialize tqdm progress bar
progress_bar = tqdm(total=max_iterations)

# Iterate until the cumulative coverage matches or is close to the target coverage
for iteration in range(max_iterations):
    remaining_diff = target_coverages - cumulative_coverages
    deficit = remaining_diff.clip(lower=0)
    surplus = remaining_diff.clip(upper=0) * 0.1
    deficit_sum = deficit.sum()

    normalized_weights = deficit / deficit_sum

    # Calculate the combined weights for deficit and surplus for the entire dataset
    weights = df_data[landcover_columns].mul(normalized_weights) + surplus

    # Calculate the weight sum for each polygon
    weight_sum = weights.sum(axis=1)

    # Select the top 1% polygons based on weight sum
    top_percentile = int(len(df_data) * top_percentage)
    top_indices = weight_sum.nlargest(top_percentile).index
    selected_polygon_index = np.random.choice(top_indices)

    selected_polygon = df_data.loc[selected_polygon_index]

    cumulative_coverages, subsample_iteration = select_polygons(
        selected_polygon, cumulative_coverages, landcover_columns, target_coverages
    )

    # Add the selected polygon to the subsample
    subsample = subsample.append(subsample_iteration)

    df_data = df_data.drop(selected_polygon_index)

    # Check if all polygons have been selected or the cumulative coverage matches or is close to the target coverage
    if df_data.empty or np.allclose(cumulative_coverages, target_coverages, rtol=convergence_threshold):
        break

    # Calculate the percentage of coverage achieved
    coverage_percentage = (cumulative_coverages.sum() / target_coverages.sum()) * 100

    # Update tqdm progress bar
    progress_bar.set_description(f"Iteration {iteration+1}: Coverage Percentage: {coverage_percentage:.2f}%")
    progress_bar.update(1)

progress_bar.close()

subsample.reset_index(drop=True, inplace=True)

The problem

代码很慢(10次迭代/S)，并且不能很好地管理容差，即我可以获得远高于OPTIMISATION%的累积覆盖率，而容差还没有得到满足(我的 Select 指南不够好). 另外，必须有一个更好的OPTIMISATION才能得到我想要的.

如有任何帮助/建议，我们将不胜感激.

data = { 'Polygon A': { 'Landcover 1': 10, 'Landcover 2': 15, 'Landcover 3': 20 }, 'Polygon B': { 'Landcover 1': 5, 'Landcover 2': 8, 'Landcover 3': 12 }, 'Polygon C': { 'Landcover 1': 7, 'Landcover 2': 4, 'Landcover 3': 9 }, 'Polygon D': { 'Landcover 1': 3, 'Landcover 2': 6, 'Landcover 3': 14 } } target_requirements = { 'Landcover 1': 15, 'Landcover 2': 20, 'Landcover 3': 25 }

from pulp import LpProblem, LpVariable, LpMinimize, lpSum, value model = LpProblem("LandcoverOptimization", LpMinimize) # Create a binary variable for each polygon to indicate whether it was selected in the solution polygons = list(data.keys()) selected = LpVariable.dicts("Selected", polygons, cat='Binary') # Minimize the amount of selected polygons model += lpSum(selected[polygon] for polygon in polygons) # We need to approximately cover the target required landcover +- tolerance # Also: We have to select a polygon to add it to the solution, this will be minimized by the objective tolerance = 0.1 # We create the sums only once for performance sums = {} for landcover in target_requirements: sums[landcover] = lpSum(landcovers[polygon][landcover] * selected[polygon] for polygon in polygons) for landcover in target_requirements: model += sums[landcover] >= target_requirements[landcover] * (1 - tolerance) model += sums[landcover] <= target_requirements[landcover] * (1 + tolerance) model.solve(PULP_CBC_CMD(fracGap = 0.9))# Set mip gap very permissively for performance

selected_polygons = [polygon for polygon in polygons if value(selected[polygon]) == 1.0] print("Selected Polygons:") for polygon in selected_polygons: print(polygon) achieved_landcovers = {} for landcover in target_requirements: total_landcover = sum(data[polygon][landcover] for polygon in selected_polygons) achieved_landcovers[landcover] = total_landcover print("Required Landcovers:") for landcover, requirement in target_requirements.items(): print(f"{landcover}: {requirement}") print("Achieved Landcovers:") for landcover, coverage in achieved_landcovers.items(): print(f"{landcover}: {coverage}")

Result - Optimal solution found Objective value: 2.00000000 Enumerated nodes: 0 Total iterations: 1 Time (CPU seconds): 0.00 Time (Wallclock seconds): 0.00 Option for printingOptions changed from normal to all Total time (CPU seconds): 0.00 (Wallclock seconds): 0.01

Result - Optimal solution found (within gap tolerance) Objective value: 14178.00000000 Lower bound: 14080.357 Gap: 0.01 Enumerated nodes: 0 Total iterations: 0 Time (CPU seconds): 218.06 Time (Wallclock seconds): 219.21 Option for printingOptions changed from normal to all Total time (CPU seconds): 293.59 (Wallclock seconds): 296.26

Landcover Comparison: n20: Required=3303.7, Achieved=3204.7, Difference=99.0 n25: Required=20000.0, Achieved=19401.1, Difference=598.9 n5: Required=20000.0, Achieved=19400.0, Difference=600.0 n16: Required=8021.1, Achieved=7781.3, Difference=239.8 ....

Python 基于多列目标要求的 Pandas DataFrame 的高效随机子样本

My [extended] code so far

The problem

推荐答案

Python相关问答推荐

Django：如何将一个模型的唯一实例创建为另一个模型中的字段

将数组操作转化为纯numpy方法

带有Postgres的Flask-Data在调用少量API后崩溃

避免循环的最佳方法

按照行主要蛇扫描顺序对点列表进行排序

Pandas 除以一列中出现的每个值

如何在Python中使用时区夏令时获取任何给定本地时间的纪元值？

如何根据条件在多指标框架上进行groupby

无法使用equals_html从网址获取全文

如何调整spaCy token 化器，以便在德国模型中将数字拆分为行末端的点

如何使用symy打印方程？

在线条上绘制表面

log 1 p numpy的意外行为

使用groupby Pandas的一些操作

Python键入协议默认值

Pandas DataFrame中行之间的差异

将JSON对象转换为Dataframe

如何在Python中获取`Genericums`超级类型？

具有相同图例 colored颜色和标签的堆叠子图

pysnmp—lextudio使用next()和getCmd()生成器导致TypeError：tuple对象不是迭代器''