我目前正在做一个项目,我有一个很大的DataFrame(500‘000行),其中包含作为行的多边形,每个多边形代表一个地理区域. DataFrame的列表示不同的土地覆盖类别(34个类别),单元格中的值表示每个土地覆盖类别覆盖的面积(以平方公里为单位).
我的目标是subsample this DataFrame based on target requirements for landcover classes岁.具体地说,我希望 Select 一个多边形子集,这些多边形子集共同满足每个土地覆盖类别的特定目标覆盖要求.目标要求被指定 for each 土地覆盖类别所需的面积覆盖率.
一些同事暗示,这可以被解读为optimisation problem with an objective function分.然而,我还没有找到解决方案,我try 了一种不同的、缓慢的、迭代的方法(见下文).
为了让您更好地理解,下面是我的DataFrame struct 的一个最小可重现示例,该 struct 只有4个多边形和3个类:
import pandas as pd
# Create a sample DataFrame
data = {
'Polygon': ['Polygon A', 'Polygon B', 'Polygon C', 'Polygon D'],
'Landcover 1': [10, 5, 7, 3],
'Landcover 2': [15, 8, 4, 6],
'Landcover 3': [20, 12, 9, 14]
}
df = pd.DataFrame(data)
例如,假设我对陆地覆盖类有以下目标需求:
target_requirements = {
'Landcover 1': 15,
'Landcover 2': 20,
'Landcover 3': 25
}
基于这些目标要求,我想通过 for each 陆地覆盖类 Select collectively meet or closely approximate the target area coverage个多边形子集来对DataFrame进行子采样. 在本例中,多边形A和C是很好的子样本,因为它们的土地覆盖总和接近于我设置的要求.
My [extended] code so far
以下是我到目前为止编写的代码. 您将看到大约extra steps个在这里实现的代码:
- 权重:使用赤字和盈余来指导多边形的 Select
- 前0.5%的随机采样:基于权重,我 Select 前0.5%的多边形,并从该 Select 中随机选取1个.
- 容差:我为当前子样本发现的累积面积与所需要求之间的差异设置了容差.
- 进度条:美观.
import numpy as np
import pandas as pd
from tqdm import tqdm
def select_polygons(row, cumulative_coverages, landcover_columns, target_coverages):
selected_polygon = row[landcover_columns]
# Add the selected polygon to the subsample
subsample = selected_polygon.to_frame().T
cumulative_coverages += selected_polygon.values
return cumulative_coverages, subsample
df_data = # Your DataFrame with polygons and landcover classes
landcover_columns = # List of landcover columns in the DataFrame
target_coverages = # Dictionary of target coverages for each landcover class
total_coverages = df_data[landcover_columns].sum()
target_coverages = pd.Series(target_coverages, landcover_columns)
df_data = df_data.sample(frac=1).dropna().reset_index(drop=True)
# Set parameters for convergence
max_iterations = 30000
convergence_threshold = 0.1
top_percentage = 0.005
# Initialize variables
subsample = pd.DataFrame(columns=landcover_columns)
cumulative_coverages = pd.Series(0, index=landcover_columns)
# Initialize tqdm progress bar
progress_bar = tqdm(total=max_iterations)
# Iterate until the cumulative coverage matches or is close to the target coverage
for iteration in range(max_iterations):
remaining_diff = target_coverages - cumulative_coverages
deficit = remaining_diff.clip(lower=0)
surplus = remaining_diff.clip(upper=0) * 0.1
deficit_sum = deficit.sum()
normalized_weights = deficit / deficit_sum
# Calculate the combined weights for deficit and surplus for the entire dataset
weights = df_data[landcover_columns].mul(normalized_weights) + surplus
# Calculate the weight sum for each polygon
weight_sum = weights.sum(axis=1)
# Select the top 1% polygons based on weight sum
top_percentile = int(len(df_data) * top_percentage)
top_indices = weight_sum.nlargest(top_percentile).index
selected_polygon_index = np.random.choice(top_indices)
selected_polygon = df_data.loc[selected_polygon_index]
cumulative_coverages, subsample_iteration = select_polygons(
selected_polygon, cumulative_coverages, landcover_columns, target_coverages
)
# Add the selected polygon to the subsample
subsample = subsample.append(subsample_iteration)
df_data = df_data.drop(selected_polygon_index)
# Check if all polygons have been selected or the cumulative coverage matches or is close to the target coverage
if df_data.empty or np.allclose(cumulative_coverages, target_coverages, rtol=convergence_threshold):
break
# Calculate the percentage of coverage achieved
coverage_percentage = (cumulative_coverages.sum() / target_coverages.sum()) * 100
# Update tqdm progress bar
progress_bar.set_description(f"Iteration {iteration+1}: Coverage Percentage: {coverage_percentage:.2f}%")
progress_bar.update(1)
progress_bar.close()
subsample.reset_index(drop=True, inplace=True)
The problem
代码很慢(10次迭代/S),并且不能很好地管理容差,即我可以获得远高于OPTIMISATION%的累积覆盖率,而容差还没有得到满足(我的 Select 指南不够好). 另外,必须有一个更好的OPTIMISATION才能得到我想要的.
如有任何帮助/建议,我们将不胜感激.