我正在试图计算地理空间中具有坐标的点之间的某种相似性. 我将用一个例子让事情更清楚一些:

import pandas as pd
import geopandas as gpd
from geopy import distance
from shapely import Point

df = pd.DataFrame({
    'Name':['a','b','c','d'],
    'Value':[1,2,3,4],
    'geometry':[Point(1,0), Point(1,2), Point(1,0), Point(3,3)]
})
gdf = gpd.GeoDataFrame(df, geometry=df.geometry)
print(gdf)
  Name  Value                 geometry
0    a      1  POINT (1.00000 0.00000)
1    b      2  POINT (1.00000 2.00000)
2    c      3  POINT (1.00000 0.00000)
3    d      4  POINT (3.00000 3.00000)

我需要一个新的数据帧,其中包含每对点之间的距离、它们的相似性(本例中为曼哈顿距离)以及它们的其他可能变量(本例中仅有name作为附加变量).

我的解决方案如下:

def calc_values_for_row(row, sourcepoint):  ## sourcepoint is a row of tdf
    sourcename = sourcepoint['Name']
    targetname = row['Name']
    manhattan = abs(sourcepoint['Value']-row['Value'])
    sourcecoord = sourcepoint['geometry']
    targetcoord = row['geometry']
    dist_meters = distance.distance(np.array(sourcecoord.coords), np.array(targetcoord.coords)).meters

    new_row = [sourcename, targetname, manhattan, sourcecoord, targetcoord, dist_meters]
    new_row = pd.Series(new_row)
    new_row.index = ['SourceName','TargetName','Manhattan','SourceCoord','TargetCoord','Distance (m)']
    return new_row

def calc_dist_df(df):
    full_df = pd.DataFrame()
    for i in df.index:
        tdf = df.loc[df.index>i]
        if tdf.empty == False:
            sliced_df = tdf.apply(lambda x: calc_values_for_row(x, df.loc[i]), axis=1)
            full_df = pd.concat([full_df, sliced_df])
    return full_df.reset_index(drop=True)

calc_dist_df(gdf)


### EXPECTED RESULT
  SourceName TargetName  Manhattan  SourceCoord  TargetCoord   Distance (m)
0          a          b          1  POINT (1 0)  POINT (1 2)  222605.296097
1          a          c          2  POINT (1 0)  POINT (1 0)       0.000000
2          a          d          3  POINT (1 0)  POINT (3 3)  400362.335920
3          b          c          1  POINT (1 2)  POINT (1 0)  222605.296097
4          b          d          2  POINT (1 2)  POINT (3 3)  247555.571681
5          c          d          1  POINT (1 0)  POINT (3 3)  400362.335920

它像预期的那样工作得很好,但对于 Big Data 集来说,它的速度非常慢. 我对数据帧的每一行迭代一次,对GDF切片一次,然后在切片的GDF上使用.apply(),但我想知道是否有方法可以避免第一个for循环,或者可能有其他解决方案使该算法更快.

NOTE
combination from itertools might not be the solution because the geometry column can contain repeated values
EDIT
This is the distribution of repeated values for the 'geometry' column. As you can see most of the points are repeated and only a few are unique. Distribution of PointID counts

推荐答案

你可以用scipy.spatial.distance_matrix美元.使用.x.y特性从Shapely Point提取坐标:

from scipy.spatial import distance_matrix

RADIUS = 6371.009 * 1e3  # meters

cx = gdf.add_prefix('Source').merge(gdf.add_prefix('Target'), how='cross')
coords = np.radians(np.stack([gdf['geometry'].x, gdf['geometry'].y], axis=1))
cx['Distance'] = distance_matrix(coords, coords, p=2).ravel() * RADIUS

r, c = np.triu_indices(len(gdf), k=1)
cx = cx.loc[r * len(df) + c]

输出:

>>> cx
   SourceName  SourceValue           Sourcegeometry TargetName  TargetValue           Targetgeometry       Distance
1           a            1  POINT (1.00000 0.00000)          b            2  POINT (1.00000 2.00000)  222390.167448
2           a            1  POINT (1.00000 0.00000)          c            3  POINT (1.00000 0.00000)       0.000000
3           a            1  POINT (1.00000 0.00000)          d            4  POINT (3.00000 3.00000)  400919.575947
6           b            2  POINT (1.00000 2.00000)          c            3  POINT (1.00000 0.00000)  222390.167448
7           b            2  POINT (1.00000 2.00000)          d            4  POINT (3.00000 3.00000)  248639.765971
11          c            3  POINT (1.00000 0.00000)          d            4  POINT (3.00000 3.00000)  400919.575947

Python相关问答推荐

如何将桌子刮成带有Se的筷子/要求/Beautiful Soup ?

如何使用矩阵在sklearn中同时对每个列执行matthews_corrcoef?

具有症状的分段函数:如何仅针对某些输入值定义函数?

GL pygame无法让缓冲区与vertextPointer和colorPointer一起可靠地工作

如何才能知道Python中2列表中的巧合.顺序很重要,但当1个失败时,其余的不应该失败或是0巧合

使用索引列表列表对列进行切片并获取行方向的向量长度

如何使用数组的最小条目拆分数组

Streamlit应用程序中的Plotly条形图中未正确显示Y轴刻度

如何创建一个缓冲区周围的一行与manim?

Scrapy和Great Expectations(great_expectations)—不合作

删除marplotlib条形图上的底边

SQLAlchemy bindparam在mssql上失败(但在mysql上工作)

try 检索blob名称列表时出现错误填充错误""

使用BeautifulSoup抓取所有链接

matplotlib图中的复杂箭头形状

Flask Jinja2如果语句总是计算为false&

BeautifulSoup:超过24个字符(从a到z)的迭代失败:降低了首次深入了解数据集的复杂性:

如何在GEKKO中使用复共轭物

Autocad使用pyautocad/comtypes将对象从一个图形复制到另一个图形

PYTHON中的pd.wide_to_long比较慢