我正在试图计算地理空间中具有坐标的点之间的某种相似性. 我将用一个例子让事情更清楚一些:
import pandas as pd
import geopandas as gpd
from geopy import distance
from shapely import Point
df = pd.DataFrame({
'Name':['a','b','c','d'],
'Value':[1,2,3,4],
'geometry':[Point(1,0), Point(1,2), Point(1,0), Point(3,3)]
})
gdf = gpd.GeoDataFrame(df, geometry=df.geometry)
print(gdf)
Name Value geometry
0 a 1 POINT (1.00000 0.00000)
1 b 2 POINT (1.00000 2.00000)
2 c 3 POINT (1.00000 0.00000)
3 d 4 POINT (3.00000 3.00000)
我需要一个新的数据帧,其中包含每对点之间的距离、它们的相似性(本例中为曼哈顿距离)以及它们的其他可能变量(本例中仅有name
作为附加变量).
我的解决方案如下:
def calc_values_for_row(row, sourcepoint): ## sourcepoint is a row of tdf
sourcename = sourcepoint['Name']
targetname = row['Name']
manhattan = abs(sourcepoint['Value']-row['Value'])
sourcecoord = sourcepoint['geometry']
targetcoord = row['geometry']
dist_meters = distance.distance(np.array(sourcecoord.coords), np.array(targetcoord.coords)).meters
new_row = [sourcename, targetname, manhattan, sourcecoord, targetcoord, dist_meters]
new_row = pd.Series(new_row)
new_row.index = ['SourceName','TargetName','Manhattan','SourceCoord','TargetCoord','Distance (m)']
return new_row
def calc_dist_df(df):
full_df = pd.DataFrame()
for i in df.index:
tdf = df.loc[df.index>i]
if tdf.empty == False:
sliced_df = tdf.apply(lambda x: calc_values_for_row(x, df.loc[i]), axis=1)
full_df = pd.concat([full_df, sliced_df])
return full_df.reset_index(drop=True)
calc_dist_df(gdf)
### EXPECTED RESULT
SourceName TargetName Manhattan SourceCoord TargetCoord Distance (m)
0 a b 1 POINT (1 0) POINT (1 2) 222605.296097
1 a c 2 POINT (1 0) POINT (1 0) 0.000000
2 a d 3 POINT (1 0) POINT (3 3) 400362.335920
3 b c 1 POINT (1 2) POINT (1 0) 222605.296097
4 b d 2 POINT (1 2) POINT (3 3) 247555.571681
5 c d 1 POINT (1 0) POINT (3 3) 400362.335920
它像预期的那样工作得很好,但对于 Big Data 集来说,它的速度非常慢.
我对数据帧的每一行迭代一次,对GDF切片一次,然后在切片的GDF上使用.apply()
,但我想知道是否有方法可以避免第一个for
循环,或者可能有其他解决方案使该算法更快.
NOTE
combination
from itertools might not be the solution because the geometry column can contain repeated values
EDIT
This is the distribution of repeated values for the 'geometry' column. As you can see most of the points are repeated and only a few are unique.