Edit:将更快的路由添加到@jqurious建议的所需groups
.
首先,让我扩展示例df
,因为当断开连接的组彼此紧随其后时,它会变得更复杂,我假设这很可能是一种可能性.
Data个
import pandas as pd
data = {'word': {0: 'sugar', 1: 'ice', 2: 'cream', 3: 'hello', 4: 'there',
5: 'world', 6: 'sugar'},
'start': {0: 10, 1: 32, 2: 36, 3: 50, 4: 56, 5: 62, 6: 70},
'end': {0: 16, 1: 35, 2: 41, 3: 55, 4: 61, 5: 67, 6: 75}}
df = pd.DataFrame(data)
df
word start end
0 sugar 10 16
1 ice 32 35 # join `ice cream` (start: 32, end: 41)
2 cream 36 41
3 hello 50 55 # join `hello there world` (start: 50, end: 67)
4 there 56 61
5 world 62 67
6 sugar 70 75
Solution 1个
@jqurious人在 comments 中建议的更好的路由:
- 将
end
向前移位一行(Series.shift
),并从start
(Series.sub
)中减go 它.
- 勾选
> 1
(Series.gt
)将组作为True
开始,并应用Series.cumsum
for each 组分配一个连续编号.
- 应用
df.groupby
和链.agg
来获得每一列的相关对数.
g = df['start'].sub(df['end'].shift(1)).gt(1).cumsum()
# like: [0, 1, 1, 2, 2, 2, 3]
out = (df.groupby(g).agg({'word':' '.join, 'start':'first','end':'last'}))
out
word start end
0 sugar 10 16
1 ice cream 32 41
2 hello there world 50 67
3 sugar 70 75
Solution 2(详细)
# get `True` for each row where col `end` + 1 equals next row `start`
s = df['end'].add(1).eq(df['start'].shift(-1))
# create groups for consecutive `True` values in `s`, and reindex
out = (s.diff().ne(0).cumsum()[s]).reindex(df.index)
# like: [nan, 2.0, nan, 4.0, 4.0, nan, nan] (still missing the group value
# for the last member of a group)
# fill the first `np.nan` after each `True` with a shift of `out`
out = out.where(out.notna(),out.shift(1))
# getting: [nan, 2.0, 2.0, 4.0, 4.0, 4.0, nan]
# add another `Series.where` to fill remaining `np.nan` with the index from `df`;
# these will be the `non-group` values
out = out.where(out.notna(),df.index.to_numpy())
# use `out` as a grouper and get the relevant aggegrations for each column,
# resetting the index with `as_index=False`
out = (df.groupby(out, as_index=False)
.agg({'word':' '.join, 'start':'first','end':'last'}))
# same result as `Solution 1`