各位朋友,

我有一个数据帧,其中包含从文本中提取的单词,其中的列指示它们在文本中的位置.请说:

word    start   end
sugar   10      16
ice     32      35
cream   36      41

我想创建一个循环来识别文本中的单词何时聚集在一起,并将冰淇淋作为其自己的条目:

word      start   end
sugar     10      16
ice cream 32      41

我试过了,但得到了Keyerror分:

treats = pd.DataFrame()

for i in range(len(df)):
  if df['end'][i]+1 == df['start'][i+1]:
    word = df['word'][i]+' '+df['word'][i+1]
    start = df['start'][i]
    end = df['end'][i+1]
    i = i+2
  else:
    word = df['word'][i]
    start = df['start'][i]
    end = df['end'][i]
  row = [site,url,ent,word,start,end]
  treats = pd.concat([treats,row])

我做错了什么?

EDIT:哇.你带来了非常非常深思熟虑的答案.希望我能给你们每人一杯咖啡.我刚刚发现了一个额外的复杂性:解决方案以我最初定义问题的方式工作得很好,但在try 了您的解决方案后,我发现DF也有被一分为二的单词,因此df[‘end’][x]==df[‘start’][x+1].有没有简单的变通办法?

推荐答案

Edit:将更快的路由添加到@jqurious建议的所需groups.


首先,让我扩展示例df,因为当断开连接的组彼此紧随其后时,它会变得更复杂,我假设这很可能是一种可能性.

Data

import pandas as pd

data = {'word': {0: 'sugar', 1: 'ice', 2: 'cream', 3: 'hello', 4: 'there', 
                 5: 'world', 6: 'sugar'}, 
        'start': {0: 10, 1: 32, 2: 36, 3: 50, 4: 56, 5: 62, 6: 70}, 
        'end': {0: 16, 1: 35, 2: 41, 3: 55, 4: 61, 5: 67, 6: 75}}
df = pd.DataFrame(data)

df

    word  start  end
0  sugar     10   16
1    ice     32   35 # join `ice cream` (start: 32, end: 41)
2  cream     36   41
3  hello     50   55 # join `hello there world` (start: 50, end: 67)
4  there     56   61
5  world     62   67
6  sugar     70   75

Solution 1

@jqurious人在 comments 中建议的更好的路由:

  1. end向前移位一行(Series.shift),并从start(Series.sub)中减go 它.
  2. 勾选> 1(Series.gt)将组作为True开始,并应用Series.cumsum for each 组分配一个连续编号.
  3. 应用df.groupby和链.agg来获得每一列的相关对数.
g = df['start'].sub(df['end'].shift(1)).gt(1).cumsum()
# like: [0, 1, 1, 2, 2, 2, 3]

out = (df.groupby(g).agg({'word':' '.join, 'start':'first','end':'last'}))

out

                word  start  end
0              sugar     10   16
1          ice cream     32   41
2  hello there world     50   67
3              sugar     70   75

Solution 2(详细)

# get `True` for each row where col `end` + 1 equals next row `start`
s = df['end'].add(1).eq(df['start'].shift(-1))

# create groups for consecutive `True` values in `s`, and reindex
out = (s.diff().ne(0).cumsum()[s]).reindex(df.index)
# like: [nan, 2.0, nan, 4.0, 4.0, nan, nan] (still missing the group value
# for the last member of a group)

# fill the first `np.nan` after each `True` with a shift of `out`
out = out.where(out.notna(),out.shift(1))
# getting: [nan, 2.0, 2.0, 4.0, 4.0, 4.0, nan]

# add another `Series.where` to fill remaining `np.nan` with the index from `df`;
# these will be the `non-group` values
out = out.where(out.notna(),df.index.to_numpy())

# use `out` as a grouper and get the relevant aggegrations for each column,
# resetting the index with `as_index=False`
out = (df.groupby(out, as_index=False)
       .agg({'word':' '.join, 'start':'first','end':'last'}))

# same result as `Solution 1`

Python相关问答推荐

韦尔福德方差与Numpy方差不同

Pandas 有条件轮班操作

Python中的嵌套Ruby哈希

用Python解密Java加密文件

根据二元组列表在pandas中创建新列

删除字符串中第一次出现单词后的所有内容

我如何使法国在 map 中完全透明的代码?

SQLAlchemy bindparam在mssql上失败(但在mysql上工作)

如何合并两个列表,并获得每个索引值最高的列表名称?

如何从列表框中 Select 而不出错?

如何在PySide/Qt QColumbnView中删除列

numpy.unique如何消除重复列?

关于两个表达式的区别

如何从比较函数生成ngroup?

Python日志(log)库如何有效地获取lineno和funcName?

随机森林n_估计器的计算

无法在盐流道中获得柱子

如何在表单中添加管理员风格的输入(PDF)

某些值的数值幂和**之间的差异

正则表达式反向查找