Python 比较两个不同行中的值和迭代中的变量

发布于06月03日

各位朋友，

我有一个数据帧，其中包含从文本中提取的单词，其中的列指示它们在文本中的位置.请说:

word    start   end
sugar   10      16
ice     32      35
cream   36      41

我想创建一个循环来识别文本中的单词何时聚集在一起，并将冰淇淋作为其自己的条目:

word      start   end
sugar     10      16
ice cream 32      41

我试过了，但得到了Keyerror分:

treats = pd.DataFrame()

for i in range(len(df)):
  if df['end'][i]+1 == df['start'][i+1]:
    word = df['word'][i]+' '+df['word'][i+1]
    start = df['start'][i]
    end = df['end'][i+1]
    i = i+2
  else:
    word = df['word'][i]
    start = df['start'][i]
    end = df['end'][i]
  row = [site,url,ent,word,start,end]
  treats = pd.concat([treats,row])

我做错了什么？

EDIT:哇.你带来了非常非常深思熟虑的答案.希望我能给你们每人一杯咖啡.我刚刚发现了一个额外的复杂性:解决方案以我最初定义问题的方式工作得很好，但在try 了您的解决方案后，我发现DF也有被一分为二的单词，因此df[‘end’][x]==df[‘start’][x+1].有没有简单的变通办法？

import pandas as pd data = {'word': {0: 'sugar', 1: 'ice', 2: 'cream', 3: 'hello', 4: 'there', 5: 'world', 6: 'sugar'}, 'start': {0: 10, 1: 32, 2: 36, 3: 50, 4: 56, 5: 62, 6: 70}, 'end': {0: 16, 1: 35, 2: 41, 3: 55, 4: 61, 5: 67, 6: 75}} df = pd.DataFrame(data) df word start end 0 sugar 10 16 1 ice 32 35 # join `ice cream` (start: 32, end: 41) 2 cream 36 41 3 hello 50 55 # join `hello there world` (start: 50, end: 67) 4 there 56 61 5 world 62 67 6 sugar 70 75

g = df['start'].sub(df['end'].shift(1)).gt(1).cumsum() # like: [0, 1, 1, 2, 2, 2, 3] out = (df.groupby(g).agg({'word':' '.join, 'start':'first','end':'last'})) out word start end 0 sugar 10 16 1 ice cream 32 41 2 hello there world 50 67 3 sugar 70 75

# get `True` for each row where col `end` + 1 equals next row `start` s = df['end'].add(1).eq(df['start'].shift(-1)) # create groups for consecutive `True` values in `s`, and reindex out = (s.diff().ne(0).cumsum()[s]).reindex(df.index) # like: [nan, 2.0, nan, 4.0, 4.0, nan, nan] (still missing the group value # for the last member of a group) # fill the first `np.nan` after each `True` with a shift of `out` out = out.where(out.notna(),out.shift(1)) # getting: [nan, 2.0, 2.0, 4.0, 4.0, 4.0, nan] # add another `Series.where` to fill remaining `np.nan` with the index from `df`; # these will be the `non-group` values out = out.where(out.notna(),df.index.to_numpy()) # use `out` as a grouper and get the relevant aggegrations for each column, # resetting the index with `as_index=False` out = (df.groupby(out, as_index=False) .agg({'word':' '.join, 'start':'first','end':'last'})) # same result as `Solution 1`

Python 比较两个不同行中的值和迭代中的变量

推荐答案

Python相关问答推荐

韦尔福德方差与Numpy方差不同

Pandas 有条件轮班操作

Python中的嵌套Ruby哈希

用Python解密Java加密文件

根据二元组列表在pandas中创建新列

删除字符串中第一次出现单词后的所有内容

我如何使法国在 map 中完全透明的代码？

SQLAlchemy bindparam在mssql上失败(但在mysql上工作)

如何合并两个列表，并获得每个索引值最高的列表名称？

如何从列表框中 Select 而不出错？

如何在PySide/Qt QColumbnView中删除列

numpy.unique如何消除重复列？

关于两个表达式的区别

如何从比较函数生成ngroup？

Python日志(log)库如何有效地获取lineno和funcName？

随机森林n_估计器的计算

无法在盐流道中获得柱子

如何在表单中添加管理员风格的输入(PDF)

某些值的数值幂和**之间的差异

正则表达式反向查找