我有一个数据帧,其中包含帖子和 comments . 每条 comments 都有一个id和一个父id(标识 comments 所回复的 comments 或帖子). 帖子只有一个ID,因为它们不回应任何事情.
submission | id | parent id |
---|---|---|
post1 | 1 | |
comment1 | 2 | 1 |
comment2 | 3 | 1 |
comment3 | 4 | 2 |
comment4 | 5 | 4 |
post2 | 6 | |
comment5 | 7 | 6 |
我想 for each comments 检索原始帖子的ID,并获得类似以下内容:
submission | id | parent id | ancestor id |
---|---|---|---|
post1 | 1 | ||
comment1 | 2 | 1 | 1 |
comment2 | 3 | 1 | 1 |
comment3 | 4 | 2 | 1 |
comment4 | 5 | 4 | 1 |
post2 | 6 | ||
comment5 | 7 | 6 | 2 |
为此,我try 从dataframe的末尾循环到开头,反复回溯parent_id的parent_id,直到我发现一个空的parent_id单元格. 在测试数据帧上,它可以工作,但在主数据帧上太慢.有没有办法让它更有效率?
以下是我的原始代码:
#creating a column for the id of the original post
df["ancestor"] = df.id
#obtaining the id of the original post for every comment
for i in reversed(range(len(df.id))): #looping trough the comments
id = df["parent_id"][i] #variable to initialize the future loop
parent = id
while parent != "": #only looping trough comments
df.ancestor[i] = id
parent = df.parent_id[df.id == id].values[0]
id = parent