Python 使用groupby方法移除公共子字符串

发布于04月03日

我想删除重复与公共子串，公共子串是最短的字符串在rame.

我try 下面的代码和它的工作，如预期，但我想改进代码.在groupby之后，滤波器逻辑增加了reformat函数.

import re
import pandas as pd
import tabulate

def dumpdf(df):
    s = tabulate.tabulate(df, tablefmt='plain', headers='keys', showindex=True)
    print(s)
    return

def reformat(df):
    dfg = df.groupby('name')
    flag = False
    for name,df in dfg:
        item = []
        df = df.sort_values(by="package", key=lambda x: x.str.len())
        data = []
        for idx,row in df.iterrows():
            pkg = row['package']
            df.loc[df['package'].str.startswith(pkg, na=False), 'package'] = pkg
        df = df[df.duplicated(['package'], keep='first') == False]
        df = df.reset_index(drop=True)
        if len(df) > 0:
            if flag == False:
                flag = True
                out = df
            else:
                out = pd.concat([out,df],ignore_index=True)
    return out

def main():

    data = [
        ['A','com.example'],
        ['A','com.example.a'],
        ['A','com.example.b.c'],
        ['A','com.fun'],
        ['B','com.demo'],
        ['B','com.demo.b.c'],
        ['B','com.fun'],
        ['B','com.fun.e'],
        ['B','com.fun.f.g']
        ]
    df = pd.DataFrame(data,columns=['name','package'])

    df = reformat(df)
    df = df.groupby('name', as_index=False).agg('\n'.join)

    dumpdf(df)
    return

main()

输出:

    name    package
 0  A       com.fun
            com.example
 1  B       com.fun
            com.demo

Python 使用groupby方法移除公共子字符串

推荐答案

Python相关问答推荐

在Pandas 日历中插入一行

根据不同列的值在收件箱中移动数据

Gekko：Spring-Mass系统的参数识别

更改键盘按钮进入'

如何在类和classy-fastapi -fastapi- followup中使用FastAPI创建路由

使用@ guardlasses. guardlass和注释的Python继承

基于字符串匹配条件合并两个帧

在Python中，从给定范围内的数组中提取索引组列表的更有效方法

从嵌套的yaml创建一个嵌套字符串，后面跟着点

如何更新pandas DataFrame上列标题的de值？

如何禁用FastAPI应用程序的Swagger UI autodoc中的application/json？

Pandas Data Wrangling/Dataframe Assignment

如何在BeautifulSoup/CSS Select 器中处理regex？

以逻辑方式获取自己的pyproject.toml依赖项

Gekko中基于时间的间隔约束

如何在Python请求中组合多个适配器？

Tensorflow tokenizer问题.num_words到底做了什么？

TypeError：'；Locator'；对象无法在PlayWriter中使用.first()调用

从列表中分离数据的最佳方式

按列表分组到新列中