Start_Year   End_Year   Opp1              Opp2          Duration
1500         1501       ['A','B']        ['C','D']      1
1500         1510       ['P','Q','R']    ['X','Y']      10
1520         1520       ['A','X']        ['C']          0
...          ....        ........        .....          ..
1809         1820       ['M']            ['F','H','Z']  11

我的数据集(csv文件格式)是不同实体(国家、州和派系,用大写字母A、B、P、Q等表示,如Opp1(反对派)和Opp2列中的列表)之间的武装战争.Start_Year和End_Year是战争开始和结束的年份.持续时间列是通过将End_Year的值减go Start_Year来创建的.

我想通过战争持续时间的因子复制那些持续时间大于0的行,即如果持续时间为6年,则复制该行6次,将持续时间值减少1,并将复制行中的每个复制的Start_Year增加1,并保持其他列中的值相同.(如果持续时间为1年,则应将行复制2次,以便在复制到最后一步后,每次战争的持续时间变为0年).

我不知道如何继续这样的事情,因为我是一个数据科学和分析的初学者.请原谅我没有在这里显示任何试用代码.

Start_Year   End_Year   Opp1              Opp2          Duration
1500         1501       ['A','B']        ['C','D']      1
1501         1501       ['A','B']        ['C','D']      0
1500         1510       ['P','Q','R']    ['X','Y']      10
1501         1510       ['P','Q','R']    ['X','Y']      9
1502         1510       ['P','Q','R']    ['X','Y']      8
1503         1510       ['P','Q','R']    ['X','Y']      7
1504         1510       ['P','Q','R']    ['X','Y']      6
1505         1510       ['P','Q','R']    ['X','Y']      5
....         ....       .............    ........       ..
1510         1510       ['P','Q','R']    ['X','Y']      0
1520         1520       ['A','X']        ['C']          0
...          ....        ........        .....          ..
1809         1820       ['M']            ['F','H','Z']  11
1810         1820       ['M']            ['F','H','Z']  10
....         ....       .....            .............. ..
1820         1820       ['M']            ['F','H','Z']  0 

编辑:1

推荐答案

您可以使用pandas.Index.repeat根据第Duration列重复行[Duration times],然后使用pandas.core.groupby.GroupBy.cumcount可以向start_year列添加递增的累积值.

Reading data

data = [[1500, 1501, ['A','B'], ['C','D'], 1],
        [1500, 1510, ['P','Q','R'], ['X','Y'], 10],
        [1520, 1520, ['A','X'], ['C'], 0],
        [1809, 1820, ['M'], ['F','H','Z'], 11]]
df = pd.DataFrame(data, columns = ['Start_Year', 'End_Year', 'Opp1', 'Opp2', 'Duration'])

Repeating the values

mask = df['Duration'].gt(0)
df1 = df[mask].copy()
df1 = df1.loc[df1.index.repeat(df1['Duration'] + 1)]

Assigning increasing values to each group

df1['Start_Year'] += df1[['Start_Year', 'End_Year', 'Opp1', 'Opp2']].astype(str).groupby(['Start_Year', 'End_Year', 'Opp1', 'Opp2']).cumcount()

Generating output

df1['Duration'] = df1['End_Year'] - df1['Start_Year']
df = pd.concat([df1, df[~mask]]).sort_index(kind = 'mergesort').reset_index(drop=True)

这为我们提供了预期输出:

    Start_Year  End_Year       Opp1       Opp2  Duration
0         1500      1501     [A, B]     [C, D]         1
1         1501      1501     [A, B]     [C, D]         0
2         1500      1510  [P, Q, R]     [X, Y]        10
3         1501      1510  [P, Q, R]     [X, Y]         9
4         1502      1510  [P, Q, R]     [X, Y]         8
5         1503      1510  [P, Q, R]     [X, Y]         7
6         1504      1510  [P, Q, R]     [X, Y]         6
7         1505      1510  [P, Q, R]     [X, Y]         5
8         1506      1510  [P, Q, R]     [X, Y]         4
9         1507      1510  [P, Q, R]     [X, Y]         3
10        1508      1510  [P, Q, R]     [X, Y]         2
11        1509      1510  [P, Q, R]     [X, Y]         1
12        1510      1510  [P, Q, R]     [X, Y]         0
13        1520      1520     [A, X]        [C]         0
14        1809      1820        [M]  [F, H, Z]        11
15        1810      1820        [M]  [F, H, Z]        10
16        1811      1820        [M]  [F, H, Z]         9
17        1812      1820        [M]  [F, H, Z]         8
18        1813      1820        [M]  [F, H, Z]         7
19        1814      1820        [M]  [F, H, Z]         6
20        1815      1820        [M]  [F, H, Z]         5
21        1816      1820        [M]  [F, H, Z]         4
22        1817      1820        [M]  [F, H, Z]         3
23        1818      1820        [M]  [F, H, Z]         2
24        1819      1820        [M]  [F, H, Z]         1
25        1820      1820        [M]  [F, H, Z]         0

Alternatively

您也可以在Repeating the values之后try 另一种方法,通过在第一个累积递减中分配持续时间.然后再次计算"开始年份"

df1['Duration'] = df1[['Start_Year', 'End_Year', 'Opp1', 'Opp2']].astype(str).groupby(['Start_Year', 'End_Year', 'Opp1', 'Opp2']).cumcount(ascending=False)
df1['Start_Year'] = df1['End_Year'] - df1['Duration']
df = pd.concat([df1, df[~mask]]).sort_index(kind = 'mergesort').reset_index(drop=True)

Output :

这将提供相同的预期输出:

    Start_Year  End_Year       Opp1       Opp2  Duration
0         1500      1501     [A, B]     [C, D]         1
1         1501      1501     [A, B]     [C, D]         0
2         1500      1510  [P, Q, R]     [X, Y]        10
3         1501      1510  [P, Q, R]     [X, Y]         9
4         1502      1510  [P, Q, R]     [X, Y]         8
5         1503      1510  [P, Q, R]     [X, Y]         7
6         1504      1510  [P, Q, R]     [X, Y]         6
7         1505      1510  [P, Q, R]     [X, Y]         5
8         1506      1510  [P, Q, R]     [X, Y]         4
9         1507      1510  [P, Q, R]     [X, Y]         3
10        1508      1510  [P, Q, R]     [X, Y]         2
11        1509      1510  [P, Q, R]     [X, Y]         1
12        1510      1510  [P, Q, R]     [X, Y]         0
13        1520      1520     [A, X]        [C]         0
14        1809      1820        [M]  [F, H, Z]        11
15        1810      1820        [M]  [F, H, Z]        10
16        1811      1820        [M]  [F, H, Z]         9
17        1812      1820        [M]  [F, H, Z]         8
18        1813      1820        [M]  [F, H, Z]         7
19        1814      1820        [M]  [F, H, Z]         6
20        1815      1820        [M]  [F, H, Z]         5
21        1816      1820        [M]  [F, H, Z]         4
22        1817      1820        [M]  [F, H, Z]         3
23        1818      1820        [M]  [F, H, Z]         2
24        1819      1820        [M]  [F, H, Z]         1
25        1820      1820        [M]  [F, H, Z]         0

您可以使用pandas.DataFrame.reset_index重置索引.

Summary :

基本上,我们在这里所做的是根据第Duration列的值和条件复制行.

我们保存了在使用pandas.Index.repeat重复行[Duration value times]时可能消失的行,一旦我们在行上复制并应用逻辑,用Duration > 0替换列值,用pandas.core.groupby.GroupBy.cumcount替换后续的increasing/decreasing个累积值,我们将dataframe连接起来,并用pandas.DataFrame.sort_indexindex上排序,因为当我们使用pandas.Index.repeat重复行[Duration value times]时,索引也应该重复.因此,索引排序将按照与原始数据帧相同的顺序为我们提供数据帧.

Python相关问答推荐

ConversationalRetrivalChain引发键错误

关于两个表达式的区别

Pandas 数据帧中的枚举,不能在枚举列上执行GROUP BY吗?

没有内置pip模块的Python3.11--S在做什么?

如何在Django模板中显示串行化器错误

极柱内丢失类型信息""

用来自另一个数据框的列特定标量划分Polars数据框中的每一列,

从列表中分离数据的最佳方式

将索引表转换为Numy数组

在不降低分辨率的情况下绘制一组数据点的最外轮廓

是否从Python调用SHGetKnownFolderPath?

如何从多个词典中制作Pandas 数据帧?

如果init被重载,如何输入提示一个基于init的函数的返回类型

绘制的烛台图在绘制其他数据后消失

str的泛型Enum类的Python类型

在HS代码之前获取字符串:数字(不包括HS代码:某个数字)

将共同的交付成果分解为单独的变量

Lambda调用未处理,3秒后超时?

窗口必须是整数0或更大,&q;与';30D';样式滚动计算有关

匹配具有给定异常的给定格式的所有字符串的正则表达式