我试图通过将"FAULT_START"列标记为"True"来解决这样一个问题,即当故障代码第一次出现时,我需要计算它们的行数.如果它们按顺序重复,且没有出现任何零,则将其标记为"False".如果出现零,则表示故障(S)已清除,如果相同故障(S)再次出现(S),则将其标记为"真".

我已经包括了下面的数据样本.请注意,"Rn"列只是行号,如果需要,可以忽略它.此外,"FAULT_START"是滞后一行的"FAULT_CODE"列.

truck_fault_counts_1hz_df = 
Truck                       Timestamp                       fault_code  rn  fault_start
251_PS1_PB_520_REF_111199   2023-07-18T00:01:02.739+0000    3           63  49
251_PS1_PB_520_REF_111199   2023-07-18T00:01:03.739+0000    244         64  3
251_PS1_PB_520_REF_111199   2023-07-18T00:01:04.739+0000    3           65  244
251_PS1_PB_520_REF_111199   2023-07-18T00:01:05.739+0000    244         66  3
251_PS1_PB_520_REF_111199   2023-07-18T00:01:06.739+0000    3           67  244
251_PS1_PB_520_REF_111199   2023-07-18T00:01:07.739+0000    244         68  3
251_PS1_PB_520_REF_111199   2023-07-18T00:01:08.739+0000    3           69  244
251_PS1_PB_520_REF_111199   2023-07-18T00:01:09.739+0000    244         70  3
251_PS1_PB_520_REF_111199   2023-07-18T00:01:10.739+0000    3           71  244
251_PS1_PB_520_REF_111199   2023-07-18T00:01:11.739+0000    244         72  3
251_PS1_PB_520_REF_111199   2023-07-18T00:01:12.741+0000    3           73  244
251_PS1_PB_520_REF_111199   2023-07-18T00:01:13.741+0000    244         74  3
251_PS1_PB_520_REF_111199   2023-07-18T00:01:14.839+0000    49          75  244
251_PS1_PB_520_REF_111199   2023-07-18T00:01:15.839+0000    0           76  49
251_PS1_PB_520_REF_111199   2023-07-18T00:01:16.840+0000    49          77  0
251_PS1_PB_520_REF_111199   2023-07-18T00:01:17.840+0000    0           78  49
251_PS1_PB_520_REF_111199   2023-07-18T00:01:18.840+0000    49          79  0
251_PS1_PB_520_REF_111199   2023-07-18T00:01:19.840+0000    0           80  49
251_PS1_PB_520_REF_111199   2023-07-18T00:01:20.840+0000    49          81  0
251_PS1_PB_520_REF_111199   2023-07-18T00:01:21.840+0000    0           82  49
251_PS1_PB_520_REF_111199   2023-07-18T00:01:22.840+0000    49          83  0
251_PS1_PB_520_REF_111199   2023-07-18T00:01:23.840+0000    0           84  49
251_PS1_PB_520_REF_111199   2023-07-18T00:01:24.840+0000    49          85  0
251_PS1_PB_520_REF_111199   2023-07-18T00:01:25.841+0000    0           86  49
251_PS1_PB_520_REF_111199   2023-07-18T00:01:26.842+0000    49          87  0
251_PS1_PB_520_REF_111199   2023-07-18T00:01:27.940+0000    3           88  49
251_PS1_PB_520_REF_111199   2023-07-18T00:01:28.940+0000    244         89  3

下面的解决方案统计了所有不为零的故障,但没有计算连续重复出现的故障:

# Initialize a list of numbers to not be included in the distinct_faults list
x = [0.0]
# Select the distinct faults in the fault_code column and convert it to Pandas
distinct_faults_df = truck_fault_counts_1hz_df.dropDuplicates(["fault_code"]).select("fault_code").toPandas()["fault_code"]
# Create a list of the distinct faults in the fault_code column and convert the values to ints
distinct_faults = [int(i) for i in distinct_faults_df if i not in x]
# Sort the list in ascending order
distinct_faults.sort()
# Convert the values to strings
distinct_faults = [str(i) for i in distinct_faults]

# Convert the fault_code and fault_start column to int then to string from float
truck_fault_counts_1hz_df = truck_fault_counts_1hz_df.withColumn("fault_code",truck_fault_counts_1hz_df.fault_code.cast('int'))
truck_fault_counts_1hz_df = truck_fault_counts_1hz_df.withColumn("fault_code",truck_fault_counts_1hz_df.fault_code.cast('string'))
truck_fault_counts_1hz_df = truck_fault_counts_1hz_df.withColumn("fault_start",truck_fault_counts_1hz_df.fault_start.cast('int'))
truck_fault_counts_1hz_df = truck_fault_counts_1hz_df.withColumn("fault_start",truck_fault_counts_1hz_df.fault_start.cast('string'))

# Fill any Null values with "0"
truck_fault_counts_1hz_df = truck_fault_counts_1hz_df.fillna(value="0", subset=["fault_start"])

for i in distinct_faults:
    # Create a column for row count and fault start
    # fault_start lags fault_code by one row
    # Any row that has a number greater than 0 in fault_code
    truck_fault_counts_1hz_df = truck_fault_counts_1hz_df.\
        withColumn("fault_start",\
        when(((col("fault_code") == i) & (col("fault_start") != i)), 
        lit("True")).\
        when(((col("fault_code") == 0) & (col("fault_start") == 0)), 
        lit("False")).\
        when(((col("fault_code") == i) & (col("fault_start") == i)), 
        lit("False")).\
        when(((col("fault_code") == 0) & (col("fault_start") == i)), 
        lit("False")).\
        otherwise(truck_fault_counts_1hz_df.fault_start))

truck_fault_counts_1hz_noSpam_df = truck_fault_counts_1hz_df.where(truck_fault_counts_1hz_df.fault_start == True).drop("rn")

display(truck_fault_counts_1hz_df)

我目前拥有的输出是:

Truck                       Timestamp                       fault_code  rn  fault_start
251_PS1_PB_520_REF_111199   2023-07-18T00:01:02.739+0000    3           63  True
251_PS1_PB_520_REF_111199   2023-07-18T00:01:03.739+0000    244         64  True
251_PS1_PB_520_REF_111199   2023-07-18T00:01:04.739+0000    3           65  True
251_PS1_PB_520_REF_111199   2023-07-18T00:01:05.739+0000    244         66  True
251_PS1_PB_520_REF_111199   2023-07-18T00:01:06.739+0000    3           67  True
251_PS1_PB_520_REF_111199   2023-07-18T00:01:07.739+0000    244         68  True
251_PS1_PB_520_REF_111199   2023-07-18T00:01:08.739+0000    3           69  True
251_PS1_PB_520_REF_111199   2023-07-18T00:01:09.739+0000    244         70  True
251_PS1_PB_520_REF_111199   2023-07-18T00:01:10.739+0000    3           71  True
251_PS1_PB_520_REF_111199   2023-07-18T00:01:11.739+0000    244         72  True
251_PS1_PB_520_REF_111199   2023-07-18T00:01:12.741+0000    3           73  True
251_PS1_PB_520_REF_111199   2023-07-18T00:01:13.741+0000    244         74  True
251_PS1_PB_520_REF_111199   2023-07-18T00:01:14.839+0000    49          75  True
251_PS1_PB_520_REF_111199   2023-07-18T00:01:15.839+0000    0           76  False
251_PS1_PB_520_REF_111199   2023-07-18T00:01:16.840+0000    49          77  True
251_PS1_PB_520_REF_111199   2023-07-18T00:01:17.840+0000    0           78  False
251_PS1_PB_520_REF_111199   2023-07-18T00:01:18.840+0000    49          79  True
251_PS1_PB_520_REF_111199   2023-07-18T00:01:19.840+0000    0           80  False
251_PS1_PB_520_REF_111199   2023-07-18T00:01:20.840+0000    49          81  True
251_PS1_PB_520_REF_111199   2023-07-18T00:01:21.840+0000    0           82  False
251_PS1_PB_520_REF_111199   2023-07-18T00:01:22.840+0000    49          83  True
251_PS1_PB_520_REF_111199   2023-07-18T00:01:23.840+0000    0           84  False
251_PS1_PB_520_REF_111199   2023-07-18T00:01:24.840+0000    49          85  True
251_PS1_PB_520_REF_111199   2023-07-18T00:01:25.841+0000    0           86  False
251_PS1_PB_520_REF_111199   2023-07-18T00:01:26.842+0000    49          87  True
251_PS1_PB_520_REF_111199   2023-07-18T00:01:27.940+0000    3           88  True
251_PS1_PB_520_REF_111199   2023-07-18T00:01:28.940+0000    244         89  True

我希望实现的输出是:

Truck                       Timestamp                       fault_code  rn  fault_start
251_PS1_PB_520_REF_111199   2023-07-18T00:01:02.739+0000    3           63  True
251_PS1_PB_520_REF_111199   2023-07-18T00:01:03.739+0000    244         64  True
251_PS1_PB_520_REF_111199   2023-07-18T00:01:04.739+0000    3           65  False
251_PS1_PB_520_REF_111199   2023-07-18T00:01:05.739+0000    244         66  False
251_PS1_PB_520_REF_111199   2023-07-18T00:01:06.739+0000    3           67  False
251_PS1_PB_520_REF_111199   2023-07-18T00:01:07.739+0000    244         68  False
251_PS1_PB_520_REF_111199   2023-07-18T00:01:08.739+0000    3           69  False
251_PS1_PB_520_REF_111199   2023-07-18T00:01:09.739+0000    244         70  False
251_PS1_PB_520_REF_111199   2023-07-18T00:01:10.739+0000    3           71  False
251_PS1_PB_520_REF_111199   2023-07-18T00:01:11.739+0000    244         72  False
251_PS1_PB_520_REF_111199   2023-07-18T00:01:12.741+0000    3           73  False
251_PS1_PB_520_REF_111199   2023-07-18T00:01:13.741+0000    244         74  False
251_PS1_PB_520_REF_111199   2023-07-18T00:01:14.839+0000    49          75  True
251_PS1_PB_520_REF_111199   2023-07-18T00:01:15.839+0000    0           76  False
251_PS1_PB_520_REF_111199   2023-07-18T00:01:16.840+0000    49          77  True
251_PS1_PB_520_REF_111199   2023-07-18T00:01:17.840+0000    0           78  False
251_PS1_PB_520_REF_111199   2023-07-18T00:01:18.840+0000    49          79  True
251_PS1_PB_520_REF_111199   2023-07-18T00:01:19.840+0000    0           80  False
251_PS1_PB_520_REF_111199   2023-07-18T00:01:20.840+0000    49          81  True
251_PS1_PB_520_REF_111199   2023-07-18T00:01:21.840+0000    0           82  False
251_PS1_PB_520_REF_111199   2023-07-18T00:01:22.840+0000    49          83  True
251_PS1_PB_520_REF_111199   2023-07-18T00:01:23.840+0000    0           84  False
251_PS1_PB_520_REF_111199   2023-07-18T00:01:24.840+0000    49          85  True
251_PS1_PB_520_REF_111199   2023-07-18T00:01:25.841+0000    0           86  False
251_PS1_PB_520_REF_111199   2023-07-18T00:01:26.842+0000    49          87  True
251_PS1_PB_520_REF_111199   2023-07-18T00:01:27.940+0000    3           88  True
251_PS1_PB_520_REF_111199   2023-07-18T00:01:28.940+0000    244         89  True

我还一直在try 实现一个字典,将故障代码作为键,故障开始(True/False)作为值,以跟踪它们已经启动,但还没有找到一种实用的方法来实现,它不涉及迭代每一行的时间和内存密集型操作(该数据集包含1900万行,因此这是不实际的).我还考虑过 for each 新的故障代码创建一个新的专栏,但也没有找到实现它的方法.

任何帮助都不胜感激!!

推荐答案

Code

W1 = Window.partitionBy('Truck').orderBy('Timestamp')
df = df.withColumn('reset_cnt', F.sum((F.col('fault_code') == 0).cast('int')).over(W1))

W2 = Window.partitionBy('Truck', 'reset_cnt', 'fault_code').orderBy('Timestamp')
df = df.withColumn('fault_start', F.row_number().over(W2) == 1)
df = df.withColumn('fault_start', F.expr("IF(fault_code = 0, false, fault_start)"))

How this works

将数据帧除以Truck并按Timestamp排序,然后在fault_code等于0的情况下计算累积和.这将创建一个名为reset_cnt的列,它将帮助我们区分重置fault_code的不同行块,即清除故障(S).

+--------------------+--------------------+----------+---+---------+
|               Truck|           Timestamp|fault_code| rn|reset_cnt|
+--------------------+--------------------+----------+---+---------+
|251_PS1_PB_520_RE...|2023-07-18T00:01:...|         3| 63|        0|
|251_PS1_PB_520_RE...|2023-07-18T00:01:...|       244| 64|        0|
|251_PS1_PB_520_RE...|2023-07-18T00:01:...|         3| 65|        0|
|251_PS1_PB_520_RE...|2023-07-18T00:01:...|       244| 66|        0|
|251_PS1_PB_520_RE...|2023-07-18T00:01:...|         3| 67|        0|
|251_PS1_PB_520_RE...|2023-07-18T00:01:...|       244| 68|        0|
|251_PS1_PB_520_RE...|2023-07-18T00:01:...|         3| 69|        0|
|251_PS1_PB_520_RE...|2023-07-18T00:01:...|       244| 70|        0|
|251_PS1_PB_520_RE...|2023-07-18T00:01:...|         3| 71|        0|
|251_PS1_PB_520_RE...|2023-07-18T00:01:...|       244| 72|        0|
|251_PS1_PB_520_RE...|2023-07-18T00:01:...|         3| 73|        0|
|251_PS1_PB_520_RE...|2023-07-18T00:01:...|       244| 74|        0|
|251_PS1_PB_520_RE...|2023-07-18T00:01:...|        49| 75|        0|
|251_PS1_PB_520_RE...|2023-07-18T00:01:...|         0| 76|        1|
|251_PS1_PB_520_RE...|2023-07-18T00:01:...|        49| 77|        1|
|251_PS1_PB_520_RE...|2023-07-18T00:01:...|         0| 78|        2|
|251_PS1_PB_520_RE...|2023-07-18T00:01:...|        49| 79|        2|
|251_PS1_PB_520_RE...|2023-07-18T00:01:...|         0| 80|        3|
|251_PS1_PB_520_RE...|2023-07-18T00:01:...|        49| 81|        3|
|251_PS1_PB_520_RE...|2023-07-18T00:01:...|         0| 82|        4|
|251_PS1_PB_520_RE...|2023-07-18T00:01:...|        49| 83|        4|
|251_PS1_PB_520_RE...|2023-07-18T00:01:...|         0| 84|        5|
|251_PS1_PB_520_RE...|2023-07-18T00:01:...|        49| 85|        5|
|251_PS1_PB_520_RE...|2023-07-18T00:01:...|         0| 86|        6|
|251_PS1_PB_520_RE...|2023-07-18T00:01:...|        49| 87|        6|
|251_PS1_PB_520_RE...|2023-07-18T00:01:...|         3| 88|        6|
|251_PS1_PB_520_RE...|2023-07-18T00:01:...|       244| 89|        6|
+--------------------+--------------------+----------+---+---------+

创建fault_start列:按Truckreset_cntfault_code对数据帧进行分区,并分配行号以标识每个分区的副本fault_codes.

+--------------------+--------------------+----------+---+---------+-----------+
|               Truck|           Timestamp|fault_code| rn|reset_cnt|fault_start|
+--------------------+--------------------+----------+---+---------+-----------+
|251_PS1_PB_520_RE...|2023-07-18T00:01:...|         3| 63|        0|       true|
|251_PS1_PB_520_RE...|2023-07-18T00:01:...|       244| 64|        0|       true|
|251_PS1_PB_520_RE...|2023-07-18T00:01:...|         3| 65|        0|      false|
|251_PS1_PB_520_RE...|2023-07-18T00:01:...|       244| 66|        0|      false|
|251_PS1_PB_520_RE...|2023-07-18T00:01:...|         3| 67|        0|      false|
|251_PS1_PB_520_RE...|2023-07-18T00:01:...|       244| 68|        0|      false|
|251_PS1_PB_520_RE...|2023-07-18T00:01:...|         3| 69|        0|      false|
|251_PS1_PB_520_RE...|2023-07-18T00:01:...|       244| 70|        0|      false|
|251_PS1_PB_520_RE...|2023-07-18T00:01:...|         3| 71|        0|      false|
|251_PS1_PB_520_RE...|2023-07-18T00:01:...|       244| 72|        0|      false|
|251_PS1_PB_520_RE...|2023-07-18T00:01:...|         3| 73|        0|      false|
|251_PS1_PB_520_RE...|2023-07-18T00:01:...|       244| 74|        0|      false|
|251_PS1_PB_520_RE...|2023-07-18T00:01:...|        49| 75|        0|       true|
|251_PS1_PB_520_RE...|2023-07-18T00:01:...|         0| 76|        1|       true|
|251_PS1_PB_520_RE...|2023-07-18T00:01:...|        49| 77|        1|       true|
|251_PS1_PB_520_RE...|2023-07-18T00:01:...|         0| 78|        2|       true|
|251_PS1_PB_520_RE...|2023-07-18T00:01:...|        49| 79|        2|       true|
|251_PS1_PB_520_RE...|2023-07-18T00:01:...|         0| 80|        3|       true|
|251_PS1_PB_520_RE...|2023-07-18T00:01:...|        49| 81|        3|       true|
|251_PS1_PB_520_RE...|2023-07-18T00:01:...|         0| 82|        4|       true|
|251_PS1_PB_520_RE...|2023-07-18T00:01:...|        49| 83|        4|       true|
|251_PS1_PB_520_RE...|2023-07-18T00:01:...|         0| 84|        5|       true|
|251_PS1_PB_520_RE...|2023-07-18T00:01:...|        49| 85|        5|       true|
|251_PS1_PB_520_RE...|2023-07-18T00:01:...|         0| 86|        6|       true|
|251_PS1_PB_520_RE...|2023-07-18T00:01:...|        49| 87|        6|       true|
|251_PS1_PB_520_RE...|2023-07-18T00:01:...|         3| 88|        6|       true|
|251_PS1_PB_520_RE...|2023-07-18T00:01:...|       244| 89|        6|       true|
+--------------------+--------------------+----------+---+---------+-----------+

使用FALSE掩码fault_start中的行,其中fault_code0

+--------------------+--------------------+----------+---+---------+-----------+
|               Truck|           Timestamp|fault_code| rn|reset_cnt|fault_start|
+--------------------+--------------------+----------+---+---------+-----------+
|251_PS1_PB_520_RE...|2023-07-18T00:01:...|         3| 63|        0|       true|
|251_PS1_PB_520_RE...|2023-07-18T00:01:...|       244| 64|        0|       true|
|251_PS1_PB_520_RE...|2023-07-18T00:01:...|         3| 65|        0|      false|
|251_PS1_PB_520_RE...|2023-07-18T00:01:...|       244| 66|        0|      false|
|251_PS1_PB_520_RE...|2023-07-18T00:01:...|         3| 67|        0|      false|
|251_PS1_PB_520_RE...|2023-07-18T00:01:...|       244| 68|        0|      false|
|251_PS1_PB_520_RE...|2023-07-18T00:01:...|         3| 69|        0|      false|
|251_PS1_PB_520_RE...|2023-07-18T00:01:...|       244| 70|        0|      false|
|251_PS1_PB_520_RE...|2023-07-18T00:01:...|         3| 71|        0|      false|
|251_PS1_PB_520_RE...|2023-07-18T00:01:...|       244| 72|        0|      false|
|251_PS1_PB_520_RE...|2023-07-18T00:01:...|         3| 73|        0|      false|
|251_PS1_PB_520_RE...|2023-07-18T00:01:...|       244| 74|        0|      false|
|251_PS1_PB_520_RE...|2023-07-18T00:01:...|        49| 75|        0|       true|
|251_PS1_PB_520_RE...|2023-07-18T00:01:...|         0| 76|        1|      false|
|251_PS1_PB_520_RE...|2023-07-18T00:01:...|        49| 77|        1|       true|
|251_PS1_PB_520_RE...|2023-07-18T00:01:...|         0| 78|        2|      false|
|251_PS1_PB_520_RE...|2023-07-18T00:01:...|        49| 79|        2|       true|
|251_PS1_PB_520_RE...|2023-07-18T00:01:...|         0| 80|        3|      false|
|251_PS1_PB_520_RE...|2023-07-18T00:01:...|        49| 81|        3|       true|
|251_PS1_PB_520_RE...|2023-07-18T00:01:...|         0| 82|        4|      false|
|251_PS1_PB_520_RE...|2023-07-18T00:01:...|        49| 83|        4|       true|
|251_PS1_PB_520_RE...|2023-07-18T00:01:...|         0| 84|        5|      false|
|251_PS1_PB_520_RE...|2023-07-18T00:01:...|        49| 85|        5|       true|
|251_PS1_PB_520_RE...|2023-07-18T00:01:...|         0| 86|        6|      false|
|251_PS1_PB_520_RE...|2023-07-18T00:01:...|        49| 87|        6|       true|
|251_PS1_PB_520_RE...|2023-07-18T00:01:...|         3| 88|        6|       true|
|251_PS1_PB_520_RE...|2023-07-18T00:01:...|       244| 89|        6|       true|
+--------------------+--------------------+----------+---+---------+-----------+

Python-3.x相关问答推荐

Pandas -我们如何在一行中应用多个要求

在 Python 中比较和排序列之间的值(带有不匹配列)

在Python中基于组/ID将两个数据帧进行映射,找出较接近的值

Sunburst 折线图可视化

将数据框中的值与另一个数据框中的多列进行比较,以获取条目以有效方式匹配的列表列表

转换Pandas 数据框 - 添加行

如何通过从特定列创建分组多标题来reshape 数据框?

当我判断另一个 checkButton 时,如何判断两个 python tkinter checkButtons?

在 Python 3.5 中使用 aiohttp 获取多个 url

多个返回函数的列表理解?

创建一个可旋转的 3D 地球

str.format_map(mapping) 和 str.format 有什么区别

具有两个或多个返回参数的函数注释

在 Python 3 中使用 unittest.mock 修补 input()

如何使 Python3 成为 Geany 中的默认 Python

如何编写可 Select 充当常规函数的 asyncio 协程?

django - 值更改后自动更新日期

如何找出从哪个模块导入名称?

通过字典有效地替换Pandas 系列中的值

__iter__ 和 __getitem__ 有什么区别?