我试图通过将"FAULT_START"列标记为"True"来解决这样一个问题,即当故障代码第一次出现时,我需要计算它们的行数.如果它们按顺序重复,且没有出现任何零,则将其标记为"False".如果出现零,则表示故障(S)已清除,如果相同故障(S)再次出现(S),则将其标记为"真".
我已经包括了下面的数据样本.请注意,"Rn"列只是行号,如果需要,可以忽略它.此外,"FAULT_START"是滞后一行的"FAULT_CODE"列.
truck_fault_counts_1hz_df =
Truck Timestamp fault_code rn fault_start
251_PS1_PB_520_REF_111199 2023-07-18T00:01:02.739+0000 3 63 49
251_PS1_PB_520_REF_111199 2023-07-18T00:01:03.739+0000 244 64 3
251_PS1_PB_520_REF_111199 2023-07-18T00:01:04.739+0000 3 65 244
251_PS1_PB_520_REF_111199 2023-07-18T00:01:05.739+0000 244 66 3
251_PS1_PB_520_REF_111199 2023-07-18T00:01:06.739+0000 3 67 244
251_PS1_PB_520_REF_111199 2023-07-18T00:01:07.739+0000 244 68 3
251_PS1_PB_520_REF_111199 2023-07-18T00:01:08.739+0000 3 69 244
251_PS1_PB_520_REF_111199 2023-07-18T00:01:09.739+0000 244 70 3
251_PS1_PB_520_REF_111199 2023-07-18T00:01:10.739+0000 3 71 244
251_PS1_PB_520_REF_111199 2023-07-18T00:01:11.739+0000 244 72 3
251_PS1_PB_520_REF_111199 2023-07-18T00:01:12.741+0000 3 73 244
251_PS1_PB_520_REF_111199 2023-07-18T00:01:13.741+0000 244 74 3
251_PS1_PB_520_REF_111199 2023-07-18T00:01:14.839+0000 49 75 244
251_PS1_PB_520_REF_111199 2023-07-18T00:01:15.839+0000 0 76 49
251_PS1_PB_520_REF_111199 2023-07-18T00:01:16.840+0000 49 77 0
251_PS1_PB_520_REF_111199 2023-07-18T00:01:17.840+0000 0 78 49
251_PS1_PB_520_REF_111199 2023-07-18T00:01:18.840+0000 49 79 0
251_PS1_PB_520_REF_111199 2023-07-18T00:01:19.840+0000 0 80 49
251_PS1_PB_520_REF_111199 2023-07-18T00:01:20.840+0000 49 81 0
251_PS1_PB_520_REF_111199 2023-07-18T00:01:21.840+0000 0 82 49
251_PS1_PB_520_REF_111199 2023-07-18T00:01:22.840+0000 49 83 0
251_PS1_PB_520_REF_111199 2023-07-18T00:01:23.840+0000 0 84 49
251_PS1_PB_520_REF_111199 2023-07-18T00:01:24.840+0000 49 85 0
251_PS1_PB_520_REF_111199 2023-07-18T00:01:25.841+0000 0 86 49
251_PS1_PB_520_REF_111199 2023-07-18T00:01:26.842+0000 49 87 0
251_PS1_PB_520_REF_111199 2023-07-18T00:01:27.940+0000 3 88 49
251_PS1_PB_520_REF_111199 2023-07-18T00:01:28.940+0000 244 89 3
下面的解决方案统计了所有不为零的故障,但没有计算连续重复出现的故障:
# Initialize a list of numbers to not be included in the distinct_faults list
x = [0.0]
# Select the distinct faults in the fault_code column and convert it to Pandas
distinct_faults_df = truck_fault_counts_1hz_df.dropDuplicates(["fault_code"]).select("fault_code").toPandas()["fault_code"]
# Create a list of the distinct faults in the fault_code column and convert the values to ints
distinct_faults = [int(i) for i in distinct_faults_df if i not in x]
# Sort the list in ascending order
distinct_faults.sort()
# Convert the values to strings
distinct_faults = [str(i) for i in distinct_faults]
# Convert the fault_code and fault_start column to int then to string from float
truck_fault_counts_1hz_df = truck_fault_counts_1hz_df.withColumn("fault_code",truck_fault_counts_1hz_df.fault_code.cast('int'))
truck_fault_counts_1hz_df = truck_fault_counts_1hz_df.withColumn("fault_code",truck_fault_counts_1hz_df.fault_code.cast('string'))
truck_fault_counts_1hz_df = truck_fault_counts_1hz_df.withColumn("fault_start",truck_fault_counts_1hz_df.fault_start.cast('int'))
truck_fault_counts_1hz_df = truck_fault_counts_1hz_df.withColumn("fault_start",truck_fault_counts_1hz_df.fault_start.cast('string'))
# Fill any Null values with "0"
truck_fault_counts_1hz_df = truck_fault_counts_1hz_df.fillna(value="0", subset=["fault_start"])
for i in distinct_faults:
# Create a column for row count and fault start
# fault_start lags fault_code by one row
# Any row that has a number greater than 0 in fault_code
truck_fault_counts_1hz_df = truck_fault_counts_1hz_df.\
withColumn("fault_start",\
when(((col("fault_code") == i) & (col("fault_start") != i)),
lit("True")).\
when(((col("fault_code") == 0) & (col("fault_start") == 0)),
lit("False")).\
when(((col("fault_code") == i) & (col("fault_start") == i)),
lit("False")).\
when(((col("fault_code") == 0) & (col("fault_start") == i)),
lit("False")).\
otherwise(truck_fault_counts_1hz_df.fault_start))
truck_fault_counts_1hz_noSpam_df = truck_fault_counts_1hz_df.where(truck_fault_counts_1hz_df.fault_start == True).drop("rn")
display(truck_fault_counts_1hz_df)
我目前拥有的输出是:
Truck Timestamp fault_code rn fault_start
251_PS1_PB_520_REF_111199 2023-07-18T00:01:02.739+0000 3 63 True
251_PS1_PB_520_REF_111199 2023-07-18T00:01:03.739+0000 244 64 True
251_PS1_PB_520_REF_111199 2023-07-18T00:01:04.739+0000 3 65 True
251_PS1_PB_520_REF_111199 2023-07-18T00:01:05.739+0000 244 66 True
251_PS1_PB_520_REF_111199 2023-07-18T00:01:06.739+0000 3 67 True
251_PS1_PB_520_REF_111199 2023-07-18T00:01:07.739+0000 244 68 True
251_PS1_PB_520_REF_111199 2023-07-18T00:01:08.739+0000 3 69 True
251_PS1_PB_520_REF_111199 2023-07-18T00:01:09.739+0000 244 70 True
251_PS1_PB_520_REF_111199 2023-07-18T00:01:10.739+0000 3 71 True
251_PS1_PB_520_REF_111199 2023-07-18T00:01:11.739+0000 244 72 True
251_PS1_PB_520_REF_111199 2023-07-18T00:01:12.741+0000 3 73 True
251_PS1_PB_520_REF_111199 2023-07-18T00:01:13.741+0000 244 74 True
251_PS1_PB_520_REF_111199 2023-07-18T00:01:14.839+0000 49 75 True
251_PS1_PB_520_REF_111199 2023-07-18T00:01:15.839+0000 0 76 False
251_PS1_PB_520_REF_111199 2023-07-18T00:01:16.840+0000 49 77 True
251_PS1_PB_520_REF_111199 2023-07-18T00:01:17.840+0000 0 78 False
251_PS1_PB_520_REF_111199 2023-07-18T00:01:18.840+0000 49 79 True
251_PS1_PB_520_REF_111199 2023-07-18T00:01:19.840+0000 0 80 False
251_PS1_PB_520_REF_111199 2023-07-18T00:01:20.840+0000 49 81 True
251_PS1_PB_520_REF_111199 2023-07-18T00:01:21.840+0000 0 82 False
251_PS1_PB_520_REF_111199 2023-07-18T00:01:22.840+0000 49 83 True
251_PS1_PB_520_REF_111199 2023-07-18T00:01:23.840+0000 0 84 False
251_PS1_PB_520_REF_111199 2023-07-18T00:01:24.840+0000 49 85 True
251_PS1_PB_520_REF_111199 2023-07-18T00:01:25.841+0000 0 86 False
251_PS1_PB_520_REF_111199 2023-07-18T00:01:26.842+0000 49 87 True
251_PS1_PB_520_REF_111199 2023-07-18T00:01:27.940+0000 3 88 True
251_PS1_PB_520_REF_111199 2023-07-18T00:01:28.940+0000 244 89 True
我希望实现的输出是:
Truck Timestamp fault_code rn fault_start
251_PS1_PB_520_REF_111199 2023-07-18T00:01:02.739+0000 3 63 True
251_PS1_PB_520_REF_111199 2023-07-18T00:01:03.739+0000 244 64 True
251_PS1_PB_520_REF_111199 2023-07-18T00:01:04.739+0000 3 65 False
251_PS1_PB_520_REF_111199 2023-07-18T00:01:05.739+0000 244 66 False
251_PS1_PB_520_REF_111199 2023-07-18T00:01:06.739+0000 3 67 False
251_PS1_PB_520_REF_111199 2023-07-18T00:01:07.739+0000 244 68 False
251_PS1_PB_520_REF_111199 2023-07-18T00:01:08.739+0000 3 69 False
251_PS1_PB_520_REF_111199 2023-07-18T00:01:09.739+0000 244 70 False
251_PS1_PB_520_REF_111199 2023-07-18T00:01:10.739+0000 3 71 False
251_PS1_PB_520_REF_111199 2023-07-18T00:01:11.739+0000 244 72 False
251_PS1_PB_520_REF_111199 2023-07-18T00:01:12.741+0000 3 73 False
251_PS1_PB_520_REF_111199 2023-07-18T00:01:13.741+0000 244 74 False
251_PS1_PB_520_REF_111199 2023-07-18T00:01:14.839+0000 49 75 True
251_PS1_PB_520_REF_111199 2023-07-18T00:01:15.839+0000 0 76 False
251_PS1_PB_520_REF_111199 2023-07-18T00:01:16.840+0000 49 77 True
251_PS1_PB_520_REF_111199 2023-07-18T00:01:17.840+0000 0 78 False
251_PS1_PB_520_REF_111199 2023-07-18T00:01:18.840+0000 49 79 True
251_PS1_PB_520_REF_111199 2023-07-18T00:01:19.840+0000 0 80 False
251_PS1_PB_520_REF_111199 2023-07-18T00:01:20.840+0000 49 81 True
251_PS1_PB_520_REF_111199 2023-07-18T00:01:21.840+0000 0 82 False
251_PS1_PB_520_REF_111199 2023-07-18T00:01:22.840+0000 49 83 True
251_PS1_PB_520_REF_111199 2023-07-18T00:01:23.840+0000 0 84 False
251_PS1_PB_520_REF_111199 2023-07-18T00:01:24.840+0000 49 85 True
251_PS1_PB_520_REF_111199 2023-07-18T00:01:25.841+0000 0 86 False
251_PS1_PB_520_REF_111199 2023-07-18T00:01:26.842+0000 49 87 True
251_PS1_PB_520_REF_111199 2023-07-18T00:01:27.940+0000 3 88 True
251_PS1_PB_520_REF_111199 2023-07-18T00:01:28.940+0000 244 89 True
我还一直在try 实现一个字典,将故障代码作为键,故障开始(True/False)作为值,以跟踪它们已经启动,但还没有找到一种实用的方法来实现,它不涉及迭代每一行的时间和内存密集型操作(该数据集包含1900万行,因此这是不实际的).我还考虑过 for each 新的故障代码创建一个新的专栏,但也没有找到实现它的方法.
任何帮助都不胜感激!!