我正在寻找一种解决方案来判断一个组内的多个条件.首先,我判断记录之间的重叠(基于id),其次,我应该为相同传染重叠中编号最高的记录排除一个例外.最重要的是,同一id可以有多个重叠. 以下是示例:

data = [('A',1000,1,100),
   ('B',1001,0,10),
   ('B',1002,10,15),
   ('B',1002,20,22),
   ('B',1003,25,50),
   ('B',1004,50,55),
   ('B',1005,53,56),
   ('B',1006,60,100),
   ('C',1007,1,100)
 ]

schema = StructType([ \
   StructField("id",StringType(),True), \
   StructField("tran",IntegerType(),True), \
   StructField("start",IntegerType(),True), \
   StructField("end",IntegerType(),True), \
 ])

df = spark.createDataFrame(data=data,schema=schema)
df.show()

+---+----+-----+---+
| id|tran|start|end|
+---+----+-----+---+
|  A|1000|    1|100|
|  B|1001|    0| 10|
|  B|1002|   10| 15|
|  B|1003|   20| 22|
|  B|1004|   25| 50|
|  B|1005|   50| 55|
|  B|1006|   53| 56|
|  B|1007|   60|100|
|  C|1008|    1|100|
+---+----+-----+---+

所需的数据帧应如下所示:

| id|tran|start|end|valid|
+---+----+-----+---+-----+
|  A|1000|    1|100|  yes| # this is valid because by id there is no overlap between start and end
|  B|1001|    0| 10|   no| # invalid because by id it overlaps with the next
|  B|1002|   10| 15|  yes| # it overlaps with the previous one but it has the highest tran number between the two 
|  B|1003|   20| 22|  yes| # yes because no overlap
|  B|1004|   25| 50|   no| # invalid because overlaps and the tran is not the highest
|  B|1005|   50| 55|   no| # invalid because overlaps and the tran is not the highest
|  B|1006|   53| 56|  yes| # it overlaps with the previous ones but it has the highest tran number among the three contagiously overlapping ones
|  B|1007|   60|100|  yes| # no overlap
|  C|1008|    1|100|  yes| # no overlap
+---+----+-----+---+-----+

非常感谢解决这个问题的传奇人物:)

推荐答案

对数据帧使用JOIN WITH SELF和GROUP BY,计算有多少条记录重叠并进行验证.

df.alias('a').join(
    df.alias('b'),
    (f.col('a.id') == f.col('b.id')) & (f.col('a.start') != f.col('b.start')) & (f.col('b.start').between(f.col('a.start'), f.col('a.end'))),
    'left'
  ) \
  .groupBy('a.id', 'a.tran', 'a.start', 'a.end') \
  .agg(f.count('b.id').alias('valid')) \
  .withColumn('valid', f.when(f.col('valid') == f.lit(0), 'yes').otherwise('no')) \
  .show()

+---+----+-----+---+-----+
| id|tran|start|end|valid|
+---+----+-----+---+-----+
|  B|1001|    0| 10|   no|
|  A|1000|    1|100|  yes|
|  B|1002|   10| 15|  yes|
|  B|1002|   20| 22|  yes|
|  B|1003|   25| 50|   no|
|  B|1004|   50| 55|   no|
|  B|1005|   53| 56|  yes|
|  B|1006|   60|100|  yes|
|  C|1007|    1|100|  yes|
+---+----+-----+---+-----+

Python相关问答推荐

从包含数字和单词的文件中读取和获取数据集

具有症状的分段函数:如何仅针对某些输入值定义函数?

GL pygame无法让缓冲区与vertextPointer和colorPointer一起可靠地工作

列表上值总和最多为K(以O(log n))的最大元素数

当密钥是复合且唯一时,Pandas合并抱怨标签不唯一

如何使用Google Gemini API为单个提示生成多个响应?

线性模型PanelOLS和statmodels OLS之间的区别

如何比较numPy数组中的两个图像以获取它们不同的像素

Odoo 16使用NTFS使字段只读

使用BeautifulSoup抓取所有链接

pandas:对多级列框架的列进行排序/重新排序

如何防止Pandas将索引标为周期?

在Admin中显示从ManyToMany通过模型的筛选结果

使用polars. pivot()旋转一个框架(类似于R中的pivot_longer)

如何使用matplotlib查看并列直方图

没有内置pip模块的Python3.11--S在做什么?

python的文件. truncate()意外地没有截断'

如何在Python中从html页面中提取html链接?

遍历列表列表,然后创建数据帧

是否将Pandas 数据帧标题/标题以纯文本格式转换为字符串输出?