Python Pyspark 查找数组列中第一个正数的索引

发布于08月24日

我在一个pyspark数据帧中有一个数组列，我想要找到每个数组中第一个正数的索引.数据如下所示:

id	arr
Cell 1	-1, -1, -1, -1
Cell 2	-1, -1, 5, -1
Cell 3	-1, 3, -1, -1

我希望得到类似于以下内容的输出:

id	arr	first_positive_element_index
Cell 1	-1, -1, -1, -1	null
Cell 2	-1, -1, 5, -1	2
Cell 3	-1, 3, -1, -1	1

我可以使用UDF来做到这一点，但数据非常大，这使得这种方法非常慢.如果有更好的方法绕过这个问题，而不使用UDF，我会更喜欢.

注:所有非正数均为-1

推荐答案

您可以使用带有array_position的expr:

df_pos = df.select(
    'id', 'arr',
    func.explode('arr').alias('arr_explode_value')
).filter(
    func.col('arr_explode_value')>=0
).withColumn(
    'pos', func.expr('array_position(arr, arr_explode_value)')-1
).groupBy(
    'id'
).agg(
    func.min('pos').alias('pos')
)
df_pos.show(10, False)
+------+---+
|id    |pos|
+------+---+
|Cell 2|2  |
|Cell 3|1  |
+------+---+

您可以创建数据帧以

分解数组
过滤掉正值
找到最小的索引

其余部分是将引用表连接回数据帧.

df.select('id', 'arr').join(df_pos.select('id', 'pos'), on=['id'], how='left')

编辑1:

如果因为长数组而不想使用explode，则可以使用transform和array_position:

df.select(
    'id', 'arr',
    func.transform(func.col('arr'), lambda value: func.when(value>=0, 1).otherwise(0)).alias('transformed_arr')
).withColumn(
    'pos', func.array_position('transformed_arr', 1)-1
).show(
    10, False
)
+------+----------------+---------------+---+
|id    |arr             |transformed_arr|pos|
+------+----------------+---------------+---+
|Cell 1|[-1, -1, -1, -1]|[0, 0, 0, 0]   |-1 |
|Cell 2|[-1, -1, 5, -1] |[0, 0, 1, 0]   |2  |
|Cell 3|[-1, 3, -1, -1] |[0, 1, 0, 0]   |1  |
+------+----------------+---------------+---+

由于第arr列是数组类型，因此可以使用transform对元素应用函数.