Python 使用通配符标识扩展名为的文件

发布于02月14日

在将我的数据湖装载到Databricks之后，我try 使用*.json个后缀将所有JSON文件加载到一个数据帧中，但它不起作用:

df = spark.read.option("recursiveFileLookup", "true") \
    .json("/mnt/adls_gen/prod/**/*.json")

执行上述代码时出现以下错误

[PATH_NOT_FOUND] Path does not exist: dbfs:/mnt/adls_gen/prod/**/*.json.

如果删除文件扩展名，操作将成功:

df = spark.read.option("recursiveFileLookup", "true") \
    .json("/mnt/adls_gen/prod/**/*")

...但它也在读取其他文件，例如扩展名为*.json_old和*.txt的文件.

我不熟悉在此场景中使用的任何替代选项.是否有其他方法可用于按文件扩展名进行筛选？我在数据湖中的文件有各种扩展名，所以我正在寻找一种适应这种多样性的解决方案.

ApacheSpark版本是3.4.1(Scala 2.12).

推荐答案

我认为这就是在其他通用文件源选项中使用Path Glob Filter的目的:

Path Glob Filter
pathGlobFilter is used to only include files with file names matching the pattern. The syntax follows org.apache.hadoop.fs.GlobFilter. It does not change the behavior of partition discovery.

df_filtered = spark.read
  .option("header","true")
  .option("recursiveFileLookup","true")
  .csv("s3a://mybucket/testdata/csvs", pathGlobFilter="*.csv")

from pyspark.sql.functions import *
df_filtered.select(input_file_name()).distinct().show(truncate=False)
+-------------------------------------+
|input_file_name()                    |
+-------------------------------------+
|s3a://mybucket/testdata/csvs/c000.csv|
|s3a://mybucket/testdata/csvs/c001.csv|
|         :         :          :      |
+-------------------------------------+