在我的pyspark dataframe中,有一个数据类型为string
的列CallDate
,其中包含如下所示的值:
2008-04-01T00:00:00
2008-04-01T00:00:00
2008-04-01T00:00:00
2008-04-01T00:00:00
2008-04-01T00:00:00
2008-04-01T00:00:00
我try 使用pyspark.sql.functions.to_timestamp()
将此列从数据类型string
转换为timestamp
.
当我运行此代码时:
df.withColumn('IncidentDate', to_timestamp(col('CallDate'), 'yyyy/MM/dd')).select('CallDate', 'IncidentDate').show()
...我将获得以下输出:
+-------------------+------------+
| CallDate|IncidentDate|
+-------------------+------------+
|2008-04-01T00:00:00| NULL|
|2008-04-01T00:00:00| NULL|
|2008-04-01T00:00:00| NULL|
|2008-04-01T00:00:00| NULL|
|2008-04-01T00:00:00| NULL|
|2008-04-01T00:00:00| NULL|
我认为NULL
个值是因为为日期指定的格式与实际日期字符串不一致,由于没有找到匹配项,所以返回NULL
个值.
但当我运行这段代码时:
df.withColumn('IncidentDate', to_timestamp(col('CallDate'), 'yyyy-MM-dd')).select('CallDate', 'IncidentDate').show()
我得到一个错误,
Caused by: org.apache.spark.SparkUpgradeException: [INCONSISTENT_BEHAVIOR_CROSS_VERSION.PARSE_DATETIME_BY_NEW_PARSER] You may get a different result due to the upgrading to Spark >= 3.0:
Fail to parse '2008-04-01T00:00:00' in the new parser. You can set "spark.sql.legacy.timeParserPolicy" to "LEGACY" to restore the behavior before Spark 3.0, or set to "CORRECTED" and treat it as an invalid datetime string.
我知道正确的解析格式应该是"yyyy-MM-dd'T'HH:mm:ss"
,如下所示:
df.withColumn('IncidentDate', to_timestamp(col('CallDate'), "yyyy-MM-dd'T'HH:mm:ss")).select('CallDate', 'IncidentDate').show()
但我的问题是,为什么当我将日期解析格式设置为yyyy/MM/dd
时,Spark会返回NULL
值,而当我将其设置为yyyy-MM-dd
时,它会抛出错误?