我想从一个亚秒精度的TimeSeries.csv
文件的一列中解析时间,但对于一些时间戳,它返回NaT
.
该数据集的一个奇怪之处在于,每一非整秒将以%Y-%m-%d %H:%M:%S.%f
表示,而每一整秒将以%Y-%m-%d %H:%M:%S
表示
我观察到,格式将在第一行将被转换,其他将是NaT
例如:
import pandas as pd
# Example data
timestamps_full_first = [
"2023-12-30 00:00:00",
"2023-12-30 00:00:00.1",
"2023-12-30 00:00:00.9",
"2023-12-30 00:00:01"
]
timestamps_sub_first = [
"2023-12-30 00:00:00.1",
"2023-12-30 00:00:00.9",
"2023-12-30 00:00:01",
"2023-12-30 00:00:01.1"
]
# Convert to datetime
datetime_series_full_first = pd.to_datetime(timestamps_full_first, errors='coerce', utc=True)
datetime_series_sub_first = pd.to_datetime(timestamps_sub_first, errors='coerce', utc=True)
print(datetime_series_full_first)
print(datetime_series_sub_first)
输出:
DatetimeIndex(['2023-12-30 00:00:00+00:00', 'NaT', 'NaT',
'2023-12-30 00:00:01+00:00'],
dtype='datetime64[ns, UTC]', freq=None)
DatetimeIndex(['2023-12-30 00:00:00.100000+00:00',
'2023-12-30 00:00:00.900000+00:00',
'NaT',
'2023-12-30 00:00:01.100000+00:00'],
dtype='datetime64[ns, UTC]', freq=None)
Initial Solution个
我最初的解决方案是编写一个定制的parse_date
函数,我可以在其中以列表的形式提供多种格式.
def parse_date(self, date_str, formats = ["none"]):
for fmt in formats:
try:
return pd.to_datetime(date_str, format=fmt, utc=True)
except ValueError:
continue
return pd.NaT # Return 'Not a Time' for unrecognized formats
用途:
data[self.timestamp_col] = data[self.timestamp_col].apply(lambda x: self.parse_date(x, formats = self.timestamp_formats))
它是有效的,但与Pandas 的内部解析相比,它非常非常慢.
Optimized Solution(GPT-4建议):
GPT建议将数据帧矢量化,然后使用掩码进行一次解析,然后使用另一种格式解析每秒包含NaT
行的行,这应该会提高性能.
import pandas as pd
import numpy as np
def vectorized_parse_date(date_series, formats):
result_series = pd.Series(np.full(date_series.shape, pd.NaT), index=date_series.index)
for fmt in formats:
mask = result_series.isna() & ~date_series.isna() # Only try to parse where we don't have a result and the date is not NaN
try:
result_series[mask] = pd.to_datetime(date_series[mask], format=fmt, errors='raise', utc=True)
except ValueError:
continue
return result_series
# Usage
data[self.timestamp_col] = vectorized_parse_date(data[self.timestamp_col], formats=self.timestamp_formats)
我还没有试过,因为我觉得GPT对我的方法有点偏见,试图弄清楚如何在我的人为约束下做到这一点.
所以也许你们中的一些人看到了另一个使用pandas功能的解决方案.